
# Mathematics Underlying Neural Network Model Creation and Training

**Tensors** are the fundamental data structures of neural networks.

This notebook provides a detailed exploration of the mathematical operations and concepts that underlie neural network model building. For simplicity, the Iris dataset is used alongside TensorFlow to illustrate the operations involving tensors in the model architecture, with a particular focus on the first dense layer, activation functions, and batch normalization.

The notebook emphasizes the importance of understanding tensor operations when building neural networks.



## Python Code:

Building and Training a Neural Network Model with TensorFlow to Classify the Iris Dataset


In [2]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target

# Standardize the data
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Sequential model
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    BatchNormalization(),
    Dropout(0.5),
    Dense(64, activation='relu'),
    BatchNormalization(),
    Dropout(0.5),
    Dense(3, activation='softmax')  # 3 output neurons for the 3 classes in the Iris dataset
])

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Define early stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train the model
history = model.fit(X_train, y_train, epochs=100, batch_size=8, validation_split=0.2, callbacks=[early_stopping])

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test Accuracy: {accuracy}')

# Make predictions
y_pred = model.predict(X_test)
y_pred_classes = tf.argmax(y_pred, axis=1)

# Evaluate the model using accuracy_score
accuracy = accuracy_score(y_test, y_pred_classes)
print(f'Accuracy: {accuracy}')


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/100
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 30ms/step - accuracy: 0.4314 - loss: 1.7515 - val_accuracy: 0.6667 - val_loss: 0.9488
Epoch 2/100
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step - accuracy: 0.6000 - loss: 0.8342 - val_accuracy: 0.7083 - val_loss: 0.8571
Epoch 3/100
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.6171 - loss: 1.0188 - val_accuracy: 0.7083 - val_loss: 0.7894
Epoch 4/100
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.7471 - loss: 0.7989 - val_accuracy: 0.7083 - val_loss: 0.7263
Epoch 5/100
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.6785 - loss: 0.8814 - val_accuracy: 0.7500 - val_loss: 0.6753
Epoch 6/100
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.6954 - loss: 0.8061 - val_accuracy: 0.7917 - val_loss: 0.6338
Epoch 7/100
[1m12/12[0m [32m━

In [3]:
model.summary()


## Explanation of Model Architecture:

- **Dense Layer**: Each neuron computes a weighted sum of inputs:
  $$ z = W \cdot X + b $$

- **Activation Function (ReLU)**: Defined as:
  $$ \text{ReLU}(z) = \max(0, z) $$

- **Batch Normalization**: Normalizes outputs to enhance training speed and stability.

- **Dropout**: Randomly sets a fraction of input units to 0 during training to mitigate overfitting.

- **Output Layer**: Utilizes the softmax function for multi-class classification:
  $$ P(y = k | X) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}} $$





# Mathematics of Model Architecture

1. **Dense Layer**:
   - **Input**: $ \mathbf{X} $ of shape $(n, m)$, where $ n $ is the batch size and $ m $ is the number of input features.
   - **Weights and Biases**: $ \mathbf{W} $ of shape $(m, k)$ and $ \mathbf{b} $ of shape $(k,)$.
   - **Linear Transformation**:
     $$
     \mathbf{Z} = \mathbf{X} \mathbf{W} + \mathbf{b} $$
   - **Activation Function** (e.g., ReLU):
     $$
     \mathbf{A} = \sigma(\mathbf{Z}) $$
   - **Output**: $ \mathbf{A} $ of shape $(n, k)$.

2. **Batch Normalization Layer**:
   - **Input**: $ \mathbf{A} $ of shape $(n, k)$.
   - **Mean and Variance**:
  $$
  \mu = \frac{1}{n} \sum_{i=1}^{n} \mathbf{A}_i $$

     $$
     \sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (\mathbf{A}_i - \mu)^2 $$
   - **Normalization**:
     $$
     \hat{\mathbf{A}} = \frac{\mathbf{A} - \mu}{\sqrt{\sigma^2 + \epsilon}} $$
   - **Scaling and Shifting**:
     $$
     \mathbf{Y} = \gamma \hat{\mathbf{A}} + \beta $$
   - **Output**: $ \mathbf{Y} $ of shape $(n, k)$.





### The Output $Y$ from 1st Dense and Batch Normalization Layers:

The output $Y$ provides a transformed representation of the input data $\mathbf{X}_{\text{train}}$, which can be further processed in subsequent layers of a neural network. Here's a more detailed explanation:

### Transformed Representation

1. **Dense Layer Output $\mathbf{A}$**:
   - The Dense Layer applies a linear transformation to the input data $\mathbf{X}_{\text{train}}$ using weights and biases, followed by an activation function (in this case, ReLU). This transformation allows the model to learn complex patterns and relationships in the data.

2. **Batch Normalization Output $\mathbf{Y}$**:
   - The Batch Normalization Layer normalizes the output of the Dense Layer, which helps to mitigate issues related to internal covariate shift. By normalizing the activations, the model can learn more effectively, as the inputs to each layer remain more stable throughout training.

### Benefits of Normalization

- **Stabilizes Learning**: Normalization reduces the sensitivity of the network to the scale of the inputs, making the training process more stable.
- **Faster Convergence**: By keeping the activations within a certain range, normalization can lead to faster convergence during training, allowing the model to reach optimal weights more quickly.
- **Improved Performance**: Normalization can help improve the overall performance of the model, as it allows for better gradient flow and reduces the likelihood of vanishing or exploding gradients.

### Subsequent Layers

After the Batch Normalization Layer, the output $\mathbf{Y}$ is fed into additional layers of the neural network, such as:

- **Dropout Layers**: To prevent overfitting.
- **Additional Dense Layers**: To further learn complex representations.
- **Activation Layers**: To introduce non-linearity.
- **Output Layers**: To produce the final predictions.



The transformed representations from the Dense Layer and Batch Normalization Layer are crucial for the effective training and performance of neural networks, enabling them to learn from the input data $\mathbf{X}_{\text{train}}$ more efficiently.

## Example:





## 1. Dense Layer Calculation:

### Step 1.Linear Transformation

We need to compute $\mathbf{Z} = \mathbf{X} \mathbf{W} + \mathbf{b}$:

1. **Input $ \mathbf{X} $**:
   $
   \mathbf{X} = \begin{bmatrix} 1.0 & 2.0 & 3.0 \\ 4.0 & 5.0 & 6.0 \end{bmatrix}
   $

2. **Weights $ \mathbf{W} $**:
   $
   \mathbf{W} = \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 \\ 0.5 & 0.6 & 0.7 & 0.8 \\ 0.9 & 1.0 & 1.1 & 1.2 \end{bmatrix}
   $

3. **Biases $ \mathbf{b} $**:
   $
   \mathbf{b} = \begin{bmatrix} 0.1 \\ 0.2 \\ 0.3 \\ 0.4 \end{bmatrix}
   $

#### Calculation of $ \mathbf{Z} $:

- First, we compute $ \mathbf{X} \mathbf{W} $:

$ \begin{bmatrix} 1.0 & 2.0 & 3.0 \\ 4.0 & 5.0 & 6.0 \end{bmatrix} \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 \\ 0.5 & 0.6 & 0.7 & 0.8 \\ 0.9 & 1.0 & 1.1 & 1.2 \end{bmatrix}
$

Calculating each element of the resulting matrix:

- For the first row:
  - First column: $ 1.0 \cdot 0.1 + 2.0 \cdot 0.5 + 3.0 \cdot 0.9 = 0.1 + 1.0 + 2.7 = 3.8 $
  - Second column: $ 1.0 \cdot 0.2 + 2.0 \cdot 0.6 + 3.0 \cdot 1.0 = 0.2 + 1.2 + 3.0 = 4.4 $
  - Third column: $ 1.0 \cdot 0.3 + 2.0 \cdot 0.7 + 3.0 \cdot 1.1 = 0.3 + 1.4 + 3.3 = 5.0 $
  - Fourth column: $ 1.0 \cdot 0.4 + 2.0 \cdot 0.8 + 3.0 \cdot 1.2 = 0.4 + 1.6 + 3.6 = 5.6 $

- For the second row:
  - First column: $ 4.0 \cdot 0.1 + 5.0 \cdot 0.5 + 6.0 \cdot 0.9 = 0.4 + 2.5 + 5.4 = 8.3 $
  - Second column: $ 4.0 \cdot 0.2 + 5.0 \cdot 0.6 + 6.0 \cdot 1.0 = 0.8 + 3.0 + 6.0 = 9.8 $
  - Third column: $ 4.0 \cdot 0.3 + 5.0 \cdot 0.7 + 6.0 \cdot 1.1 = 1.2 + 3.5 + 6.6 = 11.3 $
  - Fourth column: $ 4.0 \cdot 0.4 + 5.0 \cdot 0.8 + 6.0 \cdot 1.2 = 1.6 + 4.0 + 7.2 = 12.8 $

Thus, we have:

$
\mathbf{X} \mathbf{W} = \begin{bmatrix} 3.8 & 4.4 & 5.0 & 5.6 \\ 8.3 & 9.8 & 11.3 & 12.8 \end{bmatrix}
$



### Step 2: Adding Biases

$\mathbf{b}  =\begin{bmatrix} 0.1 \\ 0.2 \\ 0.3 \\ 0.4 \end{bmatrix}
$

Since $ \mathbf{b} $ is a column vector, it will be added to each row of $ \mathbf{X} \mathbf{W} $:

- For the first row:
  - $ 3.8 + 0.1 = 3.9 $
  - $ 4.4 + 0.2 = 4.6 $
  - $ 5.0 + 0.3 = 5.3 $
  - $ 5.6 + 0.4 = 6.0 $

- For the second row:
  - $ 8.3 + 0.1 = 8.4 $
  - $ 9.8 + 0.2 = 10.0 $
  - $ 11.3 + 0.3 = 11.6 $
  - $ 12.8 + 0.4 = 13.2 $

Thus

$
\mathbf{Z} = \begin{bmatrix} 3.9 & 4.6 & 5.3 & 6.0 \\ 8.4 & 10.0 & 11.6 & 13.2 \end{bmatrix}
$

### Step 3: Activation Function (ReLU)

Next, apply the ReLU activation function $ \sigma(\mathbf{Z}) = \max(0, \mathbf{Z}) $:

$
\mathbf{A} = \begin{bmatrix} \max(0, 3.9) & \max(0, 4.6) & \max(0, 5.3) & \max(0, 6.0) \\ \max(0, 8.4) & \max(0, 10.0) & \max(0, 11.6) & \max(0, 13.2) \end{bmatrix}
$

In this example, since all values are positive, so **Output of the Dense Layer**:

$
\mathbf{A} = \begin{bmatrix} 3.9 & 4.6 & 5.3 & 6.0 \\ 8.4 & 10.0 & 11.6 & 13.2 \end{bmatrix}
$

This output $ \mathbf{A} $ from the Dense() layer is next passed to the Batch Normalization() layer.



## 2. Batch Normalization calculations


### Step 1: Mean and Variance Calculation

Given the output $\mathbf{A} $:

$
\mathbf{A} = \begin{bmatrix} 3.9 & 4.6 & 5.3 & 6.0 \\ 8.4 & 10.0 & 11.6 & 13.2 \end{bmatrix}
$

 **Mean $\mu $**:


   $
   \mu = \frac{1}{n} \sum_{i=1}^{n} \mathbf{A}_i
   $

   Here, $n = 2 $ (the batch size), and we compute the mean for each feature across the batch:

   - For the first feature:
     $
     \mu_1 = \frac{3.9 + 8.4}{2} = \frac{12.3}{2} = 6.15
     $
   - For the second feature:
     $
     \mu_2 = \frac{4.6 + 10.0}{2} = \frac{14.6}{2} = 7.3
     $
   - For the third feature:
     $
     \mu_3 = \frac{5.3 + 11.6}{2} = \frac{16.9}{2} = 8.45
     $
   - For the fourth feature:
     $
     \mu_4 = \frac{6.0 + 13.2}{2} = \frac{19.2}{2} = 9.6
     $

   Thus, the mean vector $\mu $ is:

   $
   \mu = \begin{bmatrix} 6.15 \\ 7.3 \\ 8.45 \\ 9.6 \end{bmatrix}
   $

**Variance $\sigma^2 $**:
   

   $
   \sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (\mathbf{A}_i - \mu)^2
   $

   We compute the variance for each feature:

   - For the first feature:
     $
     \sigma_1^2 = \frac{(3.9 - 6.15)^2 + (8.4 - 6.15)^2}{2} = \frac{(-2.25)^2 + (2.25)^2}{2} = \frac{5.0625 + 5.0625}{2} = \frac{10.125}{2} = 5.0625
     $
   - For the second feature:
     $
     \sigma_2^2 = \frac{(4.6 - 7.3)^2 + (10.0 - 7.3)^2}{2} = \frac{(-2.7)^2 + (2.7)^2}{2} = \frac{7.29 + 7.29}{2} = \frac{14.58}{2} = 7.29
     $
   - For the third feature:
     $
     \sigma_3^2 = \frac{(5.3 - 8.45)^2 + (11.6 - 8.45)^2}{2} = \frac{(-3.15)^2 + (3.15)^2}{2} = \frac{9.9225 + 9.9225}{2} = \frac{19.845}{2} = 9.9225
     $
   - For the fourth feature:
     $
     \sigma_4^2 = \frac{(6.0 - 9.6)^2 + (13.2 - 9.6)^2}{2} = \frac{(-3.6)^2 + (3.6)^2}{2} = \frac{12.96 + 12.96}{2} = \frac{25.92}{2} = 12.96
     $

   Thus, the variance vector $\sigma^2 $ is:

   $
   \sigma^2 = \begin{bmatrix} 5.0625 \\ 7.29 \\ 9.9225 \\ 12.96 \end{bmatrix}
   $



### Step 2: Compute the normalized output $ \hat{\mathbf{A}} $ using the mean $ \mu $ and variance $ \sigma^2 $ calculated previously:

$
\hat{\mathbf{A}} = \frac{\mathbf{A} - \mu}{\sqrt{\sigma^2 + \epsilon}}
$

where $ \epsilon $ is a small constant added for numerical stability (commonly set to $ 1 \times 10^{-5} $ or similar). For this example, let's use $ \epsilon = 1 \times 10^{-5} $.

**Calculate $ \sqrt{\sigma^2 + \epsilon} $**:

   $
   \sqrt{\sigma^2 + \epsilon} = \begin{bmatrix} \sqrt{5.0625 + 1 \times 10^{-5}} \\ \sqrt{7.29 + 1 \times 10^{-5}} \\ \sqrt{9.9225 + 1 \times 10^{-5}} \\ \sqrt{12.96 + 1 \times 10^{-5}} \end{bmatrix}
   $

   Approximating the square roots:

   - For the first feature:
     $
     \sqrt{5.0625 + 1 \times 10^{-5}} \approx \sqrt{5.0625} \approx 2.25
     $
   - For the second feature:
     $
     \sqrt{7.29 + 1 \times 10^{-5}} \approx \sqrt{7.29} \approx 2.7
     $
   - For the third feature:
     $
     \sqrt{9.9225 + 1 \times 10^{-5}} \approx \sqrt{9.9225} \approx 3.15
     $
   - For the fourth feature:
     $
     \sqrt{12.96 + 1 \times 10^{-5}} \approx \sqrt{12.96} \approx 3.6
     $

   Thus, we have:

   $
   \sqrt{\sigma^2 + \epsilon} \approx \begin{bmatrix} 2.25 \\ 2.7 \\ 3.15 \\ 3.6 \end{bmatrix}
   $

**Calculate $ \hat{\mathbf{A}} $**:

Now we can compute $ \hat{\mathbf{A}} $:

$
\hat{\mathbf{A}} = \frac{\mathbf{A} - \mu}{\sqrt{\sigma^2 + \epsilon}} = \begin{bmatrix} 3.9 & 4.6 & 5.3 & 6.0 \\ 8.4 & 10.0 & 11.6 & 13.2 \end{bmatrix} - \begin{bmatrix} 6.15 \\ 7.3 \\ 8.45 \\ 9.6 \end{bmatrix}
$

Calculating $ \mathbf{A} - \mu $:

- For the first row:
  - $ 3.9 - 6.15 = -2.25 $
  - $ 4.6 - 7.3 = -2.7 $
  - $ 5.3 - 8.45 = -3.15 $
  - $ 6.0 - 9.6 = -3.6 $

- For the second row:
  - $ 8.4 - 6.15 = 2.25 $
  - $ 10.0 - 7.3 = 2.7 $
  - $ 11.6 - 8.45 = 3.15 $
  - $ 13.2 - 9.6 = 3.6 $

Thus,

$
\mathbf{A} - \mu = \begin{bmatrix} -2.25 & -2.7 & -3.15 & -3.6 \\ 2.25 & 2.7 & 3.15 & 3.6 \end{bmatrix}
$

**Next, divide  by $ \sqrt{\sigma^2 + \epsilon} $**:

$
\hat{\mathbf{A}} = \begin{bmatrix} \frac{-2.25}{2.25} & \frac{-2.7}{2.7} & \frac{-3.15}{3.15} & \frac{-3.6}{3.6} \\ \frac{2.25}{2.25} & \frac{2.7}{2.7} & \frac{3.15}{3.15} & \frac{3.6}{3.6} \end{bmatrix}
$

$ = \begin{bmatrix} -1 & -1 & -1 & -1 \\ 1 & 1 & 1 & 1 \end{bmatrix}
$



### Step 3: Scaling and Shifting:


$
\mathbf{Y} = \gamma \hat{\mathbf{A}} + \beta
$

Where $ \gamma $ and $ \beta $ are learnable parameters. For this example, let's assume:

- $ \gamma = \begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \end{bmatrix} $ (scale factor)
- $ \beta = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \end{bmatrix} $ (shift factor)

Therefore:

$
\mathbf{Y} = \begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \end{bmatrix} \begin{bmatrix} -1 & -1 & -1 & -1 \\ 1 & 1 & 1 & 1 \end{bmatrix} + \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}
$

Then, calculating $ \gamma \hat{\mathbf{A}} $:

$
\gamma \hat{\mathbf{A}} = \begin{bmatrix} -1 & -1 & -1 & -1 \\ 1 & 1 & 1 & 1 \end{bmatrix}
$


Finally, adding the shift $ \beta $:

$
\mathbf{Y} = \begin{bmatrix} -1 & -1 & -1 & -1 \\ 1 & 1 & 1 & 1 \end{bmatrix} + \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \end{bmatrix} = \begin{bmatrix} -1 & -1 & -1 & -1 \\ 1 & 1 & 1 & 1 \end{bmatrix}
$

Thus,

**Final Output**:
  $
   \mathbf{Y} = \begin{bmatrix} -1 & -1 & -1 & -1 \\ 1 & 1 & 1 & 1 \end{bmatrix}
  $





## Visualization of tensor operations in a neural network using TensorFlow Playground.



To visualize a classification problem similar to the Iris dataset, simulate a multi-class classification problem in TensorFlow Playground. Follow these steps:

1. **Go to TensorFlow Playground**: Open your web browser and navigate to [TensorFlow Playground](http://playground.tensorflow.org).

2. **Select a Similar Dataset**:
   - In the top left corner, you will see a dropdown menu for datasets. Choose one of the following datasets that resemble a multi-class classification problem:
     - **Spiral**
     - **Circles**
     - **Moons**
   - These datasets will allow you to visualize how a neural network can learn to classify different classes.

3. **Configure the Neural Network**:
   - **Add Layers**: Click on the "+" button to add a Dense layer. Start with one hidden layer and set the number of neurons to either 4 or 8.
   - **Activation Function**: For the hidden layer, select the "ReLU" activation function from the dropdown menu.
   - **Output Layer**: Ensure that the output layer has multiple neurons. For example, if you are using the Iris dataset as a reference, set the output layer to have 3 neurons to represent the three species.

4. **Adjust Hyperparameters**:
   - Set the **learning rate** to a suitable value (e.g., 0.01 or 0.1).
   - Adjust the **regularization** settings if needed (you can start with no regularization).
   - Set the **number of epochs** to a higher value, such as 200 or 300, to allow the model to train longer.

5. **Train the Model**:
   - Click the "Run" button to start training the model.
   - Observe how the model learns to classify the data points. You can visualize the decision boundaries and see how well the model performs on the selected dataset.

Enjoy experimenting with different configurations and observing how the neural network learns!



# Conclusion

In this notebook, I've explored neural network architecture using TensorFlow.

Understanding the mathematics underlying neural network model architecture is crucial for effectively building and training neural networks.
