# Neural Network
## Neurons
Neurons are the fundamental building block of neural network. A neuron takes in a set of input, a set of weights, and an activation function (eg. linear/logistic regression) and produce a set of output.

The input of a neuron will be a vector that contains all features

The output of a neuron can be the input of another neuron

## Layers
A layer consists one or more neuron(s) that does not interact with each other.

* Input layer (Layer 0): first layer where the data is first introduced to the network
* Hidden layer(s): layers between the input and output layer
* Output layer: the final layer of the neural network that produces a final result

Each layer is performing "feature engineering" to create some new features based on the input that allows the model to make better predictions. This process is automatic.

## Activation function
The activation function normalizes the weights, it can be a linear regression model, logistic regression model, or other functions.


<img src = "https://miro.medium.com/v2/resize:fit:1358/1*Gh5PS4R_A5drl5ebd_gNrg@2x.png" width = 500>

## Notations
$n^{[l]}$: number of neurons in the $l$th layer

$\vec w^{[l]}_j$, $b^{[l]}_j$: the weights and bias of the $j$th neuron in $l$th layer

$\vec a^{[l]}$: the output vector for $l$th layer and input for $l + 1$th layer

$\vec a^{[0]}$: the input vector

$g$: the activation function


$a^{[l]}$: a column vector that has number of rows equals to the number of neurons in the current layer

General equation for output of $j$th neuron in $l$th layer: $\vec a^{[l]}_j = g(\vec w^{[l]}_j \cdot \vec a^{[l-1]}_j + b^{[l]}_j)$

$$ a^{[l]} = \begin{bmatrix} a^{[l]}_1 \\ a^{[l]}_2 \\ \vdots \\ a^{[l]}_n \end{bmatrix}$$

# Forward propagation
The neural network model where data moves from left (input layer) to right (output layer)



In [3]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input, LeakyReLU
from tensorflow.keras import Sequential
from tensorflow.keras.losses import MeanSquaredError, BinaryCrossentropy, SparseCategoricalCrossentropy
from tensorflow.keras.activations import sigmoid, linear, relu

In [4]:
# TensorFlow convension: must use matrix not 1D array
x = np.array([[200, 17]]) # a 1x2 matrix

In [5]:
# Create a Dense layer (there are other types of layers)
layer_1 = Dense(units = 3, activation='sigmoid')
# units: number of nuerons in this layer
# activation: the activation function used

In [6]:
# Get the weights of a layer
layer_1.get_weights()

[]

In [7]:
# Output after the first layer
a1 = layer_1(x)
print(a1)

tf.Tensor([[1. 1. 1.]], shape=(1, 3), dtype=float32)


In [8]:
# Create a neural network (a series of layers)
model = Sequential([
    tf.keras.Input(shape=(1,)), # Input shape
    Dense(units = 1, activation = 'sigmoid', name = 'layer1'), # Layer1
    Dense(units = 2, activation = 'sigmoid', name = 'layer2')  # Layer2
    
])

In [9]:
# Build model
model.build()

# Model summary
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 layer1 (Dense)              (None, 1)                 2         
                                                                 
 layer2 (Dense)              (None, 2)                 4         
                                                                 
Total params: 6
Trainable params: 6
Non-trainable params: 0
_________________________________________________________________


In [10]:
# Get layers
layer1 = model.get_layer('layer1')

# Set weights 
w = np.array([[0]])
b = np.array([0])
layer1.set_weights([w,b])

# Get weights
layer1.get_weights()

[array([[0.]], dtype=float32), array([0.], dtype=float32)]

# Implementation in Python

In [11]:
# Create data
np.set_printoptions(precision=2)

def load_coffee_data():
    """ Creates a coffee roasting data set.
        roasting duration: 12-15 minutes is best
        temperature range: 175-260C is best
    """
    rng = np.random.default_rng(2)
    X = rng.random(400).reshape(-1,2)
    X[:,1] = X[:,1] * 4 + 11.5          # 12-15 min is best
    X[:,0] = X[:,0] * (285-150) + 150  # 350-500 F (175-260 C) is best
    Y = np.zeros(len(X))
    
    i=0
    for t,d in X:
        y = -3/(260-175)*t + 21
        if (t > 175 and t < 260 and d > 12 and d < 15 and d<=y ):
            Y[i] = 1
        else:
            Y[i] = 0
        i += 1

    return (X, Y.reshape(-1,1))

X, Y = load_coffee_data()
print(X.shape, Y.shape)

(200, 2) (200, 1)


In [12]:
# Normalize data
norm_l = tf.keras.layers.Normalization(axis=-1)
norm_l.adapt(X)  # learns mean, variance
Xn = norm_l(X)

2024-06-19 03:53:53.055098: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


In [13]:
# Sigmoid function
def sigmoid(z):
    z = np.clip(z, -500, 500) # prevent overflow
    g = 1.0 / (1.0 + np.exp(-z))
    return g

In [14]:
# Dense layer function
def dense(a_in, W, b):
    units = W.shape[1] # get the number of neurons
    a_out = np.zeros(units) # create output array
    
    # loop through the array
    for j in range(units):
        w_j = W[:,j] # get weights for jth neuron
        b_j = b[j] # get bias for jth neuron
        z = np.dot(w_j, a_in) + b[j]
        a_j = sigmoid(z) # get jth value using the sigmoid function
        # print(a_j)
        a_out[j] = a_j
    #print(a_out)
    return (a_out)

In [15]:
# Create a sequence of layers
def network(x, W1, b1, W2, b2):
    a1 = dense(x, W1, b1)
    a2 = dense(a1, W2, b2)
    
    return (a2)

In [16]:
# Create weights and bias
W1_tmp = np.array( [[-8.93,  0.29, 12.9 ],   # the weights of jst neuron is the jth column of the matrix
                    [-0.1,  -7.32, 10.81]] )

b1_tmp = np.array( [-9.82, -9.28,  0.96] )

W2_tmp = np.array( [[-31.18],
                    [-27.59],
                    [-32.56]] )

b2_tmp = np.array( [15.41] )

In [17]:
# Create model
def model(X, W1, b1, W2, b2):
    m = X.shape[0] # get the number of training examples
    p = np.zeros((m,1)) # create an output for probability
    for i in range(m):
        p[i,0] = network(X[i], W1, b1, W2, b2)
    return(p)

In [18]:
# Test
X_tst = np.array([
    [200,13.9],  # postive example
    [200,17]])   # negative example
X_tstn = norm_l(X_tst)  # remember to normalize
predictions = model(X_tstn, W1_tmp, b1_tmp, W2_tmp, b2_tmp)
print(predictions)

[[9.72e-01]
 [3.29e-08]]


  p[i,0] = network(X[i], W1, b1, W2, b2)


In [19]:
x = np.array([[1],[2]])
print(x.shape)

(2, 1)


# Vectorization
Vectorization allows us to compute the prediction for all training examples at the same time, which is a more efficent method for implementation

$X$: a matrix containing all the training examples with number of rows equals to the number of features for the training examples and number of columns equals to the number of training examples. The $i$th column of $X$ represents the $i$th training example

$$\mathbf{X} = 
\begin{bmatrix}
| & | &  & |\\
(\mathbf{x}^{(0)}) & (\mathbf{x}^{(1)}) & \cdots & (\mathbf{x}^{(m)}) \\
| & | &  & | \\
\end{bmatrix} $$

$W^{[l]}$: a matrix containing all the weights for the $l$th layer, where the number of rows equals the number of output features (number of neurons in the current layer) and the number of columns equals the number of input features (number of neurons in the previous layer), where $\mathbf{w}^{[l](j)}$ is a column vector containing the weights for the $j$th neuron in the $l$th layer

$$
\mathbf{W^{[l]}} = 
\begin{bmatrix}
--- (\mathbf{w}^{[l](1)})^T --- \\
--- (\mathbf{w}^{[l](2)})^T --- \\
\vdots \\
--- (\mathbf{w}^{([l](n))})^T --- \\
\end{bmatrix} $$

$\mathbf b^{[l]}$: a column vector containing all the bias for the $l$th layer, where $\mathbf b^{[l](j)}$ is the bias for the $j$th neuron in the $l$th layer

$$
\mathbf{ b^{[l]}} = 
\begin{bmatrix}
 b^{[l](1)}  \\
 b^{[l](2)} \\
\vdots \\
b^{[l](n)} \\
\end{bmatrix}\quad
$$

$g^{[l]}()$: the activation function for the $l$th layer

$A^{[l]}$: a matrix containing all the output from the $l$th layer for all training examples with number of rows equals to the number of neurons in the current layer and columns equals to the number of training examples. The $i$th column of $A^{[l]}$ represents the output vector from the $l$th layer of the $i$th training example

$$\mathbf{A^{[l]}} = 
\begin{bmatrix}
| & | &  & |\\
(\mathbf{A}^{[l](1)}) & (\mathbf{A}^{[l](2)}) & \cdots & (\mathbf{A}^{[l](m)}) \\
| & | &  & | \\
\end{bmatrix} $$

Note:
* $A^{[0]} = X$ the output for the 0th layer is the matrix of training examples
* $A^{[L]}$ is the output of the final layer, which is the final prediction 

## Vectorized computation
$$A^{[l]} = g^{[l]}(W^{[l]}A^{[l-1]} + b^{[l]})$$

Repeatedly calculate the output of the next layer using the output from the previous layer

In [42]:
# Vectorized implementation
def dense_vec(a_in, W, b):
    z = np.dot(W, a_in) + b # Matrix multiplication
    a_out = sigmoid(z)
    
    return (a_out)

In [43]:
X_tst = 0.1*np.arange(1,9,1).reshape(2,4) # (4 examples, 2 features)
W_tst = 0.1*np.arange(1,7,1).reshape(3,2) # (2 input features, 3 output features)
b_tst = 0.1*np.arange(1,4,1).reshape(3,1)

print(dense_vec(X_tst, W_tst, b_tst))

[[0.55 0.56 0.57 0.57]
 [0.61 0.62 0.64 0.65]
 [0.66 0.68 0.7  0.73]]


# Neural netowork training
Steps:
1. Define the models and sequentical layers
2. Compile the model and determine which loss function to use
3. Call a function and minimize the cost using gradient descent

In [20]:
# Create training example
x = np.arange(1,26)
X = np.array([x]).T
y = np.array([[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,]]).T
print(X.shape)
print(y.shape)

(25, 1)
(25, 1)


In [21]:
# Define the model
model1 = Sequential([
    Dense(units=25, activation='sigmoid'),
    Dense(units=15, activation='sigmoid'),
    Dense(units=1, activation='sigmoid'),
])

In [22]:
# Define loss function
# Binary crossentropy function is the same as the loss function for logistic regression with exactly 2 classes
model1.compile(loss = BinaryCrossentropy())

In [23]:
# Minimize the cost
# X, y are training examples
# Epochs is the number of iteration for gradient descent 
model1.fit(X, y, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x1549c75b0>

# Activation functions
Issue: Sigmoid function predict things in binary manner (0 or 1), but sometimes, things are not limited in binary form

## Linear (No activation function)
$$g(z) = z = \vec w \cdot \vec x + b$$
Since $g(z) = z$, this is equivalent to no activation function

## Sigmoid
$$g(z) = \frac{1}{1 + e^{-z}}$$

Derivatives
$$\frac{\partial g(z)}{\partial z} = \frac{1}{1 + e^{-z}}  (1 - \frac{1}{1 + e^{-z}}) = g(z) (1-g(z))$$

## Tanh
$$g(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$$
<img src="https://vidyasheela.com/web-contents/img/post_img/39/tanh%20activation%20function-new.png" width=500>

Derivatives
$$\frac{\partial g(z)}{\partial z} = 1 - (tanh(z))^2 = 1 - g(z)^2$$

## ReLU
$$g(z) =  \begin{cases}
0 & \text{if $z<0$}\\
max(0, z) & \text{if $z \geq 0$}\\
\end{cases}
$$
<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*DfMRHwxY1gyyDmrIAd-gjQ.png" width=500>

ReLU has the ability to deactivate a function when $z < 0$, so we can use ReLU to build more complex, piecewise functions for decision boundaries

Derivatives
$$\frac{\partial g(z)}{\partial z} = \begin{cases}
0 & \text{if $z<0$}\\
1 & \text{if $z \geq 0$}\\
\end{cases}
$$

## Leaky ReLU
$$g(z) =  \begin{cases}
az & \text{if $z<0$}\\
max(0, z) & \text{if $z \geq 0$}\\
\end{cases}
$$
where $a$ is a very small positive number

<img src="https://production-media.paperswithcode.com/methods/Screen_Shot_2020-05-25_at_3.09.45_PM.png" width=400>

Derivatives
$$\frac{\partial g(z)}{\partial z} = \begin{cases}
a & \text{if $z<0$}\\
1 & \text{if $z \geq 0$}\\
\end{cases}
$$

## Select activation functions
For the output layer:
* For binary classification, use sigmoid
* For regression (when the result can be negative value), use linear activation function
* For regression (when the result cannot be negative), use ReLU

For the hidden layers:
* ReLU is the most commonly used since ReLU is faster to train
* Tanh is almost always better than sigmoid since it allows the data to center at 0
* Leaky ReLU takes longer to train than ReLU

# Random initialization
When initializing $W$, if all the weights are intialized to 0, the neurons in the same layer will have the same weights after training. Therefore, we intialize all the weights to random values close to 0, so each neuron will end up with learning differently

## Vanishing/Exploding gradient
When a neural netowrk is very deep, the coefficient for the function of the output can grow or diminish exponentially. This can cause gradient descent to have a very larger or small slope to each iteration, which is not ideal

As a partial solution, we can set the the variance of each weight matrix $W^{[l]}$ equals to one when performing random initialization to reduce the speed of vanishing or exploding gradient. This method is called Xavier initialization

# Mean normalization
Normalizing input data ensures the all the feature values are in similar ranges, so the gradient descent can run faster

<img src="https://miro.medium.com/v2/resize:fit:992/format:webp/1*DK6tNx7Ke_27-CdLT3_1Ug.png">

Ensures the validation ans test set are included during the normalization so the mean and variance used is the same as the training set

In [24]:
# Implementation
layer2 = Dense(units=10, activation='relu')

# Vectorized cost function for binary classification

\begin{equation*}
\begin{split}
J &= f(A^{[L]}, Y) \\
&= -\frac{1}{m} \sum_i (y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)) \\
&= -\frac{1}{m} \sum_i (y_i \log(a_i^{[L]}) + (1 - y_i) \log(1 - a_i^{[L]})),
\end{split}
\end{equation*}

\begin{equation}
 = -\frac{1}{m} \underbrace{\sum_{\text{axis} = 1} (Y \odot \log(A^{[L]}) + (1 - Y) \odot \log(1 - A^{[L]}))}_\text{scalar}.
\end{equation}

$Y$: a column vector that contains the true lable for each training example (0 or 1)

$A^{[L]}$: predicted value for each training example by the model

# Regularized cost function
$$J = - \frac{1}{m} \sum_{axis = 1} (Y \odot \log(A^{[L]}) + (1 - Y) \odot \log(1 - A^{[L]})) + \frac{\lambda}{2m} \sum_{l=1}^{L} ||W^{[l]}||^2$$

$\lambda$: regularization parameter

$||W^{[l]}||^2$: Forbenius norms for the weights of the $l$th layer

# Dropout regularization
During the training process, the neurons in each layer will have a probability of being dropped, so its weights and bias will not be updated. The output of each layer will then be divided by that probability to ensure the correct scaling. During back propagation, the same set of neurons will be dropped. After each iteration of gradient descent, another random set of neurons will be dropped

This can be done by creating a matrix $D^{[l]}$ that has the same size as the output from the $l$th layer, $A^{[l]}$, where all entries are 0 or 1 based on the probability, where $\frac {D^{[l]} * A^{[l]}} {Probability}$ is the output of the $l$th layer after dropout

Dropout regularization forces the algorithm to spread out its weights instead of relying on some features heavily, which decreases the overall cost


Dropout is only applied during training but not during validation or testing phase

# Multiclass classification
Classification problems that have more than two classes, meaning the output($y$) can have more than two possible values

## Softmax
Softmax activation function is only used as the final layer of a neural network

For a classification problem that has $n$ different classes,
$$
\begin{cases}
z_1 = \vec w_1 \cdot \vec x + b_1\\
z_2 = \vec w_2 \cdot \vec x + b_2\\
...\\
z_n = \vec w_n \cdot \vec x + b_n\\
\end{cases}
$$
For weights $\vec w_1, \vec w_2, ..., \vec w_n$ and bias $b_1, b_2, ..., b_n$

$$
\begin{cases}
a_1 = \frac{e^{z_1}}{e^{z_1} + e^{z_2} + ... + e^{z_n}}\\
a_2 = \frac{e^{z_2}}{e^{z_1} + e^{z_2} + ... + e^{z_n}}\\
...\\
a_n = \frac{e^{z_n}}{e^{z_1} + e^{z_2} + ... + e^{z_n}}\\
\end{cases}
$$
Thus,
$$a_1 + a_2 + ... + a_n = 1$$
where $a_j$ is the probability of the given data to be the $j$th class

Summary:
$$a_j = \frac{e^{z_j}}{ \sum_{k=1}^{n}{e^{z_k} }} $$

The output $\mathbf{a}$ is a vector of length $n$ that contains the probability of the input being each class
\begin{align}
\mathbf{a}(x) =
\begin{bmatrix}
P(y = 1 | \mathbf{x}; \mathbf{w},b) \\
\vdots \\
P(y = N | \mathbf{x}; \mathbf{w},b)
\end{bmatrix}
=
\frac{1}{ \sum_{k=1}^{n}{e^{z_k} }}
\begin{bmatrix}
e^{z_1} \\
\vdots \\
e^{z_{n}} \\
\end{bmatrix} 
\end{align}

## Cost function for softmax
Loss function:
\begin{equation}
  L(\vec{a},y)=\begin{cases}
    -log(a_1), & \text{if $y=1$}.\\
        &\vdots\\
     -log(a_n), & \text{if $y=n$}
  \end{cases}
\end{equation}

$L$ calculates the loss for each training example
 
For the loss function, only the line that corresponds to the target contributes to the loss, other lines are zero. To write the cost equation we need an 'indicator function' that will be 1 when the index matches the target and zero otherwise. 
    $$\mathbf{1}\{y == n\} = =\begin{cases}
    1, & \text{if $y==n$}.\\
    0, & \text{otherwise}.
  \end{cases}$$
Cost:
\begin{align}
J(\mathbf{w},b) = -\frac{1}{m} \left[ \sum_{i=1}^{m} (\sum_{j=1}^{n}  1\left\{y^{(i)} == j\right\} \log \frac{e^{z^{(i)}_j}}{\sum_{k=1}^n e^{z^{(i)}_k} })\right]
\end{align}

$\sum_{j=1}^{n}  1\left\{y^{(i)} == j\right\} \log \frac{e^{z^{(i)}_j}}{\sum_{k=1}^n e^{z^{(i)}_k} }$: the loss of $a_j$ term

Where $m$ is the number of examples, $n$ is the number of outputs. This is the average of all the losses.

# Code

In [45]:
# Softmax function
def softmax(z):     # Input z is an array of values
    ez = np.exp(z)
    sm = ez / np.sum(ez)
    return (sm)
# sm is an array that has the size equals to the total number of class containing the probability of the given data being each class 

In [46]:
# Create data set
from sklearn.datasets import make_blobs

centers = [[-5, 2], [-2, -2], [1, 2], [5, -2]]
X_train, y_train = make_blobs(n_samples=2000, centers=centers, cluster_std=1.0,random_state=30)
print(X_train.shape)
print(y_train.shape)

(2000, 2)
(2000,)


In [47]:
# Tensorflow implementation that prevents roundoff errors

model2 = Sequential([
    Dense(units=25, activation='relu'),
    Dense(units=15, activation='relu'),
    Dense(units=4, activation='linear') # The last layer produce a set of z values, which must be converted to probability using softmax
])

# SparseCategoricalCrossentropy is the loss function for multiclass classification
# from_logits informs the loss function that is operation is for a softmax implementation
# optimizer - Adam algorithm (see below)
model2.compile(loss=SparseCategoricalCrossentropy(from_logits=True), optimizer = tf.keras.optimizers.Adam(0.001),)


model2.fit(X_train, y_train, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x142088bb0>

In [48]:
z_value = model2.predict(X_train) # The model predict a matrix of z values
prob = tf.nn.softmax(z_value).numpy() # Convert z values to probability using softmax function

for i in range(10):
    print(prob[i], np.argmax(prob[i])) # Print probability (each row adds up to 1) and predicted class

[0.   0.01 0.97 0.02] 2
[9.97e-01 2.55e-03 4.03e-05 2.45e-06] 0
[9.75e-01 2.43e-02 5.66e-04 6.04e-05] 0
[0.01 0.99 0.   0.  ] 1
[3.90e-03 2.77e-04 9.96e-01 1.18e-04] 2
[6.83e-04 1.26e-03 9.95e-01 3.29e-03] 2
[0.   0.99 0.   0.  ] 1
[1.00e+00 1.89e-05 2.85e-06 7.95e-08] 0
[4.26e-03 9.93e-01 1.59e-03 7.49e-04] 1
[9.77e-05 2.27e-04 2.12e-03 9.98e-01] 3


# Multilabel classification
Contains multiple lables in a single input

# Additional layer type
## Convolutional layer
Convolutional layer only allows each neuron to get part of the complete input, which can
* Speed up computation
* Require less training data
* Reduce overfitting

# Back propagation
Back propagation trains a neural network effectively using chain rule. It determines how the weights and bias should change to better fit a single training example to lower the cost function by calculating the gradient. Back propagation should be performed for each training example to find the most suitable set of parameters for the model.

Back propagation computes the gradient one layer at a time, iterating backward from the last layer to avoid redundant calculations of intermediate terms in the chain rule. This allows the calculation of derivatives to be faster and more efficient

Since a weight $W$ only effect the cost through its effect on the next layer, we only need the derivatives of later layers to determine the effect of $W$ on the cost, then the previous layer can be computed and repeated recursively.


# Advanced optimization

## Mini-batch gradient descent (Stochastic gradient descent)
Mini-batch allows the training to be faster by dividing the training set into multiple mini-batches, where each min-batch contains some training examples

For each iteration of gradient descent, the training is only performed on one mini-batch instead of the entire training set. After each iteration, we move on to the next mini-batch

In mini-batch gradient descent, each iteration does not necessary reduce the cost, but the general trend should be a decreasing cost function.

### Notations

$X$: the matrix of all training examples

$Y$: the matrix for all labels of $X$

$X^{\{t\}}$: a $n$ by $k$ matrix for the $t$th mini-batch, where $n$ is the number of features for each training example and $k$ is the number of training examples in this mini-batch

$Y^{\{t\}}$: a $1$ by $k$ matrix for the labels of the $t$th mini-batch

### Choosing mini-batch size
In general, the smaller the batch size, the faster the training speed, but the cost function will be noisier

If the training set is small, use batch gradient descent. Otherwise, set the size of mini-batch as the power of two (typically 64, 128, 256, 512 examples per mini-batch)

Note: the training examples and their labels should be randomly shuffled before splitting into mini-batches


## Exponetially weighted average
Exponentially weighted average calculates the average of data, where older values gives less weights to the current average

$$v_{t} = \beta v_{t-1} + (1 - \beta) \theta_{t}$$
$$v_0 = 0$$

$t$: number of iterations

$\beta$: the weight parameter (between 0 and 1) that determines how important is the current value, $\theta_{t}$, to the overall avreage. Larger $\beta$ means current value is less significant

$\theta_{t}$: the current value

### Bias correction
When initialization the value for $v_0 = 0$, the first few calculation may result in inaccurate predictions since the value is multiplied by $\beta$, and the following formula can improve this 
$$v_{t} = \frac{\beta v_{t-1} + (1 - \beta) \theta_{t}}{{1-\beta^t}}$$

## Momentum
The momentum algorithm reduces the oscillations in gradient descent by using the eponetially weighted average

To do this, we first calculate $dw$ and $db$. Then, apply the eponetially weighted average formula, where
$$v_{dW} = \beta v_{dW} + (1 - \beta) dW$$
$$v_{db} = \beta v_{db} + (1 - \beta) db$$

After, updates the weights and bias
$$W = W - \alpha v_{dW}$$
$$b = b - \alpha v_{db}$$

## RMSprop
Root mean square prop is another algorithm that reduces the oscillations in gradient descent and speed up the learning

To do this, we first calculate $dw$ and $db$. Then, apply the eponetially weighted average formula, where
$$s_{dW} = \beta_2 s_{dW} + (1 - \beta_2) dW^2$$
$$s_{db} = \beta_2 s_{db} + (1 - \beta_2) db^2$$

After, updates the weights and bias
$$W = W - \alpha \frac{dW}{\sqrt{(s_{dW})} + \epsilon}$$
$$b = b - \alpha \frac{db}{\sqrt{(s_{db})} + \epsilon}$$

$\epsilon$: a very small number to prevent division by 0


## Adam algorithm
Adam algorithm is a combination of Momentum and RMSprop algorithm for better performance

To do this, we first calculate $dw$ and $db$. Then, apply the eponetially weighted average formula, where
$$v_{dW} = \frac {\beta_1 v_{dW} + (1 - \beta) dW} {{1-(\beta_1)^t}}$$

$$v_{db} = \frac {\beta_1 v_{db} + (1 - \beta) db} {{1-(\beta_1)^t}}$$

$$s_{dW} = \frac {\beta_2 s_{dW} + (1 - \beta_2) dW^2} {1-(\beta_2)^t}$$

$$s_{db} = \frac {\beta_2 s_{db} + (1 - \beta_2) db^2} {1-(\beta_2)^t}$$

After, updates the weights and bias
$$W = W - \alpha \frac{v_{dW}}{\sqrt{(s_{dW})} + \epsilon}$$
$$b = b - \alpha \frac{v_{db}}{\sqrt{(s_{db})} + \epsilon}$$

$\alpha$: learning rate

$\beta_1, \beta_2$: two weight parameters

$t$: number of iterations

$\epsilon$: a very small number to prevent division by 0

In tensorflow, the adam algorithm can automatically select an appropriate learning rate $\alpha$ for each step taken when updating the weights and bias. When the steps are taken in the same direction, the algorithm will use a larger learning rate to speed up the learnin; when the stpes taken are oscillating, the algorithm will use a smaller learning rate to improve the accuracy


## Learning rate decay
Learning rate decay algorithm reduces the oscillation by using smaller learning rate as the training proceed, which can speed up the training

$$\alpha = \frac{\alpha_0}{1 + \text{Decay Rate} \times \text{Epoch Number}}$$

$\alpha$: current learning rate

$\alpha_0$: initial learning rate

$\text{Dacay rate}$: a hyperparameter

$\text{Epoch Number}$: the number of iterations through the entire training set. After going through all the training examples, epoch number will increase by 1 (if mini-batch is used, epoch number will increase when all mini-batches are used once)

Note: there are other leanring rate decay algorithm, but the general purpose is the same



# Hyperparameter tuning
When trying out different values of hyperparameters, use random sampling (random values for hyperparameters) instead of a grid search (combining some set values) to better explore the space of hyperparameters

After finding a region of values works well for the model, we can zoom in to that region and perform random sampling more densely

## Hyperparameter scaling
When the hyperparameter can be an exponential ranged values (e.g. learning rate), instead of randomize the hyperparameters on a linear scale, we can randomize the values on a log scale to better explore the hyperparameter space 

# Batch normalization
Batch normalization increases the training speed by performing mean normalization on each hidden layer of the neural netowrk. Also, since each hidden layer is normalized, the change in input data will have smaller effect as the network goes deeper, which allows it to provide more stable performances even when the distribution of the input data changes

When performing batch normalization with mini-batches, each mini-batch is scaled by its own mean and variance instead of those of the entire training set. This adds some noise to the prediction, which has a slight regularization effect to prevent overfitting. However, it is not a substitute for regularization algorithm

## Implementation
For some intermdiate values $Z^{[l]}$ in $l$th layer of a neural network, where
$$Z^{[l]} = W^{[l]}A^{[l-1]} + b^{[l]}$$

$Z^{[l]}$: a matrix with the number of rows equals the number of nuerons in the current layer and number of columns equals the number of training examples in the mini-batch

$$\mathbf{Z^{[l]}} = 
\begin{bmatrix}
| & | &  & |\\
(\mathbf{z}^{(1)}) & (\mathbf{x}^{(2)}) & \cdots & (\mathbf{z}^{(m)}) \\
| & | &  & | \\
\end{bmatrix} $$

Then,

$$\mu = \frac{1}{m} \sum_{i=1}^{m} z^{(i)}$$

$$\sigma^2 = \frac{1}{m} \sum_{i=1}^{m} (z^{(i)} - \mu)^2$$

$$z^{(i)}_{\text{norm}} = \frac{z^{(i)} - \mu}{\sqrt{\sigma^2 + \epsilon}}$$

$$\tilde{z}^{(i)} = \gamma^{[l]} z^{(i)}_{\text{norm}} + \beta^{[l]}$$

$z^{[l](i)}_{\text{norm}}$: normalized data with mean of 0 and variance of 1

$\gamma^{[l]}, \beta^{[l]}$: column vectors with the number of entires equals to the number of neurons in the $l$th layer. $\gamma^{[l]}$ and $\beta^{[l]}$ are learnable parameters from gradient descent that controls the mean and variance of the data

$\tilde{z}^{[l](i)}$: normalized $z$ value to pass into the activation function for the $l$th layer for the $i$th training example

### Update $\gamma^{[l]}, \beta^{[l]}$

$$\gamma^{[l]} = \gamma^{[l]} - \alpha * d\gamma^{[l]}$$
$$\beta^{[l]} = \beta^{[l]} - \alpha * d\beta^{[l]}$$

Note: optimization algorithms also work in batch normalization when updating $\gamma^{[l]}$ and $\beta^{[l]}$


## Batch normalization during testing
When testing, each single test example will not have a mean & variance value for forward propgation. Thus, we obtain $\mu$ and $\sigma^2$ values by calculating the exponentially weighted average of those values across all min-batches

During training, the $t$th mini-batch will produce $\mu^{\{t\}[l]}$ and $\sigma^{2^{\{t\}[l]}}$ for each layer. We can calculate the exponentially weighted averages of these values for all mini-batches and use them to compute $z_{\text{norm}}$ and $z$ for the test example

