### ***Batch normalization***

Batch normalization, often abbreviated as BatchNorm, is a technique used to improve the speed, performance, and stability of artificial neural networks. It was introduced by Sergey Ioffe and Christian Szegedy in 2015 in their paper "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift".

Here's a detailed explanation of what BatchNorm is and how it works:

### The Problem: Internal Covariate Shift
- **Internal Covariate Shift** refers to the change in the distribution of network activations due to the change in network parameters during training. It can slow down the training process because each layer needs to continuously adapt to new distributions in every training step.

### The Solution: Batch Normalization
- **Normalization**: Traditionally, input data to a neural network is normalized to have zero mean and unit variance. This helps the model learn more effectively. BatchNorm extends this idea to the intermediate layers of the network.
- **Working of BatchNorm**: During training, BatchNorm standardizes the outputs of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation. Afterward, it applies a scale factor and a shift factor. These two parameters are learnable and get updated through backpropagation.
  
  The process is as follows:
  1. Compute the mean and variance of a mini-batch.
  2. Normalize the activations of the mini-batch by subtracting the mean and dividing by the square root of the variance plus a small epsilon (to avoid division by zero).
  3. Scale and shift the normalized values using two parameters ($\gamma$ and $\beta$) that are learned during training.
  
- **Benefits**:
  - **Stabilizes the Learning Process**: By normalizing the inputs to layers within the network, it helps to stabilize the learning process and reduces the number of epochs required to train deep networks.
  - **Enables Higher Learning Rates**: Since the network is more stable, it allows the use of higher learning rates, which can further speed up training.
  - **Reduces Sensitivity to Initial Weights**: With BatchNorm, the network is less sensitive to the initial starting weights.
  - **Acts as Regularization**: BatchNorm adds a slight noise to the activations, which can have a regularization effect, similar to dropout.

### Implementation Details:
- **Mini-Batch Dependence**: The normalization is done over the mini-batches, and as such, the effectiveness of BatchNorm can depend on the batch size.
- **Inference**: During inference (model deployment), the mean and variance used for normalization are not computed from the batch; instead, they are estimated from the entire training set. These are often referred to as the "population statistics".

Batch normalization has become a standard component in constructing deep neural networks because of its ability to speed up training and enable the use of higher learning rates, making it easier to train deep networks effectively.

### **'Internal covariate shift'** : ***When & Why this happens ?***

Internal covariate shift refers to the phenomenon during the training of deep neural networks where the distribution of each layer's inputs changes as the parameters (weights and biases) of the previous layers change. When and why internal covariate shift happens:

1. **During Backpropagation**: In each training iteration, neural networks are updated via backpropagation. This is where the network learns from the error between its predictions and the true outcomes. The weights are adjusted to reduce this error.

2. **After Weight Updates**: As weights are updated to minimize the loss, the activation functions' outputs (the inputs to the subsequent layers) change. Since the network is deep, even small changes in the early layers can amplify as they propagate through the network, leading to significant changes in the distributions of the deeper layers' inputs.

3. **With Large Learning Rates**: If the learning rate is large, the updates to the weights can be significant, causing dramatic shifts in the distributions of the activations. This can destabilize the network, making it difficult for the network to converge to a solution.

4. **High Parameter Sensitivity**: Early in training, when weights are initialized randomly, the network can be particularly sensitive to these parameter updates. This sensitivity can lead to large shifts in the activation distributions.

5. **Across Different Data Batches**: Since training is usually performed on mini-batches of data, the statistics of one batch may differ from those of another. This means that each batch can cause the activations to have different distributions, contributing to the internal covariate shift.

The effect of internal covariate shift is that the layers have to constantly adapt to new distributions of data, which can slow down training and make it harder to use higher learning rates without causing instability. It also makes the network more sensitive to the initial parameter values and can lead to the vanishing or exploding gradients problem, where the gradients become too small or too large to enable effective learning.

### **Here's how it works**

1. **Normalization**: For each batch during training, BatchNorm standardizes the activations of a previous layer. This standardization adjusts the activations to have a mean of zero and a standard deviation of one, similar to feature scaling during data preprocessing.

   - The **mean** (also known as the average) is the sum of all data points divided by the number of points. It provides a central value for the data.
   - The **standard deviation** measures the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean, whereas a high standard deviation indicates that the values are spread out over a wider range.

   When we say that a dataset has a mean of 0 and a standard deviation of 1, we are describing a standard normal distribution, also known as a Z-distribution or Gaussian distribution. This is a specific way of characterizing a dataset:

   1. **Mean 0**: This means that the data points, on average, are centered around the value 0. If you sum up all the deviations of the data points from this mean, it will be zero.

   2. **Standard Deviation of 1**: This means that, on average, the data points differ from the mean by 1. It's a measure of the spread of the dataset. Most of the data points (about 68% if the data is normally distributed) will lie within one standard deviation of the mean, which in this case is between -1 and +1.


   The mathematical formula for BatchNorm can be summarized as follows for each activation $x$:

   1. Compute the mean ($\mu$) and variance ($\sigma^2$) for the batch.
   2. Normalize the activations of the batch $x$ to $\hat{x}$ using the formula:
      $$ \hat{x_i} = \frac{x_i - μ_B}{\sqrt{σ²_B + ε}} $$
   
   
   The term $\sigma_{B}^{2}$ in the batch normalization formula represents the variance of a particular batch of data.

   In the context of batch normalization, we typically deal with a mini-batch of data during training. Here's what each term represents:

   - $x_i$: The individual data point from the mini-batch.
   - $\mu_{B}$: The mean of the mini-batch.
   - $\sigma_{B}^{2}$: The variance of the mini-batch.
   - where $\epsilon$ is a small constant added for numerical stability (to avoid division by zero).

   So, $\sigma_{B}^{2}$ is calculated by taking the average of the squared differences from the mini-batch mean $\mu_{B}$. It measures how spread out the values of the mini-batch are. In the normalization step, this variance is used to scale the data points of the mini-batch to unit variance, which means after normalization, the transformed data will have a variance of 1 (before the application of gamma and beta parameters). This is part of the process that helps stabilize the learning process and ensures that each mini-batch of data has similar distributional properties.

2. **Scaling and Shifting**: After the standardization, the BatchNorm layer then applies two trainable parameters to each activation:
   - **Gamma ($\gamma$)**: A scaling factor that multiplies the normalized activation. This allows the network to scale the activations if it determines that the activations should have a different standard deviation.
   - **Beta ($\beta$)**: A shifting factor that is added to the scaled activations. This is equivalent to the bias term in a neural network layer, and it allows the network to shift the activations if it determines that they should have a different mean.

      Scale and shift the normalized value $\hat{x}$:
      $$ y_i = \gamma \cdot \hat{x_i} + \beta $$
      where $y_i$ is the output of the BatchNorm layer for the activation $x_i$, and $\gamma$ and $\beta$ are parameters learned during the training of the network.

The effects of BatchNorm are:

- It reduces internal covariate shift, which is the change in the distribution of network activations due to the update in weights during training. By normalizing the activations, it makes sure the subsequent layers receive data that's on a similar scale, preventing the earlier layers from having to adapt to a constantly changing distribution.
- It allows for higher learning rates because it mitigates the issue of exploding or vanishing gradients, making the network more stable.
- It acts as a regularizer, reducing (sometimes even eliminating) the need for Dropout.
- It can speed up convergence of the training process and lead to faster training times.

BatchNorm has become a standard component in many neural network architectures, especially deep convolutional networks, due to these benefits.