# Batch Normalization

**What is Batch Normalization**

1. Batch-Normalization (BN) is an algorithmic method which makes 
the training of Deep Neural Networks (DNN) **faster** and more 
**stable**.

2. It consists of normalizing activation vectors from hidden 
layers using the **mean and variance** of the current batch. 

3. This normalization step is applied right before (or right after) the 
nonlinear function.




**Where they applied**

Applied After Linear Layers and Before Activation Functions.
   
Example : 

Case 1 

1. Layer : [-5, -2, 0, 3, 6] 
2. BatchNorm : [-1.2, -0.5, 0, 0.8, 1.5]
3. Relu : [0, 0, 0, 0.8, 1.5]

Case 2 

1. Layer : [-5, -2, 0, 3, 6] 
2. Relu : [-5, -2, 0, 3, 6] → [0, 0, 0, 3, 6]
3. BatchNorm : [-0.5, -0.5, -0.5, 0.2, 1.3]

It can be seen that case 1 is better as in case 2 , output lost important information from negative side.But in case 2 , first they are balanced then in relu neg are removed.


**Why use BatchNorm** 

1. In case of unnormalized data , contour plot is like oval and in that case if you keep learning rate high then it overshoots , so we keep lr low which results in slow training.
(just check dropout note first slide for how it looks ).

2. In case of normalized data , contour plot is like circle , in which you can keep lr high , that makes training faster.

3. Due to internal covariate shift , input distribution for furthers layers keep changing , 
 due to thi slater layers constantly need to re-adapt and hence leads to instability and lower training also.





**Concept of Internal Covariate Shift**

1. First understand Covariate Shift , it happens when 
    
    1. Input distribution changes

    2. But relationship between input → output stays the same

2. For example : 
   
    1. Train only on red roses → model learns “rose = red + flower features”
    
    2. Later see blue/yellow roses → still roses, but input distribution changed.
    
    3. Model struggles because it never saw these distributions before . So the model has to re-adapt to new types of roses, even though the labeling rule didn’t change.

3. Internal Covariate shift is defined as change in distribution of network activations due to change in network parameters during training.

4. Internal Covariate Shift (ICS) : Same concept, but happening inside the neural network.
   Example : Inside layer 5 of a network , Input distribution keeps changing every batch Because layers 1–4 weights are updating continuously. So deeper layers are constantly receiving different feature distributions, causing : Unstable training ,Slower convergence.


**How BatchNorm Works ?** 

check slides fo example.

1. z11 -> z11(N) -> z11(BN) -> g(z11(BN)) -> a11 (check Batch Normalization slides).

2. z11 = w1 * (cgpa) + w2 * (iq) + b . Shape: (4 × 2) For each neuron, we normalize across batch (down each column) [4 is batch size] .[mu and sigma i.e mean and std devn  is not learnable]  [mean = 0 , std devn = 1 , (std devn)^2 = variance = 1]

3. z11(BN) = gamma * (z11(N)) + beta * (z11(N)) , then further g(z11(NB)) is activation , 
where gamman and beta are learnable paramaters.

4. While normalizing z11 , we add epsilon in denominator so that in case sigma get 0 , it handles.


**Batch Norm during test**

1. During testing we are providing the data in batches , but in test we provide a single row , then in that case how is z11 normalized ? (as we required mean and std devn and for single row how they can be calculated).

2. For that , we calculated mean and variance using exponential weighted meaning average.For each neuron seperately.

3. check for expression on internet its like [val_store = alpha * val_store + (1-alpha)*current_caluculated] and for std devn conisder square. [where alpha is momentum , a hyper parameter]

**Advantages**

1. Make training stable.

2. Make training faster , as you can choose high lr.

3. Act as regularizer in some sense . (not too much extent like dropout).[like value of mean and std devn depends on batch itself (if batch changes) , then lead to changes in activation and that can introduce a little randomness or some noise which leads to little decrase in overfitting].

4. Not to worry about Weight initialization , as its impact reduce because now normalization happens so it converges (check slides).  [cost function stretched without using it and now it takes time to reach optimal solution , but if you used it , then it is uniform , then you can reach optimal solution in a better way.]

# DropOuts

**What is DropOut and how it works**

1. Dropout is a technique used to reduce overfitting by **randomly turns off p% neuron**s in the hidden layer during each forward pass. It encourages neurons to learn independent useful representations

2. It is applied to the hidden layers and applied after the ReLU activation function.

3. This has a **regularization** effect because it prevents a neural network from relying too heavily on specific neurons during training.

4. If overfitting is there increase the value of p , if underfitting is there decrease the valeu of p. [it is a tip to first apply dopout to last layer then check].

**How it works in testing** 

1. No dropout applied.

2. All neurons active

3. Their outputs (weights) are scaled (typically multiplied by 1−p) so that the expected activation matches training. (see the notes pdf)

**DrawBacks**

1. It delays the convergence (means go to right weights and bias) because neurons randomly drop, gradients become noisy ,so model takes longer to learn patterns. [Means training become a little slow]

2. The value of cost function changes.(as in every epoch , some nodes are not consider).It faces some issue in debugging and  the calculation of gradients become a little difficult.

# L2 Regularization

**What is Regularization**

1. Regularization is applied to the weights of the model to penalize large values and
encourage smaller, more generalizable weights.

2. Adds a penalty term λ∑wi2 to the loss function in L2 regularization.

3. In weight decay, directly modifies the gradient update rule to include λwi, effectively
shrinking weights during training.

4. Encourages the network to distribute learning across multiple parameters, avoiding
reliance on a few large weights.  [Just remeber large weights leads to memoize noise and unstability , for example : 100*(51 - 50) : 100 , where as if 1 * (51-50) then it will be 1 , if its a 100 diff then model try to memoize it i.e Model becomes too sensitive and memorizes tiny noise in data.]

5. Regularization is typically applied only to weights, not biases, as biases don't directly
affect model complexity


# Code

In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
from torchinfo import summary

In [16]:
torch.manual_seed(42)

<torch._C.Generator at 0x1113abf50>

In [17]:

df = pd.read_csv('../2. Dataset/fmnist_small.csv')


In [18]:
x = df.iloc[:, 1:]/255.0
y = df.iloc[:,0]

In [19]:
xtrain , xtest , ytrain , ytest = train_test_split( x , y , test_size=0.2 , random_state=20)

In [20]:
xtrain_tensor = torch.from_numpy(xtrain.values).float()
xtest_tensor = torch.from_numpy(xtest.values).float()
ytrain_tensor = torch.from_numpy(ytrain.values)
ytest_tensor = torch.from_numpy(ytest.values)

In [21]:
print(xtrain_tensor.shape , ytrain_tensor.shape)

torch.Size([4800, 784]) torch.Size([4800])


In [22]:
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):

  def __init__(self, features, labels):

    self.features = features
    self.labels = labels

  def __len__(self):

    return len(self.features)

  def __getitem__(self, idx):

    return self.features[idx], self.labels[idx]


In [23]:
train_dataset = CustomDataset(xtrain_tensor,ytrain_tensor)
test_dataset = CustomDataset(xtest_tensor,ytest_tensor)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True , pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=True , pin_memory=True)

In [24]:
class MyNN(nn.Module):

  def __init__(self, num_features):

    super().__init__()

    self.network = nn.Sequential(
        nn.Linear(num_features, 128), 
        nn.BatchNorm1d(128),
        nn.ReLU(),
        nn.Dropout(p=0.3),
        nn.Linear(128, 64),
        nn.BatchNorm1d(64), 
        nn.ReLU(),
        nn.Dropout(p=0.3),
        nn.Linear(64, 10)
    )

  def forward(self, features):

    out = self.network(features)

    return out

In [25]:
device = 'cpu'
if hasattr(torch,'mps') and torch.backends.mps.is_available():
    device = 'mps'
    print("MPS is available")

MPS is available


In [26]:
model = MyNN(xtrain_tensor.shape[1]) 
model = model.to(device) # so that weights also move on device
summary(model , input_size = xtrain_tensor.shape , device=device)   # shoudl pass device , else it takes cpu and possibility of runtime

Layer (type:depth-idx)                   Output Shape              Param #
MyNN                                     [4800, 10]                --
├─Sequential: 1-1                        [4800, 10]                --
│    └─Linear: 2-1                       [4800, 128]               100,480
│    └─BatchNorm1d: 2-2                  [4800, 128]               256
│    └─ReLU: 2-3                         [4800, 128]               --
│    └─Dropout: 2-4                      [4800, 128]               --
│    └─Linear: 2-5                       [4800, 64]                8,256
│    └─BatchNorm1d: 2-6                  [4800, 64]                128
│    └─ReLU: 2-7                         [4800, 64]                --
│    └─Dropout: 2-8                      [4800, 64]                --
│    └─Linear: 2-9                       [4800, 10]                650
Total params: 109,770
Trainable params: 109,770
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 526.90
Input size (MB): 15.05
Forward

In [27]:
epochs = 10
learning_rate = 0.1

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr= learning_rate)

In [28]:
for epoch in range(epochs):

  total_epoch_loss = 0

  for batch_features, batch_labels in train_loader:

    batch_features, batch_labels = batch_features.to(device), batch_labels.to(device)
 
    outputs = model(batch_features)

    loss = criterion(outputs, batch_labels)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    total_epoch_loss = total_epoch_loss + loss.item()

  avg_loss = total_epoch_loss/len(train_loader)
  print(f'Epoch: {epoch + 1} , Loss: {avg_loss}')




Epoch: 1 , Loss: 0.9772230315208436
Epoch: 2 , Loss: 0.6982409131526947
Epoch: 3 , Loss: 0.6130022386709849
Epoch: 4 , Loss: 0.5782167559862137
Epoch: 5 , Loss: 0.5423443067073822
Epoch: 6 , Loss: 0.5220815540353457
Epoch: 7 , Loss: 0.4715494939684868
Epoch: 8 , Loss: 0.4639199024438858
Epoch: 9 , Loss: 0.4541665451725324
Epoch: 10 , Loss: 0.44277151897549627


In [29]:
model.eval()

MyNN(
  (network): Sequential(
    (0): Linear(in_features=784, out_features=128, bias=True)
    (1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): Dropout(p=0.3, inplace=False)
    (4): Linear(in_features=128, out_features=64, bias=True)
    (5): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): ReLU()
    (7): Dropout(p=0.3, inplace=False)
    (8): Linear(in_features=64, out_features=10, bias=True)
  )
)

In [30]:
total = 0
correct = 0

with torch.no_grad():

  for batch_features, batch_labels in test_loader:

    batch_features, batch_labels = batch_features.to(device), batch_labels.to(device)

    outputs = model(batch_features)

    _, predicted = torch.max(outputs, 1) 
    # torch.max(input, dim)  ==> maximum along dim 1 i.e along rows
    # gives max_values,max_indices

    total = total + batch_labels.shape[0]

    correct = correct + (predicted == batch_labels).sum().item()

print(correct/total)


0.83
