# Additional techniques for building neural networks with PyTorch

This notebook explores additional techniques to enhance neural networks' performance using PyTorch.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np

In [2]:
# Generate dummy data
X_train = np.random.rand(100, 10).astype(np.float32)
y_train = np.random.rand(100, 1).astype(np.float32)

X_train = torch.tensor(X_train)
y_train = torch.tensor(y_train)

## Regularization techniques

Regularization is essential to prevent overfitting, where a model performs well on training data but poorly on unseen data. In PyTorch, regularization techniques like L1 and L2 regularization can be easily applied by adding penalty terms to the loss function or using regularization parameters in layers.

### L1 and L2 regularization

- **L1 regularization**: Adds a penalty equal to the absolute value of the model weights, encouraging sparsity (i.e., some weights are driven to zero). In PyTorch, we can add L1 by manually adding the penalty to the loss function.
- **L2 regularization**: Adds a penalty equal to the square of the model weights, encouraging smaller weights. In PyTorch, L2 regularization can be implemented using the `weight_decay` parameter in optimizers.

In [3]:
# Define a simple feedforward neural network
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(10, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 1)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Instantiate the model
model = SimpleNN()

# Define L1 regularization manually
def l1_regularization(model, lambda_l1):
    l1_penalty = sum(param.abs().sum() for param in model.parameters())
    return lambda_l1 * l1_penalty

# Define L2 regularization (already available in optimizer)
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=0.01)  # L2 regularization via weight_decay
criterion = nn.MSELoss()

# Training loop with L1 regularization
num_epochs = 10
lambda_l1 = 0.01
for epoch in range(num_epochs):
    model.train()
    
    optimizer.zero_grad()
    outputs = model(X_train)
    loss = criterion(outputs, y_train)
    
    # Add L1 regularization
    loss += l1_regularization(model, lambda_l1)
    
    loss.backward()
    optimizer.step()
    
    print(f'Epoch {epoch+1}, Loss: {loss.item():.4f}')

Epoch 1, Loss: 2.6478
Epoch 2, Loss: 2.6148
Epoch 3, Loss: 2.5822
Epoch 4, Loss: 2.5500
Epoch 5, Loss: 2.5184
Epoch 6, Loss: 2.4874
Epoch 7, Loss: 2.4570
Epoch 8, Loss: 2.4270
Epoch 9, Loss: 2.3974
Epoch 10, Loss: 2.3683


**Explanation**:
- **L1 regularization**: We manually add the L1 penalty by summing the absolute values of the model's parameters and multiplying by a regularization coefficient (`lambda_l1`) and adding this to the loss.
- **L2 regularization**: Can be directly implemented using the `weight_decay` parameter in optimizers like `Adam`. This parameter controls the strength of L2 regularization.

### Dropout

Dropout is a regularization technique that randomly sets a fraction of input units to zero during training. This prevents the network from becoming overly dependent on any particular feature, helping it generalize better to unseen data.

In [4]:
class DropoutNN(nn.Module):
    def __init__(self):
        super(DropoutNN, self).__init__()
        self.fc1 = nn.Linear(10, 64)
        self.dropout1 = nn.Dropout(0.5)  # 50% dropout
        self.fc2 = nn.Linear(64, 32)
        self.dropout2 = nn.Dropout(0.2)  # 20% dropout
        self.fc3 = nn.Linear(32, 1)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x)) # Apply first layer and activation function
        x = self.dropout1(x) # Apply dropout
        x = torch.relu(self.fc2(x)) # Apply second layer and activation function
        x = self.dropout2(x) # Apply dropout
        x = self.fc3(x) # Apply final output layer
        return x

# Instantiate the model
model = DropoutNN()

# Define optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()

# Training loop
for epoch in range(num_epochs):
    model.train()
    
    optimizer.zero_grad()
    outputs = model(X_train)
    loss = criterion(outputs, y_train)
    
    loss.backward()
    optimizer.step()
    
    print(f'Epoch {epoch+1}, Loss: {loss.item():.4f}')

Epoch 1, Loss: 0.2200
Epoch 2, Loss: 0.2077
Epoch 3, Loss: 0.2065
Epoch 4, Loss: 0.1859
Epoch 5, Loss: 0.1817
Epoch 6, Loss: 0.1665
Epoch 7, Loss: 0.1619
Epoch 8, Loss: 0.1487
Epoch 9, Loss: 0.1325
Epoch 10, Loss: 0.1374


**Explanation**:
- `nn.Dropout(p)`: randomly sets a fraction `p` of the input units to zero during training. This encourages the model to learn redundant representations, improving generalization.

In PyTorch, dropout is defined in the `__init__` method of the neural network class because this method is where we set up all the layers and components of our model. When we define dropout here, we are telling PyTorch to include a dropout layer as part of our model architecture. This setup ensures that dropout is a part of the model when it is initialized, which allows us to control the dropout rate and apply it consistently during training.

In the `forward` method, we apply dropout to the model’s activations (the outputs of layers). Dropout is applied during training to randomly set a portion of the activations to zero, which helps to prevent overfitting. When we call `self.dropout(x)` in the forward method, we are telling PyTorch to randomly drop out some of the activations based on the dropout rate.

During evaluation (e.g., when you are testing the model or making predictions), dropout is automatically turned off. After training, we switch the model to evaluation mode using `model.eval()`, which disables dropout, so the entire network is used for making predictions.


## Normalization techniques
Normalization techniques are methods used to improve the training process of neural networks. They help make training faster and more stable by adjusting the inputs to each layer.

### Layer normalization
Layer normalization normalizes the inputs across the features for each individual training example. This means that for each data point, the values of its features are adjusted to have a mean of zero and a standard deviation of one, independent of other data points. Since layer normalization is done for each example independently, it is particularly useful in RNNs and other models where the data comes in sequences, as it ensures that each step in the sequence is treated consistently.

- How it works:
    - It calculates the mean and variance for each individual layer across all its neurons.
    - Then, it normalizes the activations within that layer (e.g., the outputs of neurons), so they all have the same range and distribution.

In [5]:
class LayerNormNN(nn.Module):
    def __init__(self):
        super(LayerNormNN, self).__init__()
        self.fc1 = nn.Linear(10, 64)
        self.layernorm1 = nn.LayerNorm(64)
        self.fc2 = nn.Linear(64, 32)
        self.layernorm2 = nn.LayerNorm(32)
        self.fc3 = nn.Linear(32, 1)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.layernorm1(x)
        x = torch.relu(self.fc2(x))
        x = self.layernorm2(x)
        x = self.fc3(x)
        return x

# Instantiate the model
model = LayerNormNN()

# Define optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()

# Training loop
for epoch in range(num_epochs):
    model.train()
    
    optimizer.zero_grad()
    outputs = model(X_train)
    loss = criterion(outputs, y_train)
    
    loss.backward()
    optimizer.step()
    
    print(f'Epoch {epoch+1}, Loss: {loss.item():.4f}')

Epoch 1, Loss: 0.6791
Epoch 2, Loss: 0.3399
Epoch 3, Loss: 0.2070
Epoch 4, Loss: 0.2194
Epoch 5, Loss: 0.2590
Epoch 6, Loss: 0.2628
Epoch 7, Loss: 0.2324
Epoch 8, Loss: 0.1872
Epoch 9, Loss: 0.1449
Epoch 10, Loss: 0.1163


**Explanation**:

- **Layer normalization** (`nn.LayerNorm(normalized_shape)`): Normalizes the activations of a layer for each individual data sample. It adjusts the mean and variance of the activations, which helps stabilize and speed up training.
    - `normalized_shape`: The shape of the input that needs to be normalized, typically the number of features in the layer.


### Batch normalization
Batch normalization normalizes the inputs across the batch dimension. This means that it computes the mean and standard deviation of each feature across the entire batch of training examples, and then normalizes each feature using these statistics. By normalizing the data in batches, batch normalization reduces what's known as "internal covariate shift," where the distribution of each layer’s inputs changes during training. This helps the model to converge faster and can allow for higher learning rates, leading to quicker and often more effective training. Batch normalization is more commonly used in feedforward networks, CNNs, and in some cases RNNs.

- How it works:
    - During training, for each mini-batch, it calculates the mean and variance of the activations (outputs of neurons) across that batch.
    - Then, it normalizes these activations to have a mean of 0 and a standard deviation of 1.
    - After normalization, it applies a scale and shift, allowing the network to adjust the normalized values.

In [6]:
class BatchNormNN(nn.Module):
    def __init__(self):
        super(BatchNormNN, self).__init__()
        self.fc1 = nn.Linear(10, 64)
        self.batchnorm1 = nn.BatchNorm1d(64)
        self.fc2 = nn.Linear(64, 32)
        self.batchnorm2 = nn.BatchNorm1d(32)
        self.fc3 = nn.Linear(32, 1)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.batchnorm1(x)
        x = torch.relu(self.fc2(x))
        x = self.batchnorm2(x)
        x = self.fc3(x)
        return x

# Instantiate the model
model = BatchNormNN()

# Define optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()

# Training loop
for epoch in range(num_epochs):
    model.train()
    
    optimizer.zero_grad()
    outputs = model(X_train)
    loss = criterion(outputs, y_train)
    
    loss.backward()
    optimizer.step()
    
    print(f'Epoch {epoch+1}, Loss: {loss.item():.4f}')

Epoch 1, Loss: 0.8467
Epoch 2, Loss: 0.7256
Epoch 3, Loss: 0.6294
Epoch 4, Loss: 0.5546
Epoch 5, Loss: 0.4976
Epoch 6, Loss: 0.4552
Epoch 7, Loss: 0.4238
Epoch 8, Loss: 0.4000
Epoch 9, Loss: 0.3809
Epoch 10, Loss: 0.3645


**Explanation**:

- **Batch normalization** (`nn.BatchNorm1d(num_features)`): Normalizes the activations (outputs) across the entire batch. This helps stabilize training and can improve convergence by reducing internal covariate shift.
    - `num_features`: The number of features in the input to the BatchNorm layer.


### Group normalization
Group normalization is a technique that splits the channels of the data into smaller groups and then applies normalization within each group. In linear (fully connected) layers, channels typically refer to the number of features or neurons in the layer. In convolutional layers, channels are the depth dimension of the input or output tensor. For example, in an RGB image, there are 3 channels corresponding to red, green, and blue.

It is particularly useful in scenarios where batch normalization might not perform well, such as with very small batch sizes. Unlike batch normalization, group normalization does not depend on the batch size, making it more effective in situations with small or varying batch sizes. We can control the granularity of the normalization process, by adjusting the number of groups, allowing for more flexibility in different tasks.
- How it works:
    - Grouping: The channels of the input are divided into groups (e.g., if you have 32 channels and divide into 4 groups, each group will have 8 channels).
    - Normalization: For each group, it computes the mean and variance and then normalizes the activations within that group to have a mean of 0 and a standard deviation of 1.

In [7]:
class GroupNormNN(nn.Module):
    def __init__(self):
        super(GroupNormNN, self).__init__()
        self.fc1 = nn.Linear(10, 64)
        self.groupnorm1 = nn.GroupNorm(num_groups=8, num_channels=64)
        self.fc2 = nn.Linear(64, 32)
        self.groupnorm2 = nn.GroupNorm(num_groups=4, num_channels=32)
        self.fc3 = nn.Linear(32, 1)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.groupnorm1(x.unsqueeze(2)).squeeze(2)  # Add dimension, normalize, and remove it
        x = torch.relu(self.fc2(x))
        x = self.groupnorm2(x.unsqueeze(2)).squeeze(2)
        x = self.fc3(x)
        return x

# Instantiate the model
model = GroupNormNN()

# Define optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()

# Training loop
for epoch in range(num_epochs):
    model.train()
    
    optimizer.zero_grad()
    outputs = model(X_train)
    loss = criterion(outputs, y_train)
    
    loss.backward()
    optimizer.step()
    
    print(f'Epoch {epoch+1}, Loss: {loss.item():.4f}')

Epoch 1, Loss: 0.7407
Epoch 2, Loss: 0.4172
Epoch 3, Loss: 0.2973
Epoch 4, Loss: 0.2756
Epoch 5, Loss: 0.2725
Epoch 6, Loss: 0.2597
Epoch 7, Loss: 0.2292
Epoch 8, Loss: 0.1956
Epoch 9, Loss: 0.1638
Epoch 10, Loss: 0.1439


**Explanation**:

- **Group normalization** (`nn.GroupNorm(num_groups, num_channels)`): Divides the channels into groups and normalizes within each group.
    - `num_groups`: The number of groups to divide the channels into.
    - `num_channels`: The number of channels in the input.


## Weight initialization strategies

Weight initialization is crucial for training neural networks efficiently and effectively. Proper initialization can help avoid problems like vanishing or exploding gradients, particularly in deep networks.

### Glorot (Xavier) initialization
Glorot initialization, also known as Xavier initialization, is suitable for layers using linear or tanh activation functions. It aims to keep the scale of the gradients similar across all layers. Weights are typically initialized from a distribution with zero mean and a variance of 2/(input units + output units). Typically used in networks with activation functions like `tanh` or `linear` or when we are unsure about the network's behavior and want a general-purpose initialization.

In [8]:
# Define a simple feedforward neural network with Glorot Initialization
class GlorotNN(nn.Module):
    def __init__(self):
        super(GlorotNN, self).__init__()
        self.fc1 = nn.Linear(10, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 1)
        
        # Apply Glorot (Xavier) initialization to the weights
        nn.init.xavier_uniform_(self.fc1.weight)
        nn.init.xavier_uniform_(self.fc2.weight)
        nn.init.xavier_uniform_(self.fc3.weight)
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Instantiate and train the model with Glorot Initialization
model_glorot = GlorotNN()
optimizer = optim.Adam(model_glorot.parameters(), lr=0.001)
criterion = nn.MSELoss()

# Training loop
print("Training model with Glorot initialization:")
for epoch in range(10):
    model_glorot.train()
    optimizer.zero_grad()
    outputs = model_glorot(X_train)
    loss = criterion(outputs, y_train)
    loss.backward()
    optimizer.step()
    print(f'Epoch {epoch+1}, Loss: {loss.item():.4f}')

Training model with Glorot initialization:
Epoch 1, Loss: 0.2428
Epoch 2, Loss: 0.2132
Epoch 3, Loss: 0.1874
Epoch 4, Loss: 0.1653
Epoch 5, Loss: 0.1472
Epoch 6, Loss: 0.1334
Epoch 7, Loss: 0.1237
Epoch 8, Loss: 0.1182
Epoch 9, Loss: 0.1160
Epoch 10, Loss: 0.1164


**Explanation**

- **`__init__` method**: In PyTorch, the `__init__` method is where we define the architecture of our neural network. Normally, when we define layers in PyTorch (like `nn.Linear`), PyTorch automatically initializes the weights and biases of those layers using some default method. We don’t need to explicitly create them as they are generated as part of the `nn.Linear` layer. However, sometimes we might want to customize this initialization process for better training performance. This is where adding custom initialization comes in, such as initialization functions or additional configurations.

- **How to customize initialization?**
- **Accessing weights** (`self.fc.weight`): In the model class, we can access the weights of a layer directly by referring to them as `self.fc1.weight`, `self.fc2.weight`, etc. The `.weight` is an attribute of the `nn.Linear` object that stores the matrix of weights for that layer. Even though we didn’t manually define `weight`, PyTorch did it for us under the hood when we created `self.fc`.
- **Glorot initialization** (`nn.init.xavier_uniform_`): This method initializes the weights using a uniform distribution with bounds calculated to maintain the variance of gradients throughout the network.
    - Adding lines like `nn.init.xavier_uniform_(self.fc1.weight)` in the `__init__` method is how we can replace the default initialization with our custom one. Specifically, this line of code applies Glorot (Xavier) initialization to the weights of `self.fc1`. 
        - The `__init__` method is where we define the structure of our model. Weights Initialization is something we typically only need to do once—right after the layers are created. By putting it in the `__init__` method, we ensure that the custom initialization is applied every time we instantiate the model.


#### Kaiming (He) initialization
Kaiming initialization, also known as He initialization, is ideal for layers with ReLU or its variants as activation functions. This method is designed to mitigate the vanishing gradient problem by initializing weights from a distribution with zero mean and a variance of 2/(input units).

In [9]:
# Define a simple feedforward neural network with Kaiming Initialization
class KaimingNN(nn.Module):
    def __init__(self):
        super(KaimingNN, self).__init__()
        self.fc1 = nn.Linear(10, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 1)
        
        # Apply Kaiming (He) initialization to the weights
        nn.init.kaiming_uniform_(self.fc1.weight, nonlinearity='relu')
        nn.init.kaiming_uniform_(self.fc2.weight, nonlinearity='relu')
        nn.init.kaiming_uniform_(self.fc3.weight, nonlinearity='linear')  # linear output
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Instantiate and train the model with Kaiming Initialization
model_kaiming = KaimingNN()
optimizer = optim.Adam(model_kaiming.parameters(), lr=0.001)
criterion = nn.MSELoss()

# Training loop
print("Training model with Kaiming initialization:")
for epoch in range(10):
    model_kaiming.train()
    optimizer.zero_grad()
    outputs = model_kaiming(X_train)
    loss = criterion(outputs, y_train)
    loss.backward()
    optimizer.step()
    print(f'Epoch {epoch+1}, Loss: {loss.item():.4f}')

Training model with Kaiming initialization:
Epoch 1, Loss: 0.1149
Epoch 2, Loss: 0.1116
Epoch 3, Loss: 0.1085
Epoch 4, Loss: 0.1056
Epoch 5, Loss: 0.1031
Epoch 6, Loss: 0.1009
Epoch 7, Loss: 0.0989
Epoch 8, Loss: 0.0971
Epoch 9, Loss: 0.0955
Epoch 10, Loss: 0.0940


**Explanation**

- **Kaiming initialization** (`nn.init.kaiming_uniform_`): Weights are initialized from a uniform distribution designed to avoid issues like vanishing or exploding gradients in networks with ReLU activations. Best suited for networks where the ReLU activation function (or its variants) is used, as it helps maintain the variance of gradients during backpropagation, thus preventing issues like vanishing gradients.

## Learning rate scheduling
Learning rate scheduling adjusts the learning rate during training, which can lead to better convergence. Common schedules include reducing the learning rate on a plateau or using cyclic learning rates.

In [10]:
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(10, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 1)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Instantiate the model
model = SimpleNN()

# Define optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.MSELoss()

# Define a learning rate scheduler
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)

# Training loop with learning rate scheduling
num_epochs = 20
for epoch in range(num_epochs):
    model.train()
    
    optimizer.zero_grad()
    outputs = model(X_train)
    loss = criterion(outputs, y_train)
    
    loss.backward()
    optimizer.step()
    
    # Step the learning rate scheduler
    scheduler.step()
    
    print(f'Epoch {epoch+1}, Loss: {loss.item():.4f}, Learning Rate: {scheduler.get_last_lr()[0]:.6f}')

Epoch 1, Loss: 0.3112, Learning Rate: 0.010000
Epoch 2, Loss: 0.1733, Learning Rate: 0.010000
Epoch 3, Loss: 0.1172, Learning Rate: 0.010000
Epoch 4, Loss: 0.0900, Learning Rate: 0.010000
Epoch 5, Loss: 0.1065, Learning Rate: 0.001000
Epoch 6, Loss: 0.1190, Learning Rate: 0.001000
Epoch 7, Loss: 0.1176, Learning Rate: 0.001000
Epoch 8, Loss: 0.1148, Learning Rate: 0.001000
Epoch 9, Loss: 0.1113, Learning Rate: 0.001000
Epoch 10, Loss: 0.1076, Learning Rate: 0.000100
Epoch 11, Loss: 0.1040, Learning Rate: 0.000100
Epoch 12, Loss: 0.1036, Learning Rate: 0.000100
Epoch 13, Loss: 0.1033, Learning Rate: 0.000100
Epoch 14, Loss: 0.1029, Learning Rate: 0.000100
Epoch 15, Loss: 0.1025, Learning Rate: 0.000010
Epoch 16, Loss: 0.1021, Learning Rate: 0.000010
Epoch 17, Loss: 0.1021, Learning Rate: 0.000010
Epoch 18, Loss: 0.1021, Learning Rate: 0.000010
Epoch 19, Loss: 0.1020, Learning Rate: 0.000010
Epoch 20, Loss: 0.1020, Learning Rate: 0.000001


**Explanation**

- **optimizer** (`optim.Adam`): An optimizer that implements the Adam algorithm.
- **Learning rate scheduler** (`optim.lr_scheduler.StepLR`): A learning rate scheduler adjusts the learning rate during training. This can help the model converge more efficiently by using a higher learning rate in the early stages and reducing it as training progresses. `StepLR` is a specific type of learning rate scheduler that reduces the learning rate by a fixed factor (`gamma`) every few epochs (`step_size`). Here, the learning rate will be multiplied by 0.1 every 5 epochs.
- **Updates the model parameters** (`optimizer.step()`): After computing the gradients of the loss with respect to the model parameters, `optimizer.step()` updates the parameters to minimize the loss. This is called in each iteration (or batch) of the training loop after `loss.backward()` has been called to compute the gradients.
- **Updates the learning rate** (`scheduler.step()`): Updates the learning rate according to the schedule defined by the learning rate scheduler. It usually decreases the learning rate as training progresses. Typically called at the end of each epoch (or after a set number of iterations), depending on the type of scheduler.
- **Last computed learning rate** (`scheduler.get_last_lr()`): Returns the most recent learning rate that was computed by the scheduler. It’s useful for monitoring how the learning rate changes over time during training.