<a href="https://colab.research.google.com/github/ppujari/PyTorch/blob/main/PyTorch_tips.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#PyTorch Tip-11

**Leverage *torch.nn.ModuleList* for Dynamic Model Architectures**  
**Dynamically add or remove layers:**
*    This is useful for models like recurrent neural networks where the
sequence length might vary.

**Iterate and modify layers:**  
*    Easily loop through and modify layers, e.g., to freeze certain layers or apply different learning rates.


**Use case:**  
Let us consider a text classification task where the input is a variable list sequence of words and the output is a probability distribution over a set of classes. We can use RNN but the sequence length is variable. Here we can use ModuleList.

**How to use ModuleList to make the RNN dynamic:**

Variable Number of Layers:
Create a ModuleList to store RNN layers.
Dynamically add or remove layers based on a hyperparameter or input sequence length.

Another Simple Example: Dynamic Linear Regression

Let's create a simple linear regression model where we can dynamically add or remove layers. This can be useful for experimenting with different model architectures or for early stopping.


In [None]:
#This is abstract code snippet
import torch
import torch.nn as nn

class DynamicModel(nn.Module):
    def __init__(self): #constructor
    """
        DynamicModel: This is the name of the current class.
        self: This refers to the current instance of the DynamicModel class.
        By passing these arguments to super(), we're essentially telling Python
        to look for the parent class of DynamicModel. This ensures that the
        DynamicModel class is properly initialized. Inheritance property.
    """
        super(DynamicModel, self).__init__()
        self.layers = nn.ModuleList()

    def add_layer(self, layer):
        self.layers.append(layer)

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

**Another Simple Example: Dynamic Linear Regression**

Let's create a simple linear regression model where we can dynamically add or remove layers. This can be useful for experimenting with different model architectures or for early stopping.

In [3]:
#This is a basic example, but it demonstrates the flexibility of ModuleList
import torch
import torch.nn as nn

# Sample data
X = torch.randn(100, 10)  # 100 samples, each with 10 features
y = 2 * X[:, 0] + 3 * X[:, 1] + torch.randn(100)  # Linear relationship with noise

class DynamicLinearRegression(nn.Module):
    def __init__(self, input_size, output_size):
        super(DynamicLinearRegression, self).__init__()
        self.layers = nn.ModuleList()
        self.layers.append(nn.Linear(input_size, 64))
        self.layers.append(nn.ReLU())
        self.layers.append(nn.Linear(64, output_size))

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

# Example usage:
model = DynamicLinearRegression(input_size=10, output_size=1)

# Add another layer:
model.layers.append(nn.Linear(64, 32))
model.layers.append(nn.ReLU())

# Remove the last layer:
del model.layers[-2:]


In [4]:
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

In [5]:
# Training loop
num_epochs = 100
for epoch in range(num_epochs):
    # Forward pass
    output = model(X)
    loss = criterion(output, y.unsqueeze(1))  # Reshape y to match output shape

    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

Epoch [10/100], Loss: 9.5748
Epoch [20/100], Loss: 7.4827
Epoch [30/100], Loss: 5.1878
Epoch [40/100], Loss: 3.2073
Epoch [50/100], Loss: 1.9626
Epoch [60/100], Loss: 1.3691
Epoch [70/100], Loss: 1.1244
Epoch [80/100], Loss: 1.0142
Epoch [90/100], Loss: 0.9540
Epoch [100/100], Loss: 0.9120


**Conclusion:**
By using a ModuleList, we can easily modify the architecture of the model by adding or removing layers without changing the core forward method. This flexibility is crucial for experimentation and optimization.

#PyTorch Tip-10
Using torch.cuda.amp module

**Why Use Mixed Precision?**  
Mixed precision training speeds up computations by using lower precision (e.g., float16) while maintaining the accuracy of float32 for critical operations. It reduces memory usage and can accelerate training significantly, especially on GPUs with Tensor Cores like NVIDIA’s.  

**How to Implement Mixed Precision Training**
Use the torch.cuda.amp (Automatic Mixed Precision) module for seamless integration.
import torch
from torch.cuda.amp import GradScaler, autocast

model = YourModel().cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = torch.nn.CrossEntropyLoss()

scaler = GradScaler()  # Initialize GradScaler for mixed precision

for inputs, targets in dataloader:  # Training loop
    inputs, targets = inputs.cuda(), targets.cuda()

    optimizer.zero_grad()
    
    # Enable mixed precision for the forward pass
    with autocast():
        outputs = model(inputs)
        loss = criterion(outputs, targets)
    
    # Scale the loss for backpropagation
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

print("Training completed with mixed precision!")

**Benefits:**
**Speed:**
Mixed precision uses hardware acceleration, reducing training time.
Memory Efficiency: Enables larger batch sizes by cutting memory usage.

**When to Use:**
Ideal for deep learning models that require heavy GPU resources.
Works well for architectures like CNNs and Transformers.

#PyTorch Tip-9
**Gradient Clipping to Prevent Exploding Gradients:**  
When training deep neural networks, especially RNNs or LSTMs, gradients can sometimes grow too large, leading to exploding gradients. This causes instability and poor model convergence. To mitigate this, you can use gradient clipping, which limits the size of gradients during backpropagation.  

**How to Implement Gradient Clipping:**  
PyTorch makes it easy to clip gradients using torch.nn.utils.clip_grad_norm_. This function clips gradients to a maximum specified norm, preventing them from exceeding a certain value.


import torch.nn.utils as utils

max_norm = 1.0  # Maximum norm for the gradients
for inputs, labels in dataloader:
    outputs = model(inputs)
    loss = loss_function(outputs, labels)
    
    loss.backward()  # Backpropagate
    
    # Clip gradients to avoid exploding gradients
    utils.clip_grad_norm_(model.parameters(), max_norm)
    
    optimizer.step()  # Update weights
    optimizer.zero_grad()  # Clear gradients

**I used many cases, it helps:**
**Stabilizes Training:**  
Especially useful when training deep or recurrent networks.
**Prevents Gradient Explosion:**  
Keeps gradient values in a reasonable range to ensure smooth updates.

**Improves Convergence:**
Helps models converge more reliably by avoiding runaway updates.

Note that this technique is especially useful when working with deep architectures, RNNs, or complex optimization landscapes.

#PyTorch Tip-8

Diffrence between torch.manual_seed() and
torch.cuda.manual_seed() function  
**torch.manual_seed()**  
This function sets the random seed for generating random numbers for the CPU and all GPU devices in PyTorch. This includes operations like random initialization of tensors, weights, and data augmentation that rely on random number generation.

**Effect:**

Ensures that any operation involving randomness on the CPU will produce the same result every time the code is run. When you want to ensure consistent random behavior across both CPU and GPU operations in your PyTorch model.  
**torch.cuda.manual_seed(42)**  
 This function sets the random seed specifically for CUDA operations on the current GPU device only. **Does not affect CPU random number generation**  
 **Use Case:** When you need to ensure consistent random behavior specifically for GPU computations but don't want to affect CPU operations.  
 ## Key Differences:  
 | Features |torch.manual_seed(42)|torch.cuda.manual_seed(42)|
 | :- | -: | :-: |
 |Scope|Affects both CPU and GPU operations|Affects only the current GPU device|
 |CPU Randomness|Sets the seed for CPU random number generation|Does not affect CPU random number generation|
 |GPU Randomness|Sets the seed for all GPU devices|Sets the seed only for the current GPU|
 |Multi-GPU Behavior|Seed is applied to all available GPUs|Affects only the active GPU (no effect on other GPUs)|
 |Common Use Case|Reproducibility across both CPU and GPU operations|Reproducibility for operations on the current GPU device only|


In [None]:
import torch
from torch import nn, optim
import numpy as np
import random

In [None]:
class LinearRegression(nn.Module):
    '''
        Class to define the neural network using Linear layers. Importing nn.Module is necessary whenever building any NN
    '''

    def __init__(self, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.layer1 = nn.Linear(in_features=1, out_features=1, bias=True, dtype=torch.float32)
        self.layer2 = nn.Linear(in_features=1, out_features=1, bias=True, dtype=torch.float32)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        self.forward1 = self.layer1(x)
        return self.layer2(self.forward1)

In [None]:
#torch.manual_seed(42) #remove comment to see results
torch.cuda.manual_seed(42)
device = "cuda" if torch.cuda.is_available() else "cpu"
print('device = ', device )
model_linear = LinearRegression()
model_linear.to(device=device)
print(model_linear.state_dict())

device =  cpu
OrderedDict([('layer1.weight', tensor([[0.8815]])), ('layer1.bias', tensor([-0.7336])), ('layer2.weight', tensor([[0.8692]])), ('layer2.bias', tensor([0.1872]))])


# **PyTorch Tip-7**

# **Gradient Accumulation for Large Batch Training:**  
When training deep learning models on limited hardware (like a single GPU with limited memory), you may not be able to fit a large batch size into memory. Gradient accumulation is a trick that lets you simulate a larger batch size without increasing memory usage.
It works using divide and conquer strategy. Instead of processing the entire batch at once, it's divided into smaller sub-batches.
After processing all sub-batches, the accumulated gradients are used to update the model's weights.

**How to Do It:**  
You can accumulate gradients over several smaller batches and update the model only after accumulating enough gradients to match the desired larger batch size.

**Example:**

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

In [None]:
# Define your model, loss function, and optimizer
model = YourModel()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)


# Hyperparameters
batch_size = 64  # Desired effective batch size
sub_batch_size = 16
num_batches = batch_size // sub_batch_size

In [None]:
# Assume we want to use an effective batch size of 64, but can only use 16 due to memory constraints
accumulation_steps = 4  # Accumulate gradients over 4 small batches

for epoch in range(num_epochs):
    for i, (inputs, labels) in enumerate(data_loader):
        outputs = model(inputs)
        loss = loss_fn(outputs, labels)
        loss = loss / accumulation_steps  # Scale loss

        loss.backward()  # Accumulate gradients

        if (i + 1) % accumulation_steps == 0:
            optimizer.step()  # Update weights
            optimizer.zero_grad()  # Reset gradients

# This simulates a larger batch size of 64 by using 4 batches of 16 and accumulating gradients.


**Why This Helps:**  
**Memory-efficient:**  
You can work with smaller batches while achieving the same effect as
training with a larger batch.

**Stable training:**  
 Larger batch sizes can help stabilize gradient updates and potentially lead to better model convergence.

This is particularly useful when dealing with large models or datasets where memory is a constraint.

# **pytorch tip-6**

When training a model, efficient data loading is crucial. PyTorch provides the DataLoader class, which can handle batching, shuffling, and parallel data loading with ease.

**Benefits of DataLoader:**   
Batching: Automatically splits your dataset into batches.
Shuffling: Randomizes the order of data, which helps in breaking any potential patterns in the data.
Parallel Loading: Loads data in parallel using multiple worker processes, speeding up the data pipeline.  
**Example:**  
**Step 1: Create a Dataset**  
First, you need to create a custom dataset by subclassing torch.utils.data.Dataset.  



In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        x = self.data[idx]
        y = self.labels[idx]
        return x, y

# Example data
data = torch.randn(1000, 10)
labels = torch.randint(0, 2, (1000,))

**Step 2: Create a DataLoader**  
Then, you can create a DataLoader to handle batching and shuffling.

In [None]:
# Create dataset instance
dataset = CustomDataset(data, labels)

# Create DataLoader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=4)

# Iterate through DataLoader
for batch_data, batch_labels in dataloader:
    # Your training code here
    print(batch_data, batch_labels)
    break



tensor([[-5.0268e-02,  2.4722e+00,  8.5836e-01, -7.1339e-01, -5.5074e-01,
          1.2122e+00, -4.0883e-01, -1.8010e+00, -4.8705e-01, -1.4218e-01],
        [ 6.0705e-02, -1.1061e+00, -4.1906e-01,  1.4947e-01, -7.9631e-01,
         -5.4613e-01,  7.5147e-01, -2.6709e-02,  6.7446e-01, -1.3347e+00],
        [-1.8181e+00,  1.0670e+00, -2.6840e-01, -1.7488e-01, -2.8205e-02,
          6.4314e-02,  1.1637e+00,  2.9105e-01, -6.5504e-01,  1.8453e+00],
        [-1.0356e+00, -1.5176e+00, -5.7280e-01, -5.0991e-01, -2.1609e+00,
         -4.2958e-01, -3.7735e-01, -7.3656e-01,  8.8699e-02,  1.4433e+00],
        [ 4.1850e-01,  2.5052e+00,  1.5065e-01,  1.1184e+00, -2.3551e-01,
          1.5556e+00,  6.9068e-01,  7.1115e-01,  1.2963e+00, -1.7523e+00],
        [ 9.2839e-01, -1.9293e-03,  9.6393e-01,  3.3717e-02,  1.8298e-01,
         -1.6492e+00,  1.4252e+00,  3.4436e-01, -3.8601e-01, -1.5281e-01],
        [-5.1026e-01,  2.3432e-01,  2.4819e-01,  9.3880e-01,  4.5567e-02,
         -1.5704e-02,  1.0721e+0

# **pytorch tip-5**

A PyTorch tip that combines efficiency and debugging best practices:

**Overfit a Single Batch for Sanity Checks**

Before investing significant time training on a large dataset, use a small batch to verify your model's functionality. This can catch errors early and save you time:

**Create a Data Loader:** Set up your data loader as usual for training.  
**Grab a Single Batch:** Extract the first batch of data (images, labels) from the data loader using next(iter(data_loader)).  
**Overfit the Batch:** Train your model on this single batch for a few epochs. You expect the model to overfit (achieve very high accuracy) on this small sample.

In [None]:
import torch

# Assuming you have your model (`model`) and data loader (`data_loader`) defined

# Get the first batch of data
images, labels = next(iter(data_loader))

# Train on the single batch for a few epochs (e.g., 5 epochs)
    # ... your training loop logic using `images` and `labels` ...
for epoch in range(5):
    for batch_images, batch_labels in data_loader:
        # Forward pass
        outputs = model(batch_images)
        loss = loss_function(outputs, batch_labels)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
# Evaluate the model's performance on the single batch (optional)
# You can calculate metrics such as accuracy, precision, recall, etc.

#if your model cannot even overfit a single batch, it suggests an issue with the
#model architecture, learning process, or data preprocessing.

# If successful, proceed with training on the full dataset


**Benefits:**

**Early Debugging:** Catch potential errors before extensive training.  
**Faster Iteration:** Quickly test model changes without waiting for full dataset training.  
**Efficiency:** Reduce training time on large datasets if there are fundamental issues.  

**Remember:** Overfitting a single batch is a sanity check, not a complete evaluation. If successful, proceed with training on the entire dataset.

In [None]:
# prompt: pytorch tips

# Use `torch.cuda.is_available()` to check if CUDA is available before using it.
# This can help avoid errors and improve performance.

# Use `torch.no_grad()` context manager to disable gradient computation when it is not needed.
# This can save memory and improve performance.

# Use `torch.jit.trace()` to create a traced version of your model for improved performance.
# This can be especially useful for models that are used frequently.

# Use `torch.utils.data.DataLoader` to load your data in batches.
# This can help improve performance by reducing the number of times the data is loaded into memory.

# Use `torch.optim.lr_scheduler` to adjust the learning rate during training.
# This can help improve the convergence of the model.

# Use `torch.nn.utils.clip_grad_norm_` to clip the gradients during training.
# This can help prevent the gradients from becoming too large and causing the model to diverge.

# Use `torch.utils.tensorboard` to visualize the training process.
# This can help you track the progress of the model and identify any potential problems.


# **PyTorch Tip - 4**

**Explore Static Graphs with torch.compile (PyTorch 2.0 or later):**

If you're using PyTorch version 2.0 or above and aiming to deploy your model for inference, consider leveraging **torch.compile**. This feature offers significant speedups by converting your model's dynamic computational graph into a static one. **torch.compile** makes PyTorch code run faster by JIT-compiling PyTorch code into optimized kernel.

**Understanding Dynamic vs. Static Graphs:**

**Dynamic Graphs (Default):**In PyTorch's eager execution mode, the computational graph is built on-the-fly during each forward pass. While flexible, this approach can introduce overhead due to graph creation in every run.
**Static Graphs**: torch.compile optimizes the model by pre-compiling the computational graph into a more efficient, fixed structure. This static graph can then be repeatedly executed for inference tasks, leading to faster predictions.


**Trade-offs:** While torch.compile generally accelerates inference, it might incur a slight overhead during the compilation process itself. However, this is usually a one-time cost that outweighs the benefits in most deployment scenarios.  
**Limited Flexibility:** Once compiled, the static graph cannot be easily modified. If your model needs dynamic adjustments at runtime, torch.compile might not be the most suitable option.

**How to Use torch.compile:**

In [None]:
import torch

# Load your trained model

compiled_model = torch.compile(model)

# Use the compiled model for inference on new data
predictions = compiled_model(data)


Python functions can be optimized by passing the callable to torch.compile. We can then call the returned optimized function in place of the original function.

In [None]:
def foo(x, y):
    a = torch.sin(x)
    b = torch.cos(y)
    return a + b
opt_foo1 = torch.compile(foo)
print(opt_foo1(torch.randn(10, 10), torch.randn(10, 10)))

Alternatively, we can decorate the function.

In [None]:
@torch.compile
def opt_foo2(x, y):
    a = torch.sin(x)
    b = torch.cos(y)
    return a + b
print(opt_foo2(torch.randn(10, 10), torch.randn(10, 10)))

**Benefits of torch.compile:**

**Reduced Inference Latency:** By eliminating the need to construct the graph dynamically each time, you can achieve noticeably faster inference speeds. This is crucial for real-time applications where low latency is essential.  
**Potential for Further Optimizations:** torch.compile often paves the way for additional optimizations under the hood, such as kernel fusion* and improved memory access patterns.  
**kernel fusion** is a valuable technique for optimizing code running on GPUs. By reducing data transfer overhead and improving cache utilization, it can significantly accelerate computations.  

By adopting torch.compile for deployment, you can significantly enhance your model's inference performance, making it more efficient and responsive in real-world applications.

Further Reading: https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html

# **PyTorch Tip - 3**


**Using torch.where() for Conditional Element-wise Operations**
PyTorch's **torch.where()** function allows you to perform conditional element-wise operations efficiently. It takes three arguments: the condition, the tensor to select values from when the condition is true, and the tensor to select values from when the condition is false. This is particularly useful for implementing conditional logic within your neural network models.

Here's a quick example:
In this example, the elements from tensor_true are selected where the condition is True, and elements from tensor_false are selected where the condition is False. This allows for flexible conditional operations within your PyTorch code.

In [None]:
import torch

# Define tensors
condition = torch.tensor([[True, False], [False, True]])
tensor_true = torch.tensor([[1, 2], [3, 4]])
tensor_false = torch.tensor([[5, 6], [7, 8]])

# Perform conditional element-wise operation
result = torch.where(condition, tensor_true, tensor_false)
print(result)


tensor([[1, 6],
        [7, 4]])


**Some Practical usages below:**

**1. Masked Operations:** You might have a tensor and want to perform different operations on elements based on some condition. For instance, in natural language processing, you might want to mask out certain tokens during tokenization or in attention mechanisms based on some condition.

In [None]:
import torch

# Example: Masking tokens with a special token ID
input_ids = torch.tensor([101, 102, 103, 104, 105])  # Example input tensor
mask_condition = input_ids == 103  # Condition to mask out token with ID 103
special_token_id = 1000  # Special token ID to replace masked tokens

# Mask out tokens with ID 103
masked_input_ids = torch.where(mask_condition, torch.tensor(special_token_id), input_ids)

print(masked_input_ids)


tensor([ 101,  102, 1000,  104,  105])


**2. Loss Function Modification:** During training, you may want to apply different weights to different elements of the loss function based on some condition.

In [None]:
import torch.nn.functional as F

# Example: Modifying loss function based on class imbalance
predicted_scores = torch.tensor([0.1, 0.8, 0.3, 0.9, 0.2])  # Example predicted scores
true_labels = torch.tensor([0, 1, 0, 1, 1])  # Example true labels

# Calculate binary cross-entropy loss with class imbalance handling
positive_weight = 2.0  # Weight for positive class
negative_weight = 1.0  # Weight for negative class
loss_weights = torch.where(true_labels == 1, positive_weight, negative_weight)

# Calculate weighted binary cross-entropy loss
loss = F.binary_cross_entropy_with_logits(predicted_scores, true_labels.float(), weight=loss_weights)

print(loss)


tensor(0.8439)


**Model Interpretability:** In some scenarios, you might want to interpret the output of your model differently based on certain conditions.

In [None]:
import torch

# Example: Interpreting model output differently based on confidence
output_scores = torch.tensor([0.8, 0.6, 0.9, 0.4, 0.7])  # Example output scores
confidence_threshold = 0.7  # Threshold for high confidence

# Determine model predictions based on confidence
predictions = torch.where(output_scores >= confidence_threshold, torch.tensor(1), torch.tensor(0))

print(predictions)

tensor([1, 0, 1, 0, 1])
