In [2]:
# PyTorch Adam Optimizer Examples
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn import preprocessing
import torch
import torch.nn as nn
import torch.optim as optim

print("PyTorch version:", torch.__version__)
print("Device available:", "CUDA" if torch.cuda.is_available() else "CPU")

PyTorch version: 2.7.1+cu126
Device available: CUDA


In [None]:
#import os
#os.getcwd()
#os.chdir('~/Deep-Learning-and-PyTorch/Gradient-Based-Learning/Round2')
#import sys
#sys.path.append("path")
#from utils import *
# $ jupyter notebook --notebook-dir=Deep-Learning-and-PyTorch/Gradient-Based-Learning/Round2
# PyTorch implementation
import torch
import torch.nn as nn
import torch.optim as optim

In [None]:
# =============================================================================
# CUDA DEVICE SETUP AND OPTIMIZATION
# =============================================================================

# Check if CUDA is available
cuda_available = torch.cuda.is_available()
if cuda_available:
    device = torch.device('cuda')
    cuda_device_name = torch.cuda.get_device_name(0)
    cuda_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3)  # GB
    print(f"   CUDA is available!")
    print(f"   Device: {cuda_device_name}")
    print(f"   Memory: {cuda_memory:.1f} GB")
    print(f"   Using GPU acceleration for PyTorch computations")
else:
    device = torch.device('cpu')
    print(f"⚠️  CUDA is not available - using CPU")
    print(f"   PyTorch will run on CPU (slower but still functional)")

print(f"\nSelected device: {device}")

print(f"\n{'='*60}")
print("WHY PYTORCH + CUDA IS OPTIMIZED FOR DEEP LEARNING")
print(f"{'='*60}")

print("""
TENSOR OPERATIONS ON GPU:
• Tensors are multi-dimensional arrays perfect for parallel processing
• GPU has thousands of cores vs CPU's few cores
• Matrix operations (dot products, convolutions) are highly parallelizable
• PyTorch tensors can seamlessly move between CPU and GPU

⚡ CUDA ADVANTAGES:
• Massive parallelization: 1000s of threads vs CPU's 8-16 threads  
• Memory bandwidth: GPU memory is 10x faster than system RAM
• Specialized cores: Tensor cores optimized for AI/ML computations
• Automatic memory management and optimization

PYTORCH CUDA OPTIMIZATIONS:
• Automatic kernel fusion: Combines operations for efficiency
• Memory pooling: Reduces allocation overhead
• Mixed precision: Uses FP16 for speed, FP32 for accuracy
• Asynchronous execution: Overlaps computation with memory transfers

TYPICAL SPEEDUPS:
• Linear algebra operations: 10-50x faster on GPU
• Neural network training: 5-20x faster overall
• Large batch processing: Up to 100x faster
• Gradient computations: Highly parallelized
""")

if cuda_available:
    print("This notebook will benefit from GPU acceleration!")
else:
    print("For GPU acceleration, ensure CUDA-compatible GPU and drivers are installed")

print(f"{'='*60}\n")

<h1 align="center">Gradient Based Optimization with PyTorch</h1>


This notebook demonstrates gradient-based optimization using **PyTorch** instead of TensorFlow/Keras. PyTorch provides automatic differentiation through its autograd system, making gradient computation much more efficient and easier to implement.

We'll explore how PyTorch's built-in functions can be used for:
- Automatic gradient computation
- Tensor operations
- Loss function calculations
- Optimization algorithms

The idea is to tune (adjust) the parameters according to the gradient of the average loss incurred by the neural network on a training set. This average loss is also known as the **training error** and defines an **objective or cost function** $f(\mathbf{w})$ that we want to minimize using PyTorch's optimization tools.

Here we discuss a simple iterative algorithm which is called **gradient descent** (GD) implemented with PyTorch. GD minimizes the training error by incrementally improving the current guess for the optimal parameters by moving a bit into the direction of the negative gradient. We will also discuss a slight variation of GD known as **stochastic gradient descent** (SGD). SGD is one of the most widely used optimization methods within deep learning and is readily available in PyTorch.

### Goals

- How PyTorch gradients can be used to learn the parameters of a neural network
- The basic idea behind stochastic gradient descent (SGD) using PyTorch optimizers
- SGD components "batch", "batch size", "learning rate" and "epoch" in PyTorch
- PyTorch's advanced optimization algorithms such as Adam, RMSprop, and others

### Recommended Resources for PyTorch

- PyTorch Official Documentation: https://pytorch.org/docs/stable/index.html
- PyTorch Tutorials: https://pytorch.org/tutorials/
- "Deep Learning with PyTorch" by Eli Stevens, Luca Antiga, and Thomas Viehmann
- PyTorch autograd documentation: https://pytorch.org/docs/stable/autograd.html

### Stochastic Gradient Descent with PyTorch

PyTorch provides several built-in optimizers:
- torch.optim.SGD: Stochastic Gradient Descent
- torch.optim.Adam: Adam optimizer
- torch.optim.RMSprop: RMSprop optimizer
- And many more: https://pytorch.org/docs/stable/optim.html

PyTorch methods aim at finding a good choice for the weights (and bias) of an **artificial neural network (ANN)**. We need to define a loss function to measure how "good" is a particular choice for the weights. PyTorch provides many built-in loss functions in the `torch.nn` module.

For a given pair of predicted label value $\hat{y}$ and true label value $y$, PyTorch loss functions like `nn.MSELoss()` or `nn.CrossEntropyLoss()` provide a measure for the error, or "loss", incurred in predicting the true label $y$ by $\hat{y}$.

Some particular PyTorch loss functions that have proven useful in many applications:
- **nn.MSELoss()**: Mean Squared Error for regression problems with numeric labels
- **nn.CrossEntropyLoss()**: Cross Entropy Loss for classification problems
- **nn.BCELoss()**: Binary Cross Entropy for binary classification
- **nn.L1Loss()**: Mean Absolute Error (L1 loss)

To measure the quality of particular choice for the parameters of a neural network, we use PyTorch tensors to represent our labeled data points. PyTorch's automatic differentiation system (autograd) computes gradients automatically during the **backward pass**.

The training process in PyTorch typically follows this pattern:
1. **Forward pass**: Compute predictions using the model
2. **Loss computation**: Calculate loss using a PyTorch loss function
3. **Backward pass**: Compute gradients using `loss.backward()`
4. **Parameter update**: Update weights using a PyTorch optimizer

By using PyTorch's built-in functions, we can solve:
$$ \min_{\mathbf{w} \in \mathbb{R}^{d}} f(\mathbf{w})$$
more efficiently than manual implementations.

## Mean Squared Error (MSE) with PyTorch

PyTorch provides the MSE loss function through `torch.nn.MSELoss()`. For numeric label values $y \in \mathbb{R}$, the squared error loss is:

$$L(y,\hat{y}) = (\underbrace{y- \hat{y}}_{\mbox{prediction error}})^{2}.$$

In PyTorch, we can compute this using:
```python
mse_loss = nn.MSELoss()
loss = mse_loss(predictions, targets)
```

The **mean squared error (MSE)** is computed automatically:
$$ f(\mathbf{w}) = (1/m) \big( \big( y^{(1)}-\hat{y}^{(1)}\big)^{2}+\big( y^{(2)}-\hat{y}^{(2)}\big)^{2}+\ldots+\big( y^{(m)}-\hat{y}^{(m)}\big)^{2} \big). $$

PyTorch tensors automatically track gradients when `requires_grad=True`, making the computation much more efficient than manual implementations.

The shape of the loss $f(\mathbf{w})$, viewed as a function of the weights $\mathbf{w}$, depends on two components. First, it depends on how the predictor map depends on the weights. Second, it depends on the choice of the loss function.

PyTorch's automatic differentiation system handles the computation of gradients regardless of the complexity of the model architecture. The combination of linear operations and PyTorch's MSE loss function results in efficient gradient computation.

PyTorch provides several advantages:
- **Automatic differentiation**: No need to manually compute gradients
- **GPU acceleration**: Seamless GPU support with `.cuda()` or `.to(device)`
- **Built-in optimizers**: Ready-to-use optimization algorithms
- **Dynamic computation graphs**: Flexible model architectures

A convex function has the attractive property that any local minimum is always also a [global minimum](https://en.wikipedia.org/wiki/Maxima_and_minima#/media/File:Extrema_example_original.svg). PyTorch's gradient descent implementation can efficiently find these minima.

<img src="../R2/MSELinPred.jpeg" width=400>

In [None]:
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [20, 10]
MSELinPred_image = plt.imread("../R2/MSELinPred.jpeg")
plt.imshow(MSELinPred_image)

PyTorch methods use predictor maps represented by neural networks with tunable weights. In this case, the predictor depends non-linearly on the weights. As a result, we obtain (highly) non-convex loss landscapes.

PyTorch's automatic differentiation system efficiently handles these complex, non-convex optimization problems. The framework provides:
- **Automatic gradient computation** for any differentiable function
- **Memory-efficient backpropagation** through dynamic computation graphs
- **Advanced optimizers** that can handle non-convex landscapes better than simple SGD

Below, examples of loss function landscapes of more complicated models (neural networks) illustrate that finding a minimum of these loss functions is not a trivial task, but PyTorch makes it much more manageable.

<img src="../R2/NNloss.png" width=500/>

<center><a href="https://www.cs.umd.edu/~tomg/projects/landscapes/">image source</a></center>
<center><a href="https://arxiv.org/abs/1712.09913/">original paper</a></center>

Here you can find more examples of visualizations for loss functions obtained from representing a predictor map using neural networks:

[3D visualization of NN loss functions](http://www.telesens.co/loss-landscape-viz/viewer.html)

**Key PyTorch Advantages for Complex Loss Landscapes:**
- Efficient computation on GPUs
- Advanced optimizers (Adam, RMSprop, etc.) that adapt to the loss landscape
- Automatic mixed precision for faster training
- Easy experimentation with different architectures and loss functions

## PyTorch Adam Optimizer

**Adam (Adaptive Moment Estimation)** is one of the most popular optimization algorithms in deep learning. PyTorch provides Adam through `torch.optim.Adam`. 

### Key Features of Adam:
- **Adaptive Learning Rates**: Automatically adjusts learning rates for each parameter
- **Momentum**: Uses moving averages of gradients (first moment)
- **RMSprop**: Uses moving averages of squared gradients (second moment)
- **Bias Correction**: Corrects bias in moment estimates during early training

### Mathematical Foundation:
Adam combines the best properties of AdaGrad and RMSprop:

$$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \quad \text{(momentum)}$$
$$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \quad \text{(RMSprop)}$$

Where:
- $g_t$ is the gradient at time step $t$
- $\beta_1$ = 0.9 (default), $\beta_2$ = 0.999 (default)
- $m_t$ and $v_t$ are bias-corrected first and second moment estimates

### PyTorch Adam vs SGD Comparison:

```python
# SGD Optimizer
optimizer_sgd = torch.optim.SGD(model.parameters(), lr=0.01)

# Adam Optimizer (recommended for most cases)
optimizer_adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=1e-8)
```

In [4]:
# PyTorch Linear Model for batch training with CUDA support
class LinearModelBatch(nn.Module):
    def __init__(self, n_features):
        super(LinearModelBatch, self).__init__()
        self.linear = nn.Linear(n_features, 1)
        
    def forward(self, x):
        return self.linear(x)

def create_pytorch_batches(X_tensor, y_tensor, batch_size):
    """Create mini-batches using PyTorch DataLoader with CUDA support"""
    dataset = torch.utils.data.TensorDataset(X_tensor, y_tensor)
    dataloader = torch.utils.data.DataLoader(
        dataset, batch_size=batch_size, shuffle=True, num_workers=0
    )
    return dataloader

In [None]:
# Compare Adam vs SGD Optimizers
def train_with_sgd(X, y, epochs=100, lr=0.01):
    """Train linear model using SGD optimizer for comparison"""
    
    X_tensor = torch.tensor(X, dtype=torch.float32)
    y_tensor = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)
    
    model = LinearModel(X.shape[1])
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)
    criterion = nn.MSELoss()
    
    losses = []
    weights = []
    
    for epoch in range(epochs):
        predictions = model(X_tensor)
        loss = criterion(predictions, y_tensor)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        losses.append(loss.item())
        weights.append(model.linear.weight.data.clone().numpy().flatten())
        
        if (epoch + 1) % 20 == 0:
            print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.6f}')
    
    return model, losses, weights

print("\nTraining with SGD optimizer...")
model_sgd, losses_sgd, weights_sgd = train_with_sgd(X, y, epochs=100, lr=0.01)

# Compare final results
print(f"\nFinal Results Comparison:")
print(f"Adam - Final Loss: {losses_adam[-1]:.6f}, Final Weight: {weights_adam[-1][0]:.6f}")
print(f"SGD  - Final Loss: {losses_sgd[-1]:.6f}, Final Weight: {weights_sgd[-1][0]:.6f}")

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Loss comparison
axes[0].plot(losses_adam, 'r-', label='Adam', linewidth=2)
axes[0].plot(losses_sgd, 'b-', label='SGD', linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Loss Convergence: Adam vs SGD')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Weight evolution
adam_weights_flat = [w[0] for w in weights_adam]
sgd_weights_flat = [w[0] for w in weights_sgd]

axes[1].plot(adam_weights_flat, 'r-', label='Adam', linewidth=2)
axes[1].plot(sgd_weights_flat, 'b-', label='SGD', linewidth=2)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Weight Value')
axes[1].set_title('Weight Evolution: Adam vs SGD')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Loss landscape (weight vs loss)
axes[2].plot(adam_weights_flat, losses_adam, 'ro-', label='Adam', markersize=3)
axes[2].plot(sgd_weights_flat, losses_sgd, 'bo-', label='SGD', markersize=3)
axes[2].set_xlabel('Weight Value')
axes[2].set_ylabel('Loss')
axes[2].set_title('Optimization Path: Weight vs Loss')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Advanced Adam Features and Hyperparameter Tuning

def compare_adam_hyperparameters(X, y, epochs=100):
    """Compare different Adam hyperparameter configurations"""
    
    X_tensor = torch.tensor(X, dtype=torch.float32)
    y_tensor = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)
    
    # Different Adam configurations
    adam_configs = [
        {'lr': 0.001, 'betas': (0.9, 0.999), 'name': 'Adam Default'},
        {'lr': 0.01, 'betas': (0.9, 0.999), 'name': 'Adam High LR'},
        {'lr': 0.001, 'betas': (0.5, 0.999), 'name': 'Adam Low β1'},
        {'lr': 0.001, 'betas': (0.9, 0.99), 'name': 'Adam Low β2'},
        {'lr': 0.001, 'betas': (0.95, 0.999), 'name': 'Adam High β1'},
    ]
    
    results = {}
    
    for config in adam_configs:
        print(f"\nTraining with {config['name']}: lr={config['lr']}, betas={config['betas']}")
        
        # Initialize fresh model
        model = LinearModel(X.shape[1])
        optimizer = torch.optim.Adam(model.parameters(), 
                                   lr=config['lr'], 
                                   betas=config['betas'])
        criterion = nn.MSELoss()
        
        losses = []
        for epoch in range(epochs):
            predictions = model(X_tensor)
            loss = criterion(predictions, y_tensor)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            losses.append(loss.item())
        
        results[config['name']] = {
            'losses': losses,
            'final_loss': losses[-1],
            'final_weight': model.linear.weight.data.item()
        }
        
        print(f"  Final Loss: {losses[-1]:.6f}, Final Weight: {model.linear.weight.data.item():.6f}")
    
    return results

# Run hyperparameter comparison
print("=== Adam Hyperparameter Comparison ===")
adam_results = compare_adam_hyperparameters(X, y, epochs=100)

# Plot comparison
plt.figure(figsize=(12, 8))

# Plot loss curves
plt.subplot(2, 2, 1)
for name, result in adam_results.items():
    plt.plot(result['losses'], label=name, linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Loss Convergence with Different Adam Configurations')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot final losses as bar chart
plt.subplot(2, 2, 2)
names = list(adam_results.keys())
final_losses = [adam_results[name]['final_loss'] for name in names]
colors = ['red', 'blue', 'green', 'orange', 'purple']
bars = plt.bar(range(len(names)), final_losses, color=colors)
plt.xlabel('Configuration')
plt.ylabel('Final Loss')
plt.title('Final Loss Comparison')
plt.xticks(range(len(names)), [name.split()[1] for name in names], rotation=45)
for bar, loss in zip(bars, final_losses):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
             f'{loss:.3f}', ha='center', va='bottom')

# Plot weights comparison
plt.subplot(2, 2, 3)
final_weights = [adam_results[name]['final_weight'] for name in names]
bars = plt.bar(range(len(names)), final_weights, color=colors)
plt.xlabel('Configuration')
plt.ylabel('Final Weight')
plt.title('Final Weight Comparison')
plt.xticks(range(len(names)), [name.split()[1] for name in names], rotation=45)
for bar, weight in zip(bars, final_weights):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
             f'{weight:.3f}', ha='center', va='bottom')

# Learning rate sensitivity
plt.subplot(2, 2, 4)
learning_rates = [0.0001, 0.001, 0.01, 0.1]
lr_losses = []

for lr in learning_rates:
    model = LinearModel(X.shape[1])
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.MSELoss()
    
    X_tensor = torch.tensor(X, dtype=torch.float32)
    y_tensor = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)
    
    for _ in range(50):  # Quick training
        predictions = model(X_tensor)
        loss = criterion(predictions, y_tensor)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    lr_losses.append(loss.item())

plt.semilogx(learning_rates, lr_losses, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Learning Rate')
plt.ylabel('Final Loss (50 epochs)')
plt.title('Adam Learning Rate Sensitivity')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Adam Optimizer: Best Practices & Recommendations

### When to Use Adam vs SGD

**Use Adam when:**
-  **Quick prototyping**: Adam often works well with default parameters
-  **Sparse gradients**: Better handling of sparse features
-  **Non-stationary objectives**: When the optimization landscape changes
-  **Fast initial progress**: Adam typically converges faster initially

**Use SGD when:**
-  **Final performance**: SGD often achieves better generalization
-  **Memory constraints**: SGD uses less memory (no momentum storage)
-  **Fine-tuning**: When you need precise control over learning dynamics
-  **Well-studied problems**: When you know good SGD hyperparameters

### Adam Hyperparameter Guidelines

| Parameter | Default | Range | Purpose |
|-----------|---------|-------|---------|
| `lr` (learning rate) | 0.001 | 1e-5 to 1e-1 | Controls step size |
| `beta1` (β₁) | 0.9 | 0.8 to 0.95 | Momentum decay rate |
| `beta2` (β₂) | 0.999 | 0.99 to 0.9999 | RMSprop decay rate |
| `eps` (ε) | 1e-8 | 1e-10 to 1e-6 | Numerical stability |
| `weight_decay` | 0 | 1e-6 to 1e-2 | L2 regularization |

### Common PyTorch Adam Patterns

```python
# Standard Adam setup
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Adam with weight decay (AdamW alternative)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

# Adam with learning rate scheduling
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)

# Training loop with Adam
for epoch in range(num_epochs):
    for batch in dataloader:
        optimizer.zero_grad()
        loss = compute_loss(model(batch.x), batch.y)
        loss.backward()
        optimizer.step()
    scheduler.step()  # Update learning rate
```

### Adam vs Other Optimizers Summary

| Optimizer | Convergence Speed | Final Performance | Memory Usage | Stability |
|-----------|------------------|-------------------|--------------|-----------|
| **Adam** | HIGH | MEDIUM | LOW | HIGH |
| **SGD** | LOW | HIGH | HIGH | MEDIUM |
| **AdamW** | HIGH | HIGH | LOW | HIGH |
| **RMSprop** | MEDIUM | MEDIUM | MEDIUM | MEDIUM |

# Batch vs Mini-Batch Training in PyTorch

## Understanding Different Batch Training Approaches

This section demonstrates the fundamental differences between:
- **Full Batch Training**: Uses entire dataset for each gradient update
- **Mini-Batch Training**: Uses small subsets of data for each gradient update  
- **Stochastic Gradient Descent (SGD)**: Uses single samples for each gradient update

Each approach has different trade-offs in terms of:
- Memory usage
- Convergence speed
- Computational efficiency
- Final model performance

In [None]:
def train_pytorch_full_batch(X, y, epochs=50, lr=0.01):
    """Full batch training in PyTorch with CUDA support"""
    print(f"PyTorch FULL BATCH training (dataset size: {len(X)})")
    print(f"Device: {device}")
    
    # Convert to tensors and move to device (GPU/CPU)
    X_tensor = torch.tensor(X, dtype=torch.float32).to(device)
    y_tensor = torch.tensor(y, dtype=torch.float32).reshape(-1, 1).to(device)
    
    # Initialize model and move to device
    model = LinearModelBatch(X.shape[1]).to(device)
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)
    criterion = nn.MSELoss()
    
    # Training
    losses = []
    weights = []
    
    for epoch in range(epochs):
        # Full batch forward pass
        predictions = model(X_tensor)
        loss = criterion(predictions, y_tensor)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Store metrics (move to CPU for numpy conversion)
        losses.append(loss.item())
        weights.append(model.linear.weight.data.cpu().clone().numpy().flatten()[0])
        
        if epoch % 10 == 0:
            print(f"  Epoch {epoch:3d}/{epochs}, Loss: {loss.item():.6f}, Weight: {weights[-1]:.6f}")
    
    return model, losses, weights

In [None]:
def train_pytorch_mini_batch(X, y, epochs=50, lr=0.01, batch_size=10):
    """Mini-batch training in PyTorch with CUDA support"""
    print(f"PyTorch MINI-BATCH training (batch size: {batch_size})")
    print(f"Device: {device}")
    
    # Convert to tensors and move to device (GPU/CPU)
    X_tensor = torch.tensor(X, dtype=torch.float32).to(device)
    y_tensor = torch.tensor(y, dtype=torch.float32).reshape(-1, 1).to(device)
    
    # Initialize model and move to device
    model = LinearModelBatch(X.shape[1]).to(device)
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)
    criterion = nn.MSELoss()
    
    # Create data loader
    dataloader = create_pytorch_batches(X_tensor, y_tensor, batch_size)
    
    # Training
    losses = []
    weights = []
    
    for epoch in range(epochs):
        epoch_losses = []
        
        for batch_X, batch_y in dataloader:
            # Move batch to device if needed (DataLoader might not preserve device)
            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)
            
            # Mini-batch forward pass
            predictions = model(batch_X)
            loss = criterion(predictions, batch_y)
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            epoch_losses.append(loss.item())
        
        # Average loss for the epoch
        avg_loss = np.mean(epoch_losses)
        losses.append(avg_loss)
        weights.append(model.linear.weight.data.cpu().clone().numpy().flatten()[0])
        
        if epoch % 10 == 0:
            print(f"  Epoch {epoch:3d}/{epochs}, Loss: {avg_loss:.6f}, Weight: {weights[-1]:.6f}")
    
    return model, losses, weights

In [None]:
# Generate dataset for PyTorch batch comparison
print("Generating dataset for PyTorch batch comparison...")
np.random.seed(42)
X_pytorch, y_pytorch = make_regression(n_samples=500, n_features=3, noise=15, random_state=42)
X_pytorch = StandardScaler().fit_transform(X_pytorch)  # Normalize features
y_pytorch = StandardScaler().fit_transform(y_pytorch.reshape(-1, 1)).flatten()  # Normalize target

print(f"PyTorch dataset shape: X={X_pytorch.shape}, y={y_pytorch.shape}")

# Run PyTorch batch comparisons
print("\n" + "="*60)
print("PYTORCH BATCH TRAINING COMPARISON")
print("="*60)

print("\n1. PyTorch FULL BATCH:")
model_full, losses_full, weights_full, time_full = train_pytorch_full_batch(
    X_pytorch, y_pytorch, epochs=40, lr=0.01)

print("\n2. PyTorch MINI-BATCH (batch_size=25):")
model_mini25, losses_mini25, weights_mini25, time_mini25 = train_pytorch_mini_batch(
    X_pytorch, y_pytorch, epochs=40, lr=0.01, batch_size=25)

print("\n3. PyTorch MINI-BATCH (batch_size=50):")
model_mini50, losses_mini50, weights_mini50, time_mini50 = train_pytorch_mini_batch(
    X_pytorch, y_pytorch, epochs=40, lr=0.01, batch_size=50)

print("\n4. PyTorch SGD (batch_size=1):")
model_sgd, losses_sgd, weights_sgd, time_sgd = train_pytorch_sgd(
    X_pytorch, y_pytorch, epochs=20, lr=0.01)

print(f"\n" + "="*60)
print("TRAINING RESULTS SUMMARY")
print("="*60)
print(f"Full Batch (500) - Final Loss: {losses_full[-1]:.6f}, Time: {time_full:.2f}s")
print(f"Mini-Batch (25)  - Final Loss: {losses_mini25[-1]:.6f}, Time: {time_mini25:.2f}s")
print(f"Mini-Batch (50)  - Final Loss: {losses_mini50[-1]:.6f}, Time: {time_mini50:.2f}s")
print(f"SGD (1)          - Final Loss: {losses_sgd[-1]:.6f}, Time: {time_sgd:.2f}s")
print("="*60)

In [None]:
# Comprehensive PyTorch batch analysis visualization
fig, axes = plt.subplots(2, 3, figsize=(20, 12))
fig.suptitle('PyTorch: Batch vs Mini-Batch Training Analysis', fontsize=16)

# Loss convergence comparison
axes[0, 0].plot(losses_full, 'b-', linewidth=2, label='Full Batch (500)')
axes[0, 0].plot(losses_mini25, 'r-', linewidth=2, label='Mini-Batch (25)')
axes[0, 0].plot(losses_mini50, 'g-', linewidth=2, label='Mini-Batch (50)')
axes[0, 0].plot(losses_sgd, 'm-', linewidth=2, label='SGD (1)', alpha=0.7)
axes[0, 0].set_xlabel('Epoch')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].set_title('Loss Convergence Comparison')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Weight evolution (first weight only for multi-feature model)
axes[0, 1].plot(weights_full, 'b-', linewidth=2, label='Full Batch')
axes[0, 1].plot(weights_mini25, 'r-', linewidth=2, label='Mini-Batch (25)')
axes[0, 1].plot(weights_mini50, 'g-', linewidth=2, label='Mini-Batch (50)')
axes[0, 1].plot(weights_sgd, 'm-', linewidth=2, label='SGD (1)', alpha=0.7)
axes[0, 1].set_xlabel('Epoch')
axes[0, 1].set_ylabel('First Weight Value')
axes[0, 1].set_title('Weight Evolution Comparison')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Training time comparison
methods = ['Full Batch\n(500)', 'Mini-Batch\n(25)', 'Mini-Batch\n(50)', 'SGD\n(1)']
times = [time_full, time_mini25, time_mini50, time_sgd]
colors = ['blue', 'red', 'green', 'magenta']

bars = axes[0, 2].bar(methods, times, color=colors, alpha=0.7)
axes[0, 2].set_ylabel('Training Time (seconds)')
axes[0, 2].set_title('Training Time Comparison')
axes[0, 2].grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bar, time_val in zip(bars, times):
    height = bar.get_height()
    axes[0, 2].text(bar.get_x() + bar.get_width()/2., height + 0.01,
                    f'{time_val:.2f}s', ha='center', va='bottom')

# Memory usage analysis (theoretical)
batch_sizes_analysis = [1, 25, 50, 100, 250, len(X_pytorch)]
memory_usage = [size * X_pytorch.shape[1] * 4 / 1024 for size in batch_sizes_analysis]  # KB

axes[1, 0].semilogx(batch_sizes_analysis, memory_usage, 'o-', linewidth=2, markersize=8)
axes[1, 0].set_xlabel('Batch Size (log scale)')
axes[1, 0].set_ylabel('Memory Usage (KB)')
axes[1, 0].set_title('Memory Usage vs Batch Size')
axes[1, 0].grid(True, alpha=0.3)

# Updates per epoch analysis
batch_sizes_for_updates = [1, 25, 50, 500]  # Our actual batch sizes
updates_per_epoch = [len(X_pytorch) // size for size in batch_sizes_for_updates]

axes[1, 1].bar(range(len(batch_sizes_for_updates)), updates_per_epoch, 
               color=['magenta', 'red', 'green', 'blue'], alpha=0.7)
axes[1, 1].set_xlabel('Batch Configuration')
axes[1, 1].set_ylabel('Gradient Updates per Epoch')
axes[1, 1].set_title('Gradient Updates per Epoch')
axes[1, 1].set_xticks(range(len(batch_sizes_for_updates)))
axes[1, 1].set_xticklabels([f'Batch={size}' for size in batch_sizes_for_updates])
axes[1, 1].grid(True, alpha=0.3, axis='y')

# Add value labels
for i, updates in enumerate(updates_per_epoch):
    axes[1, 1].text(i, updates + 1, str(updates), ha='center', va='bottom')

# Final performance comparison
final_losses = [losses_full[-1], losses_mini25[-1], losses_mini50[-1], losses_sgd[-1]]
bars = axes[1, 2].bar(methods, final_losses, color=colors, alpha=0.7)
axes[1, 2].set_ylabel('Final Loss')
axes[1, 2].set_title('Final Performance Comparison')
axes[1, 2].grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bar, loss in zip(bars, final_losses):
    height = bar.get_height()
    axes[1, 2].text(bar.get_x() + bar.get_width()/2., height + height*0.01,
                    f'{loss:.4f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

# Print detailed analysis
print("\nDETAILED ANALYSIS:")
print("-" * 40)
print(f"• Full Batch processes {len(X_pytorch)} samples per update")
print(f"• Mini-Batch (25) processes 25 samples per update, {len(X_pytorch)//25} updates/epoch")
print(f"• Mini-Batch (50) processes 50 samples per update, {len(X_pytorch)//50} updates/epoch")
print(f"• SGD processes 1 sample per update, {len(X_pytorch)} updates/epoch")
print(f"\n• Memory usage scales linearly with batch size")
print(f"• Mini-batches provide good balance of speed and stability")
print(f"• SGD is noisiest but can escape local minima")
print(f"• Full batch has smoothest convergence but slowest per-epoch")

In [None]:
# Advanced PyTorch DataLoader Features
print("=" * 60)
print("ADVANCED PYTORCH DATALOADER FEATURES")
print("=" * 60)

def demonstrate_dataloader_features():
    """Demonstrate advanced DataLoader capabilities"""
    
    # Create a sample dataset
    X_demo = torch.randn(100, 5)  # 100 samples, 5 features
    y_demo = torch.randn(100, 1)  # 100 targets
    
    print("\n1. BASIC DATALOADER:")
    dataset = torch.utils.data.TensorDataset(X_demo, y_demo)
    basic_loader = torch.utils.data.DataLoader(dataset, batch_size=20, shuffle=True)
    
    print(f"   Dataset size: {len(dataset)}")
    print(f"   Batch size: {basic_loader.batch_size}")
    print(f"   Number of batches: {len(basic_loader)}")
    
    # Show first batch
    for i, (batch_x, batch_y) in enumerate(basic_loader):
        print(f"   Batch {i+1}: X shape {batch_x.shape}, y shape {batch_y.shape}")
        if i == 0:  # Only show first batch
            break
    
    print("\n2. DATALOADER WITH DROP_LAST:")
    drop_last_loader = torch.utils.data.DataLoader(dataset, batch_size=30, 
                                                   shuffle=True, drop_last=True)
    print(f"   With drop_last=True: {len(drop_last_loader)} batches")
    
    drop_last_false_loader = torch.utils.data.DataLoader(dataset, batch_size=30, 
                                                         shuffle=True, drop_last=False)
    print(f"   With drop_last=False: {len(drop_last_false_loader)} batches")
    
    print("\n3. BATCH SIZE EFFECTS:")
    batch_sizes = [10, 25, 50, 100]
    for bs in batch_sizes:
        loader = torch.utils.data.DataLoader(dataset, batch_size=bs)
        print(f"   Batch size {bs:3d}: {len(loader)} batches")
    
    print("\n4. SHUFFLE COMPARISON:")
    # Without shuffle
    no_shuffle_loader = torch.utils.data.DataLoader(dataset, batch_size=20, shuffle=False)
    batch1_no_shuffle = next(iter(no_shuffle_loader))[0][0]  # First sample of first batch
    
    # With shuffle
    shuffle_loader = torch.utils.data.DataLoader(dataset, batch_size=20, shuffle=True)
    batch1_shuffle = next(iter(shuffle_loader))[0][0]  # First sample of first batch
    
    print(f"   First sample without shuffle: {batch1_no_shuffle[:3].numpy()}")
    print(f"   First sample with shuffle: {batch1_shuffle[:3].numpy()}")
    print("   (Note: Values will be different due to shuffling)")

demonstrate_dataloader_features()

# Demonstrate the importance of batch size selection
print("\n5. BATCH SIZE SELECTION IMPACT:")
print("-" * 40)

def test_batch_size_impact():
    """Test different batch sizes on the same problem"""
    X_test = torch.randn(200, 4)
    y_test = torch.sum(X_test, dim=1, keepdim=True) + 0.1 * torch.randn(200, 1)
    
    batch_sizes = [1, 10, 50, 200]  # SGD, small mini-batch, medium mini-batch, full batch
    results = {}
    
    for bs in batch_sizes:
        model = LinearModelBatch(4)
        optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
        criterion = nn.MSELoss()
        
        dataset = torch.utils.data.TensorDataset(X_test, y_test)
        dataloader = torch.utils.data.DataLoader(dataset, batch_size=bs, shuffle=True)
        
        losses = []
        for epoch in range(20):
            epoch_loss = 0
            num_batches = 0
            for batch_x, batch_y in dataloader:
                optimizer.zero_grad()
                pred = model(batch_x)
                loss = criterion(pred, batch_y)
                loss.backward()
                optimizer.step()
                
                epoch_loss += loss.item()
                num_batches += 1
            
            avg_loss = epoch_loss / num_batches
            losses.append(avg_loss)
        
        results[f'Batch_{bs}'] = {
            'final_loss': losses[-1],
            'losses': losses,
            'batch_size': bs
        }
        
        batch_type = 'SGD' if bs == 1 else 'Full Batch' if bs == 200 else 'Mini-Batch'
        print(f"   {batch_type} (size={bs}): Final Loss = {losses[-1]:.6f}")
    
    return results

batch_impact_results = test_batch_size_impact()

# Visualize batch size impact
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Plot loss curves
colors = ['red', 'blue', 'green', 'purple']
for i, (name, result) in enumerate(batch_impact_results.items()):
    label = f"Batch Size {result['batch_size']}"
    ax1.plot(result['losses'], color=colors[i], linewidth=2, label=label)

ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Loss Convergence for Different Batch Sizes')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Bar chart of final losses
batch_sizes = [result['batch_size'] for result in batch_impact_results.values()]
final_losses = [result['final_loss'] for result in batch_impact_results.values()]

bars = ax2.bar([f'Size {bs}' for bs in batch_sizes], final_losses, color=colors, alpha=0.7)
ax2.set_ylabel('Final Loss')
ax2.set_title('Final Loss by Batch Size')
ax2.grid(True, alpha=0.3, axis='y')

# Add value labels
for bar, loss in zip(bars, final_losses):
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height + height*0.01,
             f'{loss:.4f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

print(f"\nBatch size impact analysis completed!")

## PyTorch Batch vs Mini-Batch: Key Takeaways

### **When to use Each approach?**

#### Full batch Trainint
- **Best for**: Small datasets (< 1000 samples)
- **Advantages**: Smoothest convergence, most accurate gradients
- **Disadvantages**: Memory intensive, slower per epoch
- **Use cases**: Final fine-tuning, small-scale experiments

#### Mini-Batch training  
- **Best for**: Most deep learning applications
- **Advantages**: Memory efficient, good balance of speed/stability
- **Disadvantages**: Slightly noisy gradients
- **Recommended sizes**: 32-128 for most problems

#### Stochastic Gradient Descent (SGD)
- **Best for**: Very large datasets, online learning
- **Advantages**: Minimal memory usage, can escape local minima
- **Disadvantages**: Very noisy convergence, poor GPU utilization
- **Use cases**: Streaming data, extremely memory-constrained environments

### **PyTorch advantages for batch processing**

1. **DataLoader**: Automatic batching, shuffling, and parallel loading
2. **GPU Acceleration**: Efficient parallel processing within batches  
3. **Memory Management**: Automatic gradient accumulation and cleanup
4. **Flexibility**: Easy to experiment with different batch sizes
5. **Integration**: Seamless with neural network architectures

### **Recommendations**

- **Start with batch_size=32** for most problems
- **Increase batch size** if you have more GPU memory
- **Use power-of-2 batch sizes** (16, 32, 64, 128) for GPU efficiency
- **Enable shuffling** for training, disable for validation
- **Monitor memory usage** and adjust batch size accordingly
- **Consider gradient accumulation** for very large effective batch sizes

In [None]:
# =============================================================================
# GPU vs CPU PERFORMANCE COMPARISON (if CUDA is available)
# =============================================================================

if cuda_available:
    print("DEMONSTRATING GPU vs CPU PERFORMANCE")
    print("="*50)
    
    import time
    
    # Create a larger dataset for performance comparison
    X_large, y_large = make_regression(n_samples=5000, n_features=50, noise=10, random_state=42)
    
    def benchmark_device_performance(X, y, device_name, device_obj, epochs=10):
        """Benchmark training performance on specified device"""
        print(f"\n📊 Training on {device_name}...")
        
        # Move data to device
        X_tensor = torch.tensor(X, dtype=torch.float32).to(device_obj)
        y_tensor = torch.tensor(y, dtype=torch.float32).reshape(-1, 1).to(device_obj)
        
        # Create model
        model = LinearModelBatch(X.shape[1]).to(device_obj)
        optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
        criterion = nn.MSELoss()
        
        # Warm-up (important for GPU)
        for _ in range(3):
            predictions = model(X_tensor)
            loss = criterion(predictions, y_tensor)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        # Benchmark
        start_time = time.time()
        
        for epoch in range(epochs):
            predictions = model(X_tensor)
            loss = criterion(predictions, y_tensor)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            if device_obj.type == 'cuda':
                torch.cuda.synchronize()  # Wait for GPU operations to complete
        
        end_time = time.time()
        
        training_time = end_time - start_time
        print(f"   Time: {training_time:.3f} seconds")
        print(f"   Final Loss: {loss.item():.6f}")
        
        return training_time, loss.item()
    
    # Benchmark CPU
    cpu_time, cpu_loss = benchmark_device_performance(
        X_large, y_large, "CPU", torch.device('cpu'), epochs=20
    )
    
    # Benchmark GPU
    gpu_time, gpu_loss = benchmark_device_performance(
        X_large, y_large, "GPU (CUDA)", torch.device('cuda'), epochs=20
    )
    
    # Results
    speedup = cpu_time / gpu_time
    print(f"\nPERFORMANCE RESULTS:")
    print(f"   CPU Time: {cpu_time:.3f}s")
    print(f"   GPU Time: {gpu_time:.3f}s")
    print(f"   Speedup: {speedup:.2f}x faster on GPU")
    
    if speedup > 2:
        print("   Significant GPU acceleration achieved!")
    elif speedup > 1.2:
        print("   ⚡ Moderate GPU acceleration achieved!")
    else:
        print("   ⚠️  Limited acceleration - dataset might be too small for GPU benefits")
    
    print(f"\n Insights:")
    print(f"   • GPU excels at parallel matrix operations")
    print(f"   • Larger datasets see bigger GPU benefits")
    print(f"   • Memory transfer overhead affects small datasets")
    print(f"   • Tensor cores accelerate FP16 operations")

else:
    print("⚠️ CUDA not available - GPU performance comparison skipped")
    print("   To enable GPU acceleration:")
    print("   1. Install CUDA-compatible GPU")
    print("   2. Install CUDA drivers and toolkit")
    print("   3. Install PyTorch with CUDA support:")
    print("      pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118")