# Lab 3: PyTorch Basics & Training a Neural Network

**Duration**: ~3 hours

### Learning Objectives
By the end of this lab, you will be able to:
1. Understand tensors (PyTorch's building blocks)
2. Build a simple neural network from scratch
3. Understand the training loop: forward pass, loss, backpropagation
4. Train a classifier on MNIST digits

### Prerequisites
- Completed Labs 1-2
- Basic Python knowledge

### Why PyTorch?

sklearn is great for classical ML, but for **deep learning** we need:
- Automatic differentiation (computing gradients)
- GPU acceleration
- Flexible neural network building

PyTorch gives us all this with Pythonic syntax!

In [None]:
# Install PyTorch (run once)
# For CPU: !pip install torch torchvision
# For GPU (Colab): Already installed!

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

import numpy as np
import matplotlib.pyplot as plt

# Check PyTorch version and device
print(f"PyTorch version: {torch.__version__}")

# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

if device.type == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")

---

# Part 1: Tensors - The Building Blocks

## What is a Tensor?

A tensor is just a **fancy array**. That's it!

```
TENSOR DIMENSIONS:

Scalar (0D):     5
                 └── Just a number

Vector (1D):     [1, 2, 3, 4, 5]
                 └── List of numbers

Matrix (2D):     [[1, 2, 3],
                  [4, 5, 6]]
                 └── Table of numbers

3D Tensor:       [[[1, 2], [3, 4]],
                  [[5, 6], [7, 8]]]
                 └── Cube of numbers (e.g., RGB image)
```

**Why not just use NumPy arrays?**
- Tensors can run on GPU (1000x faster!)
- Tensors track gradients for automatic differentiation

In [None]:
# SOLVED EXAMPLE: Creating Tensors

# From Python list
t1 = torch.tensor([1, 2, 3, 4, 5])
print(f"From list: {t1}")
print(f"Shape: {t1.shape}")
print(f"Data type: {t1.dtype}")

print("\n" + "="*50)

# From NumPy array
np_array = np.array([[1, 2, 3], [4, 5, 6]])
t2 = torch.from_numpy(np_array)
print(f"From numpy:\n{t2}")
print(f"Shape: {t2.shape}")

print("\n" + "="*50)

# Special tensors
zeros = torch.zeros(3, 4)  # 3x4 matrix of zeros
ones = torch.ones(2, 3)    # 2x3 matrix of ones
rand = torch.rand(2, 2)    # 2x2 matrix of random [0,1]
randn = torch.randn(2, 2)  # 2x2 matrix of random normal

print(f"Zeros (3x4):\n{zeros}")
print(f"\nRandom [0,1]:\n{rand}")

In [None]:
# SOLVED EXAMPLE: Tensor Operations

a = torch.tensor([1., 2., 3.])
b = torch.tensor([4., 5., 6.])

# Basic math (element-wise)
print(f"a = {a}")
print(f"b = {b}")
print(f"a + b = {a + b}")
print(f"a * b = {a * b}")
print(f"a ** 2 = {a ** 2}")

print("\n" + "="*50)

# Matrix operations
A = torch.tensor([[1., 2.], [3., 4.]])
B = torch.tensor([[5., 6.], [7., 8.]])

print(f"Matrix A:\n{A}")
print(f"\nMatrix B:\n{B}")
print(f"\nA @ B (matrix multiply):\n{A @ B}")
print(f"\nA.T (transpose):\n{A.T}")

In [None]:
# SOLVED EXAMPLE: Reshaping Tensors

# This is CRUCIAL for neural networks!
x = torch.arange(12)  # [0, 1, 2, ..., 11]
print(f"Original: {x}")
print(f"Shape: {x.shape}")

# Reshape to 3x4
x_3x4 = x.reshape(3, 4)
print(f"\nReshaped to 3x4:\n{x_3x4}")

# Reshape to 2x2x3
x_2x2x3 = x.reshape(2, 2, 3)
print(f"\nReshaped to 2x2x3:\n{x_2x2x3}")

# Use -1 to infer dimension
x_auto = x.reshape(4, -1)  # -1 means "figure it out"
print(f"\nReshaped to 4x?:\n{x_auto}")
print(f"Shape: {x_auto.shape}")

## Moving Tensors to GPU

If you have a GPU, you can make computations **much faster**!

In [None]:
# SOLVED EXAMPLE: GPU Operations

# Create tensor on CPU
x_cpu = torch.rand(1000, 1000)
print(f"Tensor device: {x_cpu.device}")

# Move to GPU (if available)
x_gpu = x_cpu.to(device)
print(f"After .to(device): {x_gpu.device}")

# Time comparison (only meaningful on GPU)
import time

# CPU timing
x_cpu = torch.rand(5000, 5000)
y_cpu = torch.rand(5000, 5000)

start = time.time()
z_cpu = x_cpu @ y_cpu
cpu_time = time.time() - start
print(f"\nCPU matrix multiply: {cpu_time:.4f} seconds")

if device.type == 'cuda':
    x_gpu = x_cpu.to(device)
    y_gpu = y_cpu.to(device)
    
    # Warm-up
    _ = x_gpu @ y_gpu
    torch.cuda.synchronize()
    
    start = time.time()
    z_gpu = x_gpu @ y_gpu
    torch.cuda.synchronize()
    gpu_time = time.time() - start
    print(f"GPU matrix multiply: {gpu_time:.4f} seconds")
    print(f"Speedup: {cpu_time / gpu_time:.1f}x faster!")

## Question 1.1

Create a 3x3 tensor filled with the numbers 1-9, then:
1. Print the sum of all elements
2. Print the mean of each row
3. Reshape it to 9x1

In [None]:
# YOUR CODE HERE

# Create 3x3 tensor with values 1-9
# Hint: torch.arange(1, 10).reshape(...)

# Sum all elements
# Hint: tensor.sum()

# Mean of each row
# Hint: tensor.mean(dim=1)

# Reshape to 9x1


---

# Part 2: Building Blocks of Neural Networks

## The Neuron: A Simple Function

```
A SINGLE NEURON:

  inputs         weights           output
    │              │                 │
   x1 ─────► w1 ──┐                  │
                  │                  │
   x2 ─────► w2 ──┼──► Σ + b ──► f ──┴──► y
                  │     │        │
   x3 ─────► w3 ──┘   bias    activation
                              (e.g., ReLU)

   y = f(w1*x1 + w2*x2 + w3*x3 + b)
```

**The Linear Layer**: Many neurons in parallel!

In [None]:
# SOLVED EXAMPLE: The Linear Layer

# A linear layer: input_size -> output_size
linear = nn.Linear(in_features=3, out_features=2)

print("Linear layer: 3 inputs → 2 outputs")
print(f"Weight shape: {linear.weight.shape}")  # 2x3 matrix
print(f"Bias shape: {linear.bias.shape}")      # 2 values

print(f"\nWeights:\n{linear.weight}")
print(f"\nBias: {linear.bias}")

# Forward pass
x = torch.tensor([1., 2., 3.])  # Input
y = linear(x)                     # Output

print(f"\nInput: {x}")
print(f"Output: {y}")

# Manual calculation to verify
y_manual = x @ linear.weight.T + linear.bias
print(f"Manual: {y_manual}")

## Activation Functions: Adding Non-linearity

Without activations, stacking linear layers is useless (just another linear layer!).

Activations add **non-linearity** so networks can learn complex patterns.

```
COMMON ACTIVATIONS:

ReLU:           Sigmoid:        Softmax:
    │    ╱         │ ───────       │ 
    │   ╱          │╱              │   Probabilities!
────┼──╱       ────┼────       ────┼──── 
    │              │               │   Sum to 1.0
    │              │               │

f(x) = max(0,x)  f(x)=1/(1+e^-x)  f(x)=e^x/Σe^x

Use: Hidden      Use: Binary     Use: Multi-class
     layers           output          output
```

In [None]:
# SOLVED EXAMPLE: Activation Functions

x = torch.linspace(-5, 5, 100)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# ReLU
axes[0].plot(x, torch.relu(x))
axes[0].set_title('ReLU: max(0, x)')
axes[0].axhline(0, color='gray', linestyle='--', alpha=0.5)
axes[0].axvline(0, color='gray', linestyle='--', alpha=0.5)
axes[0].grid(True, alpha=0.3)

# Sigmoid
axes[1].plot(x, torch.sigmoid(x))
axes[1].set_title('Sigmoid: 1/(1+e^-x)')
axes[1].axhline(0.5, color='gray', linestyle='--', alpha=0.5)
axes[1].grid(True, alpha=0.3)

# Tanh
axes[2].plot(x, torch.tanh(x))
axes[2].set_title('Tanh')
axes[2].axhline(0, color='gray', linestyle='--', alpha=0.5)
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# SOLVED EXAMPLE: Softmax for Classification

# Raw scores from network (called "logits")
logits = torch.tensor([2.0, 1.0, 0.1])

# Convert to probabilities with softmax
probs = F.softmax(logits, dim=0)

print(f"Raw scores (logits): {logits}")
print(f"Probabilities (softmax): {probs}")
print(f"Sum of probabilities: {probs.sum():.4f}")

# Interpretation
classes = ['Cat', 'Dog', 'Bird']
for cls, prob in zip(classes, probs):
    print(f"  {cls}: {prob:.1%}")

---

# Part 3: Building a Neural Network

## The Sequential Model

Stack layers like LEGO blocks!

In [None]:
# SOLVED EXAMPLE: Building a Simple Network

# Network architecture:
# Input (784) -> Hidden (128) -> ReLU -> Output (10)

model = nn.Sequential(
    nn.Linear(784, 128),  # First layer: 784 inputs, 128 outputs
    nn.ReLU(),            # Activation
    nn.Linear(128, 10)    # Output layer: 128 inputs, 10 classes
)

print(model)
print(f"\nNumber of parameters: {sum(p.numel() for p in model.parameters()):,}")

In [None]:
# SOLVED EXAMPLE: Forward Pass

# Fake input: batch of 5 images, each 28x28 = 784 pixels
fake_images = torch.randn(5, 784)

# Forward pass
output = model(fake_images)

print(f"Input shape: {fake_images.shape}")  # [5, 784]
print(f"Output shape: {output.shape}")       # [5, 10]

# Get predictions
probs = F.softmax(output, dim=1)
predictions = output.argmax(dim=1)

print(f"\nPredictions (class indices): {predictions}")
print(f"\nFirst sample probabilities:")
for i, p in enumerate(probs[0]):
    print(f"  Class {i}: {p:.2%}")

## Custom Network Class

For more control, define your own network class:

In [None]:
# SOLVED EXAMPLE: Custom Network Class

class SimpleNet(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super().__init__()
        
        # Define layers
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, num_classes)
        
    def forward(self, x):
        # Define forward pass
        x = self.fc1(x)       # Linear
        x = F.relu(x)          # Activation
        x = self.fc2(x)       # Linear
        return x

# Create network
net = SimpleNet(input_size=784, hidden_size=128, num_classes=10)
print(net)

# Test forward pass
test_input = torch.randn(1, 784)
test_output = net(test_input)
print(f"\nOutput shape: {test_output.shape}")

## Question 2.1

Create a deeper network with:
- Input: 784
- Hidden 1: 256 + ReLU
- Hidden 2: 128 + ReLU
- Hidden 3: 64 + ReLU
- Output: 10

How many parameters does it have?

In [None]:
# YOUR CODE HERE

# Using nn.Sequential
deep_net = nn.Sequential(
    # Add your layers here
)

# Count parameters
# num_params = sum(p.numel() for p in deep_net.parameters())


---

# Part 4: The Training Loop

## How Neural Networks Learn

```
THE TRAINING LOOP:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│  1. FORWARD PASS                                            │
│     Input ──► Network ──► Prediction                        │
│                                                             │
│  2. COMPUTE LOSS                                            │
│     How wrong are we? (prediction vs actual)                │
│                                                             │
│  3. BACKWARD PASS                                           │
│     Compute gradients (which direction to adjust?)          │
│                                                             │
│  4. UPDATE WEIGHTS                                          │
│     weights = weights - learning_rate * gradients           │
│                                                             │
│  5. REPEAT! (thousands of times)                            │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

In [None]:
# SOLVED EXAMPLE: Understanding Gradients

# Create a tensor that requires gradients
x = torch.tensor([2.0], requires_grad=True)

# Compute a function: y = x^2
y = x ** 2

print(f"x = {x.item()}")
print(f"y = x² = {y.item()}")

# Compute gradient: dy/dx = 2x
y.backward()

print(f"dy/dx = {x.grad.item()}")
print(f"(We expect 2*x = 2*2 = 4)")

In [None]:
# SOLVED EXAMPLE: Loss Functions

# For classification: Cross Entropy Loss
logits = torch.tensor([[2.0, 1.0, 0.1]])  # Raw scores for 3 classes
target = torch.tensor([0])                  # True class is 0

criterion = nn.CrossEntropyLoss()
loss = criterion(logits, target)

print(f"Logits: {logits}")
print(f"True class: {target.item()}")
print(f"Loss: {loss.item():.4f}")

# Lower loss = better!
# Perfect prediction would give loss ≈ 0

## Load the MNIST Dataset

MNIST: 70,000 handwritten digit images (28x28 pixels, grayscale)

In [None]:
# SOLVED EXAMPLE: Loading MNIST

# Transform: Convert to tensor and normalize
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))  # MNIST mean and std
])

# Download and load training data
train_dataset = datasets.MNIST(
    root='./data', 
    train=True, 
    download=True, 
    transform=transform
)

test_dataset = datasets.MNIST(
    root='./data', 
    train=False, 
    download=True, 
    transform=transform
)

print(f"Training samples: {len(train_dataset)}")
print(f"Test samples: {len(test_dataset)}")

# Look at one sample
image, label = train_dataset[0]
print(f"\nImage shape: {image.shape}")  # [1, 28, 28]
print(f"Label: {label}")

In [None]:
# SOLVED EXAMPLE: Visualize Some Digits

fig, axes = plt.subplots(2, 5, figsize=(12, 5))

for i, ax in enumerate(axes.flat):
    image, label = train_dataset[i]
    ax.imshow(image.squeeze(), cmap='gray')
    ax.set_title(f'Label: {label}')
    ax.axis('off')

plt.suptitle('MNIST Digits', fontsize=14)
plt.tight_layout()
plt.show()

In [None]:
# SOLVED EXAMPLE: Create Data Loaders

# DataLoader: Loads data in batches for efficient training
batch_size = 64

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

print(f"Number of batches (training): {len(train_loader)}")
print(f"Number of batches (test): {len(test_loader)}")

# Get one batch
images, labels = next(iter(train_loader))
print(f"\nBatch images shape: {images.shape}")  # [64, 1, 28, 28]
print(f"Batch labels shape: {labels.shape}")    # [64]

## The Complete Training Loop

In [None]:
# SOLVED EXAMPLE: Complete Training Pipeline

# 1. Define the model
class MNISTNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()  # 28x28 -> 784
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)
        
    def forward(self, x):
        x = self.flatten(x)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Create model and move to device
model = MNISTNet().to(device)
print(model)

# 2. Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

print(f"\nModel on: {next(model.parameters()).device}")

In [None]:
# SOLVED EXAMPLE: Training Function

def train_epoch(model, train_loader, criterion, optimizer, device):
    model.train()  # Set to training mode
    total_loss = 0
    correct = 0
    total = 0
    
    for images, labels in train_loader:
        # Move to device
        images, labels = images.to(device), labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward pass
        optimizer.zero_grad()  # Clear old gradients
        loss.backward()         # Compute new gradients
        optimizer.step()        # Update weights
        
        # Track stats
        total_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()
    
    return total_loss / len(train_loader), 100. * correct / total


def test(model, test_loader, criterion, device):
    model.eval()  # Set to evaluation mode
    total_loss = 0
    correct = 0
    total = 0
    
    with torch.no_grad():  # No gradients needed for testing
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            loss = criterion(outputs, labels)
            
            total_loss += loss.item()
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()
    
    return total_loss / len(test_loader), 100. * correct / total

In [None]:
# SOLVED EXAMPLE: Train for 5 Epochs

num_epochs = 5
history = {'train_loss': [], 'train_acc': [], 'test_loss': [], 'test_acc': []}

print("Training MNIST Classifier...")
print("=" * 60)

for epoch in range(num_epochs):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    test_loss, test_acc = test(model, test_loader, criterion, device)
    
    # Save history
    history['train_loss'].append(train_loss)
    history['train_acc'].append(train_acc)
    history['test_loss'].append(test_loss)
    history['test_acc'].append(test_acc)
    
    print(f"Epoch {epoch+1}/{num_epochs}:")
    print(f"  Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%")
    print(f"  Test Loss:  {test_loss:.4f}, Test Acc:  {test_acc:.2f}%")

print("\nTraining complete!")

In [None]:
# SOLVED EXAMPLE: Plot Training History

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Loss
axes[0].plot(history['train_loss'], 'o-', label='Train')
axes[0].plot(history['test_loss'], 's-', label='Test')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Loss over Training')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy
axes[1].plot(history['train_acc'], 'o-', label='Train')
axes[1].plot(history['test_acc'], 's-', label='Test')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy (%)')
axes[1].set_title('Accuracy over Training')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# SOLVED EXAMPLE: Visualize Predictions

model.eval()

# Get some test images
images, labels = next(iter(test_loader))
images, labels = images.to(device), labels.to(device)

# Get predictions
with torch.no_grad():
    outputs = model(images)
    _, predictions = outputs.max(1)

# Move back to CPU for plotting
images = images.cpu()
labels = labels.cpu()
predictions = predictions.cpu()

# Plot
fig, axes = plt.subplots(2, 5, figsize=(12, 5))

for i, ax in enumerate(axes.flat):
    ax.imshow(images[i].squeeze(), cmap='gray')
    color = 'green' if predictions[i] == labels[i] else 'red'
    ax.set_title(f'Pred: {predictions[i].item()}\nTrue: {labels[i].item()}', color=color)
    ax.axis('off')

plt.suptitle('Model Predictions (Green=Correct, Red=Wrong)', fontsize=14)
plt.tight_layout()
plt.show()

## Question 3.1

Experiment with the network architecture. Try:
1. Adding more hidden layers
2. Changing the number of neurons per layer
3. Using dropout for regularization

Can you get above 98% test accuracy?

In [None]:
# YOUR CODE HERE

class ImprovedMNISTNet(nn.Module):
    def __init__(self):
        super().__init__()
        # Design your improved network
        # Hint: Try nn.Dropout(0.2) between layers
        pass
        
    def forward(self, x):
        pass

# Train and evaluate


---

# Challenge Problems

## Challenge 1: Fashion MNIST

Replace MNIST digits with Fashion MNIST (10 clothing categories).
Train a classifier and report your accuracy.

In [None]:
# YOUR CODE HERE

# Load Fashion MNIST
# Hint: datasets.FashionMNIST instead of datasets.MNIST

# Fashion MNIST classes:
fashion_classes = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
                   'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']


## Challenge 2: Learning Rate Experiments

The learning rate is crucial! Compare:
- lr = 0.1 (too high?)
- lr = 0.01
- lr = 0.001 (default)
- lr = 0.0001 (too low?)

Plot the training curves for each.

In [None]:
# YOUR CODE HERE


## Challenge 3: Confusion Matrix

Create a confusion matrix showing which digits get confused with each other most often.

In [None]:
# YOUR CODE HERE

# Hint: Use sklearn.metrics.confusion_matrix
# and seaborn.heatmap for visualization

from sklearn.metrics import confusion_matrix


---

# Summary

## What We Learned

| Concept | Description |
|---------|-------------|
| **Tensors** | Multi-dimensional arrays that can run on GPU |
| **nn.Linear** | Fully connected layer: y = Wx + b |
| **Activations** | Non-linear functions (ReLU, Sigmoid, Softmax) |
| **Forward Pass** | Input → Network → Output |
| **Loss Function** | Measures how wrong our predictions are |
| **Backward Pass** | Computes gradients via backpropagation |
| **Optimizer** | Updates weights to minimize loss |

## Key Code Patterns

```python
# Define model
model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

# Training loop
for epoch in range(num_epochs):
    for images, labels in train_loader:
        outputs = model(images)        # Forward
        loss = criterion(outputs, labels)  # Loss
        optimizer.zero_grad()           # Clear gradients
        loss.backward()                 # Backward
        optimizer.step()                # Update weights
```

## What's Next?

In **Lab 4**, we'll use these skills to build a **language model** that predicts the next character!