# <font color="#418FDE" size="6.5" uppercase>**Execution and Review**</font>

>Last update: 20260130.
    
By the end of this Lecture, you will be able to:
- Implement the full training and evaluation pipeline for the capstone project using PyTorch 2.10.0. 
- Apply at least one performance or scalability technique, such as torch.compile, AMP, or DDP, to the project. 
- Analyze project outcomes, document lessons learned, and identify areas for future improvement. 


## **1. Running PyTorch Experiments**

### **1.1. Running the Training Loop**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_10/Lecture_B/image_01_01.jpg?v=1769784542" width="250">



>* Training loop batches data, predicts, computes loss, backpropagates
>* Consistent structure, modes, and randomness ensure reliable learning

>* Carefully manage training state, mode, and gradients
>* Handle devices, graphs, and errors for stability

>* Add advanced controls like clipping, gradient accumulation, logging
>* Use a modular loop for reliable, scalable experiments



In [None]:
#@title Python Code - Running the Training Loop

# This script shows a simple PyTorch training loop.
# It focuses on running epochs and batches clearly.
# Use it to understand each training step precisely.

# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu.

# Import required standard libraries.
import os
import random
import math

# Import torch and related modules.
import torch
import torch.nn as nn
import torch.optim as optim

# Import torchvision for a tiny MNIST subset.
from torchvision import datasets
from torchvision import transforms

# Set deterministic random seeds.
SEED = 42
random.seed(SEED)
torch.manual_seed(SEED)

# Detect device for training loop.
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print PyTorch version and device.
print("PyTorch", torch.__version__, "Device", DEVICE)

# Define a simple transform for MNIST.
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,)),
])

# Download MNIST training data.
train_dataset_full = datasets.MNIST(
    root="./data", train=True, download=True, transform=transform
)

# Use a small subset for quick training.
subset_size = 512
indices = list(range(subset_size))
train_dataset = torch.utils.data.Subset(train_dataset_full, indices)

# Create a data loader with small batch size.
train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=64, shuffle=True
)

# Define a simple fully connected classifier.
class SimpleMLP(nn.Module):
    # Initialize layers for the network.
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(28 * 28, 128)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(128, 10)

    # Define forward pass for inputs.
    def forward(self, x):
        x = self.flatten(x)
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Create model and move to device.
model = SimpleMLP().to(DEVICE)

# Define loss function and optimizer.
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Optionally compile model for performance.
if hasattr(torch, "compile"):
    model = torch.compile(model)

# Define a single training epoch function.
def train_one_epoch(model, loader, optimizer, criterion, device):
    # Set model to training mode.
    model.train()
    running_loss = 0.0
    total_batches = 0

    # Iterate over data batches.
    for batch_idx, (inputs, targets) in enumerate(loader):
        # Move data to the selected device.
        inputs = inputs.to(device)
        targets = targets.to(device)

        # Validate input and target shapes.
        assert inputs.ndim == 4
        assert targets.ndim == 1

        # Zero gradients from previous step.
        optimizer.zero_grad()

        # Forward pass to get predictions.
        outputs = model(inputs)

        # Check output shape matches targets.
        assert outputs.shape[0] == targets.shape[0]

        # Compute loss for this batch.
        loss = criterion(outputs, targets)

        # Backward pass to compute gradients.
        loss.backward()

        # Update model parameters.
        optimizer.step()

        # Accumulate loss for reporting.
        running_loss += loss.item()
        total_batches += 1

    # Compute average loss for the epoch.
    avg_loss = running_loss / max(total_batches, 1)
    return avg_loss

# Run a small number of epochs.
num_epochs = 3

# Training loop over epochs.
for epoch in range(1, num_epochs + 1):
    # Train for one full epoch.
    avg_loss = train_one_epoch(
        model, train_loader, optimizer, criterion, DEVICE
    )

    # Print concise epoch summary.
    print(f"Epoch {epoch} average training loss: {avg_loss:.4f}")

# Confirm script finished successfully.
print("Training loop completed without errors.")



### **1.2. Experiment Metrics Logging**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_10/Lecture_B/image_01_02.jpg?v=1769784632" width="250">



>* Log training and validation metrics every epoch
>* Use consistent logging to track learning and reproducibility

>* Combine live feedback, saved logs, and visualizations
>* Use plots to compare settings and performance tradeoffs

>* Log task-specific metrics that match real goals
>* Store rich metadata to enable reproducible experiments



In [None]:
#@title Python Code - Experiment Metrics Logging

# This script shows simple experiment metrics logging.
# It uses TensorFlow to simulate a training experiment.
# Focus on recording metrics and visualizing learning progress.

# !pip install tensorflow==2.20.0.

# Import standard libraries for math and plotting.
import os
import random
import numpy as np
import matplotlib.pyplot as plt

# Import TensorFlow and check version.
import tensorflow as tf

# Set deterministic random seeds for reproducibility.
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

# Detect device type for information only.
physical_gpus = tf.config.list_physical_devices("GPU")
DEVICE = "GPU" if physical_gpus else "CPU"

# Print framework version and device summary.
print("TensorFlow version:", tf.__version__, "Device:", DEVICE)

# Load a small subset of MNIST digits dataset.
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# Reduce dataset size to keep runtime very small.
train_samples = 2000
test_samples = 500
x_train = x_train[:train_samples]
y_train = y_train[:train_samples]

# Slice test data for quick evaluation.
x_test = x_test[:test_samples]
y_test = y_test[:test_samples]

# Normalize images to range zero to one.
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0

# Add channel dimension for convolutional layers.
x_train = np.expand_dims(x_train, axis=-1)
x_test = np.expand_dims(x_test, axis=-1)

# Validate shapes before building the model.
assert x_train.shape[0] == train_samples
assert x_test.shape[0] == test_samples

# Build a tiny convolutional classification model.
model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(
        8,
        (3, 3),
        activation="relu",
        input_shape=(28, 28, 1),
    ),
    tf.keras.layers.MaxPooling2D((2, 2)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(32, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax"),
])

# Compile model with optimizer, loss, and accuracy metric.
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

# Prepare a simple dictionary for metrics logging.
metrics_log = {
    "epoch": [],
    "train_loss": [],
    "train_accuracy": [],
    "val_loss": [],
    "val_accuracy": [],
}


# Define a custom callback to capture metrics each epoch.
class SimpleMetricsLogger(tf.keras.callbacks.Callback):
    # Initialize callback with external log dictionary.
    def __init__(self, log_dict):
        super().__init__()
        self.log_dict = log_dict

    # At epoch end, store metrics into the dictionary.
    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
        self.log_dict["epoch"].append(epoch + 1)
        self.log_dict["train_loss"].append(float(logs.get("loss", 0.0)))
        self.log_dict["train_accuracy"].append(
            float(logs.get("accuracy", 0.0))
        )
        self.log_dict["val_loss"].append(float(logs.get("val_loss", 0.0)))
        self.log_dict["val_accuracy"].append(
            float(logs.get("val_accuracy", 0.0))
        )


# Create callback instance for training.
logger_callback = SimpleMetricsLogger(metrics_log)

# Train the model quietly while logging metrics.
history = model.fit(
    x_train,
    y_train,
    epochs=5,
    batch_size=64,
    validation_data=(x_test, y_test),
    verbose=0,
    callbacks=[logger_callback],
)

# Convert metrics log to a small structured array.
logged_epochs = np.array(metrics_log["epoch"], dtype=int)
train_loss = np.array(metrics_log["train_loss"], dtype=float)

# Print a compact summary of logged metrics.
print("Logged epochs:", logged_epochs.tolist())
print("Train loss per epoch:", np.round(train_loss, 3).tolist())

# Plot training and validation accuracy curves.
plt.figure(figsize=(6, 4))
plt.plot(metrics_log["epoch"], metrics_log["train_accuracy"], label="train_acc")
plt.plot(metrics_log["epoch"], metrics_log["val_accuracy"], label="val_acc")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.title("Experiment metrics logging example")
plt.legend()
plt.tight_layout()
plt.show()




### **1.3. Managing checkpoints**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_10/Lecture_B/image_01_03.jpg?v=1769784719" width="250">



>* Checkpoints save model and training state snapshots
>* They prevent progress loss and enable flexible experimentation

>* Save weights, optimizer, and best model regularly
>* Include metadata and clear filenames for reproducibility

>* Plan retention, pruning, and safe checkpoint saving
>* Regularly test restoring checkpoints to ensure reliability



In [None]:
#@title Python Code - Managing checkpoints

# This script shows simple PyTorch checkpointing.
# It focuses on saving and loading safely.
# Use it as a template for experiments.

# !pip install torch torchvision.

# Import standard libraries for paths and randomness.
import os
import random
import math

# Import torch and basic neural network tools.
import torch
import torch.nn as nn
import torch.optim as optim

# Set a deterministic random seed value.
SEED = 42
random.seed(SEED)

# Set seeds for torch and CUDA if available.
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

# Select device based on GPU availability.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print the torch version and selected device.
print("torch version:", torch.__version__, "device:", device)

# Define a tiny feedforward model for demonstration.
class TinyNet(nn.Module):
    # Initialize layers with small hidden size.
    def __init__(self, input_dim: int, hidden_dim: int, num_classes: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, num_classes),
        )

    # Define forward pass through the network.
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)

# Create a tiny synthetic classification dataset.
def make_tiny_dataset(num_samples: int, input_dim: int, num_classes: int):
    # Create random features with normal distribution.
    x = torch.randn(num_samples, input_dim)
    # Create random integer labels for classes.
    y = torch.randint(0, num_classes, (num_samples,))
    return x, y

# Simple training step for one small epoch.
def train_one_epoch(model, optimizer, criterion, x, y):
    # Set model to training mode for updates.
    model.train()
    optimizer.zero_grad()
    # Forward pass and loss computation.
    logits = model(x)
    loss = criterion(logits, y)
    # Backward pass and parameter update.
    loss.backward()
    optimizer.step()
    return loss.item()

# Simple evaluation step without gradient tracking.
def evaluate(model, criterion, x, y):
    # Set model to evaluation mode for metrics.
    model.eval()
    with torch.no_grad():
        logits = model(x)
        loss = criterion(logits, y)
        # Compute accuracy over tiny dataset.
        preds = torch.argmax(logits, dim=1)
        correct = (preds == y).sum().item()
        acc = correct / max(1, y.numel())
    return loss.item(), acc

# Helper to build a checkpoint dictionary.
def build_checkpoint(epoch, model, optimizer, best_val_loss, config):
    # Store model and optimizer state dictionaries.
    return {
        "epoch": epoch,
        "model_state": model.state_dict(),
        "optimizer_state": optimizer.state_dict(),
        "best_val_loss": best_val_loss,
        "config": config,
    }

# Helper to save checkpoint safely to disk.
def save_checkpoint(state, is_best, checkpoint_dir):
    # Ensure checkpoint directory exists on disk.
    os.makedirs(checkpoint_dir, exist_ok=True)
    # Define main checkpoint file path.
    ckpt_path = os.path.join(checkpoint_dir, "last_checkpoint.pt")
    # Save last checkpoint with torch save.
    torch.save(state, ckpt_path)
    # Optionally save separate best checkpoint file.
    if is_best:
        best_path = os.path.join(checkpoint_dir, "best_checkpoint.pt")
        torch.save(state, best_path)

# Helper to load checkpoint if it exists.
def load_checkpoint_if_available(model, optimizer, checkpoint_dir):
    # Define checkpoint path for last checkpoint.
    ckpt_path = os.path.join(checkpoint_dir, "last_checkpoint.pt")
    if not os.path.exists(ckpt_path):
        # Return defaults when no checkpoint exists.
        return 0, math.inf
    # Load checkpoint to CPU for safety.
    checkpoint = torch.load(ckpt_path, map_location="cpu")
    # Restore model and optimizer states.
    model.load_state_dict(checkpoint["model_state"])
    optimizer.load_state_dict(checkpoint["optimizer_state"])
    # Return starting epoch and best validation loss.
    return int(checkpoint["epoch"]) + 1, float(checkpoint["best_val_loss"])

# Main function to run tiny experiment.
def main():
    # Define simple configuration dictionary.
    config = {
        "input_dim": 8,
        "hidden_dim": 16,
        "num_classes": 3,
        "num_epochs": 4,
        "lr": 0.01,
        "checkpoint_dir": "checkpoints_demo",
    }

    # Create tiny dataset and move to device.
    x_train, y_train = make_tiny_dataset(64, config["input_dim"], config["num_classes"])
    x_val, y_val = make_tiny_dataset(32, config["input_dim"], config["num_classes"])

    # Validate shapes before training loop.
    assert x_train.shape[1] == config["input_dim"]
    assert x_val.shape[1] == config["input_dim"]

    # Move tensors to selected device.
    x_train_device = x_train.to(device)
    y_train_device = y_train.to(device)
    x_val_device = x_val.to(device)
    y_val_device = y_val.to(device)

    # Initialize model, loss function, and optimizer.
    model = TinyNet(config["input_dim"], config["hidden_dim"], config["num_classes"]).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=config["lr"])

    # Try to resume from existing checkpoint.
    start_epoch, best_val_loss = load_checkpoint_if_available(
        model, optimizer, config["checkpoint_dir"]
    )

    # Print basic resume information for clarity.
    print("Starting epoch:", start_epoch, "best_val_loss:", round(best_val_loss, 4))

    # Run a short training loop with checkpointing.
    for epoch in range(start_epoch, config["num_epochs"]):
        train_loss = train_one_epoch(
            model, optimizer, criterion, x_train_device, y_train_device
        )
        val_loss, val_acc = evaluate(
            model, criterion, x_val_device, y_val_device
        )
        # Decide if this is the best validation loss.
        is_best = val_loss < best_val_loss
        if is_best:
            best_val_loss = val_loss
        # Build checkpoint state dictionary.
        state = build_checkpoint(epoch, model, optimizer, best_val_loss, config)
        # Save last and maybe best checkpoint.
        save_checkpoint(state, is_best, config["checkpoint_dir"])
        # Print compact progress information.
        print(
            f"Epoch {epoch} train_loss={train_loss:.3f} val_loss={val_loss:.3f} val_acc={val_acc:.2f}"
        )

    # Load best checkpoint to evaluate final performance.
    best_path = os.path.join(config["checkpoint_dir"], "best_checkpoint.pt")
    if os.path.exists(best_path):
        best_state = torch.load(best_path, map_location=device)
        model.load_state_dict(best_state["model_state"])
        # Evaluate best model on validation data.
        best_loss, best_acc = evaluate(
            model, criterion, x_val_device, y_val_device
        )
        print("Best checkpoint val_loss=", round(best_loss, 3), "val_acc=", round(best_acc, 2))
    else:
        # Inform user when no best checkpoint exists.
        print("No best checkpoint found, used last model only.")

# Run the main function when script executes.
main()




## **2. Performance Optimizations in PyTorch**

### **2.1. Adding torch compile**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_10/Lecture_B/image_02_01.jpg?v=1769784840" width="250">



>* Compilation analyzes your model and optimizes execution
>* Training runs faster without changing most code

>* Wrap the model with compile, change little
>* Start safe, then try faster compile settings

>* Compilation has warm-up cost but speeds training
>* Refactor dynamic models so compiler optimizes effectively



In [None]:
#@title Python Code - Adding torch compile

# This script shows how torch compile speeds training.
# We compare eager and compiled models on tiny data.
# Focus on clear structure and minimal printed output.

# Install PyTorch if needed for local environments.
# !pip install torch torchvision torchaudio --quiet.

# Import required standard libraries.
import os
import random
import time

# Import torch and related modules.
import torch
import torch.nn as nn
import torch.optim as optim

# Set deterministic seeds for reproducibility.
seed_value = 42
random.seed(seed_value)

# Set torch manual seed for cpu and cuda.
torch.manual_seed(seed_value)

# Detect device preferring cuda when available.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print torch version and selected device.
print("Torch version:", torch.__version__, "Device:", device)

# Define a tiny feedforward model for classification.
class TinyNet(nn.Module):
    # Initialize layers with small hidden size.
    def __init__(self, input_dim: int, hidden_dim: int, num_classes: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, num_classes),
        )

    # Define forward pass using sequential network.
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)

# Create a tiny synthetic classification dataset.
def make_tiny_dataset(num_samples: int, input_dim: int, num_classes: int):
    # Create random features with normal distribution.
    x = torch.randn(num_samples, input_dim)
    # Create random integer labels for classes.
    y = torch.randint(0, num_classes, (num_samples,))
    # Validate shapes before returning tensors.
    assert x.shape == (num_samples, input_dim)
    assert y.shape == (num_samples,)
    return x, y

# Simple training loop returning final loss value.
def train_one_model(model: nn.Module, data, labels, epochs: int = 3):
    # Move model and data to selected device.
    model = model.to(device)
    data = data.to(device)
    labels = labels.to(device)

    # Define optimizer and loss function.
    optimizer = optim.SGD(model.parameters(), lr=0.1)
    criterion = nn.CrossEntropyLoss()

    # Track loss for final reporting.
    final_loss = None

    # Loop over epochs with small number.
    for epoch in range(epochs):
        optimizer.zero_grad()
        outputs = model(data)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        final_loss = loss.item()

    # Return last epoch loss value.
    return final_loss

# Helper to time training for a given model.
def time_training(model: nn.Module, data, labels, label: str):
    # Synchronize cuda before timing if available.
    if device.type == "cuda":
        torch.cuda.synchronize()
    start = time.time()

    # Train model and capture final loss.
    final_loss = train_one_model(model, data, labels, epochs=5)

    # Synchronize cuda again after training.
    if device.type == "cuda":
        torch.cuda.synchronize()
    duration = time.time() - start

    # Print short summary line for this run.
    print(label, "time: {:.4f}s loss: {:.4f}".format(duration, final_loss))

# Main execution block for eager and compiled runs.
def main():
    # Define tiny problem dimensions.
    input_dim = 32
    hidden_dim = 64
    num_classes = 4

    # Build small synthetic dataset.
    x, y = make_tiny_dataset(num_samples=512, input_dim=input_dim, num_classes=num_classes)

    # Create eager model instance.
    eager_model = TinyNet(input_dim=input_dim, hidden_dim=hidden_dim, num_classes=num_classes)

    # Time training for eager model.
    time_training(eager_model, x, y, label="Eager model")

    # Create fresh model for compiled run.
    compiled_model = TinyNet(input_dim=input_dim, hidden_dim=hidden_dim, num_classes=num_classes)

    # Wrap model with torch compile using default mode.
    compiled_model = torch.compile(compiled_model)

    # Time training for compiled model.
    time_training(compiled_model, x, y, label="Compiled model")

# Run main when script is executed.
main()




### **2.2. Using AMP or DDP**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_10/Lecture_B/image_02_02.jpg?v=1769784921" width="250">



>* AMP uses lower precision to boost GPU efficiency
>* Train faster, larger models without extra hardware

>* DDP trains across multiple GPUs in parallel
>* Faster runs enable more experiments within deadlines

>* Pick AMP for memory or single-GPU limits
>* Use DDP or both after profiling bottlenecks



In [None]:
#@title Python Code - Using AMP or DDP

# This script shows basic PyTorch AMP usage.
# It trains a tiny model on random data.
# Focus is on clear beginner friendly structure.

# !pip install torch torchvision torchaudio.

# Import required standard libraries.
import os
import random
import math

# Import torch and related modules.
import torch
import torch.nn as nn
import torch.optim as optim

# Set deterministic random seeds.
seed_value = 42
random.seed(seed_value)

# Set seeds for torch randomness.
torch.manual_seed(seed_value)

# Select device based on availability.
use_cuda = torch.cuda.is_available()

device = torch.device("cuda" if use_cuda else "cpu")

# Print PyTorch version and device.
print("PyTorch version:", torch.__version__)

# Define a tiny feedforward network.
class TinyNet(nn.Module):
    # Initialize layers for the network.
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim),
        )

    # Define forward computation step.
    def forward(self, x):
        return self.net(x)

# Set simple problem dimensions.
input_dim = 20
hidden_dim = 32

# Set number of output classes.
output_dim = 3

# Create random training data tensor.
num_samples = 256

# Generate random features and labels.
features = torch.randn(num_samples, input_dim)

# Create random integer class labels.
labels = torch.randint(0, output_dim, (num_samples,))

# Move data to selected device.
features = features.to(device)
labels = labels.to(device)

# Validate shapes before training.
assert features.shape == (num_samples, input_dim)

# Validate labels shape and type.
assert labels.shape == (num_samples,)

# Create model and move to device.
model = TinyNet(input_dim, hidden_dim, output_dim).to(device)

# Define loss function and optimizer.
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Create gradient scaler for AMP.
scaler = torch.cuda.amp.GradScaler(enabled=use_cuda)

# Set training hyperparameters.
batch_size = 32
epochs = 3

# Compute number of batches per epoch.
num_batches = math.ceil(num_samples / batch_size)

# Training loop with AMP enabled.
for epoch in range(epochs):
    # Set model to training mode.
    model.train()
    epoch_loss = 0.0

    # Iterate over mini batches.
    for batch_idx in range(num_batches):
        start = batch_idx * batch_size
        end = start + batch_size

        # Slice current batch safely.
        batch_x = features[start:end]
        batch_y = labels[start:end]

        # Ensure batch is non empty.
        if batch_x.size(0) == 0:
            continue

        # Zero gradients before backward.
        optimizer.zero_grad(set_to_none=True)

        # Forward pass under autocast.
        with torch.cuda.amp.autocast(enabled=use_cuda):
            outputs = model(batch_x)
            loss = criterion(outputs, batch_y)

        # Scale loss and backpropagate.
        scaler.scale(loss).backward()

        # Step optimizer through scaler.
        scaler.step(optimizer)
        scaler.update()

        # Accumulate loss value.
        epoch_loss += loss.item() * batch_x.size(0)

    # Compute average epoch loss.
    avg_loss = epoch_loss / num_samples

    # Print compact training progress.
    print(f"Epoch {epoch+1}/{epochs}, loss {avg_loss:.4f}")

# Switch model to evaluation mode.
model.eval()

# Disable gradients for evaluation.
with torch.no_grad():
    # Use autocast also during evaluation.
    with torch.cuda.amp.autocast(enabled=use_cuda):
        logits = model(features)

# Compute predicted classes.
preds = torch.argmax(logits, dim=1)

# Calculate accuracy on training data.
correct = (preds == labels).sum().item()

# Compute accuracy fraction safely.
accuracy = correct / num_samples

# Print final accuracy summary.
print(f"Final training accuracy with AMP: {accuracy:.2%}")



### **2.3. Evaluating Optimization Results**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_10/Lecture_B/image_02_03.jpg?v=1769784994" width="250">



>* Define what performance improvement means for you
>* Measure baseline metrics to compare future optimizations

>* Compare runs with identical settings and metrics
>* Weigh speed gains against accuracy and usefulness

>* Check stability, consistency, and numerical issues carefully
>* Balance speed gains with deployment complexity and workflow



In [None]:
#@title Python Code - Evaluating Optimization Results

# This script compares baseline and optimized training performance.
# It uses a tiny synthetic dataset for speed and clarity.
# We focus on timing and accuracy for simple evaluation.

# !pip install torch torchvision.

# Import required standard libraries.
import time
import random
import os

# Import torch and related modules.
import torch
import torch.nn as nn
import torch.optim as optim

# Set deterministic seeds for reproducibility.
seed_value = 42
random.seed(seed_value)
torch.manual_seed(seed_value)

# Detect device and print PyTorch version.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("PyTorch version:", torch.__version__)

# Create a tiny synthetic classification dataset.
num_samples = 512
num_features = 20
num_classes = 3

# Generate random features and integer labels.
X = torch.randn(num_samples, num_features)
y = torch.randint(0, num_classes, (num_samples,))

# Split into train and validation subsets.
train_size = int(0.8 * num_samples)
val_size = num_samples - train_size

# Use torch.utils.data random split for subsets.
train_dataset, val_dataset = torch.utils.data.random_split(
    list(zip(X, y)), [train_size, val_size]
)

# Define a simple feedforward neural network.
class TinyNet(nn.Module):

    # Initialize layers with small hidden size.
    def __init__(self, in_features, hidden_size, num_classes):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_features, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, num_classes),
        )

    # Forward pass through the network.
    def forward(self, x):
        return self.net(x)

# Create data loaders with small batch size.
batch_size = 64
train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=batch_size, shuffle=True
)
val_loader = torch.utils.data.DataLoader(
    val_dataset, batch_size=batch_size, shuffle=False
)

# Utility function to run one training epoch.
def train_one_epoch(model, loader, optimizer, criterion, scaler=None):
    model.train()
    total_loss = 0.0
    for inputs, targets in loader:
        inputs = inputs.to(device)
        targets = targets.to(device)
        optimizer.zero_grad()
        if scaler is None:
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()
        else:
            with torch.autocast(device_type=device.type, dtype=torch.float16):
                outputs = model(inputs)
                loss = criterion(outputs, targets)
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        total_loss += loss.item() * inputs.size(0)
    return total_loss / len(loader.dataset)

# Utility function to evaluate accuracy on validation set.
@torch.no_grad()
def evaluate_accuracy(model, loader):
    model.eval()
    correct = 0
    total = 0
    for inputs, targets in loader:
        inputs = inputs.to(device)
        targets = targets.to(device)
        outputs = model(inputs)
        preds = outputs.argmax(dim=1)
        correct += (preds == targets).sum().item()
        total += targets.size(0)
    if total == 0:
        return 0.0
    return correct / total

# Helper to run a short experiment and collect metrics.
def run_experiment(use_amp=False, use_compile=False):
    model = TinyNet(num_features, 32, num_classes).to(device)
    if use_compile and hasattr(torch, "compile"):
        model = torch.compile(model)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.1)
    scaler = torch.cuda.amp.GradScaler() if use_amp and device.type == "cuda" else None
    epochs = 3
    start_time = time.time()
    for _ in range(epochs):
        train_one_epoch(model, train_loader, optimizer, criterion, scaler)
    total_time = time.time() - start_time
    val_acc = evaluate_accuracy(model, val_loader)
    return total_time, val_acc

# Run baseline experiment without optimizations.
baseline_time, baseline_acc = run_experiment(
    use_amp=False, use_compile=False
)

# Run optimized experiment with AMP if possible.
use_amp_flag = device.type == "cuda"
opt_time, opt_acc = run_experiment(
    use_amp=use_amp_flag, use_compile=False
)

# Compute relative speedup and accuracy change.
speedup = baseline_time / opt_time if opt_time > 0 else 1.0
acc_diff = opt_acc - baseline_acc

# Print concise summary of both runs.
print("Device used:", device)
print("Baseline time (s):", round(baseline_time, 4))
print("Baseline accuracy:", round(baseline_acc, 4))
print("Optimized time (s):", round(opt_time, 4))
print("Optimized accuracy:", round(opt_acc, 4))
print("Speedup factor:", round(speedup, 3))
print("Accuracy difference:", round(acc_diff, 4))




## **3. Reflect Review Improve**

### **3.1. Capstone Results Summary**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_10/Lecture_B/image_03_01.jpg?v=1769785081" width="250">



>* Describe model goal, audience, setup, and metrics
>* Tell a concise story of performance and limitations

>* Compare performance across data slices and subgroups
>* Note fairness issues, surprises, and inconsistent behaviors

>* Link metrics to real-world use and constraints
>* Summarize strengths, limits, and best future improvements



### **3.2. Common PyTorch Pitfalls**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_10/Lecture_B/image_03_02.jpg?v=1769785097" width="250">



>* Switch train/eval modes to avoid distorted metrics
>* Manage gradient tracking carefully to preserve computation graphs

>* Unoptimized data pipelines cause bottlenecks and instability
>* Poor device placement and memory choices hurt performance

>* Avoid data leakage, overused validation, uncontrolled randomness
>* Ensure clean splits, documentation, and reproducible experiments



In [None]:
#@title Python Code - Common PyTorch Pitfalls

# This script shows common PyTorch beginner pitfalls.
# We compare wrong and correct training evaluation patterns.
# Focus on modes devices gradients and evaluation mistakes.

# !pip install torch torchvision.

# Import required standard libraries.
import os
import random
import math

# Import torch and torchvision if available.
try:
    import torch
    from torch import nn
    from torch.utils.data import DataLoader
    from torchvision import datasets
    from torchvision import transforms
except Exception as e:
    raise RuntimeError("PyTorch and torchvision are required here") from e

# Set deterministic random seeds for reproducibility.
random.seed(0)

# Set torch manual seed for reproducibility.
torch.manual_seed(0)

# Select device safely with fallback to cpu.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print framework version and selected device.
print("Torch", torch.__version__, "Device", device)

# Define a tiny neural network with dropout layer.
class TinyNet(nn.Module):
    # Initialize layers including dropout and linear.
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(28 * 28, 64)
        self.relu = nn.ReLU()
        self.drop = nn.Dropout(p=0.5)
        self.fc2 = nn.Linear(64, 10)

    # Define forward pass using dropout and relu.
    def forward(self, x):
        x = self.flatten(x)
        x = self.fc1(x)
        x = self.relu(x)
        x = self.drop(x)
        x = self.fc2(x)
        return x

# Prepare a simple transform to tensor only.
transform = transforms.ToTensor()

# Download a tiny MNIST training subset.
train_dataset = datasets.MNIST(
    root="./data", train=True, download=True, transform=transform
)

# Download a tiny MNIST test subset.
test_dataset = datasets.MNIST(
    root="./data", train=False, download=True, transform=transform
)

# Use only a small subset for speed.
small_train_size = 256

# Use only a small subset for evaluation.
small_test_size = 256

# Create subset indices safely within dataset length.
train_indices = list(range(min(small_train_size, len(train_dataset))))

# Create subset indices for test dataset.
test_indices = list(range(min(small_test_size, len(test_dataset))))

# Wrap subsets with torch subset utility.
train_subset = torch.utils.data.Subset(train_dataset, train_indices)

# Wrap test subset similarly for evaluation.
test_subset = torch.utils.data.Subset(test_dataset, test_indices)

# Create data loaders with small batch size.
train_loader = DataLoader(train_subset, batch_size=64, shuffle=True)

# Create test loader without shuffle for evaluation.
test_loader = DataLoader(test_subset, batch_size=64, shuffle=False)

# Helper function to compute accuracy on loader.
def evaluate_model(model, loader, use_eval_mode, use_no_grad):
    # Optionally set evaluation or training mode.
    if use_eval_mode:
        model.eval()
    else:
        model.train()

    # Optionally disable gradient tracking.
    context = torch.no_grad() if use_no_grad else torch.enable_grad()

    # Initialize counters for accuracy computation.
    correct = 0
    total = 0

    # Use chosen context manager for evaluation.
    with context:
        for images, labels in loader:
            images = images.to(device)
            labels = labels.to(device)
            outputs = model(images)
            preds = outputs.argmax(dim=1)
            correct += (preds == labels).sum().item()
            total += labels.size(0)

    # Avoid division by zero with safe check.
    if total == 0:
        return 0.0

    # Return accuracy as float percentage.
    return 100.0 * correct / float(total)

# Create model instance and move to device.
model = TinyNet().to(device)

# Define loss function and optimizer.
criterion = nn.CrossEntropyLoss()

# Use simple stochastic gradient descent optimizer.
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Train for one short epoch to get non random weights.
model.train()

# Iterate over one epoch only for speed.
for images, labels in train_loader:
    images = images.to(device)
    labels = labels.to(device)
    optimizer.zero_grad()
    outputs = model(images)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()

# Demonstrate bad practice evaluation with train mode.
bad_acc = evaluate_model(
    model=model, loader=test_loader, use_eval_mode=False, use_no_grad=False
)

# Demonstrate good practice evaluation with eval mode.
good_acc = evaluate_model(
    model=model, loader=test_loader, use_eval_mode=True, use_no_grad=True
)

# Show both accuracies to highlight dropout effect.
print("Bad evaluation accuracy with train mode:", round(bad_acc, 2))

# Show correct evaluation accuracy with eval and no_grad.
print("Good evaluation accuracy with eval mode:", round(good_acc, 2))

# Intentionally create device mismatch then fix it.
cpu_tensor = torch.ones(2, 2)

# Move another tensor to selected device.
device_tensor = torch.ones(2, 2, device=device)

# Safely check devices before risky operation.
if cpu_tensor.device != device_tensor.device:
    print("Warning device mismatch detected and avoided.")

# Move cpu tensor to correct device before addition.
fixed_tensor = cpu_tensor.to(device) + device_tensor

# Print small tensor to confirm successful safe addition.
print("Safe addition result shape:", tuple(fixed_tensor.shape))




### **3.3. Next Steps with PyTorch**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_10/Lecture_B/image_03_03.jpg?v=1769785188" width="250">



>* Turn your capstone into an evolving project
>* Plan concrete growth directions tied to real use

>* Explore new deployment options and advanced training methods
>* Run focused experiments to understand model behavior deeply

>* Embed projects in real-world, collaborative contexts
>* Use new constraints to iteratively refine systems



# <font color="#418FDE" size="6.5" uppercase>**Execution and Review**</font>


In this lecture, you learned to:
- Implement the full training and evaluation pipeline for the capstone project using PyTorch 2.10.0. 
- Apply at least one performance or scalability technique, such as torch.compile, AMP, or DDP, to the project. 
- Analyze project outcomes, document lessons learned, and identify areas for future improvement. 

<font color='yellow'>Congratulations on completing this course!</font>