# <font color="#418FDE" size="6.5" uppercase>**Losses and Optimizers**</font>

>Last update: 20260129.
    
By the end of this Lecture, you will be able to:
- Select and configure appropriate loss functions for common supervised learning tasks in PyTorch. 
- Use optimizers such as SGD and Adam to update model parameters within a training loop. 
- Inspect and debug optimization behavior using learning rate settings, gradient norms, and loss curves. 


## **1. PyTorch Loss Functions**

### **1.1. Regression and Classification**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_A/image_01_01.jpg?v=1769663895" width="250">



>* Decide if your task is regression or classification
>* Regression predicts numbers; classification predicts class probabilities

>* Regression losses reflect how you value errors
>* They use continuous outputs, penalizing large deviations

>* Classification loss compares class scores and labels
>* It rewards high probability on the correct class



### **1.2. MSE and L1 Losses**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_A/image_01_02.jpg?v=1769663912" width="250">



>* MSE and L1 measure regression prediction errors
>* MSE punishes big errors; L1 is steadier

>* MSE highlights large errors and smooth optimization
>* Can overreact to noisy labels and outliers

>* L1 loss resists outliers and noisy labels
>* Choose L1 or MSE based on error tradeoffs



In [None]:
#@title Python Code - MSE and L1 Losses

# This script compares MSE and L1 losses.
# It uses tiny tensors for clear intuition.
# Run cells in order inside Google Colab.

# Optional install line for PyTorch in Colab.
# !pip install torch torchvision torchaudio.

# Import torch for tensor and loss operations.
import torch

# Set deterministic seed for reproducible results.
torch.manual_seed(0)

# Create small batch of target regression values.
true_targets = torch.tensor([[2.0], [0.0], [1.0]])

# Create predictions with one large error included.
predictions = torch.tensor([[2.5], [3.0], [0.5]])

# Validate shapes to avoid broadcasting mistakes.
assert predictions.shape == true_targets.shape

# Define mean squared error loss function object.
mse_loss_fn = torch.nn.MSELoss(reduction="mean")

# Define L1 loss function object for comparison.
l1_loss_fn = torch.nn.L1Loss(reduction="mean")

# Compute MSE loss value for current predictions.
mse_value = mse_loss_fn(predictions, true_targets)

# Compute L1 loss value for current predictions.
l1_value = l1_loss_fn(predictions, true_targets)

# Print both loss values with short explanations.
print("MSE loss value:", float(mse_value))

# Show L1 loss value for the same predictions.
print("L1 loss value:", float(l1_value))

# Compute elementwise squared errors for illustration.
squared_errors = (predictions - true_targets) ** 2

# Compute elementwise absolute errors for illustration.
absolute_errors = (predictions - true_targets).abs()

# Print per example squared and absolute errors.
print("Squared errors per example:", squared_errors.view(-1).tolist())

# Print per example absolute errors for comparison.
print("Absolute errors per example:", absolute_errors.view(-1).tolist())

# Show that large errors dominate MSE more strongly.
print("MSE emphasizes large errors more than L1.")



### **1.3. Classification Loss Choices**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_A/image_01_03.jpg?v=1769663937" width="250">



>* Classification losses depend on outputs and labels
>* Use logits with cross-entropy; avoid pre-softmax

>* Use cross-entropy for single-label classification tasks
>* Use sigmoid plus BCE for multi-label; wrong choice hurts

>* Adjust loss for imbalance and noisy labels
>* Use weighting or focal losses to prioritize errors



In [None]:
#@title Python Code - Classification Loss Choices

# This script compares basic classification loss choices.
# It uses tiny tensors to keep things simple.
# Focus on logits targets and loss configuration.

# Optional install for PyTorch if missing.
# !pip install torch torchvision torchaudio.

# Import required standard libraries.
import os
import random
import math

# Try importing torch and handle absence gracefully.
try:
    import torch
    import torch.nn as nn
except ImportError:
    raise SystemExit("PyTorch is required for this example.")

# Set deterministic random seeds for reproducibility.
random.seed(0)

torch.manual_seed(0)

# Print PyTorch version in one short line.
print("PyTorch version:", torch.__version__)

# Create logits for single label three class example.
logits_single = torch.tensor([[2.0, 0.5, -1.0]])

# Create integer target for single label classification.
target_single = torch.tensor([0])

# Validate shapes for single label example.
assert logits_single.shape == (1, 3)

# Define cross entropy loss for single label case.
ce_loss = nn.CrossEntropyLoss()

# Compute loss using logits and integer target.
loss_single = ce_loss(logits_single, target_single)

# Print single label cross entropy loss value.
print("Single label CE loss:", float(loss_single))

# Create logits for multi label three class example.
logits_multi = torch.tensor([[2.0, -0.5, 1.0]])

# Create multi hot targets for multi label example.
target_multi = torch.tensor([[1.0, 0.0, 1.0]])

# Validate shapes for multi label example.
assert logits_multi.shape == target_multi.shape

# Define BCE with logits loss for multi label.
bce_loss = nn.BCEWithLogitsLoss()

# Compute loss using logits and multi hot targets.
loss_multi = bce_loss(logits_multi, target_multi)

# Print multi label BCE with logits loss value.
print("Multi label BCE loss:", float(loss_multi))

# Define class weights for imbalanced single label case.
class_weights = torch.tensor([3.0, 1.0, 1.0])

# Validate weights length matches class count.
assert class_weights.shape[0] == logits_single.shape[1]

# Create weighted cross entropy loss instance.
weighted_ce = nn.CrossEntropyLoss(weight=class_weights)

# Compute weighted loss for same single label example.
loss_weighted = weighted_ce(logits_single, target_single)

# Print weighted cross entropy loss value.
print("Weighted single label CE loss:", float(loss_weighted))

# Show how probabilities differ for softmax versus sigmoid.
softmax_probs = torch.softmax(logits_single, dim=1)

# Compute sigmoid probabilities for multi label logits.
sigmoid_probs = torch.sigmoid(logits_multi)

# Print both probability style outputs briefly.
print("Softmax probs single:", softmax_probs.tolist())

# Print sigmoid probabilities for multi label example.
print("Sigmoid probs multi:", sigmoid_probs.tolist())

# Final line confirms script finished without issues.
print("Finished comparing classification loss choices.")



## **2. PyTorch Optimizers Overview**

### **2.1. Momentum in SGD**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_A/image_02_01.jpg?v=1769663995" width="250">



>* Momentum makes SGD faster and more stable
>* It smooths noisy gradients using past helpful directions

>* Momentum stores a velocity from past gradients
>* It speeds movement and smooths zigzagging updates

>* Too much momentum can overshoot and oscillate
>* Start with moderate momentum, tune for stability



In [None]:
#@title Python Code - Momentum in SGD

# This script demonstrates momentum in SGD.
# We compare SGD with and without momentum.
# Focus on a tiny linear regression example.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import numpy as np

# Import TensorFlow and check version.
import tensorflow as tf
print("TensorFlow version:", tf.__version__)

# Set deterministic random seeds.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)

# Set TensorFlow random seed.
tf.random.set_seed(seed_value)

# Create simple synthetic linear regression data.
true_w = 2.0
true_b = -1.0
num_samples = 64

# Generate input features.
X = np.linspace(-1.0, 1.0, num_samples).astype(np.float32)

# Generate targets with small noise.
noise = 0.05 * np.random.randn(num_samples).astype(np.float32)
y = true_w * X + true_b + noise

# Reshape data for TensorFlow models.
X_tf = X.reshape(-1, 1)
y_tf = y.reshape(-1, 1)

# Build a tiny linear model function.
def build_model():
    # Create a simple sequential model.
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(1,)),
        tf.keras.layers.Dense(1)
    ])
    return model

# Create two identical models for fair comparison.
model_sgd = build_model()
model_mom = build_model()

# Configure optimizer without momentum.
opt_sgd = tf.keras.optimizers.SGD(learning_rate=0.1)

# Configure optimizer with momentum enabled.
opt_mom = tf.keras.optimizers.SGD(
    learning_rate=0.1,
    momentum=0.9
)

# Define mean squared error loss function.
loss_fn = tf.keras.losses.MeanSquaredError()

# Prepare dataset as TensorFlow tensors.
X_tensor = tf.convert_to_tensor(X_tf)
y_tensor = tf.convert_to_tensor(y_tf)

# Validate shapes before training.
assert X_tensor.shape[0] == y_tensor.shape[0]

# Training settings for both optimizers.
epochs = 20

# Containers to store loss history.
loss_history_sgd = []
loss_history_mom = []

# Training loop comparing both optimizers.
for epoch in range(epochs):
    # Record gradients and update plain SGD model.
    with tf.GradientTape() as tape_sgd:
        preds_sgd = model_sgd(X_tensor, training=True)
        loss_sgd = loss_fn(y_tensor, preds_sgd)
    grads_sgd = tape_sgd.gradient(loss_sgd, model_sgd.trainable_variables)
    opt_sgd.apply_gradients(zip(grads_sgd, model_sgd.trainable_variables))

    # Record gradients and update momentum model.
    with tf.GradientTape() as tape_mom:
        preds_mom = model_mom(X_tensor, training=True)
        loss_mom = loss_fn(y_tensor, preds_mom)
    grads_mom = tape_mom.gradient(loss_mom, model_mom.trainable_variables)
    opt_mom.apply_gradients(zip(grads_mom, model_mom.trainable_variables))

    # Store scalar losses for later inspection.
    loss_history_sgd.append(float(loss_sgd.numpy()))
    loss_history_mom.append(float(loss_mom.numpy()))

# Print a compact comparison of losses.
print("Epoch  Loss_SGD  Loss_Momentum")
for i in range(epochs):
    # Format each epoch line clearly.
    print(i + 1, round(loss_history_sgd[i], 4), round(loss_history_mom[i], 4))




### **2.2. Adaptive Adam Optimizers**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_A/image_02_02.jpg?v=1769664054" width="250">



>* Adam adapts learning rates per parameter automatically
>* Helps different model parts learn faster, more stably

>* Adam fits into the usual training loop
>* Uses adaptive per-parameter steps, forgiving learning rates

>* Tune learning rate, betas, and weight decay
>* Monitor loss curves to compare and refine Adam



In [None]:
#@title Python Code - Adaptive Adam Optimizers

# This script demonstrates adaptive Adam optimizer basics.
# We compare SGD and Adam on a tiny regression task.
# Focus on training loop and optimizer configuration.

# !pip install torch torchvision torchaudio.

# Import required standard libraries.
import os
import math
import random

# Import torch and check availability.
import torch
import torch.nn as nn

# Set deterministic seeds for reproducibility.
random.seed(0)

# Set torch manual seed for reproducibility.
torch.manual_seed(0)

# Select device based on GPU availability.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print framework version and selected device.
print("PyTorch version:", torch.__version__, "Device:", device)

# Create small synthetic regression dataset.
true_w, true_b = 2.0, -1.0

# Generate input features as a column vector.
x = torch.linspace(-1.0, 1.0, steps=40).unsqueeze(1)

# Generate targets with a simple linear relationship.
y = true_w * x + true_b

# Move data tensors to the selected device.
x, y = x.to(device), y.to(device)

# Validate shapes before training loop.
assert x.shape == y.shape

# Define a tiny linear regression model.
model_sgd = nn.Linear(in_features=1, out_features=1).to(device)

# Create a separate model copy for Adam optimizer.
model_adam = nn.Linear(in_features=1, out_features=1).to(device)

# Initialize Adam model with same parameters as SGD model.
model_adam.load_state_dict(model_sgd.state_dict())

# Define mean squared error loss function.
criterion = nn.MSELoss()

# Configure SGD optimizer with fixed learning rate.
optimizer_sgd = torch.optim.SGD(model_sgd.parameters(), lr=0.05)

# Configure Adam optimizer with typical hyperparameters.
optimizer_adam = torch.optim.Adam(
    model_adam.parameters(), lr=0.05, betas=(0.9, 0.999)
)

# Set number of training epochs small for speed.
num_epochs = 40

# Prepare lists to store loss values for inspection.
sgd_losses, adam_losses = [], []

# Training loop comparing SGD and Adam side by side.
for epoch in range(num_epochs):

    # Zero gradients for SGD model.
    optimizer_sgd.zero_grad()

    # Forward pass for SGD model.
    preds_sgd = model_sgd(x)

    # Compute loss for SGD model.
    loss_sgd = criterion(preds_sgd, y)

    # Backward pass to compute gradients.
    loss_sgd.backward()

    # Optional gradient norm inspection for SGD.
    total_norm_sgd = 0.0

    # Accumulate squared gradient norms for SGD.
    for p in model_sgd.parameters():
        if p.grad is not None:
            total_norm_sgd += p.grad.data.norm().item() ** 2

    # Take square root to get overall gradient norm.
    total_norm_sgd = math.sqrt(total_norm_sgd)

    # Update parameters using SGD step.
    optimizer_sgd.step()

    # Zero gradients for Adam model.
    optimizer_adam.zero_grad()

    # Forward pass for Adam model.
    preds_adam = model_adam(x)

    # Compute loss for Adam model.
    loss_adam = criterion(preds_adam, y)

    # Backward pass to compute gradients.
    loss_adam.backward()

    # Optional gradient norm inspection for Adam.
    total_norm_adam = 0.0

    # Accumulate squared gradient norms for Adam.
    for p in model_adam.parameters():
        if p.grad is not None:
            total_norm_adam += p.grad.data.norm().item() ** 2

    # Take square root to get overall gradient norm.
    total_norm_adam = math.sqrt(total_norm_adam)

    # Update parameters using Adam step.
    optimizer_adam.step()

    # Store detached loss values for later printing.
    sgd_losses.append(loss_sgd.item())

    # Store Adam loss values for comparison.
    adam_losses.append(loss_adam.item())

# Print a few selected epochs to avoid spam.
print("Epoch 0 SGD loss:", round(sgd_losses[0], 4))

# Print Adam loss at first epoch.
print("Epoch 0 Adam loss:", round(adam_losses[0], 4))

# Print middle epoch losses for both optimizers.
mid = num_epochs // 2

# Show SGD loss at middle epoch.
print("Epoch", mid, "SGD loss:", round(sgd_losses[mid], 4))

# Show Adam loss at middle epoch.
print("Epoch", mid, "Adam loss:", round(adam_losses[mid], 4))

# Print final epoch losses for both optimizers.
print("Epoch", num_epochs - 1, "SGD loss:", round(sgd_losses[-1], 4))

# Print final Adam loss to compare convergence.
print("Epoch", num_epochs - 1, "Adam loss:", round(adam_losses[-1], 4))

# Show learned parameters for SGD model.
print("SGD learned w, b:", model_sgd.weight.item(), model_sgd.bias.item())

# Show learned parameters for Adam model.
print("Adam learned w, b:", model_adam.weight.item(), model_adam.bias.item())

# Print true underlying parameters for reference.
print("True w, b:", true_w, true_b)



### **2.3. Tuning Optimizer Settings**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_A/image_02_03.jpg?v=1769664088" width="250">



>* Learning rate balances speed and training stability
>* Watch loss curves to adjust update step sizes

>* Optimizer hyperparameters shape update speed and smoothness
>* Tuning momentum and Adam settings improves stability

>* Use changing learning rates instead of fixed
>* Schedules balance fast early learning and stable convergence



In [None]:
#@title Python Code - Tuning Optimizer Settings

# This script shows basic optimizer tuning concepts.
# We compare SGD and Adam with different learning rates.
# Focus on simple training loops and clear outputs.

# !pip install torch torchvision.

# Import required standard libraries.
import os
import random
import math

# Import torch and check availability.
import torch
import torch.nn as nn
import torch.optim as optim

# Set deterministic seeds for reproducibility.
seed_value = 42
random.seed(seed_value)

# Set seeds for torch random generators.
torch.manual_seed(seed_value)

# Select device based on GPU availability.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print torch version and selected device.
print("PyTorch version:", torch.__version__, "Device:", device)

# Create a tiny synthetic regression dataset.
true_w = torch.tensor([[2.0], [-3.0]])

# Create true bias for the synthetic data.
true_b = torch.tensor([0.5])

# Generate input features on CPU for simplicity.
X = torch.randn(200, 2)

# Generate targets with a simple linear rule.
y = X @ true_w + true_b

# Move data to the selected device.
X = X.to(device)

# Move targets to the selected device.
y = y.to(device)

# Confirm shapes are as expected.
assert X.shape == (200, 2)

# Confirm target shape matches expectations.
assert y.shape == (200, 1)

# Define a simple linear regression model.
class TinyRegressor(nn.Module):
    # Initialize with one linear layer.
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(2, 1)

    # Define forward computation for inputs.
    def forward(self, x):
        return self.linear(x)


# Helper function to train for a few epochs.
def train_model(optimizer_name, learning_rate, momentum=None):
    # Create a fresh model for each run.
    model = TinyRegressor().to(device)

    # Use mean squared error loss function.
    criterion = nn.MSELoss()

    # Choose optimizer based on requested name.
    if optimizer_name == "SGD":
        if momentum is None:
            optimizer = optim.SGD(model.parameters(), lr=learning_rate)
        else:
            optimizer = optim.SGD(
                model.parameters(), lr=learning_rate, momentum=momentum
            )
    else:
        optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    # Store losses for each epoch.
    losses = []

    # Run a short deterministic training loop.
    for epoch in range(15):
        # Set model to training mode.
        model.train()

        # Forward pass to compute predictions.
        preds = model(X)

        # Compute loss between predictions and targets.
        loss = criterion(preds, y)

        # Zero gradients before backward pass.
        optimizer.zero_grad()

        # Backward pass to compute gradients.
        loss.backward()

        # Optional gradient norm check for stability.
        total_norm = 0.0
        for p in model.parameters():
            if p.grad is not None:
                param_norm = p.grad.data.norm(2).item()
                total_norm += param_norm ** 2
        total_norm = math.sqrt(total_norm)

        # Take one optimization step.
        optimizer.step()

        # Record loss value for later inspection.
        losses.append(loss.item())

    # Return final loss and last gradient norm.
    return losses[-1], total_norm


# Define different optimizer settings to compare.
settings = [
    ("SGD", 0.001, 0.0),
    ("SGD", 0.1, 0.0),
    ("SGD", 0.1, 0.9),
    ("Adam", 0.001, None),
]

# Run experiments and collect results.
results = []
for name, lr, mom in settings:
    # Train model with given hyperparameters.
    final_loss, grad_norm = train_model(name, lr, mom)

    # Store a summary tuple for printing.
    results.append((name, lr, mom, final_loss, grad_norm))

# Print a short header for clarity.
print("Optimizer, lr, momentum, final_loss, grad_norm")

# Print one summary line per configuration.
for name, lr, mom, loss_val, gnorm in results:
    print(
        name,
        "lr=",
        lr,
        "mom=",
        mom,
        "loss=",
        round(loss_val, 4),
        "g_norm=",
        round(gnorm, 4),
    )




## **3. Monitoring Optimization Progress**

### **3.1. Tracking Loss Trends**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_A/image_03_01.jpg?v=1769664158" width="250">



>* Track loss over epochs to judge learning
>* Flat or erratic loss suggests training problems

>* Compare training and validation loss to assess generalization
>* Use loss patterns to detect overfitting or underfitting

>* Loss curve shape reveals learning rate issues
>* Visual patterns guide debugging and training adjustments



In [None]:
#@title Python Code - Tracking Loss Trends

# This script shows how loss trends behave.
# We train a tiny model and record losses.
# Then we plot and briefly inspect them.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import numpy as np

# Import TensorFlow and Keras utilities.
import tensorflow as tf
from tensorflow import keras

# Print TensorFlow version once.
print("TensorFlow version:", tf.__version__)

# Set deterministic random seeds.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Load MNIST dataset from Keras.
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Use a small subset for quick training.
train_samples = 2000
test_samples = 1000
x_train = x_train[:train_samples]
y_train = y_train[:train_samples]

# Reduce test set size.
x_test = x_test[:test_samples]
y_test = y_test[:test_samples]

# Normalize pixel values to range zero one.
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0

# Flatten images into vectors.
input_shape = (28 * 28,)
x_train = x_train.reshape((-1, input_shape[0]))
x_test = x_test.reshape((-1, input_shape[0]))

# Validate shapes before building model.
assert x_train.shape[1] == input_shape[0]
assert x_test.shape[1] == input_shape[0]

# Build a simple dense neural network.
model = keras.Sequential([
    keras.layers.Input(shape=input_shape),
    keras.layers.Dense(64, activation="relu"),
    keras.layers.Dense(10, activation="softmax"),
])

# Choose optimizer and loss function.
optimizer = keras.optimizers.Adam(learning_rate=0.001)
loss_fn = keras.losses.SparseCategoricalCrossentropy()

# Prepare metrics for monitoring.
train_loss_metric = keras.metrics.Mean(name="train_loss")
val_loss_metric = keras.metrics.Mean(name="val_loss")

# Create TensorFlow datasets for batching.
batch_size = 64
train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_ds = train_ds.shuffle(buffer_size=train_samples, seed=seed_value)
train_ds = train_ds.batch(batch_size)

# Create validation dataset.
val_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test))
val_ds = val_ds.batch(batch_size)

# Prepare lists to store epoch losses.
train_losses = []
val_losses = []

# Define number of epochs for demonstration.
num_epochs = 5

# Training loop with manual loss tracking.
for epoch in range(num_epochs):
    train_loss_metric.reset_state()
    val_loss_metric.reset_state()

    # Iterate over training batches.
    for step, (images, labels) in enumerate(train_ds):
        with tf.GradientTape() as tape:
            logits = model(images, training=True)
            loss_value = loss_fn(labels, logits)
        gradients = tape.gradient(loss_value, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))
        train_loss_metric.update_state(loss_value)

    # Iterate over validation batches.
    for val_images, val_labels in val_ds:
        val_logits = model(val_images, training=False)
        val_loss_value = loss_fn(val_labels, val_logits)
        val_loss_metric.update_state(val_loss_value)

    # Store average losses for this epoch.
    epoch_train_loss = float(train_loss_metric.result().numpy())
    epoch_val_loss = float(val_loss_metric.result().numpy())
    train_losses.append(epoch_train_loss)
    val_losses.append(epoch_val_loss)

    # Print concise summary for this epoch.
    print(
        f"Epoch {epoch + 1}: train_loss={epoch_train_loss:.4f}, "
        f"val_loss={epoch_val_loss:.4f}"
    )

# Import matplotlib for plotting.
import matplotlib.pyplot as plt

# Create a simple loss curve plot.
plt.figure(figsize=(6, 4))
plt.plot(range(1, num_epochs + 1), train_losses, label="Train loss")
plt.plot(range(1, num_epochs + 1), val_losses, label="Validation loss")

# Label axes and add legend.
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Training and validation loss trends")
plt.legend()

# Display the plot to inspect trends.
plt.show()



### **3.2. Gradient Norm Monitoring**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_A/image_03_02.jpg?v=1769664232" width="250">



>* Gradient norms show overall strength of parameter updates
>* Healthy training shows large then gradually shrinking norms

>* Huge gradient norms signal unstable, exploding training
>* Fix by lowering learning rate or clipping

>* Tiny gradients mean stalled or vanishing learning
>* Log norms to intervene and improve optimization



In [None]:
#@title Python Code - Gradient Norm Monitoring

# This script shows gradient norm monitoring simply.
# We train a tiny model and track gradient magnitudes.
# Use this to debug unstable or stagnant optimization.

# Install PyTorch if needed in your environment.
# !pip install torch torchvision torchaudio.

# Import required standard libraries.
import os
import random
import math

# Import torch and check version and device.
import torch
import torch.nn as nn
import torch.optim as optim

# Set deterministic seeds for reproducibility.
seed_value = 42
random.seed(seed_value)

# Set seeds for torch random number generators.
torch.manual_seed(seed_value)

# Select device based on GPU availability.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print framework version and selected device.
print("PyTorch version:", torch.__version__, "Device:", device)

# Create a tiny synthetic regression dataset.
num_samples = 64

# Generate input features from normal distribution.
X = torch.randn(num_samples, 1)

# Generate targets with simple linear relationship.
y = 3.0 * X + 0.5

# Move data to selected device safely.
X = X.to(device)

y = y.to(device)

# Validate shapes before building model.
assert X.shape[0] == y.shape[0]

# Define a tiny linear regression model.
model = nn.Linear(1, 1).to(device)

# Choose mean squared error loss function.
criterion = nn.MSELoss()

# Configure optimizer with slightly high learning rate.
optimizer = optim.SGD(model.parameters(), lr=0.2)

# Function to compute total gradient norm value.
def compute_grad_norm(parameters):
    total_norm_sq = 0.0
    for p in parameters:
        if p.grad is None:
            continue
        param_norm = p.grad.detach().data.norm(2).item()
        total_norm_sq += param_norm ** 2
    return math.sqrt(total_norm_sq)

# Training loop with gradient norm monitoring.
num_epochs = 8

grad_norm_history = []

# Run several epochs and record loss and gradient norms.
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(X)
    loss = criterion(outputs, y)
    loss.backward()
    grad_norm = compute_grad_norm(model.parameters())
    grad_norm_history.append(grad_norm)
    optimizer.step()
    print(
        f"Epoch {epoch+1}: loss={loss.item():.4f}, grad_norm={grad_norm:.4f}"
    )

# Show final parameters and last gradient norm.
for name, param in model.named_parameters():
    print(name, "value:", param.data.view(-1).tolist())

# Print short summary of gradient norm trend.
print("Gradient norms per epoch:", [round(g, 4) for g in grad_norm_history])



### **3.3. Learning Rate Dynamics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_A/image_03_03.jpg?v=1769664278" width="250">



>* Learning rate controls how fast parameters change
>* Too high or low harms stable loss improvement

>* Track schedule changes with loss and validation
>* Relate loss patterns to too high or low learning rates

>* Correlate learning rate, loss, and gradients carefully
>* Use logs to tune schedules and reduce instability



In [None]:
#@title Python Code - Learning Rate Dynamics

# This script visualizes simple learning rate dynamics.
# It compares high and low learning rate behaviors.
# Use it to connect curves with optimization stability.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import math
import random

# Import TensorFlow and check version.
import tensorflow as tf

# Set deterministic random seeds for reproducibility.
seed_value = 42
random.seed(seed_value)

# Set TensorFlow random seed for reproducibility.
tf.random.set_seed(seed_value)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Create a tiny synthetic regression dataset.
true_w = tf.constant([[2.0]], dtype=tf.float32)

# Define true bias for the synthetic data.
true_b = tf.constant([0.5], dtype=tf.float32)

# Generate simple input features as a column vector.
x_values = tf.linspace(-1.0, 1.0, 64)

# Reshape inputs to match linear model expectations.
x_values = tf.reshape(x_values, (-1, 1))

# Generate noiseless targets using the true parameters.
y_values = tf.matmul(x_values, true_w) + true_b

# Confirm shapes are as expected before training.
assert x_values.shape == (64, 1)

# Confirm target shape matches input batch dimension.
assert y_values.shape == (64, 1)

# Define a simple linear regression model function.
def linear_model(x, w, b):
    return tf.matmul(x, w) + b

# Define mean squared error loss function.
def mse_loss(y_true, y_pred):
    return tf.reduce_mean(tf.square(y_true - y_pred))

# Create two sets of trainable parameters.
w_low = tf.Variable(tf.random.normal((1, 1)))

# Initialize bias for low learning rate model.
b_low = tf.Variable(tf.zeros((1,)))

# Initialize parameters for high learning rate model.
w_high = tf.Variable(tf.random.normal((1, 1)))

# Initialize bias for high learning rate model.
b_high = tf.Variable(tf.zeros((1,)))

# Create optimizers with different learning rates.
optimizer_low = tf.keras.optimizers.SGD(learning_rate=0.01)

# Define a more aggressive learning rate optimizer.
optimizer_high = tf.keras.optimizers.SGD(learning_rate=0.5)

# Prepare containers to store loss and gradient norms.
loss_history_low = []

# Store high learning rate loss values per epoch.
loss_history_high = []

# Store gradient norms for low learning rate model.
grad_norms_low = []

# Store gradient norms for high learning rate model.
grad_norms_high = []

# Define a single training step function.
@tf.function


def train_step(x_batch, y_batch, w_var, b_var, optimizer):
    with tf.GradientTape() as tape:
        preds = linear_model(x_batch, w_var, b_var)
        loss = mse_loss(y_batch, preds)
    grads = tape.gradient(loss, [w_var, b_var])
    optimizer.apply_gradients(zip(grads, [w_var, b_var]))
    grad_norm = tf.sqrt(
        tf.add(tf.reduce_sum(tf.square(grads[0])), tf.reduce_sum(tf.square(grads[1])))
    )
    return loss, grad_norm

# Set number of epochs for the demonstration.
num_epochs = 20

# Run a short training loop for both learning rates.
for epoch in range(num_epochs):
    loss_low, grad_low = train_step(
        x_values, y_values, w_low, b_low, optimizer_low
    )
    loss_high, grad_high = train_step(
        x_values, y_values, w_high, b_high, optimizer_high
    )
    loss_history_low.append(float(loss_low.numpy()))
    loss_history_high.append(float(loss_high.numpy()))
    grad_norms_low.append(float(grad_low.numpy()))
    grad_norms_high.append(float(grad_high.numpy()))

# Import matplotlib for plotting learning curves.
import matplotlib.pyplot as plt

# Create a figure with two vertically stacked subplots.
fig, axes = plt.subplots(2, 1, figsize=(6, 6))

# Plot loss curves for both learning rates.
axes[0].plot(loss_history_low, label="low lr 0.01")

# Plot high learning rate loss curve for comparison.
axes[0].plot(loss_history_high, label="high lr 0.5")

# Add labels and legend to the loss subplot.
axes[0].set_ylabel("MSE loss")

# Add a simple title describing the loss behavior.
axes[0].set_title("Loss versus epoch for two learning rates")

# Show legend to distinguish the two curves.
axes[0].legend()

# Plot gradient norm curves for both learning rates.
axes[1].plot(grad_norms_low, label="low lr grad norm")

# Plot gradient norms for the high learning rate model.
axes[1].plot(grad_norms_high, label="high lr grad norm")

# Label the y axis for gradient norms.
axes[1].set_ylabel("Gradient norm")

# Label the x axis shared by both subplots.
axes[1].set_xlabel("Epoch")

# Add legend to the gradient norm subplot.
axes[1].legend()

# Adjust layout to prevent overlapping labels.
plt.tight_layout()

# Print a short summary connecting curves to dynamics.
print("Low lr shows smooth loss and stable gradients.")

# Print a second line describing high learning rate behavior.
print("High lr may oscillate with larger gradient norms.")

# Display the combined plot for visual inspection.
plt.show()



# <font color="#418FDE" size="6.5" uppercase>**Losses and Optimizers**</font>


In this lecture, you learned to:
- Select and configure appropriate loss functions for common supervised learning tasks in PyTorch. 
- Use optimizers such as SGD and Adam to update model parameters within a training loop. 
- Inspect and debug optimization behavior using learning rate settings, gradient norms, and loss curves. 

In the next Lecture (Lecture B), we will go over 'Training Loop Design'