# <font color="#418FDE" size="6.5" uppercase>**Losses and Optimizers**</font>

>Last update: 20260129.
    
By the end of this Lecture, you will be able to:
- Select and configure appropriate loss functions for common supervised learning tasks in PyTorch. 
- Use optimizers such as SGD and Adam to update model parameters within a training loop. 
- Inspect and debug optimization behavior using learning rate settings, gradient norms, and loss curves. 


## **1. Choosing Loss Functions**

### **1.1. Regression and Classification**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_A/image_01_01.jpg?v=1769701742" width="250">



>* First choose between regression or classification task
>* Their outputs and loss definitions differ fundamentally

>* Regression loss measures distance between predictions and targets
>* Continuous outputs make loss scale and outlier sensitive

>* Classification losses use probabilities over discrete classes
>* They align with output layers and probability transforms



### **1.2. MSE and L1 Losses**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_A/image_01_02.jpg?v=1769701760" width="250">



>* MSE squares errors, heavily penalizing large outliers
>* L1 uses absolute errors, giving robust behavior

>* MSE suits clean data, smooth fast training
>* L1 handles noisy outliers, targets median behavior

>* Configure averaging, summing, or weighting across batches
>* Combine MSE and L1 for robustness, smoothness



In [None]:
#@title Python Code - MSE and L1 Losses

# This script compares MSE and L1 losses.
# It uses tiny synthetic regression data.
# It prints simple values for clear intuition.

# Optional install for plotting library if missing.
# !pip install matplotlib.

# Import required standard libraries.
import math
import random
import statistics

# Import matplotlib for a small plot.
import matplotlib.pyplot as plt

# Set deterministic random seeds for reproducibility.
random.seed(42)

# Create a tiny synthetic regression dataset.
true_weights = 2.0
true_bias = 1.0

# Generate simple input features and targets.
xs = [float(x) for x in range(-3, 4)]
ys = [true_weights * x + true_bias for x in xs]

# Add one strong outlier to the targets.
ys_noisy = ys.copy()
ys_noisy[-1] = ys_noisy[-1] + 10.0

# Define a helper to compute MSE loss.
def mse_loss(preds, targets):
    assert len(preds) == len(targets)
    squared_errors = [(p - t) ** 2 for p, t in zip(preds, targets)]
    return sum(squared_errors) / len(squared_errors)

# Define a helper to compute L1 loss.
def l1_loss(preds, targets):
    assert len(preds) == len(targets)
    abs_errors = [abs(p - t) for p, t in zip(preds, targets)]
    return sum(abs_errors) / len(abs_errors)

# Build two simple prediction sets for comparison.
preds_perfect = ys
preds_shifted = [y + 1.5 for y in ys]

# Compute losses on clean data without outlier.
mse_clean = mse_loss(preds_shifted, ys)
l1_clean = l1_loss(preds_shifted, ys)

# Compute losses on noisy data with outlier.
mse_noisy = mse_loss(preds_shifted, ys_noisy)
l1_noisy = l1_loss(preds_shifted, ys_noisy)

# Print a short summary of loss values.
print("Clean MSE:", round(mse_clean, 3))
print("Clean L1:", round(l1_clean, 3))
print("Noisy MSE:", round(mse_noisy, 3))
print("Noisy L1:", round(l1_noisy, 3))

# Prepare values for a tiny bar plot.
labels = ["MSE clean", "L1 clean", "MSE noisy", "L1 noisy"]
values = [mse_clean, l1_clean, mse_noisy, l1_noisy]

# Create a simple bar chart to visualize sensitivity.
plt.figure(figsize=(6, 4))
plt.bar(labels, values, color=["C0", "C1", "C0", "C1"])

# Add basic labels and title for clarity.
plt.ylabel("Average loss value")
plt.title("MSE reacts more to the outlier than L1")

# Display the plot to compare both losses.
plt.tight_layout()
plt.show()



### **1.3. CrossEntropy and BCE**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_A/image_01_03.jpg?v=1769701791" width="250">



>* Cross entropy handles single-label multi-class classification
>* BCE handles binary and multi-label yes/no tasks

>* Cross entropy scores multi-class predictions as probabilities
>* PyTorch takes raw logits and integer class labels

>* BCE handles independent yes or no predictions
>* Use sigmoid outputs and 0–1 float targets



In [None]:
#@title Python Code - CrossEntropy and BCE

# This script compares cross entropy and BCE losses.
# It uses tiny tensors to keep things simple.
# Run it in Colab to see printed outputs.

# Uncomment if tensorflow is missing in your environment.
# !pip install tensorflow==2.20.0.

# Import tensorflow and numpy for basic tensors.
import tensorflow as tf
import numpy as np

# Set deterministic random seeds for reproducibility.
tf.random.set_seed(0)
np.random.seed(0)

# Print tensorflow version in one short line.
print("TensorFlow version:", tf.__version__)

# Create tiny logits for a three class example.
logits_multi = tf.constant([[2.0, 0.5, -1.0]], dtype=tf.float32)

# Create integer label for the correct class index.
labels_multi = tf.constant([0], dtype=tf.int32)

# Check shapes to ensure they are compatible.
print("Multi logits shape:", logits_multi.shape)

# Compute sparse categorical cross entropy from logits.
ce_loss = tf.keras.losses.sparse_categorical_crossentropy(
    labels_multi, logits_multi, from_logits=True
)

# Print the scalar cross entropy loss value.
print("Cross entropy loss value:", float(ce_loss.numpy()[0]))

# Convert logits to probabilities using softmax function.
probs_multi = tf.nn.softmax(logits_multi, axis=1)

# Print the resulting probability distribution values.
print("Softmax probabilities:", probs_multi.numpy())

# Create logits for a binary classification example.
logits_binary = tf.constant([[1.5]], dtype=tf.float32)

# Create binary label as float between zero and one.
labels_binary = tf.constant([[1.0]], dtype=tf.float32)

# Validate shapes for binary loss computation.
print("Binary logits shape:", logits_binary.shape)

# Compute binary cross entropy using logits directly.
bce_loss = tf.nn.sigmoid_cross_entropy_with_logits(
    labels=labels_binary, logits=logits_binary
)

# Print the scalar binary cross entropy loss.
print("Binary cross entropy loss:", float(bce_loss.numpy()[0][0]))

# Convert binary logits to probabilities with sigmoid.
probs_binary = tf.nn.sigmoid(logits_binary)

# Print the predicted positive class probability.
print("Sigmoid probability:", float(probs_binary.numpy()[0][0]))

# Show both losses together for quick comparison.
print("CE and BCE losses:", float(ce_loss.numpy()[0]), float(bce_loss.numpy()[0][0]))



## **2. Training Optimizers**

### **2.1. Momentum Based SGD**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_A/image_02_01.jpg?v=1769701824" width="250">



>* Momentum stores a velocity from recent gradients
>* Updates follow consistent downhill directions, smoothing noise

>* Momentum smooths noisy updates and loss zigzags
>* Helps traverse narrow valleys faster, improving convergence

>* Momentum coefficient controls memory of past gradients
>* Tune learning rate and momentum based on loss



In [None]:
#@title Python Code - Momentum Based SGD

# This script compares SGD with and without momentum.
# It trains a tiny model on synthetic regression data.
# Focus on optimizer behavior and loss trajectories.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import numpy as np

# Import tensorflow and check version.
import tensorflow as tf
print("TensorFlow version:", tf.__version__)

# Set deterministic seeds for reproducibility.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)

# Set tensorflow random seed for reproducibility.
tf.random.set_seed(seed_value)

# Generate simple one dimensional regression data.
num_samples = 200
x = np.linspace(-1.0, 1.0, num_samples)

# Create noisy targets using a linear relationship.
true_w, true_b = 2.0, -0.5
y = true_w * x + true_b

# Add small gaussian noise to targets.
noise = 0.1 * np.random.randn(num_samples)
y_noisy = y + noise

# Reshape data for keras dense layer.
X = x.reshape(-1, 1).astype("float32")
Y = y_noisy.reshape(-1, 1).astype("float32")

# Validate shapes before building models.
assert X.shape[0] == Y.shape[0]

# Build a tiny sequential regression model.
def build_model():
    # Use single dense unit with linear activation.
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(1, input_shape=(1,))
    ])
    return model

# Create two identical models for fair comparison.
model_plain = build_model()
model_momentum = build_model()

# Define mean squared error loss function.
loss_fn = tf.keras.losses.MeanSquaredError()

# Configure plain SGD optimizer without momentum.
optimizer_plain = tf.keras.optimizers.SGD(
    learning_rate=0.1, momentum=0.0
)

# Configure SGD optimizer with momentum enabled.
optimizer_momentum = tf.keras.optimizers.SGD(
    learning_rate=0.1, momentum=0.9
)

# Prepare dataset as small batches for training.
batch_size = 32
dataset = tf.data.Dataset.from_tensor_slices((X, Y))

# Shuffle and batch the dataset deterministically.
dataset = dataset.shuffle(buffer_size=num_samples, seed=seed_value)

# Batch and prefetch for efficient iteration.
dataset = dataset.batch(batch_size).prefetch(1)

# Training parameters for short demonstration.
num_epochs = 15

# Containers to store epoch losses for both optimizers.
plain_losses = []
momentum_losses = []

# Training loop comparing both optimizers side by side.
for epoch in range(num_epochs):
    # Reset epoch loss accumulators.
    epoch_loss_plain = 0.0
    epoch_loss_momentum = 0.0

    # Counter for number of processed batches.
    batch_count = 0

    # Iterate over mini batches from dataset.
    for batch_x, batch_y in dataset:
        batch_count += 1

        # Use gradient tape for plain optimizer.
        with tf.GradientTape() as tape_plain:
            preds_plain = model_plain(batch_x, training=True)
            loss_plain = loss_fn(batch_y, preds_plain)

        # Compute gradients for plain optimizer.
        grads_plain = tape_plain.gradient(
            loss_plain, model_plain.trainable_variables
        )

        # Apply gradients using plain SGD.
        optimizer_plain.apply_gradients(
            zip(grads_plain, model_plain.trainable_variables)
        )

        # Use gradient tape for momentum optimizer.
        with tf.GradientTape() as tape_mom:
            preds_mom = model_momentum(batch_x, training=True)
            loss_mom = loss_fn(batch_y, preds_mom)

        # Compute gradients for momentum optimizer.
        grads_mom = tape_mom.gradient(
            loss_mom, model_momentum.trainable_variables
        )

        # Apply gradients using momentum SGD.
        optimizer_momentum.apply_gradients(
            zip(grads_mom, model_momentum.trainable_variables)
        )

        # Accumulate batch losses for averaging.
        epoch_loss_plain += float(loss_plain)
        epoch_loss_momentum += float(loss_mom)

    # Compute mean loss over all batches.
    mean_plain = epoch_loss_plain / batch_count
    mean_mom = epoch_loss_momentum / batch_count

    # Store losses for later inspection.
    plain_losses.append(mean_plain)
    momentum_losses.append(mean_mom)

    # Print concise epoch summary for both optimizers.
    print(
        f"Epoch {epoch+1:02d} - SGD loss: {mean_plain:.4f}, "
        f"Momentum SGD loss: {mean_mom:.4f}"
    )

# Import matplotlib for a simple loss curve plot.
import matplotlib.pyplot as plt

# Create a small figure for loss comparison.
plt.figure(figsize=(6, 4))

# Plot plain SGD loss trajectory.
plt.plot(plain_losses, label="SGD no momentum")

# Plot momentum SGD loss trajectory.
plt.plot(momentum_losses, label="SGD with momentum")

# Label axes and add legend for clarity.
plt.xlabel("Epoch")
plt.ylabel("Mean training loss")

# Add title emphasizing momentum effect.
plt.title("Effect of momentum on SGD convergence")

# Show legend and tight layout for readability.
plt.legend(); plt.tight_layout(); plt.show()



### **2.2. Adaptive Optimizers Overview**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_A/image_02_02.jpg?v=1769701944" width="250">



>* Adaptive optimizers adjust each parameter’s learning rate
>* Helps deep networks where layers learn at different speeds

>* RMSProp scales learning rates using recent gradient sizes
>* Adam adds momentum, stabilizing and speeding training

>* Training loop stays the same with adaptives
>* Optimizer tracks gradient history, adds tunable hyperparameters



In [None]:
#@title Python Code - Adaptive Optimizers Overview

# This script compares SGD and Adam optimizers.
# It shows how adaptive optimizers change updates.
# We use a tiny regression example in PyTorch.

# Install PyTorch if not already available.
# !pip install torch torchvision torchaudio.

# Import required standard libraries.
import math
import random
import os

# Import torch and check availability.
import torch
import torch.nn as nn
import torch.optim as optim

# Set deterministic seeds for reproducibility.
random.seed(0)
torch.manual_seed(0)

# Select device based on GPU availability.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print framework version and selected device.
print("PyTorch version:", torch.__version__, "Device:", device)

# Create simple synthetic regression data.
true_w = 2.0
true_b = -1.0

# Generate small input tensor on CPU.
x = torch.linspace(-1.0, 1.0, steps=40).unsqueeze(1)

# Generate targets with a little noise.
noise = 0.1 * torch.randn_like(x)
y = true_w * x + true_b + noise

# Move data to selected device.
x = x.to(device)
y = y.to(device)

# Define a tiny linear regression model.
class TinyRegressor(nn.Module):
    # Initialize with one linear layer.
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(1, 1)

    # Define forward computation step.
    def forward(self, x):
        return self.linear(x)

# Create two identical models for fair comparison.
model_sgd = TinyRegressor().to(device)
model_adam = TinyRegressor().to(device)

# Copy parameters from SGD model to Adam model.
model_adam.load_state_dict(model_sgd.state_dict())

# Define mean squared error loss function.
criterion = nn.MSELoss()

# Configure SGD optimizer with fixed learning rate.
optimizer_sgd = optim.SGD(model_sgd.parameters(), lr=0.1)

# Configure Adam optimizer with default betas.
optimizer_adam = optim.Adam(model_adam.parameters(), lr=0.1)

# Helper function to run one training step.
def train_step(model, optimizer, x_batch, y_batch):
    # Set gradients to zero before backward.
    optimizer.zero_grad()
    # Forward pass through the model.
    preds = model(x_batch)
    # Compute loss between predictions and targets.
    loss = criterion(preds, y_batch)
    # Backpropagate gradients through the graph.
    loss.backward()
    # Compute gradient norm for monitoring.
    total_norm = 0.0
    for p in model.parameters():
        if p.grad is not None:
            param_norm = p.grad.data.norm(2).item()
            total_norm += param_norm ** 2
    total_norm = math.sqrt(total_norm)
    # Update parameters using optimizer step.
    optimizer.step()
    # Return scalar loss and gradient norm.
    return loss.item(), total_norm

# Train both models for a few epochs.
num_epochs = 15

# Store history for selected epochs.
record_epochs = [1, 5, 10, 15]

# Print header for comparison table.
print("Epoch  Optim  Loss       GradNorm")

# Loop over epochs and train both optimizers.
for epoch in range(1, num_epochs + 1):
    # Run one step for SGD model.
    loss_sgd, grad_sgd = train_step(model_sgd, optimizer_sgd, x, y)
    # Run one step for Adam model.
    loss_adam, grad_adam = train_step(model_adam, optimizer_adam, x, y)
    # Print only selected epochs for clarity.
    if epoch in record_epochs:
        print(
            f"{epoch:5d}  SGD   {loss_sgd:8.4f}  {grad_sgd:8.4f}"
        )
        print(
            f"{epoch:5d}  Adam  {loss_adam:8.4f}  {grad_adam:8.4f}"
        )

# Show final learned parameters for both optimizers.
print("True w, b:", true_w, true_b)
print("SGD w, b:", list(model_sgd.parameters())[0].item(), list(model_sgd.parameters())[1].item())
print("Adam w, b:", list(model_adam.parameters())[0].item(), list(model_adam.parameters())[1].item())



### **2.3. Tuning Optimizer Settings**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_A/image_02_03.jpg?v=1769702036" width="250">



>* Learning rate size controls training speed and stability
>* Start with a default, then adjust from loss

>* Optimizer hyperparameters shape update behavior and stability
>* Adjust settings per task using experiments and feedback

>* Use loss curves and metrics to guide tuning
>* Run experiments, adjust schedules, build optimization intuition



In [None]:
#@title Python Code - Tuning Optimizer Settings

# This script shows basic optimizer tuning concepts.
# We compare SGD and Adam on a tiny regression task.
# Focus on learning rate and loss behavior over epochs.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import math

# Import TensorFlow and NumPy for modeling.
import tensorflow as tf
import numpy as np

# Print TensorFlow version briefly.
print("TensorFlow version:", tf.__version__)

# Set deterministic seeds for reproducibility.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Select device based on GPU availability.
physical_gpus = tf.config.list_physical_devices("GPU")
if physical_gpus:
    device_name = "GPU"
else:
    device_name = "CPU"

# Print which device type is used.
print("Using device type:", device_name)

# Create a tiny synthetic regression dataset.
num_samples = 64
x_values = np.linspace(-1.0, 1.0, num_samples).astype("float32")
true_w, true_b = 2.0, -0.5

# Generate noisy targets for y = 2x - 0.5.
noise = 0.1 * np.random.randn(num_samples).astype("float32")
y_values = true_w * x_values + true_b + noise

# Reshape features and targets to column vectors.
x_train = x_values.reshape(-1, 1)
y_train = y_values.reshape(-1, 1)

# Validate shapes before building models.
assert x_train.shape == (num_samples, 1)
assert y_train.shape == (num_samples, 1)

# Build a simple one layer regression model.
def build_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(1, input_shape=(1,))
    ])
    return model

# Create two identical models for fair comparison.
model_sgd = build_model()
model_adam = build_model()

# Define mean squared error loss function.
loss_fn = tf.keras.losses.MeanSquaredError()

# Configure SGD optimizer with moderate learning rate.
optimizer_sgd = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.0)

# Configure Adam optimizer with smaller learning rate.
optimizer_adam = tf.keras.optimizers.Adam(learning_rate=0.05)

# Prepare lists to store loss history for each optimizer.
sgd_losses = []
adam_losses = []

# Define a small number of training epochs.
num_epochs = 20
batch_size = 16

# Helper function to compute gradient norm safely.
def gradient_norm(gradients):
    squared_sum = 0.0
    for g in gradients:
        if g is not None:
            squared_sum += tf.reduce_sum(tf.square(g))
    return float(tf.sqrt(squared_sum))

# Training loop comparing SGD and Adam side by side.
for epoch in range(num_epochs):
    indices = np.arange(num_samples)
    np.random.shuffle(indices)

    # Shuffle data for this epoch.
    x_shuffled = x_train[indices]
    y_shuffled = y_train[indices]

    # Iterate over mini batches.
    for start in range(0, num_samples, batch_size):
        end = start + batch_size
        xb = x_shuffled[start:end]
        yb = y_shuffled[start:end]

        # One training step for SGD model.
        with tf.GradientTape() as tape_sgd:
            preds_sgd = model_sgd(xb, training=True)
            loss_sgd = loss_fn(yb, preds_sgd)
        grads_sgd = tape_sgd.gradient(loss_sgd, model_sgd.trainable_variables)
        optimizer_sgd.apply_gradients(zip(grads_sgd, model_sgd.trainable_variables))

        # One training step for Adam model.
        with tf.GradientTape() as tape_adam:
            preds_adam = model_adam(xb, training=True)
            loss_adam = loss_fn(yb, preds_adam)
        grads_adam = tape_adam.gradient(loss_adam, model_adam.trainable_variables)
        optimizer_adam.apply_gradients(zip(grads_adam, model_adam.trainable_variables))

    # Compute full batch loss after epoch.
    full_preds_sgd = model_sgd(x_train, training=False)
    full_preds_adam = model_adam(x_train, training=False)

    # Calculate losses for logging and analysis.
    epoch_loss_sgd = float(loss_fn(y_train, full_preds_sgd))
    epoch_loss_adam = float(loss_fn(y_train, full_preds_adam))

    # Store losses for later inspection.
    sgd_losses.append(epoch_loss_sgd)
    adam_losses.append(epoch_loss_adam)

    # Compute gradient norms on full batch for insight.
    with tf.GradientTape() as tape_sgd_full:
        preds_sgd_full = model_sgd(x_train, training=True)
        loss_sgd_full = loss_fn(y_train, preds_sgd_full)
    grads_sgd_full = tape_sgd_full.gradient(
        loss_sgd_full, model_sgd.trainable_variables
    )

    with tf.GradientTape() as tape_adam_full:
        preds_adam_full = model_adam(x_train, training=True)
        loss_adam_full = loss_fn(y_train, preds_adam_full)
    grads_adam_full = tape_adam_full.gradient(
        loss_adam_full, model_adam.trainable_variables
    )

    # Measure gradient norms for both optimizers.
    sgd_grad_norm = gradient_norm(grads_sgd_full)
    adam_grad_norm = gradient_norm(grads_adam_full)

    # Print compact training summary for this epoch.
    print(
        f"Epoch {epoch+1:02d} | "
        f"SGD loss {epoch_loss_sgd:.4f}, grad {sgd_grad_norm:.4f} | "
        f"Adam loss {epoch_loss_adam:.4f}, grad {adam_grad_norm:.4f}"
    )

# Print final learned parameters for both optimizers.
sgd_w, sgd_b = model_sgd.layers[0].get_weights()
adam_w, adam_b = model_adam.layers[0].get_weights()

# Show how close each optimizer came to true parameters.
print("True w, b:", true_w, true_b)
print("SGD w, b:", float(sgd_w[0][0]), float(sgd_b[0]))
print("Adam w, b:", float(adam_w[0][0]), float(adam_b[0]))




## **3. Monitoring Optimization Progress**

### **3.1. Tracking Loss Trends**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_A/image_03_01.jpg?v=1769702168" width="250">



>* Loss curves show if the model learns
>* Unusual loss patterns signal optimization setup problems

>* Compare training and validation loss to assess generalization
>* Use curves to spot overfitting, underfitting, tune hyperparameters

>* Loss curves reveal subtle learning rate problems
>* Logging and comparing curves guides optimization debugging



In [None]:
#@title Python Code - Tracking Loss Trends

# This script shows how to track loss trends.
# We use TensorFlow to simulate a training loop.
# Focus on plotting simple training and validation losses.

# !pip install tensorflow.

# Import required standard libraries.
import os
import random
import math

# Import TensorFlow and NumPy.
import tensorflow as tf
import numpy as np

# Set deterministic random seeds.
random.seed(0)
np.random.seed(0)
tf.random.set_seed(0)

# Print TensorFlow version once.
print("TensorFlow version:", tf.__version__)

# Choose device based on GPU availability.
physical_gpus = tf.config.list_physical_devices("GPU")
if physical_gpus:
    device_name = "/GPU:0"
else:
    device_name = "/CPU:0"

# Briefly report selected device.
print("Using device:", device_name)

# Create a small synthetic regression dataset.
num_samples = 256
x_data = np.linspace(-2.0, 2.0, num_samples).astype("float32")

# Generate targets with a simple nonlinear pattern.
y_true = (0.5 * x_data ** 3 - 0.3 * x_data).astype("float32")

# Add small Gaussian noise.
noise = 0.05 * np.random.randn(num_samples).astype("float32")
y_data = y_true + noise

# Split into training and validation sets.
train_size = 200
x_train = x_data[:train_size]
y_train = y_data[:train_size]

# Prepare validation subset.
x_val = x_data[train_size:]
y_val = y_data[train_size:]

# Validate shapes before building model.
assert x_train.shape[0] == y_train.shape[0]
assert x_val.shape[0] == y_val.shape[0]

# Build a tiny regression model.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(1,)),
    tf.keras.layers.Dense(8, activation="tanh"),
    tf.keras.layers.Dense(1)
])

# Choose mean squared error loss.
loss_fn = tf.keras.losses.MeanSquaredError()

# Use Adam optimizer with moderate learning rate.
optimizer = tf.keras.optimizers.Adam(learning_rate=0.05)

# Prepare lists to store loss history.
train_losses = []
val_losses = []

# Prepare list to store gradient norms.
grad_norms = []

# Define batch size and epochs.
batch_size = 32
epochs = 20

# Create TensorFlow datasets for batching.
train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_ds = train_ds.shuffle(buffer_size=train_size, seed=0)

# Batch the training dataset.
train_ds = train_ds.batch(batch_size)

# Use context manager for selected device.
with tf.device(device_name):
    for epoch in range(epochs):
        epoch_losses = []
        epoch_grad_norms = []

        # Iterate over mini batches.
        for batch_x, batch_y in train_ds:
            batch_x = tf.reshape(batch_x, (-1, 1))
            batch_y = tf.reshape(batch_y, (-1, 1))

            # Record gradients with GradientTape.
            with tf.GradientTape() as tape:
                preds = model(batch_x, training=True)
                loss_value = loss_fn(batch_y, preds)

            # Compute gradients of loss w.r.t parameters.
            grads = tape.gradient(loss_value, model.trainable_variables)

            # Apply gradients using optimizer.
            optimizer.apply_gradients(zip(grads, model.trainable_variables))

            # Store batch loss for averaging.
            epoch_losses.append(float(loss_value.numpy()))

            # Compute gradient norm for monitoring.
            squared_sum = 0.0
            for g in grads:
                if g is not None:
                    squared_sum += float(tf.reduce_sum(g ** 2).numpy())

            # Take square root for L2 norm.
            grad_norm = math.sqrt(squared_sum)
            epoch_grad_norms.append(grad_norm)

        # Compute mean training loss for epoch.
        mean_train_loss = float(np.mean(epoch_losses))
        train_losses.append(mean_train_loss)

        # Compute mean gradient norm for epoch.
        mean_grad_norm = float(np.mean(epoch_grad_norms))
        grad_norms.append(mean_grad_norm)

        # Evaluate validation loss without gradient tracking.
        x_val_tensor = tf.reshape(x_val, (-1, 1))
        y_val_tensor = tf.reshape(y_val, (-1, 1))
        val_preds = model(x_val_tensor, training=False)
        val_loss_value = loss_fn(y_val_tensor, val_preds)

        # Store validation loss.
        val_losses.append(float(val_loss_value.numpy()))

# Import matplotlib for plotting.
import matplotlib.pyplot as plt

# Create a figure with two subplots.
fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Plot training and validation loss trends.
axes[0].plot(range(1, epochs + 1), train_losses, label="train")
axes[0].plot(range(1, epochs + 1), val_losses, label="val")

# Label the first subplot clearly.
axes[0].set_xlabel("Epoch")
axes[0].set_ylabel("Loss")
axes[0].set_title("Loss trends over epochs")
axes[0].legend()

# Plot gradient norm trend.
axes[1].plot(range(1, epochs + 1), grad_norms, label="grad_norm")

# Label the second subplot clearly.
axes[1].set_xlabel("Epoch")
axes[1].set_ylabel("Gradient L2 norm")
axes[1].set_title("Gradient norm over epochs")
axes[1].legend()

# Adjust layout for readability.
plt.tight_layout()

# Print a short numeric summary.
print("Final train loss:", round(train_losses[-1], 4))
print("Final val loss:", round(val_losses[-1], 4))
print("Final grad norm:", round(grad_norms[-1], 4))

# Display the plots to visually inspect trends.
plt.show()



### **3.2. Gradient Norm Monitoring**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_A/image_03_02.jpg?v=1769702282" width="250">



>* Gradient norms summarize overall gradient strength during training
>* Tracking them reveals unstable, vanishing, or exploding updates

>* Small or huge norms signal learning problems
>* Plot norms with loss to spot issues

>* Use gradient norms to choose tuning actions
>* Build intuition to prevent failures and speed experiments



In [None]:
#@title Python Code - Gradient Norm Monitoring

# This script shows gradient norm monitoring basics.
# It uses TensorFlow to simulate a tiny training.
# Focus on computing and printing simple gradient norms.

# Install TensorFlow only if missing in your environment.
# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import math

# Import TensorFlow and NumPy for computation.
import tensorflow as tf
import numpy as np

# Set deterministic random seeds for reproducibility.
seed_value = 42
random.seed(seed_value)

# Set NumPy and TensorFlow seeds deterministically.
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Select device string based on GPU availability.
physical_gpus = tf.config.list_physical_devices("GPU")

# Choose CPU if no GPU is detected.
if physical_gpus:
    device_name = "/GPU:0"
else:
    device_name = "/CPU:0"

# Create a tiny synthetic regression dataset.
num_samples = 64
input_dim = 3

# Generate random inputs with small normal noise.
X = np.random.randn(num_samples, input_dim).astype(np.float32)

# Define true weights and bias for synthetic targets.
true_w = np.array([[2.0], [-1.0], [0.5]], dtype=np.float32)

# Compute targets with linear rule plus small noise.
y = X @ true_w + 0.1 * np.random.randn(num_samples, 1).astype(np.float32)

# Validate shapes before building the model.
assert X.shape == (num_samples, input_dim)

# Validate target shape for safety.
assert y.shape == (num_samples, 1)

# Build a tiny sequential model for regression.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(input_dim,)),
    tf.keras.layers.Dense(1)
])

# Choose mean squared error loss function.
loss_fn = tf.keras.losses.MeanSquaredError()

# Create an SGD optimizer with moderate learning rate.
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)

# Prepare lists to store loss and gradient norms.
loss_history = []
grad_norm_history = []

# Define a small number of training epochs.
num_epochs = 8

# Use selected device context for training loop.
with tf.device(device_name):
    for epoch in range(num_epochs):

        # Record operations for automatic differentiation.
        with tf.GradientTape() as tape:
            preds = model(X, training=True)

            # Compute scalar loss value for this batch.
            loss_value = loss_fn(y, preds)

        # Compute gradients of loss with respect to weights.
        grads = tape.gradient(loss_value, model.trainable_variables)

        # Filter out any None gradients defensively.
        valid_grads = [g for g in grads if g is not None]

        # Compute global L2 norm of all gradients.
        squared_sums = [tf.reduce_sum(tf.square(g)) for g in valid_grads]

        # Add small epsilon to avoid numerical issues.
        total_squared = tf.add_n(squared_sums) + 1e-12

        # Take square root to obtain gradient L2 norm.
        grad_norm = tf.sqrt(total_squared)

        # Apply gradients to update model parameters.
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

        # Store scalar values for later inspection.
        loss_history.append(float(loss_value.numpy()))

        # Store gradient norm as Python float value.
        grad_norm_history.append(float(grad_norm.numpy()))

# Print a compact header for monitoring results.
print("Epoch | Loss | Gradient L2 Norm")

# Loop through history and print few summary lines.
for epoch in range(num_epochs):
    loss_val = loss_history[epoch]

    # Retrieve corresponding gradient norm value.
    gnorm_val = grad_norm_history[epoch]

    # Print rounded values for readability.
    print(epoch + 1, "|", round(loss_val, 4), "|", round(gnorm_val, 4))




### **3.3. Learning Rate Dynamics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_A/image_03_03.jpg?v=1769702319" width="250">



>* Learning rate controls how fast parameters change
>* Watch loss behavior to judge healthy learning rates

>* Schedulers change learning rate during training
>* Visualizing rate and loss reveals exploration, refinement

>* Watch loss behavior to judge learning rate
>* Adjust schedules to stabilize, refine, or stop training



In [None]:
#@title Python Code - Learning Rate Dynamics

# This script visualizes simple learning rate dynamics.
# It compares constant and decaying learning rates.
# Use it to connect curves with optimization behavior.

# Optional TensorFlow install for environments without it.
# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import math

# Import numpy for numeric operations.
import numpy as np

# Import matplotlib for plotting results.
import matplotlib.pyplot as plt

# Import tensorflow and set logging level.
import tensorflow as tf

# Set seeds for reproducible behavior.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Create a tiny synthetic regression dataset.
true_w = 2.0
true_b = -1.0
x_data = np.linspace(-1.0, 1.0, 64).astype(np.float32)
noise = 0.1 * np.random.randn(64).astype(np.float32)
y_data = true_w * x_data + true_b + noise

# Validate shapes before creating tensors.
assert x_data.shape == y_data.shape

# Convert numpy arrays to TensorFlow tensors.
x_tensor = tf.convert_to_tensor(x_data.reshape(-1, 1))
y_tensor = tf.convert_to_tensor(y_data.reshape(-1, 1))

# Define a simple linear regression model.
class SimpleLinear(tf.Module):

    # Initialize trainable parameters.
    def __init__(self):
        super().__init__()
        self.w = tf.Variable(tf.random.normal([1, 1]))
        self.b = tf.Variable(tf.zeros([1]))

    # Forward pass computing predictions.
    def __call__(self, x):
        return tf.matmul(x, self.w) + self.b

# Define mean squared error loss function.
def mse_loss(y_true, y_pred):
    return tf.reduce_mean(tf.square(y_true - y_pred))

# Training step using given optimizer and model.
@tf.function
def train_step(model, optimizer, x_batch, y_batch):
    with tf.GradientTape() as tape:
        preds = model(x_batch)
        loss = mse_loss(y_batch, preds)
    grads = tape.gradient(loss, [model.w, model.b])
    optimizer.apply_gradients(zip(grads, [model.w, model.b]))
    grad_norm = tf.sqrt(sum(tf.reduce_sum(g * g) for g in grads))
    return loss, grad_norm

# Helper to run training with a schedule.
def run_training(lr_schedule, label, epochs):
    model = SimpleLinear()
    losses = []
    grad_norms = []
    lrs = []
    for epoch in range(epochs):
        lr = lr_schedule(epoch)
        optimizer = tf.keras.optimizers.SGD(learning_rate=lr)
        loss, grad_norm = train_step(model, optimizer, x_tensor, y_tensor)
        losses.append(float(loss.numpy()))
        grad_norms.append(float(grad_norm.numpy()))
        lrs.append(lr)
    return {
        "label": label,
        "losses": losses,
        "grad_norms": grad_norms,
        "lrs": lrs,
    }

# Define a constant learning rate schedule.
def constant_lr(epoch):
    return 0.1

# Define a simple exponential decay schedule.
def decaying_lr(epoch):
    return 0.1 * (0.9 ** epoch)

# Set a small number of epochs for clarity.
num_epochs = 25

# Run training with constant learning rate.
result_const = run_training(constant_lr, "constant", num_epochs)

# Run training with decaying learning rate.
result_decay = run_training(decaying_lr, "decay", num_epochs)

# Print a brief numeric summary for both schedules.
print("Final loss constant:", round(result_const["losses"][-1], 4))
print("Final loss decay:", round(result_decay["losses"][-1], 4))
print("First three learning rates constant:", result_const["lrs"][:3])
print("First three learning rates decay:", [round(v, 4) for v in result_decay["lrs"][:3]])
print("First three grad norms constant:", [round(v, 4) for v in result_const["grad_norms"][:3]])
print("First three grad norms decay:", [round(v, 4) for v in result_decay["grad_norms"][:3]])

# Create a figure with two subplots.
fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Plot loss curves for both learning rate schedules.
axes[0].plot(result_const["losses"], label="constant lr")
axes[0].plot(result_decay["losses"], label="decaying lr")
axes[0].set_title("Loss vs epoch")
axes[0].set_xlabel("epoch")
axes[0].set_ylabel("loss")
axes[0].legend()

# Plot learning rate values over epochs.
axes[1].plot(result_const["lrs"], label="constant lr")
axes[1].plot(result_decay["lrs"], label="decaying lr")
axes[1].set_title("Learning rate vs epoch")
axes[1].set_xlabel("epoch")
axes[1].set_ylabel("learning rate")
axes[1].legend()

# Adjust layout and display the single figure.
plt.tight_layout()
plt.show()



# <font color="#418FDE" size="6.5" uppercase>**Losses and Optimizers**</font>


In this lecture, you learned to:
- Select and configure appropriate loss functions for common supervised learning tasks in PyTorch. 
- Use optimizers such as SGD and Adam to update model parameters within a training loop. 
- Inspect and debug optimization behavior using learning rate settings, gradient norms, and loss curves. 

In the next Lecture (Lecture B), we will go over 'Training Loop Design'