# <font color="#418FDE" size="6.5" uppercase>**Training Loop Design**</font>

>Last update: 20260129.
    
By the end of this Lecture, you will be able to:
- Implement a standard PyTorch training loop that iterates over DataLoader batches and updates model parameters. 
- Add evaluation and metric computation to the training workflow without leaking gradients or mixing modes. 
- Handle device placement and optional mixed precision to improve performance while maintaining correctness. 


## **1. Training Epoch Loop**

### **1.1. Iterating Over Batches**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_B/image_01_01.jpg?v=1769704620" width="250">



>* Training uses small batches instead of whole dataset
>* Batches flow sequentially, refining parameters efficiently

>* DataLoader yields shuffled input-target batches each epoch
>* Loop runs batches, computes loss, updates model parameters

>* Varied batches prevent bias in model learning
>* Correct iteration ensures fairness, efficiency, and reproducibility



In [None]:
#@title Python Code - Iterating Over Batches

# This script shows iterating over training batches.
# We use TensorFlow to simulate a batch training loop.
# Focus on the rhythm of batches per epoch.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import numpy as np

# Import TensorFlow and check version.
import tensorflow as tf

# Set deterministic random seeds.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Choose device string based on GPU availability.
physical_gpus = tf.config.list_physical_devices("GPU")
use_gpu = bool(physical_gpus)
device_name = "GPU" if use_gpu else "CPU"

# Inform user which device is used.
print("Using device:", device_name)

# Create a tiny synthetic regression dataset.
num_samples = 64
num_features = 3
X = np.random.randn(num_samples, num_features).astype("float32")

# Create targets with a simple linear rule.
true_w = np.array([[2.0], [-1.0], [0.5]], dtype="float32")
y = X @ true_w + 0.1

# Validate shapes before building dataset.
assert X.shape == (num_samples, num_features)
assert y.shape == (num_samples, 1)

# Build a tf.data.Dataset and batch it.
batch_size = 8
dataset = tf.data.Dataset.from_tensor_slices((X, y))

# Shuffle slightly then batch for training.
dataset = dataset.shuffle(buffer_size=num_samples, seed=seed_value)
dataset = dataset.batch(batch_size)

# Define a simple dense regression model.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(num_features,)),
    tf.keras.layers.Dense(1)
])

# Create optimizer and loss function.
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)
loss_fn = tf.keras.losses.MeanSquaredError()

# Training parameters for one short epoch.
num_epochs = 1
print("Starting training for", num_epochs, "epoch.")

# Iterate over epochs and batches explicitly.
for epoch in range(num_epochs):
    epoch_loss = 0.0
    batch_count = 0

    # Loop over batches from the dataset.
    for step, (batch_x, batch_y) in enumerate(dataset):
        batch_count += 1

        # Validate batch shapes before forward pass.
        assert batch_x.shape[0] <= batch_size
        assert batch_y.shape[0] <= batch_size

        # Record operations for automatic differentiation.
        with tf.GradientTape() as tape:
            preds = model(batch_x, training=True)
            loss_value = loss_fn(batch_y, preds)

        # Compute gradients of loss with respect to weights.
        grads = tape.gradient(loss_value, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

        # Accumulate loss for simple reporting.
        epoch_loss += float(loss_value.numpy())

        # Print a short message for the first few batches.
        if step < 2:
            print("Epoch", epoch + 1, "batch", step + 1, "loss", round(float(loss_value), 4))

    # Compute average loss over all batches.
    avg_loss = epoch_loss / max(batch_count, 1)
    print("Epoch", epoch + 1, "average loss", round(avg_loss, 4))




### **1.2. Forward Pass and Loss**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_B/image_01_02.jpg?v=1769704665" width="250">



>* Forward pass turns input batches into predictions
>* Layers run sequentially; output reflects learned understanding

>* Compute loss measuring predictions versus true targets
>* Loss is scalar, differentiable, usually batch-averaged

>* Treat forward pass and loss as one
>* Loss drives gradients, enabling iterative performance improvement



In [None]:
#@title Python Code - Forward Pass and Loss

# This script demonstrates forward pass and loss.
# It uses TensorFlow to mimic training logic.
# Focus on batches predictions and scalar loss.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import numpy as np

# Import TensorFlow and Keras utilities.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Set deterministic random seeds.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Detect available device type.
physical_gpus = tf.config.list_physical_devices("GPU")
use_gpu = bool(physical_gpus)
device_type = "GPU" if use_gpu else "CPU"

# Print framework version and device.
print("TensorFlow version:", tf.__version__)
print("Using device type:", device_type)

# Create a tiny synthetic classification dataset.
num_samples = 64
num_features = 20
num_classes = 3

# Generate random feature data.
X = np.random.randn(num_samples, num_features).astype("float32")
Y = np.random.randint(num_classes, size=(num_samples,))

# Validate shapes before building dataset.
assert X.shape == (num_samples, num_features)
assert Y.shape == (num_samples,)

# Build a tf.data Dataset with small batch size.
batch_size = 8
dataset = tf.data.Dataset.from_tensor_slices((X, Y))

dataset = dataset.shuffle(buffer_size=num_samples, seed=seed_value)

dataset = dataset.batch(batch_size)

# Define a simple dense neural network model.
model = keras.Sequential([
    layers.Input(shape=(num_features,)),
    layers.Dense(16, activation="relu"),
    layers.Dense(num_classes, activation="softmax"),
])

# Choose an optimizer and loss function.
optimizer = keras.optimizers.SGD(learning_rate=0.1)
loss_fn = keras.losses.SparseCategoricalCrossentropy()

# Run one training epoch focusing on forward pass.
print("\nRunning one epoch with manual loop...")

# Iterate over batches from the dataset.
for step, (batch_x, batch_y) in enumerate(dataset):
    # Validate batch shapes defensively.
    assert batch_x.shape[1] == num_features
    assert len(batch_y.shape) == 1

    # Record operations for automatic differentiation.
    with tf.GradientTape() as tape:
        # Forward pass to compute predictions.
        logits = model(batch_x, training=True)

        # Compute scalar loss for current batch.
        loss_value = loss_fn(batch_y, logits)

    # Compute gradients of loss with respect to weights.
    grads = tape.gradient(loss_value, model.trainable_variables)

    # Apply gradients to update model parameters.
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

    # Print a short summary for first few steps.
    if step < 3:
        print(
            "Step", int(step),
            "batch_loss:", float(loss_value.numpy()),
        )

# Compute predictions on a single batch without gradients.
first_batch_x, first_batch_y = next(iter(dataset))

# Disable gradient tracking for evaluation.
pred_probs = model(first_batch_x, training=False)

# Compute evaluation loss on that batch.
eval_loss = loss_fn(first_batch_y, pred_probs).numpy()

# Print final evaluation loss for the batch.
print("\nEvaluation loss on first batch:", float(eval_loss))



### **1.3. Backward pass and update**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_B/image_01_03.jpg?v=1769704749" width="250">



>* Clear old gradients, then call loss.backward
>* Backward pass computes each weight’s effect on loss

>* Optimizer uses gradients to adjust each parameter
>* Small weight changes accumulate, improving model performance

>* Backward and updates must follow batch-wise order
>* Correct sequence keeps gradients clean and learning stable



In [None]:
#@title Python Code - Backward pass and update

# This script shows a simple training loop.
# We focus on backward pass and updates.
# Run it in Colab to follow along.

# Install PyTorch if not already available.
# !pip install torch torchvision torchaudio.

# Import required standard libraries.
import os
import random
import math

# Import torch and related submodules.
import torch
import torch.nn as nn
import torch.optim as optim

# Print PyTorch version in one short line.
print("PyTorch version:", torch.__version__)

# Set deterministic random seeds for reproducibility.
random.seed(0)

# Set numpy seed through torch generator for simplicity.
torch.manual_seed(0)

# Select device based on GPU availability.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print which device will be used for training.
print("Using device:", device.type)

# Create a tiny synthetic regression dataset.
num_samples = 64

# Generate input features with two dimensions.
X = torch.linspace(-1.0, 1.0, num_samples).unsqueeze(1)

# Create targets using a simple linear relationship.
y = 3.0 * X + 0.5

# Move data to the selected device.
X = X.to(device)

y = y.to(device)

# Define a very small linear regression model.
model = nn.Linear(in_features=1, out_features=1)

# Move model parameters to the selected device.
model = model.to(device)

# Define mean squared error loss function.
criterion = nn.MSELoss()

# Create a simple stochastic gradient descent optimizer.
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Wrap data into a TensorDataset for DataLoader.
dataset = torch.utils.data.TensorDataset(X, y)

# Create DataLoader to iterate over mini batches.
loader = torch.utils.data.DataLoader(dataset, batch_size=8, shuffle=True)

# Set number of epochs for the training loop.
num_epochs = 5

# Loop over epochs to repeatedly see the data.
for epoch in range(num_epochs):

    # Set model to training mode for this epoch.
    model.train()

    # Track running loss for simple monitoring.
    running_loss = 0.0

    # Iterate over mini batches from the DataLoader.
    for batch_inputs, batch_targets in loader:

        # Ensure batch tensors are on the correct device.
        batch_inputs = batch_inputs.to(device)

        # Move targets to the same device as inputs.
        batch_targets = batch_targets.to(device)

        # Validate shapes before forward computation.
        assert batch_inputs.shape[1] == 1

        # Clear old gradients so they do not accumulate.
        optimizer.zero_grad()

        # Forward pass to compute model predictions.
        predictions = model(batch_inputs)

        # Compute loss comparing predictions and targets.
        loss = criterion(predictions, batch_targets)

        # Backward pass computes gradients for all parameters.
        loss.backward()

        # Optimizer step updates parameters using gradients.
        optimizer.step()

        # Accumulate loss value for reporting later.
        running_loss += loss.item() * batch_inputs.size(0)

    # Compute average loss over all samples.
    epoch_loss = running_loss / num_samples

    # Print a short summary line for this epoch.
    print(f"Epoch {epoch + 1}, loss {epoch_loss:.4f}")

# Evaluate model on the full dataset without gradients.
with torch.no_grad():

    # Switch model to evaluation mode for inference.
    model.eval()

    # Compute predictions on the full dataset.
    preds = model(X)

    # Compute final mean squared error on dataset.
    final_loss = criterion(preds, y).item()

# Print final loss to confirm training effectiveness.
print("Final MSE on tiny dataset:", round(final_loss, 4))



## **2. Evaluation and Metrics**

### **2.1. Training vs Evaluation Modes**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_B/image_02_01.jpg?v=1769704788" width="250">



>* Training and evaluation modes behave differently by design
>* Wrong mode choice corrupts metrics and generalization

>* Switch between train and eval at boundaries
>* Eval mode gives stable, deployment-like, deterministic behavior

>* Metrics must reflect true inference-time behavior
>* Switch modes correctly to ensure fair, reliable evaluation



In [None]:
#@title Python Code - Training vs Evaluation Modes

# This script shows training and evaluation modes.
# We use TensorFlow to mimic PyTorch behavior.
# Focus on metrics without mixing gradient modes.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import numpy as np

# Import TensorFlow and Keras utilities.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Set deterministic seeds for reproducibility.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Select device preferring GPU when available.
physical_gpus = tf.config.list_physical_devices("GPU")
use_gpu = bool(physical_gpus)
print("Using GPU:" if use_gpu else "Using GPU:", use_gpu)

# Load a small subset of MNIST digits.
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Reduce dataset size for quick demonstration.
train_samples = 2000

# Number of test samples for evaluation.

test_samples = 500
x_train = x_train[:train_samples]
y_train = y_train[:train_samples]

# Slice test data for faster evaluation.
x_test = x_test[:test_samples]
y_test = y_test[:test_samples]

# Normalize images to range zero one.
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0

# Add channel dimension for convolution layers.
x_train = np.expand_dims(x_train, axis=-1)
x_test = np.expand_dims(x_test, axis=-1)

# Validate shapes before building model.
assert x_train.shape[0] == train_samples
assert x_test.shape[0] == test_samples

# Build a simple convolutional classification model.
model = keras.Sequential([
    layers.Conv2D(8, (3, 3), activation="relu", input_shape=(28, 28, 1)),
    layers.MaxPooling2D((2, 2)),
    layers.Dropout(0.5),
    layers.Flatten(),
    layers.Dense(32, activation="relu"),
    layers.Dense(10, activation="softmax"),
])

# Compile model with optimizer loss and accuracy.
model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

# Train briefly with silent logs for speed.
history = model.fit(
    x_train,
    y_train,
    epochs=2,
    batch_size=64,
    verbose=0,
)

# Define a helper to compute accuracy manually.
def compute_accuracy(pred_probs, true_labels):
    # Convert probabilities to predicted classes.
    pred_classes = np.argmax(pred_probs, axis=1)
    # Ensure shapes match before comparison.
    assert pred_classes.shape == true_labels.shape
    # Compute mean accuracy as float.
    return float(np.mean(pred_classes == true_labels))

# Run evaluation with model in training mode.
model.trainable = True
# model.train()  # Not available in Keras; training behavior is controlled via the `training` argument.
train_mode_probs = model(x_test, training=True).numpy()
train_mode_acc = compute_accuracy(train_mode_probs, y_test)

# Run evaluation with model in evaluation mode.
model.eval = lambda: None
model.training = False
model_probs = model(x_test, training=False).numpy()
clean_eval_acc = compute_accuracy(model_probs, y_test)

# Print both accuracies to compare modes.
print("Accuracy with training behavior on validation:", round(train_mode_acc, 4))
print("Accuracy with proper evaluation mode:", round(clean_eval_acc, 4))

# Show that Keras evaluate uses evaluation behavior.
loss_eval, acc_eval = model.evaluate(
    x_test,
    y_test,
    verbose=0,
)
print("Keras evaluate accuracy (eval mode):", round(acc_eval, 4))



### **2.2. Safe Inference with no_grad**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_B/image_02_02.jpg?v=1769704889" width="250">



>* no_grad stops tracking graphs during evaluation
>* acts like read-only mode, saving memory

>* no_grad prevents wasted graphs and GPU memory
>* enables larger, more frequent, stable evaluations

>* Separates training from evaluation as pure observation
>* Prevents parameter changes and unnecessary computation graphs



In [None]:
#@title Python Code - Safe Inference with no_grad

# This script demonstrates safe inference with no_grad.
# We compare training and evaluation behaviors for gradients.
# Focus on evaluation metrics without leaking training information.

# !pip install torch torchvision.

# Import required standard libraries.
import os
import random
import math

# Import torch and related submodules.
import torch
import torch.nn as nn
import torch.optim as optim

# Set deterministic random seeds for reproducibility.
random.seed(0)
torch.manual_seed(0)

# Detect device preference for computation.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print torch version and selected device.
print("Torch version:", torch.__version__, "Device:", device)

# Define a tiny model for classification.
class TinyNet(nn.Module):
    # Initialize linear layer and activation.
    def __init__(self, input_dim, num_classes):
        super().__init__()
        self.fc = nn.Linear(input_dim, num_classes)

    # Define forward computation for inputs.
    def forward(self, x):
        return self.fc(x)

# Set simple problem dimensions.
input_dim = 4
num_classes = 3
batch_size = 8

# Create a tiny random training batch.
train_inputs = torch.randn(batch_size, input_dim)
train_labels = torch.randint(0, num_classes, (batch_size,))

# Create a tiny random validation batch.
val_inputs = torch.randn(batch_size, input_dim)
val_labels = torch.randint(0, num_classes, (batch_size,))

# Move tensors to the selected device.
train_inputs = train_inputs.to(device)
train_labels = train_labels.to(device)
val_inputs = val_inputs.to(device)
val_labels = val_labels.to(device)

# Initialize model, loss function, and optimizer.
model = TinyNet(input_dim, num_classes).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Perform one training step with gradients.
model.train()
optimizer.zero_grad()
train_logits = model(train_inputs)
train_loss = criterion(train_logits, train_labels)
train_loss.backward()
optimizer.step()

# Check that gradients exist after training.
grad_norm = 0.0
for p in model.parameters():
    if p.grad is not None:
        grad_norm += p.grad.norm().item()

# Print training loss and gradient norm.
print("Training loss after one step:", float(train_loss))
print("Gradient norm after training step:", round(grad_norm, 4))

# Switch model to evaluation mode.
model.eval()

# Run validation without no_grad to show graph.
val_logits_graph = model(val_inputs)
val_loss_graph = criterion(val_logits_graph, val_labels)

# Confirm that validation loss requires gradients.
print("Validation loss requires grad (no context):", val_loss_graph.requires_grad)

# Compute accuracy from logits with graph.
val_preds_graph = val_logits_graph.argmax(dim=1)
val_acc_graph = (val_preds_graph == val_labels).float().mean().item()

# Print accuracy computed with gradient tracking.
print("Validation accuracy with graph tracking:", round(val_acc_graph, 3))

# Now run safe inference using no_grad context.
with torch.no_grad():
    val_logits_safe = model(val_inputs)
    val_loss_safe = criterion(val_logits_safe, val_labels)
    val_preds_safe = val_logits_safe.argmax(dim=1)

# Compute accuracy from predictions without gradients.
val_acc_safe = (val_preds_safe == val_labels).float().mean().item()

# Show that loss no longer requires gradients.
print("Validation loss requires grad (no_grad):", val_loss_safe.requires_grad)

# Print accuracy computed under no_grad context.
print("Validation accuracy with no_grad:", round(val_acc_safe, 3))

# Demonstrate that no_grad saves memory by disabling graph.
print("Safe inference uses no_grad for lightweight evaluation.")



### **2.3. Accuracy and F1 Metrics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_B/image_02_03.jpg?v=1769704963" width="250">



>* Accuracy measures fraction of correctly classified examples
>* It’s simple but unreliable with imbalanced classes

>* Use precision, recall, F1 for imbalanced data
>* F1 balances missed positives and false alarms

>* Aggregate predictions before computing accuracy and F1
>* Use class-wise or averaged scores to guide decisions



In [None]:
#@title Python Code - Accuracy and F1 Metrics

# This script shows accuracy and F1 metrics.
# We use TensorFlow to simulate evaluation workflow.
# Focus is on safe metric computation without gradients.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import numpy as np

# Import TensorFlow and Keras utilities.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Set deterministic seeds for reproducibility.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Select device string based on GPU availability.
physical_gpus = tf.config.list_physical_devices("GPU")
use_gpu = bool(physical_gpus)
device_name = "GPU" if use_gpu else "CPU"

# Print which device will be used.
print("Using device:", device_name)

# Load MNIST dataset from Keras datasets.
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Use a small subset for quick demonstration.
train_samples = 2000
test_samples = 500
x_train = x_train[:train_samples]
y_train = y_train[:train_samples]

# Slice test data subset for evaluation.
x_test = x_test[:test_samples]
y_test = y_test[:test_samples]

# Normalize images to range [0,1].
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0

# Add channel dimension for convolutional layers.
x_train = np.expand_dims(x_train, axis=-1)
x_test = np.expand_dims(x_test, axis=-1)

# Validate shapes before building model.
assert x_train.shape[0] == y_train.shape[0]
assert x_test.shape[0] == y_test.shape[0]

# Build a small CNN classification model.
model = keras.Sequential([
    layers.Conv2D(8, (3, 3), activation="relu", input_shape=(28, 28, 1)),
    layers.MaxPooling2D((2, 2)),
    layers.Flatten(),
    layers.Dense(32, activation="relu"),
    layers.Dense(10, activation="softmax"),
])

# Compile model with optimizer and loss.
model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

# Train briefly with silent output for speed.
history = model.fit(
    x_train,
    y_train,
    epochs=2,
    batch_size=64,
    verbose=0,
)

# Create dataset for evaluation without shuffling.
val_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test))
val_ds = val_ds.batch(64)

# Define function to compute accuracy from predictions.
def compute_accuracy(y_true, y_pred):
    correct = np.sum(y_true == y_pred)
    total = y_true.shape[0]
    return correct / float(total)

# Define function to compute precision, recall, F1.
def compute_f1(y_true, y_pred, num_classes):
    f1_scores = []
    for c in range(num_classes):
        tp = np.sum((y_pred == c) & (y_true == c))
        fp = np.sum((y_pred == c) & (y_true != c))
        fn = np.sum((y_pred != c) & (y_true == c))
        precision = tp / float(tp + fp + 1e-8)
        recall = tp / float(tp + fn + 1e-8)
        f1 = 2.0 * precision * recall / float(precision + recall + 1e-8)
        f1_scores.append(f1)
    macro_f1 = float(np.mean(f1_scores))
    return macro_f1

# Run evaluation in no gradient context.
all_preds = []
all_labels = []

# Use inference mode to avoid training behavior.
for batch_images, batch_labels in val_ds:
    logits = model(batch_images, training=False)
    batch_pred = tf.argmax(logits, axis=1)
    all_preds.append(batch_pred.numpy())
    all_labels.append(batch_labels.numpy())

# Concatenate all predictions and labels.
all_preds = np.concatenate(all_preds, axis=0)
all_labels = np.concatenate(all_labels, axis=0)

# Validate final concatenated shapes.
assert all_preds.shape == all_labels.shape

# Compute accuracy on full validation subset.
val_accuracy = compute_accuracy(all_labels, all_preds)

# Compute macro F1 across all classes.
num_classes = 10
val_macro_f1 = compute_f1(all_labels, all_preds, num_classes)

# Print concise metric summary lines.
print("Validation accuracy:", round(val_accuracy, 4))
print("Validation macro F1:", round(val_macro_f1, 4))
print("Difference F1 minus accuracy:", round(val_macro_f1 - val_accuracy, 4))



## **3. Devices and Mixed Precision**

### **3.1. Moving Data to CUDA**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_B/image_03_01.jpg?v=1769705063" width="250">



>* Keep model and data on same device
>* Move model once, move each batch before forward

>* Move every batch tensor onto the GPU
>* Consistent device placement prevents bugs and slowdowns

>* Minimize slow CPU–GPU transfers per batch
>* Keep computation on GPU; return results only when needed



In [None]:
#@title Python Code - Moving Data to CUDA

# This script shows moving data to devices.
# It uses TensorFlow to mimic CUDA placement.
# Focus is on safe device aware training.

# !pip install tensorflow-2.20.0.

# Import required standard libraries.
import os
import random
import numpy as np

# Import TensorFlow and check version.
import tensorflow as tf
print("TensorFlow version:", tf.__version__)

# Set deterministic random seeds everywhere.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Detect GPU availability for this session.
physical_gpus = tf.config.list_physical_devices("GPU")
use_gpu = bool(physical_gpus)
print("GPU available:", use_gpu)

# Choose device string based on availability.
if use_gpu:
    device_name = "/GPU:0"
else:
    device_name = "/CPU:0"
print("Using device:", device_name)

# Create a tiny synthetic dataset on CPU.
num_samples = 64
num_features = 8
x_cpu = np.random.randn(num_samples, num_features).astype("float32")

# Create simple linear targets on CPU.
true_w = np.arange(1, num_features + 1, dtype="float32")
y_cpu = x_cpu @ true_w + 0.5

# Wrap numpy arrays as TensorFlow tensors.
x_tensor = tf.convert_to_tensor(x_cpu)
y_tensor = tf.convert_to_tensor(y_cpu)

# Validate shapes before training begins.
assert x_tensor.shape[0] == y_tensor.shape[0]
assert x_tensor.shape[1] == num_features

# Build a tiny model inside chosen device.
with tf.device(device_name):
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(num_features,)),
        tf.keras.layers.Dense(1)
    ])

# Compile model with mean squared error loss.
model.compile(optimizer="adam", loss="mse")

# Decide whether to use mixed precision.
use_mixed_precision = use_gpu
print("Using mixed precision:", use_mixed_precision)

# Configure mixed precision policy if enabled.
if use_mixed_precision:
    from tensorflow.keras import mixed_precision
    policy = mixed_precision.Policy("mixed_float16")
    mixed_precision.set_global_policy(policy)

# Create a small tf.data.Dataset from tensors.
dataset = tf.data.Dataset.from_tensor_slices((x_tensor, y_tensor))

# Shuffle and batch the dataset safely.
dataset = dataset.shuffle(buffer_size=num_samples, seed=seed_value)
dataset = dataset.batch(16)

# Define one manual training step with device placement.
@tf.function
def train_step(batch_x, batch_y):
    with tf.device(device_name):
        with tf.GradientTape() as tape:
            preds = model(batch_x, training=True)
            loss = tf.reduce_mean(tf.keras.losses.mse(batch_y, preds))
        grads = tape.gradient(loss, model.trainable_variables)
        model.optimizer.apply_gradients(zip(grads, model.trainable_variables))
    return loss

# Run a short training loop over few epochs.
num_epochs = 3
for epoch in range(num_epochs):
    epoch_losses = []
    for batch_x, batch_y in dataset:
        loss_value = train_step(batch_x, batch_y)
        epoch_losses.append(float(loss_value.numpy()))
    mean_loss = float(np.mean(epoch_losses))
    print("Epoch", epoch + 1, "mean loss:", round(mean_loss, 4))

# Move a small batch explicitly to chosen device.
first_batch = next(iter(dataset))
with tf.device(device_name):
    batch_x_device = tf.identity(first_batch[0])
    batch_y_device = tf.identity(first_batch[1])

# Show that tensors now live on the same device.
print("Batch x device:", batch_x_device.device)
print("Batch y device:", batch_y_device.device)
print("Model variables device:", model.trainable_variables[0].device if hasattr(model.trainable_variables[0], "device") else "N/A")



### **3.2. Automatic Mixed Precision**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_B/image_03_02.jpg?v=1769705165" width="250">



>* Runs many operations in faster low precision
>* Keeps sensitive parts full precision, code unchanged

>* AMP runs safe ops in full precision
>* Other ops use half precision for speed, memory

>* Monitor metrics; mixed precision can change convergence
>* Keep sensitive parts full precision for reliability



In [None]:
#@title Python Code - Automatic Mixed Precision

# This script shows TensorFlow automatic mixed precision usage.
# It compares float32 and mixed precision training on MNIST.
# Focus is on devices and safe performance improvements.

# !pip install tensorflow.

# Import required standard libraries.
import os
import random
import numpy as np

# Import TensorFlow and Keras utilities.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Set deterministic seeds for reproducibility.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version and device information.
print("TensorFlow version:", tf.__version__)
print("GPU available:", tf.config.list_physical_devices("GPU") != [])

# Load MNIST dataset from Keras datasets.
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Use a small subset for quick demonstration.
x_train = x_train[:8000]
y_train = y_train[:8000]
x_test = x_test[:2000]
y_test = y_test[:2000]

# Normalize images to range [0,1].
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0

# Add channel dimension for convolutional layers.
x_train = np.expand_dims(x_train, axis=-1)
x_test = np.expand_dims(x_test, axis=-1)

# Validate shapes before building models.
assert x_train.shape[1:] == (28, 28, 1)
assert x_test.shape[1:] == (28, 28, 1)

# Define a simple CNN model builder function.
def build_model():
    inputs = keras.Input(shape=(28, 28, 1))
    x = layers.Conv2D(16, 3, activation="relu")(inputs)
    x = layers.MaxPooling2D(2)(x)
    x = layers.Conv2D(32, 3, activation="relu")(x)
    x = layers.Flatten()(x)
    x = layers.Dense(64, activation="relu")(x)
    outputs = layers.Dense(10, activation="softmax")(x)
    model = keras.Model(inputs=inputs, outputs=outputs)
    return model


# Create a float32 baseline model on default device.
baseline_model = build_model()

# Compile baseline model with standard settings.
baseline_model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

# Train baseline model quietly for few epochs.
history_fp32 = baseline_model.fit(
    x_train,
    y_train,
    validation_split=0.1,
    epochs=2,
    batch_size=64,
    verbose=0,
)

# Evaluate baseline model on test data.
loss_fp32, acc_fp32 = baseline_model.evaluate(
    x_test,
    y_test,
    verbose=0,
)

# Enable mixed precision policy if GPU supports it.
if tf.config.list_physical_devices("GPU"):
    policy = tf.keras.mixed_precision.Policy("mixed_float16")
    tf.keras.mixed_precision.set_global_policy(policy)
else:
    policy = tf.keras.mixed_precision.Policy("float32")
    tf.keras.mixed_precision.set_global_policy(policy)

# Build a new model under the current policy.
mp_model = build_model()

# Use an optimizer wrapped for mixed precision.
optimizer = keras.optimizers.Adam(learning_rate=1e-3)
optimizer = tf.keras.mixed_precision.LossScaleOptimizer(optimizer)

# Compile mixed precision model with same loss and metrics.
mp_model.compile(
    optimizer=optimizer,
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

# Train mixed precision model quietly for few epochs.
history_mp = mp_model.fit(
    x_train,
    y_train,
    validation_split=0.1,
    epochs=2,
    batch_size=64,
    verbose=0,
)

# Evaluate mixed precision model on test data.
loss_mp, acc_mp = mp_model.evaluate(
    x_test,
    y_test,
    verbose=0,
)

# Print concise comparison of results.
print("Baseline float32 test accuracy:", round(acc_fp32, 4))
print("Mixed precision test accuracy:", round(acc_mp, 4))
print("Baseline float32 test loss:", round(loss_fp32, 4))
print("Mixed precision test loss:", round(loss_mp, 4))




### **3.3. Stable GradScaler Usage**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_03/Lecture_B/image_03_03.jpg?v=1769705247" width="250">



>* Gradient scaler rescales loss to protect small gradients
>* Prevents underflow in half precision, keeping training stable

>* Scaler detects bad gradients and adjusts scale
>* Defaults usually give stable, efficient mixed-precision training

>* Integrate scaler consistently; treat skipped steps normally
>* Track skip frequency to tune hyperparameters and stability



In [None]:
#@title Python Code - Stable GradScaler Usage

# This script shows stable GradScaler usage.
# We simulate mixed precision training with TensorFlow.
# Focus on safe scaling and clear device placement.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import numpy as np

# Import TensorFlow and Keras components.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Set deterministic seeds for reproducibility.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Choose device string based on GPU availability.
physical_gpus = tf.config.list_physical_devices("GPU")
use_gpu = bool(physical_gpus)
device_name = "/GPU:0" if use_gpu else "/CPU:0"

# Enable mixed precision policy when GPU is available.
if use_gpu:
    from tensorflow.keras import mixed_precision
    policy = mixed_precision.Policy("mixed_float16")
    mixed_precision.set_global_policy(policy)

# Create a tiny synthetic regression dataset.
num_samples = 256
num_features = 8
x_data = np.random.randn(num_samples, num_features).astype("float32")

# Create targets with a simple linear relationship.
true_w = np.arange(1, num_features + 1, dtype="float32")
true_b = 0.5
y_data = x_data @ true_w + true_b

# Add small noise for realism.
noise = 0.1 * np.random.randn(num_samples).astype("float32")
y_data = y_data + noise

# Validate shapes before building dataset.
assert x_data.shape == (num_samples, num_features)
assert y_data.shape == (num_samples,)

# Build a small tf.data.Dataset for iteration.
batch_size = 32
dataset = tf.data.Dataset.from_tensor_slices((x_data, y_data))
dataset = dataset.shuffle(num_samples, seed=seed_value).batch(batch_size)

# Define a simple sequential regression model.
with tf.device(device_name):
    model = keras.Sequential([
        layers.Dense(16, activation="relu", input_shape=(num_features,)),
        layers.Dense(1)
    ])

# Choose optimizer and loss function.
optimizer = keras.optimizers.Adam(learning_rate=0.01)
loss_fn = keras.losses.MeanSquaredError()

# Create a gradient scaler only when using mixed precision.
if use_gpu:
    scaler = mixed_precision.LossScaleOptimizer(optimizer)
else:
    scaler = None

# Define one training step with optional scaling.
@tf.function
def train_step(inputs, targets):
    # Ensure shapes are as expected.
    tf.debugging.assert_shapes([(inputs, (None, num_features)), (targets, (None,))])

    # Use mixed precision path when scaler exists.
    if scaler is not None:
        with tf.GradientTape() as tape:
            predictions = model(inputs, training=True)
            loss = loss_fn(targets, tf.squeeze(predictions))
            scaled_loss = scaler.get_scaled_loss(loss)
        scaled_grads = tape.gradient(scaled_loss, model.trainable_variables)
        grads = scaler.get_unscaled_gradients(scaled_grads)
        scaler.apply_gradients(zip(grads, model.trainable_variables))
        return loss
    else:
        with tf.GradientTape() as tape:
            predictions = model(inputs, training=True)
            loss = loss_fn(targets, tf.squeeze(predictions))
        grads = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))
        return loss

# Define a simple evaluation step without gradient tracking.
@tf.function
def eval_step(inputs, targets):
    predictions = model(inputs, training=False)
    loss = loss_fn(targets, tf.squeeze(predictions))
    return loss

# Run a short training loop with clear logging.
num_epochs = 3
for epoch in range(num_epochs):
    epoch_losses = []
    for batch_inputs, batch_targets in dataset:
        loss_value = train_step(batch_inputs, batch_targets)
        epoch_losses.append(loss_value)

    # Compute mean loss for the epoch.
    mean_loss = tf.reduce_mean(epoch_losses)

    # Run evaluation on the full dataset once per epoch.
    eval_loss = eval_step(x_data, y_data)

    # Print concise information about training stability.
    print(
        f"Epoch {epoch + 1}: train_loss={mean_loss:.4f}, eval_loss={eval_loss:.4f}"
    )




# <font color="#418FDE" size="6.5" uppercase>**Training Loop Design**</font>


In this lecture, you learned to:
- Implement a standard PyTorch training loop that iterates over DataLoader batches and updates model parameters. 
- Add evaluation and metric computation to the training workflow without leaking gradients or mixing modes. 
- Handle device placement and optional mixed precision to improve performance while maintaining correctness. 

In the next Module (Module 4), we will go over 'Data and Dataloaders'