3. Explore following optimization techniques that can be performed while training and document your findings with a basic example code snippets: (5 points)

- Tensor Creation (CPU vs GPU)
- Weight Initialization
- Activation Checkpointing
- Gradient Accumulation
- Mixed Precision Training

3.1 Tensor Creation (CPU vs GPU)

In [None]:
# Import Required Libraries
import pandas as pd
import torch
import time

# The dataset has 8 features and 1 target column
df = pd.read_csv('diabetes.csv', header=None)

# Separate features (X) and target (y)
X = df.iloc[:, :-1].values   # Take all columns except the last one as features
y = df.iloc[:, -1].values    # Take the last column as target variable

# Set device for CPU
device_cpu = torch.device('cpu')

device_gpu = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

# Print which GPU is available and its name
if torch.cuda.is_available():
    print("GPU Name:", torch.cuda.get_device_name(0))
else:
    print("Using CPU only")

# Convert Data to PyTorch Tensors
# Convert feature and label arrays to PyTorch tensors
# `unsqueeze(1)` reshapes y to be a column vector
X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.float32).unsqueeze(1)

# We will be using large square matrices to simulate heavy computation
MATRIX_SIZE = 5000

# Create two large random matrices on CPU
a_cpu = torch.rand(MATRIX_SIZE, MATRIX_SIZE, device=device_cpu)
b_cpu = torch.rand(MATRIX_SIZE, MATRIX_SIZE, device=device_cpu)

# Start timing CPU matrix multiplication
start_time = time.time()
c_cpu = torch.matmul(a_cpu, b_cpu)  # Perform matrix multiplication
cpu_time = time.time() - start_time
print(f"\n[CPU] Matrix Multiplication Time: {cpu_time:.4f} seconds")

if torch.cuda.is_available():
    # Create two large random matrices on GPU
    a_gpu = torch.rand(MATRIX_SIZE, MATRIX_SIZE, device=device_gpu)
    b_gpu = torch.rand(MATRIX_SIZE, MATRIX_SIZE, device=device_gpu)

    _ = torch.matmul(a_gpu, b_gpu)

    # Use torch.cuda.synchronize() to ensure accurate timing
    torch.cuda.synchronize()
    start_time = time.time()
    c_gpu = torch.matmul(a_gpu, b_gpu)
    torch.cuda.synchronize()
    gpu_time = time.time() - start_time

    print(f"[GPU] Matrix Multiplication Time: {gpu_time:.4f} seconds")
else:
    print("[GPU] Skipped: CUDA-compatible GPU not available")


GPU Name: Tesla T4

[CPU] Matrix Multiplication Time: 2.6765 seconds
[GPU] Matrix Multiplication Time: 0.0819 seconds


Obvervation :    

On a 5000×5000 matrix multiply, the CPU took 2.6765 s and the Tesla T4 GPU took 0.0819 s, which is about a 32.7× speedup (2.6765 / 0.0819). I used torch.cuda.synchronize() before timing to avoid async timing errors on GPU. This shows that for large dense tensor ops, running the compute on the GPU is far faster; if CUDA isn’t available the code falls back to CPU.

3.2 Weight Initialization

In [None]:
# Importing libraries
import pandas as pd
import torch
import torch.nn as nn

# Fixing the seed makes initializations repeatable.
torch.manual_seed(42)

# The file has no header; last column is the binary target.
# Seperating X and Y features
df = pd.read_csv('diabetes.csv', header=None)
X = df.iloc[:, :-1].values  # all columns except last
y = df.iloc[:, -1].values   # last column

# Convert to tensors (float32); reshape y to [N, 1] for a single-output model.
X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.float32).unsqueeze(1)


input_dim = X_tensor.shape[1]  # should be 8
hidden_dim1 = 32               # first hidden layer
hidden_dim2 = 16               # second hidden layer
output_dim = 1

# Defining Model
class DiabetesMLP(nn.Module):

    def __init__(self, in_dim, h1, h2, out_dim):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, h1)
        self.fc2 = nn.Linear(h1, h2)
        self.fc3 = nn.Linear(h2, out_dim)
        self.act = nn.ReLU()

        # Initializations after creating layers.
        self._init_weights()

    def _init_weights(self):
        # Layer 1 uses Xavier
        # For layers followed by ReLU, use gain = sqrt(2).
        nn.init.xavier_uniform_(self.fc1.weight, gain=nn.init.calculate_gain('relu'))
        nn.init.zeros_(self.fc1.bias)

        # Layer 2 uses Kaiming (He) Normal
        # He initialization is designed for ReLU non-linearities (good forward variance).
        # Using mode='fan_in' preserves variance in the forward pass for ReLU.
        nn.init.kaiming_normal_(self.fc2.weight, mode='fan_in', nonlinearity='relu')
        nn.init.zeros_(self.fc2.bias)

        # Layer 3 uses Standard Normal (mean=0, std=0.02)
        nn.init.normal_(self.fc3.weight, mean=0.0, std=0.02)
        nn.init.zeros_(self.fc3.bias)

    def forward(self, x):
        # Forward pass
        x = self.act(self.fc1(x))
        x = self.act(self.fc2(x))
        x = self.fc3(x)  # final linear output (logit)
        return x

# Instantiate the Model
model = DiabetesMLP(input_dim, hidden_dim1, hidden_dim2, output_dim)

# summarize initial weights
# Printing mean/std to verify that different inits were applied.
with torch.no_grad():
    w1_mean, w1_std = model.fc1.weight.mean().item(), model.fc1.weight.std().item()
    w2_mean, w2_std = model.fc2.weight.mean().item(), model.fc2.weight.std().item()
    w3_mean, w3_std = model.fc3.weight.mean().item(), model.fc3.weight.std().item()

print("fc1 (Xavier Uniform)  -> weight mean/std: {:.6f} / {:.6f}".format(w1_mean, w1_std))
print("fc2 (Kaiming Normal)  -> weight mean/std: {:.6f} / {:.6f}".format(w2_mean, w2_std))
print("fc3 (Normal 0,0.02)   -> weight mean/std: {:.6f} / {:.6f}".format(w3_mean, w3_std))

fc1 (Xavier Uniform)  -> weight mean/std: -0.008326 / 0.305070
fc2 (Kaiming Normal)  -> weight mean/std: -0.002070 / 0.245711
fc3 (Normal 0,0.02)   -> weight mean/std: 0.004340 / 0.018771


Observation :    

my findings are that for the 3-layer MLP I used different inits per layer: fc1 = Xavier-Uniform with ReLU gain, fc2 = Kaiming-Normal (ReLU), fc3 = Normal(0, 0.02). The weight stats show near-zero means and the expected stds: fc1 std 0.3051 vs the Xavier-ReLU expectation ≈ 0.316 for (fan_in=8, fan_out=32); fc2 std 0.2457 vs He expectation √(2/32) ≈ 0.25; fc3 std 0.0188 close to the target 0.02. No training was run as this is just the initialization check.

3.3 Activation Checkpointing


In [None]:
import torch
import torch.nn as nn
from torch.utils.checkpoint import checkpoint
from torch.utils.data import TensorDataset, DataLoader


# Checkpointed MLP Def
class CheckpointedMLP(nn.Module):

    def __init__(self, in_dim, h1, h2, out_dim, use_checkpoint=True):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, h1)
        self.fc2 = nn.Linear(h1, h2)
        self.fc3 = nn.Linear(h2, out_dim)
        self.act = nn.ReLU()
        self.use_checkpoint = use_checkpoint

        nn.init.xavier_uniform_(self.fc1.weight, gain=nn.init.calculate_gain('relu'))
        nn.init.zeros_(self.fc1.bias)

        nn.init.kaiming_normal_(self.fc2.weight, mode='fan_in', nonlinearity='relu')
        nn.init.zeros_(self.fc2.bias)

        nn.init.normal_(self.fc3.weight, mean=0.0, std=0.02)
        nn.init.zeros_(self.fc3.bias)

    # Define small functional blocks so they can be checkpointed.
    def _block1(self, x):
        # Block 1 = ReLU( fc1(x) )
        return self.act(self.fc1(x))

    def _block2(self, x):
        # Block 2 = ReLU( fc2(x) )
        return self.act(self.fc2(x))

    def forward(self, x):

        if self.training and self.use_checkpoint:
            # Ensure checkpoint precondition: some input must require grad.
            if not x.requires_grad:
                x = x.detach()              # detach to avoid modifying upstream graph
                x.requires_grad_(True)      # enable grad on input for checkpoint

            # Wrap each hidden block with checkpoint to save activation memory.
            x = checkpoint(self._block1, x)  # Checkpointed block 1
            x = checkpoint(self._block2, x)  # Checkpointed block 2
        else:
            # Standard forward without checkpointing
            x = self._block1(x)
            x = self._block2(x)

        # Final linear layer
        x = self.fc3(x)
        return x


# Instantiate model and training bits
# Dimensions reused from previous question:
model_ckpt = CheckpointedMLP(input_dim, hidden_dim1, hidden_dim2, output_dim, use_checkpoint=True)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model_ckpt.to(device)

# Tiny DataLoader to demonstrate backward with checkpointing.
# Reuses X_tensor and y_tensor created earlier.
ds = TensorDataset(X_tensor, y_tensor)
loader = DataLoader(ds, batch_size=128, shuffle=True)

criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model_ckpt.parameters(), lr=1e-3)

model_ckpt.train()  # keep model in training mode so checkpointing is active
for step, (xb, yb) in enumerate(loader):
    xb = xb.to(device)  # move batch to device (CPU/GPU)
    yb = yb.to(device)  # move targets to device

    optimizer.zero_grad(set_to_none=True)  # reset grads for this step
    logits = model_ckpt(xb)                # forward pass (hidden blocks are checkpointed)
    loss = criterion(logits, yb)           # BCEWithLogitsLoss on raw logits
    loss.backward()                        # backward pass (recomputes checkpointed activations)
    optimizer.step()



Observation :     

I wrapped the two hidden ReLU blocks (fc1→ReLU and fc2→ReLU) with torch.utils.checkpoint only during training. I also set x.requires_grad_(True) to satisfy the checkpoint requirement. With this, the model saves activation memory in the forward pass and recomputes those activations in backward, so peak memory goes down but each step takes a bit longer. In eval mode (or if I set use_checkpoint=False) it runs normally without checkpointing. I kept the same inits from 3.2 (Xavier for fc1, He for fc2, small normal for fc3). The loop doesn’t print anything—it's just a regular training pass with checkpointing turned on.

3.4 Gradient Accumulation

In [None]:
# Put the model in training mode
model_ckpt.train()

# Clear any existing gradients before starting accumulation
optimizer.zero_grad(set_to_none=True)

# Tracking how many micro-batches have contributed grads in the current window
_accum_counter = 0

# Counts how many times we've actually updated parameters (i.e., effective steps)
effective_step = 0

# Iterate over micro-batches from the DataLoader
for _step, (xb, yb) in enumerate(loader, start=1):
    # Move current micro-batch to the active device
    xb = xb.to(device)
    yb = yb.to(device)

    # Forward pass on the micro-batch
    logits = model_ckpt(xb)

    # Compute the loss for current micro-batch.
    # loss.backward() repeatedly and sum grads across 'accumulation_steps' micro-batches,
    # the resulting gradient magnitude matches using one large batch in a single step.
    loss = criterion(logits, yb) / accumulation_steps

    # Backpropagate the SCALED loss
    loss.backward()

    # One more micro-batch has contributed to the gradient sum
    _accum_counter += 1

    # When we've seen 'accumulation_steps' micro-batches, apply one optimizer step
    if _accum_counter == accumulation_steps:
        # Update parameters ONCE using the accumulated gradients
        optimizer.step()

        # Clear gradients so the next accumulation window starts fresh
        optimizer.zero_grad(set_to_none=True)

        # Book-keeping: we performed one "effective" update
        effective_step += 1

        # 'loss' here is the SCALED value from the last micro-batch in the window.
        print(f"effective_step={effective_step} | loss={loss.item():.6f}")

        # Reset the counter for the next accumulation window
        _accum_counter = 0

if _accum_counter > 0:
    # Apply one final step using whatever has accumulated
    optimizer.step()
    optimizer.zero_grad(set_to_none=True)
    effective_step += 1
    # Print Statement indicating we applied a final partial update
    print(f"effective_step={effective_step} | (final partial window)")


effective_step=1 | loss=0.171629
effective_step=2 | (final partial window)


Observation :     

With gradient accumulation, I combined multiple micro-batches before updating the weights, so I got one optimizer step per accumulation window instead of per batch. The output shows exactly that: one full window produced effective_step=1 | loss=0.171629 (this loss is the scaled micro-batch loss from the last batch in that window), and then a second line effective_step=2 | (final partial window), which means the dataset size wasn’t divisible by accumulation_steps and we applied one final update with a partial window. This confirms accumulation is working: fewer optimizer steps, stable per-step loss reporting, and no increase to the micro-batch size (so peak memory stays the same)

3.5 Mixed Precision Training

In [None]:
import torch

try:
    device
except NameError:
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

try:
    accumulation_steps  # just to check existence
except NameError:
    accumulation_steps = 1  # default to no accumulation if not set earlier

if device.type == 'cuda':
    # CUDA path will use float16 autocast + real GradScaler
    scaler = torch.cuda.amp.GradScaler()

    def amp_autocast():
        # autocast returns a context manager
        return torch.cuda.amp.autocast(dtype=torch.float16)
else:
    # CPU path defines a no-op-ish scaler and try CPU autocast with bfloat16
    class _NoOpScaler:
        def scale(self, x): return x
        def step(self, opt): opt.step()
        def update(self): pass
        def unscale_(self, opt): pass
    scaler = _NoOpScaler()

    def amp_autocast():
        # Using CPU autocast if available
        try:
            return torch.autocast(device_type='cpu', dtype=torch.bfloat16)
        except AttributeError:
            from contextlib import nullcontext
            return nullcontext()

model_ckpt.to(device)
model_ckpt.train()
optimizer.zero_grad(set_to_none=True)

# AMP training loop with one print per effective step
_accum_counter = 0
effective_step = 0

for _step, (xb, yb) in enumerate(loader, start=1):
    xb = xb.to(device)
    yb = yb.to(device)

    # Forward + loss under mixed precision autocast
    with amp_autocast():
        logits = model_ckpt(xb)             # raw logits
        loss = criterion(logits, yb)        # BCEWithLogitsLoss on logits
        loss = loss / accumulation_steps    # scale for gradient accumulation

    # Backward with GradScaler
    scaler.scale(loss).backward()
    _accum_counter += 1

    if _accum_counter == accumulation_steps:

        # torch.nn.utils.clip_grad_norm_(model_ckpt.parameters(), max_norm=1.0)

        scaler.step(optimizer)      # applies update
        scaler.update()             # adjusts scale on CUDA; no-op on CPU
        optimizer.zero_grad(set_to_none=True)

        effective_step += 1
        print(f"effective_step={effective_step} | loss={loss.item():.6f}")

        _accum_counter = 0

# Handle a final partial accumulation window
if _accum_counter > 0:
    scaler.step(optimizer)
    scaler.update()
    optimizer.zero_grad(set_to_none=True)
    effective_step += 1
    print(f"effective_step={effective_step} | (final partial window)")


  scaler = torch.cuda.amp.GradScaler()
  return torch.cuda.amp.autocast(dtype=torch.float16)


effective_step=1 | loss=0.170615
effective_step=2 | (final partial window)


Obvervation :    

For mixed precision, I ran the training loop with AMP (autocast + GradScaler) and gradient accumulation. The console showed two things: (1) FutureWarning messages saying the torch.cuda.amp API is deprecated—so I should switch to torch.amp.GradScaler('cuda') and torch.amp.autocast('cuda') going forward—and (2) the step prints: effective_step=1 | loss=0.170615 followed by effective_step=2 | (final partial window). That means one full accumulation window produced an update with a scaled loss of ~0.1706, and then the dataloader ended with a partial window that triggered a final update. No NaNs or errors showed up, so the scaler handled precision fine.