# Week 13 — GPU Computing for Data Science

Complete the three short tasks below to practice moving workloads to a GPU (when available). Submit this notebook (`.ipynb`) only.


## Submission notes
- Keep the notebook runnable on CPU-only machines; wrap CUDA-specific code with availability checks.
- Fill in the TODO sections. Add brief commentary after each task describing what you observed.
- Do **not** submit additional files.


In [1]:
# Basic imports (feel free to add more if needed)
import time
import torch

print(f"PyTorch version: {torch.__version__}")


PyTorch version: 2.9.0+cpu


## Task 1 — Detect GPU and report device info
Implement helper functions that choose the appropriate device and report its details.

**What to do**
1) Implement `get_device()` that returns `torch.device("cuda")` when CUDA is available, otherwise CPU.
2) Implement `report_device_info(device)` that prints whether CUDA is available and, if so, the device name and total memory.
3) Call both functions and capture their output.


In [3]:
import torch

# TODO: implement Task 1 here
def get_device():
    """Return a torch.device object, preferring CUDA when available."""
    if torch.cuda.is_available():
        return torch.device("cuda")
    else:
        return torch.device("cpu")

def report_device_info(device: torch.device) -> None:
    """Print device details. Include CUDA availability and device name if using GPU."""
    if device.type == 'cuda':
        print(f"CUDA is available: Yes")
        print(f"Device Name: {torch.cuda.get_device_name(0)}")
        
        # Get total memory (returns in bytes, so we convert to GB for readability)
        total_memory = torch.cuda.get_device_properties(0).total_memory
        print(f"Total Memory: {total_memory / (1024**3):.2f} GB")
    else:
        print("CUDA is available: No")
        print("Using CPU.")

# Example usage
device = get_device()
report_device_info(device)

CUDA is available: No
Using CPU.


## Task 2 — Matrix multiply benchmark (CPU vs GPU)
Compare matrix multiplication speed on CPU vs GPU for a moderately sized tensor. Keep the matrix size modest so it runs quickly even on CPU.

**What to do**
1) Create a function `bench_matmul(device, n=1024)` that creates two random `n x n` tensors on the chosen device and times a single matrix multiplication.
2) If using GPU, include `torch.cuda.synchronize()` around the timing to get accurate measurements.
3) Run the benchmark once on CPU and once on GPU when available. Print the elapsed times in milliseconds.


In [2]:
import torch
import time

# TODO: implement Task 2 here
def bench_matmul(device: torch.device, n: int = 1024) -> float:
    """Return elapsed time (ms) for one matrix multiplication on the given device."""
    
    # 1. Create random tensors on the specified device
    # We create them outside the timer so we only measure the multiplication itself
    a = torch.randn(n, n, device=device)
    b = torch.randn(n, n, device=device)

    # 2. Prepare for timing
    # If using CUDA, we must synchronize before starting the timer to ensure 
    # the GPU isn't busy with the tensor creation above.
    if device.type == 'cuda':
        torch.cuda.synchronize()

    start_time = time.perf_counter()

    # 3. Perform the matrix multiplication
    # We assign it to a variable 'c' to ensure the operation isn't optimized away
    c = torch.matmul(a, b)

    # 4. Stop timing
    # Crucial: We must synchronize again. Otherwise, the timer stops 
    # immediately after *launching* the kernel, reporting near-zero time.
    if device.type == 'cuda':
        torch.cuda.synchronize()
        
    end_time = time.perf_counter()

    # Return elapsed time in milliseconds
    return (end_time - start_time) * 1000

# Example usage (from your screenshot's boilerplate)
print(f"Benchmarking with matrix size: {1024}x{1024}")

cpu_time_ms = bench_matmul(torch.device("cpu"))
print(f"CPU matmul: {cpu_time_ms:.2f} ms")

if torch.cuda.is_available():
    gpu_time_ms = bench_matmul(torch.device("cuda"))
    print(f"GPU matmul: {gpu_time_ms:.2f} ms")
    print(f"Speedup: {cpu_time_ms / gpu_time_ms:.1f}x")
else:
    print("CUDA not available; skipped GPU benchmark.")

Benchmarking with matrix size: 1024x1024
CPU matmul: 24.06 ms
CUDA not available; skipped GPU benchmark.


## Task 3 — Train a tiny model on GPU when available
Train a simple logistic regression classifier on synthetic data and compare training time on CPU vs GPU (if available).

**What to do**
1) Generate a synthetic binary classification dataset (e.g., 10,000 samples, 50 features) using `torch.randn`.
2) Define a single-layer model (e.g., `torch.nn.Linear`) and a training loop using binary cross-entropy.
3) Run a short training loop (e.g., 10–20 epochs) on CPU and, if available, on GPU. Time each run and report final loss for each device.
4) Add a brief note explaining any speedup (or lack thereof) you observed.


In [3]:
import torch
import torch.nn as nn
import torch.optim as optim
import time

# TODO: implement Task 3 here
def train_log_reg(device: torch.device, epochs: int = 15, n_samples: int = 10_000, n_features: int = 50):
    """Train a tiny logistic regression model on synthetic data and return (final_loss, elapsed_ms)."""
    
    # 1. Generate synthetic data
    # X: random noise, y: random binary targets (0 or 1)
    X = torch.randn(n_samples, n_features)
    y = torch.randint(0, 2, (n_samples, 1)).float()

    # 2. Move data to the target device (Critical step!)
    # If we don't do this, the CPU data cannot talk to the GPU model.
    X, y = X.to(device), y.to(device)

    # 3. Define Model, Loss, and Optimizer
    model = nn.Linear(n_features, 1)
    model.to(device)  # Move model to device

    criterion = nn.BCEWithLogitsLoss() # Combines Sigmoid + BCE for stability
    optimizer = optim.SGD(model.parameters(), lr=0.01)

    # 4. Prepare Timing
    if device.type == 'cuda':
        torch.cuda.synchronize()
    
    start_time = time.perf_counter()

    # 5. Training Loop
    for epoch in range(epochs):
        optimizer.zero_grad()           # Reset gradients
        outputs = model(X)              # Forward pass
        loss = criterion(outputs, y)    # Compute loss
        loss.backward()                 # Backward pass (gradients)
        optimizer.step()                # Update weights

    # 6. Stop Timing
    if device.type == 'cuda':
        torch.cuda.synchronize()
    
    end_time = time.perf_counter()
    elapsed_ms = (end_time - start_time) * 1000
    
    # Return the final loss (as a python float) and time
    return loss.item(), elapsed_ms

# Example usage
cpu_loss, cpu_ms = train_log_reg(torch.device("cpu"))
print(f"CPU -> loss: {cpu_loss:.4f}, time: {cpu_ms:.1f} ms")

if torch.cuda.is_available():
    gpu_loss, gpu_ms = train_log_reg(torch.device("cuda"))
    print(f"GPU -> loss: {gpu_loss:.4f}, time: {gpu_ms:.1f} ms")
    
    # Optional: Speedup check
    # Note: For very small models, GPU might be SLOWER due to data transfer overhead!
    if gpu_ms > 0:
        print(f"Speedup: {cpu_ms / gpu_ms:.2f}x")
else:
    print("CUDA not available; skipped GPU training run.")

CPU -> loss: 0.7270, time: 65.3 ms
CUDA not available; skipped GPU training run.
