# Week 13 — GPU Computing for Data Science

Complete the three short tasks below to practice moving workloads to a GPU (when available). Submit this notebook (`.ipynb`) only.


## Submission notes
- Keep the notebook runnable on CPU-only machines; wrap CUDA-specific code with availability checks.
- Fill in the TODO sections. Add brief commentary after each task describing what you observed.
- Do **not** submit additional files.


In [None]:
# Basic imports (feel free to add more if needed)
import time
import torch

print(f"PyTorch version: {torch.__version__}")


## Task 1 — Detect GPU and report device info
Implement helper functions that choose the appropriate device and report its details.

**What to do**
1) Implement `get_device()` that returns `torch.device("cuda")` when CUDA is available, otherwise CPU.
2) Implement `report_device_info(device)` that prints whether CUDA is available and, if so, the device name and total memory.
3) Call both functions and capture their output.


In [None]:
# TODO: implement Task 1 here
def get_device():
    """Return a torch.device object, preferring CUDA when available."""
    raise NotImplementedError()


def report_device_info(device: torch.device) -> None:
    """Print device details. Include CUDA availability and device name if using GPU."""
    raise NotImplementedError()


# Example usage (replace pass / NotImplementedErrors with your code above)
device = get_device()
report_device_info(device)


## Task 2 — Matrix multiply benchmark (CPU vs GPU)
Compare matrix multiplication speed on CPU vs GPU for a moderately sized tensor. Keep the matrix size modest so it runs quickly even on CPU.

**What to do**
1) Create a function `bench_matmul(device, n=1024)` that creates two random `n x n` tensors on the chosen device and times a single matrix multiplication.
2) If using GPU, include `torch.cuda.synchronize()` around the timing to get accurate measurements.
3) Run the benchmark once on CPU and once on GPU when available. Print the elapsed times in milliseconds.


In [None]:
# TODO: implement Task 2 here
def bench_matmul(device: torch.device, n: int = 1024) -> float:
    """Return elapsed time (ms) for one matrix multiplication on the given device."""
    raise NotImplementedError()


# Example usage (after implementing bench_matmul)
cpu_time_ms = bench_matmul(torch.device("cpu"))
print(f"CPU matmul: {cpu_time_ms:.2f} ms")

# Only run GPU timing if available
if torch.cuda.is_available():
    gpu_time_ms = bench_matmul(torch.device("cuda"))
    print(f"GPU matmul: {gpu_time_ms:.2f} ms")
else:
    print("CUDA not available; skipped GPU benchmark.")


## Task 3 — Train a tiny model on GPU when available
Train a simple logistic regression classifier on synthetic data and compare training time on CPU vs GPU (if available).

**What to do**
1) Generate a synthetic binary classification dataset (e.g., 10,000 samples, 50 features) using `torch.randn`.
2) Define a single-layer model (e.g., `torch.nn.Linear`) and a training loop using binary cross-entropy.
3) Run a short training loop (e.g., 10–20 epochs) on CPU and, if available, on GPU. Time each run and report final loss for each device.
4) Add a brief note explaining any speedup (or lack thereof) you observed.


In [None]:
# TODO: implement Task 3 here
def train_log_reg(device: torch.device, epochs: int = 15, n_samples: int = 10_000, n_features: int = 50):
    """Train a tiny logistic regression model on synthetic data and return (final_loss, elapsed_ms)."""
    raise NotImplementedError()


# Example usage (after implementing train_log_reg)
cpu_loss, cpu_ms = train_log_reg(torch.device("cpu"))
print(f"CPU -> loss: {cpu_loss:.4f}, time: {cpu_ms:.1f} ms")

if torch.cuda.is_available():
    gpu_loss, gpu_ms = train_log_reg(torch.device("cuda"))
    print(f"GPU -> loss: {gpu_loss:.4f}, time: {gpu_ms:.1f} ms")
else:
    print("CUDA not available; skipped GPU training run.")
