# Super Computer Usage 101

## Accessing Resource:
Go to the Illinois [ICRN webpage](https://docs.ncsa.illinois.edu/systems/icrn/)

To use GPU resources, select "pytorch" + "A100 GPU 2CPU/8GB". 



## Create an environment:

The point of a virtual environment is to isolate project dependencies so that different projects can use different package versions without conflict. This creates a "sandbox" for each project, containing its own specific Python interpreter and installed libraries, which makes development more organized and reproducible. 

1. First, shut down all kernels. 
2. Run the command below in a terminal: 
- ~: Your home directory, equivalent to /home/NET_ID
- --prefix: Telling mamba where to install your environment.
```
mamba create --prefix ~/myenv python tensorflow[and-cuda]=2.17 ipykernel pytorch pandas seaborn tqdm matplotlib pytorch-cuda -c pytorch -c nvidia -c conda-forge
```

This will take some time. We won't use it today, but if you need to use tensorflow in the future, this is how you do on ICRN.

2. Activate your environment
```
source activate ~/env_name
```
3. Run a new kernel session
```
python -m ipykernel install --user --name=session 
```
Click "+" and you will see your session has been added. You may also open the Kernel menu and select Change kernel. You can also learn more about Mamba [here](https://mamba.readthedocs.io/en/latest/index.html)!

### Common bash commands
- pwd -P show current absolute path
- cat print out everything in the file
- ls (folder name) show everything in the current folder.
- nvidia-smi show the current GPU status.
- rm remove a file
- source let bash run the script.
- cp [dir1] [dir2] copy files from 1 directory to another directory. If copying a folder, use cp -r
- In jupyter-lab, you can use these batch commands by putting a "!" 


## What and Why GPU?



In [16]:
import numpy as np
import torch
import time
import torch.profiler
import time
import pandas as pd

# Helper functions
def format_time(time_us):
    """Converts microseconds to a formatted ms or us string"""
    if time_us == 0:
        return "0.000us"
    if time_us > 1000 or time_us < -1000:
        return f"{time_us / 1000:.3f}ms"
    return f"{time_us:.3f}us"
    
def to_pd(prof):
    key_averages = prof.key_averages()
    total_self_cpu = prof.key_averages().self_cpu_time_total
    total_self_cuda = prof.key_averages().total_average().__dict__["self_device_time_total"]
    profiler_data = []
    for avg in key_averages:
        profiler_data.append({
            "Name": avg.key,
            
            # CPU Columns
            
            #"Self CPU %": f"{avg.self_cpu_time_total / total_self_cpu * 100:.2f}%" if total_self_cpu > 0 else "0.00%",
            "Self CPU": format_time(avg.self_cpu_time_total),
            "CPU total %": f"{avg.cpu_time_total / total_self_cpu * 100:.2f}%" if total_self_cpu > 0 else "0.00%", # Follows profiler's table logic
            "CPU total": format_time(avg.cpu_time_total),
            "CPU time avg": format_time(avg.cpu_time_total / avg.count),
            
            # CUDA Columns
            #"Self CUDA %": f"{avg.self_device_time_total / total_self_cuda * 100:.2f}%" if total_self_cuda > 0 else "0.00%",
            "Self CUDA": format_time(avg.self_device_time_total),
            "CUDA total": format_time(avg.device_time_total),
            "CUDA time avg": format_time(avg.device_time_total / avg.count),
            
            "# of Calls": avg.count,
            #"_cuda_total_raw": avg.device_time_total # Internal column just for sorting
        })
    print(f"total cpu time:{format_time(total_self_cpu)}")
    print(f"total gpu time:{format_time(total_self_cuda)}")
    return pd.DataFrame(profiler_data).sort_values(by="Self CUDA", ascending=False)

In [4]:
# Create two large random tensors
a = torch.randn(5000, 5000)
b = torch.randn(5000, 5000)

# --- 1. CPU MatMul Test ---
start_time = time.time()
c_cpu = torch.matmul(a, b)
cpu_time = time.time() - start_time
print(f"CPU MatMul Time: {cpu_time:.6f} seconds")

# --- 2. GPU MatMul Test ---
a_gpu = a.to("cuda")
b_gpu = b.to("cuda")
torch.cuda.synchronize()

start_time = time.time()
c_gpu = torch.matmul(a_gpu, b_gpu)
torch.cuda.synchronize()
gpu_time = time.time() - start_time

print(f"GPU MatMul Time: {gpu_time:.6f} seconds")
print(f"GPU is {cpu_time/gpu_time:.2f}x faster for MatMul")

CPU MatMul Time: 2.311736 seconds
GPU MatMul Time: 0.014381 seconds
GPU is 160.75x faster for MatMul


The GPU is faster because it has thousands of simple cores (for throughput), while the CPU has a few complex cores (for latency).
- Jupyter Notebook Tip: You can also use %%timeit to time a cell.

In [99]:
%%timeit
a = np.arange(10**6) 
np.sum(a**2)

702 µs ± 26.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


Let's get more information from GPUs....

In [100]:
if torch.cuda.is_available():
    props = torch.cuda.get_device_properties(0)
    
    print(f"\n--- Internal Specs for {props.name} ---")
    
    # This is the "factory count," the most important number for tuning.
    print(f"Streaming Multiprocessors (SMs): {props.multi_processor_count}")

    print(f"Max threads per Multiprocessors (SMs): {props.max_threads_per_multi_processor}")

else:
    print("CUDA device not found.")


--- Internal Specs for NVIDIA A100-SXM4-80GB ---
Streaming Multiprocessors (SMs): 108
Max threads per Multiprocessors (SMs): 2048


In [101]:
!nvidia-smi

Sun Oct 26 01:12:34 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:00:05.0 Off |                    0 |
| N/A   28C    P0             86W /  500W |    5401MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

Your goal is to make volatile GPU-Util as busy as possible and don't fill up the memory usage. 

## The Real Enemy: PCIe bus.

In [122]:
# Create a tensor on the CPU
z_cpu = torch.randn(5000, 5000)
acts=[torch.profiler.ProfilerActivity.CPU,torch.profiler.ProfilerActivity.CUDA]
with torch.profiler.profile(
    activities=acts
) as prof:
    # This loop does NO compute, just data transfer
    for _ in range(10):
        z_gpu = z_cpu.to("cuda")
        z_gpu+=z_gpu
        z_back = z_gpu.to("cpu")
to_pd(prof)

total cpu time:891.633ms
total gpu time:1753.928ms


Unnamed: 0,Name,Self CPU,CPU total %,CPU total,CPU time avg,Self CUDA,CUDA total,CUDA time avg,# of Calls
3,aten::copy_,379.340us,99.80%,889.817ms,44.491ms,875.846ms,875.846ms,43.792ms,20
10,Memcpy DtoH (Device -> Pageable),0.000us,0.00%,0.000us,0.000us,730.451ms,730.451ms,73.045ms,10
5,Memcpy HtoD (Pageable -> Device),0.000us,0.00%,0.000us,0.000us,145.395ms,145.395ms,14.540ms,10
7,aten::add_,230.053us,0.06%,530.498us,53.050us,1.118ms,1.118ms,111.795us,10
9,void at::native::vectorized_elementwise_kernel...,0.000us,0.00%,0.000us,0.000us,1.118ms,1.118ms,111.795us,10
0,aten::to,576.055us,99.94%,891.090ms,44.554ms,0.000us,875.846ms,43.792ms,20
1,aten::_to_copy,240.812us,99.87%,890.514ms,44.526ms,0.000us,875.846ms,43.792ms,20
2,aten::empty_strided,456.411us,0.05%,456.411us,22.821us,0.000us,0.000us,0.000us,20
4,cudaMemcpyAsync,889.105ms,99.72%,889.105ms,44.455ms,0.000us,0.000us,0.000us,20
6,cudaStreamSynchronize,332.207us,0.04%,332.207us,16.610us,0.000us,0.000us,0.000us,20


All the gpu time is spent in memory operations(copy or Memcpy). This is the data moving across the PCIe bus from RAM to VRAM. Your kernel can be infinitely fast, but you'll still be slow if you're bottlenecked by data transfer."


## Kernel Launch Overhead: Vectorization (batching).

A for loop in Python is not parallel. Writing 
```
for i in range(1000):
```
to process your data is wrong. The correct way is to feed the GPU one big batch and let it use its 108 SMs.
This is because it is expensive to create a stream and launch new kernels on a gpu... 
we want to minimize and have our data in a big matrix, rather than doing the same operations thousands of times. 

In [5]:
import time

# --- BAD: 100,000 tiny kernels ---
a = torch.randn(1, device='cuda')
b = torch.randn(1, device='cuda')

torch.cuda.synchronize()
start = time.time()

for _ in range(100000):
    c = a + b  # A new kernel launch every loop!

torch.cuda.synchronize()
print(f"Time for 100,000 small launches: {time.time() - start:.6f}s")


Time for 100,000 small launches: 0.583946s


In [6]:
# --- GOOD: One big, vectorized kernel ---
a = torch.randn(100000, device='cuda')
b = torch.randn(100000, device='cuda')

torch.cuda.synchronize()
start = time.time()

c = a + b  # One single kernel launch

torch.cuda.synchronize()
print(f"Time for 1 big launch: {time.time() - start:.6f}s")

Time for 1 big launch: 0.000157s


The vectorized (one-launch) version will be dramatically faster, even though it's doing the same amount of math.

## Data Loading (The "Starving" GPU)?
You've proven the GPU is fast and the PCIe bus is slow. But what about getting data from your disk to your CPU RAM in the first place?
Your GPU is **starving**. It's spending all its time waiting for the CPU to load data (like JPEGs or CSVs) from the disk and prepare the next batch. Your 108 SMs (factories) are idle because the "delivery trucks" (the CPU) are stuck in traffic.

### The "Fix": `DataLoader` Knobs
When you use `torch.utils.data.DataLoader`, you have two magic knobs:

1.  **`num_workers=...`**: This is the *most important* knob. Setting `num_workers=8` (or `16`, `32`) launches 8 parallel CPU processes. While your GPU is busy on `batch 1`, these 8 workers are *already loading and preprocessing* `batch 2`, `3`, `4`, etc. This **hides** the data-loading latency. In ICRN, you only have 2 CPUs, so don't make this number above 2. 
2.  **`pin_memory=True`**: This is a direct hardware optimization. It tells PyTorch to put the CPU-side data in a special "page-locked" (or "pinned") memory region. This makes the `HtoD` (Host-to-Device) copy over the PCIe bus **significantly faster**.

### What is pinned memory?
- Standard memory: By default, host (CPU) memory is "pageable," meaning the operating system can move it to disk to free up physical RAM for other applications. 
- Pinned memory: When memory is allocated as "pinned," the OS is instructed to lock it in place within physical RAM, creating a stable, predictable memory region. 
- GPU access: Because the memory is stable and its physical address is guaranteed to be constant, the GPU can use DMA to transfer data directly to and from it without CPU involvement. 

In [30]:
import time
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

# --- 1. Load (CIFAR-10) ---
# This will force the loader to read real files from disk.
print("Downloading CIFAR-10 dataset...")
transform = transforms.Compose([transforms.ToTensor()])
# We'll use the training set for our test
train_dataset = torchvision.datasets.CIFAR10(
    root='./data', 
    train=True,
    download=True, 
    transform=transform
)
print("Dataset ready.")

# --- 2. Define a simple model to create a GPU workload ---
# This simulates the "compute" part of a training loop.
class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        # A few simple layers to keep the GPU busy
        self.net = nn.Sequential(
            nn.Flatten(),
            nn.Linear(32*32*3, 512),
            nn.ReLU(),
            nn.Linear(512, 10)
        )
    def forward(self, x):
        return self.net(x)

# --- 3. The "Bad" Way (num_workers=0) ---
bad_loader = DataLoader(
    train_dataset, 
    batch_size=256,
    num_workers=2,      # Main CPU thread does all the disk I/O
    pin_memory=False
)

# --- 4. The "Good" Way (num_workers=8) ---
good_loader = DataLoader(
    train_dataset, 
    batch_size=256,
    num_workers=2,      # 8 parallel CPU processes fetching data
    pin_memory=True     # Speeds up the final CPU -> GPU copy
)

# --- 5. The Training Simulation ---
def run_epoch(loader, model):
    """Simulates one full epoch of training."""
    for inputs, labels in loader:
        # Move data to GPU
        inputs_gpu = inputs.to("cuda")
        labels_gpu = labels.to("cuda")
        
        # --- Simulate Compute ---
        # Forward pass
        outputs = model(inputs_gpu)
        # We'll just use the output as the "loss" for simplicity
        # Backward pass
        outputs.sum().backward()
        # --- End Compute ---

# --- Test 1: Time the Bad Loader ---
print("\nTesting Bad Loader (num_workers=0)...")
model = SimpleModel().to("cuda") # Reset model on GPU

torch.cuda.synchronize()
start_time = time.time()

run_epoch(bad_loader, model)

torch.cuda.synchronize()
bad_time = time.time() - start_time
print(f"  Time for Bad Loader: {bad_time:.6f}s")


# --- Test 2: Time the Good Loader ---
print("\nTesting Good Loader (pin_memory=True)...")
model = SimpleModel().to("cuda") # Reset model on GPU

torch.cuda.synchronize()
start_time = time.time()

run_epoch(good_loader, model)

torch.cuda.synchronize()
good_time = time.time() - start_time
print(f"  Time for Good Loader: {good_time:.6f}s")

print(f"\nSpeedup: {bad_time / good_time:.2f}x")

Downloading CIFAR-10 dataset...
Dataset ready.

Testing Bad Loader (num_workers=0)...
  Time for Bad Loader: 3.596519s

Testing Good Loader (pin_memory=True)...
  Time for Good Loader: 3.586690s

Speedup: 1.00x


In [27]:
import time
from torch.utils.data import DataLoader, TensorDataset # <- This line is fixed

# --- Create a larger dataset to make the test obvious ---
# 500,000 samples, 1024 features
print("Creating large dummy dataset...")
dummy_dataset = TensorDataset(torch.randn(500000, 1024), torch.randn(500000))
print("Dataset created.")

# --- The "Bad" Way (Default) ---
bad_loader = DataLoader(
    dummy_dataset, 
    batch_size=256,
    num_workers=0,      # Main CPU thread does all the disk I/O
    pin_memory=False
)

# --- The "Good" Hardware-Aware Way ---
good_loader = DataLoader(
    dummy_dataset, 
    batch_size=256,
    num_workers=0,      # Main CPU thread does all the disk I/O
    pin_memory=True
)

# --- Test 1: Time the Bad Loader ---
print("\nTesting Bad Loader (num_workers=0)...")
torch.cuda.synchronize() # Wait for GPU to be idle
start_time = time.time()

for inputs, labels in bad_loader:
    # This loop simulates a training loop by moving data to the GPU
    inputs_gpu = inputs.to("cuda")
    labels_gpu = labels.to("cuda")

torch.cuda.synchronize() # Wait for all copies to finish
bad_time = time.time() - start_time
print(f"  Time for Bad Loader: {bad_time:.6f}s")


# --- Test 2: Time the Good Loader ---
print("\nTesting Good Loader (pin_memory=True)...")
torch.cuda.synchronize() # Wait for GPU to be idle
start_time = time.time()

for inputs, labels in good_loader:
    # This loop simulates a training loop by moving data to the GPU
    inputs_gpu = inputs.to("cuda")
    labels_gpu = labels.to("cuda")

torch.cuda.synchronize() # Wait for all copies to finish
good_time = time.time() - start_time
print(f"  Time for Good Loader: {good_time:.6f}s")

print(f"\nSpeedup: {bad_time / good_time:.2f}x")

Creating large dummy dataset...
Dataset created.

Testing Bad Loader (num_workers=0)...
  Time for Bad Loader: 2.561420s

Testing Good Loader (pin_memory=True)...
  Time for Good Loader: 109.136102s

Speedup: 0.02x


Activity:
Play around with different parameters (same workers/pinned memory=true/false) for these 2 processes.

#### The Bottleneck: num_workers=0 + pin_memory=True
What num_workers=0 Does: This tells the DataLoader to fetch all data in the main process. No separate processes are spawned. Your for loop cannot continue until the data loading for the current batch is 100% complete.

What pin_memory=True Does: This tells the DataLoader, "After you fetch a batch, please copy it from standard (pageable) RAM into pinned (page-locked) RAM."

The Catch: Allocating new pinned memory is a very slow, expensive operation for the operating system.

Your "Good" Loader's Slow Loop: For every single batch, your main thread is forced to:

Fetch the batch data (from RAM, which is fast).

Stall: Explicitly allocate a new block of pinned memory (very, very slow).

Copy the batch data into that new pinned memory (also slow).

Finally, return the batch to your loop.

inputs.to("cuda"): This step is now fast (a direct DMA transfer), but it doesn't matter. You already spent a massive amount of time on steps 2 and 3.

## The A100's "Superpower": Tensor Cores & Mixed Precision
Your NVIDIA A100 has specialized hardware called **Tensor Cores**. These are like special-purpose "mini-factories" inside your SMs that do matrix multiplication *insanely* fast.

There's one catch: they **only work on `float16` (half-precision) data**, not the default `float32`.

Using `float16` gives you two massive wins:
1.  **2-3x Speedup:** By using the fast Tensor Cores.
2.  **50% Memory Reduction:** `float16` tensors take half the VRAM of `float32`.

The problem? `float16` can be unstable for some operations (like `softmax`). The solution is **Automatic Mixed Precision (AMP)**.

### The Fix: `torch.cuda.amp.autocast`
PyTorch's `autocast` will automatically run "safe" operations in `float32` but switch to `float16` for "fast" operations (like `matmul` and `conv2d`) to use the Tensor Cores.

In [31]:
import torch
import time

# --- Configuration ---
# Use a large matrix size to make the difference obvious
# Tensor Cores are activated for specific dimensions (multiples of 8 or 64)
N = 4096
# Number of benchmark iterations
ITERATIONS = 500
# Warmup iterations
WARMUP = 50

def benchmark(fn, *args):
    """
    A simple benchmarking function with GPU warmup and synchronization.
    """
    # Warmup runs
    for _ in range(WARMUP):
        fn(*args)
    
    # Wait for all GPU operations to finish
    torch.cuda.synchronize()
    
    # Start timing
    start_time = time.time()
    
    # Main benchmark loop
    for _ in range(ITERATIONS):
        fn(*args)
        
    # Wait for all GPU operations to finish
    torch.cuda.synchronize()
    
    # Stop timing
    end_time = time.time()
    
    # Calculate average time per iteration
    avg_time = (end_time - start_time) / ITERATIONS
    return avg_time

# --- Main Comparison ---
if not torch.cuda.is_available():
    print("CUDA is not available. This benchmark requires an NVIDIA GPU.")
    exit()

if torch.cuda.get_device_capability()[0] < 8:
    print(f"Warning: Your GPU (Compute Capability {torch.cuda.get_device_capability()})")
    print("is not an Ampere architecture (A100) or newer.")
    print("Tensor Core results (TF32) may not be as dramatic.")

print(f"Running on GPU: {torch.cuda.get_device_name(0)}")
print(f"Matrix Size: {N}x{N}")
print(f"Iterations: {ITERATIONS} (after {WARMUP} warmup runs)\n")

# --- 1. Test Pure FP32 (Standard CUDA Cores) ---
# We force 'highest' precision to *disable* TF32 and ensure
# we are using the standard FP32 CUDA cores.
print("--- Test 1: Pure FP32 (Standard CUDA Cores) ---")
torch.set_float32_matmul_precision('highest')
print(f"Current FP32 Precision: {torch.get_float32_matmul_precision()}")

# Create standard float32 tensors
a_fp32 = torch.randn(N, N, device='cuda', dtype=torch.float32)
b_fp32 = torch.randn(N, N, device='cuda', dtype=torch.float32)

# Benchmark the matmul operation
fp32_time = benchmark(torch.matmul, a_fp32, b_fp32)
print(f"Average Time: {fp32_time * 1000:.4f} ms\n")


# --- 2. Test TF32 (Tensor Cores) ---
# We set precision to 'high' to *enable* TF32.
# The inputs are *still* float32, but PyTorch
# will use the Tensor Cores for the internal calculation.
print("--- Test 2: TF32 (Tensor Cores) ---")
try:
    torch.set_float32_matmul_precision('high')
    print(f"Current FP32 Precision: {torch.get_float32_matmul_precision()}")

    # We can re-use the *exact same* fp32 tensors
    tf32_time = benchmark(torch.matmul, a_fp32, b_fp32)
    print(f"Average Time: {tf32_time * 1000:.4f} ms\n")

    # --- 3. Test FP16 (Tensor Cores) ---
    # This is the fastest path. We use float16 (half-precision) data.
    # This fully leverages the A100's FP16 Tensor Core capabilities.
    print("--- Test 3: FP16 (Tensor Cores) ---")
    
    # Create float16 tensors
    a_fp16 = a_fp32.half() # .half() is a shortcut for .to(torch.float16)
    b_fp16 = b_fp32.half()
    
    # Benchmark the matmul operation
    fp16_time = benchmark(torch.matmul, a_fp16, b_fp16)
    print(f"Average Time: {fp16_time * 1000:.4f} ms\n")


    # --- Results ---
    print("--- Summary ---")
    print(f"Pure FP32 (CUDA Cores): {fp32_time * 1000:.4f} ms")
    print(f"TF32 (Tensor Cores):    {tf32_time * 1000:.4f} ms")
    print(f"FP16 (Tensor Cores):    {fp16_time * 1000:.4f} ms")
    print("-----------------")
    print(f"TF32 vs Pure FP32 Speedup: {fp32_time / tf32_time:.2f}x")
    print(f"FP16 vs Pure FP32 Speedup: {fp32_time / fp16_time:.2f}x")

except RuntimeError as e:
    print(f"Could not complete benchmark. Your GPU may not support a required setting.")
    print(f"Error: {e}")

Running on GPU: NVIDIA A100-SXM4-80GB
Matrix Size: 4096x4096
Iterations: 500 (after 50 warmup runs)

--- Test 1: Pure FP32 (Standard CUDA Cores) ---
Current FP32 Precision: highest
Average Time: 7.2240 ms

--- Test 2: TF32 (Tensor Cores) ---
Current FP32 Precision: high
Average Time: 1.0666 ms

--- Test 3: FP16 (Tensor Cores) ---
Average Time: 0.5358 ms

--- Summary ---
Pure FP32 (CUDA Cores): 7.2240 ms
TF32 (Tensor Cores):    1.0666 ms
FP16 (Tensor Cores):    0.5358 ms
-----------------
TF32 vs Pure FP32 Speedup: 6.77x
FP16 vs Pure FP32 Speedup: 13.48x


You should see a some speedup from that one-line change.

In a real training loop (like for your ANN homework), it looks like this:
(This is pseudo-code to show you the structure)

```python
# Import the tools
from torch.cuda.amp import autocast, GradScaler

# Initialize the scaler
scaler = GradScaler()

# --- In your training loop ---
for inputs, labels in good_loader:
    inputs, labels = inputs.to("cuda"), labels.to("cuda")
    
    optimizer.zero_grad()
    
    # --- This is the magic ---
    # Wrap your model and loss function in autocast
    with autocast():
        # Runs MatMul/Conv in float16
        # Runs Softmax/Loss in float32
        predictions = model(inputs)
        loss = loss_fn(predictions, labels)
    # --- End magic ---
    
    # scaler handles the loss scaling to prevent errors
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

print("Training finished!")

## Resources

- [ICRN docs](https://docs.ncsa.illinois.edu/systems/icrn/en/latest/index.html)
- [Cornell GPU workshop](https://cvw.cac.cornell.edu/gpu-architecture/gpu-characteristics/index)
- [LeetGPU-- GPU version of leetcode.](https://leetgpu.com/challenges)