# Super Computer Usage 101

## Accessing Resource:
Go to the Illinois [ICRN webpage](https://docs.ncsa.illinois.edu/systems/icrn/)

To use GPU resources, select "A100 GPU 2CPU/8GB". 

## Create an environment:

The point of a virtual environment is to isolate project dependencies so that different projects can use different package versions without conflict. This creates a "sandbox" for each project, containing its own specific Python interpreter and installed libraries, which makes development more organized and reproducible. 

1. First, shut down all kernels. 
2. Run the command below in a terminal: 
- ~: Your home directory, equivalent to /home/NET_ID
- --prefix: Telling mamba where to install your environment.
```
mamba create --prefix ~/myenv python tensorflow[and-cuda]=2.17 ipykernel pytorch pandas seaborn tqdm matplotlib pytorch-cuda -c pytorch -c nvidia -c conda-forge
```

This will take some time. 

2. Activate your environment
```
source activate ~/env_name
```
3. Run a new kernel session
```
python -m ipykernel install --user --name=session 
```
Click "+" and you will see your session has been added. You may also open the Kernel menu and select Change kernel. You can also learn more about Mamba [here](https://mamba.readthedocs.io/en/latest/index.html)!

### Common bash commands
- pwd -P show current absolute path
- cat print out everything in the file
- ls (folder name) show everything in the current folder.
- nvidia-smi show the current GPU status.
- rm remove a file
- source let bash run the script.
- cp [dir1] [dir2] copy files from 1 directory to another directory. If copying a folder, use cp -r


## What and Why GPU?

In [None]:
import numpy as np
import pandas as pd
import torch
import time

: 

In [None]:


# Create two large random tensors
a = torch.randn(10000, 10000)
b = torch.randn(10000, 10000)

# --- 1. CPU Test ---
start_time = time.time()
c_cpu = a + b
cpu_time = time.time() - start_time
print(f"CPU Time: {cpu_time:.6f} seconds")

# --- 2. GPU Test ---
# Move data to the GPU (over the PCIe bus)
a_gpu = a.to("cuda")
b_gpu = b.to("cuda")

# We must synchronize to get an accurate time!
# This waits for the GPU to finish its work.
torch.cuda.synchronize()

start_time = time.time()
c_gpu = a_gpu + b_gpu
torch.cuda.synchronize()
gpu_time = time.time() - start_time

print(f"GPU Time: {gpu_time:.6f} seconds")
print(f"GPU is {cpu_time/gpu_time:.2f}x faster")

Tip: You can also use %%timeit to time a cell.

## GPU 

In [None]:
%%timeit
a = np.arange(10**6) 
np.sum(a**2)

In [None]:
!nvidia-smi

The GPU is faster because it has thousands of simple cores (for throughput), while the CPU has a few complex cores (for latency).

## The Real Enemy: PCIe bus.

In [None]:
def format_time(time_us):
    """Converts microseconds to a formatted ms or us string"""
    if time_us == 0:
        return "0.000us"
    if time_us > 1000 or time_us < -1000:
        return f"{time_us / 1000:.3f}ms"
    return f"{time_us:.3f}us"

In [None]:
def to_pd(prof):
    key_averages = prof.key_averages()
    total_self_cpu = prof.key_averages().self_cpu_time_total
    total_self_cuda = prof.key_averages().self_cpu_time_total
    profiler_data = []
    for avg in key_averages:
        profiler_data.append({
            "Name": avg.key,
            
            # CPU Columns
            
            #"Self CPU %": f"{avg.self_cpu_time_total / total_self_cpu * 100:.2f}%" if total_self_cpu > 0 else "0.00%",
            "Self CPU": format_time(avg.self_cpu_time_total),
            "CPU total %": f"{avg.cpu_time_total / total_self_cpu * 100:.2f}%" if total_self_cpu > 0 else "0.00%", # Follows profiler's table logic
            "CPU total": format_time(avg.cpu_time_total),
            "CPU time avg": format_time(avg.cpu_time_total / avg.count),
            
            # CUDA Columns
            #"Self CUDA %": f"{avg.self_device_time_total / total_self_cuda * 100:.2f}%" if total_self_cuda > 0 else "0.00%",
            "Self CUDA": format_time(avg.self_device_time_total),
            "CUDA total": format_time(avg.device_time_total),
            "CUDA time avg": format_time(avg.device_time_total / avg.count),
            
            "# of Calls": avg.count,
            "_cuda_total_raw": avg.device_time_total # Internal column just for sorting
        })
    print(f"total cpu time:{total_self_cpu}")
    print(f"total gpu time:{total_self_cuda}")
    return pd.DataFrame(profiler_data).sort_values(by="_cuda_total_raw", ascending=False)

In [None]:
# Create a tensor on the CPU
z_cpu = torch.randn(5000, 5000)

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as prof:
    # This loop does NO compute, just data transfer
    for _ in range(10):
        z_gpu = z_cpu.to("cuda")
        z_back = z_gpu.to("cpu")

In [None]:
import torch.profiler

a_cpu = torch.randn(2000, 2000)
b_cpu = torch.randn(2000, 2000)

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as prof:
    for _ in range(10):
        # --- BAD: Move to GPU inside the loop ---
        a_gpu = a_cpu.to("cuda")
        b_gpu = b_cpu.to("cuda")
        
        c_gpu = torch.matmul(a_gpu, b_gpu)
        
        # --- BAD: Move back to CPU inside the loop ---
        c_cpu = c_gpu.to("cpu")

to_pd(prof)

In [None]:
# --- GOOD: Move data ONCE ---
a_gpu = a_cpu.to("cuda")
b_gpu = b_cpu.to("cuda")

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as prof:
    for _ in range(10):
        # --- All computation stays on the GPU ---
        c_gpu = torch.matmul(a_gpu, b_gpu)

# --- GOOD: Bring final result back ONCE ---
c_cpu = c_gpu.to("cpu")

to_pd(prof)


All the time is spent in cudaMemcpy... This is the data moving across the PCIe bus from RAM to VRAM. Your kernel can be infinitely fast, but you'll still be slow if you're bottlenecked by data transfer."

## Kernel Launch Overhead

In [None]:
import time

# --- BAD: 100,000 tiny kernels ---
a = torch.randn(1, device='cuda')
b = torch.randn(1, device='cuda')

torch.cuda.synchronize()
start = time.time()

for _ in range(100000):
    c = a + b  # A new kernel launch every loop!

torch.cuda.synchronize()
print(f"Time for 100,000 small launches: {time.time() - start:.6f}s")


In [None]:
# --- GOOD: One big, vectorized kernel ---
a = torch.randn(100000, device='cuda')
b = torch.randn(100000, device='cuda')

torch.cuda.synchronize()
start = time.time()

c = a + b  # One single kernel launch

torch.cuda.synchronize()
print(f"Time for 1 big launch: {time.time() - start:.6f}s")

The vectorized (one-launch) version will be dramatically faster, even though it's doing the same amount of math.

## Resources

- [ICRN docs](https://docs.ncsa.illinois.edu/systems/icrn/en/latest/index.html)
- [Cornell GPU workshop](https://cvw.cac.cornell.edu/gpu-architecture/gpu-characteristics/index)
- [LeetGPU-- GPU version of leetcode.](https://leetgpu.com/challenges)