# **DD2360 Project – Acceleration of LLM.c**

Group members:  
- Diogo Paulo  
- Hugo Dezerto  
- Maria Carolina Sebastião  

Platform: Google Colab (NVIDIA Tesla T4)

---
## Project Reporting Checklist

To fully meet the project instructions and grading criteria, make sure your report and code include the following:


### 1. **Problem Introduction**
- Briefly explain what GPT-2 is, why it is compute-intensive, and why GPU acceleration is beneficial.

### 2. **Initial Design**
- Summarize how the original code works (e.g., custom CUDA kernels for matmul, softmax, etc.).

### 3. **Profiling Method**
- State which profiler you used (e.g., nvprof, Nsight), how you ran it, and what metrics you collected.

### 4. **Profiling Results**
- Include a table or figure showing kernel time shares, step time, and throughput.

### 5. **Bottleneck Analysis & Optimization Plan**
- Clearly state which kernel is the main bottleneck and what optimization strategy you plan (e.g., replace custom matmul with cuBLAS).

### 6. **Results Visualization**
- Add figures or charts (e.g., bar chart of kernel time shares, before/after step time).

### 7. **Improved Metrics**
- After optimization, show the new step time, throughput, and kernel time share, and compare to the baseline.

### 8. **Code & Documentation**
- Ensure your code is well-commented.
- Include a README with clear instructions for compiling, running, and checking outputs.

### 9. **Presentation**
- Slides should clearly state objectives, methodology, results, and findings.
- Use visual aids (figures/charts) to present profiling and optimization results.
- Discuss findings and any limitations or trade-offs.


**Summary Table Example:**

| Section                | What to Include                                                                 |
|------------------------|--------------------------------------------------------------------------------|
| Problem Introduction   | What is GPT-2, why GPU, what is being accelerated                              |
| Initial Design         | How the original code works (kernels, data flow)                               |
| Profiling Method       | Which tool, how you ran it, what you measured                                  |
| Profiling Results      | Table/figure of kernel time shares, step time, throughput                      |
| Bottleneck Analysis    | Which kernel is slowest, why, and what you plan to do                          |
| Optimization           | What you changed (e.g., cuBLAS), how you did it                                |
| Improved Metrics       | New profiling results, comparison to baseline                                  |
| Visualization          | Figures/charts for before/after performance                                    |
| Code/README            | Clear instructions, comments                                                   |
| Presentation           | Slides with objectives, methods, results, and discussion                       |


**If you include all these points, you will fully meet the project requirements!**

---

In [1]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0


---
## **Repository Setup**

In [2]:
!git clone https://github.com/hdezerto/applied_gpu_project.git

Cloning into 'applied_gpu_project'...
remote: Enumerating objects: 134, done.[K
remote: Counting objects: 100% (134/134), done.[K
remote: Compressing objects: 100% (111/111), done.[K
remote: Total 134 (delta 19), reused 133 (delta 18), pack-reused 0 (from 0)[K
Receiving objects: 100% (134/134), 322.93 KiB | 1.75 MiB/s, done.
Resolving deltas: 100% (19/19), done.


In [3]:
%cd /content/applied_gpu_project

/content/applied_gpu_project


In [4]:
!git fetch --all


Fetching origin


In [5]:
!git branch -a

* [32mmain[m
  [31mremotes/origin/HEAD[m -> origin/main
  [31mremotes/origin/hugo[m
  [31mremotes/origin/main[m
  [31mremotes/origin/original[m


---
## **Baseline**

In [6]:
!git checkout original

Branch 'original' set up to track remote branch 'original' from 'origin'.
Switched to a new branch 'original'


In [7]:
%cd /content/applied_gpu_project/llm.c

/content/applied_gpu_project/llm.c


**Set up data + weights**

Downloads TinyShakespeare dataset and GPT-2 124M weights.

In [8]:
!bash dev/download_starter_pack.sh

Downloaded tiny_shakespeare_val.bin to /content/applied_gpu_project/llm.c/dev/data/tinyshakespeare/tiny_shakespeare_val.bin
Downloaded gpt2_tokenizer.bin to /content/applied_gpu_project/llm.c/dev/../gpt2_tokenizer.bin
Downloaded tiny_shakespeare_train.bin to /content/applied_gpu_project/llm.c/dev/data/tinyshakespeare/tiny_shakespeare_train.bin
Downloaded gpt2_124M_bf16.bin to /content/applied_gpu_project/llm.c/dev/../gpt2_124M_bf16.bin
Downloaded gpt2_124M.bin to /content/applied_gpu_project/llm.c/dev/../gpt2_124M.bin
Downloaded gpt2_124M_debug_state.bin to /content/applied_gpu_project/llm.c/dev/../gpt2_124M_debug_state.bin
Downloaded hellaswag_val.bin to /content/applied_gpu_project/llm.c/dev/data/hellaswag/hellaswag_val.bin
All files downloaded and saved in their respective directories


**Compile the FP32 training code:**

In [13]:
!make clean
!make train_gpt2fp32cu NVCC_FLAGS="-gencode arch=compute_75,code=sm_75"

---------------------------------------------
→ cuDNN is manually disabled by default, run make with `USE_CUDNN=1` to try to enable
✓ OpenMP found
✓ NCCL found, OK to train with multiple GPUs
✓ MPI enabled
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
rm -f train_gpt2 test_gpt2 train_gpt2cu test_gpt2cu train_gpt2fp32cu test_gpt2fp32cu 
rm -f build/*.o
---------------------------------------------
→ cuDNN is manually disabled by default, run make with `USE_CUDNN=1` to try to enable
✓ OpenMP found
✓ NCCL found, OK to train with multiple GPUs
✓ MPI enabled
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
/usr/local/cuda/bin/nvcc -gencode arch=compute_75,code=sm_75 train_gpt2_fp32.cu -lcublas -lcublasLt -lnvidia-ml -L/usr/lib/x86_64-linux-gnu/openmpi/lib/ -I/usr/lib/x86_64-linux-gnu/openmpi/include/ -lnccl -lmpi -o train_gpt2fp32cu
  static cublasComputeType_t cublas_compute_type;
                             ^


**Compile the test:**

In [16]:
!make clean
!make test_gpt2fp32cu NVCC_FLAGS="-gencode arch=compute_75,code=sm_75"

---------------------------------------------
→ cuDNN is manually disabled by default, run make with `USE_CUDNN=1` to try to enable
✓ OpenMP found
✓ NCCL found, OK to train with multiple GPUs
✓ MPI enabled
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
rm -f train_gpt2 test_gpt2 train_gpt2cu test_gpt2cu train_gpt2fp32cu test_gpt2fp32cu 
rm -f build/*.o
---------------------------------------------
→ cuDNN is manually disabled by default, run make with `USE_CUDNN=1` to try to enable
✓ OpenMP found
✓ NCCL found, OK to train with multiple GPUs
✓ MPI enabled
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
/usr/local/cuda/bin/nvcc -gencode arch=compute_75,code=sm_75 test_gpt2_fp32.cu -lcublas -lcublasLt -lnvidia-ml -L/usr/lib/x86_64-linux-gnu/openmpi/lib/ -I/usr/lib/x86_64-linux-gnu/openmpi/include/ -lnccl -lmpi -o test_gpt2fp32cu


 **Run the Baseline**

- Execute the training binary:

In [14]:
!./train_gpt2fp32cu

+-----------------------+----------------------------------------------------+
| Parameter             | Value                                              |
+-----------------------+----------------------------------------------------+
| train data pattern    | dev/data/tinyshakespeare/tiny_shakespeare_train.bin |
| val data pattern      | dev/data/tinyshakespeare/tiny_shakespeare_val.bin  |
| output log file       | NULL                                               |
| batch size B          | 4                                                  |
| sequence length T     | 1024                                               |
| learning rate         | 0.000300                                           |
| val_loss_every        | 20                                                 |
| val_max_steps         | 20                                                 |
| sample_every          | 20                                                 |
| genT                  | 64                       

- Run the test for correctness:

In [17]:
!./test_gpt2fp32cu

[System]
Device 0: Tesla T4
enable_tf32: 0
[State]
batch_size: 4
seq_len: 64
allocated 221 MiB for activations
-43.431618, -43.431660
-39.836346, -39.836376
-43.065910, -43.065952
-42.828045, -42.828068
-43.529541, -43.529579
-44.318398, -44.318439
-41.227425, -41.227459
-41.270760, -41.270802
-42.541393, -42.541447
-42.394997, -42.395035
OK (LOGITS)
allocated 474 MiB for parameter gradients
allocated 4 MiB for activation gradients
LOSS OK: 5.270009 5.270009
grads
OK -0.002320 -0.002320
OK 0.002072 0.002072
OK 0.003717 0.003717
OK 0.001307 0.001307
OK 0.000632 0.000632
TENSOR OK
allocated 474 MiB for AdamW optimizer state m
allocated 474 MiB for AdamW optimizer state v
step 0: loss 5.270009 (took 60.079424 ms)
step 1: loss 4.059697 (took 64.541959 ms)
step 2: loss 3.375109 (took 105.270315 ms)
step 3: loss 2.800757 (took 91.734826 ms)
step 4: loss 2.315364 (took 78.139227 ms)
step 5: loss 1.849014 (took 78.619143 ms)
step 6: loss 1.394637 (took 78.369907 ms)
step 7: loss 0.999127 (took

**Profiling**

Profile timing:

In [19]:
!nvprof ./train_gpt2fp32cu



Profile Memory Bandwidth and Occupancy:

In [None]:
# @title
!ncu --set default ./train_gpt2fp32cu

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  volta_sgemm_128x64_tn (8, 16, 48)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 7.5
  softmax_forward_kernel5(float *, float, const float *, int, int) (6144, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 7.5
  volta_sgemm_64x64_nn (1, 16, 48)x(64, 1, 1), Context 1, Stream 7, Device 0, CC 7.5
  unpermute_kernel(float *, float *, int, int, int, int) (12288, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 7.5
  matmul_forward_kernel4(float *, const float *, const float *, const float *, int, int) (32, 6, 1)x(16, 16, 1), Context 1, Stream 7, Device 0, CC 7.5
  residual_forward_kernel(float *, float *, float *, int) (12288, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 7.5
  layernorm_forward_kernel3(float *, float *, float *, const float *, const float *, const float *, int, int) (256, 1, 1)x(512, 1, 1), Context 1, Stream 7, Device 0, CC 7.5
  matmul_forward_kernel4(float *, const float *, const float *,

### Profiling Results & Optimization Plan

**Top GPU time kernels:**
- `matmul_forward_kernel4` (58%): Custom matrix multiplication
- `volta_sgemm_*` (~27%): cuBLAS matrix multiplication (already optimized)
- `softmax_forward_kernel5` (2.6%): Custom softmax for attention
- `gelu_forward_kernel` (1.0%): Custom GELU activation
- `layernorm_forward_kernel3` (0.9%): Custom layer normalization

**Optimization priorities:**
1. **Replace `matmul_forward_kernel4` with cuBLAS (`cublasSgemm`)**  
   _This is the main bottleneck and should give the biggest speedup._
2. **(Optional, after matmul) Optimize `softmax_forward_kernel5`**  
   _Consider cuDNN softmax or a fused kernel if it becomes a larger bottleneck._
3. **Other kernels (`gelu`, `layernorm`) are low priority**  
   _Small time share; optimize only if needed after main bottlenecks._

**Summary:**  
Focus first on replacing the custom matmul with cuBLAS. Re-profile after each change.

---
## **Optimizations**

**TO DO**: here we test the optimizations, but first we switch to the branch where the optimization is implemented

In [None]:
!git checkout hugo

Compile train and test:

In [None]:
!make clean
!make train_gpt2fp32cu NVCC_FLAGS="-gencode arch=compute_75,code=sm_75"

In [None]:
!make clean
!make test_gpt2fp32cu NVCC_FLAGS="-gencode arch=compute_75,code=sm_75"

Run test to check for errors:

In [None]:
!./test_gpt2fp32cu

Run profiling to check improvement:

In [None]:
!nvprof ./train_gpt2fp32cu