# DD2360 – Acceleration of LLM.c Inference
**Project Proposal Implementation**

Group members:  
- Diogo Paulo  
- Hugo Dezerto  
- Maria Carolina Sebastião  

Platform: Google Colab (NVIDIA Tesla T4)  


## 1. Environment Setup
This section sets up the Google Colab environment and verifies GPU availability.


In [1]:
!mkdir -p /content/DD2360_project
%cd /content/DD2360_project


/content/DD2360_project


## 2. Repository Setup (llm.c)
Clone the llm.c repository and verify file structure.


In [2]:
!git clone https://github.com/karpathy/llm.c


Cloning into 'llm.c'...
remote: Enumerating objects: 6149, done.[K
remote: Total 6149 (delta 0), reused 0 (delta 0), pack-reused 6149 (from 1)[K
Receiving objects: 100% (6149/6149), 2.25 MiB | 5.44 MiB/s, done.
Resolving deltas: 100% (3963/3963), done.


In [3]:
%cd llm.c
!ls

/content/DD2360_project/llm.c
dev	  profile_gpt2.cu    test_gpt2.c	train_gpt2_fp32.cu
doc	  profile_gpt2cu.py  test_gpt2.cu	train_gpt2.py
LICENSE   README.md	     test_gpt2_fp32.cu	train_llama3.py
llmc	  requirements.txt   train_gpt2.c
Makefile  scripts	     train_gpt2.cu


## 3. Building the CUDA FP32 Baseline
Compile the FP32 CUDA inference binary using the provided Makefile.


In [4]:
!make test_gpt2fp32cu


---------------------------------------------
→ cuDNN is manually disabled by default, run make with `USE_CUDNN=1` to try to enable
✓ OpenMP found
✓ NCCL found, OK to train with multiple GPUs
✓ MPI enabled
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
/usr/local/cuda/bin/nvcc --threads=0 -t=0 --use_fast_math -std=c++17 -O3 -DMULTI_GPU -DUSE_MPI test_gpt2_fp32.cu -lcublas -lcublasLt -lnvidia-ml -L/usr/lib/x86_64-linux-gnu/openmpi/lib/ -I/usr/lib/x86_64-linux-gnu/openmpi/include/ -lnccl -lmpi -o test_gpt2fp32cu


In [5]:
!./test_gpt2fp32cu


[System]
Device 0: Tesla T4
enable_tf32: 0
Error: Failed to open file 'gpt2_124M.bin' at train_gpt2_fp32.cu:1119
Error details:
  File: train_gpt2_fp32.cu
  Line: 1119
  Path: gpt2_124M.bin
  Mode: rb
---> HINT 1: dataset files/code have moved to dev/data recently (May 20, 2024). You may have to mv them from the legacy data/ dir to dev/data/(dataset), or re-run the data preprocessing script. Refer back to the main README
---> HINT 2: possibly try to re-run `python train_gpt2.py`


## 4. Model and Data Preparation
Generate the required dataset and convert pretrained GPT-2 weights. Just doing this because it was missing this file "gpt2_124M.bin"

In [None]:
!python dev/data/tinyshakespeare.py


Downloading https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt to /content/DD2360_project/llm.c/dev/data/tinyshakespeare/tiny_shakespeare.txt...
/content/DD2360_project/llm.c/dev/data/tinyshakespeare/tiny_shakespeare.txt:   0% 0.00/425k [00:00<?, ?iB/s]/content/DD2360_project/llm.c/dev/data/tinyshakespeare/tiny_shakespeare.txt: 1.06MiB [00:00, 44.0MiB/s]     
writing 32,768 tokens to /content/DD2360_project/llm.c/dev/data/tinyshakespeare/tiny_shakespeare_val.bin (66,560 bytes) in the gpt-2 format
writing 305,260 tokens to /content/DD2360_project/llm.c/dev/data/tinyshakespeare/tiny_shakespeare_train.bin (611,544 bytes) in the gpt-2 format


In [None]:
!python train_gpt2.py --write_tensors 1 --model gpt2

Running pytorch 2.9.0+cu126
using device: cuda
total desired batch size: 256
=> calculated gradient accumulation steps: 1
wrote gpt2_tokenizer.bin
2025-12-30 17:11:53.378779: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1767114713.399220    8083 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1767114713.405355    8083 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1767114713.423678    8083 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1767114713.423702    8083 computation_placer.cc:177] computation placer already registered. Please chec

## 5. CUDA Architecture Compatibility Fix
Resolve PTX toolchain mismatch by compiling specifically for Tesla T4 (compute capability 7.5).


In [None]:
!make clean
!make test_gpt2fp32cu GPU_COMPUTE_CAPABILITY=75

---------------------------------------------
→ cuDNN is manually disabled by default, run make with `USE_CUDNN=1` to try to enable
✓ OpenMP found
✓ NCCL found, OK to train with multiple GPUs
✓ MPI enabled
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
rm -f train_gpt2 test_gpt2 train_gpt2cu test_gpt2cu train_gpt2fp32cu test_gpt2fp32cu 
rm -f build/*.o
---------------------------------------------
→ cuDNN is manually disabled by default, run make with `USE_CUDNN=1` to try to enable
✓ OpenMP found
✓ NCCL found, OK to train with multiple GPUs
✓ MPI enabled
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
/usr/local/cuda/bin/nvcc --threads=0 -t=0 --use_fast_math -std=c++17 -O3 --generate-code arch=compute_75,code=[compute_75,sm_75] -DMULTI_GPU -DUSE_MPI test_gpt2_fp32.cu -lcublas -lcublasLt -lnvidia-ml -L/usr/lib/x86_64-linux-gnu/openmpi/lib/ -I/usr/lib/x86_64-linux-gnu/openmpi/include/ -lnccl -lmpi -o test_gpt2

## 6. Baseline Inference Validation (FP32)
Run the CUDA FP32 inference and verify numerical correctness.


In [None]:
!./test_gpt2fp32cu

[System]
Device 0: Tesla T4
enable_tf32: 0
[State]
batch_size: 4
seq_len: 64
allocated 221 MiB for activations
-43.431618, -43.431633
-39.836346, -39.836357
-43.065903, -43.065926
-42.828041, -42.828056
-43.529537, -43.529564
-44.318390, -44.318409
-41.227409, -41.227428
-41.270756, -41.270779
-42.541397, -42.541420
-42.394989, -42.395012
OK (LOGITS)
allocated 474 MiB for parameter gradients
allocated 4 MiB for activation gradients
LOSS OK: 5.270009 5.270008
grads
OK -0.002320 -0.002320
OK 0.002072 0.002072
OK 0.003717 0.003717
OK 0.001307 0.001307
OK 0.000632 0.000632
TENSOR OK
allocated 474 MiB for AdamW optimizer state m
allocated 474 MiB for AdamW optimizer state v
step 0: loss 5.270009 (took 67.082373 ms)
step 1: loss 4.059695 (took 55.237933 ms)
step 2: loss 3.375108 (took 86.675907 ms)
step 3: loss 2.800755 (took 76.870364 ms)
step 4: loss 2.315362 (took 74.005729 ms)
step 5: loss 1.849014 (took 72.937497 ms)
step 6: loss 1.394637 (took 74.148315 ms)
step 7: loss 0.999126 (took 