State of the Union [April 22, 2024] #224

karpathy · 2024-04-22T19:14:00Z

karpathy
Apr 22, 2024
Maintainer

[April 22, 2024]
I will post here once in a while on where the code is, focusing especially on the mainline CUDA code. These results can be calculated running python profile_gpt2cu.py (if you get a crash, add sudo).

runtime, DRAM traffic, instructions:

Kernel type summaries:
  name                                       time   frac
  cutlass::Kernel                          161.29  75.94%
  softmax_autoregressive_backward_kernel     7.81   3.68%
  adamw_kernel2                              7.07   3.33%
  softmax_forward_kernel5                    5.61   2.64%
  fused_classifier_kernel3                   5.53   2.60%
  matmul_backward_bias_kernel4               5.07   2.39%
  layernorm_backward_kernel2                 4.55   2.14%
  gelu_backward_kernel                       3.55   1.67%
  gelu_forward_kernel                        2.40   1.13%
  layernorm_forward_kernel3                  2.25   1.06%
  permute_kernel                             1.92   0.90%
  permute_kernel_backward                    1.90   0.89%
  residual_forward_kernel                    1.85   0.87%
  unpermute_kernel_backward                  0.72   0.34%
  unpermute_kernel                           0.67   0.31%
  encoder_backward_kernel                    0.14   0.06%
  encoder_forward_kernel2                    0.08   0.04%

In total, a training step takes 212.4ms, distributed as:
  0.2ms (0.1%) in the encoder,
  54.8ms (25.8%) in forward blocks,
  48.9ms (23.0%) in the classifier part,
  101.4ms (47.8%) in backward blocks, and
  7.1ms (3.3%) in the optimizer.

We read 44.1GiB (207.8GB/s) and write 19.9GiB (93.5GB/s) to DRAM,
read 115.7GiB (544.6GB/s) and write 16.9GiB (79.4GB/s) to L2,
and execute 3.7 billion instructions (17.2 GInst/s).

Spending 76% in NVIDIA cutlass kernels, which is encouraging. This was run on an A10. On my A100 we are currently at ~73ms/iteration. PyTorch comparison (fp32, no flash attention, slightly stale PyTorch) is 78.2ms/iteration, so we are ~6.4% faster than PyTorch, in this constrained setting.

peak memory
On nvidia-smi we see a nice and constant 8753 MiB, this was heavily optimized by @ngc92 . In comparison, current PyTorch code comes up to 12879MiB. So we are 32% lower. To reproduce, run like:

python train_gpt2.py --write_tensors 0 --compile 1 --tensorcores 1 --num_iterations 100 --sequence_length 1024

lines of code
train_gpt2.cu is at 2097 of clean LOC

latency
nvcc compile latency: 2.4 s
run latency (ENTER to first step): 2.2s

"big stones" ongoing work:

optimize attention with kernel improvement/fusion (e.g. only doing lower triangular matmuls)
fp16/bf16 in the cleanest way: how
multi-gpu training with allreduce (MPI / NCCL)
experimentation harness (e.g. argparse, logging, sweeps, etc.) for bigger runs

major merged improvements last few days:

🧙‍♂️ kernels: @ngc92 @ademeure @ChrisDryden @al0vya
💎 tooling: @Ricardicus @dagelf @rosslwheeler @azret @lancerts

first notable forks appearing

Mojollm.🔥 by @dorjeduck: a Mojo port of this project
C# llm.cs by @azret: a C# port of this project
Rust llm.rs by @ToJen: a Rust port of this project
Metal llm.metal by @regrettable-username: LLM training in simple, raw C/Metal Shading Language

mandeeplearning · 2024-04-24T02:31:40Z

mandeeplearning
Apr 24, 2024

Where are the cutlass calls coming from? The current code seems to only use cublas.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State of the Union [April 22, 2024] #224

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

State of the Union [April 22, 2024] #224

karpathy Apr 22, 2024 Maintainer

Replies: 1 comment

mandeeplearning Apr 24, 2024

karpathy
Apr 22, 2024
Maintainer

mandeeplearning
Apr 24, 2024