Let's reproduce GPT-2 (1.6B): one 8XH100 node, 24 hours, $672, in llm.c #677

karpathy · 2024-07-11T16:29:28Z

karpathy
Jul 11, 2024
Maintainer

In this post we are reproducing GPT-2 in llm.c. This is "the GPT-2", the full, 1558M parameter version that was introduced in OpenAI's blog post Better Language Models and their Implications in February 14, 2019. llm.c does so directly in C/CUDA (total of ~5,000 lines of code), without the typical training stack that would involve the Python interpreter and a significantly more complex deep learning library like PyTorch/JAX, huggingface/transformers, or etc. In 2019, training GPT-2 was an involved project from an entire team and considered a big model run but, ~5 years later, due to improvements in compute (H100 GPUs), software (CUDA, cuBLAS, cuDNN, FlashAttention) and data (e.g. the FineWeb-Edu dataset), we can reproduce this model on a single 8XH100 node in 24 hours, and for $672, which is quite incredible. There are some caveats and asterisks involved though - llm.c is still not perfectly tuned, sufficiently stabilized (we still see loss spikes and bad activation ranges now and then), and our evals are not comprehensive (e.g. we do not carefully eval multilingual, code, math); A lot of this work is still ongoing.

Unicorns. The natural place to start is probably with unicorns. In the GPT-2 blog post, OpenAI prompted GPT-2 with this unicorn prompt. Let's see what our model thinks about English-speaking unicorns in the Andes mountains:

Prompt: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
Completion: It may sound like a hoax, or just a dream of the mythical animals, but it appears this may be a real case of the unicorn language. The Andean area has an abundance of minerals and is also very rich in biodiversity. According to the researchers' study, the unicorns' language is called Elveseo. To produce it, the animal forms a consonant from the lips and makes a vowel sound with the teeth, then produces a consonant sound with the mouth and makes an 'ow' sound from its chest. "The question is, what caused this, and what does this tell us about intelligence and communication?" commented study co-author Paul Marek. Speaking in the video, the unicorns had the right way to speak, the animals were able to communicate and talk in perfect English.

Fun! :) The model is fairly coherent and qualitatively somewhere around the level of GPT-2. You can find 20 samples from both GPT-2 and the llm.c model here, or generate many more using instructions down below.

Training. Training a GPT-2 with llm.c is quite simple because it is written in C/CUDA, so there is no need for minconda, Python, PyTorch, etc. You will want an 8XH100 GPU box, I recommend spinning one up from Lambda labs. But llm.c is flexible on its compute - if you have only 1 GPU you can still get your GPT-2, you'll just have to wait 8 days instead of 1. If you have 16 GPUs (e.g. using the new Lambda 1 Click Clusters), you'll be able to train multinode and only have to wait 12 hours. Once you spin up your node, here are the complete instructions to train your GPT-2 (this only takes a ~minute from blank box to start stepping):

# install cudnn so we can use FlashAttention and run fast (optional)
# https://developer.nvidia.com/cudnn-downloads
# for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install libcudnn9-dev-cuda-12

# "install" cudnn-frontend to ~/
git clone https://github.com/NVIDIA/cudnn-frontend.git

# install MPI (optional, if you intend to use multiple GPUs)
# (you might also have to install NVIDIA NCCL if it doesn't come with your setup)
sudo apt -y install openmpi-bin openmpi-doc libopenmpi-dev

# download and enter llm.c repo
git clone https://github.com/karpathy/llm.c.git
cd llm.c

# download the "starter pack" (~1GB download)
# contains GPT2-124M weights (used in tests), tokenizer, eval data .bin s
./dev/download_starter_pack.sh

# download the training dataset (FineWeb-Edu 100B token) .bin data shards
# note: this is a total of 1001 data shards. If you only want to test things
# out and don't want to do an actual run, feel free to append the number of
# training shards to download (e.g. for just 10 shards: ./edu_fineweb.sh 10)
# the full dataset is ~200GB, we can store it here in dev/data directory.
cd dev/data
./edu_fineweb.sh

# compile (~1 min 1st time for cuDNN mostly, few sec from then on)
cd ../../
make train_gpt2cu USE_CUDNN=1

# and train! (wait 24 hours here)
mpirun -np 8 ./train_gpt2cu \
	-i "dev/data/edu_fineweb100B/edu_fineweb_train_*.bin" \
	-j "dev/data/edu_fineweb100B/edu_fineweb_val_*.bin" \
	-o "log_gpt2_1558M" \
	-v 250 -s 300000 -g 384 \
	-h 1 \
	-b 16 -t 1024 \
	-d 1048576 \
	-r 0 \
	-z 1 \
	-c 0.1 \
	-k "cosine" \
	-l 0.0006 \
	-q 0.1 \
	-u 700 \
	-n 2000 \
	-x 32000 \
	-ge 1 \
	-y 1 \
	-e "d48"

I will describe the args in a second. You'll see a bunch of prints scroll through and then the optimization will begin:

num_parameters: 1557686400 => bytes: 3115372800
allocated 2971 MiB for model parameters
batch_size B=16 * seq_len T=1024 * num_processes=8 and total_batch_size=1048576
=> setting grad_accum_steps=8
created directory: log_gpt2_1558M
allocating 40409 MiB for activations
val loss 11.129390
allocating 2971 MiB for parameter gradients
allocating 742 MiB for AdamW optimizer state m
allocating 742 MiB for AdamW optimizer state v
allocating 742 MiB for master copy of params
step    1/32000 | loss 11.133732 (+nanz)| norm 52.9732 (+nanz)| lr 8.57e-07 | 3056.36 ms | 42.6% bf16 MFU | 343080 tok/s
step    2/32000 | loss 10.539388 (+nanz)| norm 43.5996 (+nanz)| lr 1.71e-06 | 2747.19 ms | 47.4% bf16 MFU | 381690 tok/s
step    3/32000 | loss 9.894109 (+nanz)| norm 23.2229 (+nanz)| lr 2.57e-06 | 2753.25 ms | 47.3% bf16 MFU | 381259 tok/s
step    4/32000 | loss 9.566241 (+nanz)| norm 28.4920 (+nanz)| lr 3.43e-06 | 2741.47 ms | 47.5% bf16 MFU | 381690 tok/s
step    5/32000 | loss 9.482848 (+nanz)| norm 23.7817 (+nanz)| lr 4.29e-06 | 2752.07 ms | 47.3% bf16 MFU | 381507 tok/s
step    6/32000 | loss 9.332832 (+nanz)| norm 15.9113 (+nanz)| lr 5.14e-06 | 2751.01 ms | 47.3% bf16 MFU | 381431 tok/s
step    7/32000 | loss 9.165650 (+nanz)| norm 10.5941 (+nanz)| lr 6.00e-06 | 2753.03 ms | 47.3% bf16 MFU | 381327 tok/s
step    8/32000 | loss 9.132234 (+nanz)| norm 16.2733 (+nanz)| lr 6.86e-06 | 2748.91 ms | 47.3% bf16 MFU | 381348 tok/s
step    9/32000 | loss 9.097384 (+nanz)| norm 12.1342 (+nanz)| lr 7.71e-06 | 2748.73 ms | 47.3% bf16 MFU | 381367 tok/s
step   10/32000 | loss 9.072879 (+nanz)| norm 10.5923 (+nanz)| lr 8.57e-06 | 2749.40 ms | 47.3% bf16 MFU | 381369 tok/s
...

We can see that each step is about 2.75 seconds and there are 32,000 of them, so now we wait ~24 hours. At every step, this training run takes a chunk of ~1 million tokens of FineWeb-EDU (these are educational web pages from the internet), and updates the 1558 million weights of the model to be slightly better at predicting the next token in a sequence. By the end we'll have processed 32,000 * 1048576 = 33.6B tokens in total. The loss goes down as we do a better job predicting the next token. The norm will stabilize around 0.1-1, the learning rate is being warmed up over the first few steps. Our model flops utilization (MFU) is around 50%, i.e. quite efficient.

Now wait 24 hours for this to finish, then you can visualize the main.log log file using the dev/vislog.ipynb jupyter notebook. For this you will need to also have Python and matplotlib installed, and you will see the following:

Evals. On the left we are tracking the loss on FineWeb-EDU validation data. If you simply run the GPT-2 released by OpenAI and evaluate its loss on this split, you get the red horizontal line (loss 2.83). You see that our run outperforms this very very quickly, by step ~5,000. However, this is not a fair comparison because GPT-2 was trained on the never-released WebText dataset, so there is a possibly large distribution shift. So e.g. if you finetune the OpenAI model for 1,000 steps at LR 1e-4, the loss quickly plunges to the blue line (loss 2.61), because it's quickly adapting to the new data statistics. I like to look at the validation loss as a sanity check, but for the actual comparison we'd want to look at fixed, 3rd party evaluations. One of the well-behaved, smooth, common, often-cited evals that also offer early signal is the HellaSwag eval. These are simple common sense scenarios and the model has to pick the correct continuation. We evaluate HellaSwag on the right pane, where we see that we cross over the GPT-2 model around step ~25K (earlier than GPT-2, which is estimated to have been trained on ~100B tokens. This possibly has to do with increased data quality, as we also observed in our earlier 124M run). The green line is the GPT-3 model of the same size, which is pretty much the same model architecture as GPT-2 with minor differences (context length 1024 -> 2048) but trained for 300B tokens (i.e. ~10X more tokens than what we trained on here). I should say that even HellaSwag is not an ideal single point of comparison because it tests simple English and common sense, it does not test e.g. multilingual, math or code. It could have been that the WebText data mixture was a lot heavier on these, and these domains were "stealing" model capacity to some extent, we don't know because it was never released. Lastly, in general, good evals are harder at low model capability like GPT-2 because e.g. the models don't understand multiple choice, and their samples are not high enough quality to make above chance dent into standard math or code evals.

Args guide. Let's look at the args we passed into the training now in more detail. The GPT-2 release from OpenAI included model weights but very few details, while GPT-3 release had no weights but many details. So in many cases, we follow the GPT-3 paper hyperparameters because the GPT-2 paper has very very little information:

mpirun -np 8 ./train_gpt2cu \ the launch command: we're using mpi to launch 8 processes (each process runs training on 1 GPU, for 8 GPUs total on this example 8XH100 node). If you have 4 GPUs, use -np 4. If you have 1 GPU, you can skip mpi, i.e. simply change this to ./train_gpt2cu.
-i -j are training and validation splits token files, downloaded earlier with edu_fineweb.sh
-o is the output directory to write logs and checkpoints into
-v 250 asks to evaluate and log the validation loss every 250 steps
-s 300000 asks to sample some tokens every 300000 steps. Because the total number of steps will be less than this, this is hacky way to turn sampling off and we will only sample a single time at the very end.
-g 384 sets the number of tokens to be sampled at the end to be 384
-h 1 asks to evaluate the HellaSwag accuracy
-b 16 sets the micro-batch size to 16 . If you are running out of memory, decrease this value, e.g. try 8, 4, 2, all the way down to 1 potentially.
-t 1024 sets the maximum sequence length to 1024, as GPT-2 did
-d 1048576 asks that the total batch size be 2 to the power 20, following the GPT-3 paper hyperparameters table. The code will make sure to meet this desired total batch size and calculate the needed gradient accumulation "inner loop" steps of the optimization. For example up above, we saw that we have 8 GPUs each doing 16 X 1024 tokens, so that is 8 X 16 X 1024 = 131,072 tokens per micro-step (a single forward backward), so the code calculated gradient accumulation steps of 8 to meet the desired 1M batch size per step. i.e. it does forward+backward 8 times and then a single update.
-r 0 sets recompute to zero. Recompute is a way to trade off compute and memory. If -r 1, then we recompute a piece of the forward pass (the GeLU) during backward. This means we don't have to cache it and save memory, at the cost of some more compute. So if you're running out of memory, try -r 1, or -r 2 (also recompute layernorms).
-z 1 turns on ZeRO-1 (i.e. optimizer state sharding) across multiple GPUs. If you're training with > 1 GPU, this setting is a no-brainer and should basically always be on. On 1 GPU this setting is a no-op.
-c 0.1 sets the weight decay to 0.1. Only (2D) weights are decayed exactly as in GPT-2, and this number comes from the GPT-3 paper
-k "cosine" sets the cosine learning rate schedule, which is the default so this is a bit spurious.
-l 0.0006 sets the maximum learning rate to 6e-4. The GPT-3 paper says to use 2e-4 for this model size, but here we triple and it and seems to train faster and without any issues. This wasn't tuned very carefully yet.
-q 0.1 says that we will decay the learning rate to 10% of max LR over the course of training, following GPT-3 paper.
-u 700 says that we will ramp up the learning rate from 0 to max learning rate over the first 700 iterations, which at total batch size 0.5M is 350M tokens, following GPT-3 paper.
-n 2000 asks to save model checkpoints every 2000 steps.
-x 32000 asks for 32K steps in total. I chose this number because it is a nice number, and just fits into 24 hours.
-ge 1 sets a very recently merged gelu recompute setting for CublasLt (optional)
-y 1 sets the "resume" flag on. If your training for any reason crashes or hangs, you can CTRL+C and re-run this command, and it will attempt to resume the optimization. llm.c is bitwise-deterministic, so you'll get the identical result as if you didn't crash.
-e "d48" asks to initialize, a depth 48 GPT-2 model from scratch.

Memory guide. The biggest constraint most people will probably face is that their GPU doesn't have 80GB. That's okay you should still be able to run everything above if you are patient, it would just run slower. So if the model doesn't fit, what do you play with? The most important one is the micro batch size -b. Try to decrease it but keep it to nice numbers. So e.g. 16 -> 8 -> 4 -> 2 -> 1. From there, try to also play with the recompute setting -r which is 0 (fastest, a lot of memory), 1 (very slightly slower, but a huge memory saving), or 2 (slightly slower, smaller memory saving). The next thing you can do is disable master weights in fp32, which you can do with -w 0 (1 is default). We won't maintain fp32 copy of params. Empirically in a few runs before this seems to be okay, likely due to our use of stochastic rounding. If even that doesn't fit (that's unlikely right?), you could try to decrease the maximum sequence length with -t, default is 1024 you can take it down to 512, 256, etc., but now you are making your model worse because you're decreasing its maximum attention span.

Code. Certainly I feel biased but llm.c is quite beautiful:

It only requires basic CUDA dependencies to run.
It is a direct, minimal and readable implementation in C/CUDA. llm.c totals about 5,000 lines of C/CUDA code. We try to be mostly C, not C++ to keep it simple. Neural net training is just one while loop of the same, simple arithmetic operations (think +, -, *, /) on a single float array, it really shouldn't be that complicated.
It compiles and runs very quickly (few seconds), so you're doing more stepping and less waiting.
It allocates all of its GPU memory a single time at the start and from then on during training has an exactly constant memory footprint. So once you start stepping, you know you're good for the rest of the run and won't OOM.
It is bitwise deterministic.
It is efficient, at just below ~50% MFU.

The main entry point and the majority of the code is in the file train_gpt2.cu. It contains the GPT-2 model definition and the training loop in ~2,000 LOC, and it imports a bunch of helper files with various utilities and the individual layer implementations from the llmc directory. cloc llmc reports 23 files with 3170 LOC, and cloc train_gpt2.cu is 1353 LOC atm.

Multi-node training. If you are part of the privileged GPU-rich upper class, llm.c supports multi-node training and the most GPUs I've seen someone train llm.c with is ~500 GPUs. This biggest run I've done personally so far is on Lambda's new 1-click cluster feature with 16XH100 GPUs in 2 nodes. The downsides of unemployment. The lambda team has put up detailed instructions on how you can train llm.c models on their 1-click clusters. E.g. with the 512-GPU H100 cluster for $2,300/hr, you might be able to train your GPT-2 in ~30 minutes. You'd have to increase the total batch size (e.g. to ~8M) and possibly tune the hyperparameters a little. I haven't tried but it probably works and would be very cool :)

PyTorch comparison. A relatively comparable run in PyTorch would I think look something like this, using our parallel PyTorch implementation:

torchrun --standalone --nproc_per_node=8 train_gpt2.py \
    --input_bin "dev/data/edu_fineweb100B/edu_fineweb_train_*.bin" \
    --input_val_bin "dev/data/edu_fineweb100B/edu_fineweb_val_*.bin" \
    --write_tensors 0 \
    --model d48 \
    --batch_size 8 --sequence_length 1024 --total_batch_size 1048576 \
    --dtype bfloat16 \
    --compile 1 \
    --tensorcores 1 \
    --flash 1 \
    --num_iterations 32000 \
    --warmup_iters 700 \
    --weight_decay 0.1 \
    --overfit_single_batch 0 \
    --learning_rate 0.0006 \
    --zero_stage 1

The PyTorch code is meant as a testing reference not an actual implementation, so the training loop is a little bit different in some places (e.g. the dataloader doesn't permute the shards, etc.), but this is still possibly useful as a point of reference. I also hacked the default vocab size to be 50257 -> 50304 to get added efficiency, then the currently PyTorch nightly gives:

step   16/32000 | train loss 8.903997 | norm 8.3474 | lr 1.37e-05 | (3381.88 ms | 310057 tok/s)
step   17/32000 | train loss 8.870140 | norm 3.7936 | lr 1.46e-05 | (3381.95 ms | 310051 tok/s)
step   18/32000 | train loss 8.875732 | norm 9.4993 | lr 1.54e-05 | (3393.09 ms | 309033 tok/s)
step   19/32000 | train loss 8.817432 | norm 2.8345 | lr 1.63e-05 | (3379.75 ms | 310253 tok/s)
step   20/32000 | train loss 8.798056 | norm 4.1234 | lr 1.71e-05 | (3386.53 ms | 309631 tok/s)
step   21/32000 | train loss 8.777574 | norm 2.8010 | lr 1.80e-05 | (3386.05 ms | 309675 tok/s)
...

Now I wouldn't say I have full confidence that the PyTorch script is maximally tuned, but the following observations can be made. PyTorch seems to be taking a lot more memory (this run is ~80GB), while llm.c is at 57GB (29% improvement). Memory is important because it allows you to crank up the batch size (e.g. llm.c can go up to 24 microbatch here), which goes a bit faster. Second, we're seeing about 3386 vs. 2750ms per iteration, so llm.c is stepping ~19% faster. Some of the gains here have known origin, e.g. llm.c includes optimizations like the Fused classifier that kicks off the backward pass, which is something torch.compile does not do today afaik. But it's also possible that this script isn't fully maximally tuned, but in any case I'm showing the comparison in case 1) others would like to take a look, play with, compare, help tune and 2) to just say that llm.c is quite optimized and fast - in the specific case of GPT-2/3 training.

The final model. A few links that may be helpful, for posterity:

The main.log file.
The model_00032000.bin llm.c bin model file
The model converted to huggingface transformers GPT-2 model I uploaded here: karpathy/gpt2_1558M_final2_hf.
I've now also added a version of the model trained for 100K steps that achieves HellaSwag 57.7 and 330K steps that achieves 62.7.

Model export. The model export can be done as follows, for example:

python dev/eval/export_hf.py --input log_gpt2_128M/model_00032000.bin --output gpt2_1558M_export

This then lets you run the Eleuther eval harness, or run the huggingface sampling pipeline to get model samples:

# take model for spin
import torch

output = "./gpt2_1558M_final2_hf"

# set pytorch seeds
torch.manual_seed(42)
torch.cuda.manual_seed(42)

prompt = "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English."
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(output)
model = AutoModelForCausalLM.from_pretrained(output, attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16, device_map='cuda')
model.eval()
tokens = tokenizer.encode(prompt, return_tensors="pt")
tokens = tokens.to('cuda')

output = model.generate(tokens, max_new_tokens=500, pad_token_id=tokenizer.eos_token_id, do_sample=True, top_k=50, num_return_sequences=4)
samples = tokenizer.batch_decode(output)
for sample in samples:
    print('-'*30)
    print(sample)

Also have a look at dev/eval for instructions on how to run the Eleuther Evaluation Harness, the evals from the HuggingFace Open LLM Leaderboard, etc.

400B token run. I have also made the attempt to train GPT-2 for significantly longer than 33B tokens. In particular, I changed -x to 400,000 to train for 420B tokens (even more than GPT-3 model of this size, which was trained with 300B). This model run looked great until about step 330,000:

This model dramatically beats GPT-2 and GPT-3 of its size on HellaSwag (it gets up to ~61%), but sadly becomes unstable there on and explodes. There are more smaller spikes along the way but the code is configured to detect the more simple instantaneous instability and skips update (I used the flags -sl 5.0 -sg 5.0), which helps mitigate and defers issues. However, I think we're not yet being sufficiently careful with our initialization, activation ranges, and overall model training stability and there are deeper issues that gradually drift the model into instability, especially for larger models and over long training duration. To be continued. If you have ideas or recommendations for stabilizing LLM model training please contribute your experience in the discussion below.

FAQ:

Can I sample from the model in llm.c? kind of, but it's inefficient and a bit weird, and even more hacky if you'd like to prompt the model. Use the huggingface paths above for now.
Can I chat with it? no, this is currently only pretraining, not chat finetuning.
Can you train in fp8? No, we're currently mostly training in bf16, but early versions are very much work in progress.
I have a non-NVIDIA GPU can I run llm.c? No, llm.c supports C/CUDA only, but good forks exist (see main README). For example there is an actively maintained AMD fork by @anthonix that is quite good.

GPT-2 (124M). I wanted to also link to an earlier post on training the GPT-2 (124M) model in llm.c, which has some more related information to llm.c runs. 124M is a smaller model in the GPT-2 miniseries, only 124M parameters compared to 1558M parameters.

Authors

Substantial contributions to llm.c came from what now feels like the llm.c core dev team, in addition to self:

@ngc92 in all aspects of the code base
@ademeure in CUDA kernel optimization, low precision training, cudnn, cublas, ...
@gordicaleksa in all aspects of whatever is next on the TODO list, from algorithms to code to multi-node or etc.
@rosslwheeler in CI and Windows support. If you're happily running llm.c on Windows you should definitely thank Ross :)
Lambda labs for sponsoring the GPUs used in the development of llm.c. The history here is that I've happily used Lambda for several years and then a few months ago I pretty please asked if they are open to not charging my account for llm.c dev work and they agreed so here we are thank you for supporting llm.c!
Nvidia and Ubicloud (www.ubicloud.com) for providing the GitHub Nvidia GPU Runners for our CI.

Coming up. Some of the next big steps we are interested in and looking at these days:

Further optimize GPT-2 training hyperparameters. For some reason, the hyperparameters cited by OpenAI in the GPT-3 paper appear to be quite suboptimal, e.g. @Yuchenj_UW on X found that you can 3X the learning rate and get faster training with no apparent downsides. There might be other similar low-hanging fruit.
Improve training and scaling stability, e.g. more stable optimizers, schedulers, clipping, norming, muP. (Some of these PRs already exist, if you have tips on stabilizing LLM runs please reach out with ideas to try!).
Mixed precision++: training with fp8 (imminent!).
Model inference, e.g. KV cache is the low hanging fruit here.
Finetuning: SFT, RLHF
Multimodal extensions, VQVAE and friends
More modern architectures, support for Llama / Gemma model series.

The goal of llm.c remains to have a simple, minimal, clean training stack for a full-featured LLM agent, in direct C/CUDA, and companion educational materials to bring many people up to speed in this awesome field.

Please feel free to use the Discussions for any FAQ and related, or if you'd like something faster, #llmc on Discord, or #llmdotc on CUDA MODE Discord.

We'll see you next time!

gordicaleksa · 2024-07-11T17:38:40Z

gordicaleksa
Jul 11, 2024

Next up "this is how to train Llama 3 8B in 72 hours for 1500$"🫡🫡🫡

5 replies

YuchenJin Jul 11, 2024

I also wonder how hard it is to modify the current codebase to train llama3 8B

ngc92 Jul 11, 2024

RMSnorm is fairly easy, removing biases is easy (both do have a PR already), SwiGLU should also be straightforwards, I think the main challenge will be group-query attention with rope-encoding.

What is quite trivial is just scaling up the current model to 8B; in fact, I'm planning to make a PR that just adds a few more options for the model init. One question is how to continue the model series; two options would be:

    // deeper
    else if (depth == 60) { channels = 1920; num_heads = 30; } // 2.7B
    else if (depth == 72) { channels = 2880; num_heads = 30; } // 7.3B
    else if (depth == 84) { channels = 3456; num_heads = 36; } // 12.2B    
    
    // wider
    else if (depth == 56) { channels = 1920; num_heads = 30; } // 2.6B
    else if (depth == 64) { channels = 2880; num_heads = 30; } // 6.5B
    else if (depth == 72) { channels = 3840; num_heads = 30; } // 12.9B

This roughly matches the GPT3 series in paramter count, but both are much deeper.

YuchenJin Jul 11, 2024

Supporting the whole GPT3 series would be interesting!

VatsaDev Jul 12, 2024

@ngc92 I've looked at scaling like that, https://vatsadev.github.io/articles/Layers.html, theres a certain amount of balance thats best, but here layers are already wide enough to just focus on depth

cduk Jul 15, 2024

I would be interested to see a small model based on Llama 3 architecture and tokenizer: it would be useful as a draft model for speculative decoding.

MathiasSchindler · 2024-07-11T18:15:32Z

MathiasSchindler
Jul 11, 2024

400B token run: "This model dramatically beats GPT-2 and GPT-3 of its size on HellaSwag (it gets up to ~61%), but sadly becomes unstable there on and explodes. "

Would you be able to release the model right before the explosion? I would be interested to learn what instability and explosion look like in a model.

4 replies

karpathy Jul 11, 2024
Maintainer Author

Ok I'll upload it later today. I'm trying to revive it still and get the full 400B to complete. I took a close look at the weights last night and sadly I couldn't see any major issues. All of them look well behaved in good ranges, the worst I saw is the c_attn biases ranged -50 to 50, which seems rather broad. Everything else was in nice ranges.

michaelklachko Jul 11, 2024

@karpathy I'd check for overflows in layernorm layers: if input x if FP16, variance might overflow: https://github.com/karpathy/llm.c/blob/master/doc/layernorm/layernorm.py#L12

diegoasua Jul 13, 2024

Could it be a hardware fault? It seems so random I would be inclined to believe it is hardware. If you were to resume the training just before it happened, does it happen again?

tantara Jul 13, 2024

Does 350b dataset help to understand instability? I'm guessing a (elementary school) student doesn't want to read the same textbook 4 times. FindWeb Edu blog shows it achieves 59.7 on HellaSwag with 1.82B models trained on 350B dataset and it outperforms on other benchmarks (eg MMLU).

from https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1

greydanus · 2024-07-11T19:17:37Z

greydanus
Jul 11, 2024

I love the simplicity, power, and attention to detail. Well done! I hope to experiment with this code myself someday soon.

0 replies

ex3ndr · 2024-07-11T19:48:45Z

ex3ndr
Jul 11, 2024

What about new flash attention 3? Will it slash price in half?

2 replies

astelmach01 Jul 12, 2024

You'd have to rewrite the kernels, as flash-attention is hardware-dependent

ngc92 Jul 12, 2024

you don't have to rewrite the kernels, you'd "just" need to figure out how to integrate FA3 (including its dependencies) into the current codebase.
That being said, no, FA3 will not cut the cost in half. A significant amount of compute is spent in good old matrix multiplications (e.g., for the MLP part), which is completely unaffected by FA.

jonready · 2024-07-11T19:50:09Z

jonready
Jul 11, 2024

Thanks again for your work to democratize AI. I'm only part way through makemore now, but am blown away by how simple you can make these tough topics.

0 replies

rosslwheeler · 2024-07-11T20:11:16Z

rosslwheeler
Jul 11, 2024

Also want to thank Ubicloud (www.ubicloud.com) for providing the GitHub Nvidia GPU Runners for CI and Nvidia for sponsoring this. Thank you!!!

CC: @karpathy

1 reply

karpathy Jul 11, 2024
Maintainer Author

Bleh very good callout thank you @rosslwheeler and Ubicloud!

Daudsarfraz · 2024-07-11T23:12:19Z

Daudsarfraz
Jul 11, 2024

Hi, Andrej I'm watching your videos. Especially llama.c. Can you please share resources how you created own transformer.

0 replies

bil-ash · 2024-07-12T02:29:43Z

bil-ash
Jul 12, 2024

@karpathy I have an integrated GPU(ryzen) with 16GB VRAM. Would like to train a 256m gpt with 32k context length (using https://github.com/anthonix/llm.c).
Will 16GB VRAM permit doing so? Time is not an issue because I am ready to wait for weeks.
Also, what should be the size of dataset- 10B(fineweb-edu), 5B or anything less than that?

5 replies

ngc92 Jul 12, 2024

With an integrated GPU, I'm not sure if weeks are the order of magnitude you're looking at.
32k context with just 16GB also sounds very challenging, not sure if that is going to work with the current codebase. More recomptuation will help; not sure if it'll be enough, though.
IMO, the thing to do is download, build, and try it out. Start with standard context of 1k, and increase it until you hit OOM.
Not sure about the dataset size required, but if you want to train 32k context, you probably need to at least filter the dataset to have enough documents of sufficient length.

Edit: Maybe with a batch size of 1, the required activation memory isn't too big actually, and you still get sufficient tokens per update.

diegoasua Jul 13, 2024

Maybe with batch size 2 or 4 it might fit in memory. You can also play with the -r flag:

-r 0 sets recompute to zero. Recompute is a way to trade off compute and memory. If -r 1, then we recompute a piece of the forward pass (the GeLU) during backward. This means we don't have to cache it and save memory, at the cost of some more compute. So if you're running out of memory, try -r 1, or -r 2 (also recompute layernorms).

That being said... Even if you have the newest AMD Ryzen with a radeon 780 you are only going to get up to 16 TFLOPS (theoretically, in practice way less). This is the best case scenario by far, and I wouldn't count with it. I would assume more like 5 TFLOPS for a really good iGPU. Compared to an 8XH100 GPU box which was ~8,000 TFLOPS this setup is over 1,000x slower. This is not even accounting for memory bandwith which in H100s is 2TB/s and iGPUs would be much less. Even if your model is 10x smaller, you are still 100x short in compute, or need half a year to ingest 30B tokens. These are back of the envelope numbers if you want to exactly calculate how long it would take you can use Karpathy's notebooks like this one

Finally this repo trains on top of CUDA which does not compile to AMD. CUDA is specific to NVIDIA GPUs. AFAIK this project does not support AMD.

bil-ash Jul 13, 2024

Maybe with batch size 2 or 4 (almost stochastic gradient descent at this point) and no gradient accumulation it might fit in memory. You can use Adam instead of AdamW to reduce memory (AdamW keeps one or two extra copies of the graph) and play with the -r flag:

-r 0 sets recompute to zero. Recompute is a way to trade off compute and memory. If -r 1, then we recompute a piece of the forward pass (the GeLU) during backward. This means we don't have to cache it and save memory, at the cost of some more compute. So if you're running out of memory, try -r 1, or -r 2 (also recompute layernorms).

That being said... Even if you have the newest AMD Ryzen with a radeon 780 you are only going to get up to 16 TFLOPS (theoretically, in practice way less). This is the best case scenario by far, and I wouldn't count with it. I would assume more like 5 TFLOPS for a really good iGPU. Compared to an 8XH100 GPU box which was ~8,000 TFLOPS this setup is over 1,000x slower. This is not even accounting for memory bandwith which in H100s is 2TB/s and iGPUs would be much less. Even if your model is 10x smaller, you are still 100x short in compute, or need half a year to ingest 30B tokens. These are back of the envelope numbers if you want to exactly calculate how long it would take you can use Karpathy's notebooks like this one

Finally this repo trains on top of CUDA which does not compile to AMD. CUDA is specific to NVIDIA GPUs. AFAIK this project does not support AMD.

Thanks for pointing out about batch size and Adam. Also, thanks for pointing out the TFLOPS aspect, I can now have a rough estimate to the time required.
My GPU is indeed 780M and I guess it is supported by @anthonix fork of llm.c since that library has support for gfx1100(7900XTX) and 780M is gfx1103. Going by the two discussion in this repo- this one and the one about 124M, I guess training 256M model will take around 15000 TFLOPS-hours. Ignoring the plausible memory bottleneck and the fact that I would like to train on 32k context, time required would be 15000/(16*24) ~ 39 days . Also, I guess a 32k context length would be impossible on iGPU, I would have to stick with 1k context length.

ngc92 Jul 13, 2024

gradient accumulation does not increase the required amount of memory, and AdamW has the same memory footprint as Adam (2 additional buffers, one for first and one for second moment).

diegoasua Jul 13, 2024

correct, updated the comment

diegoasua · 2024-07-13T10:10:40Z

diegoasua
Jul 13, 2024

If I only have 1 GPU can I use ./train_gpt2cu directly without mpirun?

1 reply

karpathy Jul 13, 2024
Maintainer Author

Thank you for asking, I missed describing it. Yes exactly! I adjusted the docs above.

jyasir · 2024-08-04T15:44:46Z

jyasir
Aug 4, 2024

Currently running this on one H100 and it's on step 20654 after 20 days of training

I was having problems with memory so had to tweak the object size.

1 reply

jyasir Aug 4, 2024

Going to definitely share my inference experience once the model is completely trained.

jyasir · 2024-08-15T15:41:18Z

jyasir
Aug 15, 2024

Just completed training and it took one month on one H100.
`step 31999/32000 | loss 2.515356 (+1.10z)| norm 0.1049 (+0.02z)| lr 6.00e-05 | 79465.11 ms | 13.1% bf16 MFU | 13184 tok/s
step 32000/32000 | loss 2.489572 (-0.13z)| norm 0.1056 (+0.22z)| lr 6.00e-05 | 79564.45 ms | 13.1% bf16 MFU | 13184 tok/s
val loss 2.397445
HellaSwag: 5125/10042 = 0.510356
generating:

The United States Quality Assurance Act of 1998 (QAA) was one of the five major amendments to the Food, Drug, and Cosmetic Act and the Agricultural Act of 1949.
QAA provides regulations for product certification, pharmaceutical product certification, and food safety. Regulatory agencies conduct periodic re-approvals based on updated analytical procedures, and commonly, implementing the original QAA regulations are being challenged because of developing state laws that violate the original QAA specifications.
QAA were designed based on after market recalls or regulatory concerns relating to products. This is an artificial situation, which is not receiving the direct efficacy from the consumers.
Q AA states that a product is a quality assurance product to multiple specifications is an acceptable way to label the product, therefore the process of re-definition of "QAA" must be re-conceptualized to an acceptance process of a "quality evaluation" rather than that of a traditional inspection Check Point
2. Developing the Standard
3. Educating Employees
4. Consulting the General Public
5. Controlling Product Failures
6. Controlling Quality Assurance Costs
7. Evaluating QA Programs (QLA)<|endoftext|>An aging population is at greater risk of Alzheimer's disease. The annual cost of care for someone with the disease is projected to exceed $50 billion by 2050. In other words, care for those with Alzheimer's disease is expected to cost more than health care in all of U.S. history.
Researchers from David Geffen School of Medicine at UCLA have developed a blood test that can detect the starting point of Alzheimer's disease based on blood protein changes that occur long before symptoms begin to emerge. The test for Alzheimer's disease provides greater insight into the disease process and interpretation of human samples to determine at an early stage whether or not a person with risk for the disease will ultimately develop the disease.The study looked at

Writing checkpoint at step 32000
Writing model to log_gpt2_1558M/model_00032000.bin
Writing state to log_gpt2_1558M/state_00032000_00000.bin
total average iteration time: 79315.241121 ms
lab@dexter-pwr1:~/llm.c$ exit`

Is there a good inference example that i can use to test this model out?

1 reply

jyasir Aug 15, 2024

Actually never mind i see this code and will try it out
python dev/eval/export_hf.py --input log_gpt2_128M/model_00032000.bin --output gpt2_1558M_export

reducibly · 2024-11-10T21:43:27Z

reducibly
Nov 10, 2024

It's shocking to me how the demo System Prompt from the mentioned OpenAI blog post is not grammatically correct. We're building language models here and we can't even get our own basic grammar right?

Wrong: "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English."

Right: "In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English."

The typo fix is to change "scientist" to "scientists" (plural), because in the second sentence the prompt refers to the "researchers" (plural). Alternatively, one could say "a scientist", and change "the researchers" to "researcher".

For example of why this matters, I wonder if this research paper might have come to some different conclusions regarding "neural text degeneration" had they not used this very same grammatically-incorrect prompt. Perhaps not to a large degree (I don't know), but my point, I hope, is food for thought.

Garbage grammar in, garbage grammar out, no?

0 replies

Let's reproduce GPT-2 (1.6B): one 8XH100 node, 24 hours, $672, in llm.c #677

karpathy Jul 11, 2024 Maintainer

Replies: 12 comments · 20 replies

karpathy Jul 11, 2024 Maintainer Author

karpathy Jul 11, 2024 Maintainer Author

karpathy Jul 13, 2024 Maintainer Author

karpathy
Jul 11, 2024
Maintainer

Replies: 12 comments 20 replies

karpathy Jul 11, 2024
Maintainer Author

karpathy Jul 11, 2024
Maintainer Author

karpathy Jul 13, 2024
Maintainer Author