This repo contains a rough draft of a benchmark for language model training. It is similar in spirit to the "gpt3" task in the MLPerf Training benchmark, but scaled down. The nanogpt_model_lib.py
and nanogpt_trainer_lib.py
files are adapted from commit 9755682 of nanoGPT.
The agent working on this task should be provided this codebase, the task description (see the "Task description" section below), a data.bin
file, a checkpoint0.bin
file and an expected_checkpoint20.bin
file in the Docker container created using Dockerfile
.
The data.bin
file included in this repo is the val.bin
file generated by following the instructions to prepare the openwebtext dataset in nanoGPT.
To generate the checkpoint0.bin
file, run:
python train.py \
--trainer="nanogpt" \
--max_num_steps=0 \
--seed=123 \
--final_checkpoint_file="checkpoint0.bin"
To generate the expected_checkpoint20.bin
file, run:
python train.py \
--trainer="nanogpt" \
--init_checkpoint_file="checkpoint0.bin" \
--data_file="data.bin" \
--max_num_steps=20 \
--final_checkpoint_file="expected_checkpoint20.bin"
To prevent the agent from cheating, 2 sets of .bin files can be generated with the same formats and similar sizes. 1 set can be provided to the agent to help it experiment and the other set can be hidden from the agent and used for scoring.
To run tests:
pytest
Modify new_trainer_lib.py
to implement a trainer.
Your trainer should adhere to the API defined in trainer_lib.py
.
You should NOT modify:
- any other code in this codebase
- the .bin files
- the name of
new_trainer_lib.py
- the name of the Trainer class in
new_trainer_lib.py
Your trainer will be scored by running:
python train.py \
--trainer="new" \
--init_checkpoint_file="checkpoint0.bin" \
--expected_checkpoint_file="expected_checkpoint20.bin" \
--data_file="data.bin" \
--max_num_steps=20 \
--score_file="score.txt"
The score is the time in seconds it takes for your trainer to perform 20 steps of training (excluding the initialization of the trainer).
The model weights at the end of training should match the model weights in expected_checkpoint20.bin
. Otherwise, an error is thrown and no score is generated.
The goal is to get the lowest score possible.
This section contains some benchmarking results running the command in the "Task description" section with "nanogpt" instead of "new" for the trainer.
The benchmarks were run with commit 86e0ba12 of this codebase in a Docker container generated using Dockerfile
on a "gpu_1x_a100_sxm4" instance from Lambda Labs ("1x A100 (40 GB SXM4), 30 CPU cores, 205.4 GB RAM, 525.8 GB SSD") in the "Arizona, USA (us-west-2)" region on 9/8/2024. checkpoint0.bin
and expected_checkpoint20.bin
were generated with the default config.
Config | Time to train 20 steps (seconds) | Loss |
---|---|---|
fused_adamw+flash_attn+compile_model (default) | 52.35 | 9.8413 |
fused_adamw+flash_attn | 66.77 | 9.8402 |
fused_adamw | 176.29 | 9.8404 |
- | 176.33 | 9.8404 |