gptperf

Overview

This repo contains a rough draft of a benchmark for language model training. It is similar in spirit to the "gpt3" task in the MLPerf Training benchmark, but scaled down. The nanogpt_model_lib.py and nanogpt_trainer_lib.py files are adapted from commit 9755682 of nanoGPT.

Setup

The agent working on this task should be provided this codebase, the task description (see the "Task description" section below), a data.bin file, a checkpoint0.bin file and an expected_checkpoint20.bin file in the Docker container created using Dockerfile.

The data.bin file included in this repo is the val.bin file generated by following the instructions to prepare the openwebtext dataset in nanoGPT.

To generate the checkpoint0.bin file, run:

python train.py \
    --trainer="nanogpt" \
    --max_num_steps=0 \
    --seed=123 \
    --final_checkpoint_file="checkpoint0.bin"

To generate the expected_checkpoint20.bin file, run:

python train.py \
    --trainer="nanogpt" \
    --init_checkpoint_file="checkpoint0.bin" \
    --data_file="data.bin" \
    --max_num_steps=20 \
    --final_checkpoint_file="expected_checkpoint20.bin"

To prevent the agent from cheating, 2 sets of .bin files can be generated with the same formats and similar sizes. 1 set can be provided to the agent to help it experiment and the other set can be hidden from the agent and used for scoring.

To run tests:

pytest

Task description

Modify new_trainer_lib.py to implement a trainer.

Your trainer should adhere to the API defined in trainer_lib.py.

You should NOT modify:

any other code in this codebase
the .bin files
the name of new_trainer_lib.py
the name of the Trainer class in new_trainer_lib.py

Your trainer will be scored by running:

python train.py \
    --trainer="new" \
    --init_checkpoint_file="checkpoint0.bin" \
    --expected_checkpoint_file="expected_checkpoint20.bin" \
    --data_file="data.bin" \
    --max_num_steps=20 \
    --score_file="score.txt"

The score is the time in seconds it takes for your trainer to perform 20 steps of training (excluding the initialization of the trainer).

The model weights at the end of training should match the model weights in expected_checkpoint20.bin. Otherwise, an error is thrown and no score is generated.

The goal is to get the lowest score possible.

Results

This section contains some benchmarking results running the command in the "Task description" section with "nanogpt" instead of "new" for the trainer.

The benchmarks were run with commit 86e0ba12 of this codebase in a Docker container generated using Dockerfile on a "gpu_1x_a100_sxm4" instance from Lambda Labs ("1x A100 (40 GB SXM4), 30 CPU cores, 205.4 GB RAM, 525.8 GB SSD") in the "Arizona, USA (us-west-2)" region on 9/8/2024. checkpoint0.bin and expected_checkpoint20.bin were generated with the default config.

Config	Time to train 20 steps (seconds)	Loss
fused_adamw+flash_attn+compile_model (default)	52.35	9.8413
fused_adamw+flash_attn	66.77	9.8402
fused_adamw	176.29	9.8404
-	176.33	9.8404

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gptperf

Overview

Setup

Task description

Results

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
data.bin		data.bin
nanogpt_model_lib.py		nanogpt_model_lib.py
nanogpt_trainer_lib.py		nanogpt_trainer_lib.py
new_trainer_lib.py		new_trainer_lib.py
test_train.py		test_train.py
train.py		train.py
trainer_lib.py		trainer_lib.py

hacobe/gptperf

Folders and files

Latest commit

History

Repository files navigation

gptperf

Overview

Setup

Task description

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages