Skip to content

hacobe/gptperf

Repository files navigation

gptperf

Overview

This repo contains a rough draft of a benchmark for language model training. It is similar in spirit to the "gpt3" task in the MLPerf Training benchmark, but scaled down. The nanogpt_model_lib.py and nanogpt_trainer_lib.py files are adapted from commit 9755682 of nanoGPT.

Setup

The agent working on this task should be provided this codebase, the task description (see the "Task description" section below), a data.bin file, a checkpoint0.bin file and an expected_checkpoint20.bin file in the Docker container created using Dockerfile.

The data.bin file included in this repo is the val.bin file generated by following the instructions to prepare the openwebtext dataset in nanoGPT.

To generate the checkpoint0.bin file, run:

python train.py \
    --trainer="nanogpt" \
    --max_num_steps=0 \
    --seed=123 \
    --final_checkpoint_file="checkpoint0.bin"

To generate the expected_checkpoint20.bin file, run:

python train.py \
    --trainer="nanogpt" \
    --init_checkpoint_file="checkpoint0.bin" \
    --data_file="data.bin" \
    --max_num_steps=20 \
    --final_checkpoint_file="expected_checkpoint20.bin"

To prevent the agent from cheating, 2 sets of .bin files can be generated with the same formats and similar sizes. 1 set can be provided to the agent to help it experiment and the other set can be hidden from the agent and used for scoring.

To run tests:

pytest

Task description

Modify new_trainer_lib.py to implement a trainer.

Your trainer should adhere to the API defined in trainer_lib.py.

You should NOT modify:

  • any other code in this codebase
  • the .bin files
  • the name of new_trainer_lib.py
  • the name of the Trainer class in new_trainer_lib.py

Your trainer will be scored by running:

python train.py \
    --trainer="new" \
    --init_checkpoint_file="checkpoint0.bin" \
    --expected_checkpoint_file="expected_checkpoint20.bin" \
    --data_file="data.bin" \
    --max_num_steps=20 \
    --score_file="score.txt"

The score is the time in seconds it takes for your trainer to perform 20 steps of training (excluding the initialization of the trainer).

The model weights at the end of training should match the model weights in expected_checkpoint20.bin. Otherwise, an error is thrown and no score is generated.

The goal is to get the lowest score possible.

Results

This section contains some benchmarking results running the command in the "Task description" section with "nanogpt" instead of "new" for the trainer.

The benchmarks were run with commit 86e0ba12 of this codebase in a Docker container generated using Dockerfile on a "gpu_1x_a100_sxm4" instance from Lambda Labs ("1x A100 (40 GB SXM4), 30 CPU cores, 205.4 GB RAM, 525.8 GB SSD") in the "Arizona, USA (us-west-2)" region on 9/8/2024. checkpoint0.bin and expected_checkpoint20.bin were generated with the default config.

Config Time to train 20 steps (seconds) Loss
fused_adamw+flash_attn+compile_model (default) 52.35 9.8413
fused_adamw+flash_attn 66.77 9.8402
fused_adamw 176.29 9.8404
- 176.33 9.8404

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published