# 7B based - 8x80GB Vram benchmark

The following is for benchmarking 7B training on 8 x 80GB vram based nvidia cards.
With the following settings.
- 16k data pack size
- 4k training size
- microbatch 10

The following are expected per GPU numbers

| GPU Model | Deepspeed 2 | Deepspeed 3 |
|-----------|-------------|-------------|
| H100 SXM  | 7 kT/s      | -           |
| H100 PCIe | 4.2 kT/s    | -           |
| A100 SXM  | 3 kT/s      | 2.6 kT/s    |
| A100 PCIe | 2.6 kT/s    | 2.3 kT/s    |
| H800 SXM* | 7 kT/s      | -           |

H800 is the "china safe export" edition of H100, with its numbers coming from the RWKV-LM repo, with different settings (not infctx repo). Left here for reference.

Blanks means we did'nt run them (yet?).

In [7]:
GPU_DEVICES="auto"
ENABLE_WANDB=True
WANDB_PREFIX="infctx-v5-benchmark"
DEEPSPEED_STRAT="deepspeed_stage_2"

print("ENABLE_WANDB:", ENABLE_WANDB)
print("GPU_DEVICES:", GPU_DEVICES)

if ENABLE_WANDB:
    WANDB_MODE="online"
else:
    WANDB_MODE="disabled"

# The model sizing
MODEL_NAME="RWKV-v5-Eagle-World-7B-v2-20240128-ctx4096.pth"
MODEL_URL="https://huggingface.co/RWKV/v5-Eagle-7B/resolve/main/RWKV-v5-Eagle-World-7B-v2-20240128-ctx4096.pth?download=true"

# Computing the notebook, and various paths
import os
NOTEBOOK_DIR=os.path.dirname(os.path.abspath("__file__"))
PROJECT_DIR=os.path.abspath(os.path.join(NOTEBOOK_DIR, "../../"))
TRAINER_DIR=os.path.abspath(os.path.join(PROJECT_DIR, "./RWKV-v5/"))

print("NOTEBOOK_DIR:", NOTEBOOK_DIR)
print("TRAINER_DIR:", TRAINER_DIR)
print("PROJECT_DIR:", PROJECT_DIR)

ENABLE_WANDB: True
GPU_DEVICES: auto
NOTEBOOK_DIR: /workspace/picocreator/RWKV-infctx-trainer/notebook/trainer-v5-validation
TRAINER_DIR: /workspace/picocreator/RWKV-infctx-trainer/RWKV-v5
PROJECT_DIR: /workspace/picocreator/RWKV-infctx-trainer


In [8]:
# Lets wget the model files
!mkdir -p "{PROJECT_DIR}/model"
!cd "{PROJECT_DIR}/model" && \
    wget -O "{MODEL_NAME}" -nc "{MODEL_URL}"

File ‘RWKV-v5-Eagle-World-7B-v2-20240128-ctx4096.pth’ already there; not retrieving.


In [9]:
# Lets preload the requried dataset 
!cd "{TRAINER_DIR}" && \
    python3 preload_datapath.py "{NOTEBOOK_DIR}/config/enwiki_100k-world-16k-packing.yaml"

Filter (num_proc=160): 100%|█| 1000000/1000000 [00:05<00:00, 168022.72 examples/
Map (num_proc=160): 100%|█████| 120800/120800 [00:03<00:00, 32131.04 examples/s]
Map (num_proc=160): 100%|█████| 120800/120800 [00:06<00:00, 17412.67 examples/s]
Saving the dataset (4/4 shards): 100%|█| 18147/18147 [00:03<00:00, 5581.05 examp
Saving the dataset (1/1 shards): 100%|█| 13423/13423 [00:00<00:00, 30989.74 exam


# Actual training run

In [10]:
# Run with torch compile, for "max" performance, but slow (3minutes+?) compile time
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    export RWKV_TORCH_COMPILE="1" && \
    python3 lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/config/enwiki_100k-world-16k-packing.yaml" \
        --model.load_model="../model/{MODEL_NAME}" \
        --data.skip_datapath_setup=True \
        --trainer.callbacks.init_args.dirpath="../checkpoint/v5-7b-benchmark/baseline/" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - 7B - Baseline (packsize=16k, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.target_batch_size=640 \
        --trainer.microbatch_size=8 \
        --model.ctx_len=4096 \
        --trainer.devices="{GPU_DEVICES}"

[2024-02-11 07:56:25,736] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV infctx using 'torch-jit' with torch '2.1.1+cu121'
/usr/local/lib/python3.10/dist-packages/lightning/pytorch/cli.py:518: LightningCLI's args parameter is intended to run from within Python like if it were from the command line. To prevent mistakes it is not recommended to provide both args and command line arguments, got: sys.argv[1:]=['fit', '-c', '/workspace/picocreator/RWKV-infctx-trainer/notebook/trainer-v5-validation/config/enwiki_100k-world-16k-packing.yaml', '--model.load_model=../model/RWKV-v5-Eagle-World-7B-v2-20240128-ctx4096.pth', '--data.skip_datapath_setup=True', '--trainer.callbacks.init_args.dirpath=../checkpoint/v5-7b-benchmark/baseline/', '--trainer.logger.init_args.name=infctx-v5-benchmark - 7B - Baseline (packsize=16k, deepspeed_stage_2)', '--trainer.strategy=deepspeed_stage_2', '--trainer.target_batch_size=640', '--trainer.