# Deepspeed 1, 2 & 3 benchmark
This model being trained has the same settings as raven 1B5 model.
- Layer count: 24
- Embed size: 2048

The goal is to validate the trainer across deepspeed 1, 2 & 3 - with and without offload. All other training params remain constant. And benchmarking them accordingly

## What does deepspeed 1, 2 & 3 do (With/Without CPU offload) ??

Instead of simply splitting the dataset being trained, and having a full copy of nearly everything in all GPU's (aka DDP / DeepSpeed 1).

Deepspeed 2, keeps a full copy of the model weights on each GPU, but splits the training gradient descent memory usage into multiple GPUs, or offload it into CPU memory (+ CPU offload option).

Deepspeed 3, takes it a step further, and distributes the model weights across all the GPUs, drastically lowering the vram requirement, while increasing the amount of GPU to GPU traffic drastically. Gradient descent memory is still split across multiple GPUs, with the option to offload into CPU memory (Same as deepspeed 2)

Finally, Deepspeed 3, also introduce options to further offload such model weights / gradient descent, more into CPU memory or NVMe. However this option was not enabled or explored in the following benchmarks.

See more here: https://huggingface.co/docs/transformers/main_classes/deepspeed

## Benchmark results

Benchmark was done on 20th Aug 2023. With Torch 2.0.1, Cuda 11.8. On 8x3090, via vast.ai
All benchmarks was done with ctx length of 4096

(@TODO - conslidate and update result)

---

| Deepspeed Strat       | Time (A5000)          | Time (3090)           | VRAM Usage       | RAM Usage | Validation Loss |
| --------------------- | --------------------- | --------------------- | ---------------- | --------- | --------------- |
| Stage 2               | 24 mins : 55 sec      | 35 mins : 04 sec      | ~22.3 + 23.8 GB  | ~85 GB    | 6.173           |
| Stage 2 + CPU offload | 43 mins : 08 sec      | 59 mins : 04 sec      | ~9.7 + 10.3 GB   | ~128 GB   | 6.124           |
| Stage 3               | 29 mins : 12 sec      | 50 mins : 04 sec      | ~23.0 + 23.2 GB^ | ~85 GB    | 5.665           |
| Stage 3 + CPU offload | 1hr : 42mins : 38 sec | 1hr : 29mins : 15 sec | ~7.0 + 7.3 GB    | ~145 GB   | 5.668           |

---

> ^ note in theory deepspeed 3 uses less vram then deepspeed 2, however it will also try to use up more ram then its needed for "cache" items if possible, maxing out to the same level as deepspeed 2 here
>
> Torch.JIT was enabled for deepspeed 2, But was disabled for deepspeed 3 (not compatible). Torch.compile was disabled


## Configure and apply your preferred settings

Adjust your desired deepspeed settings, and gpu device count.

Enable/Disable WANDB here as well ( Enabled by default, as we need the loss curve for this experiment )

( note you will need to rerun this cell, if you restart your env )

In [None]:
GPU_DEVICES="auto"
ENABLE_WANDB=False
WANDB_PREFIX="infctx-v5-deepspeed-test"

print("ENABLE_WANDB:", ENABLE_WANDB)
print("GPU_DEVICES:", GPU_DEVICES)

if ENABLE_WANDB:
    WANDB_MODE="online"
else:
    WANDB_MODE="disabled"

# Computing the notebook, and various paths
import os
NOTEBOOK_DIR=os.path.dirname(os.path.abspath("__file__"))
PROJECT_DIR=os.path.abspath(os.path.join(NOTEBOOK_DIR, "../../"))
TRAINER_DIR=os.path.abspath(os.path.join(PROJECT_DIR, "./RWKV-v5/"))

print("NOTEBOOK_DIR:", NOTEBOOK_DIR)
print("TRAINER_DIR:", TRAINER_DIR)
print("PROJECT_DIR:", PROJECT_DIR)

In [None]:
# Init the model
!cd "{TRAINER_DIR}" && \
    python3 ./init_model.py \
        --n_layer 24 --n_embd 2048 \
        --vocab_size neox --skip-if-exists \
        "../model/L24-D2048-neox-v5base-init.pth"

In [None]:
# Lets preload the requried dataset 
!cd "{TRAINER_DIR}" && \
    python3 preload_datapath.py "{NOTEBOOK_DIR}/config/enwiki_100k-4096.yaml"

# Deepspeed 1

In [None]:
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/config/enwiki_100k-4096.yaml" \
        --trainer.callbacks.init_args.dirpath="../checkpoint/v5-enwiki-100k-ds1/" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} (deepspeed_stage_1, train-ctx=4096, data-ctx=4096)" \
        --trainer.strategy="deepspeed_stage_1" \
        --trainer.devices="{GPU_DEVICES}"

# Deepspeed 2

In [None]:
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/config/enwiki_100k-4096.yaml" \
        --trainer.callbacks.init_args.dirpath="../checkpoint/v5-enwiki-100k-ds2/" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} (deepspeed_stage_2, train-ctx=4096, data-ctx=4096)" \
        --trainer.strategy="deepspeed_stage_2" \
        --trainer.devices="{GPU_DEVICES}"

# Deepspeed 2 + Offload
Perform a full 1 epoch training run of training context size = 1024. With deepspeed 2

In [None]:
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/config/enwiki_100k-4096.yaml" \
        --trainer.callbacks.init_args.dirpath="../checkpoint/v5-enwiki-100k-ds2_o/" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} (deepspeed_stage_2_offload, train-ctx=4096, data-ctx=4096)" \
        --trainer.strategy="deepspeed_stage_2_offload" \
        --trainer.devices="{GPU_DEVICES}"

# Deepspeed 3
Perform a full 1 epoch training run of training context size = 1024. With deepspeed 3

In [None]:
!cd "{TRAINER_DIR}" && \
    export RWKV_JIT_ON=0 && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/config/enwiki_100k-4096.yaml" \
        --trainer.callbacks.init_args.dirpath="../checkpoint/v5-enwiki-100k-ds3/" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} (deepspeed_stage_3, train-ctx=4096, data-ctx=4096)" \
        --trainer.strategy="deepspeed_stage_3" \
        --trainer.devices="{GPU_DEVICES}"

# Deepspeed 3 + offload
Perform a full 1 epoch training run of training context size = 1024. With deepspeed 3 + offload

In [None]:
!cd "{TRAINER_DIR}" && \
    export RWKV_JIT_ON=0 && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/config/enwiki_100k-4096.yaml" \
        --trainer.callbacks.init_args.dirpath="../checkpoint/v5-enwiki-100k-ds3_o/" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} (deepspeed_stage_3_offload, train-ctx=4096, data-ctx=4096)" \
        --trainer.strategy="deepspeed_stage_3_offload" \
        --trainer.devices="{GPU_DEVICES}"