This is adopted from https://github.com/deepspeedai/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/README.md.

We only demonstrate the SFT step here, but feel free to try out the reward modeling and RLHF training!

Env Setup: Make sure you are using GPU runtime.(e.g. T4)

Inside the training script `main.py`, we will need some modifications...

(Note: libraries such huggingface accelerate integrates with Deepspeed, see https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed for more details!)

In [None]:
!pip install deepspeed

In [6]:
import deepspeed
from deepspeed.ops.adam import DeepSpeedCPUAdam, FusedAdam
from deepspeed import get_accelerator

import argparse
import torch

#skip the details here, please check actual main.py
def get_train_ds_config(offload,
                        dtype,
                        stage=2,
                        enable_hybrid_engine=False,
                        inference_tp_size=1,
                        release_inference_cache=False,
                        pin_parameters=True,
                        tp_gather_partition_size=8,
                        max_out_tokens=512,
                        enable_tensorboard=False,
                        enable_mixed_precision_lora=False,
                        tb_path="",
                        tb_name=""):
    pass
def parse_args():
    pass
def to_device():
    pass

def main():
    args = parse_args()

    if args.local_rank == -1:
        device = torch.device(get_accelerator().device_name())
    else:
        get_accelerator().set_device(args.local_rank)
        device = torch.device(get_accelerator().device_name(), args.local_rank)
        # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
        # torch.distributed.init_process_group(backend='nccl')
        deepspeed.init_distributed()
    ds_config = get_train_ds_config(offload=args.offload,
                                    dtype=args.dtype,
                                    stage=args.zero_stage,
                                    enable_tensorboard=args.enable_tensorboard,
                                    tb_path=args.tensorboard_path,
                                    tb_name="step1_model")
    # Deepspeed provide custom cpu adam if we want to offload optimizer to cpu
    AdamOptimizer = DeepSpeedCPUAdam if args.offload else FusedAdam
    # Initialize model with deepspeed
    # DeepSpeed model training is accomplished using the DeepSpeed engine.
    # The engine can wrap any arbitrary model of type torch.nn.module and
    # has a minimal set of APIs for training and checkpointing the model.
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
        model=model,
        optimizer=optimizer,
        args=args,
        config=ds_config,
        lr_scheduler=lr_scheduler,
        dist_init_required=True)
    train_dataloader = None
    #DeepSpeed automatically performs the necessary operations required for distributed data parallel training,
    #in mixed precision, with a pre-defined learning rate scheduler. No code change needed.
    for epoch in range(args.num_train_epochs):
        model.train()
        import time
        for step, batch in enumerate(train_dataloader):
            start = time.time()
            batch = to_device(batch, device)
            outputs = model(**batch, use_cache=False)
            loss = outputs.loss
            if args.print_loss:
                print(
                    f"Epoch: {epoch}, Step: {step}, Rank: {torch.distributed.get_rank()}, loss = {loss}"
                )
            model.backward(loss)
            model.step()
            end = time.time()

Let's see what is needed to run the training script with deepspeed command line:
1. specify your training script `main.py`
2. specify args such as `model_name_or_path`, `zero_stage` etc.

To see the full list of args supported, see https://www.deepspeed.ai/docs/config-json/

Recall that:
1. Optimizer state partitioning (ZeRO stage 1)
2. Gradient partitioning (ZeRO stage 2)
3. Parameter partitioning (ZeRO stage 3)

The bash scipt looks something like this

In [None]:
"""
ZERO_STAGE=$1
OUTPUT=./output_llama2_7b
if [ "$ZERO_STAGE" == "" ]; then
    ZERO_STAGE=3
fi
mkdir -p $OUTPUT

deepspeed main.py \
   --data_split 2,4,4 \
   --model_name_or_path meta-llama/Llama-2-7b-hf \
   --per_device_train_batch_size 1 \
   --per_device_eval_batch_size 4 \
   --max_seq_len 512 \
   --learning_rate 9.65e-6 \
   --weight_decay 0. \
   --num_train_epochs 3  \
   --gradient_accumulation_steps 4 \
   --lr_scheduler_type cosine \
   --num_warmup_steps 0 \
   --seed 1234 \
   --gradient_checkpointing \
   --dtype bf16 \
   --zero_stage $ZERO_STAGE \
   --deepspeed \
   --output_dir $OUTPUT \
"""

Then we can run the command in a cluster and monitor the nvidia-smi to see the how much memory is saved!