# Eagle 7B : Finetuning on capybara chat!

The following showcases an example of Training the RWKV-v5 7B model, on iopair chat format
- https://huggingface.co/datasets/nampdn-ai/tiny-strange-textbooks

In this example, we will be training the model with 16k packings sizes

## Configure the env variable below
The default auto strategy, should work on a single 4090, scaling up all the way to 8xH100s

In [1]:
# -----------------------------------------------------------------
# Your configurable settings
# -----------------------------------------------------------------

# WANDB settings
ENABLE_WANDB=True
WANDB_PREFIX="RWKV-v5-Finetune"
WANDB_PROJECT="RWKV-v5-Finetune"

# Project directory offset (you need to modify if, you move the notebook into another dir)
PROJECT_DIR_OFFSET="../../"

# Config dir (relative to the notebook, excluding ending slash)
# to use, with the config filename
CONFIG_FILE_DIR="."
CONFIG_FILE_NAME="Eagle-x-capybara-chat"

# The model to use
MODEL_NAME="RWKV-v5-Eagle-World-7B-v2-20240128-ctx4096.pth"
MODEL_URL="https://huggingface.co/RWKV/v5-Eagle-7B/resolve/main/RWKV-v5-Eagle-World-7B-v2-20240128-ctx4096.pth?download=true"

# GPU count to use
GPU_DEVICES="auto"

# -----------------------------------------------------------------
# Lets detect the GPU vram sizes, and suggest a resonable default
# based on the detected VRAM sizes
# -----------------------------------------------------------------

# Default settings
# NOTE: If your not using cuda, you may want to manually change this around
DEEPSPEED_STRAT="deepspeed_stage_2"
TRAINING_CTX_LEN=2048
MICROBATCH_SIZE=1

import torch
if torch.cuda is None or not torch.cuda.is_available() or torch.cuda.device_count() <= 0:
    print("No CUDA compatible GPU found, using default settings")
else:
    # -----------------------------------------------------------------
    # Auto select the strategy based on the detected VRAM size
    # -----------------------------------------------------------------

    GPU_COUNT=torch.cuda.device_count()
    GPU_0_VRAM_SIZE_GB=torch.cuda.get_device_properties(0).total_memory / 1024**3
    if GPU_DEVICES != "auto":
        GPU_COUNT=int(GPU_DEVICES)
    print("GPU_COUNT:", GPU_COUNT)
    print("GPU_0_VRAM_SIZE (GB):", GPU_0_VRAM_SIZE_GB)

    if GPU_0_VRAM_SIZE_GB < 17:
        assert False, "For the Eagle-7B model, you need atleast 18GB vram"
    elif GPU_0_VRAM_SIZE_GB < 23:
        # This takes about 17.5GB vram on a single GPU
        # We DO NOT recommend training with ctx_len=128, as the training
        # quality will degrade noticably. But it will work!
        DEEPSPEED_STRAT="deepspeed_stage_2_offload"
        TRAINING_CTX_LEN=128
        MICROBATCH_SIZE=1
    elif GPU_0_VRAM_SIZE_GB < 25:
        # This takes about 21GB vram on a single GPU
        DEEPSPEED_STRAT="deepspeed_stage_2_offload"
        TRAINING_CTX_LEN=2048
        MICROBATCH_SIZE=2
    elif GPU_0_VRAM_SIZE_GB < 78:
        # This takes about 23GB vram on a single GPU
        DEEPSPEED_STRAT="deepspeed_stage_2"
        TRAINING_CTX_LEN=4096
        MICROBATCH_SIZE=2
        if GPU_COUNT >= 8:
            MICROBATCH_SIZE=4
    else:
        # This is now the 80GB vram class
        DEEPSPEED_STRAT="deepspeed_stage_2"
        TRAINING_CTX_LEN=4096
        MICROBATCH_SIZE=4
        if GPU_COUNT >= 8:
            MICROBATCH_SIZE=8

# -----------------------------------------------------------------
# # Training settings you can use to override the "auto" default above
# -----------------------------------------------------------------
# DEEPSPEED_STRAT="deepspeed_stage_1"
# TRAINING_CTX_LEN=4096
# MICROBATCH_SIZE=8

# ---
print("ENABLE_WANDB:", ENABLE_WANDB)
print("GPU_DEVICES:", GPU_DEVICES)
print("DEEPSPEED_STRAT:", DEEPSPEED_STRAT)
print("TRAINING_CTX_LEN:", TRAINING_CTX_LEN)
if ENABLE_WANDB:
    WANDB_MODE="online"
else:
    WANDB_MODE="disabled"

# Computing the notebook, and various paths
import os
NOTEBOOK_DIR=os.path.dirname(os.path.abspath("__file__"))
PROJECT_DIR=os.path.abspath(os.path.join(NOTEBOOK_DIR, PROJECT_DIR_OFFSET))
TRAINER_DIR=os.path.abspath(os.path.join(PROJECT_DIR, "./RWKV-v5/"))
print("NOTEBOOK_DIR:", NOTEBOOK_DIR)
print("TRAINER_DIR:", TRAINER_DIR)
print("PROJECT_DIR:", PROJECT_DIR)

# Check if the directory exists
if not os.path.exists(TRAINER_DIR):
    raise Exception("The trainer directory does not exists. Did you move the notebook?")

GPU_COUNT: 8
GPU_0_VRAM_SIZE (GB): 79.10943603515625
ENABLE_WANDB: True
GPU_DEVICES: auto
DEEPSPEED_STRAT: deepspeed_stage_2
TRAINING_CTX_LEN: 4096
NOTEBOOK_DIR: /workspace/picocreator/RWKV-infctx-trainer/notebook/finetune-example
TRAINER_DIR: /workspace/picocreator/RWKV-infctx-trainer/RWKV-v5
PROJECT_DIR: /workspace/picocreator/RWKV-infctx-trainer


## Lets download the model

In [2]:
!cd "{PROJECT_DIR}" && mkdir -p "./model" && \
    cd "./model" && \
    wget -nc "{MODEL_URL}" -O "{MODEL_NAME}"

File ‘RWKV-v5-Eagle-World-7B-v2-20240128-ctx4096.pth’ already there; not retrieving.


In [3]:
# Lets preload the requried dataset
!cd "{TRAINER_DIR}" && python3 preload_datapath.py "{NOTEBOOK_DIR}/{CONFIG_FILE_DIR}/{CONFIG_FILE_NAME}.yaml"

Downloading readme: 100%|██████████████████| 6.47k/6.47k [00:00<00:00, 37.4MB/s]
Downloading data: 100%|████████████████████| 74.0M/74.0M [00:02<00:00, 27.0MB/s]
Setting num_proc from 160 back to 1 for the train split to disable multiprocessing as it only contains one shard.
Generating train split: 16006 examples [00:00, 102925.73 examples/s]
Map (num_proc=160): 100%|███████| 16006/16006 [00:00<00:00, 17207.62 examples/s]
Filter (num_proc=160): 100%|████| 16006/16006 [00:00<00:00, 26494.68 examples/s]
Map (num_proc=160): 100%|████████| 15844/15844 [00:03<00:00, 4949.65 examples/s]
Map (num_proc=160): 100%|███████| 15844/15844 [00:01<00:00, 13784.94 examples/s]
Map (num_proc=160): 100%|██████████| 1989/1989 [00:00<00:00, 3547.09 examples/s]
Map (num_proc=160): 100%|█████████████| 161/161 [00:01<00:00, 127.75 examples/s]
Saving the dataset (1/1 shards): 100%|█| 1989/1989 [00:00<00:00, 6849.96 example
Saving the dataset (1/1 shards): 100%|█| 161/161 [00:00<00:00, 6137.99 examples/


## Start the training run!

In [5]:
# Setup the checkpoint dir
!cd "{PROJECT_DIR}" && mkdir -p "./checkpoint/{CONFIG_FILE_NAME}/"

# Lets start the training
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/{CONFIG_FILE_DIR}/{CONFIG_FILE_NAME}.yaml" \
        --model.load_model="../model/{MODEL_NAME}" \
        --data.skip_datapath_setup=True \
        --trainer.callbacks.init_args.dirpath="../checkpoint/{CONFIG_FILE_NAME}/" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - {CONFIG_FILE_NAME} (tctxlen={TRAINING_CTX_LEN}, {DEEPSPEED_STRAT})" \
        --trainer.logger.init_args.project="{WANDB_PROJECT}" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.target_batch_size=64 \
        --trainer.microbatch_size={MICROBATCH_SIZE} \
        --model.ctx_len={TRAINING_CTX_LEN} \
        --trainer.devices="{GPU_DEVICES}"

[2024-02-02 08:12:42,219] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV infctx using 'torch-jit' with torch '2.1.2'
/root/miniconda3/envs/rwkv-infctx/lib/python3.11/site-packages/lightning/pytorch/cli.py:518: LightningCLI's args parameter is intended to run from within Python like if it were from the command line. To prevent mistakes it is not recommended to provide both args and command line arguments, got: sys.argv[1:]=['fit', '-c', '/workspace/picocreator/RWKV-infctx-trainer/notebook/finetune-example/./Eagle-x-capybara-chat.yaml', '--model.load_model=../model/RWKV-v5-Eagle-World-7B-v2-20240128-ctx4096.pth', '--data.skip_datapath_setup=True', '--trainer.callbacks.init_args.dirpath=../checkpoint/Eagle-x-capybara-chat/', '--trainer.logger.init_args.name=RWKV-v5-Finetune - Eagle-x-capybara-chat (tctxlen=4096, deepspeed_stage_2)', '--trainer.logger.init_args.project=RWKV-v5-Finetune', '--trainer.strategy=deepspeed_

## Export the model

In [6]:
# Lets export the model from the checkpoint
!cd "{TRAINER_DIR}" && \
    python export_checkpoint.py "../checkpoint/{CONFIG_FILE_NAME}/last.ckpt" "../model/{CONFIG_FILE_NAME}.pth"
!cd "{TRAINER_DIR}" && ls -alh "../model/{CONFIG_FILE_NAME}.pth"

[2024-02-02 08:24:40,135] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Processing zero checkpoint '../checkpoint/Eagle-x-capybara-chat/last.ckpt/checkpoint'
Detected checkpoint of type zero stage ZeroStageEnum.gradients, world_size: 8
Parsing checkpoint created by deepspeed==0.12.6
Reconstructed fp32 state dict with 710 params 7518044160 elements
Saving bf16 state dict to ../model/Eagle-x-capybara-chat.pth
-rw-r--r-- 1 nobody root 15G Feb  2 08:26 ../model/Eagle-x-capybara-chat.pth


## Sanity check (that the model actually output stuff)

In [7]:
# Lets do a quick dragon prompt validation
!cd "{TRAINER_DIR}" && \
    python3 dragon_test.py "../model/{CONFIG_FILE_NAME}.pth" "cuda bf16"

[2024-02-02 08:26:36,377] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV infctx using 'torch-jit' with torch '2.1.2'
  return self.fget.__get__(instance, owner)()
---
[RWKV.TimeMix] Compiling CUDA kernel with HEAD_SIZE=64
Using /root/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu121/wkv5/build.ninja...
Building extension module wkv5...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv5...
[RWKV.TimeMix] CUDA kernel compiled & loaded globally
---
  batch_tokens = torch.tensor(
--- DRAGON PROMPT ---
In a shocking finding, scientist discovered a herd of dragons living in a remote, previously unexplored valley, in Tibet. Even more surprising to the researchers was the fact that the dra