# RWKV Baseline C
This model is RWKV standard 1B5 model

- 24 layers
- 2048 embedding size

Going through the same memory training process as TokenShift

See `./notes.md` for how the init model was initilaized.

**Note:** This project assumes you have the rwkv-infctx conda env setup

---

```bash
# ninja-build is required for the new trainer
sudo apt-get install ninja-build

# Update conda & its package listings
# Conda install script can be found here : 
# https://docs.anaconda.com/free/anaconda/install/linux/#installation
conda update conda

# Virtual env, with python 3.10
# python 3.11 have issues with torch.compile / h100s
# and if you want to use 3.11, you will need to do a nightly build install
conda create -n rwkv-infctx python=3.11 pip
conda activate rwkv-infctx

# Install pytorch (>=2.0.1)
conda install -y pytorch==2.0.1 torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# Verify your pytorch version 
python -c "import torch; print(torch.__version__)"

# We use python -m pip, instead of pip directly, as it resolve issues with venv not loading the right pip
python -m pip install datasets transformers 
python -m pip install lightning==2.0.4 deepspeed==0.9.5
python -m pip install ninja numexpr jsonargparse 'jsonargparse[signatures]'
python -m pip install lm-dataformat ftfy sentencepiece tokenizers wandb
```
---

# Basic Setup

In [1]:
# First lets setup the various directories, and get the blank init model, these init model was generated
# using the original RWKV-LM repo (as at this point of writing, this repo cannot init a model)
# As such I have preinitialized these blank models and uploaded them to HF for convinence
!mkdir -p ../../../../model/
!mkdir -p ../../../../datapath/
!mkdir -p ../../../../checkpoint/
!rm -rf ../../../../model/L24-D1024-init.pth
!cd ../../../../model/ && wget -nc https://huggingface.co/picocreator/memory-size-experiment-for-rwkv/resolve/main/Echo-A-1B5-Init.pth
!ls -alh ../../../../model/L24-D1024-init.pth

# The various other stages, if you want to skip stuff

--2023-07-29 04:40:28--  https://huggingface.co/picocreator/memory-size-experiment-for-rwkv/resolve/main/Echo-A-1B5-Init.pth
Resolving huggingface.co (huggingface.co)... 143.204.55.124, 143.204.55.75, 143.204.55.121, ...
Connecting to huggingface.co (huggingface.co)|143.204.55.124|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/cb/ef/cbef09abb2634a3375b28868bffa285226dfeabedec89b28c2fb302221164d66/0ec7214ed16737a6348254e6f96d8cdc04d3b5efbd5f53fe9337607ea42b5b9f?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27Echo-A-1B5-Init.pth%3B+filename%3D%22Echo-A-1B5-Init.pth%22%3B&Expires=1690864829&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTY5MDg2NDgyOX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy9jYi9lZi9jYmVmMDlhYmIyNjM0YTMzNzViMjg4NjhiZmZhMjg1MjI2ZGZlYWJlZGVjODliMjhjMmZiMzAyMjIxMTY0ZDY2LzBlYzcyMTRlZDE2NzM3YTYzNDgyNTRlNmY5NmQ4Y2RjMDRkM2I1ZWZ

In [2]:
DEEPSPEED_STRAT="deepspeed_stage_1"
GPU_DEVICES="auto"
ENABLE_WANDB=True
WANDB_PREFIX="Baseline-C"

print("DEEPSPEED_STRAT:", DEEPSPEED_STRAT)
print("ENABLE_WANDB:", ENABLE_WANDB)
print("GPU_DEVICES:", GPU_DEVICES)

if ENABLE_WANDB:
    WANDB_MODE="online"
else:
    WANDB_MODE="disabled"

# Computing the notebook, and various paths
import os
NOTEBOOK_DIR=os.path.dirname(os.path.abspath("__file__"))
PROJECT_DIR=os.path.abspath(os.path.join(NOTEBOOK_DIR, "../../../../"))
TRAINER_DIR=os.path.abspath(os.path.join(PROJECT_DIR, "./RWKV-v4neo/"))
INFERENCE_DIR=os.path.abspath(os.path.join(PROJECT_DIR, "./RWKV-v4neo/"))

print("NOTEBOOK_DIR:", NOTEBOOK_DIR)
print("INFERENCE_DIR:", INFERENCE_DIR)
print("TRAINER_DIR:", TRAINER_DIR)
print("PROJECT_DIR:", PROJECT_DIR)

DEEPSPEED_STRAT: deepspeed_stage_1
ENABLE_WANDB: True
GPU_DEVICES: auto
NOTEBOOK_DIR: /root/picocreator-memory-experiment/notebook/experiment/tokenshift-exp/Baseline-C
INFERENCE_DIR: /root/picocreator-memory-experiment/RWKV-v4neo
TRAINER_DIR: /root/picocreator-memory-experiment/RWKV-v4neo
PROJECT_DIR: /root/picocreator-memory-experiment


## Stage 1 : Foundation model training

In [3]:
# Lets preload the requried dataset (enwiki_100k)
!cd "{TRAINER_DIR}" && \
    python3 preload_datapath.py "{NOTEBOOK_DIR}/Baseline-C-enwiki.yaml"

Downloading readme: 100%|██████████████████████| 433/433 [00:00<00:00, 3.25MB/s]
Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/teven___parquet/teven--enwiki_100k-1359e81b212c2dd6/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7...
Downloading data files:   0%|                             | 0/1 [00:00<?, ?it/s]
Downloading data:   0%|                              | 0.00/261M [00:00<?, ?B/s][A
Downloading data:   0%|                      | 52.2k/261M [00:00<15:06, 288kB/s][A
Downloading data:   0%|                      | 288k/261M [00:00<03:47, 1.14MB/s][A
Downloading data:   0%|                      | 662k/261M [00:00<02:32, 1.71MB/s][A
Downloading data:   1%|▏                    | 1.88M/261M [00:00<00:53, 4.87MB/s][A
Downloading data:   1%|▎                    | 3.63M/261M [00:00<00:29, 8.60MB/s][A
Downloading data:   3%|▌                    | 7.21M/261M [00:00<00:15, 16.7MB/s][A
Downloading data:   5%|▉                   

In [4]:
# Start the foundation model training
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/Baseline-C-enwiki.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - Enwiki Foundation (ctx=4096, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}" 

Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1+cu118'
  rank_zero_warn(f"No seed found, seed set to {seed}")
Global seed set to 743309031
[34m[1mwandb[0m: Currently logged in as: [33mpicocreator[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.15.7 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.15.4
[34m[1mwandb[0m: Run data is saved locally in [35m[1m./wandb/run-20230729_044418-jo4wip9s[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mBaseline-C - Enwiki Foundation (ctx=4096, deepspeed_stage_1)[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/picocreator/RWKV-5X-Experiments[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/picocreator/RWKV-5X-Experiments/runs/jo4wip9s[0m
Using 

In [5]:
# Lets export the model from the checkpoint
!cd "{TRAINER_DIR}" && \
    python export_checkpoint.py "../checkpoint/Baseline-C-enwiki/last.ckpt" "../model/Baseline-C-Stage1.pth"
!cd "{TRAINER_DIR}" && ls -alh "../model/Baseline-C-Stage1.pth"

Setting ds_accelerator to cuda (auto detect)
Processing zero checkpoint '../checkpoint/Baseline-C-enwiki/last.ckpt/checkpoint'
Detected checkpoint of type zero stage ZeroStageEnum.optimizer_states, world_size: 8
Parsing checkpoint created by deepspeed==0.9.3
Reconstructed fp32 state dict with 438 params 1515106304 elements
Saving fp32 state dict to ../model/Baseline-C-Stage1.pth
-rw-r--r-- 1 root root 5.7G Jul 29 11:14 ../model/Baseline-C-Stage1.pth


In [6]:
# # Lets do a quick dragon prompt validation
!cd "{INFERENCE_DIR}" && python3 dragon_test.py ../model/Baseline-C-Stage1.pth "cuda fp32"

Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1+cu118'
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu118/wkv_1024_bf16/build.ninja...
Building extension module wkv_1024_bf16...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=wkv_1024_bf16 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.11/dist-packages/torch/include -isystem /usr/local/lib/python3.11/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.11/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.11/dist-packages/torch/include/THC -isystem /usr/local/cud

In [14]:
# # Lets do a quick memory test
# # (We dun expect this to work, as we have not finetune for memory recall, but its a baseline)
# !python3 ../memory_script/eval_memory_guided.py "{PROJECT_DIR}/model/Baseline-C-Stage1.pth"

# Stage 2 : Instruct Tuning

In [8]:
# Lets preload the requried dataset
!cd "{TRAINER_DIR}" && \
    python3 preload_datapath.py "{NOTEBOOK_DIR}/Baseline-C-instruct.yaml"

Downloading readme: 100%|██████████████████| 7.79k/7.79k [00:00<00:00, 23.8MB/s]
Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/c-s-ale___parquet/c-s-ale--dolly-15k-instruction-alpaca-format-9dfbb23260d63d9d/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7...
Downloading data files:   0%|                             | 0/1 [00:00<?, ?it/s]
Downloading data:   0%|                             | 0.00/7.80M [00:00<?, ?B/s][A
Downloading data:   1%|▏                    | 52.2k/7.80M [00:00<00:26, 289kB/s][A
Downloading data:   4%|▉                     | 313k/7.80M [00:00<00:07, 985kB/s][A
Downloading data:   9%|█▉                   | 721k/7.80M [00:00<00:03, 1.98MB/s][A
Downloading data:  26%|█████▏              | 2.02M/7.80M [00:00<00:01, 5.43MB/s][A
Downloading data:  51%|██████████▏         | 3.98M/7.80M [00:00<00:00, 9.81MB/s][A
Downloading data: 100%|████████████████████| 7.80M/7.80M [00:00<00:00, 10.3MB/s][A
Downloading dat

In [9]:
# Start the instruct finetuning
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/Baseline-C-instruct.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - Instruct (train-ctx=4096, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}"

Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1+cu118'
  rank_zero_warn(f"No seed found, seed set to {seed}")
Global seed set to 417591938
[34m[1mwandb[0m: Currently logged in as: [33mpicocreator[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.15.7 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.15.4
[34m[1mwandb[0m: Run data is saved locally in [35m[1m./wandb/run-20230729_111557-759pr8ua[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mBaseline-C - Instruct (train-ctx=4096, deepspeed_stage_1)[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/picocreator/RWKV-5X-Experiments[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/picocreator/RWKV-5X-Experiments/runs/759pr8ua[0m
Using /ro

In [10]:
# Lets export the model from the checkpoint
!cd "{TRAINER_DIR}" && \
    python export_checkpoint.py "../checkpoint/Baseline-C-instruct/last.ckpt" "../model/Baseline-C-Stage2.pth"
!cd "{TRAINER_DIR}" && ls -alh "../model/Baseline-C-Stage2.pth"

Setting ds_accelerator to cuda (auto detect)
Processing zero checkpoint '../checkpoint/Baseline-C-instruct/last.ckpt/checkpoint'
Detected checkpoint of type zero stage ZeroStageEnum.optimizer_states, world_size: 8
Parsing checkpoint created by deepspeed==0.9.3
Reconstructed fp32 state dict with 438 params 1515106304 elements
Saving fp32 state dict to ../model/Baseline-C-Stage2.pth
-rw-r--r-- 1 root root 5.7G Jul 29 12:01 ../model/Baseline-C-Stage2.pth


In [11]:
# Lets do a quick dragon prompt validation
!cd "{INFERENCE_DIR}" && python3 dragon_test.py "../model/Baseline-C-Stage2.pth" "cuda fp32"

Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1+cu118'
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu118/wkv_1024_bf16/build.ninja...
Building extension module wkv_1024_bf16...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv_1024_bf16...
--- DRAGON PROMPT ---
In a shocking finding, scientist discovered a herd of dragons living in a remote, previously unexplored valley, in Tibet. Even more surprising to the researchers was the fact that the dragons spoke perfect Chinese.# Answer:\nZhijanghünenzhuyen was a dragon, named Zisanghikers. He lived in the mountains, he was in the quest for Lord Vestris. He was born in China. He was a Buddhist dancer, poet and King of the Wolf. He was

In [13]:
# # Lets do a quick memory test
# # (We dun expect this to work, as we have not finetune for memory recall, but its a baseline)
# !python3 ../memory_script/eval_memory_guided.py "{PROJECT_DIR}/model/Baseline-C-Stage2.pth"