# RWKV Token Shift Experiment B
This model is a custom model containing
- 24 layers
- 1024 embedding size

See `./notes.md` for how the init model was initilaized.

**Note:** This project assumes you have the rwkv-infctx conda env setup

---

```bash
# ninja-build is required for the new trainer
sudo apt-get install ninja-build

# Update conda & its package listings
conda update conda

# Virtual env, with python 3.10
# python 3.11 have issues with torch.compile / h100s
# and if you want to use 3.11, you will need to do a nightly build install
conda create -n rwkv-infctx python=3.11 pip
conda activate rwkv-infctx

# Install pytorch (>=2.0.1)
conda install -y pytorch==2.0.1 torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# Verify your pytorch version 
python -c "import torch; print(torch.__version__)"

# We use python -m pip, instead of pip directly, as it resolve issues with venv not loading the right pip
python -m pip install datasets transformers 
python -m pip install lightning==2.0.4 deepspeed==0.9.5
python -m pip install ninja numexpr jsonargparse 'jsonargparse[signatures]'
python -m pip install lm-dataformat ftfy sentencepiece tokenizers wandb
```
---

# Basic Setup

In [1]:
# First lets setup the various directories, and get the blank init model, these init model was generated
# using the original RWKV-LM repo (as at this point of writing, this repo cannot init a model)
# As such I have preinitialized these blank models and uploaded them to HF for convinence
!mkdir -p ../../../../model/
!mkdir -p ../../../../datapath/
!mkdir -p ../../../../checkpoint/
!cd ../../../../model/ && wget -nc https://huggingface.co/picocreator/memory-size-experiment-for-rwkv/resolve/main/L24-D1024-init.pth
!ls -alh ../../../../model/L24-D1024-init.pth

# The various other stages, if you want to skip stuff

--2023-07-20 06:07:53--  https://huggingface.co/picocreator/memory-size-experiment-for-rwkv/resolve/main/L24-D1024-init.pth
Resolving huggingface.co (huggingface.co)... 143.204.55.121, 143.204.55.85, 143.204.55.75, ...
Connecting to huggingface.co (huggingface.co)|143.204.55.121|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/cb/ef/cbef09abb2634a3375b28868bffa285226dfeabedec89b28c2fb302221164d66/08948da228c8a8b7e5f77387e8f980dd20246fb375ee1c58a437b6bd075bf6e0?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27L24-D1024-init.pth%3B+filename%3D%22L24-D1024-init.pth%22%3B&Expires=1690092473&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTY5MDA5MjQ3M319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy9jYi9lZi9jYmVmMDlhYmIyNjM0YTMzNzViMjg4NjhiZmZhMjg1MjI2ZGZlYWJlZGVjODliMjhjMmZiMzAyMjIxMTY0ZDY2LzA4OTQ4ZGEyMjhjOGE4YjdlNWY3NzM4N2U4Zjk4MGRkMjAyNDZmYjM3NWV

In [19]:
DEEPSPEED_STRAT="deepspeed_stage_1"
GPU_DEVICES="auto"
ENABLE_WANDB=True
WANDB_PREFIX="TokenShift-Exp-B"

print("DEEPSPEED_STRAT:", DEEPSPEED_STRAT)
print("ENABLE_WANDB:", ENABLE_WANDB)
print("GPU_DEVICES:", GPU_DEVICES)

if ENABLE_WANDB:
    WANDB_MODE="online"
else:
    WANDB_MODE="disabled"

# Computing the notebook, and various paths
import os
NOTEBOOK_DIR=os.path.dirname(os.path.abspath("__file__"))
PROJECT_DIR=os.path.abspath(os.path.join(NOTEBOOK_DIR, "../../../../"))
TRAINER_DIR=os.path.abspath(os.path.join(PROJECT_DIR, "./RWKV-v4wavenet/"))
INFERENCE_DIR=os.path.abspath(os.path.join(PROJECT_DIR, "./RWKV-v4wavenet/"))

print("NOTEBOOK_DIR:", NOTEBOOK_DIR)
print("INFERENCE_DIR:", INFERENCE_DIR)
print("TRAINER_DIR:", TRAINER_DIR)
print("PROJECT_DIR:", PROJECT_DIR)

DEEPSPEED_STRAT: deepspeed_stage_1
ENABLE_WANDB: True
GPU_DEVICES: auto
NOTEBOOK_DIR: /root/rwkv5x-tokenshift-exp-A/notebook/experiment/tokenshift-exp
INFERENCE_DIR: /root/rwkv5x-tokenshift-exp-A/RWKV-v4wavenet
TRAINER_DIR: /root/rwkv5x-tokenshift-exp-A/RWKV-v4wavenet
PROJECT_DIR: /root/rwkv5x-tokenshift-exp-A


## Stage 1 : Foundation model training

In [3]:
# Lets preload the requried dataset (enwiki_100k)
!cd "{TRAINER_DIR}" && \
    python3 preload_datapath.py "{NOTEBOOK_DIR}/TokenShift-B-enwiki.yaml"

Found cached dataset parquet (/root/.cache/huggingface/datasets/teven___parquet/teven--enwiki_100k-1359e81b212c2dd6/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7)
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 71.82it/s]
                                                                                

In [4]:
# Start the foundation model training
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/TokenShift-B-enwiki.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - Enwiki Foundation (ctx=4096, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}" 

Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1+cu118'
  rank_zero_warn(f"No seed found, seed set to {seed}")
Global seed set to 2193157053
[34m[1mwandb[0m: Currently logged in as: [33mpicocreator[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.15.5 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.15.4
[34m[1mwandb[0m: Run data is saved locally in [35m[1m./wandb/run-20230720_060952-on3b4cc0[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mTokenShift-Exp-B - Enwiki Foundation (ctx=4096, deepspeed_stage_1)[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/picocreator/RWKV-5X-Experiments[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/picocreator/RWKV-5X-Experiments/runs/on3b4cc0[0m

In [5]:
# Lets export the model from the checkpoint
!cd "{TRAINER_DIR}" && \
    python export_checkpoint.py "../checkpoint/TokenShift-B-enwiki/last.ckpt" "../model/TokenShift-B-Stage1.pth"
!cd "{TRAINER_DIR}" && ls -alh "../model/TokenShift-B-Stage1.pth"

Setting ds_accelerator to cuda (auto detect)
Processing zero checkpoint '../checkpoint/TokenShift-B-enwiki/last.ckpt/checkpoint'
Detected checkpoint of type zero stage ZeroStageEnum.optimizer_states, world_size: 8
Parsing checkpoint created by deepspeed==0.9.3
Reconstructed fp32 state dict with 438 params 430397440 elements
Saving fp32 state dict to ../model/TokenShift-B-Stage1.pth
-rw-r--r-- 1 root root 1.7G Jul 20 07:59 ../model/TokenShift-B-Stage1.pth


In [20]:
# # Lets do a quick dragon prompt validation
!cd "{INFERENCE_DIR}" && python3 dragon_test.py ../model/TokenShift-B-Stage1.pth "cuda fp32"

Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1+cu118'
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu118/wkv_1024_bf16/build.ninja...
Building extension module wkv_1024_bf16...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv_1024_bf16...
--- DRAGON PROMPT ---
In a shocking finding, scientist discovered a herd of dragons living in a remote, previously unexplored valley, in Tibet. Even more surprising to the researchers was the fact that the dragons spoke perfect Chinese.

Lectures and signs

It was widely known that the Tibetan tribes in the Tibetan Plateau were a part of the Tibetan people. The Tibetan homeland was said to have been in the Tibet region. Tibetan-speaking Tibet

In [7]:
# Lets do a quick memory test
# (We dun expect this to work, as we have not finetune for memory recall, but its a baseline)
!python3 ../memory_script/eval_model_memory_guided.py "{PROJECT_DIR}/model/TokenShift-B-Stage1.pth"

Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1+cu118'
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu118/wkv_1024_bf16/build.ninja...
Building extension module wkv_1024_bf16...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=wkv_1024_bf16 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.11/dist-packages/torch/include -isystem /usr/local/lib/python3.11/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.11/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.11/dist-packages/torch/include/THC -isystem /usr/local/cud

# Stage 2 : Instruct Tuning

In [8]:
# Lets preload the requried dataset
!cd "{TRAINER_DIR}" && \
    python3 preload_datapath.py "{NOTEBOOK_DIR}/TokenShift-B-instruct.yaml"

Found cached dataset parquet (/root/.cache/huggingface/datasets/c-s-ale___parquet/c-s-ale--dolly-15k-instruction-alpaca-format-9dfbb23260d63d9d/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7)
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 679.24it/s]
                                                                                

In [13]:
# Start the instruct finetuning
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/TokenShift-B-instruct.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - Instruct (train-ctx=4096, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}"

Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1+cu118'
  rank_zero_warn(f"No seed found, seed set to {seed}")
Global seed set to 1409080610
[34m[1mwandb[0m: Currently logged in as: [33mpicocreator[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.15.5 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.15.4
[34m[1mwandb[0m: Run data is saved locally in [35m[1m./wandb/run-20230720_092037-um10rveg[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mTokenShift-Exp-B - Instruct (train-ctx=4096, deepspeed_stage_1)[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/picocreator/RWKV-5X-Experiments[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/picocreator/RWKV-5X-Experiments/runs/um10rveg[0m
Us

In [14]:
# Lets export the model from the checkpoint
!cd "{TRAINER_DIR}" && \
    python export_checkpoint.py "../checkpoint/TokenShift-B-instruct/last.ckpt" "../model/TokenShift-B-Stage2.pth"
!cd "{TRAINER_DIR}" && ls -alh "../model/TokenShift-B-Stage2.pth"

Setting ds_accelerator to cuda (auto detect)
Processing zero checkpoint '../checkpoint/TokenShift-B-instruct/last.ckpt/checkpoint'
Detected checkpoint of type zero stage ZeroStageEnum.optimizer_states, world_size: 8
Parsing checkpoint created by deepspeed==0.9.3
Reconstructed fp32 state dict with 438 params 430397440 elements
Saving fp32 state dict to ../model/TokenShift-B-Stage2.pth
-rw-r--r-- 1 root root 1.7G Jul 20 09:34 ../model/TokenShift-B-Stage2.pth


In [21]:
# Lets do a quick dragon prompt validation
!cd "{INFERENCE_DIR}" && python3 dragon_test.py "../model/TokenShift-B-Stage2.pth" "cuda fp32"

Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1+cu118'
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu118/wkv_1024_bf16/build.ninja...
Building extension module wkv_1024_bf16...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv_1024_bf16...
--- DRAGON PROMPT ---
In a shocking finding, scientist discovered a herd of dragons living in a remote, previously unexplored valley, in Tibet. Even more surprising to the researchers was the fact that the dragons spoke perfect Chinese. Soon after the discovery of the deer, a curious dune bug has been found, and the two Chinese dragons were invited to work at a zoo. The zoo was a big success in Taiwan, and was able to learn more about Tibetan

In [16]:
# Lets do a quick memory test
# (We dun expect this to work, as we have not finetune for memory recall, but its a baseline)
!python3 ../memory_script/eval_model_memory_guided.py "{PROJECT_DIR}/model/TokenShift-B-Stage2.pth"

Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1+cu118'
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu118/wkv_1024_bf16/build.ninja...
Building extension module wkv_1024_bf16...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv_1024_bf16...
###
### Model validation start ###
###
## Model validation for 5 tokens : 20.0% similarity, with 1 matched token, and 4 token mismatch
## Model validation for 10 tokens : 10.0% similarity, with 1 matched token, and 9 token mismatch
## Model validation for 15 tokens : 6.666666666666667% similarity, with 1 matched token, and 14 token mismatch
## Model validation for 20 tokens : 5.0% similarity, with 1 matched token, and 19 token mismatch
## Mod