# RWKV Token Shift Experiment A
This model is a custom model containing
- 12 layers
- 2560 embedding size

See `./notes.md` for how the init model was initilaized.

**Note:** This project assumes you have the rwkv-infctx conda env setup

---

```bash
# ninja-build is required for the new trainer
sudo apt-get install ninja-build

# Update conda & its package listings
conda update conda

# Virtual env, with python 3.10
# python 3.11 have issues with torch.compile / h100s
# and if you want to use 3.11, you will need to do a nightly build install
conda create -n rwkv-infctx python=3.11 pip
conda activate rwkv-infctx

# Install pytorch (>=2.0.1)
conda install -y pytorch==2.0.1 torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# Verify your pytorch version 
python -c "import torch; print(torch.__version__)"

# We use python -m pip, instead of pip directly, as it resolve issues with venv not loading the right pip
python -m pip install datasets transformers 
python -m pip install lightning==2.0.4 deepspeed==0.9.5
python -m pip install ninja numexpr jsonargparse 'jsonargparse[signatures]'
python -m pip install lm-dataformat ftfy sentencepiece tokenizers wandb
```
---

# Basic Setup

In [1]:
# First lets setup the various directories, and get the blank init model, these init model was generated
# using the original RWKV-LM repo (as at this point of writing, this repo cannot init a model)
# As such I have preinitialized these blank models and uploaded them to HF for convinence
!mkdir -p ../../../model/
!mkdir -p ../../../datapath/
!mkdir -p ../../../checkpoint/
!rm -rf ../../../model/L12-D2560-init.pth
!cd ../../../model/ && wget https://huggingface.co/picocreator/memory-size-experiment-for-rwkv/resolve/main/L12-D2560-init.pth
!ls -alh ../../../model/L12-D2560-init.pth

--2023-07-15 01:12:48--  https://huggingface.co/picocreator/memory-size-experiment-for-rwkv/resolve/main/L12-D2560-init.pth
Resolving huggingface.co (huggingface.co)... 13.224.249.10, 13.224.249.43, 13.224.249.44, ...
Connecting to huggingface.co (huggingface.co)|13.224.249.10|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/cb/ef/cbef09abb2634a3375b28868bffa285226dfeabedec89b28c2fb302221164d66/490b49cdae99030f402fa01a60817bb53c67b6164aa3858742ea6b1560b2c4ed?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27L12-D2560-init.pth%3B+filename%3D%22L12-D2560-init.pth%22%3B&Expires=1689613968&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTY4OTYxMzk2OH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy9jYi9lZi9jYmVmMDlhYmIyNjM0YTMzNzViMjg4NjhiZmZhMjg1MjI2ZGZlYWJlZGVjODliMjhjMmZiMzAyMjIxMTY0ZDY2LzQ5MGI0OWNkYWU5OTAzMGY0MDJmYTAxYTYwODE3YmI1M2M2N2I2MTY0YWEzO

In [51]:
DEEPSPEED_STRAT="deepspeed_stage_2_offload"
GPU_DEVICES="auto"
ENABLE_WANDB=True
WANDB_PREFIX="(EXPERIMENTAL) TokenShift-Exp-A"

print("DEEPSPEED_STRAT:", DEEPSPEED_STRAT)
print("ENABLE_WANDB:", ENABLE_WANDB)
print("GPU_DEVICES:", GPU_DEVICES)

if ENABLE_WANDB:
    WANDB_MODE="online"
else:
    WANDB_MODE="disabled"

# Computing the notebook, and various paths
import os
NOTEBOOK_DIR=os.path.dirname(os.path.abspath("__file__"))
PROJECT_DIR=os.path.abspath(os.path.join(NOTEBOOK_DIR, "../../../"))
TRAINER_DIR=os.path.abspath(os.path.join(PROJECT_DIR, "./RWKV-v4neo/"))

print("NOTEBOOK_DIR:", NOTEBOOK_DIR)
print("TRAINER_DIR:", TRAINER_DIR)
print("PROJECT_DIR:", PROJECT_DIR)

DEEPSPEED_STRAT: deepspeed_stage_2_offload
ENABLE_WANDB: False
GPU_DEVICES: auto
NOTEBOOK_DIR: /home/picocreator/rwkv-proj/rwkv5-tokenshift-experiment/notebook/experiment/tokenshift-exp
TRAINER_DIR: /home/picocreator/rwkv-proj/rwkv5-tokenshift-experiment/RWKV-v4neo
PROJECT_DIR: /home/picocreator/rwkv-proj/rwkv5-tokenshift-experiment


## Stage 1 : Foundation model training

In [52]:
# Lets preload the requried dataset (enwiki_100k)
!cd "{TRAINER_DIR}" && \
    python3 preload_dataset.py "{NOTEBOOK_DIR}/TokenShift-A-enwiki.yaml"

Found cached dataset parquet (/home/picocreator/.cache/huggingface/datasets/teven___parquet/teven--enwiki_100k-1359e81b212c2dd6/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7)
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 98.92it/s]
Loading cached processed dataset at /home/picocreator/.cache/huggingface/datasets/teven___parquet/teven--enwiki_100k-1359e81b212c2dd6/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7/cache-197c10b1cc695da5_*_of_00016.arrow
Loading cached processed dataset at /home/picocreator/.cache/huggingface/datasets/teven___parquet/teven--enwiki_100k-1359e81b212c2dd6/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7/cache-aa09da794e8d7304_*_of_00016.arrow
                                                                                

In [54]:
# Start the foundation model training
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python new_train.py fit \
        -c "{NOTEBOOK_DIR}/TokenShift-A-enwiki.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - Enwiki Foundation (ctx=4096, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}" 

[2023-07-15 16:04:24,032] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1'
  rank_zero_warn(f"No seed found, seed set to {seed}")
Global seed set to 3961506885
Using /home/picocreator/.cache/torch_extensions/py311_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/picocreator/.cache/torch_extensions/py311_cu117/wkv_4096_bf16/build.ninja...
Building extension module wkv_4096_bf16...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv_4096_bf16...
  rank_zero_warn(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


[RWKV.Trainer] Applying 'target_batch_size' with the following:
   - target_batch_size:       32

In [None]:
# Lets export the model from the checkpoint
!cd "{TRAINER_DIR}" && \
    python export_checkpoint.py "../checkpoint/TokenShift-A-enwiki/last.ckpt" "../model/TokenShift-A-Stage1.pth"
!cd "{TRAINER_DIR}" && ls -alh "../model/TokenShift-A-Stage1.pth"

In [None]:
# # Lets do a quick dragon prompt validation
# !cd "{TRAINER_DIR}" && python3 dragon_test.py ../model/TokenShift-A-Stage1.pth "cuda fp32"

In [None]:
# # Lets do a quick memory test
# # (We dun expect this to work, as we have not finetune for memory recall, but its a baseline)
# !python3 ./memory_script/eval_memory_guided.py "{PROJECT_DIR}/model/TokenShift-A-Stage1.pth"

# Stage 2 : Instruct Tuning

In [None]:
# Lets preload the requried dataset
!cd "{TRAINER_DIR}" && \
    python3 preload_dataset.py "{NOTEBOOK_DIR}/TokenShift-A-instruct.yaml"

In [None]:
# Start the instruct finetuning
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python new_train.py fit \
        -c "{NOTEBOOK_DIR}/TokenShift-A-instruct.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - Instruct (train-ctx=4096, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}"

In [None]:
# Lets export the model from the checkpoint
!cd "{TRAINER_DIR}" && \
    python export_checkpoint.py "../checkpoint/TokenShift-A-instruct/last.ckpt" "../model/TokenShift-A-Stage2.pth"
!cd "{TRAINER_DIR}" && ls -alh "../model/TokenShift-A-Stage2.pth"

In [None]:
# # Lets do a quick dragon prompt validation
# !cd "{TRAINER_DIR}" && python3 dragon_test.py "../model/TokenShift-A-Stage2.pth" "cuda fp32"

In [None]:
# # Lets do a quick memory test
# # (We dun expect this to work, as we have not finetune for memory recall, but its a baseline)
# !python3 ./memory_script/eval_memory_guided.py "{PROJECT_DIR}/model/TokenShift-A-Stage2.pth"