# RWKV Token Shift - From an existing raven model
Due to the weights overlap, what if we take an existing raven model, and finetune it to the tokenshift format?
What will happen?

**Note:** This project assumes you have the rwkv-infctx conda env setup

# Basic Setup

In [1]:
# Setup the various required folders
!mkdir -p ../../../../model/
!mkdir -p ../../../../datapath/
!mkdir -p ../../../../checkpoint/

# Intialize the modelwqa0sxz
!cd ../../../../model/ && wget -nc https://huggingface.co/BlinkDL/rwkv-4-raven/resolve/main/RWKV-4-Raven-1B5-v12-Eng98%25-Other2%25-20230520-ctx4096.pth

File ‘RWKV-4-Raven-1B5-v12-Eng98%-Other2%-20230520-ctx4096.pth’ already there; not retrieving.



In [1]:
DEEPSPEED_STRAT="deepspeed_stage_2_offload"
GPU_DEVICES="auto"
ENABLE_WANDB=True
WANDB_PREFIX="FrankenShift-1B5"

# Use for low vram / single GPU trianing
SUBSTEP_CUDA_CACHE_CLEAR=True

print("DEEPSPEED_STRAT:", DEEPSPEED_STRAT)
print("ENABLE_WANDB:", ENABLE_WANDB)
print("GPU_DEVICES:", GPU_DEVICES)
print("SUBSTEP_CUDA_CACHE_CLEAR:", SUBSTEP_CUDA_CACHE_CLEAR)

if ENABLE_WANDB:
    WANDB_MODE="online"
else:
    WANDB_MODE="disabled"

# Computing the notebook, and various paths
import os
NOTEBOOK_DIR=os.path.dirname(os.path.abspath("__file__"))
PROJECT_DIR=os.path.abspath(os.path.join(NOTEBOOK_DIR, "../../../../"))
TRAINER_DIR=os.path.abspath(os.path.join(PROJECT_DIR, "./RWKV-v4wavenet/"))
INFERENCE_DIR=os.path.abspath(os.path.join(PROJECT_DIR, "./RWKV-v4wavenet/"))

print("NOTEBOOK_DIR:", NOTEBOOK_DIR)
print("INFERENCE_DIR:", INFERENCE_DIR)
print("TRAINER_DIR:", TRAINER_DIR)
print("PROJECT_DIR:", PROJECT_DIR)

DEEPSPEED_STRAT: deepspeed_stage_2_offload
ENABLE_WANDB: True
GPU_DEVICES: auto
SUBSTEP_CUDA_CACHE_CLEAR: True
NOTEBOOK_DIR: /home/picocreator/rwkv-proj/picocreator-memory-experiment/notebook/experiment/tokenshift-exp/FrankenShift-1B5
INFERENCE_DIR: /home/picocreator/rwkv-proj/picocreator-memory-experiment/RWKV-v4wavenet
TRAINER_DIR: /home/picocreator/rwkv-proj/picocreator-memory-experiment/RWKV-v4wavenet
PROJECT_DIR: /home/picocreator/rwkv-proj/picocreator-memory-experiment


## Stage 1 : Foundation model training

In [3]:
# Lets preload the requried dataset (enwiki_100k)
!cd "{TRAINER_DIR}" && \
    python3 preload_datapath.py "{NOTEBOOK_DIR}/FrankenShift-1B5-enwiki.yaml"

[2023-08-02 17:10:34,241] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Found cached dataset parquet (/home/picocreator/.cache/huggingface/datasets/teven___parquet/teven--enwiki_100k-1359e81b212c2dd6/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7)
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 108.15it/s]
                                                                                

In [12]:
# Start the foundation model training
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/FrankenShift-1B5-enwiki.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - Enwiki Retrain (train-ctx=2048, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}" \
        --model.substep_cuda_cache_clear="{SUBSTEP_CUDA_CACHE_CLEAR}"

[RWKV.lightning_trainer.py]: Running with PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
[2023-08-02 22:54:09,624] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1'
  rank_zero_warn(f"No seed found, seed set to {seed}")
Global seed set to 3366454673
[34m[1mwandb[0m: Currently logged in as: [33mpicocreator[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.15.8 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.15.5
[34m[1mwandb[0m: Run data is saved locally in [35m[1m./wandb/run-20230802_225412-g7s2by82[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mFrankenShift-1B5 - Enwiki Retrain (train-ctx=2048, deepspeed_stage_2_offload)[0m
[34m[1mwandb[0m: ⭐️ View project at 

In [13]:
# Lets export the model from the checkpoint
!cd "{TRAINER_DIR}" && \
    python export_checkpoint.py "../checkpoint/FrankenShift-1B5-enwiki/last.ckpt" "../model/FrankenShift-1B5-Stage1.pth"
!cd "{TRAINER_DIR}" && ls -alh "../model/FrankenShift-1B5-Stage1.pth"

[2023-08-04 05:43:20,490] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Processing zero checkpoint '../checkpoint/FrankenShift-1B5-enwiki/last.ckpt/checkpoint'
Detected checkpoint of type zero stage ZeroStageEnum.gradients, world_size: 1
Parsing checkpoint created by deepspeed==0.9.5
Reconstructed fp32 state dict with 438 params 1515106304 elements
Saving fp32 state dict to ../model/FrankenShift-1B5-Stage1.pth
-rw-rw-r-- 1 picocreator picocreator 5.7G Aug  4 05:43 ../model/FrankenShift-1B5-Stage1.pth


In [14]:
# # Lets do a quick dragon prompt validation
!cd "{INFERENCE_DIR}" && python3 dragon_test.py ../model/FrankenShift-1B5-Stage1.pth "cuda fp32"

[2023-08-04 05:43:47,016] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1'
[RWKV.model]: Preloading model from '../model/FrankenShift-1B5-Stage1.pth'
Using /home/picocreator/.cache/torch_extensions/py311_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/picocreator/.cache/torch_extensions/py311_cu117/wkv_1024_bf16/build.ninja...
Building extension module wkv_1024_bf16...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv_1024_bf16...
[RWKV.model]: Loading model weights ( L24-D2048-V50277 )
[RWKV.model]: Finished initial model load
--- DRAGON PROMPT ---
In a shocking finding, scientist discovered a herd of dragons living in a remote, previously unexplored valley, in Tibet. Even more surprising to the researc

In [15]:
# Lets do a quick memory test
# (We dun expect this to work, as we have not finetune for memory recall, but its a baseline)
!python3 ../memory_script/eval_model_memory_guided.py "{PROJECT_DIR}/model/FrankenShift-1B5-Stage1.pth"

[2023-08-04 05:44:50,621] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1'
[RWKV.model]: Preloading model from '/home/picocreator/rwkv-proj/picocreator-memory-experiment/model/FrankenShift-1B5-Stage1.pth'
Using /home/picocreator/.cache/torch_extensions/py311_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/picocreator/.cache/torch_extensions/py311_cu117/wkv_1024_bf16/build.ninja...
Building extension module wkv_1024_bf16...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv_1024_bf16...
[RWKV.model]: Loading model weights ( L24-D2048-V50277 )
[RWKV.model]: Finished initial model load
###
### Model validation start ###
###
## Model validation for 5 tokens : 80.0% similarity, with 4 matched token, and 1 toke

# Stage 2 : Instruct Tuning

In [16]:
# Lets preload the requried dataset
!cd "{TRAINER_DIR}" && \
    python3 preload_datapath.py "{NOTEBOOK_DIR}/FrankenShift-1B5-instruct.yaml"

[2023-08-04 05:49:45,370] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Found cached dataset parquet (/home/picocreator/.cache/huggingface/datasets/c-s-ale___parquet/c-s-ale--dolly-15k-instruction-alpaca-format-9dfbb23260d63d9d/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7)
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 35.93it/s]
                                                                                

In [17]:
# Start the instruct finetuning
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/FrankenShift-1B5-instruct.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - Instruct Retrain (train-ctx=2048, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}"\
        --model.substep_cuda_cache_clear="{SUBSTEP_CUDA_CACHE_CLEAR}"

[RWKV.lightning_trainer.py] Running with PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
[2023-08-04 05:50:02,631] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1'
  rank_zero_warn(
  rank_zero_warn(f"No seed found, seed set to {seed}")
Global seed set to 77600742
[34m[1mwandb[0m: Currently logged in as: [33mpicocreator[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.15.8 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.15.5
[34m[1mwandb[0m: Run data is saved locally in [35m[1m./wandb/run-20230804_055005-2s04ectf[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mFrankenShift-1B5 - Instruct Retrain (train-ctx=2048, deepspeed_stage_2_offload)[0m
[34m[1mwandb[0m: ⭐️ 

In [18]:
# Lets export the model from the checkpoint
!cd "{TRAINER_DIR}" && \
    python export_checkpoint.py "../checkpoint/FrankenShift-1B5-instruct/last.ckpt" "../model/FrankenShift-1B5-Stage2.pth"
!cd "{TRAINER_DIR}" && ls -alh "../model/FrankenShift-1B5-Stage2.pth"

[2023-08-04 08:08:57,178] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Processing zero checkpoint '../checkpoint/FrankenShift-1B5-instruct/last.ckpt/checkpoint'
Detected checkpoint of type zero stage ZeroStageEnum.gradients, world_size: 1
Parsing checkpoint created by deepspeed==0.9.5
Reconstructed fp32 state dict with 438 params 1515106304 elements
Saving fp32 state dict to ../model/FrankenShift-1B5-Stage2.pth
-rw-rw-r-- 1 picocreator picocreator 5.7G Aug  4 08:09 ../model/FrankenShift-1B5-Stage2.pth


In [19]:
# Lets do a quick dragon prompt validation
!cd "{INFERENCE_DIR}" && python3 dragon_test.py "../model/FrankenShift-1B5-Stage2.pth" "cuda fp32"

[2023-08-04 08:09:21,937] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1'
[RWKV.model]: Preloading model from '../model/FrankenShift-1B5-Stage2.pth'
Using /home/picocreator/.cache/torch_extensions/py311_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/picocreator/.cache/torch_extensions/py311_cu117/wkv_1024_bf16/build.ninja...
Building extension module wkv_1024_bf16...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv_1024_bf16...
[RWKV.model]: Loading model weights ( L24-D2048-V50277 )
[RWKV.model]: Finished initial model load
--- DRAGON PROMPT ---
In a shocking finding, scientist discovered a herd of dragons living in a remote, previously unexplored valley, in Tibet. Even more surprising to the researc

In [20]:
# Lets do a quick memory test
# (We dun expect this to work, as we have not finetune for memory recall, but its a baseline)
!python3 ../memory_script/eval_model_memory_guided.py "{PROJECT_DIR}/model/FrankenShift-1B5-Stage2.pth"

[2023-08-04 08:10:26,234] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1'
[RWKV.model]: Preloading model from '/home/picocreator/rwkv-proj/picocreator-memory-experiment/model/FrankenShift-1B5-Stage2.pth'
Using /home/picocreator/.cache/torch_extensions/py311_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/picocreator/.cache/torch_extensions/py311_cu117/wkv_1024_bf16/build.ninja...
Building extension module wkv_1024_bf16...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv_1024_bf16...
[RWKV.model]: Loading model weights ( L24-D2048-V50277 )
[RWKV.model]: Finished initial model load
###
### Model validation start ###
###
## Model validation for 5 tokens : 100.0% similarity, with 5 matched token, and 0 tok