# RWKV World Memory Finetune (Memory Finetune)

This takes an existing RWKV world model, and finetune them specifically for the memory repeat task of various sizes.
This test is used as an approximation of testing the model token memory size in the "worse case scenerio"

- Using randomized data, so prior learning does not help, nor is it possible to compress the data
- Using a variety of token lengths, to avoid overfitting to a single length
- Based on the pretrained model (rwkv world)
- This process does "destroy the model" but it helps quantify the model limits

In practise however, the model may show "attention range" longer then what is benchmarked, as natural text is highly compressible. Unlike the pure randomized data that was being tested here.

This runner has been optimized to run on 8 x 80GB vram nodes, you should allocate atleast 1TB disk space.

> This project assumes you have the rwkv-infctx conda env setup, and you are executing in that environment - see the main README.md for the conda env setup steps

## Configure your environment settings
(!Important: you will need to rerun the below cell, if you restart your kernel)

In [4]:
DEEPSPEED_STRAT="deepspeed_stage_2"
GPU_DEVICES="auto"
ENABLE_WANDB=True
WANDB_PREFIX="[8xA100] RWKV-v5-7B-World"

print("DEEPSPEED_STRAT:", DEEPSPEED_STRAT)
print("ENABLE_WANDB:", ENABLE_WANDB)
print("GPU_DEVICES:", GPU_DEVICES)

if ENABLE_WANDB:
    WANDB_MODE="online"
else:
    WANDB_MODE="disabled"

# The model sizing
MODEL_NAME="RWKV-v5-7B-world.pth"
MODEL_URL="https://huggingface.co/BlinkDL/temp/resolve/2d905a2a30c778086a048e4f65ca75d9f7f9849d/RWKV-5-World-7B-v2-OnlyForTest_72%25_trained-20231204-ctx4096.pth?download=true"

# Computing the notebook, and various paths
import os
NOTEBOOK_DIR=os.path.dirname(os.path.abspath("__file__"))
PROJECT_DIR=os.path.abspath(os.path.join(NOTEBOOK_DIR, "../../../../"))
TRAINER_DIR=os.path.abspath(os.path.join(PROJECT_DIR, "./RWKV-v5/"))
MEMORY_SCRIPT_DIR=os.path.abspath(os.path.join(PROJECT_DIR, "./notebook/util-scripts/memory_script"))

print("NOTEBOOK_DIR:", NOTEBOOK_DIR)
print("TRAINER_DIR:", TRAINER_DIR)
print("PROJECT_DIR:", PROJECT_DIR)

DEEPSPEED_STRAT: deepspeed_stage_2
ENABLE_WANDB: True
GPU_DEVICES: auto
NOTEBOOK_DIR: /workspace/RWKV-infctx-trainer/notebook/rwkv-x-exp/v5-exp/memory-test
TRAINER_DIR: /workspace/RWKV-infctx-trainer/RWKV-v5
PROJECT_DIR: /workspace/RWKV-infctx-trainer


## Download the pretrained model
(if you want to skip the the basemodel train + instruct tune)


In [None]:
# Lets wget the model files
!cd "{PROJECT_DIR}" && mkdir -p "{PROJECT_DIR}/model"
!cd "{PROJECT_DIR}/model" && \
    wget -O "{MODEL_NAME}" -nc "{MODEL_URL}"

## Finetune 1 (0 -> 2*2k) : Dataset preperation

Stage 1, handles total context size of 4096. Meaning it will be tuned for memory task of approximately 2k tokens of size.

In [None]:
# Folder and eval pip setup
!cp -r "{MEMORY_SCRIPT_DIR}/" "{NOTEBOOK_DIR}/"
!python3 -m pip install rwkv asyncio aiocsv aiofiles

In [None]:
%%script bash

########################################
# Generate the required jsonl dataset
########################################

# Reset the dataset dir
mkdir -p ./dataset
rm -rf ./dataset/*.jsonl

# Generate the various datasets
echo "## Generating word reptition dataset ##"

#
# Training set for < 100 words
# This is used to fill up as much blanks as possible
#
python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/word-2-count.jsonl 2 100 &
python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/word-4-count.jsonl 4 100 &
for i in {5..100..5} 
do
    python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/gen-word-$i-count.jsonl $i 150 & 
    python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-$i-count.jsonl $i 100 & 
done

#
# Ramping up the 50+ - 400 words dataset
# 
for i in {110..200..10} 
do
    python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/gen-word-$i-count.jsonl $i 125 & 
    python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-$i-count.jsonl $i 100 & 
done

#
# Ramping up the 50+ - 400 words dataset
# 
for i in {210..4000..10} 
do
    python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/gen-word-$i-count.jsonl $i 100 & 
    python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-$i-count.jsonl $i 100 & 
done

wait
echo "## Done ##"

ls -alh ./dataset/

In [None]:
# Lets pre tokenize the requried dataset
# and pack the data into 8k of length
#
# For the initial training, it seems to be better to do 4k chunks, batch size 16, with 8k datapacks
# Then to do 8k chunks, batchsize 8, with 16k datapacks. Why? I dun know.
#
!cd "{TRAINER_DIR}" && \
    python3 preload_datapath.py "{NOTEBOOK_DIR}/stage-1-tune.yaml"

# Ensure the checkpoint directory exists
!cd "{TRAINER_DIR}" && mkdir -p "../checkpoint/stage-1-memory-finetune/"

## Finetune 1 (0 -> 2*2k) : The actual tune!

In [None]:
# Start the finetune model training
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/stage-1-tune.yaml" \
        --model.load_model="../model/{MODEL_NAME}" \
        --trainer.callbacks.init_args.dirpath="../checkpoint/stage-1-memory-finetune/{MODEL_NAME}/" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - Mem-Finetune-1 (bs=256, train-ctx=8192, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}"  \
        --trainer.microbatch_size=4 \
        --model.ctx_len=8192

In [None]:
# Lets export the model from the checkpoint
!cd "{TRAINER_DIR}" && \
    python export_checkpoint.py \
        "../checkpoint/stage-1-memory-finetune/{MODEL_NAME}/last.ckpt" \
        "../model/Memory-Tune-Stage-1-{MODEL_NAME}"
!cd "{TRAINER_DIR}" && ls -alh "../model/Memory-Tune-Stage-1-{MODEL_NAME}"

In [None]:
# Lets do a memory eval!
!python3 ./memory_script/eval_v5_memory_guided.py "{PROJECT_DIR}/model/Memory-Tune-Stage-1-{MODEL_NAME}"
!python3 ./memory_script/eval_v5_memory_guided.py "{PROJECT_DIR}/model/Memory-Tune-Stage-1-{MODEL_NAME}" "none" 1000 3000

## Finetune 2 (0 -> 2*4k) : Dataset preperation

Stage 2, handles total context size of 8k. Meaning it will be tuned for memory task of approximately 4k tokens of size.

In [None]:
%%script bash

########################################
# Generate the required jsonl dataset
########################################

# Reset the dataset dir
mkdir -p ./dataset
rm -rf ./dataset/*.jsonl

# Generate the various datasets
echo "## Generating word reptition dataset ##"

#
# Training set for <= 100 words
# This is used to fill up as much blanks as possible
#
python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/word-2-count.jsonl 2 100 &
python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/word-4-count.jsonl 4 100 &
for i in {5..100..5} 
do
    python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/gen-word-$i-count.jsonl $i 100 & 
    python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-$i-count.jsonl $i 100 & 
done

#
# Ramping up the 100+ - 3000 words dataset
# 
for i in {110..3000..10} 
do
    python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/gen-word-$i-count.jsonl $i 75 & 
    python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-$i-count.jsonl $i 75 & 
done

#
# Ramping up the 3000+ - 400 words dataset
# 
for i in {3000..6000..25} 
do
    python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/gen-word-$i-count.jsonl $i 100 & 
    python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-$i-count.jsonl $i 100 & 
done

wait
echo "## Done ##"

ls -alh ./dataset/

In [None]:
# Lets pre tokenize the requried dataset
# and pack the data into 8k of length
#
# For the initial training, it seems to be better to do 4k chunks, batch size 16, with 8k datapacks
# Then to do 8k chunks, batchsize 8, with 16k datapacks. Why? I dun know.
#
!cd "{TRAINER_DIR}" && \
    python3 preload_datapath.py "{NOTEBOOK_DIR}/stage-2-tune.yaml"

# Ensure the checkpoint directory exists
!cd "{TRAINER_DIR}" && mkdir -p "../checkpoint/stage-2-memory-finetune/"

## Finetune 1 (0 -> 2*2k) : The actual tune!

In [None]:
# Start the finetune model training
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/stage-2-tune.yaml" \
        --model.load_model="../model/Memory-Tune-Stage-1-{MODEL_NAME}" \
        --trainer.callbacks.init_args.dirpath="../checkpoint/stage-2-memory-finetune/{MODEL_NAME}/" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - Mem-Finetune-2 (bs=256, train-ctx=8192, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}"  \
        --trainer.microbatch_size=4 \
        --model.ctx_len=8192

In [None]:
# Lets export the model from the checkpoint
!cd "{TRAINER_DIR}" && \
    python export_checkpoint.py \
        "../checkpoint/stage-2-memory-finetune/{MODEL_NAME}/last.ckpt" \
        "../model/Memory-Tune-Stage-2-{MODEL_NAME}"
!cd "{TRAINER_DIR}" && ls -alh "../model/Memory-Tune-Stage-2-{MODEL_NAME}"

In [None]:
# Lets do a memory eval!
!python3 ./memory_script/eval_v5_memory_guided.py "{PROJECT_DIR}/model/Memory-Tune-Stage-2-{MODEL_NAME}"
!python3 ./memory_script/eval_v5_memory_guided.py "{PROJECT_DIR}/model/Memory-Tune-Stage-2-{MODEL_NAME}" "none" 1000 4000
!python3 ./memory_script/eval_v5_memory_guided.py "{PROJECT_DIR}/model/Memory-Tune-Stage-2-{MODEL_NAME}" "none" 4000 8000

## Finetune 2 (2x2k -> 2x4k) : Dataset preperation

Stage 2, handles total context size of 8k. Meaning it will be tuned for memory task of approximately 4k tokens of size.

In [2]:
%%script bash

########################################
# Generate the required jsonl dataset
########################################

# Reset the dataset dir
mkdir -p ./dataset
rm -rf ./dataset/*.jsonl

# Generate the various datasets
echo "## Generating word reptition dataset ##"

#
# Training set for <= 100 words
# This is used to fill up as much blanks as possible
#
python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/word-2-count.jsonl 2 100 &
python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/word-4-count.jsonl 4 100 &
for i in {5..100..5} 
do
    python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/gen-word-$i-count.jsonl $i 100 & 
    python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-$i-count.jsonl $i 100 & 
done

#
# Ramping up the 100+ - 3000 words dataset
# 
for i in {110..3000..10} 
do
    python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/gen-word-$i-count.jsonl $i 75 & 
    python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-$i-count.jsonl $i 75 & 
done

#
# Ramping up the 3000+ - 400 words dataset
# 
for i in {3025..6000..25} 
do
    python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/gen-word-$i-count.jsonl $i 100 & 
    python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-$i-count.jsonl $i 100 & 
done

wait
echo "## Done ##"

ls -alh ./dataset/

## Generating word reptition dataset ##
Generated JSONL file with - 15 max words, 100 samples - at ./dataset/gen-word-15-count.jsonl
Generated JSONL file with - 4 max words, 100 samples - at ./dataset/word-4-count.jsonl
Generated JSONL file with - 2 max words, 100 samples - at ./dataset/word-2-count.jsonl
Generated JSONL file with - 5 max words, 100 samples - at ./dataset/gen-word-5-count.jsonlGenerated JSONL file with - 10 max words, 100 samples - at ./dataset/gen-word-10-count.jsonl

Generated JSONL file with - 30 max words, 100 samples - at ./dataset/gen-word-30-count.jsonl
Generated JSONL file with - 35 max words, 100 samples - at ./dataset/gen-word-35-count.jsonl
Generated JSONL file with - 25 max words, 100 samples - at ./dataset/gen-word-25-count.jsonl
Generated JSONL file with - 20 max words, 100 samples - at ./dataset/gen-word-20-count.jsonl
Generated JSONL file with - 50 max words, 100 samples - at ./dataset/gen-word-50-count.jsonl
Generated JSONL file with - 40 max words, 10

In [5]:
# Lets pre tokenize the requried dataset
# and pack the data into 8k of length
#
# For the initial training, it seems to be better to do 4k chunks, batch size 16, with 8k datapacks
# Then to do 8k chunks, batchsize 8, with 16k datapacks. Why? I dun know.
#
!cd "{TRAINER_DIR}" && \
    python3 preload_datapath.py "{NOTEBOOK_DIR}/stage-2-tune.yaml"

# Ensure the checkpoint directory exists
!cd "{TRAINER_DIR}" && mkdir -p "../checkpoint/stage-2-memory-finetune/"

Resolving data files: 100%|███████████████| 862/862 [00:00<00:00, 107543.06it/s]
Saving the dataset (4/4 shards): 100%|█| 36851/36851 [00:01<00:00, 19869.65 exam
Saving the dataset (1/1 shards): 100%|█| 1547/1547 [00:00<00:00, 30397.64 exampl


## Finetune 2 (2x2k -> 2x4k) : The actual tune!

In [6]:
# Start the finetune model training
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/stage-2-tune.yaml" \
        --model.load_model="../model/Memory-Tune-Stage-1-{MODEL_NAME}" \
        --trainer.callbacks.init_args.dirpath="../checkpoint/stage-2-memory-finetune/{MODEL_NAME}/" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - Mem-Finetune-2 (bs=256, train-ctx=8192, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}"  \
        --trainer.microbatch_size=4 \
        --model.ctx_len=8192

[2024-01-23 22:32:27,860] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV infctx using 'torch-jit' with torch '2.1.1+cu121'
/usr/local/lib/python3.10/dist-packages/lightning/pytorch/cli.py:518: LightningCLI's args parameter is intended to run from within Python like if it were from the command line. To prevent mistakes it is not recommended to provide both args and command line arguments, got: sys.argv[1:]=['fit', '-c', '/workspace/RWKV-infctx-trainer/notebook/rwkv-x-exp/v5-exp/memory-test/stage-2-tune.yaml', '--model.load_model=../model/Memory-Tune-Stage-1-RWKV-v5-7B-world.pth', '--trainer.callbacks.init_args.dirpath=../checkpoint/stage-2-memory-finetune/RWKV-v5-7B-world.pth/', '--trainer.logger.init_args.name=[8xA100] RWKV-v5-7B-World - Mem-Finetune-2 (bs=256, train-ctx=8192, deepspeed_stage_2)', '--trainer.strategy=deepspeed_stage_2', '--trainer.devices=auto', '--trainer.microbatch_size=4', '--model.ctx_len=8192

In [7]:
# Lets export the model from the checkpoint
!cd "{TRAINER_DIR}" && \
    python export_checkpoint.py \
        "../checkpoint/stage-2-memory-finetune/{MODEL_NAME}/last.ckpt" \
        "../model/Memory-Tune-Stage-2-{MODEL_NAME}"
!cd "{TRAINER_DIR}" && ls -alh "../model/Memory-Tune-Stage-2-{MODEL_NAME}"

[2024-01-24 02:30:37,480] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Processing zero checkpoint '../checkpoint/stage-2-memory-finetune/RWKV-v5-7B-world.pth/last.ckpt/checkpoint'
Detected checkpoint of type zero stage 2, world_size: 8
Parsing checkpoint created by deepspeed==0.12.6
Reconstructed fp32 state dict with 710 params 7518044160 elements
Saving bf16 state dict to ../model/Memory-Tune-Stage-2-RWKV-v5-7B-world.pth
-rw-r--r-- 1 root root 15G Jan 24 02:31 ../model/Memory-Tune-Stage-2-RWKV-v5-7B-world.pth


In [8]:
# Lets do a memory eval!
!python3 ./memory_script/eval_v5_memory_guided.py "{PROJECT_DIR}/model/Memory-Tune-Stage-2-{MODEL_NAME}"
!python3 ./memory_script/eval_v5_memory_guided.py "{PROJECT_DIR}/model/Memory-Tune-Stage-2-{MODEL_NAME}" "none" 1000 4000
!python3 ./memory_script/eval_v5_memory_guided.py "{PROJECT_DIR}/model/Memory-Tune-Stage-2-{MODEL_NAME}" "none" 4000 8000

SCRIPT_DIR:  /workspace/RWKV-infctx-trainer/notebook/rwkv-x-exp/v5-exp/memory-test/memory_script
PROJECT_DIR:  /workspace/RWKV-infctx-trainer
MODEL_CODE_DIR:  /workspace/RWKV-infctx-trainer/RWKV-v5
[2024-01-24 02:31:46,389] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV infctx using 'torch-jit' with torch '2.1.1+cu121'
  return self.fget.__get__(instance, owner)()
---
[RWKV.TimeMix] Compiling CUDA kernel with HEAD_SIZE=64
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/wkv5/build.ninja...
Building extension module wkv5...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv5...
[RWKV.TimeMix] CUDA kernel compiled & loaded globally
---
  batch_tokens = torch.tensor(
###
#