## The Prepared data

In [3]:
!head -n1 ft_data/alpaca_data.jsonl | python -m json.tool --json-lines

{
    "instruction": "You are an assistant that takes a piece of text that has been corrupted during OCR digitisation, and produce a corrected version of the same text.",
    "input": "Hunterd pursue killer great white utalking Austrnlimn death bench (A JP) A JP - A great white shark that killed a teenago aurfer off tho South Auatralian capital Adohuide will be hunted down and destroyed after it returned to stalk tho city's beaches, authorities said.",
    "output": "Hunters pursue killer great white stalking Australian death beach (AFP) AFP - A great white shark that killed a teenage surfer off the South Australian capital Adelaide will be hunted down and destroyed after it returned to stalk the city's beaches, authorities said."
}


## The Config

Pay close attention to `datasets` and `train_on_inputs`

In [5]:
!cat ax.yml

base_model: meta-llama/Meta-Llama-3-8B
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
is_mistral_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

lora_fan_in_fan_out: false
data_seed: 49
seed: 49

datasets:
  - path: ft_data/alpaca_data.jsonl
    type: alpaca
dataset_prepared_path: last_run_prepared
val_set_size: 0.1
output_dir: ./qlora-alpaca-out
hub_model_id: pbevan11/llama-3-8b-ocr-correction

adapter: qlora
lora_model_dir:

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

wandb_project: sncds/ocr-ft
wandb_entity: peterbevan

gradient_accumulation_steps: 4
micro_batch_size: 2 # was 16
eval_batch_size: 2 # was 16
num_epochs: 4
optimizer: paged_adamw_32bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_l

## Do the Preprocessing

In [3]:
! python -m axolotl.cli.preprocess llama3-ocr.yml

This can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64

[2024-07-26 10:06:44,039] [INFO] [datasets.<module>:58] [PID:2312] PyTorch version 2.1.2+cu118 available.
[2024-07-26 10:06:44,801] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-26 10:06:44,878] [INFO] [root.spawn:38] [PID:2312] gcc -pthread -B /root/miniconda3/envs/py3.10/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/py3.10/include -fPIC -O2 -isystem /root/miniconda3/envs/py3.10/include -fPIC -c /tmp/tmpqoao59kv/test.c -o /tmp/tmpqoao59kv/test.o
[2024-07-26 

## Debug

### See All The commands

In [7]:
! python -m axolotl.cli.preprocess ax.yml --help

This can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64

[2024-07-25 18:29:45,092] [INFO] [datasets.<module>:58] [PID:1706] PyTorch version 2.1.2+cu118 available.
[2024-07-25 18:29:45,988] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-25 18:29:46,070] [INFO] [root.spawn:38] [PID:1706] gcc -pthread -B /root/miniconda3/envs/py3.10/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/py3.10/include -fPIC -O2 -isystem /root/miniconda3/envs/py3.10/include -fPIC -c /tmp/tmppl35x243/test.c -o /tmp/tmppl35x243/test.o
[2024-07-25 

### Text Only

I often have problems with `debug_text_only`, so I do things manually

In [8]:
!ls -lah last_run_prepared/

total 4.0K
drwxr-xr-x 5 root root  112 Jul 25 18:22 .
drwxr-xr-x 8 root root 4.0K Jul 25 18:28 ..
drwxr-xr-x 2 root root    6 Jul 15 21:48 .ipynb_checkpoints
drwxr-xr-x 2 root root   82 Jul 25 18:22 7c12a0bc491bbea735059d4fd220fce3
drwxr-xr-x 3 root root  108 Jul 15 22:15 f79576f2544449b45279c21970aa1125


In [9]:
import json, yaml
from transformers import AutoTokenizer
from datasets import load_from_disk

with open('ax.yml', 'r') as f:
    cfg = yaml.safe_load(f)
model_id = cfg['base_model']
tok = AutoTokenizer.from_pretrained(model_id)
ds = load_from_disk('last_run_prepared/f79576f2544449b45279c21970aa1125/')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Below is the assembled text in its flattened format.  Notice the spaces that axolotl are adding. Will talk about this at the end.

This makes me paranoid because of differences between how the prompt is assembled and inference.  You just have to make sure its the same at inference!  


In [10]:
print(tok.decode(ds['input_ids'][111]))

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
You are an assistant that takes a piece of text that has been corrupted during OCR digitisation, and produce a corrected version of the same text.

### Input:
WHO Urges Science t. Keep World Healthy (AP) AI'Technological advances that help drus companies churn out highly profitable prescription medications must also be applied to improving public health care arouud tb. giob., a World Health Orgammtion report released Wednesday says.

### Response:
WHO Urges Science to Keep World Healthy (AP) AP - Technological advances that help drug companies churn out highly profitable prescription medications must also be applied to improving public health care around the globe, a World Health Organization report released Wednesday says.<|end_of_text|>


### Other Notes

- Seeing the flattened version often helps you spot issues in your prompt.  It can be hard to notice that in jsonl format.
- Check multiple examples!

### Verbose debugging

This helps you check things like: 
1. ignoring inputs (`train_on_inputs:False`) - notice the `red` color, which indicate tokens that are ignored.
2. token ids (ex: what are those spaces right before `##`?
3. The logs tell you what the special tokens are.

In [11]:
! python -m axolotl.cli.preprocess ax.yml --debug

This can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64

[2024-07-25 18:30:15,689] [INFO] [datasets.<module>:58] [PID:1763] PyTorch version 2.1.2+cu118 available.
[2024-07-25 18:30:16,670] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-25 18:30:16,753] [INFO] [root.spawn:38] [PID:1763] gcc -pthread -B /root/miniconda3/envs/py3.10/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/py3.10/include -fPIC -O2 -isystem /root/miniconda3/envs/py3.10/include -fPIC -c /tmp/tmpvlbqtv5c/test.c -o /tmp/tmpvlbqtv5c/test.o
[2024-07-25 

## Look at special tokens

Ex: What is `<0x0A>`?

In [13]:
tok.decode([13])

'.'

**But where is the space coming from?**

In [15]:
tok.decode(774)

'eth'

**It's pretty confusing!  See [this blog post](https://hamel.dev/notes/llm/finetuning/05_tokenizer_gotchas.html)**

What does Wing think?