## The Prepared data

In [18]:
!head -n1 sample_data/alpaca_synth_queries.jsonl | python -m json.tool --json-lines

{
    "conversations": [
        {
            "from": "system",
            "value": "Honeycomb is an observability platform that allows you to write queries to inspect trace data. You are an assistant that takes a natural language query (NLQ) and a list of valid columns and produce a Honeycomb query."
        },
        {
            "from": "human",
            "value": "\n\nNLQ: \"group by HTTP method\"\n\nColumns: ['query_string_num_tokens', 'query_string_length', 'data_queries', 'http.target', 'task.id', 'trace_root.http.target', 'topic', 'http.host', 'total_hits', 'db.user', 'domain_types', 'db.name', 'graphql.document', 'history', 'http.scheme', 'http.method', 'frontend.version', 'disposition_for_dBVVysC8x4Ymwg9rtjMckgw9', 'db.system', 'event_name', 'organization', 'auth.logout', 'organizations', 'name', 'net.transport', 'db.operation', 'disposition_for_UvsPPBVUn9FDuzDjsjYCqopq', 'disposition_for_1RUGSd7GdnP5tuKdgqBRZUm2', 'process.pid', 'disposition_for_6uyAoBc3PuvEcTTPFgPM3Rt

## The Config

Pay close attention to `datasets` and `train_on_inputs`

In [25]:
!cat hc.yml

base_model: mistralai/Mistral-7B-v0.1
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

lora_fan_in_fan_out: false
data_seed: 49
seed: 49

datasets:
  - path: sample_data/alpaca_synth_queries.jsonl
    type: sharegpt
    conversation: alpaca
dataset_prepared_path: last_run_prepared
val_set_size: 0.1
output_dir: ./qlora-alpaca-out
hub_model_id: hamel/hc-mistral-alpaca

adapter: qlora
lora_model_dir:

sequence_len: 896
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

wandb_project: hc-axolotl-mistral
wandb_entity: hamelsmu

gradient_accumulation_steps: 4
micro_batch_size: 16
eval_batch_size: 16
num_epochs: 3
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002
max_grad_norm: 1.0
ada

### HF & WandB

You need to change the following things in your config

```yaml
wandb_project: hc-axolotl-mistral
wandb_entity: hamelsmu
hub_model_id: hamel/hc-mistral-alpaca
```

## Do the Preprocessing

In [26]:
! python -m axolotl.cli.preprocess hc.yml

[2024-05-16 21:32:44,934] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
                                 dP            dP   dP 
                                 88            88   88 
      .d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88 
      88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88 
      88.  .88  .d88b.  88.  .88 88 88.  .88   88   88 
      `88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP 
                                                       
                                                       

****************************************
**** Axolotl Dependency Versions *****
  accelerate: 0.26.1         
        peft: 0.9.1.dev0     
transformers: 4.39.0.dev0    
         trl: 0.7.9          
       torch: 2.0.1          
bitsandbytes: 0.41.3.post2   
****************************************
[2024-05-16 21:32:46,301] [INFO] [axolotl.normalize_config:182] [PID:751428] [RANK:0] GPU memory usage baseline: 0.000GB (+0.651GB

## Debug

### See All The commands

In [27]:
! python -m axolotl.cli.preprocess hc.yml --help

[2024-05-16 21:32:59,710] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
                                 dP            dP   dP 
                                 88            88   88 
      .d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88 
      88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88 
      88.  .88  .d88b.  88.  .88 88 88.  .88   88   88 
      `88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP 
                                                       
                                                       

****************************************
**** Axolotl Dependency Versions *****
  accelerate: 0.26.1         
        peft: 0.9.1.dev0     
transformers: 4.39.0.dev0    
         trl: 0.7.9          
       torch: 2.0.1          
bitsandbytes: 0.41.3.post2   
****************************************
[2024-05-16 21:33:00,878] [INFO] [axolotl.normalize_config:182] [PID:752112] [RANK:0] GPU memory usage baseline: 0.000GB (+0.651GB

### Text Only

I often have problems with `debug_text_only`, so I do things manually

In [33]:
!ls -lah last_run_prepared/

total 12K
drwxrwxr-x 3 hamel hamel 4.0K May 16 21:29 .
drwxrwxr-x 8 hamel hamel 4.0K May 16 21:43 ..
drwxrwxr-x 2 hamel hamel 4.0K May 16 21:29 22cf9f5f00f9d3b9504fbaf9b68a2f75


In [34]:
import json, yaml
from transformers import AutoTokenizer
from datasets import load_from_disk

with open('hc.yml', 'r') as f:
    cfg = yaml.safe_load(f)
model_id = cfg['base_model']
tok = AutoTokenizer.from_pretrained(model_id)
ds = load_from_disk('last_run_prepared/22cf9f5f00f9d3b9504fbaf9b68a2f75/')

Below is the assembled text in its flattened format.  Notice the spaces that axolotl are adding. Will talk about this at the end.

This makes me paranoid because of differences between how the prompt is assembled and inference.  You just have to make sure its the same at inference!  


In [35]:
print(tok.decode(ds['input_ids'][0]))

<s> Honeycomb is an observability platform that allows you to write queries to inspect trace data. You are an assistant that takes a natural language query (NLQ) and a list of valid columns and produce a Honeycomb query.

 ### Instruction: 

NLQ: "group by HTTP method"

Columns: ['query_string_num_tokens', 'query_string_length', 'data_queries', 'http.target', 'task.id', 'trace_root.http.target', 'topic', 'http.host', 'total_hits', 'db.user', 'domain_types', 'db.name', 'graphql.document', 'history', 'http.scheme', 'http.method', 'frontend.version', 'disposition_for_dBVVysC8x4Ymwg9rtjMckgw9', 'db.system', 'event_name', 'organization', 'auth.logout', 'organizations', 'name', 'net.transport', 'db.operation', 'disposition_for_UvsPPBVUn9FDuzDjsjYCqopq', 'disposition_for_1RUGSd7GdnP5tuKdgqBRZUm2', 'process.pid', 'disposition_for_6uyAoBc3PuvEcTTPFgPM3Rtk', 'exception.stacktrace', 'data_ingestion_individuals_count', 'disposition_for_qrnUBUz8YBfNX7Liekq6nKi3', 'task_type.type', 'disposition_for_

### Other Notes

- Seeing the flattened version often helps you spot issues in your prompt.  It can be hard to notice that in jsonl format.
- Check multiple examples!

### Verbose debugging

This helps you check things like: 
1. ignoring inputs (`train_on_inputs:False`) - notice the `red` color, which indicate tokens that are ignored.
2. token ids (ex: what are those spaces right before `##`?
3. The logs tell you what the special tokens are.

In [32]:
! python -m axolotl.cli.preprocess hc.yml --debug

[2024-05-16 21:42:39,557] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
                                 dP            dP   dP 
                                 88            88   88 
      .d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88 
      88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88 
      88.  .88  .d88b.  88.  .88 88 88.  .88   88   88 
      `88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP 
                                                       
                                                       

****************************************
**** Axolotl Dependency Versions *****
  accelerate: 0.26.1         
        peft: 0.9.1.dev0     
transformers: 4.39.0.dev0    
         trl: 0.7.9          
       torch: 2.0.1          
bitsandbytes: 0.41.3.post2   
****************************************
[2024-05-16 21:42:40,813] [INFO] [axolotl.normalize_config:182] [PID:766298] [RANK:0] GPU memory usage baseline: 0.000GB (+0.651GB

## Look at special tokens

Ex: What is `<0x0A>`?

In [41]:
tok.decode([13])

'\n'

**But where is the space coming from?**

In [42]:
tok.decode(774)

'###'

**It's pretty confusing!  See [this blog post](https://hamel.dev/notes/llm/finetuning/05_tokenizer_gotchas.html)**

What does Wing think?