## Finetune the Model

## Environment Setup

This step uses the following libraries:
|Library|License|
|-|-|
| [PyTorch](https://github.com/pytorch/pytorch) | BSD 3-Clause |
| [python-dotenv](https://github.com/theskumar/python-dotenv) | BSD 3-Clause |
| [transformers](https://github.com/huggingface/transformers) | Apache 2.0 |
| [datasets](https://github.com/huggingface/datasets) | Apache 2.0 |
| [trl](https://github.com/huggingface/trl) | Apache 2.0 |
| [peft](https://github.com/huggingface/peft) | Apache 2.0 |
| [evaluate](https://github.com/huggingface/evaluate) | Apache 2.0 |
| [bert_score](https://github.com/Tiiiger/bert_score) | MIT |
| [numpy](https://numpy.org/about/) | Modified BSD |

In [5]:
import os
import json
from pathlib import Path
import numpy as np

import torch
from trl import DataCollatorForCompletionOnlyLM
from datasets import load_dataset
from transformers import (
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    AutoTokenizer,
    AutoModelForCausalLM,
    EarlyStoppingCallback)
from peft import LoraConfig, get_peft_model
import evaluate

In [6]:
DOCUMENT    = "FM5_0"
PDF_PATH    = Path("pdfs/raw/fm5-0.pdf")
BASE_MODEL  = Path("QuantFactory/Llama-3.2-1B-GGUF")
GGUF_FILE   = "Llama-3.2-1B.Q8_0.gguf"
CACHE_DIR   = "hf_cache"
DATA_DIR    = DOCUMENT / BASE_MODEL / "data"
MODEL_DIR   = DOCUMENT / BASE_MODEL / "lora"
CHUNKED_DATA = DATA_DIR / "chunked" / "chunked.jsonl"
QA_DATA      = DATA_DIR / "qa"       / "qa_pairs.jsonl"

os.environ["TOKENIZERS_PARALLELISM"] = "true"

Load the dataset and get the tokenizers ready.

In [7]:
raw_ds = load_dataset("json", data_files=QA_DATA.as_posix(), split="train")

Generating train split: 0 examples [00:00, ? examples/s]

Failed to load JSON from file '/home/pat/PycharmProjects/DunedainAssessment/notebooks/FM5_0/QuantFactory/Llama-3.2-1B-GGUF/data/qa/qa_pairs.jsonl' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Column() changed from object to string in row 0


DatasetGenerationError: An error occurred while generating the dataset

In [4]:
tok              = AutoTokenizer.from_pretrained(MODEL_DIR)
# tok.pad_token    = "<|finetune_right_pad_id|>"
# tok.pad_token_id = tok.convert_tokens_to_ids(tok.pad_token)

Configure the model.

In [5]:
TEST_PORTION = 0.1
IGNORE_ID    = -100
MAX_LEN      = 1024

And set up the prompt with prompt builders.

In [6]:
sys_prompt = f" You are an FM-5-0 assistant. Concisely answer the following question."

In [7]:
sys_role = "system"
usr_role = "user"
bot_role = "assistant"

These are already in the tokenizer but being able to reference them will come in handy.

In [8]:
bos_tok      = "<|begin_of_text|>"
eot_id_tok   = "<|eot_id|>"
start_hd_tok = "<|start_header_id|>"
end_hd_tok   = "<|end_header_id|>"
eot_tok      = "<|end_of_text|>"

Define some functions to process the data so we can train on it.

In [9]:
def build_prompt(sys, context, usr, ans=None):
    prompt  = f"{bos_tok}"
    prompt += f"{start_hd_tok}{sys_role}{end_hd_tok}{context}{sys}{eot_id_tok}"
    prompt += f"{start_hd_tok}{usr_role}{end_hd_tok}{usr}{eot_id_tok}"
    prompt += f"{start_hd_tok}{bot_role}{end_hd_tok}"

    if ans is not None:
        prompt += f"{ans}{eot_id_tok}{eot_tok}"

    return prompt

In [10]:
def row_to_prompt(row):
    return {"text": build_prompt(sys_prompt, row['context'], row['question'], ans=row['answer'])}

Now process the data. I'll start with one sample to see how it's handled through the collator and evaluations.

In [30]:
splits  = raw_ds.train_test_split(TEST_PORTION, seed=42)
sample = splits["train"][100]
print(sample)

{'text': "<|begin_of_text|><|start_header_id|>system<|end_header_id|>The  division  tactical  command  post  will  control  the  air assault').\n- b. (U) Signal. Describe the scheme of signal support, including location and movement of key signal nodes and critical electromagnetic spectrum considerations throughout the operation. State the primary, alternate, contingency, and emergency communications plan. Refer to Annex H (Signal) as required.\n\nACKNOWLEDGE: Include only if attachment is distributed separately from the base order.\n\n[Commander's last name]\n\n[Commander's rank]\n\nThe commander or authorized representative signs the original copy of the attachment. If the representative signs the original, add the phrase 'For the Commander.' The signed copy is the historical copy and remains in the headquarters' files.\n\n## OFFICIAL:\n\n[Authenticator's name]\n\n[Authenticator's position]\n\nUse only if the commander does not sign the original attachment. If the commander signs the

In [40]:
tok.add_bos_token = False
tokenised = tok(
    sample["text"],
    max_length=1024,
    truncation=True,
)

print("IDS IN   :", tokenised["input_ids"][:40])
print("MASK     :", tokenised["attention_mask"][:40])
print("TOKENS IN:", tok.convert_ids_to_tokens(tokenised["input_ids"][:40]))
print("TOKENS IN:", tok.decode(tokenised["input_ids"][:40], clean_up_tokenization_spaces=True))

IDS IN   : [128000, 128006, 9125, 128007, 791, 220, 13096, 220, 39747, 220, 3290, 220, 1772, 220, 690, 220, 2585, 220, 279, 220, 3805, 11965, 1861, 198, 12, 293, 13, 320, 52, 8, 28329, 13, 61885, 279, 13155, 315, 8450, 1862, 11, 2737]
MASK     : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
TOKENS IN: ['<|begin_of_text|>', '<|start_header_id|>', 'system', '<|end_header_id|>', 'The', 'Ġ', 'Ġdivision', 'Ġ', 'Ġtactical', 'Ġ', 'Ġcommand', 'Ġ', 'Ġpost', 'Ġ', 'Ġwill', 'Ġ', 'Ġcontrol', 'Ġ', 'Ġthe', 'Ġ', 'Ġair', 'Ġassault', "').", 'Ċ', '-', 'Ġb', '.', 'Ġ(', 'U', ')', 'ĠSignal', '.', 'ĠDescribe', 'Ġthe', 'Ġscheme', 'Ġof', 'Ġsignal', 'Ġsupport', ',', 'Ġincluding']
TOKENS IN: <|begin_of_text|><|start_header_id|>system<|end_header_id|>The  division  tactical  command  post  will  control  the  air assault').
- b. (U) Signal. Describe the scheme of signal support, including


So far so good, we have the desired prompt being tokenized and it de-tokenizes properly. Now I'll check the collator. I'm looking for the entire prompt to be ignored up to the actual assistant response.

In [44]:
collator = DataCollatorForCompletionOnlyLM(
    tokenizer            = tok,
    instruction_template = f"{start_hd_tok}{usr_role}{end_hd_tok}",
    response_template    = f"{start_hd_tok}{bot_role}{end_hd_tok}",
)

In [45]:
batch = collator([tokenised])
for k,v in batch.items():
    print(k, v.shape, v[0][:40])

input_ids torch.Size([1, 583]) tensor([128000, 128006,   9125, 128007,    791,    220,  13096,    220,  39747,
           220,   3290,    220,   1772,    220,    690,    220,   2585,    220,
           279,    220,   3805,  11965,   1861,    198,     12,    293,     13,
           320,     52,      8,  28329,     13,  61885,    279,  13155,    315,
          8450,   1862,     11,   2737])
attention_mask torch.Size([1, 583]) tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
labels torch.Size([1, 583]) tensor([-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100])


First part looks good, all of the context that's injected is ignored.

In [83]:
for k,v in batch.items():
    print(k, v.shape, v[0][-40:])

input_ids torch.Size([1, 583]) tensor([ 26777,    323,  68870,  16777,      8,   3493,     30, 128009, 128006,
         78191, 128007,   2028,  54368,   5825,  16188,  38864,     11,  20447,
            11,    323,  11470,    369,  11469,  89720,    358,    320,  26777,
           323,  68870,  16777,      8,    311,    279,   2385,   3197,    477,
          2015,     13, 128009, 128001])
attention_mask torch.Size([1, 583]) tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
labels torch.Size([1, 583]) tensor([  -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
          -100,   -100,   2028,  54368,   5825,  16188,  38864,     11,  20447,
            11,    323,  11470,    369,  11469,  89720,    358,    320,  26777,
           323,  68870,  16777,      8,    311,    279,   2385,   3197,    477,
          2015,     13, 128009, 128001])


And looks like the very end of this sample contains the actual tokens (non -100 values). I'll detokenize those to make sure the entire answer is included.

In [86]:
labels = batch["labels"][0].tolist()
last_mask_index = len(labels) - 1 - labels[::-1].index(IGNORE_ID)
masked_label = tok.decode(labels[last_mask_index + 1:], skip_special_tokens=True)
print(masked_label)

This annex provides fundamental considerations, formats, and instructions for developing Annex I (Air and Missile Defense) to the base plan or order.


With data processing nailed down, I can split the data into a training and testing dataset and prepare for training. To create a more robust training cycle that leverages all data, I would utilize 10-fold cross-validation with 2 folds set to testing data while tuning the hyperparameters. After I'm happy with the hyperparameters, I'll train using all the data. For this though, I'm just going to use some typical good values for the hyperparameters.

In [11]:
splits      = raw_ds.train_test_split(TEST_PORTION, seed=42)
text_train  = splits["train"]
text_test   = splits["test"]

tok_test  = splits["test"].map(
    lambda batch: tok(batch["text"], add_special_tokens=False, truncation=True, max_length=MAX_LEN, padding=False),
    batched=True)
tok_train = splits["train"].map(
    lambda batch: tok(batch["text"], add_special_tokens=False, truncation=True, max_length=MAX_LEN, padding=False),
    batched=True)

Now we load the model and the LoRA adapter.

Ideally this would be dead simple with SFTTrainer, but it doesn't support custom metrics yet (https://github.com/huggingface/trl/issues/862) so we have to do everything manually. I'm using gradient checkpointing just because I ran out of memory while training on my personal GPU.

In [13]:
base_model = AutoModelForCausalLM.from_pretrained(
            BASE_MODEL,
            cache_dir=CACHE_DIR,
            gguf_file=GGUF_FILE,
            device_map="auto",
            torch_dtype=torch.bfloat16)
base_model.gradient_checkpointing_enable()

Converting and de-quantizing GGUF tensors...:   0%|          | 0/147 [00:00<?, ?it/s]

In [14]:
lora_cfg = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM")
lora_model = get_peft_model(base_model, lora_cfg)
lora_model.print_trainable_parameters()  # sanity check

trainable params: 851,968 || all params: 1,236,666,368 || trainable%: 0.0689


Before training, I'll set up some metrics for evaluation.

 - F1: This is span-wise F1 (from SQUAD) shows how well the prediction and truth match if we treat them as a "bag of tokens".
 - Perplexity: I like to look at this over loss because you can interpret it as how "confident" the model is for the next token. E.g. a perplexity of ~2 means the model is considering bet
 - BERT Score: This is a good one to help understand how close the meaning of the output is to the label. Since it compares the BERT embeddings of the prediction and label, the embeddings of similar words are more closely aligned than disparate words.

There are some others I would like to use to gain as much insight as possible, but I omitted for simplicity here.

In [16]:
bert_metric   = evaluate.load("bertscore", cache_dir=CACHE_DIR)
squad_metric  = evaluate.load("squad", cache_dir=CACHE_DIR)

In [17]:
def compute_metrics(eval_preds) -> dict:
    preds  = eval_preds.predictions
    labels = eval_preds.label_ids
    losses = eval_preds.losses

    cleaned_labels = np.where(labels != IGNORE_ID, labels, tok.pad_token_id)
    cleaned_preds  = np.where(preds  != IGNORE_ID, preds,  tok.pad_token_id)

    decoded_preds  = tok.batch_decode(cleaned_preds.tolist(), skip_special_tokens=True)
    decoded_labels = tok.batch_decode(cleaned_labels.tolist(), skip_special_tokens=True)

    squad_preds = [
        {"id": str(i), "prediction_text": p}
        for i, p in enumerate(decoded_preds)
    ]
    squad_refs = [
        {
            "id": str(i),
            "answers": {"text": [decoded_labels[i]], "answer_start": [0]}
        }
        for i in range(len(decoded_labels))
    ]
    squad_results = squad_metric.compute(
        predictions=squad_preds,
        references=squad_refs
    )

    bert_results = bert_metric.compute(
        predictions=decoded_preds,
        references=decoded_labels,
        lang="en"
    )

    return {
        "perplexity":      np.mean(np.exp(losses)),
        "bert_precision":  np.mean(bert_results["precision"]),
        "bert_recall":     np.mean(bert_results["recall"]),
        "bert_f1":         np.mean(bert_results["f1"]),
        "qa_f1":           squad_results["f1"],
        "exact_match":     squad_results["exact_match"],
    }

All that's left is to set up the training loop and train the model.

In [18]:
args = Seq2SeqTrainingArguments(
    output_dir                  = MODEL_DIR,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 32,
    num_train_epochs            = 10,
    learning_rate               = 2e-4,
    logging_steps               = 1,
    save_steps                  = 1,
    save_total_limit            = 10,
    neftune_noise_alpha         = 0.1,
    bf16                        = True,
    bf16_full_eval              = True,
    save_strategy               = "epoch",
    eval_strategy               = "epoch",
    report_to                   = "none",
    label_names                 = ["labels"],
    metric_for_best_model       = "eval_loss",
    load_best_model_at_end      = True,
    eval_on_start               = True,
    eval_accumulation_steps     = 10,
    include_for_metrics         = ["loss"],
    predict_with_generate       = True,
)

early_stopping = EarlyStoppingCallback(
    early_stopping_patience  = 1,
    early_stopping_threshold = 0.001,
)

trainer = Seq2SeqTrainer(
    model           = lora_model,
    args            = args,
    train_dataset   = tok_train,
    eval_dataset    = tok_test,
    data_collator   = collator,
    callbacks       = [early_stopping],
    compute_metrics = compute_metrics,
)

In [19]:
trainer.train()

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Epoch,Training Loss,Validation Loss


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Setting `pad_tok

KeyboardInterrupt: 

A quick test to see if the model internalized any of the data or just learned how to copy from the context.

In [None]:
usr = "What does CCIR stand for?"
context  = ""

prompt =  build_prompt(sys_prompt, context, usr)

inputs = tok(prompt, return_tensors="pt").to(lora_model.device)
out = lora_model.generate(**inputs, max_new_tokens=512, temperature=0.2)
print(tok.decode(out[0][inputs.input_ids.shape[-1]:], skip_special_tokens=False))

And now compare the LoRA model to the untrained one.

In [None]:
chunks = []
with open(CHUNKED_DATA, "r", encoding="utf-8") as f:
    for line in f:
        chunks.append(json.loads(line))

In [None]:
usr = "What does CCIR stand for?"
context  = chunks[17]

prompt =  build_prompt(sys_prompt, context, usr)

inputs = tok(prompt, return_tensors="pt").to(lora_model.device)

In [None]:
lora_out = lora_model.generate(**inputs, max_new_tokens=512, temperature=0.2)
print(tok.decode(out[0][inputs.input_ids.shape[-1]:], skip_special_tokens=False))

In [None]:
base_out = base_model.generate(**inputs, max_new_tokens=512, temperature=0.2)
print(tok.decode(out[0][inputs.input_ids.shape[-1]:], skip_special_tokens=False))

In [4]:
final_model_path = MODEL_DIR / "final"
lora_model.save_pretrained(final_model_path.as_posix())
# tok.save_pretrained("ft-rag-qa")

NameError: name 'lora_model' is not defined