# Finetune an OSS model for out bot

We will use the [trl]() library to make our life easy! Most of the code comes from the official [trl finetune example](https://github.com/huggingface/trl/blob/main/examples/scripts/sft.py)

In [17]:
# !pip install accelerate transformers datasets bitsandbytes peft trl

In [18]:
from dataclasses import dataclass, field
from typing import Optional

import torch
from accelerate import Accelerator
from datasets import load_dataset
from peft import LoraConfig
from tqdm import tqdm
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, HfArgumentParser, TrainingArguments

import wandb

from trl import SFTTrainer

from ft_utils import read_file, load_ds_from_artifact, LLMSampleCB, generate

What is really handy here is the data preprocessing that is baked into the `SFTTrainer` class, this trainer is a thing wrapper around the transformer's `Trainer` but adds the necessary preprocessing needed to format and pack our instruction dataset.

## Data

We will grab our dataset previously created

In [19]:
training_data_path = "dataset/"

In [20]:
# by default the split is called train
ds = load_dataset("json", data_files=f"{training_data_path}/*.json")["train"].shuffle()

In [21]:
ds

Dataset({
    features: ['user', 'answer'],
    num_rows: 616
})

In [22]:
ds[0:3]

{'user': ["Bats are just regular old flying mammals. But don't tell anyone I said that, or they might start",
  'You know, Goldilocks and the Three Bears?',
  'Thanks.'],
 'answer': ['other()', 'other()', 'other()']}

In [23]:
splitted_ds = ds.train_test_split(test_size=0.1)

Let's save this split in Hugging Face dataset format (fast parquet files unde the hood)

In [24]:
splitted_ds.save_to_disk(f"{training_data_path}/split_dataset")

Saving the dataset (0/1 shards):   0%|          | 0/554 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/62 [00:00<?, ? examples/s]

Let's save this to W&B

In [25]:
# # You only need to do this once
# with wandb.init(project="otto", job_type="data_split"):
#     at = wandb.Artifact(name="split_dataset",
#                         type="dataset",
#                         description="The generated data splitted in 90/10")
#     at.add_dir(f"{training_data_path}/split_dataset")
#     wandb.log_artifact(at)

In [26]:
DATASET_ARTIFACT = 'capecape/otto/split_dataset:v2'

In [27]:
ds = load_ds_from_artifact(DATASET_ARTIFACT)
ds

[34m[1mwandb[0m:   7 of 7 files downloaded.  


DatasetDict({
    train: Dataset({
        features: ['user', 'answer'],
        num_rows: 554
    })
    test: Dataset({
        features: ['user', 'answer'],
        num_rows: 62
    })
})

## Prepare data for Training

> Depending on the model you will need to change this formatting function

We will train a Llama2 model from MetaAI, depending if it is the `chat` or `vanilla` version, you will need to format your instructions differently. My to go place to find these format is the hugginface model card (but many times it is missing), the official paper (can be hard to find) or the [Axolotl training library](https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/prompt_strategies/llama2_chat.py)

In [28]:
mistral_prompt = """[INST]You are AI that converts human request into api calls. 
You have a set of functions:
-news(topic="[topic]") asks for latest headlines about a topic.
-math(question="[question]") asks a math question in python format.
-notes(action="add|list", note="[note]") lets a user take simple notes.
-openai(prompt="[prompt]") asks openai a question.
-runapp(program="[program]") runs a program locally.
-story(description=[description]) lets a user ask for a story.
-timecheck(location="[location]") ask for the time at a location. If no location is given it's assumed to be the current location.
-timer(duration="[duration]") sets a timer for duration written out as a string.
-weather(location="[location]") ask for the weather at a location. If there's no location string the location is assumed to be where the user is.
-other() should be used when none of the other commands apply

Here is a user request, reply with the corresponding function call, be brief.
USER_QUERY: {user}[/INST]{answer}"""

In [29]:
def _create_mistral_instruct_prompt(user, answer=""):
    return mistral_prompt.format(user=user, answer=answer)

def create_prompt(row): return _create_mistral_instruct_prompt(**row)

In [30]:
print(create_prompt(ds["train"][0]))

[INST]You are AI that converts human request into api calls. 
You have a set of functions:
-news(topic="[topic]") asks for latest headlines about a topic.
-math(question="[question]") asks a math question in python format.
-notes(action="add|list", note="[note]") lets a user take simple notes.
-openai(prompt="[prompt]") asks openai a question.
-runapp(program="[program]") runs a program locally.
-story(description=[description]) lets a user ask for a story.
-timecheck(location="[location]") ask for the time at a location. If no location is given it's assumed to be the current location.
-timer(duration="[duration]") sets a timer for duration written out as a string.
-weather(location="[location]") ask for the weather at a location. If there's no location string the location is assumed to be where the user is.
-other() should be used when none of the other commands apply

Here is a user request, reply with the corresponding function call, be brief.
USER_QUERY: I'll get this flowy.[/INST]

In [31]:
from types import SimpleNamespace

MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.1"

# Define and parse arguments.
script_args=SimpleNamespace(
    model_name=MODEL_NAME,               # "the model name"
    dataset_artifact=DATASET_ARTIFACT,   # "the W&B artifact holding the dataset
    log_with="wandb",                    # "use 'wandb' to log with wandb"
    learning_rate=1.4e-5,                # "the learning rate"
    batch_size=2,                        # "the batch size", 24GB -> 2, 40GB -> 4
    seq_length=400,                      # "Input sequence length"
    gradient_accumulation_steps=16,      # "simulate larger batch sizes"
    load_in_x_bits=4,                    # "load the model in 4/8 precision
    use_peft=True,                       # "Wether to use PEFT or not to train adapters"
    output_dir="output",                 # "the output directory"
    peft_lora_r=64,                      # "the rank of the matrix parameter of the LoRA adapters"
    peft_lora_alpha=16,                  # "the alpha parameter of the LoRA adapters"
    logging_steps=1,                     # "How often to log"
    use_auth_token=True,                 # "Use HF auth token to access the model"
    max_steps=500,                       # "the number of training steps"
)

## Model

We can load the model with all the bells and whistles from Transformers!

In [32]:
# Step 1: Load the model
if script_args.load_in_x_bits in [4,8]:
    quantization_config = BitsAndBytesConfig(
        load_in_8bit=script_args.load_in_x_bits==8, 
        load_in_4bit=script_args.load_in_x_bits==4
    )
else:
    quantization_config = None

model = AutoModelForCausalLM.from_pretrained(
    script_args.model_name,
    quantization_config=quantization_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    use_auth_token=script_args.use_auth_token,
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [33]:
# Step 3: Define the training arguments
training_args = TrainingArguments(
    output_dir=script_args.output_dir,
    per_device_train_batch_size=script_args.batch_size,
    per_device_eval_batch_size=script_args.batch_size,
    gradient_accumulation_steps=script_args.gradient_accumulation_steps,
    learning_rate=script_args.learning_rate,
    logging_steps=script_args.logging_steps,
    # num_train_epochs=script_args.num_train_epochs,
    max_steps=script_args.max_steps,
    report_to=script_args.log_with,
)


# Step 4: Define the LoraConfig
if script_args.use_peft:
    peft_config = LoraConfig(
        r=script_args.peft_lora_r,
        lora_alpha=script_args.peft_lora_alpha,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "v_proj"],
    )
else:
    peft_config = None

Now we need to instantiate the `SFTTrainer` with the correct preprocessing:
- We want to pack sequences to a certain length (longer means more memory usage)
- We want to tokenize
- We also want to apply our prompt

In [34]:
script_args.seq_length

400

In [35]:
training_args.eval_steps = training_args.max_steps // 5
training_args.evaluation_strategy = "steps"

In [36]:
wandb.init(project="otto", job_type="finetune")
    
ds = load_ds_from_artifact(DATASET_ARTIFACT)
    
# Step 5: Define the Trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
    args=training_args,
    max_seq_length=script_args.seq_length,
    packing=True,
    formatting_func=create_prompt,
    peft_config=peft_config,
)

[34m[1mwandb[0m: Currently logged in as: [33mcapecape[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m:   7 of 7 files downloaded.  
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Using pad_token, but it is not set yet.


to be sure, let's check the dataloader

In [37]:
dl = trainer.get_train_dataloader()
b = next(iter(dl))
trainer.tokenizer.decode(b["input_ids"][0])

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


'.\n-weather(location="[location]") ask for the weather at a location. If there\'s no location string the location is assumed to be where the user is.\n-other() should be used when none of the other commands apply\n\nHere is a user request, reply with the corresponding function call, be brief.\nUSER_QUERY: Turn the brake![/INST]other()</s><s> [INST]You are AI that converts human request into api calls. \nYou have a set of functions:\n-news(topic="[topic]") asks for latest headlines about a topic.\n-math(question="[question]") asks a math question in python format.\n-notes(action="add|list", note="[note]") lets a user take simple notes.\n-openai(prompt="[prompt]") asks openai a question.\n-runapp(program="[program]") runs a program locally.\n-story(description=[description]) lets a user ask for a story.\n-timecheck(location="[location]") ask for the time at a location. If no location is given it\'s assumed to be the current location.\n-timer(duration="[duration]") sets a timer for durat

Let's sample from the model during Training, to do this we will add a custom WandbCallback that has access to the Trainer object (and model and tokenizer). Normally, callback don't have access to these, and that's why we need to add it to the instantiated Trainer.

In [38]:
create_test_prompt = lambda row: {"text": create_prompt({"user": row["user"], "answer": ""})}  # remove output

test_dataset = ds["test"].map(create_test_prompt)

Map:   0%|          | 0/62 [00:00<?, ? examples/s]

In [39]:
prompt = test_dataset[0]["text"]
prompt

'[INST]You are AI that converts human request into api calls. \nYou have a set of functions:\n-news(topic="[topic]") asks for latest headlines about a topic.\n-math(question="[question]") asks a math question in python format.\n-notes(action="add|list", note="[note]") lets a user take simple notes.\n-openai(prompt="[prompt]") asks openai a question.\n-runapp(program="[program]") runs a program locally.\n-story(description=[description]) lets a user ask for a story.\n-timecheck(location="[location]") ask for the time at a location. If no location is given it\'s assumed to be the current location.\n-timer(duration="[duration]") sets a timer for duration written out as a string.\n-weather(location="[location]") ask for the weather at a location. If there\'s no location string the location is assumed to be where the user is.\n-other() should be used when none of the other commands apply\n\nHere is a user request, reply with the corresponding function call, be brief.\nUSER_QUERY: Cheers![/I

In [40]:
from transformers import GenerationConfig
gen_config = GenerationConfig.from_pretrained(script_args.model_name, max_new_tokens=256)

In [41]:
generate(prompt, trainer.model, trainer.tokenizer, gen_config)

'FUNCTION_CALL: other()'

this a already pretty good

## Let's finetune to force it reply with the function call only!

we add the LLMSampleCB to log examples during training. Let's pick the examples first:

In [31]:
from datasets import Dataset
hand_picked_ds = Dataset.from_list([test_dataset[0],])
for s in test_dataset:
    if s["answer"] not in [t["answer"] for t in hand_picked_ds]:
        hand_picked_ds = hand_picked_ds.add_item(s)

In [32]:
wandb_cb = LLMSampleCB(trainer, test_dataset=hand_picked_ds, num_samples=8, max_new_tokens=256)
trainer.add_callback(wandb_cb)

In [33]:
trainer.train()

Step,Training Loss,Validation Loss
100,1.395,1.446921
200,1.0173,1.112553
300,0.8671,0.965235
400,0.8992,0.931866
500,0.8274,0.925065


  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?it/s]

TrainOutput(global_step=500, training_loss=1.1087499910593033, metrics={'train_runtime': 19379.6649, 'train_samples_per_second': 0.826, 'train_steps_per_second': 0.026, 'total_flos': 2.550104850432e+17, 'train_loss': 1.1087499910593033, 'epoch': 28.88})

In [36]:
import wandb
from pathlib import Path

def save_model(trainer, output_dir):
    "Save the model to a folder inside {output_dir} prepending the run name"
    model_name = f"{wandb.run.id}-{trainer.model.name_or_path}-ft".replace("/","_")
    model_folder = Path(output_dir) / model_name
    model_folder.mkdir(parents=True, exist_ok=True)
    trainer.save_model(model_folder)
    return model_name, model_folder

def create_model_artifact(model_name, model_folder):
    "Creates a Weights & Biases artifact for the saved model"
    at = wandb.Artifact(
        name=model_name,
        type="model",
        description="Finetuned model on Otto dataset",
        metadata={"peft": peft_config,
                  "quantization": quantization_config,
                  "prompt_func": mistral_prompt)
                 },
    )
    at.add_dir(model_folder)
    wandb.log_artifact(at)
    print(f"Artifact {model_name} logged.")
    
def save_and_log(trainer, output_dir):    
    model_name, model_folder = save_model(trainer, output_dir)
    create_model_artifact(model_name, model_folder)

save_and_log(trainer, script_args.output_dir)

[34m[1mwandb[0m: Adding directory to artifact (./output/yfk0igjz-meta-llama_Llama-2-7b-hf-ft)... Done. 0.3s


Artifact yfk0igjz-meta-llama_Llama-2-7b-hf-ft logged.


In [37]:
wandb.finish()

0,1
eval/loss,█▄▂▁▁
eval/runtime,█▇▅▂▁
eval/samples_per_second,▁▂▄██
eval/steps_per_second,▁▂▄▇█
train/epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/learning_rate,███▇▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▂▁▁▁
train/loss,█▇▇▆▆▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.92507
eval/runtime,4.5587
eval/samples_per_second,13.6
eval/steps_per_second,6.8
train/epoch,28.88
train/global_step,500.0
train/learning_rate,0.0
train/loss,0.8274
train/total_flos,2.550104850432e+17
train/train_loss,1.10875
