# Fine-tune

As an extra exercise, we're going to finetune the model to perform better on the type of conversations we're interested in. To do so, we're using [LoRA](https://huggingface.co/docs/peft/conceptual_guides/lora) technique. The goal of LoRA is to fine-tune just the most important parts of the model by injecting new trainable weights to specific layers of the transformer architecture, while keeping the original model checkpoint frozen.

> 📝 **Note:** this notebook is based on [this post by Geronimo](https://medium.com/@geronimo7/phinetuning-2-0-28a2be6de110), the code was taking and adapted from their explanations.

## Create an artificial dataset

In order to fine-tune our model, we first need a dataset to fine-tune it on. We don't have real customer-agent conversations, se we're going to create artificial data instead. Ours, is a tiny (only 4 samples) and toy dataset created using [Google Bard](https://bard.google.com/chat) asking it to give use some sample conversations between a `customer` and an `agent` on imaginary issues (samples were generated only for the `support` class).

> 📝 **Note:** this dataset was created for the task of text/response generation. Similarly, we could create another dataset to train the model for the classification task (`support`, `sales` and `joke`).

The result is a tiny dataset of the form (here `content` was truncated to just the first sentence):

```json
{ "data": [
    { "conversation" : [
      { "role": "customer", "content": "Having trouble tracking package" },
      { "role": "agent", "content": "Apologize for inconvenience, look into it" },
      { "role": "customer", "content": "Need package for important event" },
      { "role": "agent", "content": "Understand concern, package shipped on [date], in transit" },
      { "role": "customer", "content": "Thank you for help" },
      { "role": "agent", "content": "You're welcome, let me know if you have other questions" }
    ]},
    { "conversation" : [
      { "role": "customer", "content": "Worry package, shipped on [date], haven't received it" },
      { "role": "agent", "content": "Apologize for delay, investigate further" },
      { "role": "customer", "content": "Thank you for looking into it" },
      { "role": "agent", "content": "Understand concern, file missing package report, update you" },
      { "role": "customer", "content": "Thank you, relieved you're taking it seriously" },
      { "role": "agent", "content": "You're welcome, let me know if you have other questions" }
    ]}
  ]
}
```



In [1]:
from datasets import load_dataset

dataset = load_dataset("json", data_files="finetune_dataset.json", field="data")

  from .autonotebook import tqdm as notebook_tqdm


Let's print a sample sentence.

In [2]:
dataset['train'][0]

{'messages': [{'content': "I'm having trouble tracking my package. It was supposed to arrive yesterday, but I don't have any tracking information.",
   'role': 'customer'},
  {'content': 'I apologize for the inconvenience, Mr. Smith. Let me look into this for you. Please provide me with your order number, AB0002345.',
   'role': 'agent'},
  {'content': "Thank you. I'm concerned because I need this package for an important event.",
   'role': 'customer'},
  {'content': "I understand your concern. According to our records, your package was shipped on Jan 4th and is currently in transit. The expected delivery date is Jan 1st. I'll keep an eye on the tracking information for you and let you know if there are any updates.",
   'role': 'agent'},
  {'content': 'Thank you for your help. I appreciate it.', 'role': 'customer'},
  {'content': "You're welcome, Mr. Smith. Please let me know if you have any other questions.",
   'role': 'agent'}]}

Now create a test split. As it only has 4 samples, we'll randomly select one of them as test.

In [3]:
dataset = dataset["train"].train_test_split(test_size=0.25)

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['messages'],
        num_rows: 3
    })
    test: Dataset({
        features: ['messages'],
        num_rows: 1
    })
})

## Finetuning the model

Now we can start finetuning the model. The steps are:

+ Load the model
+ Load and adapt the tokenizer
+ Create the LoRA configuration and addapt the model
+ Tokenize the dataset
+ Define the collate function
+ Finally, trainig the model

In [5]:
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-2",
    device_map="auto",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_quant_type="nf4",
    ),
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

Loading checkpoint shards: 100%|██████████████████| 2/2 [00:02<00:00,  1.03s/it]


Load tokenizer and add `<PAD>` token. In the original code, extra tokens were added (`<|im_start|>` and `<|im_end|>`), this is because they used [ChatML](https://cobusgreyling.medium.com/the-introduction-of-chat-markup-language-chatml-is-important-for-a-number-of-reasons-5061f6fe2a85) format. We're going to stick the simpler [chat format](https://huggingface.co/microsoft/phi-2#chat-format) used by `microsoft/phi-2` so we only need the `<PAD>` token.

```text
Alice: I don't know why, I'm struggling to maintain focus while studying. Any suggestions?
Bob: Well, have you tried creating a study schedule and sticking to it?
Alice: Yes, I have, but it doesn't seem to help much.
Bob: Hmm, maybe you should try studying in a quiet environment, like the library.
Alice: ...
```

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", use_fast=False)   
tokenizer.add_tokens(["<PAD>"])
tokenizer.pad_token = "<PAD>"

model.resize_token_embeddings(
    new_num_tokens=len(tokenizer),
    pad_to_multiple_of=64)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Embedding(50304, 2560)

We create the LoRA adapters and add them to the model. Hopefully, we only need to define the `LoraConfig`, and most of the job is done by the [PEFT](https://huggingface.co/docs/peft/v0.7.1/en/conceptual_guides/lora#lora) library.

In [7]:
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules = ['Wqkv','out_proj'],
    lora_dropout=0.1,
    bias="none",
    modules_to_save = ["lm_head", "embed_tokens"],
    task_type="CAUSAL_LM"
)

model = prepare_model_for_kbit_training(model, use_gradient_checkpointing = False)
model = get_peft_model(model, lora_config)
model.config.use_cache = False

Now, we can tokenize our dataset with `dataset.map()`. In addition to `input_ids` and `attention_mask`, we use `labels` as well to tell the model what's the expected output. We only want the model to learn the `agent` messages, ignoring the `customer` ones; so we label them with the `IGNORE_INDEX` id.

In [8]:
import os
from functools import partial


IGNORE_INDEX = -100
template = "{role}: {content}\n"

def tokenize(input, max_length):
    input_ids, attention_mask, labels = [], [], []

    for i, msg in enumerate(input["messages"]):
        chat_msg = template.format(**msg)
        msg_tokenized = tokenizer(chat_msg, truncation=False, add_special_tokens=False)

        input_ids += msg_tokenized["input_ids"]
        attention_mask += msg_tokenized["attention_mask"]
        labels += [IGNORE_INDEX]*len(msg_tokenized["input_ids"]) \
                    if msg["role"] == "customer" \
                    else msg_tokenized["input_ids"]
    return {
        "input_ids": input_ids[:max_length], 
        "attention_mask": attention_mask[:max_length],
        "labels": labels[:max_length],
    }

dataset_tokenized = dataset.map(
    partial(tokenize, max_length=1024), 
    batched = False,
    num_proc = os.cpu_count(),
    remove_columns = dataset["train"].column_names
)

num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.
Map (num_proc=3): 100%|████████████████████| 3/3 [00:00<00:00, 20.67 examples/s]
num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1.
Map: 100%|████████████████████████████████| 1/1 [00:00<00:00, 162.97 examples/s]


We define the collate function. It is in charge of putting samples together into batches. It also pads the inputs so they all have the same length.

In [9]:
def collate(elements):
    tokens = [e["input_ids"] for e in elements]
    tokens_maxlen = max([len(t) for t in tokens])

    for i, sample in enumerate(elements):
        input_ids = sample["input_ids"]
        labels = sample["labels"]
        attention_mask = sample["attention_mask"]

        pad_len = tokens_maxlen-len(input_ids)
        input_ids.extend( pad_len * [tokenizer.pad_token_id] )
        labels.extend( pad_len * [IGNORE_INDEX] )
        attention_mask.extend( pad_len * [0] )
    batch={
        "input_ids": torch.tensor( [e["input_ids"] for e in elements] ),
        "labels": torch.tensor( [e["labels"] for e in elements] ),
        "attention_mask": torch.tensor( [e["attention_mask"] for e in elements] ),
    }
    return batch

## Train

Finally we can train the model. As this is such a simple dataset, the training process ends very quickly. 

In [10]:
from transformers import TrainingArguments, Trainer

bs=1
ga_steps=16  # gradient acc. steps
epochs=5
lr=0.00002

steps_per_epoch=len(dataset_tokenized["train"])//(bs*ga_steps)

args = TrainingArguments(
    output_dir="out",
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=16,
    evaluation_strategy="steps",
    logging_steps=1,
    eval_steps=steps_per_epoch,
    save_steps=steps_per_epoch,
    gradient_accumulation_steps=ga_steps,
    num_train_epochs=epochs,
    lr_scheduler_type="constant",
    optim="paged_adamw_32bit",
    learning_rate=lr,
    group_by_length=False,
    bf16=True,
    ddp_find_unused_parameters=False,
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=collate,
    train_dataset=dataset_tokenized["train"],
    eval_dataset=dataset_tokenized["test"],
)

trainer.train()

Step,Training Loss,Validation Loss
1,0.2738,1.441532
2,0.2713,1.42066
3,0.2699,1.416498
4,0.2675,1.397182
5,0.2511,1.390604


TrainOutput(global_step=5, training_loss=0.26674823760986327, metrics={'train_runtime': 8.7028, 'train_samples_per_second': 1.724, 'train_steps_per_second': 0.575, 'total_flos': 45798332398080.0, 'train_loss': 0.26674823760986327, 'epoch': 5.0})

## Discussion

However simple, this toy example contains all the code parts needed to fine-tune the `microsoft/phi-2` model with a custom dataset using the [LoRA](https://huggingface.co/docs/peft/conceptual_guides/lora) technique. For a useful use case, a real dataset should be used with actual chat data. But this script can be easily adapted to train on any chat data, provided it's in the right format (see [section above](#create-an-artificial-dataset)).