# Mixtral 7B model parameter efficient fine-tuning for dialogue summarization with LoRA


## Introduction

In this experiment I have implemented a parameter efficient fine-Tuning (PEFT) of `Mistral-7B-Instruct-v0.2` base model for dialogue summarization task using the great `samsum` dataset (kudos to Samsung R&D Institute Poland) using LoRA technique.

Summarization is - kind of traditionally speaking - a seq2seq task. This means that it takes one sequence of tokens and transforms it into another. Usually for this group of problems an encoder-decoder architecture model is applied. Mistral 7B as a decoder-only architecture model is rather specialized in autoregressive text generation rather than seq2seq transformation.

In this experiment I wanted to find out how well can such decoder-only model learn to perform better in a non-orthodox type of task like summarization. Spoiler alert: the fine-tuned model performs really well. After fine-tuning Mistral base model with not so many dialogue summarization examples it learned the task pretty well and showed great improvement both in terms of ROUGE metric and also human level clarity of the generated summaries.

If you are interested mostly in the final model performance - please scroll down to the Section 3 of this notebook to see its full evaluation.

You can access the code yourself directly in Colab (https://colab.research.google.com/drive/1EuAldJmYSHi9YwjCGQ4u8ZYpnf7dn1XC?usp=sharing) or just clone it from Github repository (https://github.com/msznajder/mistral-7b-samsum-dialogue-summary-finetune).


## Setup

As the base model I have used `Mistral-7B-Instruct-v0.2` model.

For fine-tuning I have used `samsum` dialogue summarization dataset (I used over 7k examples out of 14k available).

For GPU memory efficiency I have used LoRA technique for training.

I have run training notebook using Google Colab PRO A100 GPU. Model training run for aroung 6h. The highest GPU RAM usage reached a little over 30 GB. Running this notebook of lower GPU model will probably lead to this ugly out-of-memory error we all hate so much.

I have conducted the fine-tuned model evaluation using ROUGE metric and analyzing the actually generated summaries. I have also compared fine-tuned model performance with the base model I used.

## Results summary

TLDR; Fine-tuned model learned to generate pretty good and consistent summaries. You can see it in the ROUGEL metric: 0.43 for the fine-tuned model vs. 0.22 for the original base model. Even more importantly, I think, when you check the actually generated summaries it is clearly how much performance the fine-tuned model gained in this task. You can check it out right at the bottom of this notebook.


## Plan

In this notebook we will follow these steps:
1. Data preprocessing
2. Model fine-tuning
3. Evaluation
4. Summary and next steps

In [None]:
!pip install -q rouge_score
!pip install -q datasets
!pip install -q transformers
!pip install -q evaluate
!pip install -q accelerate
!pip install -q -i https://pypi.org/simple/ bitsandbytes
!pip install -q peft
!pip install -q trl
!pip install -q tqdm
!pip install -q pd
!pip install -q huggingface_hub

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m47.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━

In [None]:
import torch
import transformers
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM, TrainingArguments, Trainer, pipeline, BitsAndBytesConfig, DataCollatorForLanguageModeling, GenerationConfig
from peft import LoraConfig, get_peft_model, TaskType, PeftModel, PeftConfig, prepare_model_for_kbit_training, AutoPeftModelForCausalLM
from trl import SFTTrainer
import evaluate

import pandas as pd

import time
from tqdm import tqdm

In [None]:
DATASET = "samsum"
MODEL_CHECKPOINT = 'mistralai/Mistral-7B-Instruct-v0.2'

## 1. Data preprocessing

We start by loading out `samsum` dialogue summarization training dataset. For training I decided we will use half of the available data - over 7k examples which seems enough for this kind of model fine-tuning and it will save us some pain of dealing with limited GPU memory. I will use the rest of the data for even more fine-tuning if it will be needed in the future.

In [None]:
data = load_dataset(DATASET)

Downloading data:   0%|          | 0.00/6.06M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/347k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/335k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

In [None]:
# We will use half of the training data available in the dataset
data["train"] = data["train"].select(range(7366))

In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 7366
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

Each data record contains:
* `dialogue` with one line per one person expression
* `summary` which is rather short description dialogue contents
* `id` which we ignore here

In [None]:
print(data["train"][0]["dialogue"])

Amanda: I baked  cookies. Do you want some?
Jerry: Sure!
Amanda: I'll bring you tomorrow :-)


In [None]:
print(data["train"][0]["summary"])

Amanda baked cookies and will bring Jerry some tomorrow.


Because we use here the instruct version of the Mistral model we need to parse the dialogue data to fit it predefined prompting format.
```
<s>[INST] What is your favourite condiment? [/INST]
Response.</s>
```
Usually we would just leave the response part empty - and we will do that when parsing the data for fine-tuned model summaries generation - but here we prepare training examples. That means that training prompts have to contain both the question/task and the reponse.

We do all of that with simple preprocessing function and the magical map function.

In [None]:
def preprocess_data(example):
  dialogue = example["dialogue"]
  summary = example["summary"]
  prompt = f"""<s>[INST] You are a helpful assistant. Your task is to generate following dialogue summarization:
{dialogue}[/INST]
{summary}
</s>"""
  return {"text": prompt}

In [None]:
data_preprocessed = data.map(preprocess_data,remove_columns=["id", "dialogue", "summary"])

Map:   0%|          | 0/7366 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

So we have transformed all the training examples and merged both dialogue and label summaries into a training ready prompt-response format. We also removed removed the original fields and keep just the one we will use in the training.

In [None]:
print(data_preprocessed["train"][0]["text"])

<s>[INST] You are a helpful assistant. Your task is to generate following dialogue summarization:
Amanda: I baked  cookies. Do you want some?
Jerry: Sure!
Amanda: I'll bring you tomorrow :-)[/INST]
Amanda baked cookies and will bring Jerry some tomorrow.
</s>


In [None]:
data_preprocessed.set_format(type="torch")

We also load our base Mistral 7B model tokenizer.

Mistral tokenizer does not have defined padding token `pad_token` so we need to set it after the tokenizer is loaded.

In [None]:
MODEL_CHECKPOINT

'mistralai/Mistral-7B-Instruct-v0.2'

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)
tokenizer.pad_token = tokenizer.unk_token

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

We do not apply it just yet. Of course we could but because we will be using `SFTTrainer` later on it will be more elegant - and efficient - to let the trainer apply tokenization, truncation and padding on the during training.

## 2. Model fine-tuning

We now move to the steps related to the base model fine-tuning. There are four steps we need to cover here. First, we load and configure Mistral base model in 4bit quantization. Next, we prepare the LoRA model configuration. Then we will configure and actually run the LoRA based model training. Finally, we will merge the LoRA fine-tuned model with the base of original Mistral model and save it to the HugginFace hub.


### 2.1 Prepare base model in 4bit quantization

We first prepare the lower bytes model representation configuration and load the base Mistral model in 4bit quantization. Loading 4bit version of the model saves memory and increases training time a bit. Here however we are more memory-constraint rather than time constraint. We will usethe `bitsandbytes` library for this purpose.

In [None]:
compute_dtype = getattr(torch, "float16")
use_4bit = True

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)

We then load our Mistral 7B base model as `AutoModelForCausalLM` - which it actually is - and pass the `BitsAndBytesConfig` object when initializing the model object.

In [None]:
MODEL_CHECKPOINT

'mistralai/Mistral-7B-Instruct-v0.2'

In [None]:
device_map = "auto"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_CHECKPOINT,
    quantization_config=bnb_config,  # loading in 4-bit quantization
    device_map=device_map,
)

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

We also need to update model with the tokenizer `pad_token_id` as defined earlier in order for both tokenizer and model stay synchronized in terms of special tokens they both use.

In [None]:
# Configure the pad token in the base Mistral model
model.config.pad_token_id = tokenizer.pad_token_id

### 2.2 Prepare the fine-tuning model LoRA config

And we finally arrive at the preparation of our LoRA model fine-tuning configuration.

We will use PEFT - parameter efficient fine-tuning technique - which allows to greatly reduce GPU memory and computation cost when running model fine-tuning. It does that by freezing most of the model weights and focus on tuning a subset of existing model parameters.

Specifically, we will use LoRA (Low-Rank Adaptation) technique which reparametrize model weights using its low-rank matrices representations.

We do not perform full gradient descent on all parameters but it turns out that the performance loss is negligable.

LoRA usually uses around 1-2% of the all base model parameters to train them in the fine-tuning process. In other words - in the current dialogue summarization task Mistral 7B model fine-tuning process 99% of the base model original weights will be left unchanged and only 1% will take care in fine-tuning process. Even though it sound brutal - it works. That is the magic of LoRA and similar techniques.

On the other hand the gains are huge - we can fine-tune large model on not very specialized single GPUs instead of some very expensive clusters etc.


Let's now move to specifics of the LoRA configuration we use in our model fine-tuning process.

`r` and `lora_alpha` are key parameters in LoRA configuration.

`r` is the rank of LoRA transformed matrices. The lower the rank `r` the less trainable parameters we get from the base model which reduces the memory cost of the training but can also affect model expressiveness.

`lora_alpha` is a scaling factor for LoRA weights. It describes how much emphasis should the newly trained weights should have over the base model. High values put more emphasis on LoRA weights and low values put more values on the base model weights.

There is a golden rule that `lora_alpha=2*r`. A good range specified to select `r` from - presented in original LoRA paper - is: 8, 16, 32, 64, 128, 256 or 512. We select `r=32` and `lora_alpha=64` so we gain quite big memory usage reduction in the process.

`lora_dropout` is the parameter controlling the dropout rate applied during the fine-tuning process. We set it a bit lower then the default 0.1.

We leave `bias` as default `None`.

We set the `task_type` parameter based on the model type we fine-tune. Because the model we train here is causal language model we set it to `CAUSAL_LM` value.

Finally, `target_modules` is the crucial parameter specifying sets of weights from the base model we want to fine-tune using LoRA. The list here is model specific and is defined by the model very internal structure. From the QLoRA paper (https://arxiv.org/pdf/2305.14314.pdf) but about LoRA settings:

"We find that the most critical LoRA hyperparameter is how many LoRA adapters are used in total and that LoRA on all linear transformer block layers are required to match full finetuning performance".

You can check the components we need to list as trainable weights by printing out the Mistral model object.

In [None]:
print(model)

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )

So for Mistral 7B we select all the linear layers weights of the model for fine-tuning:
* `q_proj`,
* `k_proj`,
* `v_proj`,
* `o_proj`,
* `gate_proj`,
* `up_proj`,
* `down_proj`,
* `lm_head`

With all of that we land on our final LoRA configuration we will use in our model training.

In [None]:
peft_config = LoraConfig(
        r=32,
        lora_alpha=64,
        lora_dropout=0.1,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=[
            "q_proj",
            "k_proj",
            "v_proj",
            "o_proj",
            "gate_proj",
            "up_proj",
            "down_proj",
            "lm_head",
        ],
    )

### 2.3 Model training

Before we get to the model training we need to specify quite a few training settings and pack them as `TrainingArguments` object. In our case they are mostly focused on the fact that we will be fine-tuning model using LoRA and we want to optimize it for efficient memory usage.

In [None]:
from transformers import TrainingArguments

run_name = "peft-dialogue-summary-training"
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=3,
    gradient_checkpointing=True,
    optim="paged_adamw_32bit", # memory-efficient version of AdamW
    save_steps=500,
    logging_steps=500,
    learning_rate=3e-4,
    fp16=True,
    evaluation_strategy="steps",
    max_grad_norm=0.3,
    num_train_epochs=3,
    weight_decay=0.001,
    warmup_steps=50,
    lr_scheduler_type="linear",
    run_name=run_name
)

To train the model we will use `SFTTrainer` instead of the regular `Trainer`. It is optimized for fine-tuning pre-trained models with smaller datasets for specific tasks. It also offers better memory usage optimization by integrating techniques like PEFT (with LoRA as the one we use here) directly within the trainer by simply passing the `LoraConfig` object to the `SFTTrainer` using `peft_config` parameter.

In [None]:
trainer = SFTTrainer(
    model=model,
    train_dataset=data_preprocessed["train"],
    eval_dataset=data_preprocessed["validation"], # remove you have low VRAM and getting OOM errors
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=4096,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
)

Next step if the model training itself. During training we will be reporting just training and validation dataset loss value instead of some metric because we do not want to inflate the memory usage here.

In [None]:
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
500,2.0063,2.141908
1000,2.1025,2.289965
1500,2.2834,2.483229
2000,2.3843,2.552625
2500,2.3842,2.520403
3000,2.2129,2.462871
3500,2.1996,2.406763
4000,2.1596,2.352579
4500,2.1044,2.279514
5000,1.9745,2.256618




TrainOutput(global_step=7365, training_loss=2.0146010016229634, metrics={'train_runtime': 13722.3003, 'train_samples_per_second': 1.61, 'train_steps_per_second': 0.537, 'total_flos': 2.16570487977984e+17, 'train_loss': 2.0146010016229634, 'epoch': 3.0})

This specific training instance run on Google Colab A100 GPU took around 6h. The peak GPU memory usage was a little over 30GB. Looking at validation loss values we can see the values were still decreasing right to the end suggesting that the training could probably use another one or two additional training epochs. In this iteration of this experiment we will keep as it is to see how it will perform in the current form.

Let's save the final fine-tuned model checkpoint for later use.

In [None]:
peft_model_path="./peft-dialogue-summary-mistral-checkpoint-local"

trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)



('./peft-dialogue-summary-mistral-checkpoint-local/tokenizer_config.json',
 './peft-dialogue-summary-mistral-checkpoint-local/special_tokens_map.json',
 './peft-dialogue-summary-mistral-checkpoint-local/tokenizer.model',
 './peft-dialogue-summary-mistral-checkpoint-local/added_tokens.json',
 './peft-dialogue-summary-mistral-checkpoint-local/tokenizer.json')

### 2.4 Models merging and saving

The fine-tuned model is now trained but not yet ready to use for inference.

Because we used LoRA for fine-tuning the base model and de factor training just a subset of the actual model weights we will now need to merge the fine-tuned model parameter with the actual `Mistral 7B` original model base.

We can call the fine-tuned model an adapter model and the raw model a base model. In this setting as the next steps we need to:
* load the base model but with default - not 4-bit - precision,
* load and set the PEFT adapter on top of base model,
* merge the PEFT adapter with the base model,
* save the merged model checkpoint along with the used tokenizer.

In [None]:
model = AutoModelForCausalLM.from_pretrained(MODEL_CHECKPOINT)

model = PeftModel.from_pretrained(model, peft_model_path)

model = model.merge_and_unload()

model_dir = "./models/merged-peft-dialogue-summary-mistral/"
model.save_pretrained(model_dir, safe_serialization=True)
tokenizer.save_pretrained(model_dir)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

('./models/merged-peft-dialogue-summary-mistral/tokenizer_config.json',
 './models/merged-peft-dialogue-summary-mistral/special_tokens_map.json',
 './models/merged-peft-dialogue-summary-mistral/tokenizer.model',
 './models/merged-peft-dialogue-summary-mistral/added_tokens.json',
 './models/merged-peft-dialogue-summary-mistral/tokenizer.json')

And now finally the fine-tuned model is both trained and ready to use for inference.

The final step we will take before model evaluation is sending it to the Hugging Face model hub for easy future use.

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
model.push_to_hub("Mistral-7B-Instruct-v0.2-Samsum-DialSum-SFTT")

model-00002-of-00006.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00003-of-00006.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Upload 6 LFS files:   0%|          | 0/6 [00:00<?, ?it/s]

model-00001-of-00006.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00004-of-00006.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00005-of-00006.safetensors:   0%|          | 0.00/4.83G [00:00<?, ?B/s]

model-00006-of-00006.safetensors:   0%|          | 0.00/4.25G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/msznajder/Mistral-7B-Instruct-v0.2-Samsum-DialSum-SFTT2/commit/f7b338382ebb1b897d60e38a1a2bcc76a047c44b', commit_message='Upload MistralForCausalLM', commit_description='', oid='f7b338382ebb1b897d60e38a1a2bcc76a047c44b', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
tokenizer.push_to_hub("Mistral-7B-Instruct-v0.2-Samsum-DialSum-SFTT")

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/msznajder/Mistral-7B-Instruct-v0.2-Samsum-DialSum-SFTT2/commit/44def9f2c14b57573d0439b7ff03931e2963c83d', commit_message='Upload tokenizer', commit_description='', oid='44def9f2c14b57573d0439b7ff03931e2963c83d', pr_url=None, pr_revision=None, pr_num=None)

We are now fully ready to move to the fine-tuned dialogue summarization Mistral 7B based model evaluation.

## 3. Evaluation

We will now evaluate our fine-tuned dialogue summarization model against the original raw Mixtral 7B model. We will compare them using both various flavours of ROUGE metric (metric dedicated for summarization task) and by manual inspection of summaries generated by both models.

Before running evaluation it is best to restart the runtime to clear the memory cluttered in the model training phase. After restart just go to the top and run two top cells with install and imports.

After that we will below start by loading the test dataset and models we need for evaluation.

### 3.1 Data preprocessing and inference

We need to load only the test split of the `samsun` dataset for the evaluation containing 819 examples. To keep in mind - we used 7k examples for training.



In [None]:
data = load_dataset("samsum", split="test")

In [None]:
data

Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 819
})

In [None]:
print(data[0]["dialogue"])

Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye


In [None]:
print(data[0]["summary"])

Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.


Again, we need to reformat the contents of the dataset to follow the Mixtral prompt format. This time however - because now we will use these prompts for inference rather than training - prompts will only contain dialogue part. The summary part will have to be generated by the model.

In [None]:
def preprocess_data(example):
  dialogue = example["dialogue"]
  prompt = f"""<s>[INST] You are a helpful assistant. Your task is to generate following dialogue summarization:
{dialogue}[/INST]
</s>"""
  return {"dialogue": prompt}


In [None]:
data_preprocessed = data.map(preprocess_data, batched=False, remove_columns=["id"])

In [None]:
data_preprocessed

Dataset({
    features: ['dialogue', 'summary'],
    num_rows: 819
})

In [None]:
print(data_preprocessed[0]["dialogue"])

<s>[INST] You are a helpful assistant. Your task is to generate following dialogue summarization:
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye[/INST]
</s>


In [None]:
print(data_preprocessed[0]["summary"])

Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.


We now finally load models from Hugging Face hub along with the respective tokenizers.

In fact, in order not to hit the out of memory error using A100 GPU I had to separately load and execute evaluation for the fine-tuned model, restart the runtime, and then run the code responsible for base model evaluation. I kept the code flow in more natural linear format for easier results interpretation.

In [None]:
# Fine-tuned model
tokenizer = AutoTokenizer.from_pretrained("msznajder/Mistral-7B-Instruct-v0.2-Samsum-DialSum-SFTT")
model = AutoModelForCausalLM.from_pretrained("msznajder/Mistral-7B-Instruct-v0.2-Samsum-DialSum-SFTT")
model.generation_config.pad_token_id = model.generation_config.eos_token_id

Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

In [None]:
# Base Mistral-7B-Instruct-v0.2 model
raw_tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2')
raw_model = AutoModelForCausalLM.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2')
raw_model.generation_config.pad_token_id = raw_model.generation_config.eos_token_id
raw_tokenizer.pad_token = raw_tokenizer.unk_token

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

### 3.2 ROUGE metric evaluation

In order to calculate ROUGE metric we first need to generate the summaries for all the example dialogues from the test dataset.

For summaries generation we will use sampling methods rather than search methods and we will set the `temperature` to be close to 0 because we just want synthetic summaries without introducing some additional variability.

In [None]:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
def summarize(tokenizer, model, dialogue):
    inputs = tokenizer(dialogue, return_tensors="pt").to(DEVICE)
    inputs_length = len(inputs["input_ids"][0])
    with torch.inference_mode():
        outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.0001)
    return tokenizer.decode(outputs[0][inputs_length:], skip_special_tokens=True)

In [None]:
# For loop here clears memory after each iteration and does not cause out of memory error - map does
finetuned_generated_summaries = []
for idx, row in enumerate(data_preprocessed["dialogue"]):
  finetuned_generated_summary = summarize(tokenizer, model.to(DEVICE), row).strip()
  finetuned_generated_summaries.append(finetuned_generated_summary)

In [None]:
data_preprocessed = data_preprocessed.add_column("finetuned_generated_summary", finetuned_generated_summaries)

In [None]:
# For loop here clears memory after each iteration and does not cause out of memory error - map does
raw_generated_summaries = []
for idx, row in enumerate(data_preprocessed["dialogue"]):
  raw_generated_summary = summarize(raw_tokenizer, raw_model.to(DEVICE), row).strip()
  raw_generated_summaries.append(raw_generated_summary)

In [None]:
data_preprocessed = data_preprocessed.add_column("raw_generated_summary", raw_generated_summaries)

As a sanity check let's just print one example of full dialogue, groundtruth summary and summaries generated by both models.

In [None]:
print(data_preprocessed[10]["dialogue"])

<s>[INST] You are a helpful assistant. Your task is to generate following dialogue summarization:
Wanda: Let's make a party!
Gina: Why?
Wanda: beacuse. I want some fun!
Gina: ok, what do u need?
Wanda: 1st I need too make a list
Gina: noted and then?
Wanda: well, could u take yours father car and go do groceries with me?
Gina: don't know if he'll agree
Wanda: I know, but u can ask :)
Gina: I'll try but theres no promisess
Wanda: I know, u r the best!
Gina: When u wanna go
Wanda: Friday?
Gina: ok, I'll ask[/INST]
</s>


In [None]:
print(data_preprocessed[10]["summary"])

Wanda wants to throw a party. She asks Gina to borrow her father's car and go do groceries together. They set the date for Friday. 


In [None]:
print(data_preprocessed[10]["finetuned_generated_summary"])

Wanda wants to make a party. Gina will ask her father if he can go grocery shopping with Wanda. They want to meet on Friday.


In [None]:
print(data_preprocessed[10]["raw_generated_summary"])

Wanda suggested making a party and asked Gina for help. Gina inquired about the reason and Wanda explained that she just wanted some fun. Wanda then requested that they make a list of things needed for the party. Gina agreed and Wanda asked if Gina could borrow her father's car to go grocery shopping together. Gina expressed uncertainty about getting permission, but Wanda encouraged her to ask. Gina agreed to try and mentioned that there was no promise it would be granted. Wanda expressed confidence in Gina and asked if they could go shopping on Friday. Gina agreed to ask about the car and plan accordingly for the party.


All looks good. We are now finally ready to calculate the ROUGE metric for both models.

In [None]:
rouge = evaluate.load('rouge')

In [None]:
# Fine-tuned model ROUGE
model_rouge = rouge.compute(
    predictions=data_preprocessed["finetuned_generated_summary"],
    references=data_preprocessed["summary"][0:len(data_preprocessed["finetuned_generated_summary"])],
    use_aggregator=True,
    use_stemmer=True,
)
model_rouge

{'rouge1': 0.5157651537666942,
 'rouge2': 0.2650530320155057,
 'rougeL': 0.4295565331965113,
 'rougeLsum': 0.4294928872615915}

In [None]:
# Base model ROUGE
model_rouge = rouge.compute(
    predictions=data_preprocessed["raw_generated_summary"],
    references=data_preprocessed["summary"][0:len(data_preprocessed["raw_generated_summary"])],
    use_aggregator=True,
    use_stemmer=True,
)
model_rouge

{'rouge1': 0.2916679198398079,
 'rouge2': 0.09951049848424319,
 'rougeL': 0.21840199659647197,
 'rougeLsum': 0.22900310959298054}

All the kind of ROUGE we generated show radically strong improvement when comparing the fine-tuned vs. base model performance on dialogue summarization task. It looks almost like the model learned this task to the quite profficient level from very basic performance for the base model.

I am especially happy with the ROUGE-L and ROUGE-LSUM metrics improvement as they are based on comparing longest common subsequences between the generation and the reference summary. These subsequences convey more semantic meaning of the sequences as compared to simply counting unigrams and bigrams and we can see almost 100% improvement in these.

All of this was achieved even though we trained model on limited data (only 7k examples), we fine-tuned just a small fraction of weights of the base model (LoRA) and we used decoder-only base model for this seq2seq task.


### 3.3 Human evaluation

ROUGE metric improvement is clear, but let's check how does this improvement between models translates to the quality of the actually generated dialogue summaries.

To do that we will just manually inspect and compare couple of randomly selected examples. Just to reiterate: these are test dataset examples so neither model had chance to see these specific dialogues.

#### Example 1

In this first example below first thing we can see - and is applicable to all examples - is the fact that bade model generates VERY long summaries which are bassically dialogues translated into a linear text structure. It is some kind of summary - hence the ROUGEL score of 0.22 - but not as synthetic as we expect it in the reference summary.

On the other hand the fine-tuned model is crisp and short. It also contains quite large portion of reference summarization. However, the fine-tuned model summarization loses part of initial context information: "Hannah needs Betty's number". The rest after that is very good.

In [None]:
example = data_preprocessed[0]
print(example["dialogue"])

<s>[INST] You are a helpful assistant. Your task is to generate following dialogue summarization:
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye[/INST]
</s>


In [None]:
print(example["summary"])

Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.


In [None]:
print(example["finetuned_generated_summary"])

Amanda doesn't have Betty's number. Hannah will text Larry.


In [None]:
print(example["raw_generated_summary"])

: Hannah asked if you had Betty's number, but you couldn't find it. You suggested she ask Larry instead, as he had spoken to Betty recently. Hannah expressed hesitance due to not knowing Larry well, but you reassured her that he was nice. Eventually, Hannah agreed to text him to get Betty's number.


#### Example 2

Here we can see how good the fine-tuned model summarization can get. It is short and full of meaning while the base model summary is lengthy and very different in style.

In [None]:
example = data_preprocessed[10]
print(example["dialogue"])

<s>[INST] You are a helpful assistant. Your task is to generate following dialogue summarization:
Wanda: Let's make a party!
Gina: Why?
Wanda: beacuse. I want some fun!
Gina: ok, what do u need?
Wanda: 1st I need too make a list
Gina: noted and then?
Wanda: well, could u take yours father car and go do groceries with me?
Gina: don't know if he'll agree
Wanda: I know, but u can ask :)
Gina: I'll try but theres no promisess
Wanda: I know, u r the best!
Gina: When u wanna go
Wanda: Friday?
Gina: ok, I'll ask[/INST]
</s>


In [None]:
print(example["summary"])

Wanda wants to throw a party. She asks Gina to borrow her father's car and go do groceries together. They set the date for Friday. 


In [None]:
print(example["finetuned_generated_summary"])

Wanda wants to make a party. Gina will ask her father if he can go grocery shopping with Wanda. They want to meet on Friday.


In [None]:
print(example["raw_generated_summary"])

Wanda suggested making a party and asked Gina for help. Gina inquired about the reason and Wanda explained that she just wanted some fun. Wanda then requested that they make a list of things needed for the party. Gina agreed and Wanda asked if Gina could borrow her father's car to go grocery shopping together. Gina expressed uncertainty about getting permission, but Wanda encouraged her to ask. Gina agreed to try and mentioned that there was no promise it would be granted. Wanda expressed confidence in Gina and asked if they could go shopping on Friday. Gina agreed to ask about the car and plan accordingly for the party.


#### Example 4

In this example we can see that sometimes the fine-tuned summarization even though keeps the right short form it loses some important context information or misinterprets it. For example where "(Beth) wants to work at Deidre's beauty salon." in the reference and "Beth wants to try a beauty therapy at the salon where Deirdre works". These two sentence have very different meaning. The base model here conveys correct meaning however while keeping the not really summary form.

In [None]:
example = data_preprocessed[20]
print(example["dialogue"])

<s>[INST] You are a helpful assistant. Your task is to generate following dialogue summarization:
Deirdre: Hi Beth, how are you love?
Beth: Hi Auntie Deirdre, I'm been meaning to message you, had a favour to ask.
Deirdre: Wondered if you had any thought about your Mum's 40th, we've got to do something special!
Beth: How about a girls weekend, just mum, me, you and the girls, Kira will have to come back from Uni, of course.
Deirdre: Sounds fab! Get your thinking cap on, it's only in 6 weeks! Bet she's dreading it, I remember doing that!
Beth: Oh yeah, we had a surprise party for you, you nearly had a heart attack! 
Deirdre: Well, it was a lovely surprise! Gosh, thats nearly 4 years ago now, time flies! What was the favour, darling?
Beth: Oh, it was just that I fancied trying a bit of work experience in the salon, auntie.
Deirdre: Well, I am looking for Saturday girls, are you sure about it? you could do well in the exams and go on to college or 6th form.
Beth: I know, but it's 

In [None]:
print(example["summary"])

Beth wants to organize a girls weekend to celebrate her mother's 40th birthday. She also wants to work at Deidre's beauty salon. Deidre offers her a few hours on Saturdays as work experience. They set up for a meeting tomorrow.


In [None]:
print(example["finetuned_generated_summary"])

Beth is planning a girls weekend for her mother's 40th birthday. Beth wants to try a beauty therapy at the salon where Deirdre works. Beth and Deirdre will meet tomorrow to discuss it.


In [None]:
print(example["raw_generated_summary"])

: Deirdre suggested a special plan for Beth's mother's 40th birthday, proposing a girls' weekend with Beth, her mother, Deirdre, and Kira. Beth agreed and mentioned she had a favor to ask. Deirdre recalled their own surprise party and reminisced about the past. Beth revealed her desire to try work experience at a salon, and Deirdre offered to help her explore opportunities there, suggesting meeting the beauty therapy manager, Maxine. They discussed a trial period with compensation for expenses and potential future employment. Beth expressed excitement about the opportunity and agreed to meet Maxine the following day. Deirdre expressed her support and love.


#### Example 5

Here the fine-tuned model again performs very well producing to the point summarization while the base model produced rather non-sensical summary.

In [None]:
example = data_preprocessed[30]
print(example["dialogue"])

<s>[INST] You are a helpful assistant. Your task is to generate following dialogue summarization:
Mary: hey, im kinda broke, lend me a few box
Carter: okay, give me an hour, im at the train station
Mary: cool, thanks[/INST]
</s>


In [None]:
print(example["summary"])

Mary ran out of money. Carter is going to lend her some in an hour.


In [None]:
print(example["finetuned_generated_summary"])

Carter will lend some money to Mary.


In [None]:
print(example["raw_generated_summary"])

: Mary asked Carter for a favor, requesting that he lend her some boxes. Carter agreed and mentioned that it would take him an hour to get to her, as he was currently at the train station. Mary expressed her gratitude for Carter's help.


#### Example 6

Again example of very good summarization by the fine-tuned model. The base model also catches the meaning well but the form is again very long and detailed.

In [None]:
example = data_preprocessed[40]
print(example["dialogue"])

<s>[INST] You are a helpful assistant. Your task is to generate following dialogue summarization:
Sebastian: It's been already a year since we moved here.
Sebastian: This is totally the best time of my life.
Kevin: Really? 
Sebastian: Yeah! Totally maaan.
Sebastian: During this 1 year I learned more than ever. 
Sebastian: I learned how to be resourceful, I'm learning responsibility, and I literally have the power to make my dreams come true.
Kevin: It's great to hear that.
Kevin: It's great that you are satisfied with your decisions.
Kevin: And above all it's great to see that you have someone you love by your side :)
Sebastian: Exactly!
Sebastian: That's another part of my life that is going great.
Kevin: I wish I had such a person by my side.
Sebastian: Don't worry about it.
Sebastian: I have a feeling this day will come shortly.
Kevin: Haha. I don' think so, but thanks.
Sebastian: This one year proved to me that when you want something really badly, you can achieve it

In [None]:
print(example["summary"])

Sebastian is very happy with his life, and shares this happiness with Kevin.


In [None]:
print(example["finetuned_generated_summary"])

Sebastian moved to a new place a year ago. He is satisfied with his life. Kevin wishes he could win a lottery.


In [None]:
print(example["raw_generated_summary"])

Sabastian expresses his satisfaction with his life decisions and the past year since moving, sharing that he's learned valuable skills like resourcefulness and responsibility. He also mentions the importance of having someone he loves by his side. Kevin expresses his happiness for Sabastian and shares his own dream of winning the lottery, to which Sabastian suggests dedicating time and effort to achieving it. The conversation ends with Sabastian expressing gratitude for Kevin's support.


#### Example 7

Here both base and fine-tuned models fail but fine-tuned model in less dramatic way. The fine-tuned summary lost the context of the planned skype call (Rick and Helen are in Cancun). But here the base model generated an alternate version of the dialogue instead of summary.

In [None]:
example = data_preprocessed[100]
print(example["dialogue"])

<s>[INST] You are a helpful assistant. Your task is to generate following dialogue summarization:
Chris: Hi there! Where are you? Any chance of skyping?
Rick: Hi! Our last two days in Cancun before flying to Havana. Yeah, skyping is an idea. When would it suit you?
Rick: We don't have the best of connections in the room but I can get you pretty well in the lobby.
Chris: What's the time in your place now?
Rick: 6:45 pm
Chris: It's a quarter to one in the morning here. Am still in front of the box.
Rick: Gracious me! Sorry mate. You needn't have answered.
Chris: 8-D
Rick: Just tell me when we could skype.
Chris: Preferably in the evening. Just a few hours earlier than now. And not tomorrow.
Rick: Shute! Only tomorrow makes sense as there's no workable internet in Cuba.
Chris: Could you make it like 3 pm your time?
Rick: Sure.
Chris: Perfect. So talk to you tomorrow.
Chris: Give my love to Helen please.
Rick: I will. Thx.[/INST]
</s>


In [None]:
print(example["summary"])

Rick and Helen are in Cancun. They're flying to Havana in two days. Chris and Rick will talk on Skype at 3 PM in Mexico.


In [None]:
print(example["finetuned_generated_summary"])

Rick and Chris will skype tomorrow at 3 pm local time.


In [None]:
print(example["raw_generated_summary"])

: Hi Chris, I'm currently in Cancun and will be flying to Havana soon. I'd be happy to Skype with you, but our internet connection in the room isn't great. You can reach me better in the lobby. What's the current time with you?

Chris: It's almost 1:00 AM here.

Rick: I see, that's quite a time difference. I'm sorry for the inconvenience. Let me know when would be a good time for you to Skype, and I'll make it work.

Chris: Preferably in the evening, a few hours earlier than now, and not tomorrow as there's no reliable internet in Cuba.

Rick: Understood. How about 3 PM your time tomorrow?

Chris: That works for me. I'll talk to you then. Please give my best to Helen.

Rick: I will. Good night, Chris.


#### Example 8

Very good summarization from the fine-tuned model and again refactored dialogue instead of summary from the base model.

In [None]:
example = data_preprocessed[200]
print(example["dialogue"])

<s>[INST] You are a helpful assistant. Your task is to generate following dialogue summarization:
Abdellilah: Where are you?
Sam: work
Abdellilah: What time you finish?
Sam: Not til 5
Abdellilah: Are your bringing him over tonight:
Sam: No in the morning:
Abdellilah: ok, what time?
Sam: About 9. Is that ok?
Abdellilah: ok - see you then[/INST]
</s>


In [None]:
print(example["summary"])

Sam won't finish work till 5. Sam is bringing him over about 9 am. Sam will see Abdellilah in the morning. 


In [None]:
print(example["finetuned_generated_summary"])

Sam is working. He finishes not before 5. Sam is bringing him over in the morning. Sam and Abdellilah will see each other at about 9.


In [None]:
print(example["raw_generated_summary"])

Abdellilah: What is your current location?\
Sam: I'm at work.\
Abdellilah: What time will you finish work?\
Sam: Not until 5 p.m.\
Abdellilah: Will you be bringing him over tonight?\
Sam: No, it's in the morning instead.\
Abdellilah: What time should I expect you in the morning?\
Sam: Around 9 a.m. Is that convenient for you?\
Abdellilah: Yes, I'll see you then.


#### Example 9

The fine-tuned produced even better summary than the reference which lost important part of the discussion. The base model kind of gets it but misses the meaning a bit and keeps the too lengthy form.

In [None]:
example = data_preprocessed[300]
print(example["dialogue"])

<s>[INST] You are a helpful assistant. Your task is to generate following dialogue summarization:
Jerry: Hi sweetie :)
Janet: Hi sugar ;)
Jerry: I'm coming home
Janet: Can't wait ;)
Jerry: I should be there in 40 minutes
Janet: Ok, I'm waiting for you :)
Jerry: How was your day?
Janet: Oh, it was ok but my boss is a pain in the ass sometimes
Jerry: I know, she can be a bitch :P
Janet: Yes she can! ;)
Jerry: See you later darling
Janet: <3[/INST]
</s>


In [None]:
print(example["summary"])

Jerry will be home in 40 minutes. 


In [None]:
print(example["finetuned_generated_summary"])

Jerry will be home in 40 minutes. Janet had a hard day at work.


In [None]:
print(example["raw_generated_summary"])

: Jerry and Janet exchange greetings and express excitement for Jerry's imminent arrival home. Jerry asks about Janet's day, which she describes as okay but with a difficult boss. Jerry empathizes and they both share a light-hearted comment about the boss before signing off with endearments.


## 4. Summary and next steps
New fine-tuned dialogue summarization model works well on the human analysis level. It also improved very much in terms of ROUGE metric results in comparison with the base model.

We could still see some glitches and quirkiness in some of the generations and also ROUGE still looks like it could be improved.

I see a few options for the next steps in order to improve created fine-tuned model:

1. Model was trained on the half of the training dataset (7k examples). Adding more data to the training will most likely make model more robust and produce less random outputs.

2. The training validation loss was still getting lower until the last epoch of the trainig so probably I could invest more epochs into the training for better model fit.

3. Use higher LoRA rank for the fine-tuned model to train more of its weights with a cost of higher memory usage. Current configuration used around 30GB of GPU RAM at peaks and A100 has 40GB of GPU RAM so we still have some margin to increase the memory usage to improve the model quality.