# Fine-tune Llama 2 for chat & dialogue summarization

Welcome!

Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters.

In this Jupyter Notebook, we will learn how to fine-tune the meta-llama/Llama-2-7b-hf, the smallest of the collection, with 7B parameters, using AI Stduio Deep Learning workspace with GPU and MLFlow for tracking the metrics!

## Setup Development Environment

Our first step is to install some extra required packages.

##### Installing extra libraries

In [1]:
!pip install datasets # This one is for downloading our samsum dataset direclty from Hugging Face
!pip install peft # Both peft and trl are the libs that help us 
!pip install trl # to configure our training methods and params
!pip install bitsandbytes # This one will help us to quantize the model
!pip install mlflow==2.11.0

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.2.0,>=2023.1.0 (from fsspec[http]<=2024.2.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.2.0-py3-none-any.whl.metadata (6.8 kB)
Collecting aiohttp (from datasets)
  Downloading aiohttp-3.9.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.4 kB)
Collecting huggingface-hub>=0.19.4 (from datasets)
  Downloading huggingface_hub-0.21.3-py3-none-any.whl.metadata (13 kB)
Collecting aiosignal>=1.1.2 (from aiohttp->datasets)
  Downloading aiosignal-1.3.1-py3-none-any.whl.metadata (4.0 kB)
Collecting frozenlist>

### Imports

The libraries used below are already installed by default inside our Deep Learning workspace (excpet for transformers, which was installed together with the extra libraries).

In [30]:
from datasets import load_dataset, Dataset, DatasetDict
import os
import json
import re
from pprint import pprint
import pandas as pd
import torch
from datasets import Dataset, load_dataset
from huggingface_hub import notebook_login
from peft import LoraConfig, PeftModel, AutoPeftModelForCausalLM, PeftModelForCausalLM
from transformers import (
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer
import time
import mlflow

### Defining device and model

In [3]:
# 'cuda:0' means that we want to use our GPU, if available. If not, uses CPU.
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
# Model name defines that we are using llama 2 with 7B parameters
# MODEL_NAME = "meta-llama/Llama-2-7b-hf"

### Hugging Face token

In [4]:
HF_TOKEN = "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"

## Dataset

In [5]:
dataset = load_dataset("samsum")

Downloading data: 100%|██████████| 6.06M/6.06M [00:02<00:00, 2.65MB/s]
Downloading data: 100%|██████████| 347k/347k [00:00<00:00, 351kB/s]
Downloading data: 100%|██████████| 335k/335k [00:00<00:00, 601kB/s]


Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

In [6]:
def format_dialogue(dialogue):
    # Replace the '\r\n' with '\n' to match the desired output format.
    formatted_dialogue = re.sub(r'\r\n', '\n', dialogue)
    return formatted_dialogue

In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

This dataset has 3 different divisions:
- Train: has 14732 datapoints
- Test: has 819 datapoints
- Validation: has 818 datapoints

Let's get a sample from the training set and see how it looks like.

In [8]:
dataset['train'][0]

{'id': '13818513',
 'dialogue': "Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)",
 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.'}

Well, each datapoint is a dict composed by 3 key-values pair:
- The id of the conversation
- The dialogue of the conversation
- The summary of that dialogue

Here's the dialogue from this first datapoint:

In [9]:
dataset['train'][0]['dialogue']

"Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"

We can see that we have some '\r' between the user's name. We will have to clean up that Rs by replacing them for a '\n', which means we will have a white space before the user's name, lke this:

Amanda: I baked  cookies. Do you want some?
Jerry: Sure!
Amanda: I'll bring you tomorrow :-)

Let's create a function th formats the dialogues for us.
at

In [10]:
def format_dialogue(dialogue):
    # Replace the '\r\n' with '\n' to match the desired output format.
    formatted_dialogue = re.sub(r'\r\n', '\n', dialogue)
    return formatted_dialogue

Now let's test that funtion on the first train dialogue and see what happens.

In [11]:
print(format_dialogue(dataset['train'][0]['dialogue']))

Amanda: I baked  cookies. Do you want some?
Jerry: Sure!
Amanda: I'll bring you tomorrow :-)


Perfect! It works pretty well!

Now it is time to build the input prompt for llama 2 model.

## Defining input prompt

In [12]:
DEFAULT_SYSTEM_PROMPT = """
Below is a conversation between friends in a chat. Write a summary of their conversation.
""".strip()

In [13]:
def generate_training_prompt(
    conversation: str, summary: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT
) -> str:
    return f"""### Instruction: {system_prompt}

### Input:
{conversation.strip()}

### Response:
{summary}
""".strip()

In [14]:
def generate_text(data_point):
    summary = data_point['summary']
    conversation_text = format_dialogue(data_point['dialogue'])
    return {
        "conversation": conversation_text,
        "summary": summary,
        "text": generate_training_prompt(conversation_text, summary),
    }

In [15]:
def process_dataset(data: Dataset):
    return (
        data.shuffle(seed=42)
        .map(generate_text))

## Processing the dataset

In [16]:
# dataset["train"] = process_dataset(dataset["train"])
dataset["train"] = process_dataset(dataset["train"].select(range(1500)))
dataset["test"] = process_dataset(dataset["test"])

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

## Instantiating the model and its tokenizer

In [17]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Quantization, Tokenizer and Model

In [20]:
from transformers import AutoTokenizer, AutoModelForCausalLM

In [21]:
def create_model_and_tokenizer():

    bnb_config_8 = BitsAndBytesConfig( #BitsAndBytes is the reponsable librarie that help us to configure the desired quantization.
        load_in_8_bit=True,
        bnb_4bit_quant_type="nf4", # normalized float 4 bit data type
        bnb_4bit_compute_dtype=torch.float16 # This is to use half the memory and fit the model
    )

    model = AutoModelForCausalLM.from_pretrained( 
        "google/gemma-2b", # Here we are defining the llama 2 7B model
        use_safetensors=True, #  for storing and loading tensors
        quantization_config=bnb_config_8, # with the desired quantization 
        trust_remote_code=True, 
        device_map="auto" # and put each layer of the model depending on the available resources
        
    )

    tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b") # Here we tell we want to use the same tokenizer that the model ises
    tokenizer.pad_token = tokenizer.eos_token # Telling that all the padding tokens should be the same as the 'end of sentence'
    tokenizer.padding_side = "right" # and the side padding is the right side

    return model, tokenizer

Great! Let's instatiate both model and tokenizer:

Note: this may take a few minutes to run :)

In [22]:
model, tokenizer = create_model_and_tokenizer()
model.config.use_cache = False # The cache is only used for generation, not for training

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/555 [00:00<?, ?B/s]

## LoRA

If you are not familiar with LoRA, here's a brief explanation from Hugging Face:

"To make fine-tuning more efficient, LoRA’s approach is to represent the weight updates with two smaller matrices (called update matrices) through low-rank decomposition. These new matrices can be trained to adapt to the new data while keeping the overall number of changes low. The original weight matrix remains frozen and doesn’t receive any further adjustments. To produce the final results, both the original and the adapted weights are combined."

Source: https://huggingface.co/docs/peft/conceptual_guides/lora

Basically, LoRA makes fine-tuning more efficient by drastically reducing the number of trainable parameters and that's why this approach is so commum when fine tuning LLMs.

In [23]:
lora_r = 16 # Ranking of the matrix
lora_alpha = 64 # Scaling facotr
lora_dropout = 0.1
lora_target_modules = [ # Selecting what layers of the model we want to use
    "q_proj",
    "up_proj",
    "o_proj",
    "k_proj",
    "down_proj",
    "gate_proj",
    "v_proj",
]


peft_config = LoraConfig(
    r=lora_r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    target_modules=lora_target_modules,
    bias="none",
    task_type="CAUSAL_LM",
)

More details about LoRA config in this amzing blog post: https://medium.com/@manyi.yim/more-about-loraconfig-from-peft-581cf54643db

## Training (Let's rock!)

### MLFlow setup 
We will be using MLFlow for tracking our training metrics.

Let's define a nme for our experiment so we can easily access it on the Monitor tab later :)

In [24]:
os.environ['MLFLOW_EXPERIMENT_NAME'] = 'gemma2b-summary_task-quant8-6k'

### Training setup
Here, we will define what are the values for epochs, optmizers, learning rate, evaluation, etc.

For more details, you can check Trainer docs: https://huggingface.co/transformers/v3.0.2/main_classes/trainer.html

In [25]:
training_arguments = TrainingArguments(
    per_device_train_batch_size=4, # The batch size per GPU/TPU core/CPU for training.
    gradient_accumulation_steps=4, # Number of updates steps to accumulate the gradients for, before performing a backward/update pass.
    optim="paged_adamw_32bit", #Adam Optimizer
    logging_steps=20, # Number of update steps between two logs if logging_strategy="steps".
    save_steps = 20,
    learning_rate=1e-3, # The initial learning rate for AdamW optimizer
    fp16=True, # Whether to use fp16 16-bit (mixed) precision training instead of 32-bit training.
    max_grad_norm=0.3, # Maximum gradient norm
    num_train_epochs=3, # Number of epochs
    evaluation_strategy="steps", # Evaluation is done (and logged) every eval_steps
    eval_steps=0.2,
    warmup_ratio=0.05, # Ratio of total training steps used for a linear warmup from 0 to learning_rate
    save_strategy="steps", # Save is done at the end of each steps
    group_by_length=True, # Whether or not to group together samples of roughly the same length in the training dataset 
    output_dir='./gemma/gemma-8bit', # Path to save model checkpoints and bins
    report_to="mlflow", # Integration to report the results and logs to
    save_safetensors=True, # Use safetensors saving and loading for state dicts instead of default torch.load and torch.save
    lr_scheduler_type="cosine", # The scheduler type to use.
    seed=42, # Random seed that will be set at the beginning of training. 
) 

In [26]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"], # Placing our dataset on train
    eval_dataset=dataset["test"], # and test part
    peft_config=peft_config, #LoRA
    dataset_text_field="text",
    max_seq_length=4096, # Specifies the maximum number of tokens of the input
    tokenizer=tokenizer, # Model's tokenizer we loaded before
    args=training_arguments,
)

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

#### Star of training step

In [27]:
start_time = time.time()
trainer.train()
end_time = time.time()
training_duration = end_time - start_time
mlflow.log_metric("training_time", training_duration)

Step,Training Loss,Validation Loss
56,1.924,2.004827
112,1.7712,2.105561
168,1.602,2.012573
224,1.1134,2.136538


In [28]:
trainer.save_model()

## Inference time 😎

Let's load our model using PeftModel from_pretrained method like so:

In [31]:
model = PeftModelForCausalLM.from_pretrained(model, './gemma/gemma-8bit')

Let's create our 5-rows test dataset so we can easily try the model

In [32]:
def generate_prompt(
    conversation: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT
) -> str:
    return f"""### Instruction: {system_prompt}

### Input:
{conversation.strip()}

### Response:
""".strip()

In [33]:
examples = []

for data_point in dataset["test"].select(range(5)):
    summary = data_point['summary']
    conversation = data_point['dialogue']
    examples.append(
        {
            "summary": summary,
            "conversation": conversation,
            "prompt": generate_prompt(conversation),
        }
    )
test_df = pd.DataFrame(examples)

And also a function to help us to tokenize the inputs and use the model 

In [34]:
test_df

Unnamed: 0,summary,conversation,prompt
0,Both Claire and Linda are making curry for din...,Claire: <file_photo>\r\nKim: Looks delicious.....,### Instruction: Below is a conversation betwe...
1,Derek and Alyssa make fun of Fergie's performa...,Alyssa: Have you seen Fergie’s national anthem...,### Instruction: Below is a conversation betwe...
2,Ann wants to buy Josh's laptop for $200. Josh ...,"Ann: Hi, is the laptop still available?\r\nJos...",### Instruction: Below is a conversation betwe...
3,Matt and Tony want to go to the concert of Bon...,Matt: have you heard that Bon Jovi are coming ...,### Instruction: Below is a conversation betwe...
4,Anastasia sent her new school photos to Darrell.,Anastasia: Our new school photos\r\nAnastasia:...,### Instruction: Below is a conversation betwe...


In [35]:
def summarize(model, text: str):
    inputs = tokenizer(text, return_tensors="pt").to(DEVICE)
    inputs_length = len(inputs["input_ids"][0])
    with torch.inference_mode():
        # Adjust temperature to a more stable value
        outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
    return tokenizer.decode(outputs[0][inputs_length:], skip_special_tokens=True)

### Inferences

In [36]:
import numpy as np
np.random.seed(28)

In [43]:
example = test_df.iloc[0]
print(example.conversation)

Claire: <file_photo>
Kim: Looks delicious...
Linda: No way... Look what I'm cooking right now:
Linda: <file_photo>
Claire: hahahaha 
Kim: Curry dream team
Claire: Enjoy your dinner :*


In [44]:
print(example.summary)

Both Claire and Linda are making curry for dinner. 


In [45]:
print(example.prompt)

### Instruction: Below is a conversation between friends in a chat. Write a summary of their conversation.

### Input:
Claire: <file_photo>
Kim: Looks delicious...
Linda: No way... Look what I'm cooking right now:
Linda: <file_photo>
Claire: hahahaha 
Kim: Curry dream team
Claire: Enjoy your dinner :*

### Response:


In [46]:
%%time
summary = summarize(model, example.prompt)

CPU times: user 1min 12s, sys: 7.39 s, total: 1min 19s
Wall time: 1min 19s


In [47]:
print(summary)


Claire sends a photo of the food she's cooking. Linda is cooking curry. Kim is a fan of curry. Linda's dinner is dream-like. She enjoys eating it. Claire finds it delicious. Kim and Linda are cooking curry together. Claire sends a photo of the curry Linda's cooking. Kim and Linda are cooking dream-like curries. Linda enjoys eating her dream-like curry. Claire finds Linda's dream-like curry delicious. Linda and Kim are cooking curries together. Claire sends a photo of the curries they're cooking. Kim and Linda are cooking curries together. Linda's curry is dream-like. Claire finds Linda's dream-like curry delicious. Linda's curry is delicious. Linda and Claire are cooking curries together. Claire sends a photo of the curries they're cooking. Kim and Linda are cooking curries together. Linda's curry is dream-like. Claire finds Linda's dream-like curry delicious. Linda's curry is delicious. Linda and Claire are cooking curries together. Claire sends a photo of the curries they're cooking

You can choose other indexes as well and also try with other seeds!

The major part of the outputs are not going to be very good, but sometimes there are some good outputs

### Inference using HuggignFace

The model is also available on the HuggingFace library :)

In [48]:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM

config = PeftConfig.from_pretrained("morgana-rodrigues/gemma-2b-quant-8bit")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b")
model = PeftModel.from_pretrained(model, "morgana-rodrigues/gemma-2b-quant-8bit")

adapter_config.json:   0%|          | 0.00/688 [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/78.5M [00:00<?, ?B/s]

In [57]:
import os

def get_size(start_path):
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(start_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            # skip if it is symbolic link
            if not os.path.islink(fp):
                total_size += os.path.getsize(fp)

    return total_size

# Caminhos dos diretórios do modelo
model_8bit_dir = '/home/jovyan/.cache/huggingface/hub/models--morgana-rodrigues--gemma-2b-quant-8bit'

# Obtém o tamanho do diretório do modelo quantizado de 8 bits
size_quant_8bit = get_size(model_8bit_dir)

print(f"Tamanho do modelo quantizado de 8 bits: {size_quant_8bit} bytes")


Tamanho do modelo quantizado de 8 bits: 78480800 bytes


In [58]:
import sys

# Example variable
my_list = model_8bit_dir
# Checking the size of the variable
size_in_bytes = sys.getsizeof(my_list)

print(f"The variable is using {size_in_bytes} bytes of memory.")

The variable is using 131 bytes of memory.
