# Fine-tuning LLMs

In this section, we demonstrate how to fine-tune LLMs. Note that you will need to use a GPU for this section. You can do so by clicking "Runtime -> Change runtime type" and selecting a GPU.

Let's load all the necessary libraries:

In [None]:
! pip install transformers[torch] comet-ml opik datasets evaluate sentencepiece --quiet

In [18]:
import warnings
warnings.filterwarnings("ignore")
import comet_ml
from transformers import AutoTokenizer
from datasets import load_dataset
import evaluate
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
from transformers import Trainer, TrainingArguments
import transformers
transformers.set_seed(35)
from datasets import Features, Value, Dataset, DatasetDict
import opik
import os
import numpy as np
import pickle
import json
import pandas as pd
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


### Dataset Preparation

The code below loads the datasets and converts them into the proper format. We are also sampling the dataset. You can choose different sample sizes to run different experiments. More samples typically lead to a better performing model.

In [9]:
# loads the data from the jsonl files
emotion_dataset_train = pd.read_json(path_or_buf="https://raw.githubusercontent.com/comet-ml/comet-llmops/main/data/merged_training_sample_prepared_train.jsonl", lines=True)
emotion_dataset_val_temp = pd.read_json(path_or_buf="https://raw.githubusercontent.com/comet-ml/comet-llmops/main/data/merged_training_sample_prepared_valid.jsonl", lines=True)

# takes first half of samples from emotion_dataset_val_temp and make emotion_dataset_val
emotion_dataset_val = emotion_dataset_val_temp.iloc[:int(len(emotion_dataset_val_temp)/2)]

# takes second half of samples from emotion_dataset_val_temp and make emotion_dataset_test
emotion_dataset_test = emotion_dataset_val_temp.iloc[int(len(emotion_dataset_val_temp)/2):]

sample = True

if sample == True:
    final_ds = DatasetDict({
        "train": Dataset.from_pandas(emotion_dataset_train.sample(50)),
        "validation": Dataset.from_pandas(emotion_dataset_val.sample(50)),
        "test": Dataset.from_pandas(emotion_dataset_test.sample(50))
    })
else:
    final_ds = DatasetDict({
        "train": Dataset.from_pandas(emotion_dataset_train),
        "validation": Dataset.from_pandas(emotion_dataset_val),
        "test": Dataset.from_pandas(emotion_dataset_test)
    })

In [10]:
emotion_dataset_train.head()

Unnamed: 0,prompt,completion
0,i also volunteered that if we were to marry th...,joy\n
1,i always feel a bit awkward doing this kind of...,sadness\n
2,i feel like this could be a long term romantic...,love\n
3,i couldnt help feeling a little dismayed as th...,sadness\n
4,i never feel your tender kiss again span style...,love\n


### Tokenize Dataset

The code below defines a tokenizer and uses the Hugging Face tokenizer to tokenize the datasets. This is the format the model expects so this is an important step.

In [11]:
# model checkpoint
model_checkpoint = "google/flan-t5-base"

# We'll create a tokenizer from model checkpoint
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=False)

# We'll need padding to have same length sequences in a batch
tokenizer.pad_token = tokenizer.eos_token

# prefix
prefix_instruction = "Classify the provided piece of text into one of the following emotion labels.\n\nEmotion labels: ['anger', 'fear', 'joy', 'love', 'sadness', 'surprise']"

# Define a tokenization function that first concatenates text and target
def tokenize_function(example):
    merged = prefix_instruction + "\n\n" + "Text: " + example["prompt"].strip("\n\n###\n\n") + "\n\n" + "Emotion output:" + example["completion"].strip(" ").strip("\n")
    batch = tokenizer(merged, padding='max_length', truncation=True)
    batch["labels"] = batch["input_ids"].copy()
    return batch

# Apply it on our dataset, and remove the text columns
tokenized_datasets = final_ds.map(tokenize_function, remove_columns=["prompt", "completion"])

Map: 100%|██████████| 50/50 [00:00<00:00, 1655.79 examples/s]
Map: 100%|██████████| 50/50 [00:00<00:00, 4119.57 examples/s]
Map: 100%|██████████| 50/50 [00:00<00:00, 4068.98 examples/s]


In [12]:
# View steps from above cell
example = final_ds['train'][0]
merged = prefix_instruction + "\n\n" + "Text: " + example["prompt"].strip("\n\n###\n\n") + "\n\n" + "Emotion output:" + example["completion"].strip(" ").strip("\n")
batch = tokenizer(merged, padding='max_length', truncation=True)
batch["labels"] = batch["input_ids"].copy()
print("------------------------")
print(f"Example: {example}")
print("------------------------")
print(f"Merged: {merged}")
print("------------------------")
print(f"Batch: {batch}")
print("------------------------")

------------------------
Example: {'prompt': 'i dont know why most of my life ive been hurt i dont know why it continues to happen but i really am tired of it im tired of normal people having stomache problems when all i feel is my heart sinking and aching not to sound emo its actually true\n\n###\n\n', 'completion': ' sadness\n', '__index_level_0__': 470}
------------------------
Merged: Classify the provided piece of text into one of the following emotion labels.

Emotion labels: ['anger', 'fear', 'joy', 'love', 'sadness', 'surprise']

Text: i dont know why most of my life ive been hurt i dont know why it continues to happen but i really am tired of it im tired of normal people having stomache problems when all i feel is my heart sinking and aching not to sound emo its actually true

Emotion output:sadness
------------------------
Batch: {'input_ids': [4501, 4921, 8, 937, 1466, 13, 1499, 139, 80, 13, 8, 826, 13868, 11241, 5, 262, 7259, 11241, 10, 784, 31, 9, 9369, 31, 6, 3, 31, 89, 2

### Finetuning Model

Once the datasets have been tokenized, it's time to finetune the model. We are using the HF Trainer to simplify the finetuning code. In the code below, it's also important to initialize a Comet project which allows tracking the experimental results to Comet. You can also set the `COMET_LOG_ASSETS` to `True` to store all artifacts to Comet.

In [None]:
# initialize comet_ml
comet_ml.init(project_name="emotion-classification")

# training an autoregressive language model from a pretrained checkpoint
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint).to(device)

# set this to log HF results and assets to Comet
os.environ["COMET_LOG_ASSETS"] = "True"

# HF Trainer
model_name = model_checkpoint.split("/")[-1]
training_args = Seq2SeqTrainingArguments(
    num_train_epochs=1,
    output_dir="./results",
    overwrite_output_dir=True,
    logging_steps=1,
    evaluation_strategy = "epoch",
    learning_rate=1e-4,
    weight_decay=0.01,
    save_total_limit=5,
    save_steps=7,
    auto_find_batch_size=True
)

# instantiate HF Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
)

# run trainer
trainer.train()

#### Finetuning results 

![fine-turning](imgs/finetuning.png)

The code below stores the results locally:

In [19]:
# save the model
trainer.save_model("./results")

---

### Register Model

The code below registers the model to Comet.

In [29]:
# set existing experiment
import os
from comet_ml import Experiment
os.environ["COMET_LOG_ASSETS"] = "True"

COMET_API_KEY = os.getenv("COMET_API_KEY")

experiment = Experiment(api_key=COMET_API_KEY, project_name="emotion-classification")
experiment.log_model("Emotion-T5-Base", "./results/checkpoint-7")
experiment.register_model("Emotion-T5-Base")
experiment.end()

[1;38;5;39mCOMET INFO:[0m Experiment is live on comet.com https://www.comet.com/josephlyu/emotion-classification/5e6cbd6ad8964727a57227675aa0a17e

[1;38;5;39mCOMET INFO:[0m The process of logging environment details (conda environment, git patch) is underway. Please be patient as this may take some time.
[1;38;5;39mCOMET INFO:[0m Successfully registered 'Emotion-T5-Base', version None in workspace 'josephlyu'
[1;38;5;39mCOMET INFO:[0m ---------------------------------------------------------------------------------------
[1;38;5;39mCOMET INFO:[0m Comet.ml Experiment Summary
[1;38;5;39mCOMET INFO:[0m ---------------------------------------------------------------------------------------
[1;38;5;39mCOMET INFO:[0m   Data:
[1;38;5;39mCOMET INFO:[0m     display_summary_level : 1
[1;38;5;39mCOMET INFO:[0m     name                  : impressive_quince_4975
[1;38;5;39mCOMET INFO:[0m     url                   : https://www.comet.com/josephlyu/emotion-classification/5e6cbd6ad

---

##### Model registry

![Model registry](imgs/model_registry.png)

### Deploy Model

The code below helps to download the model and specific version to whatever environment you are deploying from.

In [31]:
from comet_ml import API

api = API(api_key=COMET_API_KEY)
COMET_WORKSPACE = os.getenv("COMET_WORKSPACE")

# model name
model_name = "Emotion-T5-Base"

#get the Model object
model = api.get_model(workspace=COMET_WORKSPACE, model_name=model_name)

# Download a Registry Model:
model.download("1.0.0", "./deploy", expand=True)

[1;38;5;39mCOMET INFO:[0m Remote Model 'josephlyu/Emotion-T5-Base:1.0.0' download has been started asynchronously.
[1;38;5;39mCOMET INFO:[0m Still downloading 10 file(s), remaining 1.14 GB/1.14 GB
[1;38;5;39mCOMET INFO:[0m Still downloading 2 file(s), remaining 1.13 GB/1.14 GB, Throughput 682.22 KB/s, ETA ~1740s
[1;38;5;39mCOMET INFO:[0m Still downloading 2 file(s), remaining 1.12 GB/1.14 GB, Throughput 814.13 KB/s, ETA ~1443s
[1;38;5;39mCOMET INFO:[0m Still downloading 2 file(s), remaining 1.11 GB/1.14 GB, Throughput 611.27 KB/s, ETA ~1907s
[1;38;5;39mCOMET INFO:[0m Still downloading 2 file(s), remaining 1.10 GB/1.14 GB, Throughput 747.32 KB/s, ETA ~1545s
[1;38;5;39mCOMET INFO:[0m Still downloading 2 file(s), remaining 1.08 GB/1.14 GB, Throughput 1.19 MB/s, ETA ~930s
[1;38;5;39mCOMET INFO:[0m Still downloading 2 file(s), remaining 1.06 GB/1.14 GB, Throughput 1.33 MB/s, ETA ~822s
[1;38;5;39mCOMET INFO:[0m Still downloading 2 file(s), remaining 1.04 GB/1.14 GB, Through