# Finetuning

## What is Finetuning?
Finetuning is the act of of "specializing" a pretrained AI Model on one or more specific tasks with given data.

A common use-case would be taking a pretrained model, and freezing some of the layers or all of them and adding additional layers to train on your specific task. This makes use of the pretrained understanding of the previous model and "transferes" (transfere learning) some of the knowledge to the new one while training on the new task.

In the following we will train a small language model on the simple task of sentiment analysis, as to get the sentiment of a given sentence or paragraph. Of course this is not a full fletched solution and we ignore that a sentence can have both positive and negative sentiments.

## How can I fine tune models?
There are multiple ways to fine tune models.

Something that is common amongst the ways is the preparation of "training data", and evaluating the new models performance. For this we will work with an existing model from the huggingface hub, and split it into a training, evaluation and testing set.

In a real project you would have to somehow gather this data, and preprocess into a common format, with labels (results that you expect for the input) that are consistent for all of them (so for example a scale of 1 - 5 with some rules for grading).

## Data Preparation
Let's first prepare our data in our train, eval and test datasets. The train dataset will be used to fine tune the model, the evaluation dataset will be used to compute some performance / quality metric (Like accuracy, F1 Score, etc.) (and the test dataset will be used to test the final result.) We won't be using a test dataset this time.

Normally you would be preprocessing your input data in the same way you expect your future data to be preprocessed. Standard preprocessing can also help with anonymizing for example username information, stripping unnecessary whitespaces and other "errors" in the data that might just cause the model to perform poorer than it would on clean data. We won't be touching on preprocessing data a lot and will only be anonymizing twitter usernames as an example.

We will be working with the following dataset: [carblacac/twitter-sentiment-analysis](https://huggingface.co/datasets/carblacac/twitter-sentiment-analysis)
It contains a bunch of tweets with the corresponding sentiment of the tweet.

In [1]:
DATASET_ID = "carblacac/twitter-sentiment-analysis"
MODEL_ID = "bert-base-multilingual-cased"

In [2]:
# Install huggingface datasets to easily load it from the hub and transformers for future use.
!pip install datasets -q
!pip install transformers -q
!pip install accelerate -U -q
!pip install evaluate -q

# Also install some basic data usage and ml packages for future use
!pip install numpy -q
!pip install pandas -q
!pip install matplotlib -q
!pip install scipy -q
!pip install scikit-learn -q

# And of course your preferred machine learning framework
!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 -q
!pip install tqdm

from tqdm.auto import tqdm
tqdm.pandas()



In [3]:
from datasets import load_dataset

train_dataset = load_dataset(path=DATASET_ID, split="train[:10%]", cache_dir="./data/")
eval_dataset = load_dataset(path=DATASET_ID, split="validation[:10%]", cache_dir="./data/")

In [4]:
train_dataset

Dataset({
    features: ['text', 'feeling'],
    num_rows: 11999
})

In [5]:
eval_dataset

Dataset({
    features: ['text', 'feeling'],
    num_rows: 3000
})

In [6]:
import re

# clean usernames
def clean_username(tweet):
    tweet["text"] = re.sub("@[a-zA-Z0-9]*", "@username", tweet["text"])
    return tweet

train_dataset = train_dataset.map(clean_username)
eval_dataset = eval_dataset.map(clean_username)

Map:   0%|          | 0/11999 [00:00<?, ? examples/s]

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

In [7]:
# example
train_dataset[0]

{'text': '@username so happy that salman won.  btw the 14sec clip is truely a teaser',
 'feeling': 0}

In [8]:
train_dataset = train_dataset.to_pandas()
train_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11999 entries, 0 to 11998
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   text     11999 non-null  object
 1   feeling  11999 non-null  int32 
dtypes: int32(1), object(1)
memory usage: 140.7+ KB


In [9]:
train_dataset.describe()

Unnamed: 0,feeling
count,11999.0
mean,0.497708
std,0.500016
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


In [10]:
eval_dataset = eval_dataset.to_pandas()
eval_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   text     3000 non-null   object
 1   feeling  3000 non-null   int32 
dtypes: int32(1), object(1)
memory usage: 35.3+ KB


In [11]:
eval_dataset.describe()

Unnamed: 0,feeling
count,3000.0
mean,0.479333
std,0.499656
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


In [12]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)

In [13]:
tokenized_train_data = tokenizer(train_dataset["text"].tolist(), padding="max_length", truncation=True)
type(tokenized_train_data)

transformers.tokenization_utils_base.BatchEncoding

In [14]:
tokenized_eval_data =  tokenizer(eval_dataset["text"].tolist(), padding="max_length", truncation=True)

## Training / Actual Finetuning
Training in this case is inputting the text into the model, wanting to receive the "feeling" as the output, and calculating a "loss" or "error" for a wrong answer which will be used to update the model internal weights in hopefully the right direction over many iterations.

### With Huggingface Trainer
Using Huggingface Trainer provides a simple interface for training and evaluating transformers models, with specific trainers existing for certain use cases with and special ones (like SFTTrainer) existing for autoregressive models like Llama2

In [15]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# import model
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID, num_labels=2) # This is an example trainer for classification on a two lable task as we have it (positive / negative)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="./", evaluation_strategy="epoch")

In [17]:
# Add evaluation metrics, we will be using accuracy and F1 Score
import numpy as np
import evaluate

accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    f1_score = f1_metric.compute(predictions=predictions, references=labels)

    return {"accuracy": accuracy, "f1": f1_score}

In [18]:
from datasets import Dataset

trainer_train_dataset = Dataset.from_dict(tokenized_train_data)
trainer_train_dataset = trainer_train_dataset.add_column("labels", train_dataset["feeling"])

trainer_eval_dataset = Dataset.from_dict(tokenized_eval_data)
trainer_eval_dataset = trainer_eval_dataset.add_column("labels", eval_dataset["feeling"])

In [19]:
trainer_train_dataset[0]

{'input_ids': [101,
  137,
  29115,
  23920,
  10380,
  54214,
  10189,
  31119,
  10589,
  11367,
  119,
  170,
  76797,
  10105,
  10247,
  10341,
  10350,
  48545,
  10124,
  22024,
  10454,
  169,
  57675,
  12754,
  102,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,

In [20]:
from transformers import Trainer

training_args = TrainingArguments(output_dir="./", evaluation_strategy="epoch", per_device_train_batch_size=16, per_device_eval_batch_size=8, gradient_accumulation_steps=4, fp16=True)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=trainer_train_dataset,
    eval_dataset=trainer_eval_dataset,
    compute_metrics=compute_metrics,
)

In [21]:
trainer.evaluate()

Trainer is attempting to log a value of "{'accuracy': 0.5216666666666666}" of type <class 'dict'> for key "eval/precision" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'f1': 0.021813224267211995}" of type <class 'dict'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


{'eval_loss': 0.6956017017364502,
 'eval_precision': {'accuracy': 0.5216666666666666},
 'eval_f1': {'f1': 0.021813224267211995},
 'eval_runtime': 38.9452,
 'eval_samples_per_second': 77.031,
 'eval_steps_per_second': 9.629}

In [22]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,F1
0,No log,0.462145,{'accuracy': 0.795},{'f1': 0.784739236961848}
2,0.411400,0.507519,{'accuracy': 0.7983333333333333},{'f1': 0.7827648114901256}


Trainer is attempting to log a value of "{'accuracy': 0.795}" of type <class 'dict'> for key "eval/precision" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'f1': 0.784739236961848}" of type <class 'dict'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.8003333333333333}" of type <class 'dict'> for key "eval/precision" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'f1': 0.7950735545672255}" of type <class 'dict'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.7983333333333333}" of type <class 'dict'> for key "eval/

TrainOutput(global_step=561, training_loss=0.3944740669400084, metrics={'train_runtime': 1415.8672, 'train_samples_per_second': 25.424, 'train_steps_per_second': 0.396, 'total_flos': 9446213109534720.0, 'train_loss': 0.3944740669400084, 'epoch': 2.99})

In [23]:
import torch

torch.save(model, "./trained_model.bin")

In [24]:
del model
del trainer
del tokenizer

# Advanced Fine Tuning
## Train a Causal LM like Llama on a new instruction
Some very big models are too costly to completely fine tune from the ground up, so techniques like PEFT and LoRA exist to reduce the amount of parameters having to be finetuned, while still getting goodish results.

Another bonus of this is that the resulting "LoRA Adapter" can be changed quite fast on the fly.

We will teach pretrained mistral 7B a summarization task based on the samsum dataset.

In [2]:
!pip install peft bitsandbytes loralib py7zr -q

In [3]:
!pip install accelerate bitsandbytes



In [4]:
# Example Dataset entry
{
  "id": "13818513",
  "summary": "Amanda baked cookies and will bring Jerry some tomorrow.",
  "dialogue": "Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"
}

{'id': '13818513',
 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.',
 'dialogue': "Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"}

In [5]:
from datasets import load_dataset

dataset = load_dataset("samsum")

len(dataset["train"])

14732

In [6]:
len(dataset["test"])

819

In [7]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

In [8]:
def preprocess_func(examples):
  return tokenizer(["Summarize the following content:\n\n" + x for x in examples["dialogue"]])

tokenized = dataset.map(preprocess_func, batched=True, num_proc=4)

In [9]:
tokenizer.decode(tokenized["train"][0]["input_ids"])

"<s> Summarize the following content:\n\nAmanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"

In [10]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [12]:
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=False,
)

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"": 0})

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
from peft import LoraConfig, get_peft_model, TaskType


# Define LoRA Config
lora_config = LoraConfig(
 r=16,
 lora_alpha=32,
 target_modules=["q", "v"],
 lora_dropout=0.05,
 bias="none",
 task_type=TaskType.SEQ_2_SEQ_LM
)

model = get_peft_model(model, lora_config)
mode.print_trainable_parameters()