<a href="https://colab.research.google.com/github/AlexUmnov/genai_course/blob/main/week4_llm_customization/week4_practice_session.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this week's practice session we'll learn how to customize GenAI models. In particular, we'll discuss:

- Parameter efficient tuning through prompt tuning and LoRA,
- Fine-tuning Stable Diffusion using Dreambooth approach.

# Efficient finetuning

In this section we'll learn how to efficiently fine-tune NLP generative models.

Modern LLM's are so huge, that today you need a large cluster of enterprise-grade GPUs to fine-tune all their parameters. For example, to inference a 7 Billion parameter LLM, which is among the smallest ones right now, you need approximately 28GB of memory (if you store the weights with 32-bit float precision). And for training it, you need about 2.5 times more, because you need to store gradients and optimiser states.

In this part we will explore several resource-effective strategies of LLM finetuning.

During this tutorial alongside usual `transformers` library, we will use a library `PEFT`, which is **Parameter Efficient Fine-Tuning**. It implements some of the fine-tune methods that leave most of the model weights frozen.

To demonstrate how it works, we will use `twitter_eval` which is a collection of datasets for text classification. Particulary we'll pick the `irony` part, which contains twits (`X`'s) and their corresponding annotation on whether the text contains irony (`y`'s).

We'll preprocess the data so that each of the item looks like

`Tweet text: {text} Label : irony / non irony`

For example:

`Tweet text: Corny jokes are my absolute favorite Label : Irony`

**Using a generative model for classification**

Generative models can potentially generate anything, and strings `irony` and `non irony` are no exception. We can just take a text, add "Label:" to the end and expect that the model predicts irony. The trick is to make the model generate not random stuff, but exactly the irony labels, and moreover the correct ones.

As our backbone we'll use a model called `pythia`. It's an open-source model by EleutherAI, which is also known for GPT-Neo and GPT-J, which are open-source reproductions of original GPT models.

In its time, Pythia was often used as a comparison in papers. The paper also introduced an influencial suite to benchmark LLM models [github](https://github.com/EleutherAI/pythia).

Our particular Pythia model is 24-layer decoder-only transformer with 1.4 billion paramters. It is trained on a deduplicated open-source dataset [The Pile](https://pile.eleuther.ai/). For the end of 2024, it is not very capable, but it's small (so you won't need to spend much time and money on working with it) and it responds well to fine tuning.

## Prompt tuning

The idea is that instead of training the actual model weights, we train the prompt for the model, or more precisely prompt parameters.


First of all, we need to install the necessary libraries:

In [1]:
!pip install -q peft transformers datasets einops

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/179.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/134.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
import os

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    get_linear_schedule_with_warmup
)
from peft import (
    get_peft_config,
    get_peft_model,
    PromptTuningInit,
    PromptTuningConfig,
    TaskType,
    PeftType
)
from tqdm import tqdm
import torch

from huggingface_boilerplate import prepare_dataloaders

We decided to hide some of boilerplate code in the `huggingface_boilerplate.py` file. In case if you're curious: what this concealed code does is transforming a dataset with columns "text" and "label" to a single sentence with corresponding attention mask, so that a decoder-only model can work on those and predict labels. Additionally we pad all the sentences to max length. Don't worry if this sounds cryptic now, we'll make it clear in the second part of the course.

In [4]:
model_name = "EleutherAI/pythia-1b-deduped"

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    padding_side='left'
)
model = AutoModelForCausalLM.from_pretrained(model_name)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

train_dataloader, eval_dataloader, dataset = prepare_dataloaders(
    tokenizer,
    dataset_path="tweet_eval",
    dataset_name="irony",
    text_column="text",
    label_names_column="label",
    max_length=64,
    batch_size=8
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/396 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/569 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.09G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/183k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/54.0k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/61.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2862 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/784 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/955 [00:00<?, ? examples/s]

Map:   0%|          | 0/2862 [00:00<?, ? examples/s]

Map:   0%|          | 0/784 [00:00<?, ? examples/s]

Map:   0%|          | 0/955 [00:00<?, ? examples/s]

Dataset_sample: {'text': 'seeing ppl walking w/ crutches makes me really excited for the next 3 weeks of my life', 'label': 1, 'text_label': 'irony'}


Running tokenizer on dataset:   0%|          | 0/2862 [00:00<?, ? examples/s]

Running tokenizer on dataset:   0%|          | 0/784 [00:00<?, ? examples/s]

Running tokenizer on dataset:   0%|          | 0/955 [00:00<?, ? examples/s]

Lets first see what our base model generates

In [6]:
from IPython.display import display

print("Samples")

input_text = [
    f"Tweet text: {text} Label : "
    for text in dataset['test'][:8]['text']
]

display(input_text)

tokenized = tokenizer(input_text, return_tensors='pt', padding=True)
tokenized = {k: v.cuda() for k, v in tokenized.items()}

model = model.cuda()

output = model.generate(
    **tokenized,
    max_new_tokens=10,
    eos_token_id=tokenizer.eos_token_id
)

print("\n\nGenerations\n\n")

display(tokenizer.batch_decode(output, skip_special_tokens=True))

Samples


['Tweet text: @user Can U Help?||More conservatives needed on #TSU + get paid 4 posting stuff like this!||YOU $ can go to Label : ',
 'Tweet text: Just walked in to #Starbucks and asked for a "tall blonde" Hahahaha #irony Label : ',
 'Tweet text: #NOT GONNA WIN Label : ',
 'Tweet text: @user He is exactly that sort of person. Weirdo! Label : ',
 "Tweet text: So much #sarcasm at work mate 10/10 #boring 100% #dead mate full on #shit absolutely #sleeping mate can't handle the #sarcasm Label : ",
 'Tweet text: Corny jokes are my absolute favorite Label : ',
 'Tweet text: People complain about my backround pic and all I feel is like "hey don\'t blame me, Albert E might have spoken those words" #sarcasm #life Label : ',
 'Tweet text: @user @user Darn, my sock joke needs fixing? Label : ']

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.




Generations




['Tweet text: @user Can U Help?||More conservatives needed on #TSU + get paid 4 posting stuff like this!||YOU $ can go to Label : ###||#TSU #TSU #TS',
 'Tweet text: Just walked in to #Starbucks and asked for a "tall blonde" Hahahaha #irony Label : ###\n\nI\'m not sure if this is',
 'Tweet text: #NOT GONNA WIN Label : \n#NOT GONNA WIN\n\n#',
 'Tweet text: @user He is exactly that sort of person. Weirdo! Label : \n#1\n\nA:\n\nYou',
 "Tweet text: So much #sarcasm at work mate 10/10 #boring 100% #dead mate full on #shit absolutely #sleeping mate can't handle the #sarcasm Label : xtc_tweet_text_1\n",
 'Tweet text: Corny jokes are my absolute favorite Label : \n#1: "I\'m a little bit',
 'Tweet text: People complain about my backround pic and all I feel is like "hey don\'t blame me, Albert E might have spoken those words" #sarcasm #life Label : xtian\n\nI\'m not sure if this',
 'Tweet text: @user @user Darn, my sock joke needs fixing? Label : \n#!/usr/bin/env python\n']

As you see, the model generates some random stuff. We need to teach it to respect the format of the answer, and we'll do it through fine-tuning.

A fine-tuning strategy is described in the `peft_config` variable. In this case we need `PromptTuningConfig`.

Note that the model we've chosen is quite large, but it still should fit in T4s memory. If you have a better GPU, feel free to experiment.

In [7]:
device = "cuda"
peft_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    prompt_tuning_init=PromptTuningInit.TEXT,
    num_virtual_tokens=8,
    prompt_tuning_init_text="Classify if the tweet contains irony:",
    tokenizer_name_or_path=model_name,
)

lr = 3e-2
num_epochs = 5

In [8]:
model = get_peft_model(model.cpu(), peft_config)
model.print_trainable_parameters()

trainable params: 16,384 || all params: 1,011,798,016 || trainable%: 0.0016


As you can see, we are only training a small fraction of our model's parameters. That's why it's called "Parameter efficient fine-tuning".

In [9]:
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
lr_scheduler = get_linear_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=(len(train_dataloader) * num_epochs),
)

Here's some more boilerplate PyTorch training code.

In [10]:
model = model.to(device)

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for step, batch in enumerate(tqdm(train_dataloader)):
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        total_loss += loss.detach().float()
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

    model.eval()
    eval_loss = 0
    eval_preds = []
    for step, batch in enumerate(tqdm(eval_dataloader)):
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = model(**batch)
        loss = outputs.loss
        eval_loss += loss.detach().float()
        eval_preds.extend(
            tokenizer.batch_decode(
                torch.argmax(outputs.logits, -1).detach().cpu().numpy(),
                skip_special_tokens=True
            )
        )

    eval_epoch_loss = eval_loss / len(eval_dataloader)
    eval_ppl = torch.exp(eval_epoch_loss)
    train_epoch_loss = total_loss / len(train_dataloader)
    train_ppl = torch.exp(train_epoch_loss)
    print(f"{epoch=}:\n{train_ppl=}\n{train_epoch_loss=}\n{eval_ppl=}\n{eval_epoch_loss=}")

100%|██████████| 358/358 [00:51<00:00,  7.00it/s]
100%|██████████| 98/98 [00:07<00:00, 13.25it/s]


epoch=0:
train_ppl=tensor(1.4943, device='cuda:0')
train_epoch_loss=tensor(0.4016, device='cuda:0')
eval_ppl=tensor(1.2541, device='cuda:0')
eval_epoch_loss=tensor(0.2265, device='cuda:0')


100%|██████████| 358/358 [00:51<00:00,  7.01it/s]
100%|██████████| 98/98 [00:07<00:00, 13.36it/s]


epoch=1:
train_ppl=tensor(1.2481, device='cuda:0')
train_epoch_loss=tensor(0.2216, device='cuda:0')
eval_ppl=tensor(1.2136, device='cuda:0')
eval_epoch_loss=tensor(0.1936, device='cuda:0')


100%|██████████| 358/358 [00:51<00:00,  7.02it/s]
100%|██████████| 98/98 [00:07<00:00, 13.42it/s]


epoch=2:
train_ppl=tensor(1.2263, device='cuda:0')
train_epoch_loss=tensor(0.2040, device='cuda:0')
eval_ppl=tensor(1.2156, device='cuda:0')
eval_epoch_loss=tensor(0.1952, device='cuda:0')


100%|██████████| 358/358 [00:50<00:00,  7.02it/s]
100%|██████████| 98/98 [00:07<00:00, 13.35it/s]


epoch=3:
train_ppl=tensor(1.2066, device='cuda:0')
train_epoch_loss=tensor(0.1878, device='cuda:0')
eval_ppl=tensor(1.2113, device='cuda:0')
eval_epoch_loss=tensor(0.1917, device='cuda:0')


100%|██████████| 358/358 [00:50<00:00,  7.03it/s]
100%|██████████| 98/98 [00:07<00:00, 13.45it/s]

epoch=4:
train_ppl=tensor(1.1767, device='cuda:0')
train_epoch_loss=tensor(0.1627, device='cuda:0')
eval_ppl=tensor(1.2042, device='cuda:0')
eval_epoch_loss=tensor(0.1858, device='cuda:0')





Good thing to do is to save the finetuned model so that the results would be easier to reproduce later.

In [11]:
model.save_pretrained("models/prompt_tuning")

We'll also show how to load the model from saved:

In [None]:
# This is a download link for our pretrained model, just in case
!gdown https://drive.google.com/drive/folders/13ClAKeOunxn7GyEexe_7JyZpVrdphL6c?usp=drive_link -O /content/models/prompt_tuning --folder

Retrieving folder contents
Processing file 1bXwEOCgqNHvhX5VTI-Rq8aigXK_tEQMm adapter_config.json
Processing file 1eYb2eEOgamgtgDd03Er4dikYwzPc6E8z adapter_model.bin
Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=1bXwEOCgqNHvhX5VTI-Rq8aigXK_tEQMm
To: /content/models/prompt_tuning/adapter_config.json
100% 493/493 [00:00<00:00, 3.19MB/s]
Downloading...
From: https://drive.google.com/uc?id=1eYb2eEOgamgtgDd03Er4dikYwzPc6E8z
To: /content/models/prompt_tuning/adapter_model.bin
100% 66.3k/66.3k [00:00<00:00, 2.80MB/s]
Download completed


To load a PEFT finetuned model we need to first load the checkpoint of base model and then apply our finetune

In [12]:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

config = PeftConfig.from_pretrained("models/prompt_tuning")
tokenizer = AutoTokenizer.from_pretrained(
    config.base_model_name_or_path,
    padding_side='left'
)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path
)
model = PeftModel.from_pretrained(model, "models/prompt_tuning")

Let's see what our finetuned model predicts

In [13]:
# we add data loading part again,
# so you could restart runtime in case of CUDA memory problems

import datasets
from torch.utils.data import DataLoader

dataset = datasets.load_dataset("tweet_eval", "irony")


In [14]:
from IPython.display import display


print("Samples")

input_text = [
    f"Tweet text: {text} Label : "
    for text in dataset['test'][:8]['text']
]

display(input_text)

tokenized = tokenizer(input_text, return_tensors='pt', padding=True)
tokenized = {k: v.cuda() for k, v in tokenized.items()}

model = model.cuda()

output = model.generate(
    **tokenized,
    max_new_tokens=10,
    eos_token_id=tokenizer.eos_token_id
)

print("\n\nGenerations\n\n")

display(
    tokenizer.batch_decode(output, skip_special_tokens=True)
)

Samples


['Tweet text: @user Can U Help?||More conservatives needed on #TSU + get paid 4 posting stuff like this!||YOU $ can go to Label : ',
 'Tweet text: Just walked in to #Starbucks and asked for a "tall blonde" Hahahaha #irony Label : ',
 'Tweet text: #NOT GONNA WIN Label : ',
 'Tweet text: @user He is exactly that sort of person. Weirdo! Label : ',
 "Tweet text: So much #sarcasm at work mate 10/10 #boring 100% #dead mate full on #shit absolutely #sleeping mate can't handle the #sarcasm Label : ",
 'Tweet text: Corny jokes are my absolute favorite Label : ',
 'Tweet text: People complain about my backround pic and all I feel is like "hey don\'t blame me, Albert E might have spoken those words" #sarcasm #life Label : ',
 'Tweet text: @user @user Darn, my sock joke needs fixing? Label : ']

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.




Generations






['Tweet text: @user Can U Help?||More conservatives needed on #TSU + get paid 4 posting stuff like this!||YOU $ can go to Label : non irony',
 'Tweet text: Just walked in to #Starbucks and asked for a "tall blonde" Hahahaha #irony Label : irony',
 'Tweet text: #NOT GONNA WIN Label : non irony',
 'Tweet text: @user He is exactly that sort of person. Weirdo! Label : irony',
 "Tweet text: So much #sarcasm at work mate 10/10 #boring 100% #dead mate full on #shit absolutely #sleeping mate can't handle the #sarcasm Label : non irony",
 'Tweet text: Corny jokes are my absolute favorite Label : irony',
 'Tweet text: People complain about my backround pic and all I feel is like "hey don\'t blame me, Albert E might have spoken those words" #sarcasm #life Label : non irony',
 'Tweet text: @user @user Darn, my sock joke needs fixing? Label : non irony']

We can see that the model mastered the output format, but what's the actual accuracy?

Note that evaluation is not so straightforward, because output is text. So we'll have to look for "irony" and "non irony" text.

In [15]:
from tqdm.auto import tqdm

eval_texts = [
    f"Tweet text: {text} Label : "
    for text in dataset['test']['text']
]
eval_labels = dataset['test']['label']

eval_text_dataloader = DataLoader(
    eval_texts, shuffle=False, batch_size=8
)

model = model.cuda()

output_texts = []

for batch in tqdm(eval_text_dataloader):
    tokenized_batch = tokenizer(
        batch,
        return_tensors='pt',
        padding=True
    )
    tokenized_batch = {
        k: v.cuda() for k, v in tokenized_batch.items()
    }
    output = model.generate(
        **tokenized_batch,
        max_new_tokens=10,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )
    output_text = tokenizer.batch_decode(
        output,
        skip_special_tokens=True
    )
    output_texts.extend(output_text)

output_labels = [
    1 if "Label : irony" in text else 0
    for text in output_texts
]

accuracy = sum([
    1 if prediction == label else 0
    for label, prediction in zip(eval_labels, output_labels)
]) / len(eval_labels)

  0%|          | 0/98 [00:00<?, ?it/s]

In [16]:
accuracy

0.7104591836734694

This doesn't sound much, but keep in mind that detecting irony would be not so easy a task even for humans.

## LoRA

LoRA doesn't actually change the weights of a model, it rather traines a matrix, which is added to the model's weights. So it doesn't have to work with weights directly and keep their gradients in memory. Furthermore, to reduce memory consumption LoRA works in much smaller dimension, by decomposing this increment matrix into two transformations: to and from lower rank. Hence the name **Low-Rank Adaptaion**.


**Important**

Here we advise you to reset your runtime and rerun the data preparation function, because GPU memory resources might already be exhausted with previous training we did.



In [17]:
import os

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    get_linear_schedule_with_warmup
)
from peft import (
    get_peft_config,
    get_peft_model,
    LoraConfig,
    TaskType,
    PeftType
)
from tqdm import tqdm
import torch

from huggingface_boilerplate import prepare_dataloaders

In [18]:
model_name = "EleutherAI/pythia-1b-deduped"

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    padding_side='left'
)
model = AutoModelForCausalLM.from_pretrained(model_name)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

train_dataloader, eval_dataloader, dataset = prepare_dataloaders(
    tokenizer,
    dataset_path="tweet_eval",
    dataset_name="irony",
    text_column="text",
    label_names_column="label",
    max_length=64,
    batch_size=8
)

Dataset_sample: {'text': 'seeing ppl walking w/ crutches makes me really excited for the next 3 weeks of my life', 'label': 1, 'text_label': 'irony'}


Running tokenizer on dataset:   0%|          | 0/2862 [00:00<?, ? examples/s]

Running tokenizer on dataset:   0%|          | 0/784 [00:00<?, ? examples/s]

Running tokenizer on dataset:   0%|          | 0/955 [00:00<?, ? examples/s]

To initialise Lora finetuning, we need to specify which layers we would like to finetune this way.

In [19]:
peft_config = LoraConfig(
    r=32,
    target_modules=[
        'query_key_value',
        'dense',
        'dense_h_to_4h',
        'dense_4h_to_h'
    ]
)

In [20]:
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()


lr = 1e-5
num_epochs = 5

trainable params: 16,777,216 || all params: 1,028,558,848 || trainable%: 1.6311


As you can see, we are still training just a tiny fraction of the model's parameters.

In [21]:
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
lr_scheduler = get_linear_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=(len(train_dataloader) * num_epochs),
)

In [22]:
model = model.cuda()

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for step, batch in enumerate(tqdm(train_dataloader)):
        batch = {k: v.cuda() for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        total_loss += loss.detach().float()
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

    model.eval()
    eval_loss = 0
    eval_preds = []
    for step, batch in enumerate(tqdm(eval_dataloader)):
        batch = {k: v.cuda() for k, v in batch.items()}
        with torch.no_grad():
            outputs = model(**batch)
        loss = outputs.loss
        eval_loss += loss.detach().float()
        eval_preds.extend(
            tokenizer.batch_decode(
                torch.argmax(outputs.logits, -1).detach().cpu().numpy(),
                skip_special_tokens=True
            )
        )

    eval_epoch_loss = eval_loss / len(eval_dataloader)
    eval_ppl = torch.exp(eval_epoch_loss)
    train_epoch_loss = total_loss / len(train_dataloader)
    train_ppl = torch.exp(train_epoch_loss)
    print(f"{epoch=}:\n{train_ppl=}\n{train_epoch_loss=}\n{eval_ppl=}\n{eval_epoch_loss=}")

100%|██████████| 358/358 [00:49<00:00,  7.24it/s]
100%|██████████| 98/98 [00:06<00:00, 14.20it/s]


epoch=0:
train_ppl=tensor(128.0153, device='cuda:0')
train_epoch_loss=tensor(4.8521, device='cuda:0')
eval_ppl=tensor(1.2784, device='cuda:0')
eval_epoch_loss=tensor(0.2456, device='cuda:0')


100%|██████████| 358/358 [00:49<00:00,  7.24it/s]
100%|██████████| 98/98 [00:06<00:00, 14.07it/s]


epoch=1:
train_ppl=tensor(1.2628, device='cuda:0')
train_epoch_loss=tensor(0.2333, device='cuda:0')
eval_ppl=tensor(1.2575, device='cuda:0')
eval_epoch_loss=tensor(0.2291, device='cuda:0')


100%|██████████| 358/358 [00:49<00:00,  7.23it/s]
100%|██████████| 98/98 [00:06<00:00, 14.08it/s]


epoch=2:
train_ppl=tensor(1.2313, device='cuda:0')
train_epoch_loss=tensor(0.2080, device='cuda:0')
eval_ppl=tensor(1.2278, device='cuda:0')
eval_epoch_loss=tensor(0.2052, device='cuda:0')


100%|██████████| 358/358 [00:49<00:00,  7.23it/s]
100%|██████████| 98/98 [00:06<00:00, 14.17it/s]


epoch=3:
train_ppl=tensor(1.2146, device='cuda:0')
train_epoch_loss=tensor(0.1944, device='cuda:0')
eval_ppl=tensor(1.2205, device='cuda:0')
eval_epoch_loss=tensor(0.1992, device='cuda:0')


100%|██████████| 358/358 [00:49<00:00,  7.23it/s]
100%|██████████| 98/98 [00:06<00:00, 14.17it/s]

epoch=4:
train_ppl=tensor(1.2068, device='cuda:0')
train_epoch_loss=tensor(0.1880, device='cuda:0')
eval_ppl=tensor(1.2108, device='cuda:0')
eval_epoch_loss=tensor(0.1913, device='cuda:0')





Let's again save the fine-tuned model:

In [23]:
model.save_pretrained("models/lora")

In [None]:
# once again a download link just in case
!gdown https://drive.google.com/drive/folders/11Aw4BPO73AwFUwsWt-6cu_Pmfx2U8cZ7?usp=sharing --folder -O /content/models/lora

Retrieving folder contents
Processing file 1NO9-yRScq5DbKHHxcetbbix3Hck1Eeag adapter_config.json
Processing file 1Rw4G91iXSu5mU2Q66EeJTf8qRJZOZ0a0 adapter_model.bin
Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=1NO9-yRScq5DbKHHxcetbbix3Hck1Eeag
To: /content/models/lora/adapter_config.json
100% 610/610 [00:00<00:00, 3.48MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Rw4G91iXSu5mU2Q66EeJTf8qRJZOZ0a0
To: /content/models/lora/adapter_model.bin
100% 67.2M/67.2M [00:01<00:00, 44.6MB/s]
Download completed


In [24]:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

config = PeftConfig.from_pretrained("models/lora")
tokenizer = AutoTokenizer.from_pretrained(
    config.base_model_name_or_path,
    padding_side='left'
)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path
)
model = PeftModel.from_pretrained(model, "models/lora")

# we add data loading part again, so you could restart runtime in case of CUDA memory problems

import datasets
from torch.utils.data import DataLoader

dataset = datasets.load_dataset("tweet_eval", "irony")

In [25]:
from IPython.display import display

print("Samples")

input_text = [
    f"Tweet text: {text} Label : "
    for text in dataset['test'][:8]['text']
]

display(input_text)

tokenized = tokenizer(input_text, return_tensors='pt', padding=True)
tokenized = {k: v.cuda() for k, v in tokenized.items()}

model = model.cuda()

output = model.generate(
    **tokenized,
    max_new_tokens=10,
    eos_token_id=tokenizer.eos_token_id
)

print("\n\nGenerations\n\n")

display(tokenizer.batch_decode(output, skip_special_tokens=True))

Samples


['Tweet text: @user Can U Help?||More conservatives needed on #TSU + get paid 4 posting stuff like this!||YOU $ can go to Label : ',
 'Tweet text: Just walked in to #Starbucks and asked for a "tall blonde" Hahahaha #irony Label : ',
 'Tweet text: #NOT GONNA WIN Label : ',
 'Tweet text: @user He is exactly that sort of person. Weirdo! Label : ',
 "Tweet text: So much #sarcasm at work mate 10/10 #boring 100% #dead mate full on #shit absolutely #sleeping mate can't handle the #sarcasm Label : ",
 'Tweet text: Corny jokes are my absolute favorite Label : ',
 'Tweet text: People complain about my backround pic and all I feel is like "hey don\'t blame me, Albert E might have spoken those words" #sarcasm #life Label : ',
 'Tweet text: @user @user Darn, my sock joke needs fixing? Label : ']

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.




Generations




['Tweet text: @user Can U Help?||More conservatives needed on #TSU + get paid 4 posting stuff like this!||YOU $ can go to Label : non irony',
 'Tweet text: Just walked in to #Starbucks and asked for a "tall blonde" Hahahaha #irony Label : irony',
 'Tweet text: #NOT GONNA WIN Label : non irony',
 'Tweet text: @user He is exactly that sort of person. Weirdo! Label : irony',
 "Tweet text: So much #sarcasm at work mate 10/10 #boring 100% #dead mate full on #shit absolutely #sleeping mate can't handle the #sarcasm Label : irony",
 'Tweet text: Corny jokes are my absolute favorite Label : irony',
 'Tweet text: People complain about my backround pic and all I feel is like "hey don\'t blame me, Albert E might have spoken those words" #sarcasm #life Label : irony',
 'Tweet text: @user @user Darn, my sock joke needs fixing? Label : irony']

As you see, the model also has a nice grip on the output format.

In [26]:
from tqdm.auto import tqdm

eval_texts = [f"Tweet text: {text} Label : "  for text in dataset['test']['text']]
eval_labels = dataset['test']['label']

eval_text_dataloader = DataLoader(
    eval_texts, shuffle=False, batch_size=8
)

model = model.cuda()

output_texts = []

for batch in tqdm(eval_text_dataloader):
    tokenized_batch = tokenizer(
        batch,
        return_tensors='pt',
        padding=True
    )
    tokenized_batch = {
        k: v.cuda() for k, v in tokenized_batch.items()
    }
    output = model.generate(
        **tokenized_batch,
        max_new_tokens=10,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )
    output_text = tokenizer.batch_decode(
        output,
        skip_special_tokens=True
    )
    output_texts.extend(output_text)

output_labels = [
    1 if "Label : irony" in text else 0
    for text in output_texts
]

accuracy = sum([
    1 if prediction == label else 0
    for label, prediction in zip(eval_labels, output_labels)
]) / len(eval_labels)
accuracy

  0%|          | 0/98 [00:00<?, ?it/s]

0.6875

## Peft takeaways

Probably, after finishing this section you have a question: why finetune a model at all if we can control it using prompts?

Here are some reasons why:

1. Prompting does not allow you to bring new information into the model.

  This means that if knowledge about a specific task is missing, you can still try and put it via the prompt, but then it depends a lot on how well you can compress the task's knowledge in your prompt.

  Also you can't really use a whole dataset inside a prompt, because there's a token limit.

2. Prompting is a soft control.

  When you prompt a model to do something, it's still a soft control mechanism. For example you cannot force a model to always answer in a specific tone of voice with a prompt. Depending on how well you crafted the prompt, some specific case might break the control mechanism you've set up.

  With finetuning you change the model fundamentals and it stops beign able to respond in a different way.

So, if you really need to fine-tune a model and you don't have sufficient resources for a full fine-tuning, PEFT is here to help!