# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* <b>PEFT technique:</b> LoRA 
* <b>Model</b>: GPT-2 
* <b>Evaluation approach:</b> Accuracy, Precision, Recall, Specificity, NPV 
* <b>Fine-tuning dataset:</b> subset of carblacac/twitter-sentiment-analysis


## Load dataset
- <b>Link:</b> https://huggingface.co/datasets/imdb
- <b>Task class:</b> Sentimental Analysis (Text Classification)

In [1]:
from datasets import load_dataset

dataset = load_dataset("carblacac/twitter-sentiment-analysis", split=['train[:20000]', 'validation[:1000]', 'test[:10000]'])
dataset

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


[Dataset({
     features: ['text', 'feeling'],
     num_rows: 20000
 }),
 Dataset({
     features: ['text', 'feeling'],
     num_rows: 1000
 }),
 Dataset({
     features: ['text', 'feeling'],
     num_rows: 10000
 })]

In [2]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
tokenizer

GPT2TokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}

In [4]:
def tokenize_fn(examples):
    return tokenizer(examples['text'], padding=True, truncation=True, return_tensors='pt')

train_dataset = dataset[0].map(tokenize_fn, batched=True)
val_dataset = dataset[1].map(tokenize_fn, batched=True)
test_dataset = dataset[2].map(tokenize_fn, batched=True)

train_dataset = train_dataset.add_column('label', train_dataset['feeling'])
val_dataset = val_dataset.add_column('label', val_dataset['feeling'])
test_dataset = test_dataset.add_column('label', test_dataset['feeling'])

## Loading and Evaluating a Foundation Model

In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [5]:
from transformers import AutoModelForSequenceClassification


model = AutoModelForSequenceClassification.from_pretrained('gpt2',
    num_labels=2,
    id2label={0: "NEGATIVE", 1: "POSITIVE"}, 
    label2id={"NEGATIVE": 0, "POSITIVE": 1})
model.config.pad_token_id = model.config.eos_token_id
model

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=2, bias=False)
)

In [6]:
from transformers import Trainer, DataCollatorWithPadding
import numpy as np

# https://en.wikipedia.org/wiki/Precision_and_recall
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    # labels: 0 -> Negative 1 -> Positive
    cls_correctness = (predictions == labels)
    labels_bin = (labels == 1)
    tp = ((cls_correctness) & (labels_bin)).sum()
    tn = ((cls_correctness) & (~labels_bin)).sum()
    fp = ((~cls_correctness) & (labels_bin)).sum()
    fn = ((~cls_correctness) & (~labels_bin)).sum()
    
    return {
        "accuracy": (tp+tn)/(tp+tn+fp+fn),
        "precision": tp/(tp+fp),
        "recall": tp/(tp+fn),
        "negative_predictive_value": tn/(tn+fn),
        "specificity": tn/(tn+fp),
    }


trainer = Trainer(
    model=model,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics
)

trainer.evaluate()

{'eval_loss': 2.250124216079712,
 'eval_accuracy': 0.5057,
 'eval_precision': 1.0,
 'eval_recall': 0.5056505650565056,
 'eval_negative_predictive_value': 0.0002022653721682848,
 'eval_specificity': 1.0,
 'eval_runtime': 67.7049,
 'eval_samples_per_second': 147.7,
 'eval_steps_per_second': 18.462}

## Performing Parameter-Efficient Fine-Tuning

In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [6]:
from transformers import AutoModelForSequenceClassification, AutoModelForCausalLM
import torch


model = AutoModelForSequenceClassification.from_pretrained(
    'gpt2', 
    num_labels=2,
    id2label={0: "NEGATIVE", 1: "POSITIVE"}, 
    label2id={"NEGATIVE": 0, "POSITIVE": 1},
    device_map='auto'
)
model.config.pad_token_id = model.config.eos_token_id
model


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=2, bias=False)
)

In [7]:
from peft import LoraConfig, LoftQConfig, get_peft_model

# https://github.com/TimDettmers/bitsandbytes/issues/762
config = LoraConfig(
    r=8,
    target_modules=['c_attn', 'c_proj'],
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none"
)

lora_model = get_peft_model(model, config)

lora_model.print_trainable_parameters()

lora_model

trainable params: 811,008 || all params: 125,252,352 || trainable%: 0.6474992182182735




PeftModel(
  (base_model): LoraModel(
    (model): GPT2ForSequenceClassification(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-11): 12 x GPT2Block(
            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attention(
              (c_attn): lora.Linear(
                (base_layer): Conv1D()
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=2304, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
      

In [8]:
from transformers import Trainer, TrainingArguments, DataCollatorWithPadding
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

trainer = Trainer(
    model=lora_model,
    args=TrainingArguments(
        output_dir="./gpt2-peft",
        learning_rate=8e-4,
        per_device_train_batch_size=12,
        per_device_eval_batch_size=12,
        num_train_epochs=4,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        optim="adamw_torch",
        label_names=["labels"]
    ),
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.4624,0.414775,0.82
2,0.3977,0.380856,0.838
3,0.3276,0.3927,0.824
4,0.2569,0.420718,0.84


TrainOutput(global_step=6668, training_loss=0.3658030051704503, metrics={'train_runtime': 2886.9274, 'train_samples_per_second': 27.711, 'train_steps_per_second': 2.31, 'total_flos': 1.2537938140987392e+16, 'train_loss': 0.3658030051704503, 'epoch': 4.0})

In [9]:
lora_model.save_pretrained("gpt2-peft")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [7]:
from peft import AutoPeftModelForSequenceClassification

inference_model = AutoPeftModelForSequenceClassification.from_pretrained(
    "gpt2-peft",
    num_labels=2,
    id2label={0: "NEGATIVE", 1: "POSITIVE"}, 
    label2id={"NEGATIVE": 0, "POSITIVE": 1},
    device_map='auto'
)
inference_model.config.pad_token_id = inference_model.config.eos_token_id
inference_model

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): GPT2ForSequenceClassification(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-11): 12 x GPT2Block(
            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attention(
              (c_attn): lora.Linear(
                (base_layer): Conv1D()
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=2304, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict

In [8]:
from transformers import Trainer, DataCollatorWithPadding
import numpy as np

# https://en.wikipedia.org/wiki/Precision_and_recall
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    # labels: 0 -> Negative 1 -> Positive
    cls_correctness = (predictions == labels)
    labels_bin = (labels == 1)
    tp = ((cls_correctness) & (labels_bin)).sum()
    tn = ((cls_correctness) & (~labels_bin)).sum()
    fp = ((~cls_correctness) & (labels_bin)).sum()
    fn = ((~cls_correctness) & (~labels_bin)).sum()
    
    return {
        "accuracy": (tp+tn)/(tp+tn+fp+fn),
        "precision": tp/(tp+fp),
        "recall": tp/(tp+fn),
        "negative_predictive_value": tn/(tn+fn),
        "specificity": tn/(tn+fp),
    }



trainer = Trainer(
    model=inference_model,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics
)

trainer.evaluate()

{'eval_loss': 0.4085777699947357,
 'eval_accuracy': 0.8221,
 'eval_precision': 0.8285205696202531,
 'eval_recall': 0.8212115271515389,
 'eval_negative_predictive_value': 0.8155339805825242,
 'eval_specificity': 0.8230251071647275,
 'eval_runtime': 73.7315,
 'eval_samples_per_second': 135.627,
 'eval_steps_per_second': 16.953}

## Comparision

| Model\Metric | Accuracy | Precision | Recall | Specificity | NPV |
| --- | --- | --- | --- | --- | --- |
| Before PEFT | 0.51 | 1.0 | 0.51 | 1.0 | 0.00 |
| After PEFT | 0.82 | 0.83 | 0.82 | 0.82 | 0.82 |

It can be observed that: 
- PEFT procedure (LoRA) has allowed to improve classifier performance
- Model before applying PEFT had tentency to classify samples as negative (Precision and 

Training on full dataset (not only subset as it can be biased towards one of classes) should allow to achieve even better results.

In [31]:
import torch

def predict(sentence: str, known_label: str) -> None:
    with torch.no_grad():
        inputs = tokenizer(sentence, return_tensors="pt").to(inference_model.device)
        logits = inference_model(**inputs).logits
        probabilities = torch.nn.functional.softmax(logits, dim=1)
        predicted_class_id = probabilities.argmax().item()
        output = inference_model.config.id2label[predicted_class_id]
        print(f'Sentence: {sentence}\t Response: {output} \t Known label: {inference_model.config.id2label[known_label]}')
        
for test_sentence, known_label in zip(test_dataset[:10]['text'], test_dataset[:10]['label']):
    predict(test_sentence, known_label)

Sentence: @justineville ...yeahhh. ) i'm 39 tweets from 1,600!	 Response: POSITIVE 	 Known label: POSITIVE
Sentence: @ApplesnFeathers aww. Poor baby! On your only REAL day off.	 Response: POSITIVE 	 Known label: NEGATIVE
Sentence: @joeymcintyre With my refunded $225 (Australian ticket price) I bought me a hot pair of brown boots  Woulda rathered seeing U any day	 Response: POSITIVE 	 Known label: NEGATIVE
Sentence: It's fine. Today sucks just because me those things. i dunno if i can see you	 Response: POSITIVE 	 Known label: NEGATIVE
Sentence: Im just chilling on psp and stuff, but sitting on pc now, also watching wimledon, getting ready for holiday @WhiteTigerNora Ahh poor you	 Response: POSITIVE 	 Known label: NEGATIVE
Sentence: @lisarinna very sad Lisa...she is freeeeeeeeeeee an Angel in Heaven xoxo	 Response: POSITIVE 	 Known label: NEGATIVE
Sentence: Comfortablity has won out	 Response: POSITIVE 	 Known label: NEGATIVE
Sentence: blaaah. I don't feel good aagain	 Response: POSITIV