# Lightweight Fine-Tuning Project

## Table of contents
0. [Decisions](#config)
1. [Loading and Evaluating a Foundation Model](#loadbase)
2. [Performing Parameter-Efficient Fine-Tuning](#peft)
3. [Performing Inference with a PEFT Model](#inference)

## 0. Decisions
<a id="config"></a>
TODO: In this cell, describe your choices for each of the following

* PEFT technique: As recommended, I will use LoRA 
* Model: I learned that gpt is better for CausalLM, therefore I will use bert instead of gpt-2
* Evaluation approach: TODO
* Fine-tuning dataset: Despite being similar to imdb, I will use movie reviews, but here from rotten tomatos. Main reason is it is the first data set I find which has an appropriate size (5331 records per sentiment, with a 50/50 distribution). Of that, we only use 33.3%

## 1. Loading and Evaluating a Foundation Model
<a id="loadbase"></a>
TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [3]:
import torch
import transformers as tf  # probably a bad idea, tf could be mistaken for tensorflow, will pick another abbrev next time
import datasets as ds
import numpy as np
import pandas as pd
import evaluate as eva
import torch.nn.functional as func
import peft

# config: pick one of the next 3
# THE_MODEL_I_USE = "gpt2"                  # suggested, but not that suitable for classification
THE_MODEL_I_USE = "distilbert-base-uncased" # current selection

# config: select a metric
# THE_METRIC_I_USE = "accuracy"
THE_METRIC_I_USE = "f1"                     # should be better for imbalanced data sets

# Config: train the classiication head? If not, the initial evaluation is more or less random, if done, these weights won't be saved and are lost
# According to guidelines, it should NOT be done
TRAIN_CH = False

# the name of the tuned model is the original one plus some suffix
NAME_OF_FINETUNED_MODEL = THE_MODEL_I_USE + "-loratuned"

In [4]:
# load data set
allsets = ds.load_dataset("rotten_tomatoes")
# to speed things up, we want to use 1 third of the data set only
only_a_third = lambda data, index: index % 3 == 0

# split into its parts
part_train      = allsets["train"]     .filter(only_a_third, with_indices=True)
part_validation = allsets["validation"].filter(only_a_third, with_indices=True)
part_test       = allsets["test"]      .filter(only_a_third, with_indices=True)

# print sizes
print(f"Count t/v/t is {len(part_train)}/{len(part_validation)}/{len(part_test)}")
# => awesome, some decent size!

# inspect it:
part_test
# => has text and label

Count t/v/t is 2844/356/356


Dataset({
    features: ['text', 'label'],
    num_rows: 356
})

In [5]:
def to_label(token):
    return "Bleh!" if token == 0 else "Yeah!"

original_tokenizer = tf.AutoTokenizer.from_pretrained(THE_MODEL_I_USE) # , return_tensors="pt")
original_model = tf.AutoModelForSequenceClassification.from_pretrained(THE_MODEL_I_USE, num_labels=2,
    id2label={0: "Bleh!", 1: "Yeah!"},
    label2id={"Bleh!": 0, "Yeah!": 1}
)

# set padding (quite an effort to make it work!)
original_tokenizer.pad_token = original_tokenizer.eos_token
original_model.config.pad_token_id = original_model.config.eos_token_id
original_tokenizer.add_special_tokens({'pad_token': '[PAD]'})

print(original_model)

# test the tokenizer on some fixed text
sample_text = "This was a really bad movie"
sample_inputs = original_tokenizer(sample_text, return_tensors="pt")
sample_inputs["input_ids"]
sample_result = original_model.forward(sample_inputs["input_ids"])
sample_result_label = to_label(np.argmax(sample_result.logits.detach, -1))
# print(f'I obtained {sample_inputs}, {np.argmax(sample_result.logits.detach, -1)}')
print(f'I obtained {sample_result_label} for {sample_text}')

# tokenize the stuff
def tokenize(dsin):
    return dsin.map(lambda x: original_tokenizer(x["text"], padding="max_length", truncation=True), batched=True)

tok_train      = tokenize(part_train)
tok_validation = tokenize(part_validation)
tok_test       = tokenize(part_test)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

Map:   0%|          | 0/2844 [00:00<?, ? examples/s]

Map:   0%|          | 0/356 [00:00<?, ? examples/s]

Map:   0%|          | 0/356 [00:00<?, ? examples/s]

In [6]:

# standard metric used for classification (I've seen f1 as well...)
#def compute_metrics(eval_pred):
#    predictions, labels = eval_pred
#    predictions = np.argmax(predictions, axis=-1)
#    return { "accuracy": (predictions == labels).mean() } 

metric = eva.load(THE_METRIC_I_USE)
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# determine how good the untrained original model is, by prompting it for the answers
for i in range(5):
    review_to_check = part_test["text"][i]
    # prompt = 'Do you think the following review is positive? ' + review_to_check # construct a prompt
    prompt = review_to_check
    tokens = original_tokenizer(prompt, return_tensors="pt")
    # print(f'Type of tokens is {type(tokens)} and tokens are {tokens}')
    answer = original_model(**tokens)
    logits = answer.logits
    result_label = to_label(np.argmax(logits.detach, -1))
    # print(logits)
    print(f'I obtained {result_label} for {review_to_check}')
    # probabilities = func.softmax(logits[0], dim=-1)  # argmax?
    #print(f'The guess for {review_to_check} is')

# That does not yet look meaningful at all! :-(

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

I obtained Bleh! for lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .
I obtained Bleh! for the story gives ample opportunity for large-scale action and suspense , which director shekhar kapur supplies with tremendous skill .
I obtained Bleh! for throws in enough clever and unexpected twists to make the formula feel fresh .
I obtained Bleh! for generates an enormous feeling of empathy for its characters .
I obtained Bleh! for mostly , [goldbacher] just lets her complicated characters be unruly , confusing and , through it all , human .


In [7]:
# Attach a classification head!

# Freeze all the parameters of the base model
for param in original_model.base_model.parameters():
    param.requires_grad = False

original_model.classifier

# possible trainer directly on base model (training not executed)
trainer_of_ch = tf.Trainer(
    model=original_model,
    args=tf.TrainingArguments(
        output_dir="./data/ch",
        learning_rate=2e-3,
        # Reduce the batch size if you don't have enough memory
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        num_train_epochs=1,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
    ),
    train_dataset = tok_train,
    eval_dataset  = tok_validation,
    tokenizer     = original_tokenizer,
    data_collator = tf.DataCollatorWithPadding(tokenizer=original_tokenizer),
    compute_metrics = compute_metrics,
)

print('===========================================')
if TRAIN_CH:
    trainer_of_ch.train()
print('===========================================')

trainer_of_ch.evaluate()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)




{'eval_loss': 0.7017835974693298,
 'eval_f1': 0.6666666666666666,
 'eval_runtime': 1.8136,
 'eval_samples_per_second': 196.294,
 'eval_steps_per_second': 49.073}

## 2. Performing Parameter-Efficient Fine-Tuning
<a id="peft"></a>

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [9]:
# create a LoRA fine tuning

# config for "bert-base-cased"
lora_config = peft.LoraConfig(
    task_type="SEQ_CLS",
    r=4,           # wild guess, LoRA paper used values between 1 and 16...
    lora_alpha=4,  # same as r?
    lora_dropout=0.01,
)
if THE_MODEL_I_USE == "distilbert-base-uncased":
    lora_config = peft.LoraConfig(
        task_type="SEQ_CLS",
        r=4,
        lora_alpha=4,
        lora_dropout=0.01,
        # target_modules=["q_lin", "v_lin", "k_lin", "out_lin"],  # this is the theory but stays at 0.5 accuracy
        target_modules=["q_lin", "v_lin"],  # this is the theory but stays at 0.5 accuracy
        # target_modules=["pre_classifier"],  # throws exception despite also being of type Linear
        # target_modules=["classifier"],  # achieves 0.82
    )


# lora_model = peft.LoraModel(original_model, lora_config, "default")
lora_model = peft.get_peft_model(original_model, lora_config)
lora_model.print_trainable_parameters()

original_tokenizer.add_special_tokens({'pad_token': '[PAD]'})

trainer_for_lora = tf.Trainer(
    model = lora_model,
    args  = tf.TrainingArguments(
        output_dir                  = "./train/", # weights saved to some subfolder
        learning_rate               = 0.002,
        per_device_train_batch_size = 4,
        per_device_eval_batch_size  = 4,
        evaluation_strategy         = "epoch",
        save_strategy               = "epoch",
        num_train_epochs            = 4,     # with GPU 4 epochs take 3 minutes, without... hours! but no improvement after epoch 2
        weight_decay                = 0.01,
        load_best_model_at_end      = True,
    ),
    train_dataset   = tok_train,
    eval_dataset    = tok_validation,
    tokenizer       = original_tokenizer,
    data_collator   = tf.DataCollatorWithPadding(tokenizer=original_tokenizer),
    compute_metrics = compute_metrics,
)

print("Now the work starts")
trainer_for_lora.train()
trainer_for_lora.evaluate()
lora_model.save_pretrained(NAME_OF_FINETUNED_MODEL)  # use current directory as required in rubric


trainable params: 665,858 || all params: 67,620,868 || trainable%: 0.9846930684178736
Now the work starts


dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss,F1
1,0.618,0.406783,0.816901
2,0.4581,0.475466,0.82659
3,0.3329,0.641566,0.824859
4,0.1999,0.796972,0.810345


## 3. Performing Inference with a PEFT Model
<a id="inference"></a>

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [11]:
lora_reloaded = peft.AutoPeftModelForSequenceClassification.from_pretrained(NAME_OF_FINETUNED_MODEL)
print(lora_reloaded)
# print(tokens)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): DistilBertForSequenceClassification(
      (distilbert): DistilBertModel(
        (embeddings): Embeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (transformer): Transformer(
          (layer): ModuleList(
            (0-5): 6 x TransformerBlock(
              (attention): MultiHeadSelfAttention(
                (dropout): Dropout(p=0.1, inplace=False)
                (q_lin): lora.Linear(
                  (base_layer): Linear(in_features=768, out_features=768, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.01, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=76

In [12]:
# just for visual checks
for i in range(5):
    review_to_check = part_test["text"][i]
    print(f'review is {review_to_check}')
    tokens_of_review = original_tokenizer(review_to_check, return_tensors="pt")
    # print(f'tokens are {tokens_of_review.input_ids}')
    answer_of_finetuned_model = lora_reloaded(input_ids=tokens_of_review.input_ids)
    # print(f'answer is {answer_of_finetuned_model}')
    logits = answer_of_finetuned_model.logits
    probs = logits[0]
    result =  "Yay!" if probs[0] < probs[1] else "Nay!"
    print(result)
    # print(type(logits))


review is lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .
Yay!
review is the story gives ample opportunity for large-scale action and suspense , which director shekhar kapur supplies with tremendous skill .
Yay!
review is throws in enough clever and unexpected twists to make the formula feel fresh .
Yay!
review is generates an enormous feeling of empathy for its characters .
Yay!
review is mostly , [goldbacher] just lets her complicated characters be unruly , confusing and , through it all , human .
Nay!


In [13]:
df = pd.DataFrame(tok_test)
df = df[["text", "label"]]

# Add the model predictions to the dataframe
# Why is predict part of the trainer?
predictions = trainer_for_lora.predict(tok_test)

print(predictions.metrics)


{'test_loss': 0.4298180043697357, 'test_f1': 0.8126801152737753, 'test_runtime': 2.011, 'test_samples_per_second': 177.029, 'test_steps_per_second': 44.257}
