<a href="https://colab.research.google.com/github/mdeevan/LightweightFineTuning/blob/main/LightweightFineTuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: <br>
**LoRA** (Low Rank Adaptation). It decomposes a large matrix into small matrices, reducing number of parameters. It requires less memory and speeds up fine-tuning.
<br>https://huggingface.co/docs/peft/developer_guides/lora
<br>

* Model: <br>
**distilbert/distilbert-base-uncased** : It's smaller and faster then BERT and used BERT base model as a teacher.
<br>https://huggingface.co/distilbert/distilbert-base-uncased
<br>

* Evaluation approach: <br>
**seqeval** framework for sequence labeling evaluation. It evaluates the precision, recall and f1 score.
<br>https://huggingface.co/spaces/evaluate-metric/seqeval
<br>

* Fine-tuning dataset: <br>
**financial_phrasebank** based on the financial news, a multi-class-classification with three sentiments (positive, negative and neutral)
<br>https://huggingface.co/datasets/financial_phrasebank


## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [None]:
!pip install accelerate




In [None]:
!kill -9 -1

In [None]:
!pip install transformers --upgrade
!pip install evaluate seqeval
!pip install peft

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━

In [None]:
# !kill -9 -1

#### IMPORTS

In [None]:
from transformers import (AutoModelForSequenceClassification,
                          AutoTokenizer, DataCollatorWithPadding,
                          TrainingArguments, Trainer)
from datasets     import load_dataset

import torch
import evaluate


#### Define Variables & Load dataset

In [None]:
checkpoint = "distilbert/distilbert-base-uncased"
data_file = "financial_phrasebank"
data_file_subset = "sentences_66agree"

In [None]:
raw_dataset = load_dataset(path=data_file,
                           name=data_file_subset,
                           split="train").train_test_split(test_size=0.2,
                                                           shuffle=True,
                                                           seed=42)

raw_train = raw_dataset.pop('train')
raw_train_valid = raw_train.train_test_split(test_size=.1, shuffle=True, seed=42)
raw_dataset['train'] = raw_train_valid.pop('train')
raw_dataset['eval'] = raw_train_valid.pop('test')
raw_dataset


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/339k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4217 [00:00<?, ? examples/s]

DatasetDict({
    test: Dataset({
        features: ['sentence', 'label'],
        num_rows: 844
    })
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 3035
    })
    eval: Dataset({
        features: ['sentence', 'label'],
        num_rows: 338
    })
})

In [None]:
# raw_dataset

In [None]:
labels = raw_dataset["train"].features['label'].names
labels

['negative', 'neutral', 'positive']

In [None]:
label2id = {l:i for i, l in enumerate(labels)}
id2label = {i:l for i, l in enumerate(labels)}

In [None]:
input_max_length = max([len(s) for s in raw_dataset['train']['sentence']])
input_max_length

315

In [None]:
print(label2id)
print(id2label)
print(len(label2id))

{'negative': 0, 'neutral': 1, 'positive': 2}
{0: 'negative', 1: 'neutral', 2: 'positive'}
3


In [None]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint,
                                                           num_labels = len(label2id),
                                                           id2label=id2label,
                                                           label2id=label2id)

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

print(device)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


cuda


In [None]:
print("model     = ", model)
print("tokenizer = ", tokenizer)

model     =  DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inp

In [None]:
# data_collator

In [None]:
def tokenize_function(data):
    return tokenizer(data['sentence'],
#                      max_length=input_max_length,
                     truncation=True,
#                      padding='max_length'
                    )


tokenized_dataset = raw_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/844 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/3035 [00:00<?, ? examples/s]

Map:   0%|          | 0/338 [00:00<?, ? examples/s]

In [None]:
tokenized_dataset

DatasetDict({
    test: Dataset({
        features: ['sentence', 'label', 'input_ids', 'attention_mask'],
        num_rows: 844
    })
    train: Dataset({
        features: ['sentence', 'label', 'input_ids', 'attention_mask'],
        num_rows: 3035
    })
    eval: Dataset({
        features: ['sentence', 'label', 'input_ids', 'attention_mask'],
        num_rows: 338
    })
})

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer,
                                        padding=True,
#                                         padding='max_length',
#                                         max_length=input_max_length)
                                       )

In [None]:
data_collator

DataCollatorWithPadding(tokenizer=DistilBertTokenizerFast(name_or_path='distilbert/distilbert-base-uncased', vocab_size=30522, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}, padding=True, max_length=None, pad_to_mul

In [None]:
[len(x) for x in tokenized_dataset['train'][:10]['sentence']]

[120, 75, 152, 46, 152, 90, 50, 167, 160, 49]

In [None]:
[len(x) for x in tokenized_dataset['train'][:10]['input_ids']]

[22, 19, 38, 12, 35, 20, 14, 57, 42, 16]

In [None]:
sample_start = 20
sample_count = 10

samples = tokenized_dataset['train'][sample_start : sample_start+sample_count]

samples = {k: v for k, v in samples.items() if k not in ['sentence', 'label']}
print([len(x) for x in samples['input_ids']])



[26, 26, 33, 28, 25, 29, 10, 31, 41, 17]


In [None]:
# data_collator([samples[i] for i in range(2)])

In [None]:
samples.keys()

dict_keys(['input_ids', 'attention_mask'])

In [None]:
batch = data_collator(samples ).to(device)
# batch
# {k: v.shape for k, v in batch.items()}


In [None]:
{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([10, 41]), 'attention_mask': torch.Size([10, 41])}

In [None]:
output = model(**batch).logits
# output

In [None]:
predictions=torch.argmax(output, dim=1).cpu().numpy()
predictions

array([2, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [None]:
references=tokenized_dataset['train']['label'][sample_start:sample_start+sample_count]
references

[1, 2, 2, 1, 2, 0, 1, 1, 1, 2]

In [None]:
batch.keys()

dict_keys(['input_ids', 'attention_mask'])

In [None]:
# https://www.evidentlyai.com/classification-metrics/multi-class-metrics

accuracy  = evaluate.load('accuracy')
f1        = evaluate.load('f1')
precision = evaluate.load('precision')
recall    = evaluate.load('recall')

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

In [None]:
import numpy as np
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    return accuracy.compute(predictions=predictions, references=labels)

In [None]:


clf =  evaluate.combine(["accuracy",'f1','precision','recall'])
accuracy_metric = accuracy.compute (predictions = predictions, references  = references )
f1_metric       = f1.compute       (predictions = predictions, references  = references,  average = "macro")
precision_metric= precision.compute(predictions = predictions, references  = references,   average = "macro", zero_division=0)
recall_metric   = recall.compute   (predictions = predictions, references  = references,  average = "macro")


print(accuracy_metric)
print(f1_metric)
print(precision_metric)
print(recall_metric)

{'accuracy': 0.4}
{'f1': 0.19047619047619047}
{'precision': 0.14814814814814814}
{'recall': 0.26666666666666666}


## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [None]:
# !pip install peft

In [None]:
from peft import LoraConfig, get_peft_model, TaskType


In [None]:
config = LoraConfig()
peft_config = LoraConfig(task_type=TaskType.SEQ_CLS ,
                         inference_mode=False,
                         r=8,
                         lora_alpha=32,
                         lora_dropout=0.1,
                         target_modules=['q_lin','v_lin', 'k_lin'],
                         )


In [None]:
peft_model = get_peft_model(model, peft_config)

In [None]:
peft_model.print_trainable_parameters()

trainable params: 814,083 || all params: 67,769,862 || trainable%: 1.2012463593330027


In [None]:
#

In [None]:
Training_Arguments = TrainingArguments(
    per_device_train_batch_size = 32,
    per_device_eval_batch_size  = 32,
    output_dir                  = "my_Multi_label_Classifier",
    learning_rate               = 2e-5,
    num_train_epochs            = 20,
    weight_decay                = 0.01,
    save_strategy               = 'epoch',
    evaluation_strategy         = 'epoch',
    deepspeed                   = False,
    load_best_model_at_end      = True)

In [None]:
tokenized_dataset['train'].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'neutral', 'positive'], id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

In [None]:
tokenized_dataset['train'].rename_column('label','labels')

Dataset({
    features: ['sentence', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 3035
})

In [None]:
trainer = Trainer(
                  model=model,
                  args=Training_Arguments,
                  train_dataset=tokenized_dataset['train'],
                  eval_dataset =tokenized_dataset['eval'],
                  compute_metrics=compute_metrics,
                  tokenizer=tokenizer,
                  data_collator=data_collator
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.456047,0.810651
2,No log,0.418948,0.840237
3,No log,0.390349,0.83432
4,No log,0.369978,0.843195
5,No log,0.350361,0.860947
6,0.415200,0.337673,0.860947
7,0.415200,0.325087,0.863905
8,0.415200,0.314885,0.866864
9,0.415200,0.309716,0.863905
10,0.415200,0.30428,0.860947


TrainOutput(global_step=1900, training_loss=0.32401989987022, metrics={'train_runtime': 293.466, 'train_samples_per_second': 206.838, 'train_steps_per_second': 6.474, 'total_flos': 1046962121692500.0, 'train_loss': 0.32401989987022, 'epoch': 20.0})

In [None]:
model.save_pretrained("fineTunedMultiLabelClassifier")

In [None]:
trainer.save_model("ftMLC")

In [None]:
tokenized_dataset['train'].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'neutral', 'positive'], id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [None]:
text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."

In [None]:
tokenizer2 = AutoTokenizer.from_pretrained('ftMLC')
inputs = tokenizer2(text, return_tensors='pt')

In [None]:
from transformers import AutoModelForSequenceClassification

model2 = AutoModelForSequenceClassification.from_pretrained('ftMLC')

Some weights of the model checkpoint at ftMLC were not used when initializing DistilBertForSequenceClassification: ['classifier.modules_to_save.default.bias', 'classifier.modules_to_save.default.weight', 'classifier.original_module.bias', 'classifier.original_module.weight', 'distilbert.transformer.layer.0.attention.k_lin.base_layer.bias', 'distilbert.transformer.layer.0.attention.k_lin.base_layer.weight', 'distilbert.transformer.layer.0.attention.k_lin.lora_A.default.weight', 'distilbert.transformer.layer.0.attention.k_lin.lora_B.default.weight', 'distilbert.transformer.layer.0.attention.q_lin.base_layer.bias', 'distilbert.transformer.layer.0.attention.q_lin.base_layer.weight', 'distilbert.transformer.layer.0.attention.q_lin.lora_A.default.weight', 'distilbert.transformer.layer.0.attention.q_lin.lora_B.default.weight', 'distilbert.transformer.layer.0.attention.v_lin.base_layer.bias', 'distilbert.transformer.layer.0.attention.v_lin.base_layer.weight', 'distilbert.transformer.layer.0.at

In [None]:
with to

In [None]:
with torch.no_grad():
  logits = model2(**inputs).logits

In [None]:
prediction = np.argmax(logits, axis=-1)
print(model2.config.id2label[prediction])

KeyError: tensor([2])

In [None]:
print(prediction)

tensor([2])
