# Text Classification Fine Tuning: DistilBERT

## Resources
- Verify the availability of notebook resources
- Fine-tuning necessitates the use of either a GPU or a TPU

In [1]:
# display resources
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Wed Jun 12 21:53:22 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   49C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### Install Libraries
+ Hugging Face
+ PyTorch
+ Standard Python data science libraries

In [None]:
%pip install transformers datasets evaluate accelerate pipeline bitsandbytes
%pip install torch torchdata
%pip install peft
%pip install loralib
%pip install huggingface_hub

In [8]:
import pandas as pd
import numpy as np
import random
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    AutoModelForCausalLM,
    AutoModelForSequenceClassification,
    GenerationConfig,
    TrainingArguments,
    Trainer,
    pipeline,
    BitsAndBytesConfig,
    DataCollatorForSeq2Seq,
    DataCollatorWithPadding
)
import torch
import evaluate
from peft import (
    LoraConfig,
    get_peft_model,
    TaskType,
    PeftModel,
    PeftConfig,
)
from huggingface_hub import login

### HuggingFace Authentication
+ Authenticate to pull models and datasets (read token required)
+ Authenticate to push models to hugging face hub (write token required)

In [None]:
login()

### Notebook Config
+ Define some useful constants
+ Device (CPU or CUDA for distributed environments)
+ Model Paths (saving model checkpoints, adaptor weights)

In [5]:
# training directory
MNAME = 'sentiment'
DIR_MODEL = f"/content/drive/MyDrive/Colab Notebooks/fine-tuning-llm/{MNAME}/peft/models/"

# device
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
DEVICE

device(type='cuda')

## DistilBERT
+ Distilled version of BERT (by Google)
+ [DistilBERT](https://huggingface.co/docs/transformers/en/model_doc/distilbert)
+ [distilBERT-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) - The base model selected for fine-tuning (case insensitive)

#### Why DistilBERT?
+ Smaller and faster than BERT (40% fewer parameters)
+ Runs 60% faster than BERT
+ Preserves 95% of BERT's performance
+ Well suited for text classification

### Fine Tuning Dataset: IMDB
+ [imdb](https://huggingface.co/datasets/stanfordnlp/imdb) => available from HuggingFace
+ Movie review data set(25k train, 25k test)
+ Consists of a review (text) and a human-assigned label (1=positive, 0=negative)
+ Steps:
  + The dataset is split between Train | Test
  + The Test dataset was further split into Train | Validate for training
  + The number of observations was randomly subset to reduce the compute time required for fine-tuning (Train 1k, Test 1k, Validate 0.5k)

In [58]:
# classification dataset
data_imdb = load_dataset("imdb")

# split in to train, test, validate
data_train = data_imdb['train']

# split test into test, validate
data_test = data_imdb['test'].train_test_split(test_size=0.3)

# subset rows to reduce train time
train = data_train.shuffle(seed=1985).select([idx for idx in list(range(1000))])
test = data_test['train'].shuffle(seed=1985).select([idx for idx in list(range(1000))])
validate = data_test['test'].shuffle(seed=1985).select([idx for idx in list(range(100))])


### Base Model
+ The distilBERT base model (case insensitive version) was fine-tuned with the imdb data to improve text classification (positive, negative)
+ Source: HuggingFace
+ Implementation: HuggingFace, Torch
+ Steps:
 + Download the pre-trained model
 + Create a tokenizer
 + Encode/decode the labels (1:positive, 0:negative)
 + Define the number number of labels (binary classification)
 + Define the base model (for fine-tuning)
 + Define the original model (for evaluation)
 + Move the models to the DEVICE (cpu, cuda)



In [None]:
# DistilBERT Base Model
base_model_name = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

# classification mappings
id2label = {0:"Negative",1:"Positive"}
label2id = {"Negative":0, "Positive":1}

# base model for training
base_model = AutoModelForSequenceClassification.from_pretrained(
    base_model_name,
    num_labels=2,
    id2label=id2label,
    label2id=label2id,
    torch_dtype=torch.bfloat16
    ).to(DEVICE)

# original model for evaluation
original_model = AutoModelForSequenceClassification.from_pretrained(
    base_model_name,
    num_labels=2,
    id2label=id2label,
    label2id=label2id,
    torch_dtype=torch.bfloat16
    ).to(DEVICE)


### Preprocessing
+ Preprocessing is required to tokenize the inputs and standardize the length of each review.
+ Steps:
  + Tokenize each review
  + Standardize review length: A combination of truncation and padding was used to ensure the length of text for each review was the same length.
  + The DataCollatorWithPadding function from HuggingFace was used to automatically set padding levels during training.

In [None]:
def preprocess(examples):
  """ Tokenize the input text """
  tokens = tokenizer(examples['text'], truncation=True)
  return tokens

# preprocess each review in the train, test and validate datasets
tokenized_train = train.map(preprocess, batched=True)
tokenized_test = test.map(preprocess, batched=True)
tokenized_val = validate.map(preprocess, batched=True)

### Generate Responses
+ Deinfe a convenience function to generate a classification for a sample review from the dataset
+ Steps
	+ Tokenize the review
	+ Generate a response
	+ Extract the logits
	+ Infer the classification from the maximum logit value
	+ Optionally print the review, the decoded classification, and human labels

In [150]:
def get_response(example, model, tokenizer, verbose=False):
  """ Generate a classification for a sample review """
  # tokenize the input text
  encoded_input = tokenizer(example['text'], return_tensors="pt", truncation=True, padding =True)
  encoded_input.to(DEVICE)

  # get the logits
  logits = model(**encoded_input).logits

  # classify
  prediction = torch.argmax(logits).tolist()

  # print a summary
  if verbose:
    # decode the prediction
    decoded_output = id2label[prediction]
    print("Input Text")
    print("="*100)
    print(example['text'])
    print("="*100)
    print(f"Prediction: {decoded_output} | Label: {id2label[example['label']]}")
  else:
    return prediction

### Training
+ Parameter Efficient Fine Tuning (PEFT)
+ The LoRA methodology is used to fine-tune a small number of adaptors during training
 #### Why PEFT/LoRA?
 + This is the preferred method of many practitioners
 + It is effective at improving performance for task-specific fine-tuning
 + It uses much fewer resources than full instruction fine-tuning
 + It only trains a small fraction of model weights (~1%)
 + It prevents catastrophic forgetting when fine-tuning because the base model weights are unchanged.  The adaptors are merged with the original base model weights
  + There is only a small loss of performance when compared to full fine-tuning



#### Calculate Training Metrics
+ A convenience function to calculate model performance during training
+ Performance is evaluated after each epoch of training
+ This is a supervised binary classification task (we have the ground truth labels). Therefore, a classification accuracy measure can be used. The F1 score was selected to balance precision and recall
+ Steps
	+ Download the F1 score from the evaluate library
	+ Extract the logits from the prediction object
	+ Infer the classification from the logits
	+ Calculate the F1 score by comparing the predicted classification to the human-assigned classification

In [124]:
def calc_training_metrics(pred):
  """ Calculate the evaluation metrics during training """

  # load the f1 metric from the evaluate library
  f1 = evaluate.load('f1')

  # get the logits and labels from the prediction object
  logits, labels = pred

  # classify by using the logit (assign using the largest value)
  predictions = np.argmax(logits, axis=-1)

  # calculate the score
  score = f1.compute(predictions=predictions, references=labels)['f1']

  return {'f1':score}


#### LoRA Configuration
+ **Key Parameters**
+ rank (r)
 + The dimensions of the adaptors to train
 + The values typically range from 3-32, with empirical observations that there is a diminishing return on Performance with a value > 10
 + The value is proportional to the number of parameters that can be tuned & the compute time requried
+ LoRA modules (target_modules)
 + Defines which layers the adaptors are added to in the base model
 + The options available depend on the topography of the base model
 + The documentation for each base model must be researched to determine which layers are available
+ LoRA dropout - regularization parameters
+ LoRA alpha - scaling factor for the weights.  Some [articles](https://www.anyscale.com/blog/fine-tuning-llms-lora-or-full-parameter-an-in-depth-analysis-with-llama-2) suggest this be set at 16 and not trained
+ task_type - the task type (summarization, classification, transaction etc.)

+ Steps
	+ Define the LoRA parameters in the LoraConfig object
	+ Prepare the PEFT model from the base model + LoRA config object
	+ View the number of trainable parameters in the PEFT model


In [65]:
# LoRA config
lora_config = LoraConfig(
    r = 8, # dimension of adaptors, rank
    target_modules = ["q_lin"], # add LoRA adaptors to these layers in the base model
    lora_alpha=16, # alpha scaling
    lora_dropout=0.05, # regularization, dropout probability
    task_type=TaskType.SEQ_CLS # text classification
)

# Create the PEFT model from the base model and LoRA config
peft_model = get_peft_model(base_model, lora_config)
peft_model.print_trainable_parameters()

trainable params: 665,858 || all params: 67,620,868 || trainable%: 0.9847


#### Training
This project aimed to demonstrate how to fine-tune LLMs for specific tasks using public datasets. As the focus was not on performance, no attempt at hyperparameter tuning was undertaken. In most instances, the default hyperparameter values were used
+ **Key Parameters**
+ output_dir - location to save trained adaptor weights
+ learning_rate -set to default
+ auto_find_batch_size - set to auto
+ Logging and evaluation were set to occur after each epoch
+ load_best_model_at_end - set to true to capture the best model from the epoch training
+ The data collator is used to automatically pad the text to the longest sequence in each batch

In [67]:
# Data Collator: This function dynamically sets the padding during training
# ensures prompts of are equal length
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# training config
config_training = TrainingArguments(
    output_dir=DIR_TRAIN,
    auto_find_batch_size=True,
    learning_rate=1e-3,
    logging_steps=1,
    num_train_epochs=10,
    eval_strategy='epoch',
    load_best_model_at_end=True
)

# Trainer
trainer = Trainer(
    model=peft_model,
    args=config_training,
    data_collator = data_collator,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    compute_metrics=calc_training_metrics
)

# train
trainer.train()

# save adaptor weights
trainer.save_model(DIR_MODEL)
# peft_model.push_to_hub('kconstable/sentiment-distlbert')

Epoch,Training Loss,Validation Loss,F1
1,0.1187,0.340488,0.868687
2,0.3301,0.256116,0.924731
3,0.0481,0.395238,0.868687
4,0.0164,0.400892,0.875
5,0.0013,0.437728,0.903226
6,0.0101,0.427545,0.903226
7,0.0,0.427249,0.903226
8,0.0056,0.505457,0.893617
9,0.0171,0.519009,0.893617
10,0.0128,0.510737,0.893617




#### Merge Base Model & Adapters
+ The trained LoRA adaptors must be merged with the original base model
+ The resulting model consists of the base model plus the trained adaptors


In [None]:
# merge base model + peft adaptors
tuned_model = PeftModel.from_pretrained(
    base_model,
    DIR_MODEL, # LoRA adapters
    torch_dthype=torch.bfloat16,
    trust_remote_code=True,
    is_trainable=False
  )

### Evaluate Model Performance
+ [hugging face evaluation metrics](https://huggingface.co/evaluate-metric)
+ This is a supervised binary classification task (we have the ground truth labels). Therefore, a classification accuracy measure can be used. The F1 score was selected to balance precision and recall
+ A function was defined to generate a classification for a list of samples from the test dataset
+ **Steps:**
	+ Randomly select 500 examples from the test dataset (out of sample)
	+ Compare the predictions to the human label for each example using the original base model and the fine-tuned model
	+ Calculate the  overall F1 score for all 500 examples for each model
	+ PEFT/LoRA fine-tuning increased the F1 score from 65% to 89%


In [184]:
def evaluate_model(test_indexes, data, model, tokenizer):
  """ Generate classifications for each example in the test indexes """
  # accumulator
  results = []

  # loop through each test index in the dataset
  for idx in test_indexes:
    # get the human label and the generated classification
    example = data[idx]
    label = example['label']
    pred = get_response(example, model, tokenizer, verbose=False)

    # accumuate results
    results.append({'idx':idx,'label':label,'pred':pred})
  return pd.DataFrame(results)

In [185]:
# Select 500 examples from the test dataset
num_samples = test.num_rows-1
num_to_test = 500
test_indexes = random.sample(range(num_samples),num_to_test)


# Evaluate the Base Model
df_base = evaluate_model(test_indexes, test, original_model, tokenizer)
f1_base = f1.compute(predictions=df_base['pred'], references=df_base['label'])['f1']
print(f"Base Model F1 Score: {f1_base*100:,.2f}%")

# Evaluate the Tuned Model
df_tuned = evaluate_model(test_indexes, test, tuned_model, tokenizer)
f1_tuned = f1.compute(predictions=df_tuned['pred'], references=df_tuned['label'])['f1']
print(f"Tuned Model F1 Score: {f1_tuned*100:,.2f}%")


Base Model F1 Score: 65.67%
Tuned Model F1 Score: 89.20%
