# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* **Fine-tuning dataset**: 
    - The IMDb dataset is a well-known benchmark dataset for sentiment analysis tasks, containing movie reviews labeled as positive or negative.
    - It's lightweight in terms of data size, containing a relatively small number of samples compared to other datasets like Common Crawl or Wikipedia.
    - Training on the IMDb dataset requires less computational resources and time compared to larger datasets, making it suitable for lightweight fine-tuning projects.
    - The IMDb dataset provides a good balance between simplicity and complexity, making it ideal for initial experimentation and prototyping of NLP models.
    - Since it focuses on sentiment analysis, the task is straightforward and can be easily adapted for various downstream applications such as review classification or sentiment detection in social media posts.
    - Availability: The IMDb dataset is widely used and readily accessible, making it convenient for researchers and practitioners to obtain and work with for fine-tuning projects.
    - The IMDb dataset often serves as a starting point for fine-tuning pre-trained language models like BERT or DistilBERT for sentiment analysis tasks, allowing for efficient transfer learning.

* **Model**: 
    - DistilBERT-base-uncased is a pre-trained transformer-based model specifically fine-tuned for natural language processing tasks.
    - It is trained on a large corpus of text data, which includes a diverse range of language patterns and nuances, making it well-suited for understanding the complexities of human language.
    - DistilBERT-base-uncased is a distilled version of BERT, which means it retains much of the performance of BERT while being smaller and faster, making it more practical for deployment and inference in real-world applications.
    - The "uncased" variant of DistilBERT means that it does not differentiate between uppercase and lowercase letters, which is appropriate for text classification tasks where the case of the letters may not carry significant semantic meaning.
    - Given the nature of the IMDB dataset, which consists of movie reviews that are primarily text-based, DistilBERT's ability to capture contextual information and semantic meaning from the text is particularly advantageous for sentiment analysis tasks such as classifying reviews as positive or negative.
    - DistilBERT has been widely adopted and benchmarked in various natural language processing tasks, including sentiment analysis, achieving competitive performance with relatively lower computational resources compared to larger models like BERT.

* **Evaluation approach**:
    - Accuracy is a straightforward and intuitive metric that measures the overall correctness of the classification model.
    - For sentiment analysis tasks like classifying IMDb reviews as positive or negative, accuracy provides a clear indication of how well the model performs in correctly predicting the sentiment of the reviews.
    - The IMDb dataset is balanced, meaning it contains roughly equal numbers of positive and negative reviews. In such cases, accuracy is a suitable metric because it reflects the model's ability to correctly classify both positive and negative instances.
    - Accuracy is easy to interpret and communicate, making it accessible to stakeholders and non-technical audiences.
    - Since the goal of sentiment analysis is to accurately determine the sentiment expressed in text data, accuracy aligns well with the primary objective of the task.
    - While accuracy may not be the only metric to consider (other metrics like precision, recall, and F1 score can provide additional insights, especially in imbalanced datasets), it serves as a fundamental measure of model performance and is often used as a baseline metric for classification tasks.

* **PEFT technique**:
    - Low-rank adaptation is a technique used to adapt large pre-trained language models, like DistilBERT, to specific downstream tasks while reducing computational costs and memory requirements.
    - By applying low-rank adaptation, we aim to fine-tune the parameters of DistilBERT in a low-dimensional subspace, which helps in retaining the essential information captured during pre-training while adapting the model to the IMDb sentiment classification task.
    - Low-rank adaptation helps mitigate overfitting by reducing the dimensionality of the parameter space, thereby enhancing the generalization capability of the model to unseen data.
    - The IMDb sentiment classification task typically involves a binary classification problem (positive or negative sentiment), making it suitable for low-rank adaptation as it allows us to focus the model's capacity on relevant features for sentiment analysis.
    - Low-rank adaptation facilitates faster convergence during fine-tuning, enabling us to efficiently leverage the knowledge captured by DistilBERT on the IMDb dataset and achieve competitive performance with fewer computational resources.
    - Through low-rank adaptation, we strike a balance between model complexity and task-specific performance, making it a practical and effective technique for sentiment analysis tasks like classifying IMDb reviews.

In [1]:
# import modules
import numpy as np
import pandas as pd

import torch
import torch.nn.functional as F

from datasets import load_dataset

from transformers import (AutoModelForSequenceClassification, AutoTokenizer,
                          DataCollatorWithPadding, TrainingArguments, Trainer)

from peft import (AutoPeftModelForSequenceClassification, LoraConfig, TaskType,
                  get_peft_model)

import evaluate

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [2]:
# load dataset
dataset = load_dataset("imdb")

In [3]:
# show dataset contained
dataset.num_columns

{'train': 2, 'test': 2, 'unsupervised': 2}

In [4]:
# show dataset details
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [5]:
# show training set text and labels
print("TEXT", dataset["train"].features[f"text"])
print("LABEL", dataset["train"].features[f"label"])

TEXT Value(dtype='string', id=None)
LABEL ClassLabel(names=['neg', 'pos'], id=None)


In [6]:
# show example test set text and labels
print("TEXT:", dataset["test"]["text"][0])
print("LABEL:", dataset["test"]["label"][0])

TEXT: I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It's really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it's rubbish as they have to

In [7]:
# set up labels, label ids and number of labels
labels = dataset["train"].features[f"label"].names

id2label = {i: name for i, name in enumerate(labels)}
label2id = {name: i for i, name in enumerate(labels)}


label_count = len(labels)

In [8]:
# set up tokenizer and paddiung
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")


if not tokenizer.pad_token:

    tokenizer.pad_token = tokenizer.eos_token

In [9]:
# set up model
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=label_count, id2label=id2label, label2id=label2id
)


device = torch.device(
    "cuda") if torch.cuda.is_available() else torch.device("cpu")

model.to(device)

print(model.config)


print(model)
print(tokenizer)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "neg",
    "1": "pos"
  },
  "initializer_range": 0.02,
  "label2id": {
    "neg": 0,
    "pos": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.38.1",
  "vocab_size": 30522
}

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Tr

In [10]:
# tokenize example text
tokenized_input = tokenizer(dataset["train"][0]["text"], truncation=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])

print("RAW INPUT:", dataset["train"][0]["text"])
print("TOKENIZED OUTPUT:", tokens)
print("TOKEN_IDS:", tokenized_input.word_ids())

RAW INPUT: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far b

In [11]:
# show tokenized example text
for token, word_id in zip(tokens, tokenized_input.word_ids()):

    print(f"({token},{word_id})", end=", ")

([CLS],None), (i,0), (rented,1), (i,2), (am,3), (curious,4), (-,5), (yellow,6), (from,7), (my,8), (video,9), (store,10), (because,11), (of,12), (all,13), (the,14), (controversy,15), (that,16), (surrounded,17), (it,18), (when,19), (it,20), (was,21), (first,22), (released,23), (in,24), (1967,25), (.,26), (i,27), (also,28), (heard,29), (that,30), (at,31), (first,32), (it,33), (was,34), (seized,35), (by,36), (u,37), (.,38), (s,39), (.,40), (customs,41), (if,42), (it,43), (ever,44), (tried,45), (to,46), (enter,47), (this,48), (country,49), (,,50), (therefore,51), (being,52), (a,53), (fan,54), (of,55), (films,56), (considered,57), (",58), (controversial,59), (",60), (i,61), (really,62), (had,63), (to,64), (see,65), (this,66), (for,67), (myself,68), (.,69), (<,70), (br,71), (/,72), (>,73), (<,74), (br,75), (/,76), (>,77), (the,78), (plot,79), (is,80), (centered,81), (around,82), (a,83), (young,84), (swedish,85), (drama,86), (student,87), (named,88), (lena,89), (who,90), (wants,91), (to,92), (

In [12]:
# set up tokenizer for corpus
def preprocess_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_input = dataset.map(preprocess_function, batched=True)

In [13]:
# check tokenizer
print(tokenized_input["train"][0]["text"])
print(tokenized_input["train"][0]["label"])
print(tokenized_input["train"][0]["input_ids"])
print(tokenized_input["train"][0]["attention_mask"])

I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, eve

In [14]:
# Assigning the pad token ID from the tokenizer to the pad_token_id attribute of the model's configuration.
# This ensures consistency between the tokenizer and the model during tokenization and padding operations.
model.config.pad_token_id = tokenizer.pad_token_id

In [15]:
# This code iterates over all parameters of the base model within a larger neural network model (presumably a pre-trained model).
# It sets the requires_grad attribute of each parameter to False, effectively freezing them from being updated during the training process.
# This is necessary when fine-tuning a pre-trained model where we want to keep the parameters of the base model fixed while only updating the parameters of the added layers or the head of the model.
# By setting requires_grad to False, we prevent gradients from being computed and accumulated for these parameters during backpropagation, thus ensuring that they remain unchanged.
for param in model.base_model.parameters():

    param.requires_grad = False

In [16]:
# check model's architecture
print(model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [17]:
# check number of class labels
model.classifier

Linear(in_features=768, out_features=2, bias=True)

In [18]:
# set up accuracy as metric function
accuracy = evaluate.load("accuracy")


def compute_accuracy(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [19]:
# set up training function
def train_model(model, output_dir, train_dataset, eval_dataset, tokenizer, compute_metrics):
    trainer = Trainer(
        model=model,
        args=TrainingArguments(
            output_dir=output_dir,
            learning_rate=2e-5,
            per_device_train_batch_size=32,
            per_device_eval_batch_size=32,
            num_train_epochs=1,
            weight_decay=0.01,
            evaluation_strategy="epoch",
            save_strategy="epoch",
            load_best_model_at_end=True,
        ),
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
        compute_metrics=compute_metrics
    )

    trainer.train()
    return trainer

In [20]:
# train foundation model on training data
trainer_foundation = train_model(
    model, './foundation_model', tokenized_input["train"], tokenized_input["test"], tokenizer, compute_accuracy)

  0%|          | 0/782 [00:00<?, ?it/s]

{'loss': 0.6288, 'grad_norm': 0.7739212512969971, 'learning_rate': 7.21227621483376e-06, 'epoch': 0.64}


  0%|          | 0/782 [00:00<?, ?it/s]

{'eval_loss': 0.5526282787322998, 'eval_accuracy': 0.80684, 'eval_runtime': 114.0632, 'eval_samples_per_second': 219.177, 'eval_steps_per_second': 6.856, 'epoch': 1.0}
{'train_runtime': 232.9497, 'train_samples_per_second': 107.319, 'train_steps_per_second': 3.357, 'train_loss': 0.6069739456371883, 'epoch': 1.0}


In [21]:
# evaluate foundation model on test data
trainer_foundation.evaluate()

  0%|          | 0/782 [00:00<?, ?it/s]

{'eval_loss': 0.5526282787322998,
 'eval_accuracy': 0.80684,
 'eval_runtime': 115.7068,
 'eval_samples_per_second': 216.063,
 'eval_steps_per_second': 6.758,
 'epoch': 1.0}

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [22]:
# set up LoRA (low-Rank Adaption)
peft_config = LoraConfig(task_type=TaskType.SEQ_CLS, target_modules=[
                         'q_lin', 'k_lin', 'v_lin'], inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1)
lora_model = get_peft_model(model, peft_config)
lora_model.print_trainable_parameters()

trainable params: 813,314 || all params: 67,768,324 || trainable%: 1.2001388731407907


In [23]:
# train LoRA-finetuned model of training data
trainer_finetuning = train_model(
    lora_model, './lora_model', tokenized_input["train"], tokenized_input["test"], tokenizer, compute_accuracy)

  0%|          | 0/782 [00:00<?, ?it/s]

{'loss': 0.3667, 'grad_norm': 1.7551478147506714, 'learning_rate': 7.21227621483376e-06, 'epoch': 0.64}


  0%|          | 0/782 [00:00<?, ?it/s]

{'eval_loss': 0.2818361818790436, 'eval_accuracy': 0.88072, 'eval_runtime': 118.1663, 'eval_samples_per_second': 211.566, 'eval_steps_per_second': 6.618, 'epoch': 1.0}
{'train_runtime': 379.7227, 'train_samples_per_second': 65.838, 'train_steps_per_second': 2.059, 'train_loss': 0.3405481616554358, 'epoch': 1.0}


In [24]:
# evaluate LoRA-finetuned model of test data
trainer_finetuning.evaluate()

  0%|          | 0/782 [00:00<?, ?it/s]

{'eval_loss': 0.2818361818790436,
 'eval_accuracy': 0.88072,
 'eval_runtime': 116.7894,
 'eval_samples_per_second': 214.06,
 'eval_steps_per_second': 6.696,
 'epoch': 1.0}

In [25]:
# save best model
lora_model.save_pretrained("lora_model")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [26]:
# load best model
best_model = AutoPeftModelForSequenceClassification.from_pretrained(
    "lora_model",  num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [27]:
# extract aim data
infer_data = dataset["unsupervised"]["text"][:5]
best_model.to(device)

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): DistilBertForSequenceClassification(
      (distilbert): DistilBertModel(
        (embeddings): Embeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (transformer): Transformer(
          (layer): ModuleList(
            (0-5): 6 x TransformerBlock(
              (attention): MultiHeadSelfAttention(
                (dropout): Dropout(p=0.1, inplace=False)
                (q_lin): lora.Linear(
                  (base_layer): Linear(in_features=768, out_features=768, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.1, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=768

In [28]:
print(infer_data)

['This is just a precious little diamond. The play, the script are excellent. I cant compare this movie with anything else, maybe except the movie "Leon" wonderfully played by Jean Reno and Natalie Portman. But... What can I say about this one? This is the best movie Anne Parillaud has ever played in (See please "Frankie Starlight", she\'s speaking English there) to see what I mean. The story of young punk girl Nikita, taken into the depraved world of the secret government forces has been exceptionally over used by Americans. Never mind the "Point of no return" and especially the "La femme Nikita" TV series. They cannot compare the original believe me! Trash these videos. Buy this one, do not rent it, BUY it. BTW beware of the subtitles of the LA company which "translate" the US release. What a disgrace! If you cant understand French, get a dubbed version. But you\'ll regret later :)', 'When I say this is my favourite film of all time, that comment is not to be taken lightly. I probabl

In [29]:
tokenized_infer_data = tokenizer(infer_data, truncation=True)

In [30]:
print(tokenized_infer_data)

{'input_ids': [[101, 2023, 2003, 2074, 1037, 9062, 2210, 6323, 1012, 1996, 2377, 1010, 1996, 5896, 2024, 6581, 1012, 1045, 2064, 2102, 12826, 2023, 3185, 2007, 2505, 2842, 1010, 2672, 3272, 1996, 3185, 1000, 6506, 1000, 6919, 2135, 2209, 2011, 3744, 17738, 1998, 10829, 3417, 2386, 1012, 2021, 1012, 1012, 1012, 2054, 2064, 1045, 2360, 2055, 2023, 2028, 1029, 2023, 2003, 1996, 2190, 3185, 4776, 11968, 9386, 6784, 2038, 2412, 2209, 1999, 1006, 2156, 3531, 1000, 12784, 2732, 7138, 1000, 1010, 2016, 1005, 1055, 4092, 2394, 2045, 1007, 2000, 2156, 2054, 1045, 2812, 1012, 1996, 2466, 1997, 2402, 7196, 2611, 29106, 1010, 2579, 2046, 1996, 2139, 18098, 10696, 2094, 2088, 1997, 1996, 3595, 2231, 2749, 2038, 2042, 17077, 2058, 2109, 2011, 4841, 1012, 2196, 2568, 1996, 1000, 2391, 1997, 2053, 2709, 1000, 1998, 2926, 1996, 1000, 2474, 26893, 29106, 1000, 2694, 2186, 1012, 2027, 3685, 12826, 1996, 2434, 2903, 2033, 999, 11669, 2122, 6876, 1012, 4965, 2023, 2028, 1010, 2079, 2025, 9278, 2009, 1010, 4

In [31]:
predicted_classes = []

for text in range(len(tokenized_infer_data['input_ids'])):
    with torch.no_grad():
        input_ids = torch.tensor(
            tokenized_infer_data['input_ids'][text]).unsqueeze(0).to(device)
        outputs = best_model(input_ids=input_ids)
        logits = outputs.logits.to(device)
        probabilities = torch.nn.functional.softmax(logits, dim=-1)
        predicted_class = torch.argmax(probabilities, dim=-1).item()
        predicted_classes.append(predicted_class)

# Create DataFrame
df = pd.DataFrame(
    {"input_ids": infer_data, "predicted_class": predicted_classes})

# Display DataFrame
display(df.head())

Unnamed: 0,input_ids,predicted_class
0,This is just a precious little diamond. The pl...,0
1,When I say this is my favourite film of all ti...,1
2,I saw this movie because I am a huge fan of th...,1
3,Being that the only foreign films I usually li...,0
4,After seeing Point of No Return (a great movie...,0
