<h2>Pre-Requisites</h2>

First off we install the required libraries :


In [None]:
# @title Installing Libraries

!pip install -q datasets
!pip install -q transformers
!pip install -q peft
!pip install -q evaluate
!pip install -q numpy

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.3/168.3 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# @title Importing libraries

# Importing the data handling utilities
from datasets import load_dataset, DatasetDict, Dataset

# Importing the core transformers module for model setup and tokenization
from transformers import (
    AutoTokenizer,  # Handles the conversion from raw text to tokens
    AutoConfig,  # Automatically infers the model configuration
    AutoModelForSequenceClassification,  # Prebuilt model for sequence classification tasks
    DataCollatorWithPadding,  # Dynamically adds padding to batched examples to the same common length
    TrainingArguments,  # Contains all training hyperparameters
    Trainer  # Streamlines the model training process
)

# Importing Parameter Efficient Fine-Tuning module and its classes
from peft import (
    PeftModel,  # for defining the PEFT model
    PeftConfig,  # for defining the config of the created PEFT model
    get_peft_model,  # utility function in order to apply PEFT on a given model
    LoraConfig  # used to configure Low-Rank Adaptation on a PEFT model
)

# Importing module for evaluation of model performance
import evaluate

# Other essential modules
import torch
import numpy as np

In [None]:
# @title Initializing the Language Model

# Defining the model checkpoint to use 'distilbert', a smaller, faster, cheaper version of BERT
model_checkpoint = 'distilbert-base-uncased'

# Mapping labels to integers and vice versa for the classification task
# Negative sentiment is mapped to 0, and Positive to 1
id2label = {0: "Negative", 1: "Positive"}
label2id = {"Negative": 0, "Positive": 1}

# Loading the pre-trained DistilBERT model fine-tuned for sequence classification
# with the number of labels specified according to our task
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2, id2label=id2label, label2id=label2id)

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# @title Loading the dataset

# The dataset is 'imdb-truncated' from 'shawhin', which is a truncated version of the IMDB reviews dataset
dataset = load_dataset("shawhin/imdb-truncated")
dataset

Downloading readme:   0%|          | 0.00/592 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/836k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/853k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['label', 'text'],
        num_rows: 1000
    })
})

In [None]:
# @title Preprocessing the Dataset

# Instantiate a tokenizer for preprocessing text from the 'distilbert-base-uncased' model checkpoint.
# 'add_prefix_space=True' is specifically useful for tokenizers like GPT and RoBERTa which require it.
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)

# Define a tokenization function that will be applied to the text data.
# This function will be used to map over the entire dataset.
def tokenize_function(examples):

  # Extract text to be tokenized.
  text = examples["text"]

  # Tokenize and truncate the text to a max length of 512 tokens.
  # Truncation is set to 'left' to keep the end of the reviews which might be more informative.
  tokenizer.truncation_side = "left"
  tokenized_inputs = tokenizer(
      text,
      return_tensors="np",  # Request numpy tensors to be returned
      truncation=True,
      max_length=512
  )
  return tokenized_inputs

# Check if the tokenizer has a pad token, add one if not. This is essential for batched processing.
if tokenizer.pad_token is None:
  tokenizer.add_special_tokens({'pad_token': '[PAD]'})
  # Resize the model's token embeddings to account for the new pad token
  model.resize_token_embeddings(len(tokenizer))

# Apply the tokenization function to the training and validation datasets in a batched manner.
# 'batched=True' processes multiple texts at once for efficiency.
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
})

In [None]:
# @title Create data collator

# Initialize a data collator that will dynamically pad the batched examples to the maximum length
# in the batch. This ensures that all the sequences in a batch have the same length,
# which is required for processing by the model.
# The tokenizer is passed to the collator, which it uses to pad the sequences with the appropriate token.
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
# @title Evaluation Metrics

# Load the 'accuracy' metric from the 'evaluate' library, which provides a suite of evaluation metrics
accuracy = evaluate.load("accuracy")

# Define a function to compute metrics, which will be used by the Trainer object.
# This function takes the output of the model prediction and processes it to calculate the accuracy.
def compute_metrics(p):

  # Unpack the predictions and labels from the output tuple 'p'
  predictions, labels = p

  # Convert the predictions to actual label indices using 'np.argmax' across the logits
  predictions = np.argmax(predictions, axis=1)

  # Compute accuracy by comparing predictions to the true labels
  return {"accuracy": accuracy.compute(predictions=predictions, references=labels)}

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [None]:
# @title Untrained Model Performance

# Define a list of example sentences for sentiment classification
text_list = [
    "This was good",
    "Not a fan honestly, I don't recommend it.",
    "I mean it's alright like, overrated in my opinion.",
    "Amazing product fr",
    "This one is a pass"
]

# Print a header for the untrained model performance section
print("Untrained Model Performance: ")
print("-----------------------------")

# Loop through each example sentence in the list
for text in text_list:

  # Tokenize the text using the previously initialized tokenizer
  inputs = tokenizer.encode(text, return_tensors="pt")

  # Get the logits (raw prediction scores) from the untrained model
  logits = model(inputs).logits

  # Convert logits to actual predictions (0 or 1) using the argmax function
  predictions = torch.argmax(logits)

  # Print the text along with the corresponding predicted label (Negative or Positive)
  print(f"{text}-{id2label[predictions.tolist()]}")

# Note: This cell is meant to demonstrate the base capability of the untrained model,
# which might not align with the expected outcomes, hence the need for training.

Untrained Model Performance: 
-----------------------------
This was good-Negative
Not a fan honestly, I don't recommend it.-Negative
I mean it's alright like, overrated in my opinion.-Positive
Amazing product fr-Negative
This one is a pass-Negative


In [None]:
# @title Fine-Tuning with LoRa

# Configure the LoRa settings for fine-tuning the model
# LoRa stands for Low-Rank Adaptation, a parameter-efficient fine-tuning (PEFT) method
peft_config = LoraConfig(
    task_type="SEQ_CLS",  # Indicates the task is sequence classification
    r=4,  # Sets the intrinsic rank, which determines the low-rank matrix size
    lora_alpha=32,  # Similar to learning rate, controls the scale of low-rank matrices
    lora_dropout=0.01,  # Dropout rate for regularization during fine-tuning
    target_modules=['q_lin']  # Specifies the model layer to apply LoRa, here the query layer
)

# Apply the PEFT method to the pre-trained model using the defined configuration
model = get_peft_model(model, peft_config)

model.print_trainable_parameters()

trainable params: 628,994 || all params: 67,584,004 || trainable%: 0.9306847223789819


In [None]:
# @title Hyperparameter Tuning

# Set the learning rate, batch size, and number of epochs for the training process
lr = 1e-3  # Learning rate: size of the optimization step
batch_size = 4  # Batch size: number of examples processed per optimization step
num_epochs = 10  # Number of epochs: times the model will iterate over the entire training dataset

# Define the training arguments using the Hugging Face's `TrainingArguments` class
training_args = TrainingArguments(
    output_dir=model_checkpoint + "-lora-text-classification",  # Directory for saving model and checkpoints
    learning_rate=lr,  # Set the learning rate
    per_device_train_batch_size=batch_size,  # Set batch size for training
    per_device_eval_batch_size=batch_size,  # Set batch size for evaluation
    num_train_epochs=num_epochs,  # Set number of training epochs
    weight_decay=0.01,  # Set weight decay for regularization
    evaluation_strategy="epoch",  # Evaluate at the end of each epoch
    save_strategy="epoch",  # Save the model at the end of each epoch
    load_best_model_at_end=True  # Load the best model at the end of training based on metrics
)

# Initialize the Trainer with the defined model and training arguments
trainer = Trainer(
    model=model,  # The PEFT model to be fine-tuned
    args=training_args,  # The training arguments
    train_dataset=tokenized_dataset["train"],  # training data
    eval_dataset=tokenized_dataset["validation"],  # validation data
    tokenizer=tokenizer,  # The tokenizer for preprocessing
    data_collator=data_collator,  # The data collator for dynamic padding
    compute_metrics=compute_metrics,  # The metric computation function
)

# Start the training process
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.387929,{'accuracy': 0.871}
2,0.436900,0.47437,{'accuracy': 0.881}
3,0.436900,0.529533,{'accuracy': 0.898}
4,0.213200,0.794916,{'accuracy': 0.865}
5,0.213200,0.680445,{'accuracy': 0.898}
6,0.082600,0.782713,{'accuracy': 0.888}
7,0.082600,0.818277,{'accuracy': 0.894}
8,0.016900,0.902025,{'accuracy': 0.9}
9,0.016900,0.894379,{'accuracy': 0.897}
10,0.015600,0.89657,{'accuracy': 0.896}


Trainer is attempting to log a value of "{'accuracy': 0.871}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.881}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.898}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.865}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.898}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This i

TrainOutput(global_step=2500, training_loss=0.15301785106658936, metrics={'train_runtime': 473.9091, 'train_samples_per_second': 21.101, 'train_steps_per_second': 5.275, 'total_flos': 1113026652407424.0, 'train_loss': 0.15301785106658936, 'epoch': 10.0})

In [None]:
# @title Trained Model Performance

# Move the model to the GPU for faster computation.
model.to("cuda")

# Print the header for the trained model performance section.
print("Trained Model Performance: ")
print("---------------------------")

# Iterate over the list of test sentences to evaluate the trained model.
for text in text_list:

  # Tokenize the text and move tensors to the GPU.
  inputs = tokenizer.encode(text, return_tensors="pt").to("cuda")

  # Obtain the logits from the model.
  logits = model(inputs).logits

  # Convert logits to predicted class indices.
  predictions = torch.max(logits, 1).indices

  # Print the original text and its predicted sentiment label.
  print(f"{text} - {id2label[predictions.tolist()[0]]}")

# Note: This process showcases how the model performs on individual text examples after training.
# We can see the predicted sentiment label for each text, which should now be more accurate than before training.

Trained Model Performance: 
---------------------------
This was good - Positive
Not a fan honestly, I don't recommend it. - Negative
I mean it's alright like, overrated in my opinion. - Negative
Amazing product fr - Positive
This one is a pass - Negative
