<h1>Finetuning a Large Language Model</h1>

<h2>Overview</h2>

In this post, we will delve into the process of finetuning a pretrained language model with the aid of the HuggingFace library. Previously, we explored the fundamentals of language models and the methodology behind constructing one. Nonetheless, the development of large and sophisticated language models requires immense computational resources and extensive datasets, assets not readily available to everyone. Consequently, an efficient alternative involves leveraging existing large language models (LLMs) instead of creating one from scratch.

While this strategy significantly conserves computational resources, there's a caveat: pretrained models, standardized for general tasks, may not perform optimally on our distinct dataset for specific tasks. This is primarily because they may not have encountered samples during training that resonate with the unique characteristics of our dataset. Therefore, to tailor a pretrained language model to our specific needs while still capitalizing on the rich knowledge encapsulated within its parameters, we can employ a strategy known as finetuning.

Finetuning allows us to specialize a model's abilities: we take an established model and continue its training regimen, but this time, using our dataset. This method does not only instill an understanding of our dataset's particularities in the model but also builds upon the expansive learning the model has already achieved. In essence, we're not initiating the learning process from scratch, but rather, standing on the shoulders of giants—benefiting from the pretrained model's extensive learning and adapting it to our specific tasks.

Nonetheless, it's crucial to acknowledge that finetuning, despite its advantages, isn't without its challenges. A primary concern is the risk of overfitting. Overfitting occurs when a model learns the training data too well, to the point where it captures noise and inaccuracies as patterns. This typically happens if the training data is too limited or the model is excessively complex relative to the task. As a result, while the model might perform exceptionally well on the training data, its ability to generalize to new, unseen data is compromised.

In the context of finetuning a language model, overfitting might manifest if the model is trained too long or too intensely on a small, specific dataset. While the pretrained model has learned from a vast amount of data, it risks becoming too specialized in the narrow task or dataset it's finetuned on, thereby losing its ability to perform well on tasks outside this specific context.

There are strategies to mitigate overfitting, such as early stopping, which involves ending training when performance on a validation dataset starts to deteriorate, or employing regularization techniques. However, it remains a significant consideration when deciding the extent to which finetuning should be carried out.

I aim to keep this content concise and focused. Therefore, I'll explore alternatives to finetuning in a subsequent post.

<h2>Task</h2>

The example task that I will demonstrate to show how to fine-tune a pretrained model is text classification for sentiment analysis.

As previously stated, I utilize a model and dataset from Hugging Face for fine-tuning purposes. Hugging Face is an AI platform renowned for its robust infrastructure, which enables users to share, access, and showcase their models. The platform offers a diverse array of models, as well as a comprehensive collection of datasets. It is particularly celebrated for its 'Transformers' library, a tool that simplifies the incorporation of the Transformer architecture in model development.

The dataset I will use is the Stanford Sentiment Treebank 2 (SST-2). This dataset is popular in natural language processing, particularly for tasks related to sentiment analysis. It's an extension of the original Stanford Sentiment Treebank (SST) and is widely used for benchmarking models in the field of sentiment analysis. You can find more information on this dataset <a href=https://huggingface.co/datasets/sst2>here</a>.

Let's load the dataset and look at a few examples below. Remember to install the 'transformers' and 'datasets' libraries if you haven't already done so.

In [1]:
#%pip install transformers datasets


In [2]:
from datasets import load_dataset
import torch


In [3]:
dataset = load_dataset("glue", "sst2")


In [4]:
for i in range(5):
    print("Sentence:", dataset["train"][i]["sentence"])
    print("Label:", "Positive" if dataset["train"][i]["label"] == 1 else "Negative")
    print()
    

Sentence: hide new secretions from the parental units 
Label: Negative

Sentence: contains no wit , only labored gags 
Label: Negative

Sentence: that loves its characters and communicates something rather beautiful about human nature 
Label: Positive

Sentence: remains utterly satisfied to remain the same throughout 
Label: Negative

Sentence: on the worst revenge-of-the-nerds clichés the filmmakers could dredge up 
Label: Negative



<h2>Implementation</h2>

<h3>Dataset</h3>
Remember, data preprocessing is always the foundational step in building any deep learning or machine learning model. This is especially true for NLP tasks, where one crucial initial step is tokenization. In our case, since we're using a publicly available dataset from Hugging Face, there isn't much required in terms of dataset preparation beyond simply loading it. As for the tokenization process, you can refer to my detailed post about it <a href=https://github.com/lsafarne/AIBites/blob/main/text_tokenization.ipynb>here</a>. However, for this project, we will be utilizing an AutoTokenizer from Hugging Face.

In [5]:
from transformers import DistilBertTokenizer 

# Load a dataset and tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
raw_datasets = load_dataset("sst", "default")
column_names = raw_datasets["train"].column_names

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["sentence"], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
train_dataset = tokenized_datasets["train"]

# Format the dataset to output torch.Tensor
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])


<h3>Model</h3>
The model that I will use is DistilBERT, which is a streamlined version of BERT. BERT (Bidirectional Encoder Representations from Transformers), developed by Google, is a transformer-based model known for learning contextual word embeddings. It utilizes a self-supervised learning paradigm, training on large volumes of text by masking words and predicting them, thereby gaining a deep contextual understanding of language. DistilBERT, developed by Hugging Face, is essentially a distilled version of BERT. It is 40% smaller but retains 97% of BERT's language understanding capabilities and can be trained much faster. This makes DistilBERT an attractive choice for applications where model size and speed are crucial, without significantly compromising the quality of the results. I chose DistilBERT since I am running this notebook on my laptop and do not have access to GPU.

In [7]:
from transformers import DistilBertForSequenceClassification

# Load pre-trained model
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

# Define the device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

<h3>Training</h3>

In [None]:
import torch
from torch.utils.data import DataLoader
from transformers import DistilBertTokenizer , DistilBertForSequenceClassification
from datasets import load_dataset
from transformers import AdamW
from tqdm import tqdm
from torch.nn import CrossEntropyLoss


def train(model, dataloader, optimizer, num_epochs):
    model.train() # Put the model into training mode
    total_train_loss = 0
    optimizer = AdamW(model.parameters(), lr=5e-5)
    loss_fn = CrossEntropyLoss()
    
    for epoch in range(num_epochs):

        # Iterate over the batches
        for batch in tqdm(dataloader, desc= f"Training ({epoch+1}/{num_epochs})"):
            optimizer.zero_grad() # Zero the gradients at the start of the iteration

            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device).long()

            # Forward pass
            outputs = model(input_ids, attention_mask=attention_mask)

            logits = outputs.logits
            loss = loss_fn(logits, labels)

            # Backward pass
            total_train_loss += loss.item()
            loss.backward()

            # Update weights
            optimizer.step()

        # Calculate the average loss over all the batches
        avg_train_loss = total_train_loss / len(dataloader)
        print(f"Average training loss for epoch {epoch+1}: ", avg_train_loss)

    return avg_train_loss



# Prepare for training
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=8)

optimizer = AdamW(model.parameters(), lr=5e-5)

num_epochs = 1

# Train the model
train(model, train_dataloader, optimizer, num_epochs)

# Save the model
model.save_pretrained("./sentiment_model")

In [None]:
# Define the evaluation function
def evaluate(model, dataloader):
    model.eval()  # Put model in evaluation mode
    total_eval_loss = 0
    total_eval_accuracy = 0
    
    total_loss = 0.0
    correct_predictions = 0
    total = 0


    criterion = torch.nn.CrossEntropyLoss()

    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Evaluating"):
            # Extract data and labels
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device).long()

            # Forward pass
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits

            # Calculate loss and accuracy
            loss = criterion(logits, labels)
            predictions = torch.argmax(logits, dim=1)
            total_eval_loss += loss.item()
            correct_predictions += (predictions == labels).sum().item()
            total += labels.size(0)

        

    average_loss = total_eval_loss / len(dataloader)
    accuracy = correct_predictions / total

    return average_loss, accuracy


# Prepare the test dataloader
test_dataset = tokenized_datasets["test"]
test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
test_dataloader = DataLoader(test_dataset, batch_size=8)

# Evaluate the model
eval_loss, eval_accuracy = evaluate(model, test_dataloader)

# Print final evaluation results
print(f"Final evaluation loss: {eval_loss}")
print(f"Final evaluation accuracy: {eval_accuracy}")


In addition to measuring the accuracy, let's perform inference for a few samples that we saw earlier in this post:

In [8]:
model.eval()  # Ensure the model is in evaluation mode

for i in range(5):
    # Extract the sentence and its true label
    sentence = dataset["test"][i]["sentence"]
    true_label = "Positive" if dataset["test"][i]["label"] == 1 else "Negative"

    # Tokenize the sentence - Note that we should use the same tokenizer as the one we used to preprocess our dataset
    inputs = tokenizer.encode_plus(sentence, return_tensors="pt", max_length=512, truncation=True)
    input_ids = inputs['input_ids'].to(device)
    attention_mask = inputs['attention_mask'].to(device)

    # Perform inference
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)

    # Extract the prediction
    prediction = torch.argmax(outputs.logits, dim=1).item()
    predicted_label = "Positive" if prediction == 1 else "Negative"

    # Print the sentence, the true label, and the predicted label
    print("Sentence:", sentence)
    print("True Label:", true_label)
    print("Predicted Label:", predicted_label)
    print()

Sentence: uneasy mishmash of styles and genres .
True Label: Negative
Predicted Label: Negative

Sentence: this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation .
True Label: Negative
Predicted Label: Negative

Sentence: by the end of no such thing the audience , like beatrice , has a watchful affection for the monster .
True Label: Negative
Predicted Label: Negative

Sentence: director rob marshall went out gunning to make a great one .
True Label: Negative
Predicted Label: Negative

Sentence: lathan and diggs have considerable personal charm , and their screen rapport makes the old story seem new .
True Label: Negative
Predicted Label: Negative



The finetuned model has predicted all labels for this set of examples correctly.

<h3>Hugging Face AutoClasses</h3>

Instead of directly utilizing the DistilBertForSequenceClassification model, we have the option to employ the AutoModelForSequenceClassification and AutoTokenizer classes. These are components of the AutoClasses within the transformers library. The design of AutoClasses facilitates ease in working with various models, reducing the need for substantial code alterations. They effectively manage underlying complexities, offering a smoother experience across different model architectures. Further details on these classes can be accessed <a href=https://huggingface.co/docs/transformers/model_doc/auto>here</a>.



In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset


In [None]:
model_checkpoint = "distilbert-base-uncased" 
auto_model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=1) # SST-2 is binary classification


The above two lines demonstrate how the AutoClasses from the Transformers library can simplify the coding process. Additionally, if you wish to use a different model, you can easily do so by simply changing the model name to any other model available on the Hugging Face hub.

Additionally, instead of utilizing the aforementioned training loop, the Trainer class could be employed. This approach offers the advantage of streamlining and condensing the code. Nonetheless, I have a stronger familiarity with the original training loop; its structure offers greater ease in debugging if any issues arise. To provide a comprehensive overview, I will include an example of how to implement the Trainer class below.


In [None]:
# Set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
auto_model.to(device)

# Train the model
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
)

trainer = Trainer(
    model=auto_model,
    args=training_args,
    train_dataset=train_dataset,
    
)

trainer.train()

# Save the model
model.save_pretrained("./sentiment_model")

# Evaluate the model
results = trainer.evaluate()
print(results)

In [None]:
#%pip install accelerate -U
