<a href="https://colab.research.google.com/github/juacardonahe/Curso_NLP/blob/main/1_FundamentosNLP/1.4_FoundationModels/1_4_4_FineTunning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://raw.githubusercontent.com/juacardonahe/Curso_NLP/refs/heads/main/data/UnFieldB.png" width="40%">

# **Natural Language Procesing (NLP)**
### Departamento de Ingeniería Eléctrica, Electrónica y Computación
#### Universidad Nacional de Colombia - Sede Manizales

#### Created by: Juan José Cardona H.
#### Reviewed by: Diego A. Perez

#**1.4.4 - Fine Tunning a Foundation Model**

In this example, we show how to use foundation models for fine-tuning to our spam detection task.

## **Loading Data**

In [None]:
url = "https://raw.githubusercontent.com/juacardonahe/Curso_NLP/refs/heads/main/data/SMSSpamCollection/SMSSpamCollection"

import pandas as pd
import urllib.request
data = urllib.request.urlopen(url)

# directly load the file from github for compatability with Colab
lines_split = [
    line.decode().strip().split("\t")
    for line in data
]
df = pd.DataFrame(lines_split, columns=["label", "text"])

## **Loading the Pre-Trained Model**
We will use the transformers library from huggingface for fine-tuning. The following cell will take some time to execute, because the model has to be downloaded.

In [None]:
from transformers import AutoModelForSequenceClassification

model_name = "google-bert/bert-base-cased"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


You can use another model from [huggingface](https://huggingface.co/). Reasons for why we chose this model:
- BERT models were trained on a "fill-in-the-gap" task, which makes them better for classification than models trained on a "predict the next word" task.
- We use a cased model (which means that it distinguishes between capital and small letters), since this may be useful for predicting spam.
- The BERT model is a _classic_ - but note that many better models have been trained since.


## **Tokenization**
For splitting the texts into tokens, we have to use the same tokenizer as was used when training the model. We download it as follows:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Let's look at what happens when we tokenize text:

In [None]:
some_text = df["text"].iloc[0]
print("Text:", some_text)
print("Token IDs:", tokenizer(some_text)["input_ids"])

Text: Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
Token IDs: [101, 3414, 1235, 179, 22497, 1403, 1553, 117, 4523, 119, 119, 11651, 8009, 2165, 1178, 1107, 15430, 1548, 183, 1632, 1362, 2495, 174, 171, 9435, 2105, 119, 119, 119, 140, 2042, 1175, 1400, 1821, 4474, 20049, 1204, 119, 119, 119, 102]


Like our word dictionary, our tokenizer turns a text into a sequence of IDs that can be fed to a network.

We tokenize all our datapoints:

In [None]:
df["tokens"] = df["text"].apply(lambda x: tokenizer(x, padding="max_length", max_length=64, truncation=True)["input_ids"])

What we did in addition to before is to make every embedded text equally long - and just retain the 64 first tokens. It should become clear after 64 tokens whether something is spam or not.

## **Preparing our Data**
We split our data into training and test set as before.

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=.2, random_state=123)

We define our own dataset class.

In [None]:
import torch

class Data(torch.utils.data.Dataset):

    def __init__(self, df):
        self.x = torch.LongTensor(df["tokens"].tolist())
        self.y = torch.LongTensor((df["label"] == "spam").tolist())

    def __getitem__(self, idx):
        return self.x[idx], self.y[idx]

    def __getitems__(self, idx):
        return self.x[idx], self.y[idx]

    def __len__(self):
        return len(self.x)


def collate_fn(data):
    print(data)
    tensors, targets = data
    targets = torch.stack(targets)
    return torch.stack(tensors), targets

# Create data and loader objects
train_data = Data(train)
test_data = Data(test)

train_loader = torch.utils.data.DataLoader(train_data, batch_size=32, shuffle=True, collate_fn=lambda x: x)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=256, shuffle=False, collate_fn=lambda x: x)

## **Fine-Tuning**
Finetuning works in the exact same way as when we train a model from scratch - except that we already start from trained weights, and not from random initialization. This means that we need a lot less epochs - a single epoch should be fine.

In [None]:
# redefine the model here, to make sure we "restart" fine-tuning
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

n_epochs = 1
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
ce = torch.nn.CrossEntropyLoss()

model.train()

losses = []

for epoch in range(n_epochs):
    print(f"Epoch {epoch + 1} of {n_epochs}")
    for x, y in train_loader:
        pred = model(x, attention_mask=(x!=0))
        loss = ce(pred.logits, y)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        losses.append(loss.detach().item())
        print(f"\rLoss {losses[-1]:2.2e}. Iteration {len(losses)} of {len(train_loader)}.", end="")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1 of 1
Loss 1.70e-02. Iteration 77 of 140.

## **Evaluation**

We iterate through the entire test set to create predictions.

In [None]:
with torch.no_grad():
    predictions = []
    for i, (x, _) in enumerate(test_loader):
        print(f"\r{i:3d} of {len(test_loader)}", end="")
        pred = model(x, attention_mask=(x > 0))
        predictions.append(pred)

# get predicted labels
predictions_tensor = torch.concat([p.logits for p in predictions])
predicted_label = predictions_tensor[:, 1] >= predictions_tensor[:, 0]


Again, we measure the quality with precision and recall:

In [None]:
from sklearn.metrics import precision_score, recall_score

print("Precision:", precision_score(test["label"] == "spam", predicted_label))
print("Recall:", recall_score(test["label"] == "spam", predicted_label))

We observe that with a single pass through the data, our model performs very well. This shows the power of fine-tuning. However, we also see that applying our model is quite slow. For the spam detection use-case, it is probably too slow - if we want to detect spam in thousands of mails per second.