This notebook is due to Margarida Campos (many thanks!!). I (Luísa) included some additional details. Any mistakes are my own.

#Fine-tuning

In this notebook you will use transfer learning to tackle a text classification task.

We will fine-tune a BERT-like model on our small dataset in order to take advantage of the knowledge aquired by the pretrained model.

**Attention:** Turn on GPU runtime to increase the speed of training!

In [None]:
import numpy as np
import pandas as pd

from sklearn import model_selection, metrics
from datasets import Dataset

import transformers

## Getting the Data
Let's continue our challenge of classifying spells:

In [None]:
data = pd.read_csv('/content/sample_data/P11_dataset_cast.csv',sep=';')
data.head()

Transforming our target (labels) into integers:

In [None]:
n_classes = data['category'].unique().shape[0]
data[['category']] = data[['category']].astype('category')
data['label'] = data['category'].cat.codes

### Split into train and test

In [None]:
df_train, df_test = model_selection.train_test_split(data[['text','label']],
                                                     test_size=0.1,
                                                     random_state=42)

## Tokenizing

We will use [DistilBERT](https://huggingface.co/docs/transformers/en/model_doc/distilbert) - a lighter model that was trained to mimic BERT's outputs.
We will use the [HuggingFace](https://huggingface.co/) library - home to most of the state-of-the-art language models.

Different models were trained using different preprocessing steps, so we need to use the appropriate `tokenizer`:

In [None]:
model_name = 'distilbert-base-uncased'
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name,
                                          do_lower_case=True)

### What does the tokenizer do?
BERT tokenizers, in addition to tokenizing a sentence, add two special tokens:
 - [CLS] token used for text classification
 - [SEP] token used for the end of the sentence or separation of sentences when each observation is more than one sentence combined.

<img src="https://raw.githubusercontent.com/MargaridaMCampos/SL_data/main/Screenshot%202025-10-03%20at%2016.45.48.png" alt="drawing" width="300"/>

The outputs are:

 - `input_ids`: the ids of each token
 - `attention_mask`: binary vector indicating if words should be considered by the attention mechanism (`0` would appear if the sentences were padded for example: the model should ignore the padded values!)
- Check what would happen if you use: tokenizer("I love NLP", padding='max_length', max_length=10)

In [None]:
sample_text = 'I love NLP'
tokens = tokenizer(sample_text, truncation=True)

# Test:
# tokens = tokenizer("I love NLP", truncation=True, padding='max_length', max_length=10)

print(tokens)
print(tokenizer.convert_ids_to_tokens(tokens["input_ids"])) # to see which were the considered tokens


We will transform our data into the HuggingFace `Dataset` format which will make our lifes easier, specially with larger datasets!
With this, we will be converting our raw data (like text files, CSVs, JSON, etc.) into the standardized Dataset object provided by the Hugging Face datasets library.

In [None]:
# Convert data into Dataset structure
train_dataset = Dataset.from_pandas(df_train)
test_dataset = Dataset.from_pandas(df_test)

Let's tokenize our data!
If you want to add preprocessing steps you can do so inside the function and then we map it to our `Dataset` structure.

In [None]:
def tokenize_data(dataset):
    return tokenizer(dataset["text"], truncation=True)

tokenized_train = train_dataset.map(tokenize_data, batched=True)
tokenized_test = test_dataset.map(tokenize_data, batched=True)

## Training

Let's load our pretrained model! We can make use of HuggingFace's pre-built architectures: using `AutoModelForSequenceClassification` we are already importing the BERT architecture adapted for text classification:

In [None]:
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name,
                                                                        num_labels = n_classes)

### Metrics

BERT outputs *logits* - a vector of scores for each class.
Let's define some evaluation functions to monitor the progress of our model!

*Note:* Uncomment the below cell if you're in Colab or you don't have the `evaluate` package.

In [None]:
!pip install evaluate
import evaluate

In [None]:
# load metrics
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1) # obtaining the class with the highest score
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "f1_macro": f1.compute(predictions=preds, references=labels, average="macro")["f1"],
    }

Let us call a padding operator to give the model. You can choose and pass additional padding parameters to it!

In [None]:
# Prepare data collator for padding sequences
data_collator = transformers.DataCollatorWithPadding(tokenizer=tokenizer)

# Note: it chooses the max sequence length in the batch, not a fixed global max length efficient!

In [None]:
# Example:

samples = [
    tokenizer("I love NLP"),
    tokenizer("I love models as Transformers")
]

batch = data_collator(samples)
print(batch.keys()) # notice the 0 at the end of the input_ids (try to explain it)

### Defining training parameters

In HuggingFace we define our training hyperparameters in `TrainingArguments`.
Please, spend some time here if these concepts are unfamiliar to you.

In [None]:
# Defining the training arguments

# Remember:
# a) Epoch: one full pass through the entire training dataset.
# b) Step (or iteration):	one batch of data is passed through the model (forward + backward pass).

training_args = transformers.TrainingArguments(
    output_dir="results", # where to save the resulting trained model
    learning_rate=2e-5, # learning rate
    per_device_train_batch_size=16, # batch size for training
    per_device_eval_batch_size=16, # batch size for evaluation (don't need to do a full evaluation)
    num_train_epochs=5, # epochs ("epoch" is also used internally to track which epoch you’re on)
    weight_decay=0.01, # penalizing very large weights
    eval_strategy="epoch", # evaluating at each epoch (it could be every N steps)
    logging_strategy="steps", # when to log information (like training loss, learning rate, etc.).
    logging_steps=50, # After every 50 batches (steps), record training metrics
    seed=42,      #setting seeds -- global random seed (for reproducibility) -- ensures model weight initialization is the same
    data_seed=42, # ensures the dataset split (e.g., 80/20 train/val) is the same.
    report_to="none") # avoid API requests

### Defining our training process

We define our finetuning in the `Trainer` object: what is the model, what is the data, the tokenizer, the padder, the metrics to use...

In [None]:
# Split: 90% train, 10% validation
split = tokenized_train.train_test_split(test_size=0.1, seed=42)

tokenized_train = split['train']       # 90% for training
tokenized_validation = split['test']

# Define Trainer object for training the model
trainer = transformers.Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_validation, # not tokenized_test,
    processing_class=tokenizer, # the tokenizer will handle preprocessing of text data
    data_collator=data_collator, # process and batch examples dynamically before feeding them to the model (pads, ...)
    compute_metrics=compute_metrics
)

## Let's finetune!

In [None]:
# Train the model
trainer.train()

# Save the trained model
trainer.save_model('model')

### Evaluating

In [None]:
eval_metrics = trainer.evaluate()
print("Eval metrics:", eval_metrics)

## Making predictions

In [None]:
from sklearn.metrics import accuracy_score, classification_report

# Run inference on the tokenized test set
predictions = trainer.predict(tokenized_test)

# Get the raw logits
logits = predictions.predictions

# Convert to class IDs
y_pred = np.argmax(logits, axis=-1)

# True labels are here:
y_true = predictions.label_ids

print("Accuracy:", accuracy_score(y_true, y_pred))
print(metrics.classification_report(y_true, y_pred))

## Let's try to make it better!

 - Increase the number of epochs to `10`. Is the model still learning?
 - Do a little research on lighter language models for classification and experiment with a different one (always checking the HuggingFace documentation!)

 We are using the tokenizer and the model with the default arguments. Check the [documentation](https://huggingface.co/docs/transformers/v4.14.1/en/model_doc/distilbert#transformers.DistilBertConfig) and play with some parameters of your choice - think about what they mean!


## Freezing parameters

Sometimes we do not have the compute resources to finetune the large number of parameters of the pretrained models. One common approach is to *freeze* the language model's weights - use them as they are - and train **only** the weights of the last layers - the classifier part: using the pretrained model essentially as a *feature provider*!

Add the following code after the loading of the model (that is, after model = AutoModelForSequenceClassification.from_pretrained...)

Notice that:
- All parameters — both in model.distilbert and in model.classifier — start with requires_grad = True

# Freeze all layers except the classifier (“freezes” the DistilBERT encoder)
for param in model.distilbert.parameters():
    param.requires_grad = False

# If you want to make explicit that only the classification head is trainable
for param in model.classifier.parameters():
    param.requires_grad = True

# If you want to unfreeze the last two layers
# 1) Freeze everything first
for param in model.distilbert.parameters():
    param.requires_grad = False

# 2) Unfreeze the last two transformer layers
for layer in model.distilbert.transformer.layer[-2:]:
    for param in layer.parameters():
        param.requires_grad = True



In [None]:
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad)}")