# Fine tuning BERTimbau to perform Parts of Speech Tagging

This notebook fine tunes a BERTimbau model (`neuralmind/bert-base-portuguese-cased`, 110M parameters) in the Mac-Morpho dataset to perform a Parts of Speech Tagging (POS Tagging) task in portuguese.

The evaluation of this model's performance consists in:
- Calculating three different metrics on the test dataset:
    - Macro precision
    - Weighted precision
    - Per-class precision
- Comparing these metrics with the ones obtained by a competitor model:
    - The competitor model is the `lisaterumi/postagger-portuguese`
    - This model is also a BERTimbau-based model
    - This model is also fine tuned to perform POS Tagging in the Mac-Morpho dataset

In practice, the model we'll fine tune aims to reproduce the results of the `lisaterumi/postagger-portuguese` model.

In [1]:
dataset_name = "nilc-nlp/mac_morpho"
model_name = "neuralmind/bert-base-portuguese-cased"

# The dataset

From the official Mac-Morpho [web-page](http://nilc.icmc.usp.br/macmorpho/):

> Mac-Morpho is a corpus of Brazilian Portuguese texts annotated with part-of-speech tags. Its first version was released in 2003, and since then, two revisions have been made in order to improve the quality of the resource.

According to the [dataset manual](http://nilc.icmc.usp.br/macmorpho/macmorpho-manual.pdf), it is already split in a train, a validation and a test sub-sets, and it contains a total of 27 possible classes/tags. The following table shows the classes in our dataset:

| Tag |  Meaning (grammatical class in portuguese)|
| ------------------- | ------------------- |
|  ADJ |  Adjetivo |
|  ADV |  Advérbio |
|  ADV-KS |  Advérbio conjuntivo subordinado  |
|  ADV-KS-REL |   Advérbio relativo subordinado |
|  ART |  Artigo  |
|  CUR |  Moeda  |
|  IN |  Interjeição |
|  KC |  Conjunção coordenativa |
|  KS |  Conjunção subordinativa |
|  N |  Substantivo |
|  NPROP | Substantivo próprio |
|  NUM |  Número |
|  PCP |  Particípio |
|  PDEN |  Palavra denotativa |
|  PREP |  Preposição |
|  PROADJ |  Pronome Adjetivo |
|  PRO-KS |  Pronome conjuntivo subordinado |
|  PRO-KS-REL |  Pronome relativo conectivo subordinado |
|  PROPESS |  Pronome pessoal |
|  PROSUB |  Pronome nominal |
|  V | Verbo |
|  VAUX  | Verbo auxiliar |

Next, we'll prepare the dataset for the fine tuning task.


## Load dataset from HuggingFace

In [None]:
from datasets import load_dataset

dataset = load_dataset(dataset_name, trust_remote_code=True)

## Creating maps from label to ID and vice-versa

In [None]:
# Get unique labels
labels = dataset["train"].features["pos_tags"].feature.names

# Create the mappings
label2id = {label: i for i, label in enumerate(labels)}
id2label = {i: label for label, i in label2id.items()}

label2id

## Loading the BERTimbau tokenizer from HuggingFace

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
)

In [5]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer)

## Tokenize and align dataset

The tokenization process (transforming words/tokens in numbers) is an already known sub-task in NLP tasks, so we'll skip any explanations. But one additional thing we'll do is **aligning** the labels with the correct tokens we want to classify.

Aligning, in this context, means *knowing what subwords/tokens of a word we want to tag*. When you tokenize a sentence, especially using subword tokenizers (which is the case with the tokenizer we're using), the sentence is split into subword tokens. Because of this, we need to ensure that the labels are correctly associated with these tokens. This is where label alignment comes into play.

In our dataset, each word has a single label, but tokenization may result in subwords. In the end of our tokenization process, the number of tokens often exceeds the number of original words. For example:
- Imagine our labels look something like: `label2id = {"NOUN": 0, "VERB": 1, ...}`
- Now imagine that we're trying to classify the word `playing` into a `VERB`
- When we tokenized THE word, it is broken into **two tokens**: `["play", "ing"]`
- What should we tag as a `VERB`: `play`, or `ing`?

In this case, we'll fix this by "aligning" our dataset:
1. The first token of a word (in our case, `play`) will be tagged with the original word's label (in our case, `VERB`, with value `1`).
2. The following tokens, will be ignored if we assign a special value/class to it. Typically, when we assign the value of `-100`, our model is already able to ignore the token, excluding it from the loss function calculation during training.




In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, padding=True, is_split_into_words=True
    )

    labels = []
    for i, label in enumerate(examples["pos_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        aligned_labels = []
        previous_word_id = None
        for word_id in word_ids:
            if word_id is None:
                aligned_labels.append(-100)
            elif word_id != previous_word_id:
                aligned_labels.append(label[word_id])
            else:
                aligned_labels.append(-100)
            previous_word_id = word_id
        labels.append(aligned_labels)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs


tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)

# Fine tune our model

We'll fine tune our model for only $5$ epochs, and we'll use a learning rate of $5\mathrm{e}{-5}$. Our dataset is split in train (37948 samples), validation (1997 samples) and test (9987 samples) sub-sets.

## Download model from HuggingFace

In [None]:
from transformers import AutoModelForTokenClassification


model = AutoModelForTokenClassification.from_pretrained(
    model_name, num_labels=len(label2id), id2label=id2label, label2id=label2id
)

model.to("cuda")

## Define fine tuning routine

### Defining custom metrics to evaluate model

In [8]:
from sklearn.metrics import precision_score, recall_score, f1_score
import numpy as np

def compute_metrics(p):
    predictions, labels = p

    # Get predicted labels by taking argmax
    predictions = np.argmax(predictions, axis=-1)

    # Flatten and filter out ignored tokens (-100)
    true_labels = labels.flatten()
    pred_labels = predictions.flatten()
    mask = true_labels != -100
    true_labels = true_labels[mask]
    pred_labels = pred_labels[mask]

    # Compute metrics
    macro_precision = precision_score(
        true_labels, pred_labels, average="macro", zero_division=0
    )
    weighted_precision = precision_score(
        true_labels, pred_labels, average="weighted", zero_division=0
    )
    macro_recall = recall_score(
        true_labels, pred_labels, average="macro", zero_division=0
    )
    weighted_recall = recall_score(
        true_labels, pred_labels, average="weighted", zero_division=0
    )
    f1 = f1_score(true_labels, pred_labels, average="weighted", zero_division=0)
    per_class_precision = precision_score(
        true_labels, pred_labels, average=None, zero_division=0
    )

    # Map class indices to precision values
    unique_tags = np.unique(true_labels)
    per_class_precision_dict = {
        id2label[int(tag)]: float(per_class_precision[i])
        for i, tag in enumerate(unique_tags)
    }

    return {
        "macro_precision": macro_precision,
        "weighted_precision": weighted_precision,
        "macro_recall": macro_recall,
        "weighted_recall": weighted_recall,
        "f1_score": f1,
        "per_class_precision": per_class_precision_dict,
    }


In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir=".results",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    num_train_epochs=5,
    save_strategy="epoch",
    use_cpu=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

# Model evaluation

In [11]:
import matplotlib.pyplot as plt
import numpy as np
import math


def plot_class_precisions(per_class_precision):
    """
    Plots a bar chart of precision scores for each class.

    Args:
    - per_class_precision (dict): Dictionary mapping class names (e.g., "NOUN", "VERB") to precision scores.

    """
    # Extract class names and precision scores
    class_names = list(per_class_precision.keys())
    precision_scores = list(per_class_precision.values())

    # Sort by precision (descending order)
    sorted_indices = np.argsort(precision_scores)[::-1]
    sorted_class_names = [class_names[i] for i in sorted_indices]
    sorted_precisions = [precision_scores[i] for i in sorted_indices]

    # Plot bar chart
    plt.figure(figsize=(10, 6))
    plt.bar(sorted_class_names, sorted_precisions, color="skyblue", edgecolor="black")

    # Add labels and title
    plt.xlabel("Classes", fontsize=12)
    plt.ylabel("Precision", fontsize=12)
    plt.title("Precision per Class (Sorted)", fontsize=12)
    plt.xticks(rotation=45, ha="right", fontsize=10)
    plt.ylim(0, 1)  # Precision values are between 0 and 1

    # Annotate the bars with precision values
    for i, precision in enumerate(sorted_precisions):
        precision_shown = math.floor(precision * 100) / 100
        plt.text(i, precision, f"{precision_shown:.2f}", ha="center", fontsize=8)

    # Display the plot
    plt.tight_layout()
    plt.show()

In [41]:
import matplotlib.pyplot as plt


def plot_class_distribution(dataset, label_column, id2label, threshold_percentage=1.0):
    """
    Plots a pie chart of the class distribution in a given dataset where the labels are lists of lists,
    showing the actual class names using `id2label`, with classes below a certain threshold percentage
    grouped into an "Other" category.

    Args:
    - dataset (Dataset): The dataset to analyze (HuggingFace Dataset).
    - label_column (str): The column name in the dataset that contains the class labels (lists of lists).
    - id2label (dict): A dictionary mapping class IDs (numeric values) to class names (text labels).
    - threshold_percentage (float): The percentage threshold below which classes will be grouped into the "Other" category.
    """
    # Extract the labels (lists of labels) from the dataset
    labels = dataset[label_column]

    # Flatten the list of lists to count all occurrences of the individual labels
    flat_labels = [label for sublist in labels for label in sublist]

    # Map class IDs to class names using `id2label`
    class_names = [id2label[label] for label in flat_labels]

    # Count the occurrences of each class
    class_counts = {
        class_name: class_names.count(class_name) for class_name in set(class_names)
    }

    # Calculate the total number of labels
    total_labels = len(flat_labels)

    # Sort classes by frequency (most frequent first)
    sorted_class_counts = dict(
        sorted(class_counts.items(), key=lambda item: item[1], reverse=True)
    )

    # Identify classes with percentage below the threshold
    threshold_count = total_labels * (threshold_percentage / 100)

    # Group classes below the threshold percentage into 'Other'
    other_count = 0
    filtered_class_counts = {}
    for class_name, count in sorted_class_counts.items():
        if count < threshold_count:
            other_count += count
        else:
            filtered_class_counts[class_name] = count

    # Add 'Other' category if there are any classes below the threshold
    if other_count > 0:
        filtered_class_counts["Other"] = other_count

    # Prepare data for pie chart
    sorted_class_names = list(filtered_class_counts.keys())
    sorted_counts = list(filtered_class_counts.values())

    # Plot pie chart
    plt.figure(figsize=(10, 10))
    wedges, texts, autotexts = plt.pie(
        sorted_counts,
        labels=sorted_class_names,
        autopct="%1.1f%%",
        startangle=90,
        colors=plt.cm.Paired.colors,
        textprops={"fontsize": 10},
        wedgeprops={"edgecolor": "black"},
    )

    # Make the labels and percentages more readable by adjusting the label distance and font size
    for text in texts + autotexts:
        text.set_fontsize(10)

    # Rotate the labels for better readability
    plt.setp(texts, rotation=45, ha="right")

    # Add a sorted legend with class names and percentages
    plt.legend(
        title="Classes",
        labels=[
            f"{name}: {count} ({count/total_labels*100:.1f}%)"
            for name, count in zip(sorted_class_names, sorted_counts)
        ],
        loc="upper left",
        fontsize=10,
        bbox_to_anchor=(1.0, 1.0),
    )

    # Set the title and aspect ratio
    plt.title("Class Distribution in Test Dataset", fontsize=14)
    plt.axis("equal")  # Equal aspect ratio ensures that pie chart is drawn as a circle.

    # Display the plot
    plt.tight_layout()  # Adjust layout to prevent overlap
    plt.show()

## Against test data

### Analysing class distribution

In [None]:
plot_class_distribution(
    tokenized_dataset["test"],
    label_column="pos_tags",
    id2label=id2label,
    threshold_percentage=2.5,
)

### Analyse model performance

In [None]:
model_results = trainer.evaluate(tokenized_dataset["test"])

model_results

In [None]:
plot_class_precisions(model_results["eval_per_class_precision"])

# Conclusions

## Overall model performance

### Comparing to the competitor model

The `lisaterumi/postagger-portuguese` model reports these metrics in its [HuggingFace page](https://huggingface.co/lisaterumi/postagger-portuguese):

```
              Precision  Recall      F1
accuracy                           0.98
macro avg          0.96    0.95    0.95
weighted avg       0.98    0.98    0.98

F1:  0.9826
Accuracy:  0.9826
```

For the model we trained, these are the reports:

```
              Precision  Recall      F1
accuracy                           0.
macro avg          0.    0.    0.
weighted avg       0.    0.    0.

F1:  0.
Accuracy:  0.
```


## Evaluating per-class precision