# Fine tuning BERTimbau to perform Parts of Speech Tagging

This notebook fine tunes a BERTimbau model (`neuralmind/bert-base-portuguese-cased`, 110M parameters) in the Mac-Morpho dataset to perform a Parts of Speech Tagging (POS Tagging) task in portuguese.

The evaluation of this model's performance consists in:
- Calculating three different metrics on the test dataset:
    - Macro precision
    - Weighted precision
    - Per-class precision
- Comparing these metrics with the ones obtained by a competitor model:
    - The competitor model is the `lisaterumi/postagger-portuguese`
    - This model is also a BERTimbau-based model
    - This model is also fine tuned to perform POS Tagging in the Mac-Morpho dataset

In practice, the model we'll fine tune aims to reproduce the results of the `lisaterumi/postagger-portuguese` model.

In [1]:
dataset_name = "nilc-nlp/mac_morpho"
model_name = "neuralmind/bert-base-portuguese-cased"
competitor_model_name = "lisaterumi/postagger-portuguese"

# The dataset

From the official Mac-Morpho [web-page](http://nilc.icmc.usp.br/macmorpho/):

> Mac-Morpho is a corpus of Brazilian Portuguese texts annotated with part-of-speech tags. Its first version was released in 2003, and since then, two revisions have been made in order to improve the quality of the resource.

According to the [dataset manual](http://nilc.icmc.usp.br/macmorpho/macmorpho-manual.pdf), it is already split in a train, a validation and a test sub-sets, and it contains a total of 27 possible classes/tags. The following table shows the classes in our dataset:


| Tag |  Meaning (grammatical class in portuguese)|
| ------------------- | ------------------- |
|  ADJ |  Adjetivo |
|  ADV |  Advérbio |
|  ADV-KS |  Advérbio conjuntivo subordinado  |
|  ADV-KS-REL |   Advérbio relativo subordinado |
|  ART |  Artigo  |
|  CUR |  Moeda  |
|  IN |  Interjeição |
|  KC |  Conjunção coordenativa |
|  KS |  Conjunção subordinativa |
|  N |  Substantivo |
|  NPROP | Substantivo próprio |
|  NUM |  Número |
|  PCP |  Particípio |
|  PDEN |  Palavra denotativa |
|  PREP |  Preposição |
|  PROADJ |  Pronome Adjetivo |
|  PRO-KS |  Pronome conjuntivo subordinado |
|  PRO-KS-REL |  Pronome relativo conectivo subordinado |
|  PROPESS |  Pronome pessoal |
|  PROSUB |  Pronome nominal |
|  V | Verbo |
|  VAUX  | Verbo auxiliar |

Next, we'll prepare the dataset for the fine tuning task.


## Load dataset from HuggingFace

In [2]:
from datasets import load_dataset

dataset = load_dataset(dataset_name, trust_remote_code=True)

  from .autonotebook import tqdm as notebook_tqdm


## Creating maps from label to ID and vice-versa

In [3]:
# Get unique labels
labels = dataset["train"].features["pos_tags"].feature.names

# Create the mappings
label2id = {label: i for i, label in enumerate(labels)}
id2label = {i: label for label, i in label2id.items()}

label2id

{'PREP+PROADJ': 0,
 'IN': 1,
 'PREP+PRO-KS': 2,
 'NPROP': 3,
 'PREP+PROSUB': 4,
 'KC': 5,
 'PROPESS': 6,
 'NUM': 7,
 'PROADJ': 8,
 'PREP+ART': 9,
 'KS': 10,
 'PRO-KS': 11,
 'ADJ': 12,
 'ADV-KS': 13,
 'N': 14,
 'PREP': 15,
 'PROSUB': 16,
 'PREP+PROPESS': 17,
 'PDEN': 18,
 'V': 19,
 'PREP+ADV': 20,
 'PCP': 21,
 'CUR': 22,
 'ADV': 23,
 'PU': 24,
 'ART': 25}

## Loading the BERTimbau tokenizer from HuggingFace

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
)

In [6]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer)

## Tokenize and align dataset

The tokenization process (transforming words/tokens in numbers) is an already known sub-task in NLP tasks, so we'll skip any explanations. But one additional thing we'll do is **aligning** the labels with the correct tokens we want to classify.

Aligning, in this context, means *knowing what subwords/tokens of a word we want to tag*. When you tokenize a sentence, especially using subword tokenizers (which is the case with the tokenizer we're using), the sentence is split into subword tokens. Because of this, we need to ensure that the labels are correctly associated with these tokens. This is where label alignment comes into play.

In our dataset, each word has a single label, but tokenization may result in subwords. In the end of our tokenization process, the number of tokens often exceeds the number of original words. For example:
- Imagine our labels look something like: `label2id = {"NOUN": 0, "VERB": 1, ...}`
- Now imagine that we're trying to classify the word `playing` into a `VERB`
- When we tokenized THE word, it is broken into **two tokens**: `["play", "ing"]`
- What should we tag as a `VERB`: `play`, or `ing`?

In this case, we'll fix this by "aligning" our dataset:
1. The first token of a word (in our case, `play`) will be tagged with the original word's label (in our case, `VERB`, with value `1`).
2. The following tokens, will be ignored if we assign a special value/class to it. Typically, when we assign the value of `-100`, our model is already able to ignore the token, excluding it from the loss function calculation during training.




In [5]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, padding=True, is_split_into_words=True
    )

    labels = []
    for i, label in enumerate(examples["pos_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        aligned_labels = []
        previous_word_id = None
        for word_id in word_ids:
            if word_id is None:
                aligned_labels.append(-100)
            elif word_id != previous_word_id:
                aligned_labels.append(label[word_id])
            else:
                aligned_labels.append(-100)
            previous_word_id = word_id
        labels.append(aligned_labels)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs


tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/9987 [00:00<?, ? examples/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Map: 100%|██████████| 9987/9987 [00:01<00:00, 6937.79 examples/s]


# Fine tune our model

We'll fine tune our model for only $5$ epochs, and we'll use a learning rate of $5\mathrm{e}{-5}$. Our dataset is split in train (37948 samples), validation (1997 samples) and test (9987 samples) sub-sets.

## Download model from HuggingFace

In [7]:
from transformers import AutoModelForTokenClassification


model = AutoModelForTokenClassification.from_pretrained(
    model_name, num_labels=len(label2id), id2label=id2label, label2id=label2id
)

model.to("cuda")

Some weights of BertForTokenClassification were not initialized from the model checkpoint at neuralmind/bert-base-portuguese-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(29794, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12

## Define fine tuning routine

### Defining custom metrics to evaluate model

In [8]:
from sklearn.metrics import precision_score
import numpy as np


def compute_metrics(p):
    predictions, labels = p

    # Get predicted labels by taking argmax
    predictions = np.argmax(predictions, axis=-1)

    # Flatten and filter out ignored tokens (-100)
    true_labels = labels.flatten()
    pred_labels = predictions.flatten()
    mask = true_labels != -100
    true_labels = true_labels[mask]
    pred_labels = pred_labels[mask]

    # Compute precision
    macro_precision = precision_score(
        true_labels, pred_labels, average="macro", zero_division=0
    )
    weighted_precision = precision_score(
        true_labels, pred_labels, average="weighted", zero_division=0
    )
    per_class_precision = precision_score(
        true_labels, pred_labels, average=None, zero_division=0
    )

    # Map class indices to precision values
    unique_tags = np.unique(true_labels)
    per_class_precision_dict = {
        id2label[int(tag)]: float(per_class_precision[i])
        for i, tag in enumerate(unique_tags)
    }

    return {
        "macro_precision": macro_precision,
        "weighted_precision": weighted_precision,
        "per_class_precision": per_class_precision_dict,
    }

In [9]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir=".results",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    num_train_epochs=5,
    save_strategy="epoch",
    use_cpu=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
)



In [10]:
trainer.train()

Epoch,Training Loss,Validation Loss,Macro Precision,Weighted Precision,Per Class Precision
1,0.0649,0.061679,0.960881,0.982931,"{'PREP+PROADJ': 1.0, 'IN': 1.0, 'PREP+PRO-KS': 0.7692307692307693, 'NPROP': 0.9671410090556274, 'PREP+PROSUB': 0.8518518518518519, 'KC': 0.9891891891891892, 'PROPESS': 0.9976635514018691, 'NUM': 0.9796806966618288, 'PROADJ': 0.9783333333333334, 'PREP+ART': 0.992171405026782, 'KS': 0.9222462203023758, 'PRO-KS': 0.9321663019693655, 'ADJ': 0.9711751662971175, 'ADV-KS': 0.9285714285714286, 'N': 0.9854633555420957, 'PREP': 0.981325618095739, 'PROSUB': 0.9058823529411765, 'PREP+PROPESS': 1.0, 'PDEN': 0.933920704845815, 'V': 0.9967320261437909, 'PREP+ADV': 1.0, 'PCP': 0.9795673076923077, 'CUR': 0.991304347826087, 'ADV': 0.9409282700421941, 'PU': 0.9996508379888268, 'ART': 0.9886986301369863}"
2,0.0441,0.059354,0.953689,0.984068,"{'PREP+PROADJ': 1.0, 'IN': 0.7857142857142857, 'PREP+PRO-KS': 0.8333333333333334, 'NPROP': 0.976658798846053, 'PREP+PROSUB': 0.8518518518518519, 'KC': 0.9913419913419913, 'PROPESS': 0.9884259259259259, 'NUM': 0.9687943262411347, 'PROADJ': 0.978369384359401, 'PREP+ART': 0.9909502262443439, 'KS': 0.9296703296703297, 'PRO-KS': 0.93058568329718, 'ADJ': 0.9709270433351618, 'ADV-KS': 0.8529411764705882, 'N': 0.9835138387484957, 'PREP': 0.9859751256946282, 'PROSUB': 0.9300411522633745, 'PREP+PROPESS': 1.0, 'PDEN': 0.9432314410480349, 'V': 0.9969849246231156, 'PREP+ADV': 1.0, 'PCP': 0.98661800486618, 'CUR': 0.991304347826087, 'ADV': 0.9371069182389937, 'PU': 0.999825388510564, 'ART': 0.9917355371900827}"
3,0.0217,0.065789,0.961825,0.984399,"{'PREP+PROADJ': 1.0, 'IN': 1.0, 'PREP+PRO-KS': 0.8333333333333334, 'NPROP': 0.9726134585289515, 'PREP+PROSUB': 0.8518518518518519, 'KC': 0.9870828848223897, 'PROPESS': 0.9953488372093023, 'NUM': 0.9713876967095851, 'PROADJ': 0.9768211920529801, 'PREP+ART': 0.9917593737124022, 'KS': 0.9281045751633987, 'PRO-KS': 0.9281045751633987, 'ADJ': 0.9721006564551422, 'ADV-KS': 0.8529411764705882, 'N': 0.9858678584370093, 'PREP': 0.9867514573396926, 'PROSUB': 0.9338842975206612, 'PREP+PROPESS': 1.0, 'PDEN': 0.9356223175965666, 'V': 0.9964841788046208, 'PREP+ADV': 1.0, 'PCP': 0.9842424242424243, 'CUR': 0.991304347826087, 'ADV': 0.9389233954451346, 'PU': 0.999825388510564, 'ART': 0.9931010693342532}"


TrainOutput(global_step=14232, training_loss=0.05393620858103992, metrics={'train_runtime': 1076.2112, 'train_samples_per_second': 105.782, 'train_steps_per_second': 13.224, 'total_flos': 1.1014157339000544e+16, 'train_loss': 0.05393620858103992, 'epoch': 3.0})

# Model evaluation

## Against test data

In [11]:
model_results = trainer.evaluate(tokenized_dataset["test"])

model_results

{'eval_loss': 0.09108009189367294, 'eval_macro_precision': 0.9449901615904162, 'eval_weighted_precision': 0.9815311707574705, 'eval_per_class_precision': {'PREP+PROADJ': 0.9967637540453075, 'IN': 0.5241379310344828, 'PREP+PRO-KS': 0.9152542372881356, 'NPROP': 0.9776452119309262, 'PREP+PROSUB': 0.9038461538461539, 'KC': 0.9869005328596803, 'PROPESS': 0.9937194696441033, 'NUM': 0.9733492442322991, 'PROADJ': 0.9703315881326352, 'PREP+ART': 0.9931466614450754, 'KS': 0.9361532322426177, 'PRO-KS': 0.9279754062362758, 'ADJ': 0.9632103104862332, 'ADV-KS': 0.8558951965065502, 'N': 0.9814008272386118, 'PREP': 0.9811835935181338, 'PROSUB': 0.9385026737967914, 'PREP+PROPESS': 1.0, 'PDEN': 0.898458748866727, 'V': 0.9958356609618607, 'PREP+ADV': 0.9666666666666667, 'PCP': 0.9734953064605191, 'CUR': 0.9932659932659933, 'ADV': 0.9331128261668504, 'PU': 0.9996283771228957, 'ART': 0.9898645973552934}, 'eval_runtime': 25.1627, 'eval_samples_per_second': 396.897, 'eval_steps_per_second': 49.637, 'epoch': 

('./fine_tuned_model/tokenizer_config.json',
 './fine_tuned_model/special_tokens_map.json',
 './fine_tuned_model/vocab.txt',
 './fine_tuned_model/added_tokens.json',
 './fine_tuned_model/tokenizer.json')

## Against the `lisaterumi/postagger-portuguese` model

In [16]:
from transformers import BertForTokenClassification

competitor_model = BertForTokenClassification.from_pretrained(competitor_model_name)
competitor_model.to("cuda")

competitor_tokenizer = AutoTokenizer.from_pretrained(competitor_model_name)

In [17]:
def tokenize_and_align_labels_for_comparison(examples):
    tokenized_inputs = competitor_tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,
        padding="max_length",
        max_length=128,
    )

    # Align labels with tokenized inputs
    labels = []
    for i, label in enumerate(examples["labels"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        aligned_labels = [
            -100 if word_id is None else label[word_id] for word_id in word_ids
        ]
        labels.append(aligned_labels)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs


tokenized_test_dataset_comparison = tokenized_dataset.map(
    tokenize_and_align_labels_for_comparison, batched=True
)

Map: 100%|██████████| 37948/37948 [00:06<00:00, 5451.28 examples/s]
Map: 100%|██████████| 9987/9987 [00:01<00:00, 6223.02 examples/s]
Map: 100%|██████████| 1997/1997 [00:00<00:00, 6049.05 examples/s]


In [18]:
# We don't need to train our model, so we'll use
# the Trainer API with dummy values
dummy_training_args = TrainingArguments(
    output_dir="./results",
    logging_dir="./logs",
    do_train=False,
    do_eval=True,
    evaluation_strategy="no",
    use_cpu=False,
)

competitor_trainer = Trainer(
    model=competitor_model,
    args=dummy_training_args,
    eval_dataset=tokenized_test_dataset_comparison["validation"],
    processing_class=competitor_tokenizer,
    compute_metrics=compute_metrics,
)



In [20]:
eval_results_comparison = competitor_trainer.evaluate(
    tokenized_test_dataset_comparison["test"]
)

eval_results_comparison

{'eval_loss': 10.26021671295166, 'eval_model_preparation_time': 0.003, 'eval_macro_precision': 0.034122055046705034, 'eval_weighted_precision': 0.07739469736230463, 'eval_per_class_precision': {'PREP+PROADJ': 0.0, 'IN': 0.0005425935973955507, 'PREP+PRO-KS': 0.0, 'NPROP': 0.06532663316582915, 'PREP+PROSUB': 0.0005937654626422563, 'KC': 0.05448717948717949, 'PROPESS': 0.01650038372985418, 'NUM': 0.011442141623488774, 'PROADJ': 0.018442622950819672, 'PREP+ART': 0.05470727180698498, 'KS': 0.009498564170532362, 'PRO-KS': 0.022401433691756272, 'ADJ': 0.049019607843137254, 'ADV-KS': 0.002431681333950903, 'N': 0.1430054848500985, 'PREP': 0.08588692274492879, 'PROSUB': 0.08333333333333333, 'PREP+PROPESS': 0.0, 'PDEN': 0.0050009805844283195, 'V': 0.0761904761904762, 'PREP+ADV': 0.0, 'PCP': 0.006422018348623854, 'CUR': 0.0, 'ADV': 0.058823529411764705, 'PU': 0.12311680688710624, 'ART': 0.0}, 'eval_runtime': 21.593, 'eval_samples_per_second': 462.511, 'eval_steps_per_second': 57.843}
