# Variant Identifier

> Note: For the code tu run faster, change the runtime of this notebook to have a GPU.
On the top tab select "Runtime > Change runtime type > GPU". This change will make the code run in seconds rather than hours...

In an era where chat bots are becoming ubiquitous tools, a significant concern arises: how can we ensure these tools benefit all societies? One major hurdle in achieving this inclusivity is the scarcity of language variant-specific models. For example, while Portuguese is not considered a low-resource language, the majority of available content is in Brazilian Portuguese. Consequently, a language model trained in a Portuguese corpus (without any concern regarding its variants) is likely to exhibit a bias towards producing text in Brazilian Portuguese. What are the implications of such a bias? Countries like Portugal could find themselves at a disadvantage, particularly in deploying language model-based systems in critical areas such as healthcare and judiciary, where the distinct nuances of Portuguese are of great importance.

With this context in mind, your task in this assignment is to develop a model capable of classifying texts as either European Portuguese or Brazilian Portuguese. Below is the quick start code to guide you.

In [None]:
!pip install --quiet datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.2/401.2 kB[0m [31m27.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import torch

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device {DEVICE} available.")

Device cuda available.


In [None]:
from datasets import load_dataset

dataset = load_dataset("cc4051/pt_vid")
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/476 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.76M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/11717 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 11717
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2570
    })
})

In [None]:
print("Number of training examples: ", len(dataset["train"]))
print("Example: ", dataset["train"][0])

Number of training examples:  11717
Example:  {'text': 'Direita italiana inicia processo de coligação para as eleições de Março A alternativa «civilizada» Eduardo Tessler, em Roma O deputado reformador Mario Segni fez ontem um acordo com a Liga Norte para criar uma aliança eleitoral de centro-direita, para concorrer às legislativas italianas de Março. É a tentativa de constituir uma alternativa da «direita civilizada» à coligação das esquerdas, liderada pelo PDS. E ainda de «domesticar» a Liga. Mas falta saber a resposta do Partido Popular (ex-democracia Cristã) e a reacção do magnata Berlusconi.', 'label': 0}


In [None]:
dataset["train"][0]['text']

'Direita italiana inicia processo de coligação para as eleições de Março A alternativa «civilizada» Eduardo Tessler, em Roma O deputado reformador Mario Segni fez ontem um acordo com a Liga Norte para criar uma aliança eleitoral de centro-direita, para concorrer às legislativas italianas de Março. É a tentativa de constituir uma alternativa da «direita civilizada» à coligação das esquerdas, liderada pelo PDS. E ainda de «domesticar» a Liga. Mas falta saber a resposta do Partido Popular (ex-democracia Cristã) e a reacção do magnata Berlusconi.'

Na versão do HuggingFace as labels tem o seguinte mapeamento:

In [None]:
label_map = {
    0: "PT-PT",
    1: "PT-BR",
}

n_classes = len(label_map)

In [None]:
"ababab".lower() # to have all words in lower case (try with "ABABAB")

'ababab'

Let's train a simple model.

In [None]:
corpus = "".join(dataset["train"]["text"]) # one giant sentence

# we will define our vocab to be composed of
vocab = list(set(corpus.lower()))

# Special padding token
pad_token = "<pad>"
vocab.append(pad_token)

# Special unknown token
unk_token = "<unk>"
vocab.append(unk_token)

n_tokens = len(vocab)

print("Number of tokens in the vocab: ", n_tokens)

Number of tokens in the vocab:  96


In [None]:
tkn2id = {v: k for k, v in enumerate(vocab)}

pad_token_id = tkn2id[pad_token]
unk_token_id = tkn2id[unk_token]


def tokenize(text):
    return [tkn2id[c] if c in tkn2id else unk_token_id for c in text.lower()]


In [None]:
example = "Olá, tudo bem?"

print("Example: ", example)
print("Tokenized: ", tokenize(example))

Example:  Olá, tudo bem?
Tokenized:  [58, 86, 21, 82, 1, 33, 49, 7, 58, 1, 37, 8, 67, 60]


For the architecture, we will use an LSTM based model as it deals well with different size inputs (we can have a text with 20 characters or 2,000).

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class Model(nn.Module):
    def __init__(
        self,
        vocab_size: int,
        hidden_size: int,
        n_classes: int,
        n_layers: int
    ):
        super(Model, self).__init__()
        self.embedding = nn.Embedding(
            vocab_size,
            hidden_size,
            padding_idx=pad_token_id
        )
        self.rnn = nn.LSTM(
            hidden_size,
            hidden_size,
            batch_first=True,
            num_layers=n_layers,
        )
        self.fc = nn.Linear(hidden_size, n_classes)

    def forward(self, x):
        x = self.embedding(x)
        x, _ = self.rnn(x)
        x = F.relu(x)
        x = x[:, -1, :]  # get the last hidden state
        logits = self.fc(x)
        return logits

In [None]:
model = Model(
    vocab_size=n_tokens,
    hidden_size=120,
    n_classes=n_classes,
    n_layers=5
    ).to(DEVICE)

Now we build a torch dataloader to facilitate training.

In [None]:
from torch.utils.data import DataLoader, Dataset


class VIdDataset(Dataset):
    def __init__(self, data, tokenizer, max_length=512):
        """The max_length limits the number of tokens in the text.
        Texts that are longer are truncated, and shorter texts are padded."""
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.pad_token_id = pad_token_id

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        text = item["text"]
        tokens = self.tokenizer(text)
        if len(tokens) < self.max_length:  # padding
            padding = [self.pad_token_id] * (self.max_length - len(tokens))
            tokens += padding
        else:  # truncate
            tokens = tokens[: self.max_length]

        label = item["label"]
        return {
            "tokens": torch.tensor(tokens),
            "label": torch.tensor(label),
        }

In [None]:
train_dataset = VIdDataset(dataset["train"], tokenize)
train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=64,
    shuffle=True,
    drop_last=True
  )

We now have the model and the data so we are ready to build our training loop.

In [None]:
from torch import optim
from tqdm import tqdm


def train(
    model: nn.Module,
    train_loader: DataLoader,
    n_epochs: int,
    lr: float,
  ):
    optimizer = optim.AdamW(model.parameters(), lr=lr)
    criteria = nn.CrossEntropyLoss()

    for epoch in range(n_epochs):
        model.train()

        n_steps = len(train_loader)
        progress_bar = tqdm(
            total=n_steps, desc=f"Epoch {epoch+1}", position=0, leave=True
        )
        total_loss = 0
        total_correct = 0

        for idx, batch in enumerate(train_loader):
            optimizer.zero_grad()
            x = batch["tokens"].to(DEVICE)
            y = batch["label"].to(DEVICE)
            logits = model(x)
            loss = criteria(logits, y)
            loss.backward()
            optimizer.step()

            # logging
            total_loss += loss.item()
            y_pred = logits.argmax(dim=1)
            total_correct += (y_pred == y).sum().item()
            progress_bar.set_postfix(
                {
                    "loss": total_loss / (idx + 1),
                    "accuracy": total_correct / (idx + 1) / x.size(0),
                }
            )
            progress_bar.update()

In [None]:
train(
    model=model,
    train_loader=train_loader,
    n_epochs=3,
    lr=1e-3,
)

Epoch 1: 100%|██████████| 183/183 [00:18<00:00,  9.93it/s, loss=0.605, accuracy=0.701]
Epoch 2: 100%|██████████| 183/183 [00:13<00:00, 13.12it/s, loss=0.597, accuracy=0.713]
Epoch 3: 100%|██████████| 183/183 [00:14<00:00, 12.95it/s, loss=0.596, accuracy=0.714]


To finish, let's evaluate the trained model on the test set.

In [None]:
def evaluate(model: nn.Module, testset: Dataset):
    test_loader = DataLoader(testset, batch_size=32, shuffle=False)

    model.eval()
    y_true, y_pred = [], []
    for batch in test_loader:
        x = batch["tokens"].to(DEVICE)
        y_true += batch["label"].tolist()
        logits = model(x)
        y_pred += logits.argmax(dim=1).tolist()

    return y_true, y_pred

In [None]:
testset = VIdDataset(dataset["test"], tokenize)
y_true_test, y_pred_test = evaluate(model, testset)

In [None]:
print(f"First 10 true labels:{y_true_test[:10]}")
print(f"First 10 pred labels:{y_pred_test[:10]}")

First 10 true labels:[1, 0, 0, 0, 1, 1, 1, 0, 1, 0]
First 10 pred labels:[0, 0, 0, 1, 1, 1, 0, 0, 1, 1]


In [None]:
from sklearn.metrics import classification_report

target_names = [label_map[i] for i in range(n_classes)]
report = classification_report(
    y_true_test,
    y_pred_test,
    target_names=target_names
)
print(report)

              precision    recall  f1-score   support

       PT-PT       0.73      0.71      0.72      1335
       PT-BR       0.69      0.72      0.70      1235

    accuracy                           0.71      2570
   macro avg       0.71      0.71      0.71      2570
weighted avg       0.71      0.71      0.71      2570





By training a relatively simple model for just one epoch yielded an F1 score of 71%. How might we further enhance this effectiveness? Below are listed some strategies to consider for improving the model:

- **Train on more data** Checkout the DSL-TL on [GitHub](https://github.com/LanguageTechnologyLab/DSL-TL/tree/main/DSL-TL-Corpus/PT-DSL-TL)  or in [huggingface](https://huggingface.co/datasets/LCA-PORVID/dsl_tl) and the [FRMT](https://huggingface.co/datasets/LCA-PORVID/frmt) dataset.

- **Increase the model capacity** Increase the number of layers or the hidden size.
- **Hyperparameter Optimization** For instance, what is learning rate that produces the best model.
- **Incorporate Pre-trained Components**
  - For **word embeddings** you might find [fasttext](https://fasttext.cc/docs/en/crawl-vectors.html) or [GloVE](https://nlp.stanford.edu/projects/glove/) as good starting points.
  - For **language models**, [albertina](https://huggingface.co/PORTULAN/albertina-100m-portuguese-ptpt-encoder) or [bertimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) should yield interesting results.