# Domain adaptation

In this notebook, domain adaptation techniques were employed using a model trained on masked data to classify the OPArticles ADU dataset.

## Google colab setup

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!pip install transformers
!pip install pandas
!pip install datasets
!pip install collections
!pip install numpy

## Base model

We chose the [neuralmind/bert-base-portuguese-cased](https://huggingface.co/neuralmind/bert-base-portuguese-cased) model as a base model to train with our masked corpus.

In [4]:
from transformers import AutoModelForMaskedLM

model_checkpoint = 'neuralmind/bert-base-portuguese-cased'
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/647 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [1]:
text = "Gosto muito de [MASK]."

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
import torch

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

## Preprocessing the data

After importing the model we load the dataset of the whole opinion articles, keeping only the body as this is the only feature we will need.

In [11]:
import pandas as pd

oparticles = pd.read_excel('/content/drive/Shareddrives/PLN/OpArticles.xlsx')
oparticles = oparticles[["body"]]
oparticles

Unnamed: 0,body
0,"O poeta espanhol António Machado escrevia, uns..."
1,“O mais excelente quadro posto a uma luz logo ...
2,1. As sociedades humanas parecem ser regidas p...
3,Este foi um Mundial incrível. Vimos actuações ...
4,O futebol sempre foi um jogo aparentemente sim...
...,...
368,"Era apenas mais um jogo da Lazio, em final de ..."
369,As eleições europeias no Reino Unido estão a s...
370,"Estava eu no Brasil, de férias, entretido (e d..."
371,Passaram mais de 300 dias desde que a Assemble...


Afterwards, we split the dataset into two subsets for training and testing.

In [12]:
from datasets import Dataset

opa_dataset = Dataset.from_pandas(oparticles)

opa_dataset = opa_dataset.train_test_split(test_size=0.3, shuffle=True, seed=42)

opa_dataset

DatasetDict({
    train: Dataset({
        features: ['body'],
        num_rows: 298
    })
    test: Dataset({
        features: ['body'],
        num_rows: 38
    })
    validation: Dataset({
        features: ['body'],
        num_rows: 37
    })
})

In [13]:
def tokenize_function(examples):
    return tokenizer(examples["body"])


tokenized_datasets = opa_dataset.map(
    tokenize_function, batched=True, remove_columns=["body"]
)
tokenized_datasets

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids'],
        num_rows: 298
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids'],
        num_rows: 38
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids'],
        num_rows: 37
    })
})

After the dataset is tokenized we collate all the articles and divide them into equally sized chunks to allow for better compatibility with the models.

In [15]:
chunk_size = 128

def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [16]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 2611
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 341
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 308
    })
})

## Fine tuning with Trainer

With our data ready we will finetune the model using the Trainer API. Before we do so, a final preprocessing step is needed: masking some parts of the dataset.

The `transformers` library has functionality for this but it ignores word boundaries, so alternatively, we can define a masking function that masks whole words only. We have tried both approaches.

In [17]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [18]:
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

# builtin transformers collator
for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] Segundo um artigo que [MASK] a [MASK] página do [MASK]ÚBLICO de 13 de Abril, a [MASK] dos Médicos entende que os clínicos de família não [MASK] passar atestados para condução de veículos ou porte de arma aos [MASK]tentes que a eles [MASK]m porque a [MASK] de tais atestados deve estar isenta de “ impedimentos e suspeições ”, o que poderá não [MASK] em diversas circunstâncias. Infelizmente, o problema não se circunscreve aos médicos. Em várias outras profissões reguladas, os atestados [MASK] declarações e termos [MASK] responsabilidade servem de base a autor [MASK], licenças e outras decisões'

'>>> processuais cujos efeitos, tal como a condução de veículos e o uso e [MASK] de [MASK], podem [MASK] a segurança de pessoas e bens. É o [MASK] do licenciamento municipaḻ obras, em particular de reabilitação. De acordo com o respetivo regime jurídico, uma declaração do técnico autor do projeto [MASK] que são observadas as opções de [MASK] adequadas obtendo segurança estrutural e sí

In [19]:
import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id

    return default_data_collator(features)

In [20]:
samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

# whole word masking collator
for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] Segundo [MASK] artigo que fez a primeira [MASK] do PÚBLICO [MASK] 13 de Abril, a Ordem dos Médicos entende que os [MASK] [MASK] de família não devem passar [MASK] [MASK] [MASK] [MASK] condução de veículos [MASK] porte de [MASK] aos utentes que a eles recorrem porque a emissão de tais [MASK] [MASK] [MASK] deve [MASK] isenta de [MASK] impedimentos e suspeições ”, [MASK] que poderá não [MASK] em [MASK] [MASK]. Infelizmente [MASK] o problema não se circunscreve aos médicos [MASK] [MASK] várias outras profissões reguladas, os [MASK] [MASK] [MASK], declarações e termos de responsabilidade servem de base a autorizações, licenças e outras decisões'

'>>> processuais cujos efeitos, tal como a condução de veículos e o [MASK] e porte de arma, [MASK] envolver a segurança de pessoas e bens. [MASK] [MASK] caso do licenciamento municipal de obras [MASK] em particular de [MASK]. De [MASK] com [MASK] respetivo regime jurídico [MASK] uma declaração do técnico [MASK] do projeto [MASK] que [MA

## Training the masked model

With data prepped we can now create the trainer that will finetune the base model based on our masked data.

In [21]:
from transformers import TrainingArguments

batch_size = 64

training_args = TrainingArguments(
    output_dir=f"./results",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=10,
    learning_rate=2e-5,
    weight_decay=0.01,
    data_seed=42,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    load_best_model_at_end=True,
)

In [23]:
from transformers import Trainer

trainer_dc = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["test"],
    data_collator=data_collator,
)

trainer_cc = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["test"],
    data_collator=whole_word_masking_data_collator,
)

In [24]:
trainer_dc.train()

The following columns in the training set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 2611
  Num Epochs = 10
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 410


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: ignored

In [None]:
trainer_dc.save_model('/content/drive/Shareddrives/PLN/models/domain_adaptation/pretrained_default')

Saving model checkpoint to /content/models/domain_adaptation/pretrained
Configuration saved in /content/models/domain_adaptation/pretrained/config.json
Model weights saved in /content/models/domain_adaptation/pretrained/pytorch_model.bin


In [None]:
trainer_cc.train()

In [None]:
trainer_cc.save_model('/content/drive/Shareddrives/PLN/models/domain_adaptation/pretrained_whole')

## Finetuning ADUs

After finetuning the language model based on the OpArticles dataset, we can now use it for the ADU classification task.

### Data preparation

Similar data preparation techniques are applied to the `OpArticles_ADUs` dataset, with the only difference being that now we split the dataset into three parts to allow for test data to be used while training the model (the `validation` set) and for evaluating the model (the `test` set).

In [25]:
adus = pd.read_excel('/content/drive/Shareddrives/PLN/OpArticles_ADUs.xlsx')
adus = adus[['tokens', 'label']]
adus['label'].replace(
    ['Value', 'Value(+)', 'Value(-)', 'Fact', 'Policy'],
    [0,1,2,3,4],
    inplace=True
)
adus

Unnamed: 0,tokens,label
0,O facto não é apenas fruto da ignorância,0
1,havia no seu humor mais jornalismo (mais inves...,0
2,É tudo cómico na FIFA,0
3,o que todos nós permitimos que esta organizaçã...,0
4,não nos fazem rir à custa dos poderosos,0
...,...,...
16738,A única variável disponibilizada que pode ser ...,0
16739,esse número esconde informação muito pertinente,3
16740,bastante imperfeita,2
16741,esconde também a proporção de diplomados que e...,0


In [26]:
adus_dataset = Dataset.from_pandas(adus)

adus_dataset = adus_dataset.train_test_split(test_size=0.2, shuffle=True, seed=42)
adus_dataset_val = adus_dataset['test'].train_test_split(test_size=0.5, shuffle=True, seed=42)
adus_dataset['validation'], opa_dataset['test'] = adus_dataset_val['train'], adus_dataset_val['test']

adus_dataset

DatasetDict({
    train: Dataset({
        features: ['tokens', 'label'],
        num_rows: 13394
    })
    test: Dataset({
        features: ['tokens', 'label'],
        num_rows: 3349
    })
    validation: Dataset({
        features: ['tokens', 'label'],
        num_rows: 1674
    })
})

### Fine tuning with Trainer

After the ADU data is ready, we employ a similar strategy but now loading our pretrained model instead.

In [28]:
from transformers import AutoModelForSequenceClassification

model_name = '/content/drive/Shareddrives/PLN/models/domain_adaptation/pretrained_default'
adu_model_dc = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)
model_name = '/content/drive/Shareddrives/PLN/models/domain_adaptation/pretrained_whole'
adu_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)

loading configuration file /content/drive/Shareddrives/PLN/models/domain_adaptation/pretrained/config.json
Model config BertConfig {
  "_name_or_path": "/content/drive/Shareddrives/PLN/models/domain_adaptation/pretrained",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layer

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(29794, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [29]:
def tokenize_adus(examples):
    return tokenizer(examples['tokens'], truncation=True, max_length=81, padding="max_length")

tokenized_adus = adus_dataset.map(tokenize_adus, batched=True, remove_columns=['tokens'])

  0%|          | 0/14 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

We also add metrics functions to keep track of model progress and evaluate it when it has finished executing.

In [31]:
from sklearn.metrics import precision_recall_fscore_support, accuracy_score


def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [32]:
adu_training_args = TrainingArguments(
    output_dir=f"./results",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=5,
    learning_rate=2e-5,
    weight_decay=0.01,
    data_seed=42,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    load_best_model_at_end=True,
    metric_for_best_model='f1',
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [35]:
from transformers import DataCollatorWithPadding

adu_trainer_dc = Trainer(
    model=adu_model_dc,
    args=adu_training_args,
    tokenizer=tokenizer,
    train_dataset=tokenized_adus["train"],
    eval_dataset=tokenized_adus["validation"],
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics
)

adu_trainer_cc = Trainer(
    model=adu_model_cc,
    args=adu_training_args,
    tokenizer=tokenizer,
    train_dataset=tokenized_adus["train"],
    eval_dataset=tokenized_adus["validation"],
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics
)

In [None]:
adu_trainer.train()

In [None]:
adu_trainer.evaluate()

In [None]:
adu_trainer.predict(test_dataset=tokenized_adus["test"])

In [None]:
trainer.save_model('/content/drive/Shareddrives/PLN/models/domain_adaptation/finetuned')