# Domain adaptation

In this notebook, domain adaptation techniques were employed using a model trained on masked data to classify the OPArticles ADU dataset.

## Google colab setup

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!pip install transformers
!pip install pandas
!pip install datasets
!pip install collections
!pip install numpy

## Base model

We chose the [neuralmind/bert-base-portuguese-cased](https://huggingface.co/neuralmind/bert-base-portuguese-cased) model as a base model to train with our masked corpus.

In [None]:
from transformers import AutoModelForMaskedLM

model_checkpoint = 'neuralmind/bert-base-portuguese-cased'
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/647 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
text = "Gosto muito de [MASK]."

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/205k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
import torch

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

'>>> Gosto muito de ler.'
'>>> Gosto muito de escrever.'
'>>> Gosto muito de música.'
'>>> Gosto muito de futebol.'
'>>> Gosto muito de você.'


## Preprocessing the data

After importing the model we load the dataset of the whole opinion articles, keeping only the body as this is the only feature we will need.

In [None]:
import pandas as pd

oparticles = pd.read_excel('/content/drive/Shareddrives/PLN/OpArticles.xlsx')
oparticles = oparticles[["body"]]
oparticles

Unnamed: 0,body
0,"O poeta espanhol António Machado escrevia, uns..."
1,“O mais excelente quadro posto a uma luz logo ...
2,1. As sociedades humanas parecem ser regidas p...
3,Este foi um Mundial incrível. Vimos actuações ...
4,O futebol sempre foi um jogo aparentemente sim...
...,...
368,"Era apenas mais um jogo da Lazio, em final de ..."
369,As eleições europeias no Reino Unido estão a s...
370,"Estava eu no Brasil, de férias, entretido (e d..."
371,Passaram mais de 300 dias desde que a Assemble...


Afterwards, we split the dataset into two subsets for training and testing.

In [None]:
from datasets import Dataset

opa_dataset = Dataset.from_pandas(oparticles)

opa_dataset = opa_dataset.train_test_split(test_size=0.3, shuffle=True, seed=42)

opa_dataset

DatasetDict({
    train: Dataset({
        features: ['body'],
        num_rows: 261
    })
    test: Dataset({
        features: ['body'],
        num_rows: 112
    })
})

In [None]:
def tokenize_function(examples):
    result = tokenizer(examples["body"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# Use batched=True to activate fast multithreading!
tokenized_datasets = opa_dataset.map(
    tokenize_function, batched=True, remove_columns=["body"]
)
tokenized_datasets

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids'],
        num_rows: 261
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids'],
        num_rows: 112
    })
})

After the dataset is tokenized we collate all the articles and divide them into equally sized chunks to allow for better compatibility with the models.

In [None]:
chunk_size = 128

def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 2306
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 953
    })
})

## Fine tuning with Trainer

With our data ready we will finetune the model using the Trainer API. Before we do so, a final preprocessing step is needed: masking some parts of the dataset.

The `transformers` library has functionality for this but it ignores word boundaries, so alternatively, we can define a masking function that masks whole words only. We have tried both approaches.

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [None]:
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] 1. O debate no mundo ocidental vai - se cent [MASK] cada [MASK] mais na [MASK] popul [MASK]. Os populistas, acossados, entre outras reivindicações [MASK] pelo fenó [MASK] migratório oriundo de África, Ásia e América Central [MASK] [MASK] como os passado [MASK], [MASK] [MASK] com [MASK] imigração, dadas as votações crescentes nas urnas. Sim, os populistas, cient [MASK] dos desequilíbrios da globalização, agudiz [MASK] com conflitos nas regiões menos prósperas, têm explorado com mes coroado o desespero de parte [MASK] [MASK] [MASK], entregando - lhe o culpado fácil [MASK] o [MASK]nte / refugiado'

'>>> , perante [MASK] passividade inquietante dos moderados que preferem nãoⲟ no assunto, que [MASK] amplo [MASK] complexo [MASK] exigindo [MASK] por isso [MASK] [MASK] abordagem séria e cora [MASK] [MASK]. A indiferença é anuente com [MASK], a [MASK] vistos, perigosas! Por isso [MASK] importa - nos como [MASK], [MASK] parte implicada nesta problemática [MASK] tec Jackson [MASK] [MA

## Training the masked model

With data prepped we can now create the trainer that will finetune the base model based on our masked data.

In [None]:
from transformers import TrainingArguments

batch_size = 64

training_args = TrainingArguments(
    output_dir=f"./results",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=10,
    learning_rate=2e-5,
    weight_decay=0.01,
    data_seed=42,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    load_best_model_at_end=True,
)

In [None]:
from transformers import Trainer

trainer_dc = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["test"],
    data_collator=data_collator,
)

In [None]:
trainer_dc.train()

The following columns in the training set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 2306
  Num Epochs = 10
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 370


Epoch,Training Loss,Validation Loss
1,No log,1.91201
2,No log,1.888229
3,No log,1.851818
4,No log,1.834873
5,No log,1.799981
6,No log,1.804453
7,No log,1.782689
8,No log,1.764677
9,No log,1.802975
10,No log,1.791921


The following columns in the evaluation set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 953
  Batch size = 64
Saving model checkpoint to ./results/checkpoint-37
Configuration saved in ./results/checkpoint-37/config.json
Model weights saved in ./results/checkpoint-37/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 953
  Batch size = 64
Saving model checkpoint to ./results/checkpoint-74
Configuration saved in ./results/checkpoint-74/config.json
Model weights saved in ./results/checkpoint-74/pytorch_model.bin
The following columns in the evalu

TrainOutput(global_step=370, training_loss=1.8956955368454391, metrics={'train_runtime': 805.1293, 'train_samples_per_second': 28.641, 'train_steps_per_second': 0.46, 'total_flos': 1517362852853760.0, 'train_loss': 1.8956955368454391, 'epoch': 10.0})

In [None]:
trainer_dc.save_model('/content/drive/Shareddrives/PLN/models/domain_adaptation/pretrained_default')

Saving model checkpoint to /content/drive/Shareddrives/PLN/models/domain_adaptation/pretrained_default
Configuration saved in /content/drive/Shareddrives/PLN/models/domain_adaptation/pretrained_default/config.json
Model weights saved in /content/drive/Shareddrives/PLN/models/domain_adaptation/pretrained_default/pytorch_model.bin


## Finetuning ADUs

After finetuning the language model based on the OpArticles dataset, we can now use it for the ADU classification task.

### Data preparation

Similar data preparation techniques are applied to the `OpArticles_ADUs` dataset, with the only difference being that now we split the dataset into three parts to allow for test data to be used while training the model (the `validation` set) and for evaluating the model (the `test` set).

In [None]:
adus = pd.read_excel('/content/drive/Shareddrives/PLN/OpArticles_ADUs.xlsx')
adus = adus[['tokens', 'label']]
adus['label'].replace(
    ['Value', 'Value(+)', 'Value(-)', 'Fact', 'Policy'],
    [0,1,2,3,4],
    inplace=True
)
adus

Unnamed: 0,tokens,label
0,O facto não é apenas fruto da ignorância,0
1,havia no seu humor mais jornalismo (mais inves...,0
2,É tudo cómico na FIFA,0
3,o que todos nós permitimos que esta organizaçã...,0
4,não nos fazem rir à custa dos poderosos,0
...,...,...
16738,A única variável disponibilizada que pode ser ...,0
16739,esse número esconde informação muito pertinente,3
16740,bastante imperfeita,2
16741,esconde também a proporção de diplomados que e...,0


In [None]:
adus_dataset = Dataset.from_pandas(adus)

adus_dataset = adus_dataset.train_test_split(test_size=0.2, shuffle=True, seed=42)
adus_dataset_val = adus_dataset['test'].train_test_split(test_size=0.5, shuffle=True, seed=42)
adus_dataset['validation'], opa_dataset['test'] = adus_dataset_val['train'], adus_dataset_val['test']

adus_dataset

DatasetDict({
    train: Dataset({
        features: ['tokens', 'label'],
        num_rows: 13394
    })
    test: Dataset({
        features: ['tokens', 'label'],
        num_rows: 3349
    })
    validation: Dataset({
        features: ['tokens', 'label'],
        num_rows: 1674
    })
})

### Fine tuning with Trainer

After the ADU data is ready, we employ a similar strategy but now loading our pretrained model instead.

In [None]:
from transformers import AutoModelForSequenceClassification

model_name = '/content/drive/Shareddrives/PLN/models/domain_adaptation/pretrained_default'
adu_model_dc = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)

loading configuration file /content/drive/Shareddrives/PLN/models/domain_adaptation/pretrained_default/config.json
Model config BertConfig {
  "_name_or_path": "/content/drive/Shareddrives/PLN/models/domain_adaptation/pretrained_default",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "poo

In [None]:
def tokenize_adus(examples):
    return tokenizer(examples['tokens'], truncation=True, max_length=81, padding="max_length")

tokenized_adus = adus_dataset.map(tokenize_adus, batched=True, remove_columns=['tokens'])

  0%|          | 0/14 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

We also add metrics functions to keep track of model progress and evaluate it when it has finished executing.

In [None]:
from sklearn.metrics import precision_recall_fscore_support, accuracy_score


def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [None]:
adu_training_args = TrainingArguments(
    output_dir=f"./results",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=5,
    learning_rate=2e-5,
    weight_decay=0.01,
    data_seed=42,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    load_best_model_at_end=True,
    metric_for_best_model='f1',
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
from transformers import DataCollatorWithPadding

adu_trainer = Trainer(
    model=adu_model_dc,
    args=adu_training_args,
    tokenizer=tokenizer,
    train_dataset=tokenized_adus["train"],
    eval_dataset=tokenized_adus["validation"],
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics
)

In [None]:
adu_trainer.train()

***** Running training *****
  Num examples = 13394
  Num Epochs = 5
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 1050


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.938371,0.606332,0.567845,0.588144,0.555773
2,No log,0.94729,0.614695,0.597471,0.598757,0.61317
3,0.913500,0.983301,0.61589,0.595163,0.587012,0.616109
4,0.913500,1.016445,0.616487,0.58584,0.593585,0.581768
5,0.646100,1.043476,0.60693,0.589892,0.581542,0.600388


***** Running Evaluation *****
  Num examples = 1674
  Batch size = 64
Saving model checkpoint to ./results/checkpoint-210
Configuration saved in ./results/checkpoint-210/config.json
Model weights saved in ./results/checkpoint-210/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-210/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-210/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 1674
  Batch size = 64
Saving model checkpoint to ./results/checkpoint-420
Configuration saved in ./results/checkpoint-420/config.json
Model weights saved in ./results/checkpoint-420/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-420/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-420/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 1674
  Batch size = 64
Saving model checkpoint to ./results/checkpoint-630
Configuration saved in ./results/checkpoint-630/config.json
Model w

TrainOutput(global_step=1050, training_loss=0.7699163128080823, metrics={'train_runtime': 978.6471, 'train_samples_per_second': 68.431, 'train_steps_per_second': 1.073, 'total_flos': 2787700746222540.0, 'train_loss': 0.7699163128080823, 'epoch': 5.0})

In [None]:
adu_trainer.evaluate()

***** Running Evaluation *****
  Num examples = 1674
  Batch size = 64


{'epoch': 5.0,
 'eval_accuracy': 0.6146953405017921,
 'eval_f1': 0.5974713902822115,
 'eval_loss': 0.947289764881134,
 'eval_precision': 0.5987571241217827,
 'eval_recall': 0.6131704193520849,
 'eval_runtime': 8.7117,
 'eval_samples_per_second': 192.154,
 'eval_steps_per_second': 3.099}

In [None]:
adu_trainer.predict(test_dataset=tokenized_adus["test"])

***** Running Prediction *****
  Num examples = 3349
  Batch size = 64


PredictionOutput(predictions=array([[ 1.1648511 , -1.0362136 , -1.0311818 , -1.1956606 ,  2.4912543 ],
       [ 0.835233  , -2.1810215 ,  3.1438503 ,  0.08276042, -2.446027  ],
       [ 1.0807153 , -0.6931774 , -1.3411509 , -1.3138272 ,  2.7049756 ],
       ...,
       [ 1.6494398 ,  2.1050043 , -1.8840888 , -0.03291372, -1.3592778 ],
       [ 1.0118313 , -2.2170367 ,  3.0757432 , -0.071308  , -2.127469  ],
       [ 2.505465  , -2.3087382 ,  2.1098032 ,  0.01614396, -2.5741115 ]],
      dtype=float32), label_ids=array([4, 0, 4, ..., 0, 0, 2]), metrics={'test_loss': 0.9477436542510986, 'test_accuracy': 0.6154075843535384, 'test_f1': 0.5879638735991607, 'test_precision': 0.5929712326521809, 'test_recall': 0.5986275793314899, 'test_runtime': 17.8294, 'test_samples_per_second': 187.836, 'test_steps_per_second': 2.973})

In [None]:
adu_trainer.save_model('/content/drive/Shareddrives/PLN/models/domain_adaptation/finetuned')

Saving model checkpoint to /content/drive/Shareddrives/PLN/models/domain_adaptation/finetuned
Configuration saved in /content/drive/Shareddrives/PLN/models/domain_adaptation/finetuned/config.json
Model weights saved in /content/drive/Shareddrives/PLN/models/domain_adaptation/finetuned/pytorch_model.bin
tokenizer config file saved in /content/drive/Shareddrives/PLN/models/domain_adaptation/finetuned/tokenizer_config.json
Special tokens file saved in /content/drive/Shareddrives/PLN/models/domain_adaptation/finetuned/special_tokens_map.json
