<a href="https://colab.research.google.com/github/haeggee/error-detection-mt/blob/main/error_detection_in_mt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# *Critical Error Detection in Machine Translation with BERT, XLM-RoBERTa and HuggingFace*
### Introduction 

In this notebook, we tackle the problem of critical-error detection in machine translation. To that end, we show how to fine-tune different transformer based language models, including BERT and the multilingual XLM-RoBERTa model, and facilitate extensive evaluation. This PyTorch implementation leverages the Hugging face *transformers* to download pre-trained models, enable quick research experiments, access datasets and evaluation metrics.

This task is part of the WMT'21 [shared task on quality estimation](http://www.statmt.org/wmt21/quality-estimation-task.html).

---
### Task and Dataset
The goal of this task is to predict sentence-level binary scores indicating whether or not a translation contains (at least one) critical error. Translations with such errors are defined as translations that deviate in meaning as compared to the source sentence in such a way that they are misleading and may carry health, safety, legal, reputation, religious or financial implications. 

The data consists of Wikipedia comments in English extracted from two sources: the Jigsaw Toxic Comment Classification Challenge and the Wikipedia Comments Corpus, with translations generated by the ML50 multilingual translation model by FAIR. It contains instances in the following languages:

* English-Czech
* English-Japanese
* English-Chinese
* English-German

The dataset used in this notebook has been prepared by ourselves for the purpose of this task, and should be uploaded to the ```datasets``` directory as provided by the ```wmt21_multi_btr_{train,dev}.pkl``` in our [GitHub repository](https://github.com/haeggee/error-detection-mt).  

---

The main features of this notebook are: 
- End-to-end ML implementation (training, validation, prediction, evaluation)
- Easy adaptability
- Facilitation of quick experiments and extensions
- Quick training with limited computational resources (mixed-precision, gradient accumulation, ...)
- Threshold choice for the classification decision (not necessarily 0.5)
- Reproducible results with seed settings

Parts of this code have been taken and adapted from [NadirEM](https://github.com/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb) and we thank the author for providing such a nice template. 

## Installation of libraries and imports

In [None]:
!pip install transformers==4.8.2 -q
!pip install sentencepiece -q
!pip install pickle5 -q

[K     |████████████████████████████████| 2.5 MB 15.7 MB/s 
[K     |████████████████████████████████| 895 kB 64.6 MB/s 
[K     |████████████████████████████████| 3.3 MB 62.5 MB/s 
[K     |████████████████████████████████| 1.2 MB 12.6 MB/s 
[K     |████████████████████████████████| 132 kB 12.0 MB/s 
[?25h  Building wheel for pickle5 (setup.py) ... [?25l[?25hdone


In [None]:
import torch
import torch.nn as nn
import os
import matplotlib.pyplot as plt
import copy
import torch.optim as optim
import random
import numpy as np
import pandas as pd
import pickle5 as pickle
from torch.utils.data import DataLoader, Dataset
from torch.cuda.amp import autocast, GradScaler
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModel, AdamW, \
                         get_linear_schedule_with_warmup
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, balanced_accuracy_score, \
                            f1_score, precision_score, recall_score, \
                            confusion_matrix, matthews_corrcoef, \
                            precision_recall_curve
import gc # garbage collector
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [None]:
# Check that we are using 100% of GPU memory footprint support libraries/code
# from https://github.com/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip -q install gputil
!pip -q install psutil
!pip -q install humanize
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn’t guaranteed
def printm():
  if torch.cuda.is_available():
    gpu = GPUs[0]
    process = psutil.Process(os.getpid())
    print("Gen RAM Free: " + \
          humanize.naturalsize( psutil.virtual_memory().available ),\
          " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
    print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% \
          Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed,\
                                  gpu.memoryUtil*100, gpu.memoryTotal))
  else:
    print("No GPU in use.")
printm()

  Building wheel for gputil (setup.py) ... [?25l[?25hdone
Gen RAM Free: 12.5 GB  | Proc size: 600.6 MB
GPU RAM Free: 16280MB | Used: 0MB | Util   0%           Total 16280MB


In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")


In case GPU utilisation (Util) is not at 0%, you can uncomment and run the following line to kill all processes to get the full GPU afterwards. Make sure to comment out the line again to not constantly crash the notebook on purpose.

In [None]:
#!kill -9 -1

## Loading the dataset

The easiest way to run the experiments on the dataset of WMT'21 is to clone our GitHub Repo [here](https://github.com/haeggee/mt-error-detection/) and inside the directory create a zip of the ```dataset/``` folder via

```
zip -q -r dataset dataset
```

Then upload the zip file to the instance of this Colab VM.

In [None]:
!unzip -qq -o dataset

In [None]:
# Load the MRPC dataset (train, validation and test)
filename_train = "dataset/wmt21_multi_btr_train.pkl" 
dataset_train = pickle.load(open(filename_train,'rb'))
filename_dev = "dataset/wmt21_multi_btr_dev.pkl" 
dataset_dev = pickle.load(open(filename_dev, 'rb'))

In [None]:
max_src = 0
avg_src = 0
max_mt = 0
avg_mt = 0
len_src = []
len_mt = []
len_btr = []
for src in dataset_train.src:
  len_src.append(len(src))
for mt in dataset_train.mt:
  len_mt.append(len(mt))
for btr in dataset_train.btr:
  len_btr.append(len(btr))
df_length = pd.DataFrame()
df_length["src"] = len_src
df_length["mt"] = len_mt
df_length["btr"] = len_btr
df_length.describe()

Unnamed: 0,src,mt,btr
count,29867.0,29867.0,29867.0
mean,90.432986,78.861519,85.427227
std,28.269123,35.620897,30.758577
min,45.0,12.0,13.0
25%,66.0,52.0,61.0
50%,87.0,72.0,81.0
75%,113.0,102.0,107.0
max,148.0,402.0,984.0


In [None]:
len(dataset_train)

29867

In [None]:
# split the original training data for validation
df_train, df_val = train_test_split(dataset_train,
                                    test_size = 0.05,
                                    random_state=42)
df_test = dataset_dev

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)


In [None]:
print(df_train.shape)
print(df_val.shape)
print(df_test.shape)

(28373, 8)
(1494, 8)
(3996, 8)


In [None]:
df_train.head()

Unnamed: 0,id,src,mt,list_scores,avg_scores,critical,language_pair,btr
0,2527,REDIRECT Talk:Royal Canadian Mounted Police Pr...,リダイレクト トーク : カナダ 王立 騎兵 警察 の 保護 警察,"[0, 0, 0]",NOT,0,en-ja,Redirect Talk : Protective Police of the Royal...
1,4131,He mentions the apartheid controversy in great...,Zmiňuje spor o apartheid velmi podrobně ve své...,"[0, 0, 0]",NOT,0,en-cs,He mentions the apartheid dispute in great det...
2,9447,"As you may have noticed, I am waiting until th...","Jak jste si možná všiml , čekám , až budou mén...","[0, 0, 0]",NOT,0,en-cs,"As you may have noticed, I'm waiting for them ..."
3,6168,Niteshift you are one seriously ignorant indiv...,ナイト シフト あなた は 1 つ の 深刻 な 無知 の 個人 笑 で す 。,"[0, 0, 0]",NOT,0,en-ja,Night Shift You are one serious ignorant indiv...
4,4888,RR down with relentless recussant!!! the empir...,RR nach unten mit unermüdlichen wiederkehrende...,"[0, 0, 0]",NOT,0,en-de,RR down with tireless recurring!!! the empire ...


## Classes and functions

In [None]:
class CustomDataset(Dataset):

    def __init__(self, data, maxlen, bert_model='xlm-roberta-large',
                 with_labels=True, mask_prob=0.0, eval_set=False,
                 use_flip=False, mask_tokens=False):

        self.data = data  # pandas dataframe
        self.bert_model = bert_model
        # Initialize the tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(bert_model)  
        self.maxlen = maxlen
        self.with_labels = with_labels 
        self.mask_prob = mask_prob
        self.eval_set = eval_set
        self.flip = use_flip
        self.mask_tokens = mask_tokens

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):

        # Selecting sentence1 and sentence2 at the specified index
        # in the dataframe
        sent1 = str(self.data.loc[index, 'src'])
        if self.bert_model == 'bert-base-uncased':
          sent2 = str(self.data.loc[index, 'btr'])
        else:
          sent2 = str(self.data.loc[index, 'mt'])
        flip = (torch.rand(1) <= 0.5).int().item()
        if self.flip and (flip and not self.eval_set):
          sent1, sent2 = sent2, sent1
        # Tokenize the pair of sentences to get token ids, attention masks
        # and token type ids
        encoded_pair = self.tokenizer(sent1, sent2, 
                                      padding='max_length', # Pad to max_length
                                      truncation=True, # Truncate to max_length
                                      max_length=self.maxlen, 
                                      return_token_type_ids=True,
                                      return_tensors='pt') # Return tensors
        # tensor of token ids
        token_ids = encoded_pair['input_ids'].squeeze(0)
        
        # set fraction mask_prob of tokens to [MASK]
        if self.mask_tokens and (self.mask_prob > 0.0 and not self.eval_set):
          probability_matrix = torch.full(token_ids.shape, self.mask_prob)
          special_tokens_mask = \
                  self.tokenizer.get_special_tokens_mask(token_ids.tolist(), \
                  already_has_special_tokens=True)
          special_tokens_mask = \
                torch.tensor(special_tokens_mask, dtype=torch.bool)
          probability_matrix.masked_fill_(special_tokens_mask, value=0.0)
          masked_indices = torch.bernoulli(probability_matrix).bool()
          token_ids[masked_indices] = \
                self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)

  
        # binary tensor with "0" for padded values and "1" for the other values
        attn_masks = encoded_pair['attention_mask'].squeeze(0)
        
        # binary tensor with "0" for the 1st sentence tokens & "1"
        # for the 2nd sentence tokens. note xml-roberta doesn't make use it
        token_type_ids = encoded_pair['token_type_ids'].squeeze(0)
        
        if self.with_labels:  # True if the dataset has labels
            label = int(self.data.loc[index, 'critical'])
            return token_ids, attn_masks, token_type_ids, label, flip 
        else:
            return token_ids, attn_masks, token_type_ids, flip

In [None]:
class SentencePairClassifier(nn.Module):

    def __init__(self, bert_model="xlm-roberta-base", freeze_bert=False,
                 num_classes=1):
        super(SentencePairClassifier, self).__init__()
        
        self.bert_model = bert_model
        # Instantiating BERT-based model object
        self.bert_layer = AutoModel.from_pretrained(bert_model)
        # Fix the hidden-state size of the encoder outputs 
        if bert_model == "xlm-roberta-base":
            hidden_size = 768
        elif bert_model == "xlm-roberta-large":
            hidden_size = 1024
        elif bert_model == "bert-base-uncased":
            hidden_size = 768
        elif bert_model.startswith('TransQuest'):
            hidden_size = 1024
        else:
            raise Exception('Unsupported model for this notebook.')

        # Freeze bert layers and only train the classification layer weights
        if freeze_bert:
            for p in self.bert_layer.parameters():
                p.requires_grad = False

        # Classification layer for error
        self.cls_layer = nn.Linear(hidden_size, num_classes)
        # Classification layer for flipping
        self.cls_layer_flip = nn.Linear(hidden_size, num_classes)

        self.dropout = nn.Dropout(p=0.5)

    @autocast()  # run in mixed precision
    def forward(self, input_ids, attn_masks, token_type_ids=None):
        '''
        Inputs:
            -input_ids : Tensor  containing token ids
            -attn_masks : Tensor containing attention masks to be used to
                          focus on non-padded values
            -token_type_ids : Tensor containing token type ids to be used to
                              identify sentence1 and sentence2 (only used by 
                              BERT, not XLM)
        '''

        bert_output = self.bert_layer(input_ids, attn_masks)
        pooler_output = bert_output.pooler_output
        
        # Feeding to the classifier layer the last layer hidden-state of the
        # [CLS] token further processed by a Linear Layer and a Tanh activation.
        logits = self.cls_layer(self.dropout(pooler_output))
        logits_flip = self.cls_layer_flip(self.dropout(pooler_output))
        return logits, logits_flip

In [None]:
def get_probs_from_logits(logits):
    """
    Converts a tensor of logits into an array of probabilities by
    applying the sigmoid function
    """
    probs = torch.sigmoid(logits.unsqueeze(-1))
    return probs.detach().cpu().numpy()

def set_seed(seed):
    """ Set all seeds to make results reproducible """
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    np.random.seed(seed)
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    

def evaluate_loss(net, device, criterion, dataloader):
    """ 
    Evaluate the loss for the critial error classification wrt. 
    a specific criterion and dataloader, averaged over all samples.
    Does not collect gradients.
    """
    net.eval()

    mean_loss = 0
    count = 0
    probs_all = []
    with torch.no_grad():
        for it, (seq, attn_masks, token_type_ids, labels, labels_flip) in \
                enumerate(tqdm(dataloader)):
            seq, attn_masks, token_type_ids, labels = \
                seq.to(device), attn_masks.to(device), \
                token_type_ids.to(device), labels.to(device)
              
            logits, _ = net(seq, attn_masks, token_type_ids)
            probs = get_probs_from_logits(logits.squeeze(-1)).squeeze(-1)
            probs_all += probs.tolist()
            mean_loss += criterion(logits.squeeze(-1), labels.float()).item()
            count += 1

    return mean_loss / count, np.array(probs_all)

In [None]:
print("Creation of the models' folder...")
!mkdir -p models/TransQuest

Creation of the models' folder...


Link for mixed precision training, gradient scaling and gradient accumulation  : https://pytorch.org/docs/stable/notes/amp_examples.html#amp-examples

If you would like to learn more about Training Neural Nets on Larger Batches, I suggest reading this post of Thomas Wolf :
https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255

In [None]:
def train_bert(bert_model, net, criterions, opti, lr, lr_scheduler, train_loader,
               minitrain_loader, labels_minitrain, val_loader, labels_val, 
               epochs, iters_to_accumulate, use_flip):
    """
    Train the model for a specified set of parameters.
    In particular, `net` is a SentencePairClassifier as specified above,
    and criterions is a tuple (crit_1, crit_2) where crit_1 is the 
    criterion of interest (here, BCE for error classification).
    
    Moreover, we evaluate the model after every epoch for the validation set
    as well as a small fraction of the training set and print selected metrics.
    """
    best_loss = np.Inf
    best_ep = 1
    nb_iterations = len(train_loader)
    # print the training loss two times per epoch
    print_every = nb_iterations // 2
    iters = []
    train_losses = []
    val_losses = []
    
    # the two different tasks we train on
    criterion = criterions[0]
    criterion_flip = criterions[1]
    scaler = GradScaler()

    for ep in range(epochs):

        net.train()
        running_loss = 0.0
        probs_all = []
        for it, (seq, attn_masks, token_type_ids, labels, labels_flip) in \
            enumerate(tqdm(train_loader)):

            # Converting to cuda tensors
            seq, attn_masks, token_type_ids, labels, labels_flip = \
                seq.to(device), attn_masks.to(device), \
                token_type_ids.to(device), labels.to(device),\
                labels_flip.to(device)
            # Enables autocasting for the forward pass (model + loss)
            with autocast():
                # Obtaining the logits from the model
                logits, logits_flip = net(seq, attn_masks, token_type_ids)
                # Computing loss
                loss = criterion(logits.squeeze(-1),
                                    labels.float())
                if use_flip:
                  loss_flip = criterion_flip(logits_flip.squeeze(-1),
                                         labels_flip.float())
                  loss = (loss + loss_flip)
                # Normalize the loss because it is averaged
                loss = loss / iters_to_accumulate 
            # Backpropagating the gradients
            # Scales loss. Calls backward() on scaled loss
            # to create scaled gradients.
            scaler.scale(loss).backward()

            if (it + 1) % iters_to_accumulate == 0:
                # --- Optimization step
                # scaler.step() first unscales the gradients of the optimizer's
                # assigned params. If these gradients do not contain infs or
                # NaNs, opti.step() is then called. Otherwise, opti.step()
                # is skipped.
                # Importantly, we have to check if the optimizer step was
                # skipped in order to also skip the lr_sched update.
                scaler.step(opti)
                scale = scaler.get_scale()
                # Updates the scale for next iteration.
                scaler.update()
                # if scale was changed means we skipped opt.step()
                skip_lr_schedule = (scale != scaler.get_scale())
                # Adjust the learning rate based on the number of iterations.
                if not skip_lr_schedule:
                  lr_scheduler.step()
                # Clear gradients
                opti.zero_grad()


            running_loss += loss.item() * iters_to_accumulate

            if (it + 1) % print_every == 0:  # Print training loss information
                print()
                print("Iteration {}/{} of epoch {} complete. Loss : {} "
                      .format(it+1, nb_iterations,
                              ep+1, running_loss / print_every))

                running_loss = 0.0

        # Compute validation loss
        val_loss, probs_val = evaluate_loss(net, device, criterion, val_loader)
        preds_val = probs_val >= 0.5
        print()
        print("Epoch {} complete!\nValidation Loss : {}".format(ep+1, val_loss))
        print("Validation F1 score: {:.4f}, Recall: {:.4f}, MCC: {:.4f}".format(
                f1_score(labels_val, preds_val),
                recall_score(labels_val, preds_val),
                matthews_corrcoef(labels_val, preds_val)))
        
        # Compute loss on small fraction of training set
        minitrain_loss, probs_minitrain = evaluate_loss(net,
                                                        device,
                                                        criterion,
                                                        minitrain_loader)
        preds_minitrain = probs_minitrain >= 0.5
        print()
        print("Minitrain F1 score: {:.4f}, Recall: {:.4f}, MCC: {:.4f}".format(
                f1_score(labels_minitrain, preds_minitrain),
                recall_score(labels_minitrain, preds_minitrain),
                matthews_corrcoef(labels_minitrain, preds_minitrain)))

        if val_loss < best_loss:
            print("Best validation loss improved from {} to {}"
                  .format(best_loss, val_loss))
            print()
            net_copy = copy.deepcopy(net)  # save a copy of the model
            best_loss = val_loss
            best_ep = ep + 1

        torch.cuda.empty_cache()
    
    # Saving the model
    path_to_model = 'models/{}_lr_{}_val_loss_{}_ep_{}.pt'.format(
        bert_model, lr, round(best_loss, 5), best_ep)
    torch.save(net_copy.state_dict(), path_to_model)
    print("The model has been saved in {}".format(path_to_model))

    del loss
    torch.cuda.empty_cache()
    return path_to_model

## Parameters

In [None]:
# bert_model = "xlm-roberta-base"
bert_model = "TransQuest/monotransquest-da-multilingual"
# freeze the encoder weights and only update the classification layer weights
freeze_bert = False
maxlen = 200  # 75% below
bs = 4 if bert_model.endswith('large') or bert_model.startswith('TransQuest') \
       else 16  # batch size

# the gradient accumulation adds gradients over an
# effective batch of size: bs * iters_to_accumulate.
# If set to "1", you get the usual batch size
iters_to_accumulate = 10
lr = 2e-5 # learning rate
epochs = 5  # number of training epochs
size_minitrain = 2000
weight_decay = 1e-2
mask_prob = 0.2 # percentage of tokens in sentence that we mask

# increase weight for pos label for data imbalance
pos_weight = ((df_train['critical'] == 0).sum() /
              (df_train['critical'] == 1).sum())
pos_weight = torch.Tensor([pos_weight.item()])
print('Positive weight for imbalance:', pos_weight)

Positive weight for imbalance: tensor([4.6262])


## Training and validation

Link for the AdamW optimizer and the learning rate scheduler :
https://huggingface.co/transformers/main_classes/optimizer_schedules.html

Below we define a function that does the full training depending on the which pretrained model to use as well as to the methods of random sentence flipping as well as masking of tokens. The other hyperparameters are defined above. 

In [None]:
def train(bert_model, use_flip=False, mask_tokens=False, seed=1):
  #  Set all seeds to make reproducible results
  set_seed(seed)
  # Creating instances of training and validation set
  print("Reading training data...")
  train_set = CustomDataset(df_train, maxlen, bert_model,
                            use_flip=use_flip, mask_tokens=mask_tokens,
                            mask_prob=mask_prob)
  minitrain_set = CustomDataset(df_train[:size_minitrain], maxlen,
                                bert_model, eval_set=True)
  # Creating instances of the dataloaders
  train_loader = DataLoader(train_set, batch_size=bs, num_workers=2)
  minitrain_loader = DataLoader(minitrain_set, batch_size=bs, num_workers=2)

  print("Reading validation data...")
  val_set = CustomDataset(df_val, maxlen, bert_model, eval_set=True)
  val_loader = DataLoader(val_set, batch_size=bs, num_workers=2)

  labels_val = df_val['critical']  # true labels
  labels_minitrain = df_train[:size_minitrain]['critical']  # true labels
  net = SentencePairClassifier(bert_model, freeze_bert=freeze_bert)
  print("Total number of parameters:",
        sum(p.numel() for p in net.bert_layer.parameters()))

  net.to(device)

  criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight.to(device))
  criterion_flip = nn.BCEWithLogitsLoss()
  opti = AdamW(net.parameters(), lr=lr, weight_decay=weight_decay)
  # The total number of training steps
  num_training_steps = epochs * len(train_loader)
  # The number of steps for the warmup phase.
  # num_warmup_steps = 0
  num_warmup_steps = int(0.1 * num_training_steps)
  # Necessary to take into account Gradient accumulation
  t_total = (len(train_loader) // iters_to_accumulate) * epochs
  lr_scheduler = get_linear_schedule_with_warmup(optimizer=opti,
                num_warmup_steps=num_warmup_steps, num_training_steps=t_total)

  path_to_model = train_bert(bert_model, net, [criterion, criterion_flip], opti,
                            lr, lr_scheduler, train_loader, minitrain_loader,
                            labels_minitrain, val_loader, labels_val,
                            epochs, iters_to_accumulate, use_flip)
  return net, path_to_model

An example call to training, with parameters as defined above:

In [None]:
net, path_to_model = train(bert_model, use_flip=True, mask_tokens=True)

Reading training data...
Reading validation data...


Some weights of the model checkpoint at TransQuest/monotransquest-da-multilingual were not used when initializing XLMRobertaModel: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Total number of parameters: 559890432


 20%|█▉        | 1418/7094 [07:26<28:54,  3.27it/s]


Iteration 1418/7094 of epoch 1 complete. Loss : 1.8825710001066642 


 40%|███▉      | 2836/7094 [14:51<21:51,  3.25it/s]


Iteration 2836/7094 of epoch 1 complete. Loss : 1.8506279181589194 


 60%|█████▉    | 4254/7094 [22:16<14:45,  3.21it/s]


Iteration 4254/7094 of epoch 1 complete. Loss : 1.729325234065061 


 80%|███████▉  | 5672/7094 [29:41<07:34,  3.13it/s]


Iteration 5672/7094 of epoch 1 complete. Loss : 1.1164419859258934 


100%|█████████▉| 7090/7094 [37:06<00:01,  2.98it/s]


Iteration 7090/7094 of epoch 1 complete. Loss : 1.051784596499187 


100%|██████████| 7094/7094 [37:07<00:00,  3.18it/s]
100%|██████████| 374/374 [00:41<00:00,  8.99it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 1 complete!
Validation Loss : 1.1531039057249692
Validation F1 score: 0.2857, Recall: 0.1871, MCC: 0.2658


100%|██████████| 500/500 [00:55<00:00,  9.00it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.3471, Recall: 0.2319, MCC: 0.3396
Best validation loss improved from inf to 1.1531039057249692



 20%|█▉        | 1418/7094 [07:25<28:52,  3.28it/s]


Iteration 1418/7094 of epoch 2 complete. Loss : 0.9947473472522832 


 40%|███▉      | 2836/7094 [14:51<21:47,  3.26it/s]


Iteration 2836/7094 of epoch 2 complete. Loss : 0.9885288822255787 


 60%|█████▉    | 4254/7094 [22:16<14:45,  3.21it/s]


Iteration 4254/7094 of epoch 2 complete. Loss : 0.9821345390586448 


 80%|███████▉  | 5672/7094 [29:42<07:33,  3.14it/s]


Iteration 5672/7094 of epoch 2 complete. Loss : 0.9794645109286135 


100%|█████████▉| 7090/7094 [37:07<00:01,  2.99it/s]


Iteration 7090/7094 of epoch 2 complete. Loss : 0.9789623124511815 


100%|██████████| 7094/7094 [37:08<00:00,  3.18it/s]
100%|██████████| 374/374 [00:41<00:00,  8.97it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 2 complete!
Validation Loss : 1.0467503838998111
Validation F1 score: 0.4381, Recall: 0.3561, MCC: 0.3572


100%|██████████| 500/500 [00:55<00:00,  8.97it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.4723, Recall: 0.3826, MCC: 0.4071
Best validation loss improved from 1.1531039057249692 to 1.0467503838998111



 20%|█▉        | 1418/7094 [07:25<28:57,  3.27it/s]


Iteration 1418/7094 of epoch 3 complete. Loss : 0.9408336227210646 


 40%|███▉      | 2836/7094 [14:51<21:47,  3.26it/s]


Iteration 2836/7094 of epoch 3 complete. Loss : 0.9299991961781393 


 60%|█████▉    | 4254/7094 [22:16<14:42,  3.22it/s]


Iteration 4254/7094 of epoch 3 complete. Loss : 0.9443246898647235 


 80%|███████▉  | 5672/7094 [29:41<07:33,  3.13it/s]


Iteration 5672/7094 of epoch 3 complete. Loss : 0.9338609801410935 


100%|█████████▉| 7090/7094 [37:08<00:01,  2.98it/s]


Iteration 7090/7094 of epoch 3 complete. Loss : 0.9498097821387065 


100%|██████████| 7094/7094 [37:09<00:00,  3.18it/s]
100%|██████████| 374/374 [00:41<00:00,  8.94it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 3 complete!
Validation Loss : 1.0185071265394674
Validation F1 score: 0.4815, Recall: 0.4928, MCC: 0.3598


100%|██████████| 500/500 [00:55<00:00,  8.93it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.5509, Recall: 0.5565, MCC: 0.4562
Best validation loss improved from 1.0467503838998111 to 1.0185071265394674



 20%|█▉        | 1418/7094 [07:26<28:59,  3.26it/s]


Iteration 1418/7094 of epoch 4 complete. Loss : 0.8997060141193829 


 40%|███▉      | 2836/7094 [14:53<21:53,  3.24it/s]


Iteration 2836/7094 of epoch 4 complete. Loss : 0.903159480230628 


 60%|█████▉    | 4254/7094 [22:20<14:40,  3.22it/s]


Iteration 4254/7094 of epoch 4 complete. Loss : 0.9027999686602666 


 80%|███████▉  | 5672/7094 [29:46<07:36,  3.12it/s]


Iteration 5672/7094 of epoch 4 complete. Loss : 0.9263979160582859 


100%|█████████▉| 7090/7094 [37:13<00:01,  2.97it/s]


Iteration 7090/7094 of epoch 4 complete. Loss : 0.9154447317038031 


100%|██████████| 7094/7094 [37:14<00:00,  3.17it/s]
100%|██████████| 374/374 [00:41<00:00,  8.91it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 4 complete!
Validation Loss : 1.3275114418988558
Validation F1 score: 0.4422, Recall: 0.3993, MCC: 0.3340


100%|██████████| 500/500 [00:56<00:00,  8.92it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.5664, Recall: 0.5130, MCC: 0.4909


 20%|█▉        | 1418/7094 [07:26<28:54,  3.27it/s]


Iteration 1418/7094 of epoch 5 complete. Loss : 0.8442859423224656 


 40%|███▉      | 2836/7094 [14:53<21:52,  3.24it/s]


Iteration 2836/7094 of epoch 5 complete. Loss : 0.8645519381651736 


 60%|█████▉    | 4254/7094 [22:20<14:46,  3.20it/s]


Iteration 4254/7094 of epoch 5 complete. Loss : 0.8431654430896948 


 80%|███████▉  | 5672/7094 [29:47<07:35,  3.12it/s]


Iteration 5672/7094 of epoch 5 complete. Loss : 0.864568694585958 


100%|█████████▉| 7090/7094 [37:13<00:01,  2.97it/s]


Iteration 7090/7094 of epoch 5 complete. Loss : 0.8807951407848591 


100%|██████████| 7094/7094 [37:14<00:00,  3.17it/s]
100%|██████████| 374/374 [00:42<00:00,  8.90it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 5 complete!
Validation Loss : 1.2350023027887955
Validation F1 score: 0.4656, Recall: 0.4388, MCC: 0.3535


100%|██████████| 500/500 [00:56<00:00,  8.93it/s]



Minitrain F1 score: 0.6174, Recall: 0.6058, MCC: 0.5397
The model has been saved in models/TransQuest/monotransquest-da-multilingual_lr_2e-05_val_loss_1.01851_ep_3.pt


You can download the model saved in the folder "models" by browsing the files on the left of the colab notebook.

Please note that you might run out of CUDA memory. As we cannot control the GPU assignments on Colab, it is possible to have GPUs that do not have enough RAM to handle the size of the pretrained models. In our experiments, at least 15-16 GB was used.

In [None]:
# If you encounter a CUDA out of memory error: 
# - uncomment the kill command, run the "kill" command (and comment it)
# - reduce the batch size or maxlen parameters
# - then run all cells from the begining 

# If you get an ugly print of tqdm (all iterations are showed), follow the
# above first and last steps
printm()
# !kill -9 -1

Gen RAM Free: 11.1 GB  | Proc size: 6.3 GB
GPU RAM Free: 16280MB | Used: 0MB | Util   0%           Total 16280MB


## Prediction

In [None]:
print("Creation of the results' folder...")
!mkdir -p results

Creation of the results' folder...


In [None]:
def test_prediction(net, device, dataloader,
                    with_labels=True, result_file="results/output.txt"):
    """
    Predict the probabilities on a dataset with
    or without labels and print the result in a file
    """
    net.eval()
    w = open(result_file, 'w')
    probs_all = []

    with torch.no_grad():
        if with_labels:
            for seq, attn_masks, tt_ids, _, _ in tqdm(dataloader):
                seq, attn_masks, tt_ids = seq.to(device), \
                     attn_masks.to(device), tt_ids.to(device)

                logits, _ = net(seq, attn_masks, tt_ids)
                probs = get_probs_from_logits(logits.squeeze(-1)).squeeze(-1)
                probs_all += probs.tolist()
        else:
            for seq, attn_masks, tt_ids, _ in tqdm(dataloader):
                seq, attn_masks, tt_ids = seq.to(device), \
                     attn_masks.to(device), tt_ids.to(device)

                logits, _ = net(seq, attn_masks, tt_ids)
                probs = get_probs_from_logits(logits.squeeze(-1)).squeeze(-1)
                probs_all += probs.tolist()

    w.writelines(str(prob)+'\n' for prob in probs_all)
    w.close()

In [None]:
def load_and_predict(bert_model, test_loader, path_to_model, path_to_output):
  """
  Create a model and load weights from previous training, then
  predict the probabilities for a test_loader.
  """
  model = SentencePairClassifier(bert_model)
  if torch.cuda.device_count() > 1:  # if multiple GPUs
      print("Let's use", torch.cuda.device_count(), "GPUs!")
      model = nn.DataParallel(model)

  print()
  print("Loading the weights of the model...")
  model.load_state_dict(torch.load(path_to_model))

  model.to(device)

  print("Predicting on test data...")
  test_prediction(net=model, device=device, dataloader=test_loader,
                  with_labels=True, 
                  result_file=path_to_output)
  print()
  print("Predictions are available in : {}".format(path_to_output))

In [None]:
def evaluate_pred_test(labels_test, preds_test):
  accuracy = accuracy_score(labels_test, preds_test)
  bac = balanced_accuracy_score(labels_test, preds_test, adjusted=True)
  f1 = f1_score(labels_test, preds_test)
  precision = precision_score(labels_test, preds_test)
  recall = recall_score(labels_test, preds_test)
  cnf = confusion_matrix(labels_test, preds_test)
  mcc = matthews_corrcoef(labels_test, preds_test)
  print("-----Evaluation-----")
  if cnf.shape == (2,2):
    print("TP: {tp}, TN: {tn}, FP: {fp}, FN: {fn}".
          format(tp=cnf[1][1], tn=cnf[0][0], fp=cnf[0][1], fn=cnf[1][0]))
    print("F1: ", f1)
    print("Precision: ", precision)
    print("Recall: ", recall)
  print("MCC: ", mcc)
  print("Accuracy: ", accuracy)

def best_threshold(labels, probs, use_f1=True):
  if use_f1:
    precision, recall, thresholds = precision_recall_curve(labels, probs)
    fscore = (2 * precision * recall) / (precision + recall)
    # locate the index of the largest f score
    ix = np.nanargmax(fscore)
    print('Best Threshold=%f, F-Score=%.3f' % (thresholds[ix], fscore[ix]))
    threshold = thresholds[ix]
  else:
    thresholds = np.linspace(0, 1, 20)
    mcc_thres = np.array([matthews_corrcoef(labels,
                        (probs >= t).astype('uint8')) for t in thresholds])
    ix = np.nanargmax(mcc_thres)
    print('Best Threshold=%f, MCC=%.3f' % (thresholds[ix], mcc_thres[ix]))
    threshold = thresholds[ix]
  return threshold

We share below our pretrained models in order to reproduce results.

We use the model parameters that resulted in the best validation loss (saved to ```path_to_model```). Here, one can also use other pretrained models to directly compare.

You can download them and upload the ```model.pt``` files in the *models* folder and edit the ```path_to_model``` variable. Link:

https://drive.google.com/drive/folders/1Akt6PmfMejVW2nI-DIrdJjTGOPkCzmJD?usp=sharing

---
The results included in the output of the cells below belong to the model as defined above. That is, we use
* TransQuest pretrained weights from Direct Assessment (DA) training in a transfer-learning fashion
* random flipping of the training sentences
* random masking of 20\% of tokens


In [None]:
# Choose threshold to maximize precision_recall (F1), else to maximize MCC
use_f1 = True

In [None]:
predict_on_train_set = False
if predict_on_train_set:
  print("Reading training data without masking...")
  train_set = CustomDataset(df_train, maxlen, bert_model, mask_prob=mask_prob,
                            eval_set=True)
  train_loader = DataLoader(train_set, batch_size=bs, num_workers=2)

  print("Predicting on train data...")
  test_prediction(net=net, device=device, dataloader=train_loader,
                  with_labels=True, result_file='results/output_train.txt')
  print()
  labels_train = df_train['critical']
  probs_train = pd.read_csv('results/output_train.txt', header=None)[0]
  threshold = best_threshold(labels_train, probs_train, use_f1)
  preds_train = (probs_train >= threshold).astype('uint8')
  print("Eval on Training data:")
  evaluate_pred_test(labels_train, preds_train)

In [None]:
# path_to_model = .... # other models here
path_to_output_file = 'results/output.txt'
print("Reading test data...")
test_set = CustomDataset(df_test, maxlen, bert_model, eval_set=True)
test_loader = DataLoader(test_set, batch_size=bs, num_workers=2)
load_and_predict(bert_model, test_loader, path_to_model, path_to_output_file)

Reading test data...


Some weights of the model checkpoint at TransQuest/monotransquest-da-multilingual were not used when initializing XLMRobertaModel: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Loading the weights of the model...


  0%|          | 0/999 [00:00<?, ?it/s]

Predicting on test data...


100%|██████████| 999/999 [01:49<00:00,  9.15it/s]


Predictions are available in : results/output.txt





You can download the predictions saved in the folder "results" by browsing the files on the left of the Colab Notebook.

## Evaluation

In [None]:
def eval_lang_pairs(preds_test, df_test):
  language_pairs = ['en-cs', 'en-ja', 'en-zh', 'en-de']
  for lang in language_pairs:
    df_test_lang = df_test[df_test['language_pair'] == lang]
    if len(df_test_lang)==0:
      continue
    labels_test_lang = df_test_lang['critical'] 
    preds_test_lang = preds_test[df_test_lang.index.tolist()]
    print("\n For language pair {}".format(lang))
    evaluate_pred_test(labels_test_lang, preds_test_lang)

You can also directly use the predicted probabilities saved to a file such as we have saved above. The next cell loads the predictions and computes the best threshold in order to maximize the F1 score.

Our other predictions can be found here:

https://drive.google.com/drive/folders/1x1_T9s1VAqab9ZVtI9YyILo9MFSr9rks?usp=sharing

In [None]:
labels_test = df_test['critical']  # true labels
#### evaluation
probs_test = pd.read_csv(path_to_output_file, header=None)[0]
# best threshold
threshold = best_threshold(labels_test, probs_test, use_f1)
preds_test = (probs_test>=threshold).astype('uint8')

Best Threshold=0.420166, F-Score=0.494


Link for the threshold choice problem : https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/

In [None]:
evaluate_pred_test(labels_test, preds_test)

-----Evaluation-----
TP: 401, TN: 2772, FP: 546, FN: 277
F1:  0.49353846153846154
Precision:  0.42344244984160506
Recall:  0.5914454277286135
MCC:  0.37680248358097124
Accuracy:  0.794044044044044


Evaluation for each language pair

In [None]:
eval_lang_pairs(preds_test, df_test)


 For language pair en-cs
-----Evaluation-----
TP: 88, TN: 737, FP: 102, FN: 72
F1:  0.5028571428571429
Precision:  0.4631578947368421
Recall:  0.55
MCC:  0.4003745260589834
Accuracy:  0.8258258258258259

 For language pair en-ja
-----Evaluation-----
TP: 32, TN: 814, FP: 89, FN: 64
F1:  0.2949308755760369
Precision:  0.2644628099173554
Recall:  0.3333333333333333
MCC:  0.2120741129705655
Accuracy:  0.8468468468468469

 For language pair en-zh
-----Evaluation-----
TP: 85, TN: 686, FP: 172, FN: 56
F1:  0.42713567839195976
Precision:  0.33073929961089493
Recall:  0.6028368794326241
MCC:  0.3204874383384705
Accuracy:  0.7717717717717718

 For language pair en-de
-----Evaluation-----
TP: 196, TN: 535, FP: 183, FN: 85
F1:  0.5939393939393939
Precision:  0.5171503957783641
Recall:  0.697508896797153
MCC:  0.4101521033038911
Accuracy:  0.7317317317317318


## Comparison of different methods


In [None]:
freeze_bert = False
maxlen = 200  # 75% below
bs = 4 # batch size

# the gradient accumulation adds gradients over an
# effective batch of size: bs * iters_to_accumulate.
# If set to "1", you get the usual batch size
iters_to_accumulate = 10
lr = 2e-5 # learning rate
epochs = 5  # number of training epochs
size_minitrain = 2000
weight_decay = 1e-2
mask_prob = 0.2 # percentage of tokens in sentence that we mask

# increase weight for pos label for data imbalance
pos_weight = ((df_train['critical'] == 0).sum() /
              (df_train['critical'] == 1).sum())
pos_weight = torch.Tensor([pos_weight.item()])
print('Positive weight for imbalance:', pos_weight)

Positive weight for imbalance: tensor([4.6262])


In [None]:
# to free up as much memory as possible for the GPU,
# garbage collect + empty cache
gc.collect()
torch.cuda.empty_cache()
# tqdm._instances.clear()

In [None]:
use_f1 = True

# Configs for experiments: (model, use_flip, mask_tokens)
models = [
          ('bert-base-uncased', False, False),
          ('xlm-roberta-large', False, False),
          ('TransQuest/monotransquest-da-multilingual', False, False),
          ('TransQuest/monotransquest-da-multilingual', True, False),
          ('TransQuest/monotransquest-da-multilingual', False, True),
          # this was already done above:
          #('TransQuest/monotransquest-da-multilingual', True, True),
          
]

In [None]:
def full_experiment(model_name, use_flip, mask_tokens): 
  #### training
  print("--------------- Using Model:", model_name)
  print("--------------- Flipping sentences:", use_flip,
        ", Masking tokens:", mask_tokens)
  _, path_to_model = train(model_name,
                          use_flip=use_flip, mask_tokens=mask_tokens)
  path_to_output = "predictions_"
  path_to_output += model_name + ("-flip" if use_flip else "-noflip") + \
                                 ("-masked" if mask_tokens else "-unmasked") + \
                                 ".txt"
  path_to_output = path_to_output.replace("/", "-")
  #### testing
  test_set = CustomDataset(df_test, maxlen, model_name, eval_set=True)
  test_loader = DataLoader(test_set, batch_size=bs, num_workers=2)
  load_and_predict(model_name, test_loader, path_to_model, path_to_output)
  labels_test = df_test['critical']  # true labels
  #### evaluation
  probs_test = pd.read_csv(path_to_output, header=None)[0]
  # best threshold
  threshold = best_threshold(labels_test, probs_test, use_f1)
  preds_test = (probs_test>=threshold).astype('uint8')
  print("Predictions for ", model_name, "flip:", use_flip, ", masking:", mask_tokens)
  evaluate_pred_test(labels_test, preds_test)
  eval_lang_pairs(preds_test, df_test)
  

Please not that in case you get memory errors with CUDA, the best procedure is to restart the runtime, rerun the definitions and repeat only the one experiment you want to execute.

In [None]:
## BERT with backtranslations
bert_exp = models[0]
full_experiment(bert[0], bert[1], bert[2])
gc.collect()
torch.cuda.empty_cache()
tqdm._instances.clear()

# XLM-RoBERTa
xlm = models[1]
full_experiment(xlm[0], xlm[1], xlm[2])
gc.collect()
torch.cuda.empty_cache()
tqdm._instances.clear()

--------------- Using Model: bert-base-uncased
--------------- Flipping sentences: False , Masking tokens: False
Reading training data...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…


Reading validation data...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Total number of parameters: 109482240


 50%|█████     | 3550/7094 [02:28<02:28, 23.86it/s]


Iteration 3547/7094 of epoch 1 complete. Loss : 1.1578311427192585 


100%|██████████| 7094/7094 [04:54<00:00, 24.05it/s]
  0%|          | 0/374 [00:00<?, ?it/s]


Iteration 7094/7094 of epoch 1 complete. Loss : 1.0893193235469092 


100%|██████████| 374/374 [00:07<00:00, 53.14it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 1 complete!
Validation Loss : 1.0703547351500566
Validation F1 score: 0.3939, Recall: 0.5108, MCC: 0.2243


100%|██████████| 500/500 [00:09<00:00, 53.63it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.4127, Recall: 0.5826, MCC: 0.2636
Best validation loss improved from inf to 1.0703547351500566



 50%|█████     | 3550/7094 [02:28<02:30, 23.54it/s]


Iteration 3547/7094 of epoch 2 complete. Loss : 1.0466000386063468 


100%|██████████| 7094/7094 [04:56<00:00, 23.92it/s]
  0%|          | 0/374 [00:00<?, ?it/s]


Iteration 7094/7094 of epoch 2 complete. Loss : 1.0160568246341168 


100%|██████████| 374/374 [00:07<00:00, 52.58it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 2 complete!
Validation Loss : 1.056592470184367
Validation F1 score: 0.4094, Recall: 0.4712, MCC: 0.2555


100%|██████████| 500/500 [00:08<00:00, 56.18it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.4704, Recall: 0.5652, MCC: 0.3446
Best validation loss improved from 1.0703547351500566 to 1.056592470184367



 50%|█████     | 3550/7094 [02:28<02:30, 23.52it/s]


Iteration 3547/7094 of epoch 3 complete. Loss : 0.9779357711656147 


100%|██████████| 7094/7094 [04:57<00:00, 23.86it/s]
  0%|          | 0/374 [00:00<?, ?it/s]


Iteration 7094/7094 of epoch 3 complete. Loss : 0.9567903976363382 


100%|██████████| 374/374 [00:06<00:00, 56.90it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 3 complete!
Validation Loss : 1.0958022334996391
Validation F1 score: 0.3939, Recall: 0.3741, MCC: 0.2649


100%|██████████| 500/500 [00:08<00:00, 56.29it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.5036, Recall: 0.5043, MCC: 0.4000


 50%|█████     | 3550/7094 [02:30<02:35, 22.85it/s]


Iteration 3547/7094 of epoch 4 complete. Loss : 0.8911970589747892 


100%|██████████| 7094/7094 [05:00<00:00, 23.59it/s]
  0%|          | 0/374 [00:00<?, ?it/s]


Iteration 7094/7094 of epoch 4 complete. Loss : 0.8443090076694039 


100%|██████████| 374/374 [00:06<00:00, 53.92it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 4 complete!
Validation Loss : 1.231507798965602
Validation F1 score: 0.3884, Recall: 0.3849, MCC: 0.2501


100%|██████████| 500/500 [00:09<00:00, 54.65it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.5984, Recall: 0.6522, MCC: 0.5088


 50%|█████     | 3550/7094 [02:31<02:37, 22.45it/s]


Iteration 3547/7094 of epoch 5 complete. Loss : 0.7347058861775437 


100%|██████████| 7094/7094 [05:06<00:00, 23.11it/s]
  0%|          | 0/374 [00:00<?, ?it/s]


Iteration 7094/7094 of epoch 5 complete. Loss : 0.6496495280841752 


100%|██████████| 374/374 [00:07<00:00, 50.43it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 5 complete!
Validation Loss : 1.4987382744165028
Validation F1 score: 0.3596, Recall: 0.4353, MCC: 0.1852


100%|██████████| 500/500 [00:09<00:00, 51.31it/s]



Minitrain F1 score: 0.6793, Recall: 0.8841, MCC: 0.6202
The model has been saved in models/bert-base-uncased_lr_2e-05_val_loss_1.05659_ep_2.pt


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Loading the weights of the model...


  0%|          | 0/999 [00:00<?, ?it/s]

Predicting on test data...


100%|██████████| 999/999 [00:18<00:00, 54.88it/s]



Predictions are available in : predictions_bert-base-uncased-noflip-unmasked.txt
Best Threshold=0.449707, F-Score=0.431
Predictions for  bert-base-uncased flip: False , masking: False
-----Evaluation-----
TP: 428, TN: 2436, FP: 882, FN: 250
F1:  0.43058350100603626
Precision:  0.3267175572519084
Recall:  0.6312684365781711
MCC:  0.29220469158627965
Accuracy:  0.7167167167167167

 For language pair en-cs
-----Evaluation-----
TP: 110, TN: 638, FP: 201, FN: 50
F1:  0.46709129511677283
Precision:  0.3536977491961415
Recall:  0.6875
MCC:  0.3547936337648813
Accuracy:  0.7487487487487487

 For language pair en-ja
-----Evaluation-----
TP: 55, TN: 627, FP: 276, FN: 41
F1:  0.2576112412177986
Precision:  0.1661631419939577
Recall:  0.5729166666666666
MCC:  0.16734994337929757
Accuracy:  0.6826826826826827

 For language pair en-zh
-----Evaluation-----
TP: 79, TN: 636, FP: 222, FN: 62
F1:  0.35746606334841624
Precision:  0.26245847176079734
Recall:  0.5602836879432624
MCC:  0.22881825553077423


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=513.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=5069051.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=9096718.0, style=ProgressStyle(descript…


Reading validation data...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2244861551.0, style=ProgressStyle(descr…




Some weights of the model checkpoint at xlm-roberta-large were not used when initializing XLMRobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Total number of parameters: 559890432


 50%|█████     | 3548/7094 [05:46<05:32, 10.66it/s]


Iteration 3547/7094 of epoch 1 complete. Loss : 1.151241079069441 


100%|██████████| 7094/7094 [11:37<00:00, 10.17it/s]
  0%|          | 0/374 [00:00<?, ?it/s]


Iteration 7094/7094 of epoch 1 complete. Loss : 1.04463134993898 


100%|██████████| 374/374 [00:13<00:00, 27.81it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 1 complete!
Validation Loss : 1.0416231850251794
Validation F1 score: 0.4521, Recall: 0.4245, MCC: 0.3378


100%|██████████| 500/500 [00:17<00:00, 28.60it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.4545, Recall: 0.4348, MCC: 0.3475
Best validation loss improved from inf to 1.0416231850251794



 50%|█████     | 3548/7094 [05:51<05:29, 10.76it/s]


Iteration 3547/7094 of epoch 2 complete. Loss : 0.9574102019339229 


100%|██████████| 7094/7094 [11:41<00:00, 10.11it/s]
  0%|          | 0/374 [00:00<?, ?it/s]


Iteration 7094/7094 of epoch 2 complete. Loss : 0.9369427818476246 


100%|██████████| 374/374 [00:13<00:00, 28.49it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 2 complete!
Validation Loss : 1.1336892571519404
Validation F1 score: 0.3865, Recall: 0.3489, MCC: 0.2665


100%|██████████| 500/500 [00:17<00:00, 29.28it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.3919, Recall: 0.3652, MCC: 0.2772


 50%|█████     | 3549/7094 [05:53<05:34, 10.60it/s]


Iteration 3547/7094 of epoch 3 complete. Loss : 0.9053326700899402 


100%|██████████| 7094/7094 [11:42<00:00, 10.09it/s]
  0%|          | 0/374 [00:00<?, ?it/s]


Iteration 7094/7094 of epoch 3 complete. Loss : 0.8791594927479868 


100%|██████████| 374/374 [00:12<00:00, 29.32it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 3 complete!
Validation Loss : 1.071153827649068
Validation F1 score: 0.4841, Recall: 0.4640, MCC: 0.3728


100%|██████████| 500/500 [00:16<00:00, 29.42it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.5301, Recall: 0.5101, MCC: 0.4372


 50%|█████     | 3548/7094 [05:52<05:31, 10.68it/s]


Iteration 3547/7094 of epoch 4 complete. Loss : 0.8337596579304388 


100%|██████████| 7094/7094 [11:42<00:00, 10.10it/s]
  0%|          | 0/374 [00:00<?, ?it/s]


Iteration 7094/7094 of epoch 4 complete. Loss : 0.836537120797366 


100%|██████████| 374/374 [00:12<00:00, 30.38it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 4 complete!
Validation Loss : 1.0914521220135178
Validation F1 score: 0.5027, Recall: 0.5036, MCC: 0.3887


100%|██████████| 500/500 [00:16<00:00, 30.11it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.5771, Recall: 0.5913, MCC: 0.4866


 50%|█████     | 3549/7094 [05:52<05:33, 10.62it/s]


Iteration 3547/7094 of epoch 5 complete. Loss : 0.7335243701516284 


100%|██████████| 7094/7094 [11:48<00:00, 10.02it/s]
  0%|          | 0/374 [00:00<?, ?it/s]


Iteration 7094/7094 of epoch 5 complete. Loss : 0.7312807366221751 


100%|██████████| 374/374 [00:13<00:00, 28.10it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 5 complete!
Validation Loss : 1.0987707770723392
Validation F1 score: 0.5073, Recall: 0.5647, MCC: 0.3834


100%|██████████| 500/500 [00:16<00:00, 29.54it/s]



Minitrain F1 score: 0.6856, Recall: 0.8029, MCC: 0.6185
The model has been saved in models/xlm-roberta-large_lr_2e-05_val_loss_1.04162_ep_1.pt


Some weights of the model checkpoint at xlm-roberta-large were not used when initializing XLMRobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Loading the weights of the model...


  0%|          | 0/999 [00:00<?, ?it/s]

Predicting on test data...


100%|██████████| 999/999 [00:34<00:00, 29.27it/s]



Predictions are available in : predictions_xlm-roberta-large-noflip-unmasked.txt
Best Threshold=0.438721, F-Score=0.469
Predictions for  xlm-roberta-large flip: False , masking: False
-----Evaluation-----
TP: 360, TN: 2820, FP: 498, FN: 318
F1:  0.46875000000000006
Precision:  0.4195804195804196
Recall:  0.5309734513274337
MCC:  0.34815653689006015
Accuracy:  0.7957957957957958

 For language pair en-cs
-----Evaluation-----
TP: 82, TN: 741, FP: 98, FN: 78
F1:  0.4823529411764706
Precision:  0.45555555555555555
Recall:  0.5125
MCC:  0.37759178703547336
Accuracy:  0.8238238238238238

 For language pair en-ja
-----Evaluation-----
TP: 16, TN: 876, FP: 27, FN: 80
F1:  0.2302158273381295
Precision:  0.37209302325581395
Recall:  0.16666666666666666
MCC:  0.1986074452746329
Accuracy:  0.8928928928928929

 For language pair en-zh
-----Evaluation-----
TP: 50, TN: 757, FP: 101, FN: 91
F1:  0.34246575342465757
Precision:  0.33112582781456956
Recall:  0.3546099290780142
MCC:  0.23026155459142109
A

In [None]:
## XLM-RoBERTa with TransQuest transfer-learning
tq = models[2]
full_experiment(tq[0], tq[1], tq[2])
gc.collect()
torch.cuda.empty_cache()
tqdm._instances.clear()

--------------- Using Model: TransQuest/monotransquest-da-multilingual
--------------- Flipping sentences: False , Masking tokens: False
Reading training data...
Reading validation data...


Some weights of the model checkpoint at TransQuest/monotransquest-da-multilingual were not used when initializing XLMRobertaModel: ['classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.dense.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Total number of parameters: 559890432


 50%|█████     | 3548/7094 [06:54<06:34,  9.00it/s]


Iteration 3547/7094 of epoch 1 complete. Loss : 1.1563235866538666 


100%|██████████| 7094/7094 [13:51<00:00,  8.53it/s]
  0%|          | 0/374 [00:00<?, ?it/s]


Iteration 7094/7094 of epoch 1 complete. Loss : 1.014421609297819 


100%|██████████| 374/374 [00:15<00:00, 24.11it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 1 complete!
Validation Loss : 0.9772487200677076
Validation F1 score: 0.5246, Recall: 0.5360, MCC: 0.4133


100%|██████████| 500/500 [00:20<00:00, 24.33it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.4755, Recall: 0.4783, MCC: 0.3654
Best validation loss improved from inf to 0.9772487200677076



 50%|█████     | 3549/7094 [06:16<05:59,  9.87it/s]


Iteration 3547/7094 of epoch 2 complete. Loss : 0.9273506626948895 


100%|██████████| 7094/7094 [12:29<00:00,  9.46it/s]
  0%|          | 0/374 [00:00<?, ?it/s]


Iteration 7094/7094 of epoch 2 complete. Loss : 0.8923516051425232 


100%|██████████| 374/374 [00:15<00:00, 24.57it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 2 complete!
Validation Loss : 1.0720392961632759
Validation F1 score: 0.4991, Recall: 0.5000, MCC: 0.3843


100%|██████████| 500/500 [00:20<00:00, 24.30it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.5603, Recall: 0.5652, MCC: 0.4678


 50%|█████     | 3548/7094 [06:13<06:16,  9.43it/s]


Iteration 3547/7094 of epoch 3 complete. Loss : 0.8560445661857947 


100%|██████████| 7094/7094 [12:26<00:00,  9.50it/s]
  0%|          | 0/374 [00:00<?, ?it/s]


Iteration 7094/7094 of epoch 3 complete. Loss : 0.8271839013856197 


100%|██████████| 374/374 [00:15<00:00, 24.26it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 3 complete!
Validation Loss : 1.2557736942195956
Validation F1 score: 0.4807, Recall: 0.4245, MCC: 0.3855


100%|██████████| 500/500 [00:19<00:00, 25.00it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.5907, Recall: 0.5333, MCC: 0.5204


 50%|█████     | 3548/7094 [06:13<06:04,  9.73it/s]


Iteration 3547/7094 of epoch 4 complete. Loss : 0.7769241928634154 


100%|██████████| 7094/7094 [12:26<00:00,  9.51it/s]
  0%|          | 0/374 [00:00<?, ?it/s]


Iteration 7094/7094 of epoch 4 complete. Loss : 0.7337825059570285 


100%|██████████| 374/374 [00:15<00:00, 24.02it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 4 complete!
Validation Loss : 1.341143104441982
Validation F1 score: 0.4903, Recall: 0.4568, MCC: 0.3857


100%|██████████| 500/500 [00:19<00:00, 25.03it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.6556, Recall: 0.6319, MCC: 0.5877


 50%|█████     | 3549/7094 [06:13<05:58,  9.90it/s]


Iteration 3547/7094 of epoch 5 complete. Loss : 0.6978571637072014 


100%|██████████| 7094/7094 [12:27<00:00,  9.50it/s]
  0%|          | 0/374 [00:00<?, ?it/s]


Iteration 7094/7094 of epoch 5 complete. Loss : 0.6433640699735169 


100%|██████████| 374/374 [00:14<00:00, 24.95it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 5 complete!
Validation Loss : 1.412142237100531
Validation F1 score: 0.5082, Recall: 0.5000, MCC: 0.3981


100%|██████████| 500/500 [00:19<00:00, 25.23it/s]



Minitrain F1 score: 0.7145, Recall: 0.7362, MCC: 0.6533
The model has been saved in models/TransQuest/monotransquest-da-multilingual_lr_2e-05_val_loss_0.97725_ep_1.pt


Some weights of the model checkpoint at TransQuest/monotransquest-da-multilingual were not used when initializing XLMRobertaModel: ['classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.dense.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Loading the weights of the model...


  0%|          | 0/999 [00:00<?, ?it/s]

Predicting on test data...


100%|██████████| 999/999 [00:32<00:00, 30.54it/s]



Predictions are available in : predictions_TransQuest-monotransquest-da-multilingual-noflip-unmasked.txt
Best Threshold=0.458252, F-Score=0.487
Predictions for  TransQuest/monotransquest-da-multilingual flip: False , masking: False
-----Evaluation-----
TP: 370, TN: 2845, FP: 473, FN: 308
F1:  0.48652202498356345
Precision:  0.4389086595492289
Recall:  0.5457227138643068
MCC:  0.37090458010328053
Accuracy:  0.8045545545545546

 For language pair en-cs
-----Evaluation-----
TP: 84, TN: 756, FP: 83, FN: 76
F1:  0.5137614678899083
Precision:  0.5029940119760479
Recall:  0.525
MCC:  0.4187981813100839
Accuracy:  0.8408408408408409

 For language pair en-ja
-----Evaluation-----
TP: 28, TN: 834, FP: 69, FN: 68
F1:  0.2901554404145078
Precision:  0.28865979381443296
Recall:  0.2916666666666667
MCC:  0.21426094292254913
Accuracy:  0.8628628628628628

 For language pair en-zh
-----Evaluation-----
TP: 58, TN: 735, FP: 123, FN: 83
F1:  0.36024844720496896
Precision:  0.32044198895027626
Recall:  0

In [None]:
# XLM-RoBERTa with TransQuest transfer-learning + flipping
tq_flip = models[3]
full_experiment(tq_flip[0], tq_flip[1], tq_flip[2])
gc.collect()
torch.cuda.empty_cache()
tqdm._instances.clear()

--------------- Using Model: TransQuest/monotransquest-da-multilingual
--------------- Flipping sentences: True , Masking tokens: False
Reading training data...
Reading validation data...


Some weights of the model checkpoint at TransQuest/monotransquest-da-multilingual were not used when initializing XLMRobertaModel: ['classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.dense.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Total number of parameters: 559890432


 50%|█████     | 3548/7094 [05:52<05:39, 10.44it/s]


Iteration 3547/7094 of epoch 1 complete. Loss : 1.8602215504777375 


100%|██████████| 7094/7094 [11:43<00:00, 10.09it/s]
  0%|          | 0/374 [00:00<?, ?it/s]


Iteration 7094/7094 of epoch 1 complete. Loss : 1.1431681461162784 


100%|██████████| 374/374 [00:12<00:00, 29.30it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 1 complete!
Validation Loss : 1.027298344249394
Validation F1 score: 0.4502, Recall: 0.4065, MCC: 0.3436


100%|██████████| 500/500 [00:17<00:00, 28.55it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.4518, Recall: 0.4145, MCC: 0.3518
Best validation loss improved from inf to 1.027298344249394



 50%|█████     | 3548/7094 [05:55<05:56,  9.94it/s]


Iteration 3547/7094 of epoch 2 complete. Loss : 0.9572139230914138 


100%|██████████| 7094/7094 [11:51<00:00,  9.86it/s]


Iteration 7094/7094 of epoch 2 complete. Loss : 0.9390217162067953 


100%|██████████| 7094/7094 [11:51<00:00,  9.96it/s]
100%|██████████| 374/374 [00:13<00:00, 27.17it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 2 complete!
Validation Loss : 1.146680568867826
Validation F1 score: 0.4694, Recall: 0.4281, MCC: 0.3647


100%|██████████| 500/500 [00:17<00:00, 28.25it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.4969, Recall: 0.4638, MCC: 0.4024


 50%|█████     | 3549/7094 [05:58<05:40, 10.42it/s]


Iteration 3547/7094 of epoch 3 complete. Loss : 0.9093068431866465 


100%|██████████| 7094/7094 [11:52<00:00,  9.95it/s]


Iteration 7094/7094 of epoch 3 complete. Loss : 0.8770470802895651 



100%|██████████| 374/374 [00:12<00:00, 29.77it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 3 complete!
Validation Loss : 1.1505264977321905
Validation F1 score: 0.4954, Recall: 0.4856, MCC: 0.3830


100%|██████████| 500/500 [00:16<00:00, 30.56it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.5527, Recall: 0.5246, MCC: 0.4663


 50%|█████     | 3549/7094 [05:49<05:30, 10.73it/s]


Iteration 3547/7094 of epoch 4 complete. Loss : 0.8257685527129323 


100%|██████████| 7094/7094 [11:37<00:00, 10.17it/s]
  0%|          | 0/374 [00:00<?, ?it/s]


Iteration 7094/7094 of epoch 4 complete. Loss : 0.794012410243123 


100%|██████████| 374/374 [00:12<00:00, 30.35it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 4 complete!
Validation Loss : 1.3510101900540572
Validation F1 score: 0.4454, Recall: 0.3597, MCC: 0.3683


100%|██████████| 500/500 [00:16<00:00, 30.94it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.5352, Recall: 0.4406, MCC: 0.4774


 50%|█████     | 3549/7094 [05:53<05:30, 10.73it/s]


Iteration 3547/7094 of epoch 5 complete. Loss : 0.7441929144501762 


100%|██████████| 7094/7094 [11:46<00:00, 10.04it/s]
  0%|          | 0/374 [00:00<?, ?it/s]


Iteration 7094/7094 of epoch 5 complete. Loss : 0.7282481979208384 


100%|██████████| 374/374 [00:13<00:00, 28.00it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 5 complete!
Validation Loss : 1.254249317959987
Validation F1 score: 0.4754, Recall: 0.4173, MCC: 0.3807


100%|██████████| 500/500 [00:17<00:00, 28.95it/s]



Minitrain F1 score: 0.6473, Recall: 0.5826, MCC: 0.5886
The model has been saved in models/TransQuest/monotransquest-da-multilingual_lr_2e-05_val_loss_1.0273_ep_1.pt


Some weights of the model checkpoint at TransQuest/monotransquest-da-multilingual were not used when initializing XLMRobertaModel: ['classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.dense.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Loading the weights of the model...


  0%|          | 0/999 [00:00<?, ?it/s]

Predicting on test data...


100%|██████████| 999/999 [00:33<00:00, 30.03it/s]



Predictions are available in : predictions_TransQuest-monotransquest-da-multilingual-flip-unmasked.txt
Best Threshold=0.415771, F-Score=0.503
Predictions for  TransQuest/monotransquest-da-multilingual flip: True , masking: False
-----Evaluation-----
TP: 399, TN: 2809, FP: 509, FN: 279
F1:  0.5031525851197982
Precision:  0.43942731277533037
Recall:  0.588495575221239
MCC:  0.389718012185608
Accuracy:  0.8028028028028028

 For language pair en-cs
-----Evaluation-----
TP: 90, TN: 752, FP: 87, FN: 70
F1:  0.5341246290801187
Precision:  0.5084745762711864
Recall:  0.5625
MCC:  0.44070369358817907
Accuracy:  0.8428428428428428

 For language pair en-ja
-----Evaluation-----
TP: 24, TN: 856, FP: 47, FN: 72
F1:  0.2874251497005988
Precision:  0.3380281690140845
Recall:  0.25
MCC:  0.22705686090165325
Accuracy:  0.8808808808808809

 For language pair en-zh
-----Evaluation-----
TP: 69, TN: 729, FP: 129, FN: 72
F1:  0.4070796460176991
Precision:  0.3484848484848485
Recall:  0.48936170212765956
MC

In [None]:
# XLM-RoBERTa with TransQuest transfer-learning + no flipping + masking
tq_mask = models[3]
full_experiment(tq_mask[0], tq_mask[1], tq_mask[2])
gc.collect()
torch.cuda.empty_cache()
tqdm._instances.clear()

--------------- Using Model: TransQuest/monotransquest-da-multilingual
--------------- Flipping sentences: False , Masking tokens: True
Reading training data...
Reading validation data...


Some weights of the model checkpoint at TransQuest/monotransquest-da-multilingual were not used when initializing XLMRobertaModel: ['classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.dense.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Total number of parameters: 559890432


 50%|█████     | 3548/7094 [05:54<05:44, 10.28it/s]


Iteration 3547/7094 of epoch 1 complete. Loss : 1.1586007820234892 


100%|██████████| 7094/7094 [11:50<00:00,  8.83it/s]


Iteration 7094/7094 of epoch 1 complete. Loss : 1.0592990099736794 


100%|██████████| 7094/7094 [11:50<00:00,  9.98it/s]
100%|██████████| 374/374 [00:14<00:00, 26.38it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 1 complete!
Validation Loss : 1.0732884786345742
Validation F1 score: 0.3595, Recall: 0.2554, MCC: 0.3152


100%|██████████| 500/500 [00:18<00:00, 26.93it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.3944, Recall: 0.2870, MCC: 0.3539
Best validation loss improved from inf to 1.0732884786345742



 50%|█████     | 3549/7094 [05:59<05:42, 10.36it/s]


Iteration 3547/7094 of epoch 2 complete. Loss : 0.9697638544003779 


100%|██████████| 7094/7094 [12:01<00:00,  9.87it/s]


Iteration 7094/7094 of epoch 2 complete. Loss : 0.9808795727539439 


100%|██████████| 7094/7094 [12:01<00:00,  9.83it/s]
100%|██████████| 374/374 [00:13<00:00, 27.31it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 2 complete!
Validation Loss : 0.9895021450073324
Validation F1 score: 0.4675, Recall: 0.4137, MCC: 0.3691


100%|██████████| 500/500 [00:17<00:00, 28.55it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.4884, Recall: 0.4261, MCC: 0.4060
Best validation loss improved from 1.0732884786345742 to 0.9895021450073324



 50%|█████     | 3548/7094 [06:01<05:50, 10.12it/s]


Iteration 3547/7094 of epoch 3 complete. Loss : 0.9209895802623442 


100%|██████████| 7094/7094 [11:56<00:00,  9.80it/s]


Iteration 7094/7094 of epoch 3 complete. Loss : 0.922638288057641 


100%|██████████| 7094/7094 [11:56<00:00,  9.90it/s]
100%|██████████| 374/374 [00:13<00:00, 27.38it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 3 complete!
Validation Loss : 0.9596643502460444
Validation F1 score: 0.4893, Recall: 0.4928, MCC: 0.3715


100%|██████████| 500/500 [00:18<00:00, 27.76it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.5467, Recall: 0.5594, MCC: 0.4498
Best validation loss improved from 0.9895021450073324 to 0.9596643502460444



 50%|█████     | 3548/7094 [06:02<05:40, 10.40it/s]


Iteration 3547/7094 of epoch 4 complete. Loss : 0.9134597431493895 


100%|██████████| 7094/7094 [12:02<00:00,  9.93it/s]


Iteration 7094/7094 of epoch 4 complete. Loss : 0.8958285536697386 


100%|██████████| 7094/7094 [12:02<00:00,  9.81it/s]
100%|██████████| 374/374 [00:13<00:00, 27.69it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 4 complete!
Validation Loss : 1.0837493802216602
Validation F1 score: 0.4945, Recall: 0.4856, MCC: 0.3816


100%|██████████| 500/500 [00:17<00:00, 28.00it/s]
  0%|          | 0/7094 [00:00<?, ?it/s]


Minitrain F1 score: 0.5669, Recall: 0.5710, MCC: 0.4759


 50%|█████     | 3548/7094 [05:54<05:51, 10.08it/s]


Iteration 3547/7094 of epoch 5 complete. Loss : 0.8605265771122852 


100%|██████████| 7094/7094 [11:51<00:00, 10.21it/s]


Iteration 7094/7094 of epoch 5 complete. Loss : 0.8479663545720296 


100%|██████████| 7094/7094 [11:51<00:00,  9.97it/s]
100%|██████████| 374/374 [00:12<00:00, 29.25it/s]
  0%|          | 0/500 [00:00<?, ?it/s]


Epoch 5 complete!
Validation Loss : 1.129438603386522
Validation F1 score: 0.4658, Recall: 0.4784, MCC: 0.3399


100%|██████████| 500/500 [00:17<00:00, 29.20it/s]



Minitrain F1 score: 0.5976, Recall: 0.6435, MCC: 0.5083
The model has been saved in models/TransQuest/monotransquest-da-multilingual_lr_2e-05_val_loss_0.95966_ep_3.pt


Some weights of the model checkpoint at TransQuest/monotransquest-da-multilingual were not used when initializing XLMRobertaModel: ['classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.dense.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Loading the weights of the model...


  0%|          | 0/999 [00:00<?, ?it/s]

Predicting on test data...


100%|██████████| 999/999 [00:37<00:00, 26.85it/s]



Predictions are available in : predictions_TransQuest-monotransquest-da-multilingual-noflip-masked.txt
Best Threshold=0.474609, F-Score=0.496
Predictions for  TransQuest/monotransquest-da-multilingual flip: False , masking: True
-----Evaluation-----
TP: 378, TN: 2849, FP: 469, FN: 300
F1:  0.49573770491803276
Precision:  0.4462809917355372
Recall:  0.5575221238938053
MCC:  0.38220623632355705
Accuracy:  0.8075575575575575

 For language pair en-cs
-----Evaluation-----
TP: 89, TN: 746, FP: 93, FN: 71
F1:  0.52046783625731
Precision:  0.489010989010989
Recall:  0.55625
MCC:  0.4232024010570228
Accuracy:  0.8358358358358359

 For language pair en-ja
-----Evaluation-----
TP: 33, TN: 840, FP: 63, FN: 63
F1:  0.34375
Precision:  0.34375
Recall:  0.34375
MCC:  0.27398255813953487
Accuracy:  0.8738738738738738

 For language pair en-zh
-----Evaluation-----
TP: 68, TN: 704, FP: 154, FN: 73
F1:  0.37465564738292007
Precision:  0.3063063063063063
Recall:  0.48226950354609927
MCC:  0.253569329223