# Assignment 3 - Natural Language Processing - Part 3 Multilingual

In this exploration of multilingual and cross-lingual modeling, we want to identify the transformer model using BERT, its adaptability and learning capabilities across languages.

First, it is identified how the Dutch texts can be incorperated into the existing English texts. As the Dutch texts are the translations of the English texts, they share the same predictor values, therefore they can be added in two ways. First, the Dutch texts can be concatenated to the English texts, to receive instances which have both English and Dutch texts. This would be great, as more information is given per entry, which the (multi lingual) transformer can use to understand and learn. Second, the Dutch texts can be appended under the English texts as new entries. This makes the dataset twice as long, in essence adding new samples to the dataset.
While the first approach is interesting, it is however not possible using transformers due to the token limitaton. If the texts are being concatenated, the length of the entries increases. As already identified in part 1 and 2, this becomes a problem as crucial information is not being processed by the transformer due to the token size limit of 512 tokens. If the chunking approach would be used, then the Dutch texts are being seen as new entries, therefore being similar to the second approach. For this reason, we opt for appending the Dutch text under the English texts as new entries. Moreover, we adopt the modified token input approach developed in part 1 to be able to handle longer input sequences.

The plan we used for this assignment consisted of the following:

- Multilingual BERT Trained on English, Tested on Dutch: We used mBERT, known for its multilingual capability, by training it solely on English texts and then assessing its performance on Dutch. This approach aimed to test mBERT's ability to use its multilingual knowledge from English to Dutch, sort of a baseline to check its design to handle multiple languages.
- Multilingual BERT Trained in Dutch, Tested in English: This approach is the same as above, but checked whether training data makes a difference.
- Multilingual BERT with Mixed Dutch and English Training and Testing: This step aimed to simulate a real-world multilingual environment. By training and testing mBERT on a balanced mix of Dutch and English texts, we examined its capacity to process and learn from both languages at the same time.
- Monolingual BERT with Mixed-Language Training and Testing: Here, we experimented with the original BERT, predominantly English-trained, using a mix of Dutch and English texts. The objective was to explore if a monolingual model could adapt to process a new language (Dutch) alongside a familiar one (English) as the Dutch samples had the same predictor values as their English translation.
- Monolingual BERT Trained on Dutch, Tested on English: This unconventional method was employed to understand how a primarily English-trained model like BERT responds to Dutch texts during training and its subsequent performance on English.
- Monolingual BERT English Training to Dutch Testing: This final approach was chosen to discern if the monolingual BERT, fine-tuned on English, could apply its learned English-language understanding to Dutch texts.

Through these phases, while we expect the first three mBERT models to have performances similar to previous findings, especially given mBERT's multilingual design and training that includes Dutch, the fourth phase presents a unique approach for experimentation. It explores the capabilities of language adaptability in NLP models, particularly checking whether a monolingual model can extend its proficiency to a language outside its original training scope.

In [None]:
from google.colab import drive
import pandas as pd
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from sklearn.datasets import make_multilabel_classification
import numpy as np


In [None]:
#Install necessary packages
!pip install nvidia-ml-py3
!pip install scikit-multilearn
!pip install evaluate
!pip install datasets
! pip install -U accelerate
! pip install -U transformers
!pip install tensorboard

Collecting nvidia-ml-py3
  Downloading nvidia-ml-py3-7.352.0.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: nvidia-ml-py3
  Building wheel for nvidia-ml-py3 (setup.py) ... [?25l[?25hdone
  Created wheel for nvidia-ml-py3: filename=nvidia_ml_py3-7.352.0-py3-none-any.whl size=19171 sha256=70c797da5dcec6ea1ee9817e153412c44d38544c394fc4053d8e021e6b5772a9
  Stored in directory: /root/.cache/pip/wheels/5c/d8/c0/46899f8be7a75a2ffd197a23c8797700ea858b9b34819fbf9e
Successfully built nvidia-ml-py3
Installing collected packages: nvidia-ml-py3
Successfully installed nvidia-ml-py3-7.352.0
Collecting scikit-multilearn
  Downloading scikit_multilearn-0.2.0-py3-none-any.whl (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.4/89.4 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scikit-multilearn
Successfully installed scikit-multilearn-0.2.0
Collecting evaluate
  Downloading evalua

In [None]:
from pynvml import *

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")


def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print_gpu_utilization()

In [None]:
data = pd.read_csv("drive/MyDrive/Levy/english_dutch_texts.csv")

In [None]:
#Data splitting in english and dutch
english_df = data.iloc[:2963]
dutch_df = data.iloc[2963:]

In [None]:
dutch_df

Unnamed: 0.1,Unnamed: 0,TEXT,cEXT,cNEU,cAGR,cCON,cOPN
2963,2963,"Nou, op dit moment werd ik net wakker na een m...",0,1,1,0,1
2964,2964,"Nou, hier gaan we met de stroom van bewustzijn...",0,0,1,0,0
2965,2965,Een open toetsenbord en knoppen om in te drukk...,0,1,0,1,1
2966,2966,Ik kan het niet geloven! Het gebeurt echt! Mij...,1,0,1,1,0
2967,2967,"Welnu, hier ga ik weer met de goede oude stroo...",1,0,1,0,1
...,...,...,...,...,...,...,...
5921,5921,Ik word dagelijks gemotiveerd door de noodzaak...,1,0,0,1,1
5922,5922,Mijn zoon is het grootste deel van mijn leven ...,1,1,0,0,0
5923,5923,Mijn kinderen en kleinkinderen houden me elke ...,1,0,1,1,0
5924,5924,Mijn grootste drijfveer is om geld te verdiene...,0,0,0,0,0


In [None]:
# #prepare datasets for splitting ENGLISH

# from skmultilearn.model_selection import iterative_train_test_split

# x_eng = english_df['TEXT']
# y_eng = english_df[['cEXT',	'cNEU',	'cAGR',	'cCON',	'cOPN']]


# # Convert the labels DataFrame to a numpy array
# y_eng_array = y_eng.to_numpy()

# # Iterative stratification to split the dataset
# x_eng_array = x_eng.to_numpy().reshape(-1, 1)
# x_eng_train_it, y_eng_train_it, x_eng_test_it, y_eng_test_it = iterative_train_test_split(x_eng_array, y_eng_array, test_size = 0.2)

# # Display the shapes of the train and test sets after iterative stratification
# print(x_eng_train_it.shape, x_eng_test_it.shape, y_eng_train_it.shape, y_eng_test_it.shape)


In [None]:
# #prepare datasets for splitting DUTCH

# x_dutch = dutch_df['TEXT']
# y_dutch = dutch_df[['cEXT',	'cNEU',	'cAGR',	'cCON',	'cOPN']]


# # Convert the labels DataFrame to a numpy array
# y_dutch_array = y_dutch.to_numpy()

# # Iterative stratification to split the dataset
# x_dutch_array = x_dutch.to_numpy().reshape(-1, 1)
# x_dutch_train_it, y_dutch_train_it, x_dutch_test_it, y_dutch_test_it = iterative_train_test_split(x_dutch_array, y_dutch_array, test_size = 0.2)

# # Display the shapes of the train and test sets after iterative stratification
# print(x_dutch_train_it.shape, x_dutch_test_it.shape, y_dutch_train_it.shape, y_dutch_test_it.shape)


In [None]:
train_data = [{'label': row[-5:].tolist(), 'text': row[-6]} for row in english_df.to_numpy()]
test_data = [{'label': row[-5:].tolist(), 'text': row[-6]} for row in dutch_df.to_numpy()]


# Creating the final data structure
multilabel_df = {
    'train': train_data,
    'test': test_data
}

In [None]:
# #combining both datasets into training and testing
# x_test_it = np.concatenate((x_eng_test_it, x_dutch_test_it))
# x_train_it = np.concatenate((x_eng_train_it, x_dutch_train_it))
# y_test_it = np.concatenate((y_eng_test_it, y_dutch_test_it))
# y_train_it = np.concatenate((y_eng_train_it, y_dutch_train_it))

In [None]:
# # Creating a list of dictionaries for the test set
# test_data = [{'label': label, 'text': text} for label, text in zip(y_test_it, x_test_it)]
# train_data = [{'label': label, 'text': text} for label, text in zip(y_train_it, x_train_it)]

# # Creating the final data structure
# multilabel_df = {
#     'train': train_data,
#     'test': test_data
# }

In [None]:
multilabel_df['train'][-1]

{'label': [1, 1, 0, 0, 0],
 'text': "People who never give up, cause life can be very cruel and a strong enemy for the most of the time and a lot of people give up, the people who don't are a real motivation for me because is not easy to don't give up. I want a job without bosses over me. My ideal job is to be an entrepreneur so i don't have to respond to anyone and i can reach financial freeedom. I will work a variable number of hours at my discretion doing some financial tasks, but i will love doing that. The rest of the time is freetime Very calm, sociable and sympathetic, as well as generous and humble, at the same time worthy of respect. easily short-tempered, greedy and unscrupulous, judicious and prejudiced enjoy the sea and the walks, stay outdoors and in the evening go to nice clubs with music and good shows. With constance, resourcefulness, humility, determination, willpower and resilience We are failing in many aspects such as the carelessness of the planet and corruption in

In [None]:
def label_distribution(data):
    # Initialize counts for each label
    label_counts = {i: {'1': 0, '0': 0} for i in range(5)}

    # Iterate over each entry and count the label occurrences
    for entry in data:
        labels = entry['label']
        for i, label in enumerate(labels):
            label_str = str(int(label))  # Convert label to string (either '1' or '0')
            label_counts[i][label_str] += 1

    return label_counts

# Calculating label distributions for train and test sets
train_label_distribution = label_distribution(multilabel_df['train'])
test_label_distribution = label_distribution(multilabel_df['test'])

train_label_distribution, test_label_distribution

({0: {'1': 1491, '0': 1472},
  1: {'1': 1482, '0': 1481},
  2: {'1': 1537, '0': 1426},
  3: {'1': 1518, '0': 1445},
  4: {'1': 1431, '0': 1532}},
 {0: {'1': 1491, '0': 1472},
  1: {'1': 1482, '0': 1481},
  2: {'1': 1537, '0': 1426},
  3: {'1': 1518, '0': 1445},
  4: {'1': 1431, '0': 1532}})

Relative well balanced classes after combining the Dutch and English texts. Performances will be monitored to see if the relative balance still results in well defined models.

In [None]:
#Tokenize with multi label multi langual
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

In [None]:
import torch

def tokenize_and_chunk(texts, labels, tokenizer, chunk_size=510):
    input_id_chunks = []
    mask_chunks = []
    chunked_labels = []

    for text, label in zip(texts, labels):
        # Ensure text is a string
        text = str(text)

        # Tokenize without special tokens
        tokens = tokenizer.encode_plus(text, add_special_tokens=False,
                                       return_tensors='pt', truncation=False)

        # Split into chunks of `chunk_size`
        input_ids = tokens['input_ids'][0]
        attention_mask = tokens['attention_mask'][0]
        num_chunks = (len(input_ids) // chunk_size) + int(len(input_ids) % chunk_size != 0)

        for i in range(num_chunks):
            # Define the start and end of the chunk
            start = i * chunk_size
            end = start + chunk_size

            # Extract chunks for input IDs and attention mask
            input_ids_chunk = input_ids[start:end]
            attention_mask_chunk = attention_mask[start:end]

            # Add [CLS] and [SEP] tokens
            input_ids_chunk = torch.tensor([101] + input_ids_chunk.tolist() + [102])
            attention_mask_chunk = torch.tensor([1] + attention_mask_chunk.tolist() + [1])

            # Pad the sequences
            padding_length = 512 - len(input_ids_chunk)
            input_ids_chunk = torch.cat([input_ids_chunk, torch.zeros(padding_length, dtype=torch.long)])
            attention_mask_chunk = torch.cat([attention_mask_chunk, torch.zeros(padding_length, dtype=torch.long)])

            # Store the chunks
            input_id_chunks.append(input_ids_chunk)
            mask_chunks.append(attention_mask_chunk)
            chunked_labels.append(torch.tensor(label, dtype=torch.float))  # Convert label list to tensor of floats

    # Convert lists to tensors
    input_ids_tensor = torch.stack(input_id_chunks)
    attention_mask_tensor = torch.stack(mask_chunks)
    labels_tensor = torch.stack(chunked_labels)  # Stack label tensors

    return input_ids_tensor, attention_mask_tensor, labels_tensor


# Training data
train_texts = [item['text'] for item in multilabel_df['train']]
train_labels = [item['label'] for item in multilabel_df['train']]
input_ids_tensor_train, attention_mask_tensor_train, labels_tensor_train = tokenize_and_chunk(train_texts, train_labels, tokenizer)

# Preparing the training dictionary
input_dict_train = {
    'input_ids': input_ids_tensor_train.long(),
    'attention_mask': attention_mask_tensor_train.int(),
    'labels': labels_tensor_train
}

# Testing data
test_texts = [item['text'] for item in multilabel_df['test']]
test_labels = [item['label'] for item in multilabel_df['test']]
input_ids_tensor_test, attention_mask_tensor_test, labels_tensor_test = tokenize_and_chunk(test_texts, test_labels, tokenizer)

# Preparing the testing dictionary
input_dict_test = {
    'input_ids': input_ids_tensor_test.long(),
    'attention_mask': attention_mask_tensor_test.int(),
    'labels': labels_tensor_test
}

In [None]:
from datasets import Dataset

# Function to convert tensor to list of integers or floats
def tensor_to_list(tensor, dtype=int):
    return [tensor[i].tolist() for i in range(len(tensor))]

# Convert tensors to lists
input_ids_list_train = tensor_to_list(input_ids_tensor_train, dtype=int)
attention_mask_list_train = tensor_to_list(attention_mask_tensor_train, dtype=int)
labels_list_train = tensor_to_list(labels_tensor_train, dtype=float)

input_ids_list_test = tensor_to_list(input_ids_tensor_test, dtype=int)
attention_mask_list_test = tensor_to_list(attention_mask_tensor_test, dtype=int)
labels_list_test = tensor_to_list(labels_tensor_test, dtype=float)

# Create DataFrame
df_train = pd.DataFrame({
    'input_ids': input_ids_list_train,
    'attention_mask': attention_mask_list_train,
    'labels': labels_list_train
})

df_test = pd.DataFrame({
    'input_ids': input_ids_list_test,
    'attention_mask': attention_mask_list_test,
    'labels': labels_list_test
})

# Create Hugging Face Dataset
train_dataset = Dataset.from_pandas(df_train)
test_dataset = Dataset.from_pandas(df_test)


In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
labels = ['cEXT',	'cNEU',	'cAGR',	'cCON',	'cOPN']
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
labels

['cEXT', 'cNEU', 'cAGR', 'cCON', 'cOPN']

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased",
                                  problem_type="multi_label_classification",
                                  num_labels=len(labels),
                                  id2label=id2label,
                                  label2id=label2id)

In [None]:
[id2label[idx] for idx, value in enumerate(train_dataset['labels'][0]) if value == 1.0]


['cNEU', 'cAGR', 'cOPN']

In [None]:
output_dir = '/content/drive/MyDrive/Levy/part3/bert_mono_english_to_dutch'


In [None]:
from transformers import TrainingArguments, Trainer, logging

logging.set_verbosity_error()

In [None]:
print_gpu_utilization()


GPU memory occupied: 4971 MB.


In [None]:
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from transformers import EvalPrediction
import torch

def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)
    # return as dictionary
    metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions,
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds,
        labels=p.label_ids)
    return result

In [None]:
training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    warmup_steps = 20,
    logging_steps = 8,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    disable_tqdm = False,
    load_best_model_at_end=True,
    push_to_hub=False,
    gradient_accumulation_steps=8,
    gradient_checkpointing=True,
    fp16=True,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()



Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
0,0.6881,0.695777,0.560198,0.498633,0.030163
2,0.6611,0.745043,0.43395,0.510399,0.063242
4,0.6442,0.758931,0.410734,0.513603,0.063744




TrainOutput(global_step=460, training_loss=0.6688788626504981, metrics={'train_runtime': 799.819, 'train_samples_per_second': 37.002, 'train_steps_per_second': 0.575, 'total_flos': 7745145641570304.0, 'train_loss': 0.6688788626504981, 'epoch': 4.97})

# Modeling

To be able to compare the different models, the parameters are kept the same. The learning rate is set to 1e-5, as identified to be best used in part 2. Also, both models are the cased version, to be able to understand the nuances of the input texts. Moreover, effective batch size is 64 and 5 epochs are conducted.

A particularly striking result emerged from the experiment where the mBERT was trained on Dutch texts and subsequently tested on English. This configuration yielded a noteworthy improvement in performance metrics, surpassing not only the other mBERT dual languages configuration but also the results from previous deep learning approaches. The micro averaged f1 score of this model was close to 0.66, not reached before. Moreover, the validation loss decreased steadily overtime, also not experienced with the other mBERT models. This enhancement in performance might be attributed to a combination of factors inherent to the nature of multilingual NLP modeling and the specifics of the mBERT architecture. The higher quality or more relevant nature to the specific task of the Dutch training data could have contributed significantly to this outcome, enabling the model to learn more effectively and thereby perform better when applied to a different language. Moreover, the design of mBERT, which includes extensive training on multilingual datasets, could inherently prefer the use of linguistic nuances of Dutch. This capability, coupled with the Germanic linguistic commonalities shared by Dutch and English, might facilitate an efficient and effective transfer of learned features and patterns from one language to another. This findings not only highlights the complex and sometimes surprising findings of language processing but also reinforces the importance of investigating a range of language combinations to comprehensively understand how language-specific characteristics can distinctly impact model behavior and performance.

The results from training the monolingual BERT model on Dutch texts and testing it on English also has an interesting outcome, especially considering the model's primary orientation towards English due to the data which it is trained on. After 5 epochs (again an error in providing some logging of the other epochs), the model obtained a micro averaged f1 score of 0.53. While a f1 score of 0.5 is random guessing, achieving this level of performance in this cross-lingual context is interesting, as it suggests some degree of linguistic transferability between Dutch and English, possibly due to their shared Germanic roots. Moreover, this scenario highlights the inherent generalization capabilities of the BERT architecture, which seems to capture some patterns across languages despite its English-centric training.

Further contrasting the model's performance across different training setups, we observed that training on English and testing on Dutch with the monolingual BERT resulted in significantly lower F1 and ROC AUC scores compared to training on Dutch and testing on English. This difference underscores the model's capability in transferring learnings from Dutch to English, which may be caused by the fact that the model in itself understands English and can make some relations from Dutch nuances to English nuances for prediction the personalities. Remarkably, a mixed-language approach, involving training and testing on a combination of English and Dutch texts, yielded moderately better results (Micro averaged F1 of 0.59 and ROC AUC of 0.55). This performance, better performing than mBERT (lower F1 of 0.58)  under a similar mixed-language setup, emphasizes the potential benefits of utilizing a mix of a linguistic dataset with one language known and the other unknown. It demonstrates that even a monolingually designed model can be nudged towards enhanced generalization capabilities through exposure to translated inputs. These outcomes underscore the complex dynamics inherent in language processing within NLP models.

Link to the drive folder with screenshots of runs: https://drive.google.com/drive/folders/1ftR9k9Ze3NJUv332_PJTyZmnJtrWBoF0?usp=sharing

mBERT English training to Dutch testing

| Epoch | Training Loss | Validation Loss |    F1    |  Roc Auc | Accuracy |
|:-----:|:-------------:|:---------------:|:--------:|:--------:|:--------:|
| 0     | 0.693300      | 0.691523        | 0.611802 | 0.523768 | 0.039733 |
| 1     | 0.681400      | 0.685460        | 0.582379 | 0.555086 | 0.055684 |
| 2     | 0.667200      | 0.680633        | 0.601897 | 0.564168 | 0.066850 |
| 3     | 0.657500      | 0.677306        | 0.603342 | 0.567819 | 0.068010 |
| 4     | 0.662800      | 0.676688        | 0.608146 | 0.570895 | 0.067575 |

mBERT Dutch training to English testing

| Epoch | Training Loss | Validation Loss |    F1    |  Roc Auc | Accuracy |
|:-----:|:-------------:|:---------------:|:--------:|:--------:|:--------:|
| 0     | 0.678300      | 0.656062        | 0.658555 | 0.606031 | 0.082491 |
| 1     | 0.671000      | 0.651009        | 0.637774 | 0.615455 | 0.093740 |
| 2     | 0.654700      | 0.645095        | 0.671791 | 0.618826 | 0.099283 |
| 4     | 0.641800      | 0.638074        | 0.659783 | 0.630236 | 0.109553 |

mBERT both languages at the same time

| Epoch | Training Loss | Validation Loss |    F1    |  Roc Auc | Accuracy |
|:-----:|:-------------:|:---------------:|:--------:|:--------:|:--------:|
| 1     | 0.687600      | 0.688543        | 0.592471 | 0.530410 | 0.031410 |
| 2     | 0.672800      | 0.685085        | 0.596622 | 0.548984 | 0.046329 |
| 3     | 0.674600      | 0.687769        | 0.577742 | 0.547794 | 0.047114 |
| 4     | 0.643000      | 0.690588        | 0.603624 | 0.552349 | 0.051040 |
| 5     | 0.635300      | 0.692506        | 0.585903 | 0.553307 | 0.051826 |

BERT English training to Dutch testing

| Epoch | Training Loss | Validation Loss |    F1    |  Roc Auc | Accuracy |
|:-----:|:-------------:|:---------------:|:--------:|:--------:|:--------:|
| 0     | 0.688100      | 0.695777        | 0.560198 | 0.498633 | 0.030163 |
| 2     | 0.661100      | 0.745043        | 0.433950 | 0.510399 | 0.063242 |
| 4     | 0.644200      | 0.758931        | 0.410734 | 0.513603 | 0.063744 |

BERT Dutch training to English testing

| Epoch | Training Loss | Validation Loss |    F1    |  Roc Auc | Accuracy |
|:-----:|:-------------:|:---------------:|:--------:|:--------:|:--------:|
| 0     | 0.691000      | 0.693861        | 0.609132 | 0.511146 | 0.033958 |
| 2     | 0.685900      | 0.694748        | 0.537271 | 0.511109 | 0.038858 |
| 4     | 0.668000      | 0.696009        | 0.533342 | 0.514245 | 0.040547 |

BERT both languages at the same time

| Epoch | Training Loss | Validation Loss |    F1    |  Roc Auc | Accuracy |
|:-----:|:-------------:|:---------------:|:--------:|:--------:|:--------:|
| 0     | 0.693500      | 0.690956        | 0.587281 | 0.527415 | 0.034338 |
| 1     | 0.682000      | 0.686498        | 0.596081 | 0.544879 | 0.055718 |
| 2     | 0.676200      | 0.685076        | 0.580311 | 0.548045 | 0.048591 |
| 4     | 0.653800      | 0.688330        | 0.592000 | 0.553130 | 0.046971 |