# Domain Adaptation
Domain adaptation consists of fine-tuning a model with in-domain data, so it predicts words related to the task at hand. It is done by creating a masked language model, that is trained to predict missing 'masked' words by their surrounding context. 

This notebook was created using Google Colab.

### Installing Libraries

In [None]:
#Installing libraries for the first run
!pip install datasets
!pip install transformers
!pip install --upgrade accelerate
#accelerate is bugged this is necessary
!pip uninstall -y transformers accelerate
!pip install transformers accelerate
!pip install evaluate

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from huggingface_hub import notebook_login
notebook_login()

### Loading Data

In [None]:
import pandas as pd

path = "/content/drive/My Drive/Colab Notebooks/train_dev_test_splits"

dataset = {}
dataset["de"] = {}
dataset["fr"] = {}

dataset["de"]["train"] = pd.read_csv(path + "/de.train.csv", sep="\t")
dataset["de"]["val"] = pd.read_csv(path + "/de.valid.csv", sep="\t")
dataset["de"]["test"] = pd.read_csv(path + "/de.test.csv", sep="\t")

dataset["fr"]["train"] = pd.read_csv(path + "/fr.train.csv", sep="\t")
dataset["fr"]["val"] = pd.read_csv(path + "/fr.valid.csv", sep="\t")
dataset["fr"]["test"] = pd.read_csv(path + "/fr.test.csv", sep="\t")

### Dataset Preparation
Preparates data by dropping unnecessary collumns and gathering all the train, validation and test datasets into a data dictionary.

In [None]:
def prepare_dataset_task(lang_dataset, task):
  drop_cols = list(lang_dataset["train"].columns)
  drop_cols.remove("content")
  drop_cols.remove(task)
  for split in ["train", "val", "test"]:
    lang_dataset[split].drop(columns=drop_cols, inplace=True)
    lang_dataset[split].rename(columns={task :'label'}, inplace = True)
  return lang_dataset

In [None]:
from datasets import Dataset, DatasetDict

def create_dict(lang_dataset):
  for split in ["train", "val", "test"]:
    lang_dataset[split] = Dataset.from_pandas(lang_dataset[split])
  # gather everyone if you want to have a single DatasetDict
  return DatasetDict({
      'train': lang_dataset["train"],
      'validation': lang_dataset["val"],
      'test': lang_dataset["test"]
  })

### Tokenization and Preprocessing

Besides tokenizing, we gather all the word_ids for later whole word masking. Truncation is not set in order to preserve information.

In [None]:
#!pip install transformers
from transformers import AutoTokenizer

def tokenize_dataset(model_name, dataset_dict):
  tokenizer = AutoTokenizer.from_pretrained(model_name)

  def preprocess_function(sample):
    result = tokenizer(sample["content"])
    if tokenizer.is_fast:
      result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

  return dataset_dict.map(preprocess_function, batched=True, remove_columns=["content", "label"]), tokenizer

The samples are concatenated and split in equal sized chunks of 64, chosen according to the size of our data and GPU memory restrictions (the tokenizer's max length is 512, for context). A new labels collumn is created to represent the ground truth for training.

In [None]:
chunk_size = 64
def group_texts(examples):

    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}

    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])

    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size

    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }

    # Create a new labels column
    result["labels"] = result["input_ids"].copy()

    return result


### Training



We used the AutoModelForMaskedLM from the pretrained models used for classification. We trained for 20 epochs, with a learning rate of 2e-5 with a batch size of 16, to accomodate for GPU restrictions. The data collator used was DataCollatorForLanguageModeling with a 15% probability, that applies the random word masking. 

In [None]:
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorForLanguageModeling, DataCollatorWithPadding
from transformers import AutoModelForMaskedLM
import torch.nn.functional as F
import numpy as np
import torch


def create_trainer(model_name, tokenized_dataset, tokenizer, results_folder):
    model = AutoModelForMaskedLM.from_pretrained(model_name)
    logging_steps = len(tokenized_dataset["train"]) // 16


    training_args = TrainingArguments(
        output_dir=f"./{results_folder}",
        overwrite_output_dir=True,
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        weight_decay=0.01,
        num_train_epochs=20,
        fp16=True,
        evaluation_strategy="epoch", # run validation at the end of each epoch
        push_to_hub=True,
        logging_steps = logging_steps
    )

    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15) #DataCollatorWithPadding(tokenizer=tokenizer) 

    return Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset["train"],
        eval_dataset=tokenized_dataset["validation"],
        tokenizer=tokenizer,
        data_collator=data_collator,
    )


This is the training and evaluation loop, that prepares and preprocesses the dataset, trains it and evaluates the perplexity before and after training. We chose perplexity as a metric as it better evaluates language models since it measures how well the model predicts words: the lower the perplexity, the more accurate and confident the models is in its predictions.

In [None]:
from transformers import DataCollatorForLanguageModeling
import math 

def train_test(lang, task, model_name):
  #Load data and prepare for task
  lang_dataset = dataset[lang]
  lang_dataset_task = prepare_dataset_task(lang_dataset, task)
  train_valid_test_dataset = create_dict(lang_dataset_task)
  
  #Tokenize dataset
  tokenized_dataset, tokenizer = tokenize_dataset(model_name, train_valid_test_dataset)
  lm_dataset = tokenized_dataset.map(lambda examples : group_texts(examples), batched=True)

  print(tokenized_dataset, lm_dataset)

  #Training
  results_folder = f"{lang}-{task}-{model_name}-finetuned"
  trainer = create_trainer(model_name, lm_dataset, tokenizer, results_folder)

  eval_results = trainer.evaluate()
  print(f">>> Perplexity Before: {math.exp(eval_results['eval_loss'])}")

  trainer.train()

  eval_results = trainer.evaluate()
  print(f">>> Perplexity After: {math.exp(eval_results['eval_loss'])}")

  trainer.push_to_hub()
  

The models chosen were the ones used on the classification task as well as on the original paper.

In [None]:
model_name_de =  "bert-base-german-cased"
model_name_fr = "camembert-base"

## Results
The new fine-tuned to the domain models had a decrease of perplexity in training, as expected. The models were then used in classification and the results obtained can be found on the other notebook presented. Both models were saved to HuggingFace and are available for the community to use them.

### French

In [None]:
train_test("fr", "e1", model_name_fr)

Downloading (…)lve/main/config.json:   0%|          | 0.00/508 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/811k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.40M [00:00<?, ?B/s]

Map:   0%|          | 0/2178 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (524 > 512). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2178 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 2178
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 500
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 1000
    })
}) DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1974
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 440
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 927
    })
})


Downloading pytorch_model.bin:   0%|          | 0.00/445M [00:00<?, ?B/s]

Cloning https://huggingface.co/rodrigotuna/fr-e1-camembert-base-finetuned into local empty directory.


Download file pytorch_model.bin:   0%|          | 15.6k/422M [00:00<?, ?B/s]

Download file runs/May24_14-44-05_a764302cd5c1/events.out.tfevents.1684939507.a764302cd5c1.320.2: 100%|#######…

Download file training_args.bin: 100%|##########| 3.87k/3.87k [00:00<?, ?B/s]

Clean file runs/May24_14-44-05_a764302cd5c1/events.out.tfevents.1684939507.a764302cd5c1.320.2: 100%|##########…

Clean file training_args.bin:  26%|##5       | 1.00k/3.87k [00:00<?, ?B/s]

Download file runs/May24_14-44-05_a764302cd5c1/1684939459.5936973/events.out.tfevents.1684939459.a764302cd5c1.…

Clean file runs/May24_14-44-05_a764302cd5c1/1684939459.5936973/events.out.tfevents.1684939459.a764302cd5c1.320…

Download file runs/May24_14-44-05_a764302cd5c1/events.out.tfevents.1684939459.a764302cd5c1.320.0: 100%|#######…

Clean file runs/May24_14-44-05_a764302cd5c1/events.out.tfevents.1684939459.a764302cd5c1.320.0:  17%|#7        …

Clean file pytorch_model.bin:   0%|          | 1.00k/422M [00:00<?, ?B/s]

You're using a CamembertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


>>> Perplexity Before: 279.68091084805786




Epoch,Training Loss,Validation Loss
1,3.2487,2.816881
2,2.6851,2.637446
3,2.6175,2.643066
4,2.5366,2.447523
5,2.4529,2.464737
6,2.4002,2.484573
7,2.3914,2.418699
8,2.3439,2.481911
9,2.2797,2.376931
10,2.3065,2.453571


>>> Perplexity After: 10.279166047945553


Upload file pytorch_model.bin:   0%|          | 1.00/422M [00:00<?, ?B/s]

Upload file runs/May24_18-37-28_83141af8a349/events.out.tfevents.1684954053.83141af8a349.330.2:   0%|         …

Upload file runs/May24_18-37-28_83141af8a349/events.out.tfevents.1684953561.83141af8a349.330.0:   0%|         …

To https://huggingface.co/rodrigotuna/fr-e1-camembert-base-finetuned
   61c30d7..8d236a5  main -> main

   61c30d7..8d236a5  main -> main

To https://huggingface.co/rodrigotuna/fr-e1-camembert-base-finetuned
   8d236a5..8f7c618  main -> main

   8d236a5..8f7c618  main -> main



### German

In [None]:
train_test("de", "e1", model_name_de)

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/255k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/485k [00:00<?, ?B/s]

Map:   0%|          | 0/2806 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2806 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids'],
        num_rows: 2806
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids'],
        num_rows: 500
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids'],
        num_rows: 1000
    })
}) DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 3770
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 664
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1348
    })
})


Downloading pytorch_model.bin:   0%|          | 0.00/439M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-german-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Cloning https://huggingface.co/rodrigotuna/de-e1-bert-base-german-cased-finetuned into local empty directory.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


>>> Perplexity Before: 250.74822509940302




Epoch,Training Loss,Validation Loss
1,3.3222,2.591698
2,2.7069,2.500781
3,2.5649,2.41643
4,2.4289,2.454169
5,2.3706,2.334143
6,2.3106,2.347946
7,2.2636,2.255965
8,2.2039,2.280394
9,2.1828,2.270794
10,2.1414,2.329805


>>> Perplexity After: 9.833652881575313


Upload file pytorch_model.bin:   0%|          | 1.00/416M [00:00<?, ?B/s]

Upload file runs/May24_18-48-39_83141af8a349/events.out.tfevents.1684954125.83141af8a349.330.3:   0%|         …

Upload file runs/May24_18-48-39_83141af8a349/events.out.tfevents.1684955060.83141af8a349.330.5:   0%|         …

To https://huggingface.co/rodrigotuna/de-e1-bert-base-german-cased-finetuned
   d2572c2..2ad50fb  main -> main

   d2572c2..2ad50fb  main -> main

To https://huggingface.co/rodrigotuna/de-e1-bert-base-german-cased-finetuned
   2ad50fb..d3c9b1e  main -> main

   2ad50fb..d3c9b1e  main -> main

