# Domain Adaptation

In this notebook, we are going to perform domain adaptation on the distilbert model, using our dataset. Instead of just performing the regular fine-tuning, we are going to use the Masked Language Model (MLM) objective to train the model. 

The idea is to train the model in this way, and then see if it can perform better than the regular fine-tuning approach.

[Reference](https://towardsdatascience.com/fine-tuning-for-domain-adaptation-in-nlp-c47def356fd6)

In [None]:
import multiprocessing
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from datasets import Dataset

from transformers import TrainingArguments, Trainer
from transformers import DistilBertForMaskedLM
from transformers import DistilBertTokenizer
from transformers import DataCollatorForLanguageModeling

In [None]:
df = pd.read_pickle("data/data_original.pkl")

down_sample_percentage = 5

df = df.sample(frac=down_sample_percentage/100, random_state=1)

print(df.shape)
df.head()

In [None]:
dataset = Dataset.from_pandas(df)

dataset = dataset.train_test_split(test_size=0.2, seed=42)

train_dataset = dataset['train']
test_dataset = dataset['test']
print(train_dataset)
print(test_dataset)

In [None]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForMaskedLM.from_pretrained('distilbert-base-uncased')

In [None]:
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, return_special_tokens_mask=True)

column_names = train_dataset.column_names

train_dataset = train_dataset.map(
    tokenize_function, 
    batched=True,
    num_proc= multiprocessing.cpu_count(),
    remove_columns=column_names
)

test_dataset = test_dataset.map(
    tokenize_function, 
    batched=True,
    num_proc= multiprocessing.cpu_count(),
    remove_columns=column_names
)

In [None]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, 
    mlm_probability=0.15
)

training_args = TrainingArguments(
    output_dir="./domain-model",
    learning_rate=2e-5,
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

trainer.train()
trainer.save_model("./domain-model/distilbert-emotions")
tokenizer.save_pretrained("./domain-model/distilbert-emotions")

In [None]:
trainer.evaluate()