# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Session 5: Decoder-only Models</font>

# <font color="#003660">Notebook 2: Domain Adaptation of a Masked Language Model</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you ...</b><br><br>
        ... are able to fine-tune a masked language model on your own data, which is useful to train a decoder model.
    </font>
</div>
</center>
</p>

The following content is heavily inspired by the following excellent sources:


*   Tunstall et al. (2021): Natural Language Processing with Transformers. O'Reilly. https://www.oreilly.com/library/view/natural-language-processing/9781098103231/
*   Hugging Face (2021): Transformer Models - Hugging Face Course. https://huggingface.co/course/



# How to Fine-tune a Masked Language Model?

For many NLP applications, you can simply take a pre-trained model from the Hugging Face Hub and fine-tune it directly on your data for the task at hand (e.g., sentiment analysis). This approach will usually produce good results, provided that the corpus used for pretraining is not too different from the corpus used for fine-tuning.

However, if your dataset is very different from the dataset used for pre-training, this approach might not be optimal. In such cases, you can boost the performance of many downstream tasks by first adapting the *language model* (not the model for the actual task of interest!) on in-domain data.

The figure below illustrates this process, which was first proposed by [Howard and Ruder in 2018](https://arxiv.org/abs/1801.06146).

<center><img width=600 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/ulmfit.png"/><br></center>

In this notebook, we go through this process for domain adaptation of a [masked langugae model](https://youtu.be/mqElG5QJWUg). 

# Import Packages

In [1]:
!pip install transformers[sentencepiece]
!pip install datasets
!pip install accelerate -U

zsh:1: no matches found: transformers[sentencepiece]


In [2]:
import pandas as pd
import numpy as np
import math
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
from datasets import load_dataset
from transformers import DataCollatorForLanguageModeling
from transformers import TrainingArguments
from transformers import Trainer

# Load Pre-trained Model

First, we load a model for mask language modeling and a corresponding tokenizer from the model hub.

In [3]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cpu'

In [4]:
model_name = "distilbert-base-uncased"

In [5]:
model = AutoModelForMaskedLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Testdrive the Model 🚗

Let's see what missing words the pre-trained model generates.

In [6]:
text = "This is a [MASK] car."

In [7]:
input_ids = tokenizer(text, return_tensors="pt").to(device)
input_ids

{'input_ids': tensor([[ 101, 2023, 2003, 1037,  103, 2482, 1012,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

In [8]:
token_logits = model(**input_ids).logits
token_logits

tensor([[[ -5.5816,  -5.5736,  -5.5748,  ...,  -4.9019,  -4.7247,  -2.9021],
         [ -9.7927,  -9.8047,  -9.8061,  ...,  -8.9532,  -7.9770,  -5.9507],
         [-11.8055, -11.8006, -11.6644,  ...,  -9.4562,  -7.6977,  -7.7933],
         ...,
         [-10.0483, -10.1558, -10.1271,  ...,  -7.8649,  -8.1916,  -6.7797],
         [-11.8359, -11.7799, -11.8815,  ...,  -9.4028,  -9.7312,  -6.9738],
         [ -9.0610,  -9.0381,  -8.9830,  ...,  -8.2114,  -7.9935,  -4.0813]]],
       grad_fn=<ViewBackward0>)

In [9]:
token_logits.shape

torch.Size([1, 8, 30522])

Identify the location of the [MASK] and retrieve its logits. We then pick the [MASK] candidates with the highest logits.

In [10]:
mask_token_index = torch.where(input_ids["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
mask_token_logits

tensor([[-4.2953, -4.3350, -4.3144,  ..., -3.4014, -4.6056, -3.8526]],
       grad_fn=<IndexBackward0>)

In [11]:
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
top_5_tokens

[4145, 9233, 5830, 2998, 9542]

Replace the [MASK] by the top candidates.

In [12]:
for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

'>>> This is a concept car.'
'>>> This is a compact car.'
'>>> This is a cable car.'
'>>> This is a sports car.'
'>>> This is a luxury car.'


# Prepare a Dataset for Domain Adaptation

Now let's adapt the model on domain-specific texts. We will use the famous IMDB movie reviews dataset for this purpose.

In [13]:
imdb_dataset = load_dataset("imdb")
imdb_dataset

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [14]:
imdb_dataset["train"][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

Tokenize the texts and remove unneeded columns.

In [15]:
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    return result

In [16]:
tokenized_datasets = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (720 > 512). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [17]:
tokenized_datasets["train"][0]

{'input_ids': [101,
  1045,
  12524,
  1045,
  2572,
  8025,
  1011,
  3756,
  2013,
  2026,
  2678,
  3573,
  2138,
  1997,
  2035,
  1996,
  6704,
  2008,
  5129,
  2009,
  2043,
  2009,
  2001,
  2034,
  2207,
  1999,
  3476,
  1012,
  1045,
  2036,
  2657,
  2008,
  2012,
  2034,
  2009,
  2001,
  8243,
  2011,
  1057,
  1012,
  1055,
  1012,
  8205,
  2065,
  2009,
  2412,
  2699,
  2000,
  4607,
  2023,
  2406,
  1010,
  3568,
  2108,
  1037,
  5470,
  1997,
  3152,
  2641,
  1000,
  6801,
  1000,
  1045,
  2428,
  2018,
  2000,
  2156,
  2023,
  2005,
  2870,
  1012,
  1026,
  7987,
  1013,
  1028,
  1026,
  7987,
  1013,
  1028,
  1996,
  5436,
  2003,
  8857,
  2105,
  1037,
  2402,
  4467,
  3689,
  3076,
  2315,
  14229,
  2040,
  4122,
  2000,
  4553,
  2673,
  2016,
  2064,
  2055,
  2166,
  1012,
  1999,
  3327,
  2016,
  4122,
  2000,
  3579,
  2014,
  3086,
  2015,
  2000,
  2437,
  2070,
  4066,
  1997,
  4516,
  2006,
  2054,
  1996,
  2779,
  25430,
  14728,
  2245,


For masked language modeling, a [common preprocessing step](https://youtu.be/8PmhEIXhBvI) is to concatenate all the samples and then split the resulting text into chunks of context length. This way, we can get around the usual  padding and truncating of individual samples and make sure that the model sees the same amount of context for each sample.

The function below, taken from https://huggingface.co/course/chapter7/3?fw=pt, does exactly this, and some other preprocessing steps.

In [18]:
chunk_size = 128

def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [19]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 61291
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 59904
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 122957
    })
})

During the above preprocessing, we have added a new column `labels` to the dataset. The labels are simply the IDs of the tokens from the input sequence. As you will see shortly, during training we will replace some IDs of the input sequences by [MASK]. After the replacement, the labels column will still contain the "truth".

In [20]:
lm_datasets["train"][1]["input_ids"][0:10]

[2004, 1996, 5148, 2162, 1998, 2679, 3314, 1999, 1996, 2142]

In [21]:
lm_datasets["train"][1]["labels"][0:10]

[2004, 1996, 5148, 2162, 1998, 2679, 3314, 1999, 1996, 2142]

# Domain Adaptation with Trainer API

To replace some input tokens by [MASK], we can use `DataCollatorForLanguageModeling()` function, which will perform the replacement on the fly during training.

In [22]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [23]:
samples = [lm_datasets["train"][i] for i in range(2)]

for chunk in data_collator(samples)["input_ids"]:
  print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] i rented i [MASK] curious - yellow from my video [MASK] because of all the planets that surrounded it when it was first released in 1967. i also heard that [MASK] first it was seized by u. s. customs if it ever [MASK] to enter this country, therefore being a fan of filmsculus [MASK] controversial [MASK] i really had to see this for myself. [MASK] br / > [MASK] [MASK] / > [MASK] plot is centered around a young swedish drama [MASK] named [MASK] [MASK] wants to learn everything she can about life [MASK] in particular she wants to focus [MASK] attentions to [MASK] some sort of documentary on what the average swede thought about certain political issues such'

'>>> as the vietnam war and race issues in the united states. [MASK] between asking [MASK] and ordinary denizens of stockholm about their opinions on politics, she has sex with her [MASK] teacher, [MASK], and married men. < br / > < br / > what kills [MASK] about i am curious - yellow is that 40 years ago, [MASK] [MASK] co

Let's downsample our dataset so that we don't have to wait tooo long.

In [24]:
train_size = 10000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)

downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
})

Now we can finally start fiting our model with the Trainer API.

In [25]:
batch_size = 128
logging_steps = len(downsampled_dataset["train"]) // batch_size

training_args = TrainingArguments(
    output_dir=f"{model_name}-mlm-finetuned-imdb",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    fp16=True,
    logging_steps=logging_steps,
)



In [26]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
)

ValueError: fp16 mixed precision requires a GPU (not 'mps').

Before we start, we calculate the original model's (pre-trained, but not domain-adapted) [perplexity](https://youtu.be/NURcDHhYe98) as a benchmark. 

In [None]:
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

In addition, let's generate the top-5 most probable words for a given context (the code below is a copy&paste from above).

In [None]:
text = "This [MASK] is simply great."
input_ids = tokenizer(text, return_tensors="pt").to(device)
token_logits = model(**input_ids).logits
mask_token_index = torch.where(input_ids["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

Perform the training!

In [None]:
trainer.train()

Calculate perplexity again.

In [None]:
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

And let's see what missing tokens our adapted model predicts.

In [None]:
text = "This [MASK] is simply great."
input_ids = tokenizer(text, return_tensors="pt").to(device)
token_logits = model(**input_ids).logits
mask_token_index = torch.where(input_ids["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")