# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Session 7: Generating Texts with Transformers</font>

# <font color="#003660">Notebook 3: Domain Adaptation of a Causal Language Model</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you ...</b><br><br>
        ... will be able to fine-tune a causal language model on your own data, which is useful for training decoder models.
    </font>
</div>
</center>
</p>

The following content is heavily inspired by the following excellent sources:


*   Tunstall et al. (2021): Natural Language Processing with Transformers. O'Reilly. https://www.oreilly.com/library/view/natural-language-processing/9781098103231/
*   Hugging Face (2021): Transformer Models - Hugging Face Course. https://huggingface.co/course/



# Recall: What is a Causal Language Model?

In the previous notebook we domain-adapted a **masked language model**, where the task is to predict a missing token in a sequence of tokens. This training task is useful **for training encoder models**.

<center><img width=300 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/mlm.png"/><br></center>

In this notebook, we will domain-adapt a **causal language model**, which has the task to predict the next token in a sequence of tokens. This is useful **for training decoder models**.

<center><img width=400 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/clm.png"/><br></center>

# Import Packages

In [None]:
!pip install transformers[sentencepiece]
!pip install datasets
!pip install accelerate -U

In [None]:
import pandas as pd
import numpy as np
import math
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
from transformers import TrainingArguments
from transformers import Trainer

# Load Pre-trained Model

First, we load a model for causal language modeling and a corresponding tokenizer from the model hub.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

In [None]:
model_name = "distilgpt2"

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Testdrive the Model 🚗

In [None]:
input_txt = "Bob and Clara are great"

In [None]:
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output = model.generate(input_ids, max_length=64, do_sample=True)
print(tokenizer.decode(output[0]))

# Prepare a Dataset for Domain Adaptation

The following data preparation steps are the same as for masked language modeling.

In [None]:
imdb_dataset = load_dataset("imdb")
imdb_dataset

In [None]:
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    return result

In [None]:
tokenized_datasets = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets

In [None]:
chunk_size = 128

def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

In [None]:
lm_datasets["train"][1]["input_ids"][0:10]

In [None]:
lm_datasets["train"][1]["labels"][0:10]

# Domain-Adapt with Trainer API

Let's draw a sample of the original dataset, so that we don't have to wait toooo long.

In [None]:
train_size = 10000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)

downsampled_dataset

For causal language modeling, we don't need a data collator. During training, the labels will be automatically shifted to right by one position so that the task is to predict the token at timestep `t+1`, using all tokens up to `t`.

In [None]:
batch_size = 32
logging_steps = len(downsampled_dataset["train"]) // batch_size

training_args = TrainingArguments(
    output_dir=f"{model_name}-clm-finetuned-imdb",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    logging_steps=logging_steps,
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"]
)

Perplexity of the pre-trained, but not domain-adapted model.

In [None]:
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perform domain adaptation!

In [None]:
trainer.train()

Calculate perplexity for the domain-adapetd model.

In [None]:
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

# Testdrive the Domain-adapted Model 🛫

Let's see if the text generated by the domain-adapted model differs from the text generated by the original model.

In [None]:
input_txt = "Bob and Clara are great"
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output = model.generate(input_ids, max_length=64, do_sample=True)
print(tokenizer.decode(output[0]))