#### CS6493 - Tutorial 4

## Language modeling with the pre-trained models

In this tutorial, we will introduce how to build your own language models based on the pre-trained models, such as GPT and BERT. In specific, we will discuss two popular language modeling schemes here, namely **Causal Language Modeling (CLM)** and **Maksed Language Modeling (MLM)**.

For CLM, the model is allowed to predict the masked tokens in a given sentence only considering *the words that occur to its left*. So, we say CLM is `undirectional`. We formally present it as,

$$p(w_t) = p(w_t|w_0, ..., w_{t-1}),$$

where $w_t$ is the masked token. The models like GPT are pre-trained with CLM.

For MLM, we typically mask a certain part of tokens in a given sentence and the model is expected to predict those maksed tokens based on *other all tokens* in the sentence. So, we say MLM is `bidirectional`. It is formally presented as,

$$p(w_t) = p(w_t|w_0, ..., w_{t-1},w_{t+1}, w_N),$$

where $N$ is the sentence length. The models like BERT are pre-trained with MLM.

The key points of this tutorial are listed below,

- Different language modeling schemes, i.e., CLM and MLM;
- Introduction of `Huggingface` libraries;
- Build language models with different schemes;
- The evaluation metrics, i.e., `perplexity`.

### Huggingface - The AI community building the future.

**Huggingface** libraries, like `transformers` and `datasets` are the most powerful and popular toolkits in deep learning NLP community. We can easily load the publicly released datasets and pre-trained models with huggingface. More details and examples about huggingface can be found [here](https://huggingface.co/).

First, we need to install the `transformers` and `datasets` packages.

In [1]:
!pip install -U scipy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scipy
  Downloading scipy-1.10.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.5/34.5 MB[0m [31m45.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scipy
  Attempting uninstall: scipy
    Found existing installation: scipy 1.7.3
    Uninstalling scipy-1.7.3:
      Successfully uninstalled scipy-1.7.3
Successfully installed scipy-1.10.0


In [2]:
!pip install datasets transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.9.0-py3-none-any.whl (462 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.8/462.8 KB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (213 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 KB[0m [31m24.1

In [3]:
import os
os.environ["WANDB_DISABLED"] = "true"

### Preparing the dataset

We use the Wikitext-2 dataset as our dataset. You can load it directly with the `Dataset` library.

In [4]:
from datasets import load_dataset
wiki_data = load_dataset('wikitext', 'wikitext-2-raw-v1')

Downloading builder script:   0%|          | 0.00/8.48k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/6.84k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.25k [00:00<?, ?B/s]

Downloading and preparing dataset wikitext/wikitext-2-raw-v1 to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126...


Downloading data:   0%|          | 0.00/4.72M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Dataset wikitext downloaded and prepared to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
print(wiki_data)

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})


In [6]:
wiki_data["train"][3]

{'text': ' Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " . \n'}

In [7]:
wiki_data['train'].features.items()

dict_items([('text', Value(dtype='string', id=None))])

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [8]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [9]:
show_random_elements(wiki_data["train"])

Unnamed: 0,text
0,"The RN 's Operational Intelligence Centre selected the route and timing for the raid based on intelligence about the location of minefields and German recognition signals sourced from Enigma decrypts and knowledge of Luftwaffe patrols compiled by the Air Ministry 's Air Intelligence Branch . When all the plans had been pulled together and the timing worked out , the raid was expected to last no longer than two hours . The commandos and crew from Campbeltown would board the motor launches at the Old Mole jetty and then return to base . \n"
1,= = = Selection of artist = = = \n
2,"In 2005 the Abbey Theatre , Dublin , produced the play with an all @-@ male cast ; it also featured Wilde as a character — the play opens with him drinking in a Parisian café , dreaming of his play . The Melbourne Theatre Company staged a production in December 2011 with Geoffrey Rush as Lady Bracknell . \n"
3,"When Carey and Boyz II Men got together to record "" One Sweet Day , "" they didn 't have enough time to re @-@ unite and film a video . For this reason , a filming crew was present during the song 's recording , and filmed bits of Carey and Boyz recording the song . In an interview with Fred Bronson , Walter Afanasieff made the following statements regarding the video for "" One Sweet Day "" : \n"
4,"Severe hurricanes in 1926 and 1928 caused catastrophic damage and flooding from Lake Okeechobee that prompted the Army Corps of Engineers to build a dike around the lake . Further floods in 1947 prompted an unprecedented construction of canals throughout southern Florida . Following another population boom after World War II , and the creation of the Central and Southern Florida Flood Control Project , the Everglades was divided into sections separated by canals and water control devices that delivered water to agricultural and newly developed urban areas . However , in the late 1960s , following a proposal to construct a massive airport next to Everglades National Park , national attention turned from developing the land to restoring the Everglades . \n"
5,"Galveston is the seat and second @-@ largest city ( after League City , Texas ) of Galveston County in population . The Galveston County Justice Center , which houses all the county 's judicial functions as well as jail , is located on 59th street . The Galveston County Administrative Courthouse , the seat of civil and administrative functions , is located near the city 's downtown . Galveston is within the County Precinct 1 ; as of 2008 Patrick Doyle serves as the Commissioner of Precinct 1 . The Galveston County Sheriff 's Office operates its law enforcement headquarters and jail from the Justice Center . The Galveston County Department of Parks and Senior Services operates the Galveston Community Center . Galveston is located in District 23 of the Texas House of Representatives . As of 2008 , Craig Eiland represents the district . Most of Galveston is within District 17 of the Texas Senate ; as of 2008 Joan Huffman represents the district . A portion of Galveston is within District 11 of the Texas Senate ; as of 2008 Mike Jackson represents the district . Galveston is in Texas 's 14th congressional district and is represented by Republican Randy Weber as of 2012 . \n"
6,"Throughout the campaign , the Rio de Janeiro bid committee introduced its plans to the General Assemblies of all Associations of National Olympic Committees ( ANOC ) , making the bid 's first official presentation on October 11 , 2008 , to the Pan American Sports Organization ( PASO ) , in Acapulco , Mexico . On October 21 , the vision was presented to the Olympic Council of Asia ( OCA ) in Bali , Indonesia , followed by the European Olympic Committees ( EOC ) on November 21 , in Istanbul , Turkey . On March 26 , 2009 , Rio officials made a praised presentation during the 2009 SportAccord Convention in Denver , United States . For the first time , a world map of the past Olympic host cities was displayed , subsequently becoming an icon of Rio 's campaign due to the void in South America . On March 31 , 2009 , the Rio de Janeiro bid committee made its plea to the Oceania National Olympic Committees ( ONOC ) in Queenstown , New Zealand ; and on July 7 , to the Association of National Olympic Committees of Africa ( ANOCA ) in Abuja , Nigeria . The bid committee also attended many sporting events , such as the Australian and European Youth Olympic Festivals , the Commonwealth Youth Games , the Asian Youth Games and the Mediterranean Games , as well as the Aquatics , Athletics , Rowing and Judo World Championships . The three @-@ year campaign culminated with the beginning of the 13th Olympic Congress in Copenhagen , Denmark , which was officially opened in a ceremony held at the city 's Opera House , and after a lunch offered by Margrethe II , Queen of Denmark , to the heads of state of the four Candidate cities at the Amalienborg Palace . \n"
7,= = Mother @-@ in @-@ law to a queen = = \n
8,
9,


As we can see, some of the texts are a full paragraph of a Wikipedia article while others are just titles or empty lines.

### Causal Language Modeling
For causal language modeling, we are going to take all the texts in our dataset and concatenate them after they are tokenized. 



```
This is text A. this is text B. this is text C. this is text D....
```



Then we will truncate them into examples of a certain sequence length. Setting the chunk length as 4, the model will receive chunks of contiguous text that may look like:

```
example 1: [This is text]
example 2: [A, this is]
example 3: [text B, this]
...
```

The labels will be the same as the inputs, shifted to the left.

```
I: [BOS] This is text
O: This is text [EOS]
```


We use the `distilgpt2` model here for this example. You can pick any of the checkpoints listed [here](https://huggingface.co/models?filter=causal-lm) instead.

In [10]:
model_checkpoint = 'distilgpt2'

To tokenize all our texts with the same vocabulary that was used when training the model, we have to download a pretrained tokenizer. This is all done by the `GPT2Tokenizer` class:

In [11]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained(model_checkpoint, use_fast=True)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

We can now call the tokenizer on all our texts. This is very simple, using the [map](https://huggingface.co/docs/datasets/process.html#map) function from the Datasets library. First we define a function that call the tokenizer on our texts. Then we apply it to all the splits in our `dataset` object, using `batched=True` and 4 processes to speed up the preprocessing.

In [12]:
def tokenize_function(examples):
    return tokenizer(examples["text"])
tokenized_datasets = wiki_data.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

       

#3:   0%|          | 0/2 [00:00<?, ?ba/s]

#0:   0%|          | 0/2 [00:00<?, ?ba/s]

#1:   0%|          | 0/2 [00:00<?, ?ba/s]

 

#2:   0%|          | 0/2 [00:00<?, ?ba/s]

     

#3:   0%|          | 0/10 [00:00<?, ?ba/s]

 

#0:   0%|          | 0/10 [00:00<?, ?ba/s]

 

#2:   0%|          | 0/10 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/10 [00:00<?, ?ba/s]

     

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

  

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

If we now look at an element of our datasets, we will see the text have been replaced by the `input_ids` the model will need:

In [13]:
tokenized_datasets["train"][1]

{'input_ids': [796, 569, 18354, 7496, 17740, 6711, 796, 220, 198],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Now for the harder part: we need to concatenate all our texts together then split the result in small chunks of a certain `block_size`. To do this, we will use the `map` method again, with the option `batched=True`. This option actually lets us change the number of examples in the datasets by returning a different number of examples than we got. This way, we can create our new samples from a batch of examples.

We grab the maximum length our model was pretrained with.

In [14]:
block_size = 64

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=10,
    num_proc=4,
)

        

#2:   0%|          | 0/109 [00:00<?, ?ba/s]

#3:   0%|          | 0/109 [00:00<?, ?ba/s]

#0:   0%|          | 0/109 [00:00<?, ?ba/s]

#1:   0%|          | 0/109 [00:00<?, ?ba/s]

        

#2:   0%|          | 0/918 [00:00<?, ?ba/s]

#0:   0%|          | 0/918 [00:00<?, ?ba/s]

#1:   0%|          | 0/918 [00:00<?, ?ba/s]

#3:   0%|          | 0/918 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/94 [00:00<?, ?ba/s]

#1:   0%|          | 0/94 [00:00<?, ?ba/s]

#2:   0%|          | 0/94 [00:00<?, ?ba/s]

#3:   0%|          | 0/94 [00:00<?, ?ba/s]

In [15]:
lm_datasets

DatasetDict({
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 4214
    })
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 35544
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 3684
    })
})

First note that we duplicate the inputs for our labels. This is because the model of the 🤗 Transformers library apply the shifting to the right, so we don't need to do it manually.

Then we can check our datasets have changed: now the samples contain chunks of `block_size` contiguous tokens, potentially spanning over several of our original texts.

In [16]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

' tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable. Released in January 2011 in Japan, it is the third game in the Valkyria series. Employing the same fusion of tactical and real @-@ time gameplay as its predecessors, the story runs parallel to the first'

Now that the data has been cleaned, we're ready to instantiate our `TrainingArguments` and `Trainer` (you can find more details about training arguments [here](https://huggingface.co/docs/transformers/v4.16.2/en/main_classes/trainer#transformers.TrainingArguments)). We first build a model:

In [17]:
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-wikitext2",
    evaluation_strategy = "epoch",
    learning_rate=5e-5,
    weight_decay=0.01,
    per_device_train_batch_size=100,
    per_device_eval_batch_size=100,
    save_strategy='epoch'
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [None]:
trainer.train()

***** Running training *****
  Num examples = 35544
  Num Epochs = 3
  Instantaneous batch size per device = 100
  Total train batch size (w. parallel, distributed & accumulation) = 100
  Gradient Accumulation steps = 1
  Total optimization steps = 1068
  Number of trainable parameters = 81912576


Epoch,Training Loss,Validation Loss
1,No log,3.873383


***** Running Evaluation *****
  Num examples = 3684
  Batch size = 100
Saving model checkpoint to distilgpt2-finetuned-wikitext2/checkpoint-356
Configuration saved in distilgpt2-finetuned-wikitext2/checkpoint-356/config.json
Configuration saved in distilgpt2-finetuned-wikitext2/checkpoint-356/generation_config.json
Model weights saved in distilgpt2-finetuned-wikitext2/checkpoint-356/pytorch_model.bin


Once the training is completed, we can evaluate our model and get its perplexity on the validation set like this:



In [None]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

### **Masked language modeling**
For masked language modeling (MLM) we are going to use the same preprocessing as before for our dataset with one additional step: we will randomly mask some tokens (by replacing them by **[MASK]**) and the labels will be adjusted to only include the masked tokens (we don't have to predict the non-masked tokens).



We will use the distilroberta-base model for this example. You can pick any of the checkpoints listed [here](https://huggingface.co/models?filter=causal-lm) instead:



In [None]:
model_checkpoint = "distilroberta-base"

We apply the tokenization function with `truncation=True`, we need to update our tokenizer to use the checkpoint we just picked:

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True)

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
tokenized_datasets = wiki_data.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

And like before, we group texts together and chunk them in samples of length block_size. 


In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

The rest is very similar to what we had, with two exceptions. First we use a model suitable for masked LM:


In [None]:
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

We redefine our TrainingArguments:

In [None]:
model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-wikitext2",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.05,
    # per_device_train_batch_size=64,
    save_strategy='epoch'
)

Finally, we use a special data_collator. The data_collator is a function that is responsible of taking the samples and batching them in tensors. In the previous example, we had nothing special to do, so we just used the default for this argument. Here we want to do the random-masking. We could do it as a pre-processing step (like the tokenization) but then the tokens would always be masked the same way at each epoch. By doing this step inside the data_collator, we ensure this random masking is done in a new way each time we go over the data.



To do this masking for us, the library provides a DataCollatorForLanguageModeling. We can adjust the probability of the masking:

In [None]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

Then we just have to pass everything to Trainer and begin training:

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
    data_collator=data_collator,
)

### **Practice**

*   Please train your MLM model and try more experimental settings and hype-parameters in `TrainingArguments`.

*   Please evaluate your model on the validation set.

*   **Is there a way to unify the unidirectional, bidirectional, and sequence to sequence language modeling?**