Measure perplexity (known as PPL) of gpt-2 models over [wikitext-2](https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/) dataset (the test split). Dataset license: [Creative Commons Attribution-ShareAlike 3.0 Unported](https://en.wikipedia.org/w/index.php?title=Wikipedia:Text_of_the_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License&ref=blog.salesforceairesearch.com)

Perplexity metric for gpt-2 and other models:
https://paperswithcode.com/sota/language-modelling-on-wikitext-2

Language Models are Unsupervised Multitask Learners:
https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf


Let's download the wikitext-2 dataset:
```python
python ../dataprep/prepare_wikitext2.py
```

This will download and create these files in the ../data/wikitext-2-raw/ folder:
- wiki.train.raw
- wiki.test.raw
- wiki.valid.raw

Although the extension is .raw, these are plain UTF-8 text files. We'll use the wiki.test.raw - the test dataset split.

In [13]:
from gptbench import Sample, GPT2TokensDataset

# load the model
ben = Sample(seed=0xA1BED0)

# init with the pretrained gpt-2 smallest model - 124M params
ben.init_pretrained('gpt2')

Initializing model from gpt2
Dataset: dummy 0 tokens
Dataset: loading uint16 tokens
Expanding initial dataset size of 1 (less than block_size+1) by 1025 times to size of 1025
Dataset train_path: dummy empty dataset, val_path: None, train_split: 0.9, vocab_size: 50257
Model params: 124.44M


In [2]:
test_dataset = GPT2TokensDataset(ben.model.block_size, data_path='../data/wikitext-2-raw/wiki.test.raw')

Dataset: encoding utf-8 to tokens


In [3]:
# measure model's perplexity from the test_dataset - the test split of wikitext-2,
# stride=-1 means we'll measure along non-overlapping blocks of block_size tokens
# if you get an out-of-memory exception, lower the max_batch_size param
ben.measure_perplexity(test_dataset, stride=-1, max_batch_size=16)

29.047134441702813

The GPT-2 paper 'Language Models are Unsupervised Multitask Learners' lists PPL=29.41 for the smaller model, our results differ by little.

Let's now try with data that GPT-2 was not trained on: a public conference in Portuguese language (DIOGO ROSA MACHADO, year 1900):

In [14]:
pt_dataset = GPT2TokensDataset(ben.model.block_size, data_path='../data/alexherc1900.txt')

Dataset: encoding utf-8 to tokens


In [15]:
ben.measure_perplexity(pt_dataset, stride=-1, max_batch_size=16)

299.6833340903949

We get much higher perplexity as GPT-2 was not trained in anything similar to this text.

Let's now try with a bigger model:

In [4]:
# empty cuda cache memory
import torch
if torch.cuda.is_available(): torch.cuda.empty_cache()

In [6]:
# init with the pretrained gpt-2 large model - 774M params
ben.init_pretrained('gpt2-large')

Initializing model from gpt2-large
Dataset: dummy 0 tokens
Dataset: loading uint16 tokens
Expanding initial dataset size of 1 (less than block_size+1) by 1025 times to size of 1025
Dataset train_path: dummy empty dataset, val_path: None, train_split: 0.9, vocab_size: 50257
Model params: 774.03M


In [11]:
# if you get an out-of-memory exception, lower the max_batch_size param
ben.measure_perplexity(test_dataset, stride=-1, max_batch_size=16)

18.08091968456662

The GPT-2 paper lists PPL=19.93 for the gpt2-large model and we obtained 18.08. Why the difference?

According to this post from one of the authors, the GPT-2 paper results were taken with a stride of 32 (we're using a stride of 1024, due to the stride=-1 param), however using stride=32 should give us an even lower perplexity (losses from the beggining of context are not used). 

https://www.reddit.com/r/MachineLearning/comments/oye64h/r_struggling_to_reproduce_perplexity_benchmarks/

If not due to the stride, it can be due to different dataset formatting, for example, sometimes dataset entries are joined with "\n\n", single space lines in the original raw text are reported as empty lines (both as in [here](https://huggingface.co/docs/transformers/perplexity)).

PPL scores are hard to compare, unless conditions are exactly the same.

In [12]:
# let's measure with a 512 stride:
ben.measure_perplexity(test_dataset, stride=512, max_batch_size=16)

15.317235740722698

Lower perplexity - this happens because:

GPT-2 was evaluated with a small stride: 32. The reason it gives lower perplexity is because transformer LMs (...) have a finite context size so when you do eval stride length = context length your model is always having to predict some subset of tokens with little to no context (the ones at the beginning of each stride / eval window). It's much harder to predict these tokens (since you have no context!) and empirically they have much higher loss. By using a full context sized window that slides by a smaller stride you're then only evaling the tokens at the end which have ~ full context.
https://www.reddit.com/r/MachineLearning/comments/oye64h/r_struggling_to_reproduce_perplexity_benchmarks/