# Question 5

## Loading Data

In [1]:
import re

In [2]:
with open("data/wiki2.train.txt", "r") as file:
    wiki_train = file.read()

with open("data/examples.txt", "r") as file:
    example_test = file.read()

In [3]:
split_examples_test = [example.strip() for example in re.split(r"\n\n", example_test)]

## Tokenization

In [4]:
import spacy
from utils.tokenization import chunked_tokenization

In [5]:
nlp = spacy.load("xx_ent_wiki_sm")

In [6]:
spacy_train = chunked_tokenization(wiki_train, nlp)

spacy_ex_test = []
for example in split_examples_test:
    spacy_ex_test.append(chunked_tokenization(example, nlp))

In [7]:
from transformers import GPT2TokenizerFast
from utils.tokenization import chunked_tokenization_gpt2

  from .autonotebook import tqdm as notebook_tqdm


In [8]:
gpt2_tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

In [9]:
gpt2_train = chunked_tokenization_gpt2(wiki_train, gpt2_tokenizer)

gpt2_test = []
for example in split_examples_test:
    gpt2_test.append(chunked_tokenization_gpt2(example, gpt2_tokenizer))

Token indices sequence length is longer than the specified maximum sequence length for this model (1134371 > 1024). Running this sequence through the model will result in indexing errors


## Testing LaPlace Smoothened Models

In [10]:
from models.ngrams.laplace_ngrams import (
    calculate_laplace_perplexities,
)

In [11]:
for gpt2_test_example in gpt2_test:
    print(f"Testing Example:\n{gpt2_test_example}\n")
    gpt2_perplexities = calculate_laplace_perplexities(gpt2_train, gpt2_test_example)
    print("GPT-2 Perplexities:")
    for n_gram, perplexity in gpt2_perplexities.items():
        print(f"{n_gram}: {perplexity:.2f}")
    print("-" * 30)

Testing Example:
['01', '.', 'ĠBest', 'Ġknown', 'Ġfor', 'Ġdeveloping', 'Ġthe', 'Ġtheory', 'Ġof', 'Ġrelativity', ',', 'ĠEinstein', 'Ġalso', 'Ġmade', 'Ġimportant', 'Ġcontributions', 'Ġto', 'Ġquantum', 'Ġmechanics', ',', 'Ġand', 'Ġwas', 'Ġthus', 'Ġa', 'Ġcentral', 'Ġfigure', 'Ġin', 'Ġthe', 'Ġrevolutionary', 'Ġresh', 'aping', 'Ġof', 'Ġthe', 'Ġscientific', 'Ġunderstanding', 'Ġof', 'Ġnature', 'Ġthat', 'Ġmodern', 'Ġphysics', 'Ġaccomplished', 'Ġin', 'Ġthe', 'Ġfirst', 'Ġdecades', 'Ġof', 'Ġthe', 'Ġtwentieth', 'Ġcentury', '.']

GPT-2 Perplexities:
1-gram: 2227.85
2-gram: 4384.27
3-gram: 366052.29
7-gram: 2217862.66
------------------------------
Testing Example:
['02', '.', 'ĠEveryone', 'Ġlooks', 'Ġto', 'Ġthe', 'ĠStark', 'Ġbanners', ',', 'Ġwith', 'Ġtheir', 'Ġdire', 'wolf', 'Ġcrest', '-', 'of', '-', 'arms', '.', 'ĠWe', 'Ġsee', 'Ġtheir', 'Ġopinions', 'Ġabout', 'Ġthe', 'Ġp', 'ups', 'Ġchange', ',', 'Ġas', 'Ġthey', 'Ġcome', 'Ġto', 'Ġunderstand', 'Ġthe', 'Ġimport', 'Ġof', 'Ġthis', 'Ġo', 'men', '.']

GPT

In [12]:
for spacy_test_example in spacy_ex_test:
    print(f"Testing Example:\n{spacy_test_example}\n")
    spacy_perplexities = calculate_laplace_perplexities(spacy_train, spacy_test_example)
    print("SpaCy Perplexities:")
    for n_gram, perplexity in spacy_perplexities.items():
        print(f"{n_gram}: {perplexity:.2f}")
    print("-" * 30)

Testing Example:
['01', '.', 'Best', 'known', 'for', 'developing', 'the', 'theory', 'of', 'relativity', ',', 'Einstein', 'also', 'made', 'important', 'contributions', 'to', 'quantum', 'mechanics', ',', 'and', 'was', 'thus', 'a', 'central', 'figure', 'in', 'the', 'revolutionary', 'reshaping', 'of', 'the', 'scientific', 'understanding', 'of', 'nature', 'that', 'modern', 'physics', 'accomplished', 'in', 'the', 'first', 'decades', 'of', 'the', 'twentieth', 'century', '.']

SpaCy Perplexities:
1-gram: 1480.44
2-gram: 3564.39
3-gram: 311870.04
7-gram: 2032440.71
------------------------------
Testing Example:
['02', '.', 'Everyone', 'looks', 'to', 'the', 'Stark', 'banners', ',', 'with', 'their', 'direwolf', 'crest', '-', 'of', '-', 'arms', '.', 'We', 'see', 'their', 'opinions', 'about', 'the', 'pups', 'change', ',', 'as', 'they', 'come', 'to', 'understand', 'the', 'import', 'of', 'this', 'omen', '.']

SpaCy Perplexities:
1-gram: 2610.69
2-gram: 6350.61
3-gram: 441522.43
7-gram: 2099033.00
--

## Testing Fine-Tuned GPT-2 Model

In [13]:
import math

import torch
from tqdm import tqdm
from transformers import GPT2LMHeadModel, GPT2Tokenizer

In [14]:
tokenizer = GPT2Tokenizer.from_pretrained("models/gpt2/")
model = GPT2LMHeadModel.from_pretrained("models/gpt2/")

In [15]:
model.eval()
# device = torch.device("cuda")
# model.to(device)
print(model.device)

cpu


In [16]:
max_length = model.config.n_positions - 1
stride = 16
n = 0
total_loss = 0.0

In [17]:
for split_example in split_examples_test:
    print(f"\nProcessing example: {split_example[:50]}...")
    total_loss = 0
    n = 0

    for i in tqdm(range(0, len(split_example), stride), desc="Processing chunks"):
        encoded_chunk = tokenizer.encode(
            split_example[i : i + max_length], return_tensors="pt"
        )
    
        encoded_chunk = encoded_chunk.to(model.device)
    
        with torch.no_grad():
            outputs = model(encoded_chunk, labels=encoded_chunk)
            total_loss += outputs.loss.item()
            n += 1
    
    average_loss = total_loss / n
    perplexity = torch.exp(torch.tensor(average_loss)).item()
    print(f"\nPerplexity for the example: {perplexity:.2f}\n")
    print("-" * 60)


Processing example: 01. Best known for developing the theory of relati...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
Processing chunks:   0%|                                                                                                      | 0/19 [00:00<?, ?it/s]TOKENIZERS_PARALLELISM=(true | false)
Processing chunks: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:07<00:00,  2.58it/s]



Perplexity for the example: 42.22

------------------------------------------------------------

Processing example: 02. Everyone looks to the Stark banners, with thei...


Processing chunks: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:03<00:00,  3.46it/s]



Perplexity for the example: 223.58

------------------------------------------------------------

Processing example: 03. I do not like them in a house. I do not like t...


Processing chunks: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:03<00:00,  4.09it/s]



Perplexity for the example: 92.33

------------------------------------------------------------

Processing example: 04. I am Sam.  Sam I am.  I do not like green eggs...


Processing chunks: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  6.40it/s]



Perplexity for the example: 508.01

------------------------------------------------------------

Processing example: 05. I am Sam.       Sam I am.       I do not like ...


Processing chunks: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.08it/s]



Perplexity for the example: 291.43

------------------------------------------------------------

Processing example: 06. <s> I am Sam. </s> <s> Sam I am. </s> <s> I do...


Processing chunks: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:02<00:00,  2.68it/s]



Perplexity for the example: 139.15

------------------------------------------------------------

Processing example: 07. <s>...


Processing chunks: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.30it/s]



Perplexity for the example: 492.24

------------------------------------------------------------

Processing example: 08. Natural Language Processing (NLP) is a branch ...


Processing chunks: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:02<00:00,  5.62it/s]



Perplexity for the example: 176.98

------------------------------------------------------------

Processing example: 09. Common NLP tasks include question answering, t...


Processing chunks: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:05<00:00,  3.17it/s]



Perplexity for the example: 576.90

------------------------------------------------------------

Processing example: 10. The Steelers, whose history may be traced to a...


Processing chunks: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:10<00:00,  1.99it/s]



Perplexity for the example: 40.07

------------------------------------------------------------

Processing example: 11. Northwestern University (NU) is a private rese...


Processing chunks: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:08<00:00,  2.46it/s]


Perplexity for the example: 47.82

------------------------------------------------------------





* The GPT-2 model's performance is significantly worser here than it was on wiki2 data.
    * the example differs slightly in style or content from the training data.
    * the transformer might have memorized the structure of the wiki2 data, which is the performance degraded here
* The test set contains a variety of text samples ranging from scientific discussions, literary excerpts, children's books, and sports.
* Fine-tuned GPT-2's better performance suggests its ability to generalize across different contexts and styles.
* Both SpaCy and GPT-2 n-gram models exhibit very high perplexity scores, particularly for the higher n-grams.
    * likely due to the sparse nature of higher-order n-grams
    * and the limitation of n-gram models in capturing long-range dependencies