# Question 1

## Reading Wiki Data

In [1]:
with open("data/wiki2.train.txt", "r") as file:
    wiki_train = file.read()

with open("data/wiki2.test.txt", "r") as file:
    wiki_test = file.read()

with open("data/wiki2.valid.txt", "r") as file:
    wiki_valid = file.read()

In [2]:
# first 100 characters
wiki_train[0:100]

' \n = Valkyria Chronicles III = \n \n Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 ,'

## Spacy Tokenizer

In [3]:
import spacy
from utils.tokenization import chunked_tokenization

In [4]:
nlp = spacy.load("xx_ent_wiki_sm")

This model is a multi-language model trained on Wikipedia, supporting named entity recognition for multiple languages.

In [5]:
spacy_train = chunked_tokenization(wiki_train, nlp)
spacy_test = chunked_tokenization(wiki_test, nlp)
spacy_valid = chunked_tokenization(wiki_valid, nlp)

Before and after tokenization:

In [6]:
spacy_train[0:20]

[' \n ',
 '=',
 'Valkyria',
 'Chronicles',
 'III',
 '=',
 '\n \n ',
 'Senjō',
 'no',
 'Valkyria',
 '3',
 ':',
 '<',
 'unk',
 '>',
 'Chronicles',
 '(',
 'Japanese',
 ':',
 '戦場のヴァルキュリア3']

In [7]:
wiki_train[0:100]

' \n = Valkyria Chronicles III = \n \n Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 ,'

## Pre-trained `GPT2TokenizerFast`

In [8]:
from transformers import GPT2TokenizerFast
from utils.tokenization import chunked_tokenization_gpt2

  from .autonotebook import tqdm as notebook_tqdm


In [9]:
gpt2_tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

In [10]:
gpt2_train = chunked_tokenization_gpt2(wiki_train, gpt2_tokenizer)
gpt2_valid = chunked_tokenization_gpt2(wiki_valid, gpt2_tokenizer)
gpt2_test = chunked_tokenization_gpt2(wiki_test, gpt2_tokenizer)

Token indices sequence length is longer than the specified maximum sequence length for this model (1134371 > 1024). Running this sequence through the model will result in indexing errors


In [11]:
gpt2_train[0:20]

['Ġ',
 'Ċ',
 'Ġ=',
 'ĠV',
 'alky',
 'ria',
 'ĠChronicles',
 'ĠIII',
 'Ġ=',
 'Ġ',
 'Ċ',
 'Ġ',
 'Ċ',
 'ĠSen',
 'j',
 'Åį',
 'Ġno',
 'ĠV',
 'alky',
 'ria']

* `Ġ` indicates a space before the word in the original text (part of GPT-2's byte pair encoding to differentiate between words that start after a space and subwords that occur in the middle of words)
* `Ċ` represents a newline character in the text.
* Words like "Valkyria" and "Chronicles" are split into subwords or individual characters (`V`, `alky`, `ria`, `Chronicles`), which are common subword units in the tokenizer's vocabulary.

## Differences

In [12]:
untokenized_test = wiki_test[0:1000].split(" ")[0:200]

In [13]:
print(f"{'Untokenized':<30} | {'Spacy Tokens':<30} | {'GPT-2 Tokens':<30}")
print(f"{'-'*30}-+-{'-'*30}-+-{'-'*30}")

for i in range(200):
    untokenized = repr(untokenized_test[i]) if i < len(untokenized_test) else ""
    spacy_token = repr(spacy_test[i]) if i < len(spacy_test) else ""
    gpt2_token = repr(gpt2_test[i]) if i < len(gpt2_test) else ""

    untokenized = untokenized.strip("'\"")
    spacy_token = spacy_token.strip("'\"")
    gpt2_token = gpt2_token.strip("'\"")

    print(f"{untokenized:<30} | {spacy_token:<30} | {gpt2_token:<30}")

Untokenized                    | Spacy Tokens                   | GPT-2 Tokens                  
-------------------------------+--------------------------------+-------------------------------
                               |  \n                            | Ġ                             
\n                             | =                              | Ċ                             
=                              | Robert                         | Ġ=                            
Robert                         | <                              | ĠRobert                       
<unk>                          | unk                            | Ġ<                            
=                              | >                              | unk                           
\n                             | =                              | >                             
\n                             | \n \n                          | Ġ=                            
Robert                        

Some key differences we can see:

1. **Granularity**:
    1. Spacy produces more word-like tokens, closely aligning with the actual words and punctuations in the text. This could be because Spacy is designed for tasks that require understanding the text at the word level, such as part-of-speech tagging, entity recognition, and dependency parsing.
    2. GPT-2 breaks down the text into subword units, represented as byte-pair encodings. This method captures the internal structure of words, allowing the model to handle a wide range of vocabulary, including neologisms and morphologically rich languages, with a fixed-size vocabulary.
2. **Special Characters and Whitespace**:
    1. Spacy treats newlines, spaces, and punctuation marks as separate tokens, which can be useful for syntactic parsing and sentence boundary detection.
    2. GPT-2 has special tokens like `Ġ` to indicate a new word segment following a space, and `Ċ` for newlines, which helps in retaining the textual structure without needing a large vocabulary for whitespace variations.
3. **Unknown Tokens**:
    1. Spacy uses `<unk>` to represent unknown or out-of-vocabulary (OOV) tokens, which it cannot parse into known word types.
    2. GPT-2 rarely encounters OOV tokens due to its subword tokenization. This allows it to piece together unfamiliar terms from known subword components, which is why we see pieces like `Ġ<` and `unk`.
4. **Purpose**:
    1. Spacy is optimized for NLP tasks requiring understanding of word forms and syntactic structures in context, e.g., NER, part-of-speech tagging, and dependecy parsing.
    2. GPT-2 is designed for language generation and comprehension tasks, where subword units allow for more flexible word representation. This allows it to handle a wide variety of text.

# Question 2

## Testing Sample Data

In [14]:
from models.ngrams.ngrams import get_ngram_model, test_ngram_model

In [15]:
sample_train_tokens = [
    "this",
    "is",
    "a",
    "sample",
    "text",
    "this",
    "is",
    "another",
    "example",
    "text",
]
# test also contains OOV
sample_test_tokens = ["this", "is", "a", "test", "text"]
n = 2

In [16]:
sample_bigram_counts, sample_bi_minus_1_gram_counts = get_ngram_model(
    sample_train_tokens, n
)

In [17]:
sample_bigram_counts

Counter({('this', 'is'): 2,
         ('is', 'a'): 1,
         ('a', 'sample'): 1,
         ('sample', 'text'): 1,
         ('text', 'this'): 1,
         ('is', 'another'): 1,
         ('another', 'example'): 1,
         ('example', 'text'): 1})

In [18]:
sample_bi_minus_1_gram_counts

Counter({('this',): 2,
         ('is',): 2,
         ('text',): 2,
         ('a',): 1,
         ('sample',): 1,
         ('another',): 1,
         ('example',): 1})

In [19]:
test_ngram_model(
    sample_test_tokens, sample_bigram_counts, sample_bi_minus_1_gram_counts, n
)

Vocabulary Size: 7
Number of OOV instances: 2


10.91152048559648

Testing it on Dr Suess data from class.

In [20]:
dr_suess_test = [
    "<s>",
    "I",
    "am",
    "Sam",
    "</s>",
    "<s>",
    "Sam",
    "I",
    "am",
    "</s>",
    "<s>",
    "I",
    "do",
    "not",
    "like",
    "green",
    "eggs",
    "and",
    "ham",
    "</s>",
]

In [21]:
get_ngram_model(dr_suess_test, n)

(Counter({('<s>', 'I'): 2,
          ('I', 'am'): 2,
          ('</s>', '<s>'): 2,
          ('am', 'Sam'): 1,
          ('Sam', '</s>'): 1,
          ('<s>', 'Sam'): 1,
          ('Sam', 'I'): 1,
          ('am', '</s>'): 1,
          ('I', 'do'): 1,
          ('do', 'not'): 1,
          ('not', 'like'): 1,
          ('like', 'green'): 1,
          ('green', 'eggs'): 1,
          ('eggs', 'and'): 1,
          ('and', 'ham'): 1,
          ('ham', '</s>'): 1}),
 Counter({('<s>',): 3,
          ('I',): 3,
          ('</s>',): 3,
          ('am',): 2,
          ('Sam',): 2,
          ('do',): 1,
          ('not',): 1,
          ('like',): 1,
          ('green',): 1,
          ('eggs',): 1,
          ('and',): 1,
          ('ham',): 1}))

$P((<s>\cap I)|<s>) = $

```python
('<s>'): 3
('<s>', 'I'): 2
```

$2/3 \approx 0.67$

## Training and Testing n-gram models

In [22]:
from models.ngrams.ngrams import calculate_perplexities

In [23]:
print("GPT-2 vocab size:", len(set(gpt2_train)))

GPT-2 vocab size: 27103


In [24]:
gpt2_perplexities = calculate_perplexities(gpt2_train, gpt2_test)
print("GPT-2 Perplexities:")
print(gpt2_perplexities)

Vocabulary Size: 27103
Number of OOV instances: 0
Vocabulary Size: 27103
Number of OOV instances: 47727
Vocabulary Size: 602318
Number of OOV instances: 138553
Vocabulary Size: 2253078
Number of OOV instances: 274857
GPT-2 Perplexities:
{'1-gram': 706.4458638503493, '2-gram': 170.32919159928596, '3-gram': 4365.838282596132, '7-gram': 1137052.2144194297}


In [25]:
print("SpaCy vocab size:", len(set(spacy_train)))

SpaCy vocab size: 33240


In [26]:
spacy_perplexities = calculate_perplexities(spacy_train, spacy_test)
print("SpaCy Perplexities:")
print(spacy_perplexities)

Vocabulary Size: 33240
Number of OOV instances: 0
Vocabulary Size: 33240
Number of OOV instances: 50345
Vocabulary Size: 619631
Number of OOV instances: 139466
Vocabulary Size: 2099033
Number of OOV instances: 263516
SpaCy Perplexities:
{'1-gram': 684.212695073679, '2-gram': 243.0552487411126, '3-gram': 6843.949376282952, '7-gram': 1363436.7859243387}


**Comments:**

* SpaCy has a larger vocabulary size than GPT-2.
    * This could be a reason for its higher perplexity, especially in higher n-grams.
    * SpaCy may also have more unique tokens and hence higher perplexity, reflecting the model's struggle to predict less frequent or more diverse sequences of words.
    * A larger vocabulary can lead to more sparse data distributions (especially in higher n-grams), making accurate predictions more difficult.
* uni-gram:
    * relatively low for both GPT-2 and SpaCy
    * GPT-2 has a slightly higher perplexity
    * both models have a good grasp of the single-word distribution in the Wiki-data corpus
    * suggests that SpaCy's tokenization method results in a distribution of tokens that slightly better reflects the test corpus.
* bi-gram:
    * GPT-2 shows a much lower perplexity compared to SpaCy
    * suggests that GPT-2's tokenization aligns better with common two-word sequences in the Wiki-data
    * or GPT-2 is more effective at capturing the syntactic structure of the "Wiki-data language"
* tri-gram and 7-gram:
    * As we move to higher n-grams, the perplexity increases dramatically for both models, but it's much more pronounced for SpaCy.
    * This increase is expected because higher n-grams are less frequent and the model has less information about these longer sequences in the training data, making accurate predictions harder.
    * the significantly higher perplexity for SpaCy suggests that its tokenization method might result in less coherent or less frequent n-grams in the context of Wiki-data.
    * or SpaCy might be less effective at capturing the language's structure over longer sequences.
    * This is apparent with the number of OOV instances we encounter with 7-grams compared to unigrams.
* Overall, GPT-2 seems to be more effective at capturing the n-gram patterns of the Wiki-data corpus

# Question 3

## Adding LaPlace Smoothing

In [27]:
from models.ngrams.laplace_ngrams import calculate_laplace_perplexities

In [28]:
gpt2_perplexities = calculate_laplace_perplexities(gpt2_train, gpt2_test)
print("GPT-2 Perplexities:")
print(gpt2_perplexities)

Vocabulary Size: 27103
Number of OOV instances: 0
Vocabulary Size: 27103
Number of OOV instances: 47727
Vocabulary Size: 602318
Number of OOV instances: 138553
Vocabulary Size: 2253078
Number of OOV instances: 274857
GPT-2 Perplexities:
{'1-gram': 707.6526358520499, '2-gram': 591.5902646528112, '3-gram': 60813.012878479945, '7-gram': 1795022.9119593713}


In [29]:
spacy_perplexities = calculate_laplace_perplexities(spacy_train, spacy_test)
print("SpaCy Perplexities:")
print(spacy_perplexities)

Vocabulary Size: 33240
Number of OOV instances: 0
Vocabulary Size: 33240
Number of OOV instances: 50345
Vocabulary Size: 619631
Number of OOV instances: 139466
Vocabulary Size: 2099033
Number of OOV instances: 263516
SpaCy Perplexities:
{'1-gram': 686.648104026337, '2-gram': 863.2694763969148, '3-gram': 81545.93484314007, '7-gram': 1838737.680698908}


**Comments:**

* GPT-2 still performs consistently better than SpaCy after LaPlace smoothing.
* uni-gram: perplexities improved for both models after smoothing
* bi-gram:
    * this worsened (i.e. increased for both models after smoothing)
    * it indicates that the smoothing had a larger impact due to previously unseen bigrams now having a non-zero probability
    * this increased is more pronounced for the GPT-2 model, possibly due to the smaller vocab size
* 3-gram and 7-gram:
    * Substantially increased for both models.
    * The increase is dramatic, indicating that with smoothing, the model is penalized more for unseen or rare n-grams, which are more common in higher-order n-grams.
* It becomes more apparent that the perplexity might be highly dependent on the probability we assign to the missing tokens.
    * If we assign a low probability for OOV words, it results in a high perplexity.
    * This is why we see a high error in n-gram models without smoothing as our OOV probability is calculated as:
      $$\textbf{P}_n(\text{OOV}) = \frac{\varepsilon}{c(w_{i-1}) + \varepsilon \cdot |V|}$$
    * Since $c(w_{i}, w_{i-1}) = 0$. Furthermore, when $c(w_{i-1}) = 0$, this corresonds to $\frac{1}{|V|}$
    * In our case $\varepsilon = 10^{-3}$ and $\log(10^{-3}) \approx -6.9$, which is a poor score.