# Question 1

## Reading Wiki Data

In [None]:
with open("data/wiki2.train.txt", "r") as file:
    wiki_train = file.read()

with open("data/wiki2.test.txt", "r") as file:
    wiki_test = file.read()

with open("data/wiki2.valid.txt", "r") as file:
    wiki_valid = file.read()

In [None]:
# first 100 characters
wiki_train[0:100]

## Spacy Tokenizer

In [None]:
import spacy

In [None]:
nlp = spacy.load("xx_ent_wiki_sm")

This model is a multi-language model trained on Wikipedia, supporting named entity recognition for multiple languages.

In [None]:
def chunked_tokenization(text, tokenizer, chunk_size=1000000):
    tokens = []
    for i in range(0, len(text), chunk_size):
        text_chunk = text[i : i + chunk_size]
        tokens.extend([token.text for token in tokenizer(text_chunk)])
    return tokens

In [None]:
spacy_train = chunked_tokenization(wiki_train, nlp)
spacy_test = chunked_tokenization(wiki_test, nlp)
spacy_valid = chunked_tokenization(wiki_valid, nlp)

Before and after tokenization:

In [None]:
spacy_train[0:20]

In [None]:
wiki_train[0:100]

## Pre-trained `GPT2TokenizerFast`

In [None]:
from transformers import GPT2TokenizerFast

In [None]:
gpt2_tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

In [None]:
def chunked_tokenization_gpt2(text, tokenizer, chunk_size=5000000):
    tokens = []
    for i in range(0, len(text), chunk_size):
        text_chunk = text[i : i + chunk_size]
        tokens.extend(tokenizer.tokenize(text_chunk))
    return tokens

In [None]:
gpt2_train = chunked_tokenization_gpt2(wiki_train, gpt2_tokenizer)
gpt2_valid = chunked_tokenization_gpt2(wiki_valid, gpt2_tokenizer)
gpt2_test = chunked_tokenization_gpt2(wiki_test, gpt2_tokenizer)

In [None]:
gpt2_train[0:20]

* `Ġ` indicates a space before the word in the original text (part of GPT-2's byte pair encoding to differentiate between words that start after a space and subwords that occur in the middle of words)
* `Ċ` represents a newline character in the text.
* Words like "Valkyria" and "Chronicles" are split into subwords or individual characters (`V`, `alky`, `ria`, `Chronicles`), which are common subword units in the tokenizer's vocabulary.

## Differences

In [None]:
untokenized_test = wiki_test[0:1000].split(" ")[0:200]

In [None]:
print(f"{'Untokenized':<30} | {'Spacy Tokens':<30} | {'GPT-2 Tokens':<30}")
print(f"{'-'*30}-+-{'-'*30}-+-{'-'*30}")

for i in range(200):
    untokenized = repr(untokenized_test[i]) if i < len(untokenized_test) else ""
    spacy_token = repr(spacy_test[i]) if i < len(spacy_test) else ""
    gpt2_token = repr(gpt2_test[i]) if i < len(gpt2_test) else ""

    untokenized = untokenized.strip("'\"")
    spacy_token = spacy_token.strip("'\"")
    gpt2_token = gpt2_token.strip("'\"")

    print(f"{untokenized:<30} | {spacy_token:<30} | {gpt2_token:<30}")

Some key differences we can see:

1. **Granularity**:
    1. Spacy produces more word-like tokens, closely aligning with the actual words and punctuations in the text. This could be because Spacy is designed for tasks that require understanding the text at the word level, such as part-of-speech tagging, entity recognition, and dependency parsing.
    2. GPT-2 breaks down the text into subword units, represented as byte-pair encodings. This method captures the internal structure of words, allowing the model to handle a wide range of vocabulary, including neologisms and morphologically rich languages, with a fixed-size vocabulary.
2. **Special Characters and Whitespace**:
    1. Spacy treats newlines, spaces, and punctuation marks as separate tokens, which can be useful for syntactic parsing and sentence boundary detection.
    2. GPT-2 has special tokens like `Ġ` to indicate a new word segment following a space, and `Ċ` for newlines, which helps in retaining the textual structure without needing a large vocabulary for whitespace variations.
3. **Unknown Tokens**:
    1. Spacy uses `<unk>` to represent unknown or out-of-vocabulary (OOV) tokens, which it cannot parse into known word types.
    2. GPT-2 rarely encounters OOV tokens due to its subword tokenization. This allows it to piece together unfamiliar terms from known subword components, which is why we see pieces like `Ġ<` and `unk`.
4. **Purpose**:
    1. Spacy is optimized for NLP tasks requiring understanding of word forms and syntactic structures in context, e.g., NER, part-of-speech tagging, and dependecy parsing.
    2. GPT-2 is designed for language generation and comprehension tasks, where subword units allow for more flexible word representation. This allows it to handle a wide variety of text.

# Question 2

## Testing Sample Data

In [None]:
import math
from collections import Counter

In [None]:
def generate_ngram_tokens(tokens, n):
    """
    Generate a Counter of n-gram tuples from a list of tokens.

    Args:
        tokens (list of str): The list of tokens from which to generate n-grams.
        n (int): The number of tokens in each n-gram.

    Returns:
        Counter: A Counter object mapping each n-gram tuple to its frequency.
    """
    return Counter([tuple(tokens[i : i + n]) for i in range(len(tokens) - n + 1)])

In [None]:
def get_ngram_model(train_tokens, n):
    """
    Generate n-gram and (n-1)-gram models from training tokens.

    Args:
        train_tokens (list of str): The list of tokens to train the model.
        n (int): The n-gram size.

    Returns:
        tuple: A tuple containing two Counter objects for n-gram and (n-1)-gram counts.
    """
    ngram_counts = generate_ngram_tokens(train_tokens, n)
    n_minus_1_gram_counts = generate_ngram_tokens(train_tokens, n - 1)

    return ngram_counts, n_minus_1_gram_counts

In [None]:
def test_ngram_model(test_tokens, ngram_counts, n_minus_1_gram_counts, n, epsilon=1e-6):
    """
    Calculate the perplexity of a test dataset using an n-gram model.

    Args:
        test_tokens (list of str): The list of tokens to test the model.
        ngram_counts (Counter): The n-gram counts from the training data.
        n_minus_1_gram_counts (Counter): The (n-1)-gram counts from the training data.
        n (int): The n-gram size.
        epsilon (float): A small value to prevent zero-error in probability calculation.

    Returns:
        float: The calculated perplexity score.
    """
    log_likelihood = 0.0
    N = 0  # total n-grams in the test data

    for i in range(len(test_tokens) - n + 1):
        test_ngram = tuple(test_tokens[i : i + n])
        test_n_minus_1_gram = test_ngram[:-1]

        # Calculate the probability of the n-gram
        numerator = ngram_counts.get(test_ngram, 0) + epsilon
        denominator = n_minus_1_gram_counts.get(test_n_minus_1_gram, 0) + (
            epsilon * len(n_minus_1_gram_counts)
        )
        prob = numerator / denominator

        # total log likelihood
        log_likelihood += math.log(prob)

        N += 1

    # Calculate perplexity
    perplexity = math.exp(-log_likelihood / N)
    return perplexity

In [None]:
sample_train_tokens = [
    "this",
    "is",
    "a",
    "sample",
    "text",
    "this",
    "is",
    "another",
    "example",
    "text",
]
# test also contains OOV
sample_test_tokens = ["this", "is", "a", "test", "text"]
n = 2

In [None]:
sample_bigram_counts, sample_bi_minus_1_gram_counts = get_ngram_model(
    sample_train_tokens, n
)

In [None]:
sample_bigram_counts

In [None]:
sample_bi_minus_1_gram_counts

In [None]:
test_ngram_model(
    sample_test_tokens, sample_bigram_counts, sample_bi_minus_1_gram_counts, n
)

Testing it on Dr Suess data from class.

In [None]:
dr_suess_test = [
    "<s>",
    "I",
    "am",
    "Sam",
    "</s>",
    "<s>",
    "Sam",
    "I",
    "am",
    "</s>",
    "<s>",
    "I",
    "do",
    "not",
    "like",
    "green",
    "eggs",
    "and",
    "ham",
    "</s>",
]

In [None]:
get_ngram_model(dr_suess_test, n)

$P((<s>\cap I)|<s>) = $

```python
('<s>'): 3
('<s>', 'I'): 2
```

$2/3 \approx 0.67$

## Training and Testing n-gram models

In [None]:
def calculate_perplexities(train_data, test_data):
    perplexities = {}
    for n in [1, 2, 3, 7]:
        ngram_counts, n_minus_1_gram_counts = get_ngram_model(train_data, n)
        perplexity = test_ngram_model(test_data, ngram_counts, n_minus_1_gram_counts, n)
        perplexities[f"{n}-gram"] = perplexity
    return perplexities

In [None]:
print("GPT-2 vocab size:", len(set(gpt2_train)))

In [None]:
gpt2_perplexities = calculate_perplexities(gpt2_train, gpt2_test)
print("GPT-2 Perplexities:")
print(gpt2_perplexities)

In [None]:
print("SpaCy vocab size:", len(set(spacy_train)))

In [None]:
spacy_perplexities = calculate_perplexities(spacy_train, spacy_test)
print("SpaCy Perplexities:")
print(spacy_perplexities)

**Comments:**

* SpaCy has a larger vocabulary size than GPT-2.
    * This could be a reason for its higher perplexity, especially in higher n-grams.
    * SpaCy may also have more unique tokens and hence higher perplexity, reflecting the model's struggle to predict less frequent or more diverse sequences of words.
    * A larger vocabulary can lead to more sparse data distributions (especially in higher n-grams), making accurate predictions more difficult.
* uni-gram:
    * relatively low for both GPT-2 and SpaCy
    * GPT-2 has a slightly higher perplexity
    * both models have a good grasp of the single-word distribution in the Wiki-data corpus
    * suggests that SpaCy's tokenization method results in a distribution of tokens that slightly better reflects the test corpus.
* bi-gram:
    * GPT-2 shows a much lower perplexity compared to SpaCy
    * suggests that GPT-2's tokenization aligns better with common two-word sequences in the Wiki-data
    * or GPT-2 is more effective at capturing the syntactic structure of the "Wiki-data language"
* tri-gram and 7-gram:
    * As we move to higher n-grams, the perplexity increases dramatically for both models, but it's much more pronounced for SpaCy.
    * This increase is expected because higher n-grams are less frequent and the model has less information about these longer sequences in the training data, making accurate predictions harder.
    * the significantly higher perplexity for SpaCy suggests that its tokenization method might result in less coherent or less frequent n-grams in the context of Wiki-data
    * or SpaCy might be less effective at capturing the language's structure over longer sequences.
* Overall, GPT-2 seems to be more effective at capturing the n-gram patterns of the Wiki-data corpus

# Question 3

## Adding LaPlace Smoothing

In [None]:
def test_laplace_ngram_model(test_tokens, ngram_counts, n_minus_1_gram_counts, n):
    """
    Calculate the perplexity of a test dataset using an n-gram model with Laplace smoothing.

    Args:
        test_tokens (list of str): The list of tokens to test the model.
        ngram_counts (Counter): The n-gram counts from the training data.
        n_minus_1_gram_counts (Counter): The (n-1)-gram counts from the training data.
        n (int): The n-gram size.

    Returns:
        float: The calculated perplexity score.
    """
    log_likelihood = 0.0
    N = 0  # total n-grams in the test data

    # Vocabulary size for Laplace smoothing
    V = len(n_minus_1_gram_counts)

    for i in range(len(test_tokens) - n + 1):
        test_ngram = tuple(test_tokens[i : i + n])
        test_n_minus_1_gram = test_ngram[:-1]

        # Calculate the probability of the n-gram
        numerator = ngram_counts.get(test_ngram, 0) + 1
        denominator = n_minus_1_gram_counts.get(test_n_minus_1_gram, 0) + V
        prob = numerator / denominator

        # total log likelihood
        log_likelihood += math.log(prob)

        N += 1

    # Calculate perplexity
    perplexity = math.exp(-log_likelihood / N)
    return perplexity

In [None]:
def calculate_laplace_perplexities(train_data, test_data):
    perplexities = {}
    for n in [1, 2, 3, 7]:
        ngram_counts, n_minus_1_gram_counts = get_ngram_model(train_data, n)
        perplexity = test_laplace_ngram_model(test_data, ngram_counts, n_minus_1_gram_counts, n)
        perplexities[f"{n}-gram"] = perplexity
    return perplexities

In [None]:
gpt2_perplexities = calculate_laplace_perplexities(gpt2_train, gpt2_test)
print("GPT-2 Perplexities:")
print(gpt2_perplexities)

In [None]:
spacy_perplexities = calculate_laplace_perplexities(spacy_train, spacy_test)
print("SpaCy Perplexities:")
print(spacy_perplexities)

**Comments:**

* GPT-2 still performs consistently better than SpaCy after LaPlace smoothing.
* uni-gram: perplexities improved for both models after smoothing
* bi-gram:
    * this worsened (i.e. increased for both models after smoothing)
    * it indicates that the smoothing had a larger impact due to previously unseen bigrams now having a non-zero probability
    * this increased is more pronounced for the GPT-2 model, possibly due to the smaller vocab size (10 percent points worser)
* 3-gram and 7-gram:
    * Substantially increased for both models.
    * The increase is dramatic, indicating that with smoothing, the model is penalized more for unseen or rare n-grams, which are more common in higher-order n-grams.