# Ngram Language Models

We will use the Penn Treebank dataset to train an N-gram language model and then generate text with it.

In [None]:
import re
from typing import Any, Callable, List

from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
from nltk.tokenize import word_tokenize
import numpy as np
from torchtext.datasets import PennTreebank

In [None]:
train, valid, test = PennTreebank()

## Preprocessing

Let's start by cleaning setting the data as simple list of sentences, and then clean them, removing special characters.

In [None]:
def iter_to_item_dataset(dataset) -> List[Any]:
    """
    Extracting dataset as a simple List.
    """
    output = [data for data in dataset]
    return output


train = iter_to_item_dataset(train)
valid = iter_to_item_dataset(valid)
test = iter_to_item_dataset(test)

In [None]:
NON_CHAR_RE = re.compile("\W+")
MULTI_SPACE_RE = re.compile("\s+")


def clean_text(text: str) -> str:
    """
    Remove special characters and lower-case the text
    """
    txt = NON_CHAR_RE.sub(" ", text.lower())
    txt = MULTI_SPACE_RE.sub(" ", txt)
    return txt.strip()


def preprocessing(text: str) -> List[str]:
    """
    Tokenize the raw text by cleaning it and adding the special start and end
    tokens.
    """
    return word_tokenize(clean_text(text))

## Training
The model is a maximum likelyhood estimator (MLE). It will just count the occurences of N-grams.

In [None]:
def train_n_gram_model(
    texts: List[str], preprocessing_pipeline: Callable[str, List[str]], n: int
) -> MLE:
    model = MLE(n)
    tokenized_texts = [preprocessing_pipeline(text) for text in texts]
    train_data, padded_sents = padded_everygram_pipeline(n, tokenized_texts)
    model.fit(train_data, padded_sents)
    return model

In [None]:
N = 3
model = train_n_gram_model(train, preprocessing, N)

In [None]:
len(model.vocab)

## Text generation

Let's see what we can generate with this model.

In [None]:
tokenized_test = [preprocessing(text) for text in test]
test_data, padded_test = padded_everygram_pipeline(N, tokenized_test)

In [None]:
model.generate(25, text_seed=["<s>", "the", "company"])

## Now it's your turn

* Try computing perplexity on the test data ([this tutorial](https://www.kaggle.com/alvations/n-gram-language-model-with-nltk) can help)
* Look into [other models](https://www.nltk.org/api/nltk.lm.html#module-nltk.lm.models)