# Text Prediction with nltk's Kneser-Ney Interpolated Model

We'll use the `lm` module in `nltk` to get a sense of how N-Gram language modelling is done.

**Source:** The content in this notebook is largely based on [N-gram Language Model with NLTK by Liling Tan](https://www.kaggle.com/alvations/n-gram-language-model-with-nltk/notebook)

See also: [language model tutorial in NLTK documentation by Ilia Kurenkov](https://github.com/nltk/nltk/blob/develop/nltk/lm/__init__.py)


## Imports

In [94]:
import pandas as pd

from nltk.lm.api import LanguageModel, Smoothing

from nltk.lm import MLE # language model
from nltk.lm import KneserNeyInterpolated
from nltk.lm.smoothing import KneserNey # language model

from nltk import word_tokenize, sent_tokenize
from nltk.util import pad_sequence, bigrams, ngrams, everygrams
from nltk.lm.preprocessing import pad_both_ends, flatten, padded_everygram_pipeline
from nltk.tokenize.treebank import TreebankWordDetokenizer

## Load Data

In [53]:
# Import Lyrics Dataframe
df = pd.read_csv('lyrics_df.csv').set_index('TrackID')
df.head()


Unnamed: 0_level_0,Track Name,Artists,Lyrics,Tokens
TrackID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
54640449,Jingle Bell Rock,Bobby Helms,"Jingle bell, jingle bell, jingle bell rock Jin...","['rock', 'swing', 'ring', 'snowing', 'blowing'..."
568180108,Fancy Like,Walker Hayes,Ayy My girl is bangin' She's so low maintenanc...,"['ayy', 'girl', 'bangin', 'shes', 'low', 'main..."
52815945,Enchanted,Taylor Swift,"There I was again tonight Forcing laughter, fa...","['tonight', 'forcing', 'laughter', 'faking', '..."
463807349,Old Town Road (Remix),Lil Nas X Feat. Billy Ray Cyrus,"Oh, oh-oh Oh Yeah, I'm gon' take my horse to ...","['yeah', 'take', 'horse', 'old', 'town', 'road..."
559530353,Thinking 'Bout You,Dustin Lynch Feat. MacKenzie Porter,"Well, look who it is Last call I thought I'd g...","['well', 'look', 'last', 'call', 'thought', 'h..."


## Language Modeling with MLE

In [None]:
# Tokenize lyrics corpus
lyrics_corpus = list(df['Lyrics'].apply(word_tokenize))

In [73]:
# Preprocess the tokenized text for 4-grams language modelling
n = 4
train_data, padded_sents = padded_everygram_pipeline(n, lyrics_corpus)

In [74]:
# Train and fit the model
country_model = MLE(n) # Lets train a 4-grams model
country_model.fit(train_data, padded_sents)

In [71]:
# Clean up the generated tokens to make it human-like.

detokenize = TreebankWordDetokenizer().detokenize

def generate_sent(model, num_words, random_seed):
    """
    :param model: An ngram language model from `nltk.lm.model`.
    :param num_words: Max no. of words to generate.
    :param random_seed: Seed value for random.
    """
    content = []
    for token in model.generate(num_words, random_seed=random_seed):
        if token == '<s>':
            continue
        if token == '</s>':
            break
        content.append(token)
    return detokenize(content)

In [80]:
# Predict text
print(generate_sent(country_model, num_words=280, random_seed=32))

, at the end of the day, y'all I'm just tryna keep my daughters off the pole And my sons out of jail (sons out of jail (sons out of jail Tryna get to church so I don't go to hell (I don't know why, don't give a fuck if it's your b-day I hand it off like a relay In Beverly Hills straight covered in grease I'm a lady Men's shirts, short skirts Oh, oh, I wannacome back as a country boy No, there ain't no better life if you ask me If my neck don't come out red, then Lord, just keep me dead 'Cause a country boy is all that I know how to be Yeah, a country boy's all that I'm like a Marlboro man so I kick on back Wish I could quit you, but I lost all control And I need you now And I don't know how I can do without I just need you now (Whoa-whoa) Guess I rather hurt than feel nothing at all It's a rich man's game No matter what they call it And you spend your last dime to put a rock on her hand I hope she's wilder than your wildest Dreams, she's walkin' back to me 'Cause don't nothing taste be

## Language Modeling with Kneser-Ney Interpolated

In [103]:
""" NOT WORKING YET """
# Train and fit the model
country_model_kn = KneserNeyInterpolated(n) # Lets train a 4-grams model

# ^^^^^ the above probably needs counter and vocab parameters to work

country_model_kn.fit(train_data, padded_sents)
len(country_model_kn.vocab)

0

In [98]:
# Predict text
print(generate_sent(country_model_kn, num_words=10, random_seed=3))

ValueError: Can't choose from empty population