## N-gram Language Model with NLTK

In [1]:
from nltk.util import ngrams
from nltk.util import everygrams

- Before we train our `ngram` models it is necessary to make sure the data we put in
them is in the right format.

- Let's say we have a text that is a list of sentences, where each sentence is a list of strings. For simplicity we just consider a text consisting of characters instead of words.

In [2]:
text = [['a', 'b', 'c'], 
        ['a', 'c', 'd', 'c', 'e', 'f']]

If we want to train a `bigram` model, we need to turn this text into bigrams. Here's what the first sentence of our text would look like if we use a function from NLTK for this.

In [3]:
from nltk.util import bigrams

In [4]:
list(bigrams(text[0]))

[('a', 'b'), ('b', 'c')]

In [5]:
list(ngrams(text[0], n=2))

[('a', 'b'), ('b', 'c')]

In [6]:
list(ngrams(text[1], n=2))

[('a', 'c'), ('c', 'd'), ('d', 'c'), ('c', 'e'), ('e', 'f')]

In [7]:
list(ngrams(text[1], n=3))

[('a', 'c', 'd'), ('c', 'd', 'c'), ('d', 'c', 'e'), ('c', 'e', 'f')]

how often sentences start with "a" and end with "c"?

A standard way to deal with this is to add special "padding" symbols to the sentence before splitting it into ngrams. Fortunately, NLTK also has a function for that, let's see what it does to the first sentence.

#### bigrams padding

In [7]:
from nltk.util import pad_sequence

In [8]:
list(pad_sequence(text[0],
                  pad_left=True, 
                  left_pad_symbol="<s>",
                  pad_right=True, 
                  right_pad_symbol="</s>",
                  n=2)) 

# The n order of n-grams, if it's 2-grams, you pad once, 3-grams pad twice, etc. 

['<s>', 'a', 'b', 'c', '</s>']

the `n` argument, that tells the function we need padding for bigrams.

In [9]:
padded_sent = list(pad_sequence(text[0], 
                                pad_left=True, 
                                left_pad_symbol="<s>", 
                                pad_right=True, 
                                right_pad_symbol="</s>", 
                                n=2))

list(ngrams(padded_sent, n=2))

[('<s>', 'a'), ('a', 'b'), ('b', 'c'), ('c', '</s>')]

#### trigrams padding

In [10]:
list(pad_sequence(text[0],
                  pad_left=True, 
                  left_pad_symbol="<s>",
                  pad_right=True, 
                  right_pad_symbol="</s>",
                  n=3)) 

# The n order of n-grams, if it's 2-grams, you pad once, 3-grams pad twice, etc.

['<s>', '<s>', 'a', 'b', 'c', '</s>', '</s>']

In [11]:
padded_sent = list(pad_sequence(text[0], 
                                pad_left=True, 
                                left_pad_symbol="<s>", 
                                pad_right=True, 
                                right_pad_symbol="</s>", n=3))

list(ngrams(padded_sent, n=3))

[('<s>', '<s>', 'a'),
 ('<s>', 'a', 'b'),
 ('a', 'b', 'c'),
 ('b', 'c', '</s>'),
 ('c', '</s>', '</s>')]

the __nltk.lm__ module provides a convenience function that has all these arguments already set while the other arguments remain the same as for pad_sequence.

In [12]:
from nltk.lm.preprocessing import pad_both_ends

In [13]:
list(pad_both_ends(text[0], n=2))

['<s>', 'a', 'b', 'c', '</s>']

In [14]:
list(bigrams(pad_both_ends(text[0], n=2)))

[('<s>', 'a'), ('a', 'b'), ('b', 'c'), ('c', '</s>')]

To make our model more robust we could also `train it on unigrams` (single words) as well as `bigrams`, its main source of information. 

NLTK once again helpfully provides a function called `everygrams`.

In [15]:
from nltk.util import everygrams

In [16]:
padded_bigrams = list(pad_both_ends(text[0], n=2))

list(everygrams(padded_bigrams, max_len=2))

[('<s>',),
 ('<s>', 'a'),
 ('a',),
 ('a', 'b'),
 ('b',),
 ('b', 'c'),
 ('c',),
 ('c', '</s>'),
 ('</s>',)]

During training and evaluation our model will rely on a vocabulary that defines which words are "known" to the model.
To create this vocabulary we need to pad our sentences (just like for counting ngrams) and then combine the sentences into one flat stream of words.

In [17]:
from nltk.lm.preprocessing import flatten

In [18]:
list(flatten(pad_both_ends(sent, n=2) for sent in text))

['<s>', 'a', 'b', 'c', '</s>', '<s>', 'a', 'c', 'd', 'c', 'e', 'f', '</s>']

Now that we understand what this means for our preprocessing, we can simply import a function that does everything for us.

In [19]:
from nltk.lm.preprocessing import padded_everygram_pipeline

In [24]:
train, vocab = padded_everygram_pipeline(2, text)

In [25]:
list(vocab)

['<s>', 'a', 'b', 'c', '</s>', '<s>', 'a', 'c', 'd', 'c', 'e', 'f', '</s>']

So as to avoid re-creating the text in memory, both train and vocab are lazy iterators. They are evaluated on demand at training time.

For the sake of understanding the output of `padded_everygram_pipeline`, we'll "materialize" the lazy iterators by casting them into a list.

In [23]:
training_ngrams, padded_sentences = padded_everygram_pipeline(2, text)

for ngramlize_sent in training_ngrams:
    print(list(ngramlize_sent))
    print()
    
print('#############')
list(padded_sentences)

[('<s>',), ('a',), ('b',), ('c',), ('</s>',), ('<s>', 'a'), ('a', 'b'), ('b', 'c'), ('c', '</s>')]

[('<s>',), ('a',), ('c',), ('d',), ('c',), ('e',), ('f',), ('</s>',), ('<s>', 'a'), ('a', 'c'), ('c', 'd'), ('d', 'c'), ('c', 'e'), ('e', 'f'), ('f', '</s>')]

#############


['<s>', 'a', 'b', 'c', '</s>', '<s>', 'a', 'c', 'd', 'c', 'e', 'f', '</s>']

#### Training
Having prepared our data we are ready to start training a model. 

As a simple example, let us train a `Maximum Likelihood Estimator` (MLE).

We only need to specify the highest `ngram` order to instantiate it. 

In [20]:
from nltk.lm import MLE

In [21]:
train, vocab = padded_everygram_pipeline(2, text)

In [22]:
lm = MLE(2)

This automatically creates an empty vocabulary...

In [23]:
len(lm.vocab)

0

... which gets filled as we fit the model.

In [24]:
lm.fit(train, vocab)

In [25]:
print(lm.vocab)

<Vocabulary with cutoff=1 unk_label='<UNK>' and 9 items>


In [26]:
len(lm.vocab)

9

In [27]:
lm.vocab.lookup(text[0]), lm.vocab.lookup(text[1])

(('a', 'b', 'c'), ('a', 'c', 'd', 'c', 'e', 'f'))

In [34]:
lm.vocab.lookup(["b", "from", "Mars"])

('b', '<UNK>', '<UNK>')

#### Using a Trained Model

When it comes to `ngram` models the training boils down to counting up the `ngrams` from the training corpus.


In [28]:
print(lm.counts)

<NgramCounter with 2 ngram orders and 24 ngrams>


This provides a convenient interface to access counts for unigrams...

In [29]:
lm.counts['a']

2

...and bigrams (in this case "a b")

In [30]:
lm.counts[['a']]['b']

1

In [31]:
lm.score("a")

0.15384615384615385

Items that are not seen during training are mapped to the vocabulary's
"unknown label" token. This is "<UNK>" by default.

Here's how you get the score for a word given some preceding context. For example we want to know what is the chance that "b" is preceded by "a".

In [39]:
lm.score("b", ["a"])

0.5

To avoid underflow when working with many small score values it makes sense to
take their logarithm.
For convenience this can be done with the `logscore` method.

In [38]:
lm.logscore("a")

-2.700439718141092

#### Lets get some real data and tokenize it

In [32]:
try: # Use the default NLTK tokenizer.
    from nltk import word_tokenize, sent_tokenize 
    
    # Testing whether it works. 
    # Sometimes it doesn't work on some machines because of setup issues.
    word_tokenize(sent_tokenize("This is a foobar sentence. Yes it is.")[0])
    
except: # Use a naive sentence tokenizer and toktok.
    import re
    from nltk.tokenize import ToktokTokenizer
    
    sent_tokenize = lambda x: re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', x)
    
    # Use the toktok tokenizer that requires no dependencies.
    toktok = ToktokTokenizer()
    word_tokenize = word_tokenize = toktok.tokenize

In [33]:
import os
import requests
import io #codecs

In [34]:
location = r'D:\MYLEARN\DATASETS\language-never-random.txt'

In [35]:
with io.open(location, encoding='utf8') as fin:
    text = fin.read()

In [36]:
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent))) for sent in sent_tokenize(text)]

In [37]:
len(tokenized_text)

235

In [38]:
print(tokenized_text[0])

['language', 'is', 'never', ',', 'ever', ',', 'ever', ',', 'random', 'adam', 'kilgarriff', 'abstract', 'language', 'users', 'never', 'choose', 'words', 'randomly', ',', 'and', 'language', 'is', 'essentially', 'non-random', '.']


In [39]:
print(tokenized_text[1])

['statistical', 'hypothesis', 'testing', 'uses', 'a', 'null', 'hypothesis', ',', 'which', 'posits', 'randomness', '.']


In [40]:
print(tokenized_text[2])

['hence', ',', 'when', 'we', 'look', 'at', 'linguistic', 'phenomena', 'in', 'cor-', 'pora', ',', 'the', 'null', 'hypothesis', 'will', 'never', 'be', 'true', '.']


- Preprocess the tokenized text for 3-grams language modelling

In [42]:
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

#### Training an N-gram Model

In [43]:
from nltk.lm import MLE
model = MLE(n) # Lets train a 3-grams model, previously we set n=3

Initializing the MLE model, creates an empty vocabulary

In [44]:
len(model.vocab)

0

In [45]:
model.fit(train_data, padded_sents)
print(model.vocab)

<Vocabulary with cutoff=1 unk_label='<UNK>' and 1391 items>


In [46]:
len(model.vocab)

1391

The vocabulary helps us handle words that have not occurred during training.

In [47]:
print(model.vocab.lookup(tokenized_text[0]))

('language', 'is', 'never', ',', 'ever', ',', 'ever', ',', 'random', 'adam', 'kilgarriff', 'abstract', 'language', 'users', 'never', 'choose', 'words', 'randomly', ',', 'and', 'language', 'is', 'essentially', 'non-random', '.')


In [48]:
# If we lookup the vocab on unseen sentences not from the training data, 
# it automatically replace words not in the vocabulary with `<UNK>`.
print(model.vocab.lookup('language is never random blah blah blah.'.split()))

('language', 'is', 'never', 'random', '<UNK>', '<UNK>', '<UNK>')


in some cases we want to `ignore` words that we did see during training but that `didn't occur frequently enough`, to provide us useful information.

You can tell the vocabulary to ignore such words using the `unk_cutoff` argument for the vocabulary lookup,

#### Using the N-gram Language Model

When it comes to ngram models the training boils down to `counting up` the ngrams from the training corpus.

In [49]:
print(model.counts)

<NgramCounter with 3 ngram orders and 19611 ngrams>


In [50]:
model.counts['language'] # i.e. Count('language')

25

...and bigrams for the phrase "language is"

In [51]:
model.counts[['language']]['is'] # i.e. Count('is'|'language')

11

In [52]:
model.counts[['language']]['model'] # i.e. Count('model'|'language')

0

... and trigrams for the phrase "language is never"

In [67]:
model.counts[['language', 'is']]['never'] # i.e. Count('never'|'language is')

7

the real purpose of training a language model is to have it `score` how probable words are in certain contexts.

MLE, the model returns the item's `relative frequency` as its score.

In [68]:
model.score('language') # P('language')

0.003691671588895452

In [69]:
model.score('is', 'language'.split())  # P('is'|'language')

0.44

In [70]:
model.score('never', 'language is'.split())  # P('never'|'language is')

0.6363636363636364

Items that are not seen during training are mapped to the vocabulary's "unknown label" token. This is "" by default.

In [71]:
model.score("<UNK>") == model.score("lah")

True

In [72]:
model.score("<UNK>") == model.score("lor")

True

To `avoid underflow` when working with many small score values it makes sense to take their logarithm.

For convenience this can be done with the logscore method.

In [73]:
model.logscore("never", "language is".split())

-0.6520766965796932

#### Generation using N-gram Language Model

One cool feature of ngram models is that they can be used to generate text.

In [53]:
print(model.generate(20, random_seed=7))

['and', 'carroll', 'used', 'hypothesis', 'testing', 'has', 'been', 'used', ',', 'and', 'a', 'half', '.', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']


In [75]:
from nltk.tokenize.treebank import TreebankWordDetokenizer

In [76]:
detokenize = TreebankWordDetokenizer().detokenize

In [79]:
def generate_sent(model, num_words, random_seed=42):
    """
    :param model: An ngram language model from `nltk.lm.model`.
    :param num_words: Max no. of words to generate.
    :param random_seed: Seed value for random.
    """
    content = []
    for token in model.generate(num_words, random_seed=random_seed):
        
        if token == '<s>':
            continue
            
        if token == '</s>':
            break
            
        content.append(token)
        
    return detokenize(content)

In [80]:
generate_sent(model, 20, random_seed=7)

'and carroll used hypothesis testing has been used, and a half.'

In [81]:
print(model.generate(28, random_seed=0))

['the', 'scf-verb', 'link', 'is', 'motivated', '.', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']


In [82]:
generate_sent(model, 28, random_seed=0)

'the scf-verb link is motivated.'

In [58]:
generate_sent(model, 20, random_seed=1)

'237⫺246.'

In [59]:
generate_sent(model, 20, random_seed=30)

'hypothesis is ever a useful construct.'

In [60]:
generate_sent(model, 20, random_seed=42)

'more (or cold) weather, or on saturday nights, or by people in (or poorer )'

## Lets try some generating with Donald Trump data!!!

In [83]:
import pandas as pd

In [86]:
location = r'D:\MYLEARN\datasets\Donald-Tweets!.csv'

In [87]:
df = pd.read_csv(location)

In [88]:
df.head()

Unnamed: 0,Date,Time,Tweet_Text,Type,Media_Type,Hashtags,Tweet_Id,Tweet_Url,twt_favourites_IS_THIS_LIKE_QUESTION_MARK,Retweets,Unnamed: 10,Unnamed: 11
0,16-11-11,15:26:37,Today we express our deepest gratitude to all ...,text,photo,ThankAVet,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,127213,41112,,
1,16-11-11,13:33:35,Busy day planned in New York. Will soon be mak...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,141527,28654,,
2,16-11-11,11:14:20,Love the fact that the small groups of protest...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,183729,50039,,
3,16-11-11,2:19:44,Just had a very open and successful presidenti...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/796...,214001,67010,,
4,16-11-11,2:10:46,A fantastic day in D.C. Met with President Oba...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/796...,178499,36688,,


In [89]:
trump_corpus = list(df['Tweet_Text'].apply(word_tokenize))

In [90]:
# Preprocess the tokenized text for 3-grams language modelling
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, trump_corpus)

In [91]:
from nltk.lm import MLE
trump_model = MLE(n) # Lets train a 3-grams model, previously we set n=3
trump_model.fit(train_data, padded_sents)

In [92]:
generate_sent(trump_model, num_words=20, random_seed=42)

'call!'

In [93]:
generate_sent(trump_model, num_words=10, random_seed=0)

'picks it up! Democrats numbers are down big in'

In [94]:
generate_sent(trump_model, num_words=50, random_seed=10)

'\\" @ ajbruno14: @ realDonaldTrump beautiful family! Best #SNL with @ realDonaldTrump You are a total joke . No clue on immigration now because he REPLACED his LEGAL cellphone?'

In [95]:
print(generate_sent(trump_model, num_words=100, random_seed=52))

will MAKE AMERICA GREAT AGAIN! https: /_
