# Introduction to N-Grams with NLTK

## Simple API overview

In [1]:
text = [['I','want','to','go','home'], ['This', 'file', 'contains', 'a', 'critical', 'bug']]

Getting all bigrams

In [2]:
from nltk.util import bigrams
list(bigrams(text[0]))

[('I', 'want'), ('want', 'to'), ('to', 'go'), ('go', 'home')]

Padding the sentence on both ends to mark its beginning and end

In [3]:
from nltk.lm.preprocessing import pad_both_ends
paddedSent=list(pad_both_ends(text[0], n=2))
print(paddedSent)

['<s>', 'I', 'want', 'to', 'go', 'home', '</s>']


Bigrams again...

In [4]:
list(bigrams(paddedSent))

[('<s>', 'I'),
 ('I', 'want'),
 ('want', 'to'),
 ('to', 'go'),
 ('go', 'home'),
 ('home', '</s>')]

Everygrams...

In [5]:
from nltk import everygrams
list(everygrams(paddedSent, max_len=3))

[('<s>',),
 ('<s>', 'I'),
 ('<s>', 'I', 'want'),
 ('I',),
 ('I', 'want'),
 ('I', 'want', 'to'),
 ('want',),
 ('want', 'to'),
 ('want', 'to', 'go'),
 ('to',),
 ('to', 'go'),
 ('to', 'go', 'home'),
 ('go',),
 ('go', 'home'),
 ('go', 'home', '</s>'),
 ('home',),
 ('home', '</s>'),
 ('</s>',)]

If we want to build a training set from all our sentences we need to flatten them

In [6]:
from nltk.lm.preprocessing import padded_everygram_pipeline
train, vocab = padded_everygram_pipeline(2, text)

In [7]:
text

[['I', 'want', 'to', 'go', 'home'],
 ['This', 'file', 'contains', 'a', 'critical', 'bug']]

In [8]:
list(train)

[<generator object everygrams at 0x10eb647b0>,
 <generator object everygrams at 0x10eb646a0>]

Importing the Maximum Likelihood Estimator and initializing it. The parameter is the maximum n-gram order the model will handle

In [9]:
from nltk.lm import MLE
MyModel = MLE(2)

In [10]:
MyModel

<nltk.lm.models.MLE at 0x10ac86d80>

Fitting the model

In [11]:
MyModel.fit(train, vocab)

In [12]:
MyModel

<nltk.lm.models.MLE at 0x10ac86d80>

In [13]:
print(MyModel.vocab)

<Vocabulary with cutoff=1 unk_label='<UNK>' and 14 items>


How big is its vocabulary?

In [14]:
len(MyModel.vocab)

14

Checking whether words are in the vocabulary...

In [15]:
MyModel.vocab.lookup("critical")

'critical'

## Training a model from a non-trivial corpus

Let's use Trump Tweets from here https://github.com/MarkHershey/CompleteTrumpTweetsArchive

In [16]:
with open("tweets.csv") as f:
    lines=f.readlines()

Cleanup function

In [29]:
import re
from nltk import sent_tokenize, word_tokenize
def cleanUp(text):
    #remove newlines
    text=text.strip()
    #remove tags
    text=re.sub("[@\#]\S+","",text)
    #remove URLS
    text=re.sub("https?://\S+","",text)
    text=re.sub("pic\.twitter\.com\S+","",text)
    #tokenize the tweet into sentences
    sentences=sent_tokenize(text)
    corpus=[]
    for sentence in sentences:
        words=word_tokenize(sentence)
        #take only words from tweets
        cleaned=[w.lower() for w in words if re.search("\w+",w)]
        corpus.append(cleaned)
    return corpus

  text=re.sub("[@\#]\S+","",text)
  text=re.sub("https?://\S+","",text)
  text=re.sub("pic\.twitter\.com\S+","",text)
  cleaned=[w.lower() for w in words if re.search("\w+",w)]


Cleaning up all sentences

In [18]:
allSentences=[]
for line in lines:
    allSentences.extend(cleanUp(line))

creating the training set and vocabulary, up to 3-grams

In [19]:
train, vocab = padded_everygram_pipeline(3, allSentences)

In [20]:
TrumpModel = MLE(3)
TrumpModel.fit(train, vocab)

Let's experiment with the model

In [21]:
TrumpModel.generate(2, ["i","will"],random_seed=3)

['be', 'interviewed']

In [22]:
TrumpModel.generate(2, ["make","america"],random_seed=5)

['great', 'again']

Let's evaluate the perplexity for some sequences of bigrams

In [23]:
TrumpModel.perplexity([['make','america'],['great','again']])

6.7922304393542845

In [24]:
TrumpModel.perplexity([['make','america'],['healthy']])

382.68996985395387

In [25]:
TrumpModel.score("america")

0.0015395876571475454

In [26]:
TrumpModel.counts[['great']]['again']

273