# **N-Grams Language Modelling (LM) with Smoothening using NLTK Library**
NLTK library for Language modelling can be used for any language i.e. english, hindi, chinese, .....

NLTK Library does not works great if we have a hyge amount of data, hence as an alternative to NLTK library we can use ["KenLM (Kneser-Ney) Library"](https://kheafield.com/code/kenlm/) for implementing LM models if we have huge amout of courpus since its implementation is speedy and efficient. 

KenLM does not have too many LM models hence you can use another alternative to KenLM which works great with large corpus and has many LM models -- SRILM. [SRILM - The SRI Language Modeling Toolkit](http://www.speech.sri.com/projects/srilm/), [SRLIM python package](https://srilm-python.readthedocs.io/en/latest/)

In [2]:
# downloading some required libraries because they are not present in the root kernel of google colab by default
!pip install -U pip
!pip install -U dill
!pip install -U nltk==3.4

Collecting pip
[?25l  Downloading https://files.pythonhosted.org/packages/47/ca/f0d790b6e18b3a6f3bd5e80c2ee4edbb5807286c21cdd0862ca933f751dd/pip-21.1.3-py3-none-any.whl (1.5MB)
[K     |████████████████████████████████| 1.6MB 7.9MB/s 
[?25hInstalling collected packages: pip
  Found existing installation: pip 19.3.1
    Uninstalling pip-19.3.1:
      Successfully uninstalled pip-19.3.1
Successfully installed pip-21.1.3
Collecting nltk==3.4
  Downloading nltk-3.4.zip (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 8.7 MB/s 
Collecting singledispatch
  Downloading singledispatch-3.6.2-py2.py3-none-any.whl (8.2 kB)
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25l[?25hdone
  Created wheel for nltk: filename=nltk-3.4-py3-none-any.whl size=1436397 sha256=96601ee2c1f109d1fddb5edd335a05cd95f5e4cfb20dad0a265de18b1ed32d76
  Stored in directory: /root/.cache/pip/wheels/13/b8/81/2349be11dd144dc7b68ab983b58cd2fae353cdc50bbdeb09d0
Successfully

## **PreRequisites**

In [3]:
from nltk.util import bigrams
from nltk.util import ngrams

If we want to train a bigram model, we need to turn this text into bigrams. Here's what the first sentence of our text would look like if we use the `ngrams` function from NLTK for this.

In [4]:
text = [['a', 'b', 'c'], ['a', 'c', 'd', 'c', 'e', 'f']]

In [5]:
list(bigrams(text[0]))

[('a', 'b'), ('b', 'c')]

In [6]:
list(ngrams(text[1], n=3))

[('a', 'c', 'd'), ('c', 'd', 'c'), ('d', 'c', 'e'), ('c', 'e', 'f')]

Notice how "b" occurs both as the first and second member of different bigrams but "a" and "c" don't? 

Wouldn't it be nice to somehow indicate how often sentences start with "a" and end with "c"?


A standard way to deal with this is to add special "padding" symbols to the sentence before splitting it into ngrams. Fortunately, NLTK also has a function for that, let's see what it does to the first sentence.Padding is done basically just to indicate the start of a sentence and end of a sentence


In [7]:
from nltk.util import pad_sequence
list(pad_sequence(text[0],
                  pad_left=True, left_pad_symbol="<start>",
                  pad_right=True, right_pad_symbol="</end>",
                  n=3)) # The n order of n-grams, if it's 2-grams, you pad once, 3-grams pad twice, etc. 

['<start>', '<start>', 'a', 'b', 'c', '</end>', '</end>']

In [8]:
list(pad_sequence(text[0],
                  pad_left=True, left_pad_symbol="<s>",
                  pad_right=True, right_pad_symbol="</s>",
                  n=3)) # The n order of n-grams, if it's 2-grams, you pad once, 3-grams pad twice, etc. 

['<s>', '<s>', 'a', 'b', 'c', '</s>', '</s>']

In [9]:
padded_sent = list(pad_sequence(text[0], pad_left=True, left_pad_symbol="<s>", 
                                pad_right=True, right_pad_symbol="</s>", n=2))
list(ngrams(padded_sent, n=2))

[('<s>', 'a'), ('a', 'b'), ('b', 'c'), ('c', '</s>')]

In [10]:
padded_sent = list(pad_sequence(text[0], pad_left=True, left_pad_symbol="<s>", 
                                pad_right=True, right_pad_symbol="</s>", n=3))
list(ngrams(padded_sent, n=3))

[('<s>', '<s>', 'a'),
 ('<s>', 'a', 'b'),
 ('a', 'b', 'c'),
 ('b', 'c', '</s>'),
 ('c', '</s>', '</s>')]

Note the `n` argument, that tells the function we need padding for bigrams.

Now, passing all these parameters every time is tedious and in most cases they can be safely assumed as defaults anyway.

Thus the `nltk.lm` module provides a convenience function that has all these arguments already set while the other arguments remain the same as for `pad_sequence`.

In [11]:
from nltk.lm.preprocessing import pad_both_ends
list(pad_both_ends(text[0], n=2))

['<s>', 'a', 'b', 'c', '</s>']

Combining the two parts discussed so far we get the following preparation steps for one sentence.

In [12]:
list(bigrams(pad_both_ends(text[0], n=2)))

[('<s>', 'a'), ('a', 'b'), ('b', 'c'), ('c', '</s>')]

To make our model more robust we could also train it on unigrams (single words) as well as bigrams, its main source of information.
NLTK once again helpfully provides a function called `everygrams`.

While not the most efficient, it is conceptually simple.

In [13]:
from nltk.util import everygrams
padded_bigrams = list(pad_both_ends(text[0], n=2))
list(everygrams(padded_bigrams, max_len=2))

[('<s>',),
 ('a',),
 ('b',),
 ('c',),
 ('</s>',),
 ('<s>', 'a'),
 ('a', 'b'),
 ('b', 'c'),
 ('c', '</s>')]

During training and evaluation our model will rely on a vocabulary that defines which words are "known" to the model.

To create this vocabulary we need to pad our sentences (just like for counting ngrams) and then combine the sentences into one flat stream of words.


In [14]:
from nltk.lm.preprocessing import flatten
list(flatten(pad_both_ends(sent, n=2) for sent in text))

['<s>', 'a', 'b', 'c', '</s>', '<s>', 'a', 'c', 'd', 'c', 'e', 'f', '</s>']

In most cases we want to use the same text as the source for both vocabulary and ngram counts.

Now that we understand what this means for our preprocessing, we can simply import a function that does everything for us.

In [15]:
from nltk.lm.preprocessing import padded_everygram_pipeline
train, vocab = padded_everygram_pipeline(2, text)

So as to avoid re-creating the text in memory, both `train` and `vocab` are lazy iterators. They are evaluated on demand at training time.

For the sake of understanding the output of `padded_everygram_pipeline`, we'll "materialize" the lazy iterators by casting them into a list.

In [16]:
training_ngrams, padded_sentences = padded_everygram_pipeline(2, text)
for ngramlize_sent in training_ngrams:
    print(list(ngramlize_sent))
    print()
print('#############')
list(padded_sentences)

[('<s>',), ('a',), ('b',), ('c',), ('</s>',), ('<s>', 'a'), ('a', 'b'), ('b', 'c'), ('c', '</s>')]

[('<s>',), ('a',), ('c',), ('d',), ('c',), ('e',), ('f',), ('</s>',), ('<s>', 'a'), ('a', 'c'), ('c', 'd'), ('d', 'c'), ('c', 'e'), ('e', 'f'), ('f', '</s>')]

#############


['<s>', 'a', 'b', 'c', '</s>', '<s>', 'a', 'c', 'd', 'c', 'e', 'f', '</s>']

## **Example 1**

### **Lets get some real data and tokenize it**

In [23]:
try: # Use the default NLTK tokenizer.
    from nltk import word_tokenize, sent_tokenize 
    # Testing whether it works. Sometimes it doesn't work on some machines because of setup issues.
    word_tokenize(sent_tokenize("This is a foobar sentence. Yes it is.")[0])
except: # Use a naive sentence tokenizer and toktok.
    import re
    from nltk.tokenize import ToktokTokenizer
    # See https://stackoverflow.com/a/25736515/610569
    sent_tokenize = lambda x: re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', x)
    # Use the toktok tokenizer that requires no dependencies.
    toktok = ToktokTokenizer()
    word_tokenize = word_tokenize = toktok.tokenize

In [24]:
import os
import requests
import io 

# Text version of https://kilgarriff.co.uk/Publications/2005-K-lineer.pdf
if os.path.isfile('language-never-random.txt'):
    with io.open('language-never-random.txt', encoding='utf8') as fin:
        text = fin.read()
else:
    url = "https://gist.githubusercontent.com/alvations/53b01e4076573fea47c6057120bb017a/raw/b01ff96a5f76848450e648f35da6497ca9454e4a/language-never-random.txt"
    text = requests.get(url).content.decode('utf8')
    with io.open('language-never-random.txt', 'w', encoding='utf8') as fout:
        fout.write(text)

In [25]:
print(text[:500])

                       Language is never, ever, ever, random

                                                               ADAM KILGARRIFF




Abstract
Language users never choose words randomly, and language is essentially
non-random. Statistical hypothesis testing uses a null hypothesis, which
posits randomness. Hence, when we look at linguistic phenomena in cor-
pora, the null hypothesis will never be true. Moreover, where there is enough
data, we shall (almost) always be able to establish 


In [26]:
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                  for sent in sent_tokenize(text)]

In [27]:
tokenized_text[0]

['language',
 'is',
 'never',
 ',',
 'ever',
 ',',
 'ever',
 ',',
 'random',
 'adam',
 'kilgarriff',
 'abstract',
 'language',
 'users',
 'never',
 'choose',
 'words',
 'randomly',
 ',',
 'and',
 'language',
 'is',
 'essentially',
 'non-random',
 '.']

In [28]:
# Preprocess the tokenized text for 3-grams language modelling
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

### **Training an N-gram Model**

Having prepared our data we are ready to start training a model. As a simple example, let us train a Maximum Likelihood Estimator (MLE).

We only need to specify the highest ngram order to instantiate it.

In [29]:
from nltk.lm import MLE
model = MLE(n) # Lets train a 3-grams model, previously we set n=3
# there are numerous language models in nltk.lm, MLE happens to be one of those. Some of the other models are
# `Lidstone`: Provides Lidstone-smoothed scores.
# `Laplace`: Implements Laplace (add one) smoothing.
# `InterpolatedLanguageModel`: Logic common to all interpolated language models (Chen & Goodman 1995).
# `WittenBellInterpolated`: Interpolated version of Witten-Bell smoothing.

#  In all these models everything remains same just the way the probabilities are calculated changes -> i.e. way the smoothening is done changes
#For more details take a look at these objects from `nltk.lm.models`-(https://github.com/nltk/nltk/blob/develop/nltk/lm/models.py):



Initializing the MLE model, creates an empty vocabulary

In [30]:
len(model.vocab) #hence before trining we have zero vocabulary size

0

... which gets filled as we fit the model.

In [31]:
model.fit(train_data, padded_sents)
print(model.vocab)

<Vocabulary with cutoff=1 unk_label='<UNK>' and 1429 items>


In [32]:
len(model.vocab) # hence after training we get a non zero vocabulary size

1429

The vocabulary helps us handle words that have not occurred during training.

In [33]:
print(model.vocab.lookup(tokenized_text[0])) #looking at the vocabulary

('language', 'is', 'never', ',', 'ever', ',', 'ever', ',', 'random', 'adam', 'kilgarriff', 'abstract', 'language', 'users', 'never', 'choose', 'words', 'randomly', ',', 'and', 'language', 'is', 'essentially', 'non-random', '.')


In [34]:
# If we lookup the vocab on unseen sentences not from the training data, it automatically replace words not in the vocabulary with `<UNK>`.
print(model.vocab.lookup('language is never random lah .'.split()))

('language', 'is', 'never', 'random', '<UNK>', '.')


*Moreover*, in some cases we want to ignore words that we did see during training but that didn't occur frequently enough, to provide us useful information. 

You can tell the vocabulary to ignore such words using the `unk_cutoff` argument for the vocabulary lookup, To find out how that works, check out the docs for the [`nltk.lm.vocabulary.Vocabulary` class](https://github.com/nltk/nltk/blob/develop/nltk/lm/vocabulary.py)

### **Using the N-gram Language Model**

When it comes to ngram models the training boils down to counting up the ngrams from the training corpus.

In [35]:
print(model.counts) # vocabulary was around 1400 but counts are 18687 because count takes into account same tokens also 

<NgramCounter with 3 ngram orders and 18687 ngrams>


This provides a convenient interface to access counts for unigrams...

In [36]:
model.counts['language'] # i.e. Count('language') # means language occured 25 times 

25

...and bigrams for the phrase "language is"

In [37]:
model.counts[['language']]['is'] # i.e. Count('is'|'language')# language is # means language is occured 11 times

11

... and trigrams for the phrase "language is never"

In [38]:
model.counts[['language', 'is']]['never'] # i.e. Count('never'|'language is') #lang is never # means language is never occured 7 times

7

And so on. However, the real purpose of training a language model is to have it score how probable words are in certain contexts.

This being MLE, the model returns the item's relative frequency as its score.

In [39]:
model.score('language') # P('language') # score is a probability value while count is not a probability value

0.003916040100250626

In [40]:
model.score('is', 'language'.split())  # P('is'|'language')

0.44

In [41]:
model.score('never', 'language is'.split())  # P('never'|'language is')

0.6363636363636364

Items that are not seen during training are mapped to the vocabulary's "unknown label" token.  This is "<UNK>" by default.


In [42]:
model.score("<UNK>") == model.score("lah")

True

In [43]:
model.score("<UNK>") == model.score("leh")

True

In [44]:
model.score("<UNK>") == model.score("lor")

True

To avoid underflow when working with many small score values it makes sense to take their logarithm. 

For convenience this can be done with the `logscore` method.


In [45]:
model.logscore("never", "language is".split()) #log base 10

-0.6520766965796932

### **Generation using N-gram Language Model**

One cool feature of ngram models is that they can be used to generate text.

In [46]:
print(model.generate(20, random_seed=7))

['ate', 'inferences', 'are', 'drawn.', '2', '.', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']


In [47]:
print(model.generate(20,text_seed="the problem is", random_seed=7))

['exact', 'method', 'can', 'be', 'applied', 'to', 'em-', 'pirical', 'linguistics', 'in', 'gale', 'and', 'sampson', '(', '1995', ')', ',', '33⫺46.', 'brent', ',']


We can do some cleaning to the generated tokens to make it human-like.

In [48]:
from nltk.tokenize.treebank import TreebankWordDetokenizer

detokenize = TreebankWordDetokenizer().detokenize

def generate_sent(model, num_words, random_seed=42):
    """
    :param model: An ngram language model from `nltk.lm.model`.
    :param num_words: Max no. of words to generate.
    :param random_seed: Seed value for random.
    """
    content = []
    for token in model.generate(num_words, random_seed=random_seed):
        if token == '<s>':
            continue
        if token == '</s>':
            break
        content.append(token)
    return detokenize(content)

In [49]:
generate_sent(model, 20, random_seed=7)

'ate inferences are drawn. 2.'

In [50]:
print(model.generate(28, random_seed=0))

['the', 'trouble', 'with', 'quantitative', 'studies', '.', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']


In [51]:
generate_sent(model, 28, random_seed=0)

'the trouble with quantitative studies.'

In [52]:
generate_sent(model, 20, random_seed=1)

'29⫺50. manning, christopher and hinrich schütze 1999 foundations of statistical independence.'

In [53]:
generate_sent(model, 20, random_seed=30)

'information glut, is inappropriate, particularly where counts are low.'

In [54]:
generate_sent(model, 20, random_seed=42)

'not random, or to refer to items that are more common or more salient in the last paragraph is'

### **Saving the model** 

The native Python's pickle may not save the lambda functions in the  model, so we can use the `dill` library in place of pickle to save and load the language model.


In [55]:
import dill as pickle 

#saving the model
with open('kilgariff_ngram_model.pkl', 'wb') as fout:
    pickle.dump(model, fout)

In [56]:
#using the save model
with open('kilgariff_ngram_model.pkl', 'rb') as fin:
    model_loaded = pickle.load(fin)

In [57]:
#saved model works :))))
generate_sent(model_loaded, 20, random_seed=42)

'not random, or to refer to items that are more common or more salient in the last paragraph is'

## **Example 2**


In [None]:
import pandas as pd
filepath = "/content/drive/MyDrive/ACM NLP Summer School 2021/Day 3 - Language Modeling/Data/Donald-Tweets!.csv"
df = pd.read_csv(filepath)
df.head()

Unnamed: 0,Date,Time,Tweet_Text,Type,Media_Type,Hashtags,Tweet_Id,Tweet_Url,twt_favourites_IS_THIS_LIKE_QUESTION_MARK,Retweets,Unnamed: 10,Unnamed: 11
0,16-11-11,15:26:37,Today we express our deepest gratitude to all ...,text,photo,ThankAVet,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,127213,41112,,
1,16-11-11,13:33:35,Busy day planned in New York. Will soon be mak...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,141527,28654,,
2,16-11-11,11:14:20,Love the fact that the small groups of protest...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,183729,50039,,
3,16-11-11,2:19:44,Just had a very open and successful presidenti...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/796...,214001,67010,,
4,16-11-11,2:10:46,A fantastic day in D.C. Met with President Oba...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/796...,178499,36688,,


In [None]:
trump_corpus = list(df['Tweet_Text'].apply(word_tokenize))

In [None]:
# Preprocess the tokenized text for 3-grams language modelling
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, trump_corpus)

In [None]:
from nltk.lm import MLE
trump_model = MLE(n) # Lets train a 3-grams model, previously we set n=3
trump_model.fit(train_data, padded_sents)

In [None]:
generate_sent(trump_model, num_words=20, random_seed=42)

'do so many people on television. Just another desperate move by the media pile on against me in Rome ,'

In [None]:
generate_sent(trump_model, num_words=10, random_seed=0)

'pretty sad situation. Go Jeb! You made winning MAJORS'

In [None]:
generate_sent(trump_model, num_words=50, random_seed=10)

'and many other subjects! Bad times for divided USA! +Israel2 "'

In [None]:
print(generate_sent(trump_model, num_words=100, random_seed=52))

with you sir! Your friend and supporter tonight at Trump Winery Beautiful fall foliage


# **N-Gram Language Modelling (LM) without Smoothening Without Using Library**

In [None]:
# importing required libraries
import string
from typing import List

In [None]:
# the function of this method is to perform tokenization, ideally we would use some smart text tokenization methods discussed earlier, but for simplicity use this one
# it taken a sentence (one parameter) as input and returns a list of tokens present in the sentence 

def tokenize(text: str) -> List[str]:
    for punct in string.punctuation:
        text = text.replace(punct, ' '+punct+' ')
    t = text.split()
    return t
    

In [None]:
# this function takes in the n value of ngram and the list of tokens present in a sentence and returns list of ngrams of tuple form: ((previous wordS!), target word) with sutaible padding

def get_ngrams(n: int, tokens: list) -> list:
    # tokens.append('<END>')
    tokens = (n-1)*['<START>']+tokens
    l = [(tuple([tokens[i-p-1] for p in reversed(range(n-1))]), tokens[i]) for i in range(n-1, len(tokens))]
    return l


In [None]:
# Variables: context and ngram : dictionaries  : for storing the counts and context of grams   
class NgramModel(object):

  #constructor
  def __init__(self, n):
    self.n = n
    self.context = {}     # dictionary that keeps list of candidate words given context
    self.ngram_counter = {}     # keeps track of how many times ngram has appeared in the text before

  #takes a sentence as input and hence iterates through the sentence and updates counts in dictionaries
  def update(self, sentence: str) -> None:
    n = self.n
    ngrams = get_ngrams(n, tokenize(sentence))
    for ngram in ngrams:
      if ngram in self.ngram_counter:
        self.ngram_counter[ngram] += 1.0
      else:
          self.ngram_counter[ngram] = 1.0

      prev_words, target_word = ngram
      if prev_words in self.context:
        self.context[prev_words].append(target_word)
      else:
        self.context[prev_words] = [target_word]
  
  #Calculates probability of a candidate token to be generated given a context, it returns conditional probability
  def prob(self, context, token):
    try:
      count_of_token = self.ngram_counter[(context, token)]
      count_of_context = float(len(self.context[context]))
      result = count_of_token / count_of_context

    except KeyError:
      result = 0.0

    return result

  #Given a context we "semi-randomly" select the next word to append in a sequence
  def next_token_selection(self, context):
    r = random.random()
    map_to_probs = {}
    token_of_interest = self.context[context]
    for token in token_of_interest:
      map_to_probs[token] = self.prob(context, token)

    summ = 0
    for token in sorted(map_to_probs):
      summ += map_to_probs[token]
      if summ > r:
        return token

  #takes no of words to be produced as input and hence generates sentence starting with bunch of n-1 padded <start> tokens
  def generate_text(self, token_count: int):
    n = self.n
    context_queue = (n - 1) * ['<START>']
    result = []
    for _ in range(token_count):
      obj = self.next_token_selection(tuple(context_queue))
      result.append(obj)
      if n > 1:
        context_queue.pop(0)
        if obj == '.':
          context_queue = (n - 1) * ['<START>']
        else:
          context_queue.append(obj)
    return ' '.join(result)

In [None]:
def create_ngram_model(n, path):
  m = NgramModel(n)
  with open(path, 'r') as f:
    text = f.read()
    text = text.split('.')
    for sentence in text:
      # add back the fullstop
      sentence += '.'
      m.update(sentence)
  return m

In [None]:
m = create_ngram_model(6, "/content/drive/MyDrive/001 My Skills/002 CS Engineering   Automated Math (BPHC)/004 Data Science (DS)   Artificial Intelligence (AI)/005 Textual Data (Unstructured Data) (Sequential Data)/001 Mono Lingual Language /001 English/001 Language Modelling (Predicting Words Phrases Sentences)/001 Next Word Phrase Sentence Prediction/001 Probability Based Algorithms/Frankenstein.txt")

In [None]:
print("Generated text:\n")
print(m.generate_text(20)) #that is generate next 20 words for the above text document

Generated text:

What could I do ? He meant to please , and he tormented me . Suddenly , as I gazed


**Excercise** 

1.   Modify NgramModel.prob() to implement any smoothing technique.
2.   Modify NgramModel.next_token_selection() to return the token with maxiumum count/highest likelihood.