<a href="https://colab.research.google.com/github/manjunaath0009/MyPython/blob/master/language_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language Model

- Create the traditinal ngram-based language model
- Codes from [A comprehensive guide to build your own language model in python](https://medium.com/analytics-vidhya/a-comprehensive-guide-to-build-your-own-language-model-in-python-5141b3917d6d)

## Training a Trigram Language Model using Reuters

In [1]:
import nltk
nltk.download('reuters')
nltk.download('punkt')

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
%%time

# code courtesy of https://nlpforhackers.io/language-models/

from nltk.corpus import reuters
from nltk import bigrams, trigrams
from collections import Counter, defaultdict

# Create a placeholder for model
model = defaultdict(lambda: defaultdict(lambda: 0))

# Count frequency of co-occurance
for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        model[(w1, w2)][w3] += 1

# Let's transform the counts to probabilities
for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count

CPU times: user 11.5 s, sys: 699 ms, total: 12.2 s
Wall time: 12.7 s


## Check Language Model

In [3]:
sorted(dict(model["the","news"]).items(), key=lambda x:-1*x[1])

[('conference', 0.25),
 ('of', 0.125),
 ('.', 0.125),
 ('with', 0.08333333333333333),
 (',', 0.08333333333333333),
 ('agency', 0.08333333333333333),
 ('that', 0.08333333333333333),
 ('brought', 0.041666666666666664),
 ('about', 0.041666666666666664),
 ('broke', 0.041666666666666664),
 ('on', 0.041666666666666664)]

## Text Generation Using the Trigram Model

- Using the trigram model to predict the next word.
- The prediction is based on the predicted probability distribution of the next words: words above a predefined cut-off are randomly selected.
- The text generator ends when two consecutuve None's are predicted (signaling the end of the sentence).

In [11]:
# code courtesy of https://nlpforhackers.io/language-models/
import random
# https://alvinntnu.github.io/python-notes/nlp/language-model.html

# starting words
text = ["the", "news"]
sentence_finished = False

while not sentence_finished:
  # select a random probability threshold
  r = random.random()
  accumulator = .0

  for word in model[tuple(text[-2:])].keys():
      accumulator += model[tuple(text[-2:])][word]
      # select words that are above the probability threshold
      if accumulator >= r:
          text.append(word)
          break

  if text[-2:] == [None, None]:
      sentence_finished = True

print (' '.join([t for t in text if t]))

the news , dipping on its enforcement staff .


In [12]:
import re
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')

text = "This is a sample sentence, showing off the stop words filtration."

tokens = word_tokenize(text)
# Regular expression to match punctuation
cleaned_tokens = [re.sub(r'[^\w\s]', '', token) for token in tokens if re.sub(r'[^\w\s]', '', token)]
print(cleaned_tokens)

['This', 'is', 'a', 'sample', 'sentence', 'showing', 'off', 'the', 'stop', 'words', 'filtration']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Issues of Ngram Language Model

- The ngram size is of key importance. The higher the order of the ngram, the better the prediction. But it comes with the issues of computation overload and data sparceness.
- Unseen ngrams are always a concern.
- Probability smoothing issues.


## Neural Language Model

- Neural language model based on deep learning may provide a better alternative to model the probabilistic relationships of linguistic units.

In [13]:
def count_n_grams(data, n, start_token = "<s>", end_token = "<e>") -> 'dict':

  # Empty dict for n-grams
  n_grams = {}

  # Iterate over all sentences in the dataset
  for sentence in data:

    # Append n start tokens and a single end token to the sentence
    sentence = [start_token]*n + sentence + [end_token]

    # Convert the sentence into a tuple
    sentence = tuple(sentence)

    # Temp var to store length from start of n-gram to end
    m = len(sentence) if n==1 else len(sentence)-1

    # Iterate over this length
    for i in range(m):

      # Get the n-gram
      n_gram = sentence[i:i+n]

      # Add the count of n-gram as value to our dictionary
      # IF n-gram is already present
      if n_gram in n_grams.keys():
        n_grams[n_gram] += 1
      # Add n-gram count
      else:
        n_grams[n_gram] = 1

  return n_grams