<a href="https://colab.research.google.com/github/jmbanda/CSC8980_NLP_Spring2021/blob/main/Class06_Language_Models_II.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Class 6 - Language Models II

In this Google Colab notebook, we will go over a very basic example of a Language Model implementation using the Reuters corpus that comes with the NLTK NLP python library (we will cover NLTK in more detail at a later class session)

The Reuters corpus is a collection of 10,788 news documents totaling 1.3 million words.

In order to use some of NLTK's built-in functionality, we need to import the libary (already installed in Colab) and download the Reuters corpus and Punkt to be able to process its data more easily. 

In [None]:
import nltk
nltk.download('reuters')
nltk.download('punkt')

In the following piece of code we are importing both the bigrams and trigrams methods from NLTK and calculate the co-occurence frequencies and convert them in probabilities.

In [4]:
from nltk.corpus import reuters
from nltk import bigrams, trigrams
from collections import Counter, defaultdict

# Create a placeholder for model
model = defaultdict(lambda: defaultdict(lambda: 0))

# Count frequency of co-occurence  
for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        model[(w1, w2)][w3] += 1
 
# Let's transform the counts to probabilities
for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count

Let's now evaluate what we have done: we will start with two simple words – “today the”. We want our model to tell us what will be the next word:

In [None]:
dict(model["today","the"])

Let's try something that could me more common now:

In [None]:
dict(model["the","price"])

Now that we have our model trained, let's try to generate text using this model. The idea here is to set a random probability threshold and continue to append words based on their probability of being next until the threshold is met.

In [None]:
import random

# starting words
text = ["today", "the"]
sentence_finished = False
 
while not sentence_finished:
  # select a random probability threshold  
  r = random.random()
  accumulator = .0

  for word in model[tuple(text[-2:])].keys():
      accumulator += model[tuple(text[-2:])][word]
      # select words that are above the probability threshold
      if accumulator >= r:
          text.append(word)
          break

  if text[-2:] == [None, None]:
      sentence_finished = True
 
print (' '.join([t for t in text if t]))

## Additional practice assignments:

1) The code here uses trigrams, can you adat it to use bigrams? how about quadrigrams?

2) Can you feed use the Shakespeare file from HW1 or his complete works to generate some text?




Code Sources:

https://nlpforhackers.io/language-models/

