### Practice 2 - Building a 2nd-order language model from text databases

Markos Flavio B. G. O.

__Context: Markov Models (MMs).__

__Course: Unsupervised Machine Learning Hidden Markov Models in Python (Udemy, LazyProgrammer)__

This code is a practice study about MMs. It's an adaptation of the code found at https://github.com/lazyprogrammer/machine_learning_examples/tree/master/hmm_class/frost.py
    
__Specific objectives__

     1. Build a 2nd-order language model from a text database
     2. Generate new phrases from the learned model
     3. Calculating the probability of a given phrase

In [1]:
import numpy as np
import pandas as pd
import string
import random

In particular, we want to count:
1. The initial distribution of words (probability of words appearing at the beggining of a sentence).
2. Second word distribution (won't have two previous words here; we could include a None in the first position (w(t-2))).
3. End of sentence distribution.

#### Build a 2nd-order language model from a text database

In [18]:
# looking at the data
loc = './Raw repo/hmm_class/robert_frost.txt'
for i, line in enumerate(open(loc)):
    if i < 10:
        print(i, line)

0 Two roads diverged in a yellow wood,

1 And sorry I could not travel both

2 And be one traveler, long I stood

3 And looked down one as far as I could

4 To where it bent in the undergrowth; 

5 

6 Then took the other, as just as fair,

7 And having perhaps the better claim

8 Because it was grassy and wanted wear,

9 Though as for that the passing there



In [3]:
# removing additional spaces and putting all characters into lower case
for i, line in enumerate(open(loc)):
    if i < 10 and line.strip(): # removing empty lines
        print(i, line.rstrip().lower())       

0 two roads diverged in a yellow wood,
1 and sorry i could not travel both
2 and be one traveler, long i stood
3 and looked down one as far as i could
4 to where it bent in the undergrowth;
6 then took the other, as just as fair,
7 and having perhaps the better claim
8 because it was grassy and wanted wear,
9 though as for that the passing there


In [4]:
# removing punctuation
for i, line in enumerate(open(loc)):
    if i < 10 and line.strip():
        print(i, line.rstrip().lower().translate(str.maketrans('','',string.punctuation)))
        # translate() returns a string where each character is mapped to its corresponding character as per the translation table.
        # the translation table is created by the maketrans() method.

0 two roads diverged in a yellow wood
1 and sorry i could not travel both
2 and be one traveler long i stood
3 and looked down one as far as i could
4 to where it bent in the undergrowth
6 then took the other as just as fair
7 and having perhaps the better claim
8 because it was grassy and wanted wear
9 though as for that the passing there


Let's now build an array of all possible values for any sequence of two words in the dataset.

Example: if in the dataset we have three sentences like "I love dogs", "I love cats", "I love dogs", "I love", we would the following dictionary:

{('I', 'love'): ['dogs', 'cats', dogs, 'END']}

In [148]:
initial = {} # holds the counting of words as being the firts one in the sentence
second_word = {} # holds the counting of words as being the 2nd one (they only have one previous word)
transitions = {}  # holds all 2nd-order transitions

In [149]:
def add2dict(d, k, v):
    if k not in d:
        d[k] = []
    d[k].append(v)

In [150]:
# gathering data
for i, line in enumerate(open(loc)):
    tokens = line.rstrip().lower().translate(str.maketrans('','',string.punctuation)).split()
    T = len(tokens)
    
    for j in range(T):
        t = tokens[j]
        if j == 0:
            initial[t] = initial.get(t, 0.) + 1
        else:
            t_1 = tokens[j-1]
            if j == T - 1:
                add2dict(transitions, (t_1, t), 'END')
            if j == 1:
                add2dict(second_word, t_1, t)
            else:
                t_2 = tokens[j-2]
                add2dict(transitions, (t_2, t_1), t)
    
    if i == 0: # looking at a sample
        print(tokens)
        print('Example of transition:')
        print(transitions)

['two', 'roads', 'diverged', 'in', 'a', 'yellow', 'wood']
Example of transition:
{('two', 'roads'): ['diverged'], ('roads', 'diverged'): ['in'], ('diverged', 'in'): ['a'], ('in', 'a'): ['yellow'], ('yellow', 'wood'): ['END'], ('a', 'yellow'): ['wood']}


In [152]:
# looking at the some sample of the data
def print_sample(d):
    print(dict(random.sample(d.items(), 1)))
print_sample(transitions)
print_sample(second_word)
print_sample(initial)

{('stood', 'the'): ['strain']}
{'said': ['that']}
{'brown': 2.0}


Turning each array into a probability distribution dicionary.

In the early example the array ['dogs', 'cats', dogs, 'END'] becomes {"cats": 1/4, "dogs": 1/2, END: 1/4}

In [153]:
# normalizing the 'initial' array is easy
initial_total = sum(initial.values())
for t, c in initial.items():
    initial[t] = c / initial_total

In [154]:
print_sample(initial)

{'sitting': 0.0006963788300835655}


In [155]:
def list2pdict(ts):
    # function to turn each list of possibilities into a dictionary of probabilities
    d = {}
    n = len(ts)
    # counting
    for t in ts:
        d[t] = d.get(t, 0.) + 1
        # normalizing
    for t, c in d.items():
        d[t] = c / n
    return d

In [156]:
for k, v in second_word.items():
    second_word[k] = list2pdict(v)
for k, v in transitions.items():
    transitions[k] = list2pdict(v)

In [157]:
print_sample(transitions)
print_sample(second_word)

{('hard', 'with'): {'john': 1.0}}
{'making': {'allowance': 1.0}}


#### Generating new phrases given the learned model

Sampling the dictionary a random word using the distribution found based on the data.

In [158]:
def sample_word(d):
    p0 = np.random.random()
    cumulative = 0
    for t, p in d.items():
        cumulative += p
        if p0 < cumulative:
            return t
    assert(False)

In [159]:
def generating_sentences(n):
    for i in range(n):
        sentence  = []
        # initial word
        w0 = sample_word(initial)
        sentence.append(w0)
        
        # sample second word
        w1 = sample_word(second_word[w0])
        sentence.append(w1)

        # second-order transitions until END
        while True:
            w2 = sample_word(transitions[(w0, w1)])
            if w2 == 'END':
                break
            sentence.append(w2)
            w0 = w1
            w1 = w2
        print(' '.join(sentence))

In [160]:
generating_sentences(5)

in snow and mist
to be you said youd seen the stone baptismal font outdoors
and this is the ideals
i made him keep on gnawing till he whined
it ought to know


#### Calculating the probability of a given sentece

To include smoothing, the probability distributions must be computed again from the new counts. It would also be better to check whether the sentence has only words included in the database.

In [161]:
def prob_seq_no_smoothing(sentence):
    
    # preprocessing
    sentence = sentence.rstrip().lower().translate(str.maketrans('','',string.punctuation))
    # splitting
    seq = sentence.split()

    # evaluating the probabilities
    try:
        prob1 = initial[seq[0]] # prob. of 1st word
        print(prob1)
    except KeyError: # not found in the database as 1st word
        return 0
    try:
        prob2 = second_word[seq[0]][seq[1]] # prob. of 2nd word
        print(prob2)
    except KeyError:
        return 0
    
    if len(sentence) > 2:
        w2 = seq[0]; w1 = seq[1]
        prob = 1
        for word in seq[2:]:
            try:
                prob = prob*transitions[(w2, w1)][word]
                print(prob)
            except KeyError:
                return 0
            w2 = w1
            w1 = word
        return prob1*prob2*prob
    else:
        return prob1*prob2
    
prob_seq_no_smoothing('and having perhaps')

0.08983286908077995
0.007751937984496124
1.0


0.0006963788300835655