---
Exercises: Language Modeling
====

![](https://www.sri.com/sites/default/files/uploads/brains_languagetrans_istockphoto.jpg)

Bag of Words Model
----

We take the words in a corpus to make a __generative model__ of language. 

We know that language is very complicated, but we can create a simplified model of language that captures part of the complexity. 

In the __bag of words__ model, we ignore the order of words, just count their frequency.  

Think of it this way: take all the words from the text, and throw them into a bag.  Shake the bag, and then generating a sentence consists of pulling words out of the bag one at a time.  

Chances are it won't be grammatical or sensible, but it will have words in roughly the right proportions.  

Let's write a function to sample an *n* word sentence from a bag of words:

In [44]:
with open("../../corpora/the_kama_sutra.txt") as f:
    text = f.read()

In [45]:
import re 

def words(text):
    "List all the word tokens (consecutive letters) in a text. Normalize to lowercase."
    return re.findall('[a-z]+', text.lower()) 

In [46]:
karma = words(text)
karma[:5]

['the', 'kama', 'sutra', 'of', 'vatsyayana']

In [47]:
import random 

random.shuffle(karma)
bag = karma
bag[:5]

['on', 'and', 'lovers', 'occasion', 'the']

In [48]:
from numpy.random import choice
#can import random instead of numpy random

In [49]:
def sample(bag, n=10):
    "Sample a random n-word sentence from the model described by the bag of words."
    return [choice(bag) for _ in range(n)]
    
    

In [50]:
# XXX: There can't be a deterministic unit test since it is random function it should be something like...
sample(bag) #=> 'intercourse attached he between his with things as phut to'

['nails',
 'pass',
 'obey',
 'woman',
 'movements',
 'king',
 'assuring',
 'pleasure',
 'and',
 'though']

What type of ngram model is this? Why?

1. unigram
2. bigram
3. trigram
4. none of the above

In [86]:
 #Unigram model - words that are more represented in the bag have a higher chance of being picked. 

-----

In [52]:
len(karma)

58751

In [53]:
from collections import Counter
from pprint import pprint

counts = Counter(karma)
pprint(counts.most_common(10))

[('the', 4487),
 ('of', 2957),
 ('and', 2224),
 ('to', 1560),
 ('a', 1478),
 ('her', 1081),
 ('is', 999),
 ('in', 980),
 ('should', 812),
 ('with', 717)]


In [54]:
len(counts)

5249

In [55]:
#verify that frequency for neverseen is zero
counts['neverseen']

0

In [56]:
print('{0:20}  {1}'.format("word", "count"))
for word in words('there are common and neverseen words'):
    print('{0:20}  {1}'.format(word, counts[word]))

word                  count
there                 120
are                   425
common                11
and                   2224
neverseen             0
words                 41


In [57]:
# TODO: Calculate the frequency of the words

print('{0:20}  {1}'.format("word", "frequency"))
for word in words('the are common and neverseen words'):
    print('{0:20}  {1:.2}'.format(word, counts[word]/len(karma))) #divide by total number of words

word                  frequency
the                   0.076
are                   0.0072
common                0.00019
and                   0.038
neverseen             0.0
words                 0.0007


In [58]:
sum(counts.values())

58751

In [59]:
# TODO: Turn that into a function
def calc_prob_distibution(counter):
    "Calculate a probability distribution based on evidence from a Counter."
    
    def prob_distribution(string):
        
        return counter[string] / sum(counter.values())
    return prob_distribution

    
    
prob_distrubtion = calc_prob_distibution(counts)

In [60]:
prob_distrubtion("the")

0.07637316811628739

In [61]:
assert round(prob_distrubtion("the"), 4) == 0.0764
assert round(prob_distrubtion("with"), 4) == 0.0122

Now, what is the probability of a *sequence* of words?  Use the definition of a joint probability:

$P(w_1 \ldots w_n) = P(w_1) \times P(w_2 \mid w_1) \times P(w_3 \mid w_1 w_2) \ldots  \times \ldots P(w_n \mid w_1 \ldots w_{n-1})$

The *bag of words* model assumes that each word is drawn from the bag *independently* of the others.  This gives us the wrong approximation:
    
$P(w_1 \ldots w_n) = P(w_1) \times P(w_2) \times P(w_3) \ldots  \times \ldots P(w_n)$

It is wrong but okay enough to move forward...

In [62]:
from functools import reduce
import operator

def product(iterable):
    "Multiply the numbers together.  (Like `sum`, but with multiplication.)"
    return reduce(operator.mul, iterable, 1)

def prob_words(words):
    "Probability of words, assuming each word is independent of others."
    return product(prob_distrubtion(w) for w in words)

In [63]:
reduce?

In [64]:
phrases = ['the the the',
         'the son',
         'the son of a Brahman', 
         'this is a neverbeforeseen word']

for phrase in phrases:
    print('{0:30}  {1:.6}'.format(phrase, prob_words(words(phrase))))

the the the                     0.000445474
the son                         7.79968e-06
the son of a Brahman            3.02572e-12
this is a neverbeforeseen word  0.0


TODO: Why is `the the the` so likely? What would we have to add to our model to reduce the likelihood of nonsense phrases?

In [65]:
# The word the occurs very frequently by itself. Since we are only considering words by themselves,
#the combination of the the the is the probability of the occuring times 3 (which is fairly high).
# If you used bigrams, you would eliminate the probability of two the's occuring.

TODO: Why is there zero probability for a phrase with a neverbeforseen word?

In [66]:
#In our training corpus, there is not the word neverbeforeseen. Therefore, this receives a probability of zero.
#This , in turn, returns a product of all of these words to be zero. 

----
Challenge Exercises
-----

Mo' Data, Mo' Better
----

Let's move up from millions to *billions and billions* of words.  

Once we have that amount of data, we can start to look at two word sequences, without them being too sparse.  

We happen to have data files available in the format of `"word \t count"`, and bigram data in the form of `"word1 word2 \t count"`.  Let's arrange to read them in:

In [67]:
def load_counts(filename, sep='\t'):
    """Return a Counter initialized from key-value pairs, 
    one on each line of filename."""
    C = Counter()
    for line in open(filename):
        key, count = line.split(sep)
        C[key] = int(count)
    return C

In [68]:
counts_unigram = load_counts('../../corpora/unigram_word_counts.txt')
counts_bigram = load_counts('../../corpora/bigram_word_counts.txt')

print(len(counts_unigram))
print(len(counts_bigram))

333333
286358


TODO: What is the total number of unigrams and bigrams in the?

In [69]:
# There are 333,333 unigrams and 286,358 bigrams for a total of 619,691

In [70]:
333333+286358

619691

In [71]:
counts_unigram.most_common(15)

[('the', 23135851162),
 ('of', 13151942776),
 ('and', 12997637966),
 ('to', 12136980858),
 ('a', 9081174698),
 ('in', 8469404971),
 ('for', 5933321709),
 ('is', 4705743816),
 ('on', 3750423199),
 ('that', 3400031103),
 ('by', 3350048871),
 ('this', 3228469771),
 ('with', 3183110675),
 ('i', 3086225277),
 ('you', 2996181025)]

In [72]:
counts_bigram.most_common(15)

[('of the', 2766332391),
 ('in the', 1628795324),
 ('to the', 1139248999),
 ('on the', 800328815),
 ('for the', 692874802),
 ('and the', 629726893),
 ('to be', 505148997),
 ('is a', 476718990),
 ('with the', 461331348),
 ('from the', 428303219),
 ('by the', 417106045),
 ('at the', 416201497),
 ('of a', 387060526),
 ('in a', 364730082),
 ('will be', 356175009)]

-----
Bigram Model
-----

A less-wrong approximation:
    
$P(w_1 \ldots w_n) = P(w_1) \times P(w_2 \mid w_1) \times P(w_3 \mid w_2) \ldots  \times \ldots P(w_n \mid w_{n-1})$

This is called the *bigram* model, and is equivalent to taking a text, cutting it up into slips of paper with two words on them, and having multiple bags, and putting each slip into a bag labelled with the first word on the slip.  Then, to generate language, we choose the first word from the original single bag of words, and chose all subsequent words from the bag with the label of the previously-chosen word.

Let's start by defining the probability of a single discrete event, given evidence stored in a Counter:

Recall that the less-wrong bigram model approximation to English is:
    
$P(w_1 \ldots w_n) = P(w_1) \times P(w_2 \mid w_1) \times P(w_3 \mid w_2) \ldots  \times \ldots P(w_n \mid w_{n-1})$

where the conditional probability of a word given the previous word is defined as:

$P(w_n \mid w_{n-1}) = P(w_{n-1}w_n) / P(w_{n-1}) $

In [73]:
prob_dist_unigram = calc_prob_distibution(counts_unigram)
prob_dist_bigram = calc_prob_distibution(counts_bigram)


In [74]:
sum(counts_unigram.values())

588124220187

In [75]:
prob_dist_unigram('the')

0.03933837507090547

In [76]:
prob_dist_bigram('of the')

0.012242832903921587

In [77]:
# Change function to use the bigger dictionary
def prob_words(words):
    "Probability of words, assuming each word is independent of others."
    return product(prob_dist_unigram(w) for w in words)

In [78]:
# test 
prob_words(['the','the','the'])


6.087644042127257e-05

In [79]:
def cond_prob_word(word, prev):
    "Conditional probability of word, given previous word."
    bigram = prev + ' ' + word
    if prob_dist_bigram(bigram) > 0 and prob_dist_unigram(prev) > 0:
        
        return prob_dist_bigram(bigram) / prob_dist_unigram(prev)
    else: # Average the back-off value and zero.
        return prob_dist_unigram(word) / 2

# TODO: Finish the prob_words_two function
def prob_words_bigram(words, prev='<S>'):
    "The probability of a sequence of words, using bigram data, given prev word."
    #print(len(words))
    cond_prob = []
    #prob_words(words(phrase))
    
    for i in range(len(words)):
        #print(i,'i')
        if i == 0:
            cond_prob.append(cond_prob_word(words[i],prev))
        else:
            cond_prob.append(cond_prob_word(words[i],words[i-1]))
    

    
    #print(cond_prob)
    print(product(cond_prob))


In [80]:
#test
cond_prob_word('the','of')

0.5474709460900279

In [81]:
prob_words_bigram(words('the the the'))

7.2594323333854714e-09


In [82]:
print([i for i in range(3)])

[0, 1, 2]


In [83]:
print(prob_words(words('the the the'))) #=> 6.087644042127257e-05
print(prob_words_bigram(words('the the the'))) #=> 7.2594323333854714e-09

6.087644042127257e-05
7.2594323333854714e-09
None


TODO: Why does the probability of this phrase go down with bigram word?

In [84]:
print(prob_words(words('of the same'))) #=> 6.087644042127257e-05
print(prob_words_bigram(words('of the same')))  #=> 0.00012825557171799635

3.5124331077955047e-07
0.00012825557171799635
None


TODO: Why does the probability of this phrase go way up with bigram word?

In [85]:
#Bigram takes into account the probabilities of pairs of words. With only a unigram model, 
#you only consider the probability of individual words and not pairs of words.

<br>
<br>
---