---
Exercises: Language Modeling
====

![](https://www.sri.com/sites/default/files/uploads/brains_languagetrans_istockphoto.jpg)

Bag of Words Model
----

We take the words in a corpus to make a __generative model__ of language. 

We know that language is very complicated, but we can create a simplified model of language that captures part of the complexity. 

In the __bag of words__ model, we ignore the order of words, just count their frequency.  

Think of it this way: take all the words from the text, and throw them into a bag.  Shake the bag, and then generating a sentence consists of pulling words out of the bag one at a time.  

Chances are it won't be grammatical or sensible, but it will have words in roughly the right proportions.  

Let's write a function to sample an *n* word sentence from a bag of words:

In [12]:
with open("../../corpora/the_kama_sutra.txt") as f:
    text = f.read()

In [13]:
import re 

def words(text):
    "List all the word tokens (consecutive letters) in a text. Normalize to lowercase."
    return re.findall('[a-z]+', text.lower()) 

In [14]:
karma = words(text)
karma[:5]

['the', 'kama', 'sutra', 'of', 'vatsyayana']

In [15]:
import random 

random.shuffle(karma)
bag = karma
bag[:5]

['engaged', 'get', 'nor', 'to', 'like']

In [16]:
def sample(bag, n=10):
    "Sample a random n-word sentence from the model described by the bag of words."
    pass

In [17]:
# XXX: There can't be a deterministic unit test since it is random function it should be something like...
sample(bag) #=> 'intercourse attached he between his with things as phut to'

What type of ngram model is this? Why?

1. unigram
2. bigram
3. trigram
4. none of the above

-----

In [18]:
from collections import Counter
from pprint import pprint

counts = Counter(karma)
pprint(counts.most_common(10))

[('the', 4487),
 ('of', 2957),
 ('and', 2224),
 ('to', 1560),
 ('a', 1478),
 ('her', 1081),
 ('is', 999),
 ('in', 980),
 ('should', 812),
 ('with', 717)]


In [19]:
print('{0:20}  {1}'.format("word", "count"))
for word in words('there are common and neverseen words'):
    print('{0:20}  {1}'.format(word, counts[word]))

word                  count
there                 120
are                   425
common                11
and                   2224
neverseen             0
words                 41


In [20]:
# TODO: Calculate the frequency of the words

print('{0:20}  {1}'.format("word", "frequency"))
for word in words('there are common and neverseen words'):
    print('{0:20}  {1:.2}'.format(word, None))

word                  frequency


TypeError: non-empty format string passed to object.__format__

In [None]:
# TODO: Turn that into a function
def calc_prob_distibution(counter):
    "Calculate a probability distribution based on evidence from a Counter."
    pass

prob_distrubtion = calc_prob_distibution(counts)

In [None]:
assert round(prob_distrubtion("the"), 4) == 0.0764
assert round(prob_distrubtion("with"), 4) == 0.0122

Now, what is the probability of a *sequence* of words?  Use the definition of a joint probability:

$P(w_1 \ldots w_n) = P(w_1) \times P(w_2 \mid w_1) \times P(w_3 \mid w_1 w_2) \ldots  \times \ldots P(w_n \mid w_1 \ldots w_{n-1})$

The *bag of words* model assumes that each word is drawn from the bag *independently* of the others.  This gives us the wrong approximation:
    
$P(w_1 \ldots w_n) = P(w_1) \times P(w_2) \times P(w_3) \ldots  \times \ldots P(w_n)$

It is wrong but okay enough to move forward...

In [None]:
from functools import reduce
import operator

def product(iterable):
    "Multiply the numbers together.  (Like `sum`, but with multiplication.)"
    return reduce(operator.mul, iterable, 1)

def prob_words(words):
    "Probability of words, assuming each word is independent of others."
    return product(prob_distrubtion(w) for w in words)

In [None]:
phrases = ['the the the',
         'the son',
         'the son of a Brahman', 
         'this is a neverbeforeseen word']

for phrase in phrases:
    print('{0:30}  {1:.6}'.format(phrase, prob_words(words(phrase))))

TODO: Why is `the the the` so likely? What would we have to add to our model to reduce the likelihood of nonsense phrases?

TODO: Why is there zero probability for a phrase with a neverbeforseen word?

----
Challenge Exercises
-----

Mo' Data, Mo' Better
----

Let's move up from millions to *billions and billions* of words.  

Once we have that amount of data, we can start to look at two word sequences, without them being too sparse.  

We happen to have data files available in the format of `"word \t count"`, and bigram data in the form of `"word1 word2 \t count"`.  Let's arrange to read them in:

In [None]:
def load_counts(filename, sep='\t'):
    """Return a Counter initialized from key-value pairs, 
    one on each line of filename."""
    C = Counter()
    for line in open(filename):
        key, count = line.split(sep)
        C[key] = int(count)
    return C

In [None]:
counts_unigram = load_counts('../../corpora/unigram_word_counts.txt')
counts_bigram = load_counts('../../corpora/bigram_word_counts.txt')

print(len(counts_unigram))
print(len(counts_bigram))

How much data do we have?

TODO: What is the total number of values in unigrams, aka the size of the corpus?

TODO: What is the total number of values in bigrams?

In [None]:
counts_unigram.most_common(15)

In [None]:
counts_bigram.most_common(15)

-----
Bigram Model
-----

A less-wrong approximation:
    
$P(w_1 \ldots w_n) = P(w_1) \times P(w_2 \mid w_1) \times P(w_3 \mid w_2) \ldots  \times \ldots P(w_n \mid w_{n-1})$

This is called the *bigram* model, and is equivalent to taking a text, cutting it up into slips of paper with two words on them, and having multiple bags, and putting each slip into a bag labelled with the first word on the slip.  Then, to generate language, we choose the first word from the original single bag of words, and chose all subsequent words from the bag with the label of the previously-chosen word.

Let's start by defining the probability of a single discrete event, given evidence stored in a Counter:

Recall that the less-wrong bigram model approximation to English is:
    
$P(w_1 \ldots w_n) = P(w_1) \times P(w_2 \mid w_1) \times P(w_3 \mid w_2) \ldots  \times \ldots P(w_n \mid w_{n-1})$

where the conditional probability of a word given the previous word is defined as:

$P(w_n \mid w_{n-1}) = P(w_{n-1}w_n) / P(w_{n-1}) $

In [None]:
prob_dist_unigram = calc_prob_distibution(counts_unigram)
prob_dist_bigram = calc_prob_distibution(counts_bigram)

In [None]:
# Change function to use the bigger dictionary
def prob_words(words):
    "Probability of words, assuming each word is independent of others."
    return product(prob_dist_unigram(w) for w in words)

In [None]:
def cond_prob_word(word, prev):
    "Conditional probability of word, given previous word."
    bigram = prev + ' ' + word
    if prob_dist_bigram(bigram) > 0 and prob_dist_unigram(prev) > 0:
        return prob_dist_bigram(bigram) / prob_dist_unigram(prev)
    else: # Average the back-off value and zero.
        return prob_dist_unigram(word) / 2

# TODO: Finish the prob_words_two function
def prob_words_bigram(words, prev='<S>'):
    "The probability of a sequence of words, using bigram data, given prev word."
    pass

In [None]:
print(prob_words(words('the the the'))) #=> 6.087644042127257e-05
print(prob_words_two(words('the the the'))) #=> 7.2594323333854714e-09

TODO: Why does the probability of this phrase go down with bigram word?

In [None]:
print(prob_words(words('of the same'))) #=> 6.087644042127257e-05
print(prob_words(words('of the same')))  #=> 0.00012825557171799635

TODO: Why does the probability of this phrase go way up with bigram word?

<br>
<br>
---