In [1]:
reset -fs

write a function to sample an *n* word sentence from a bag of words:

In [2]:
with open("../../../corpora/the_kama_sutra.txt") as f:
    text = f.read()

In [3]:
import re 

def words(text):
    "List all the word tokens (consecutive letters) in a text. Normalize to lowercase."
    return re.findall('[a-z]+', text.lower()) 

In [4]:
karma = words(text)
karma[:5]

import random 

random.shuffle(karma)
bag = karma
bag[:5]

['be', 'he', 'page', 'whose', 'a']

In [5]:
import random 

def sample(bag, n=10):
    "Sample a random n-word sentence from the model described by the bag of words."
    return ' '.join(random.choice(bag) for _ in range(n))

In [6]:
[sample(bag) for _ in range(10)]

['and expressing but should love tell are arise any and',
 'the esprit her no the the of condoling with to',
 'wife to respect for husband only kinds the the her',
 'of the many he be good same once of tenth',
 'should the out women at her the festival show young',
 'return who now gained inarticulately infant marks is get attending',
 'known or lip and distress the his moreover weary common',
 'of has a these and her amusements by bite expressive',
 'about salt who the wife he superintend secrets men union',
 'may with her time diversions plain ministers the draupadi excessive']

What type of ngram model is this? Why?

1. unigram
2. bigram
3. trigram
4. none of the above

This is unigram model. It generative data based on single token frequency.

-----

In [7]:
from collections import Counter
from pprint import pprint

counts = Counter(karma)
pprint(counts.most_common(10)) 

[('the', 4487),
 ('of', 2957),
 ('and', 2224),
 ('to', 1560),
 ('a', 1478),
 ('her', 1081),
 ('is', 999),
 ('in', 980),
 ('should', 812),
 ('with', 717)]


In [8]:
print('{0:20}  {1}'.format("word", "count"))
for word in words('there are common and neverseen words'):
    print('{0:20}  {1}'.format(word, counts[word]))

word                  count
there                 120
are                   425
common                11
and                   2224
neverseen             0
words                 41


In [9]:
# TODO: calculate the frequency of the words

for word in words('there are common and neverseen words'):
    print('{0:20}  {1:.2}'.format(word, counts[word]/sum(counts.values())))


there                 0.002
are                   0.0072
common                0.00019
and                   0.038
neverseen             0.0
words                 0.0007


In [10]:
# TODO: Turn that into a function
def calc_prob_distibution(counter):
    "Calculate a probability distribution based on evidence from a Counter."
    N = sum(counter.values())
    return lambda x: counter[x]/N

prob_distrubtion = calc_prob_distibution(counts)

In [11]:
assert round(prob_distrubtion("the"), 4) == 0.0764
assert round(prob_distrubtion("with"), 4) == 0.0122

In [12]:
print('{0:20}  {1}'.format("word", "frequency"))
for word in words('there are common and neverseen words'):
    print('{0:20}  {1:.2}'.format(word, prob_distrubtion(word)))

word                  frequency
there                 0.002
are                   0.0072
common                0.00019
and                   0.038
neverseen             0.0
words                 0.0007


----

Now, what is the probability of a *sequence* of words?  Use the definition of a joint probability:

$P(w_1 \ldots w_n) = P(w_1) \times P(w_2 \mid w_1) \times P(w_3 \mid w_1 w_2) \ldots  \times \ldots P(w_n \mid w_1 \ldots w_{n-1})$

The *bag of words* model assumes that each word is drawn from the bag *independently* of the others.  This gives us the wrong approximation:
    
$P(w_1 \ldots w_n) = P(w_1) \times P(w_2) \times P(w_3) \ldots  \times \ldots P(w_n)$

It is wrong but okay enough to move forward

In [13]:
from functools import reduce
import operator

def product(iterable):
    "Multiply the numbers together.  (Like `sum`, but with multiplication.)"
    return reduce(operator.mul, iterable, 1)

def prob_words(words):
    "Probability of words, assuming each word is independent of others."
    return product(prob_distrubtion(w) for w in words)

In [14]:
phrases = ['the the the',
         'the son',
         'the son of a Brahman', 
         'this is a neverbeforeseen word']

for phrase in phrases:
    print('{0:30}  {1:.6}'.format(phrase, prob_words(words(phrase))))

the the the                     0.000445474
the son                         7.79968e-06
the son of a Brahman            3.02572e-12
this is a neverbeforeseen word  0.0


TODO: Why is `the the the` so likely? What would we have to add to our model to reduce the likelihood of nonsense phrases?

- bigrams or grammar

TODO: Why is there zero probability for sentence with neverbeforseen word?

------
Let's move up from millions to *billions and billions* of words.  

Once we have that amount of data, we can start to look at two word sequences, without them being too sparse.  

We happen to have data files available in the format of `"word \t count"`, and bigram data in the form of `"word1 word2 \t count"`.  Let's arrange to read them in:

In [15]:
def load_counts(filename, sep='\t'):
    """Return a Counter initialized from key-value pairs, 
    one on each line of filename."""
    C = Counter()
    for line in open(filename):
        key, count = line.split(sep)
        C[key] = int(count)
    return C

In [26]:
counts_unigram = load_counts('../../../corpora/unigram_word_counts.txt')
counts_bigram = load_counts('../../../corpora/bigram_word_counts.txt')

print("{:,}".format(len(counts_unigram)))
print("{:,}".format(len(counts_bigram)))

333,333
286,358


In [27]:
print("{:,}".format(sum(counts_unigram.values())))

588,124,220,187


In [28]:
print("{:,}".format(sum(counts_bigram.values())))

225,955,251,755


In [17]:
counts_unigram.most_common(15)

[('the', 23135851162),
 ('of', 13151942776),
 ('and', 12997637966),
 ('to', 12136980858),
 ('a', 9081174698),
 ('in', 8469404971),
 ('for', 5933321709),
 ('is', 4705743816),
 ('on', 3750423199),
 ('that', 3400031103),
 ('by', 3350048871),
 ('this', 3228469771),
 ('with', 3183110675),
 ('i', 3086225277),
 ('you', 2996181025)]

In [18]:
counts_bigram.most_common(50)

[('of the', 2766332391),
 ('in the', 1628795324),
 ('to the', 1139248999),
 ('on the', 800328815),
 ('for the', 692874802),
 ('and the', 629726893),
 ('to be', 505148997),
 ('is a', 476718990),
 ('with the', 461331348),
 ('from the', 428303219),
 ('by the', 417106045),
 ('at the', 416201497),
 ('of a', 387060526),
 ('in a', 364730082),
 ('will be', 356175009),
 ('that the', 333393891),
 ('do not', 326267941),
 ('is the', 306482559),
 ('to a', 279146624),
 ('is not', 276753375),
 ('for a', 274112498),
 ('with a', 271525283),
 ('as a', 270401798),
 ('<S> and', 261891475),
 ('of this', 258707741),
 ('<S> the', 258483382),
 ('it is', 245002494),
 ('can be', 230215143),
 ('If you', 210252670),
 ('has been', 196769958),
 ('the same', 186235801),
 ('does not', 180844574),
 ('can not', 180466484),
 ('and a', 178504444),
 ('in this', 174166565),
 ('one of', 173898508),
 ('have been', 172884791),
 ('you can', 172007760),
 ('may be', 171738006),
 ('as the', 169662690),
 ('on a', 167105962),
 ('th

In [19]:
prob_dist_unigram = calc_prob_distibution(counts_unigram)
prob_dist_bigram = calc_prob_distibution(counts_bigram)

In [20]:
# Change Pwords to use P1w (the bigger dictionary) instead of Pword
def prob_words(words):
    "Probability of words, assuming each word is independent of others."
    return product(prob_dist_unigram(w) for w in words)

In [30]:
def cond_prob_word(word, prev):
    "Conditional probability of word, given previous word."
    bigram = prev + ' ' + word
    if prob_dist_bigram(bigram) > 0 and prob_dist_unigram(prev) > 0:
        return prob_dist_bigram(bigram) / prob_dist_unigram(prev)
    else: # Average the back-off value and zero.
        return prob_dist_unigram(word) / 2
    
def prob_words_improved(words, prev='<S>'):
    "The probability of a sequence of words, using bigram data, given prev word."
    return product(cond_prob_word(w, (prev if (i == 0) else words[i-1]) )
                                       for (i, w) in enumerate(words))



In [31]:
print(prob_words(words('the the the')))
print(prob_words_improved(words('the the the')))

6.087644042127257e-05
7.2594323333854714e-09


TODO: Why does the probability of the phrase go down with bigram word?

The chances are of 'the the the' are vanishing small

In [32]:
print(prob_words(words('of the same'))) #=> 3.5124331077955047e-07
print(prob_words_improved(words('of the same'))) #=> 0.00012825557171799635

3.5124331077955047e-07
0.00012825557171799635
