In [2]:
import nltk
from nltk.corpus import gutenberg

### Language modeling exercises

Go through the documentation of NLTK and check how bigrams, trigrams or n-grams calculators work.

In [3]:
help(nltk.bigrams)

Help on function bigrams in module nltk.util:

bigrams(sequence, **kwargs)
    Return the bigrams generated from a sequence of items, as an iterator.
    For example:
    
        >>> from nltk.util import bigrams
        >>> list(bigrams([1,2,3,4,5]))
        [(1, 2), (2, 3), (3, 4), (4, 5)]
    
    Use bigrams for a list version of this function.
    
    :param sequence: the source data to be converted into bigrams
    :type sequence: sequence or iter
    :rtype: iter(tuple)



Now let's take the example from the slides, and let's work with it.

In [1]:
text1 = """<s> I am Sam </s>
<s> Sam I am </s>
<s> I like eggs </s>
<s> I do not like green eggs and ham </s>"""
text1

'<s> I am Sam </s>\n<s> Sam I am </s>\n<s> I like eggs </s>\n<s> I do not like green eggs and ham </s>'

We split the corpus into individual words and then calculate the vocabulary.

In [4]:
words=text1.split()
words

['<s>',
 'I',
 'am',
 'Sam',
 '</s>',
 '<s>',
 'Sam',
 'I',
 'am',
 '</s>',
 '<s>',
 'I',
 'like',
 'eggs',
 '</s>',
 '<s>',
 'I',
 'do',
 'not',
 'like',
 'green',
 'eggs',
 'and',
 'ham',
 '</s>']

In [5]:
len(set(words))

12

Now create a list of bigrams from our corpus and print it out.

In [6]:
# Your code here:
list(nltk.bigrams(words))


[('<s>', 'I'),
 ('I', 'am'),
 ('am', 'Sam'),
 ('Sam', '</s>'),
 ('</s>', '<s>'),
 ('<s>', 'Sam'),
 ('Sam', 'I'),
 ('I', 'am'),
 ('am', '</s>'),
 ('</s>', '<s>'),
 ('<s>', 'I'),
 ('I', 'like'),
 ('like', 'eggs'),
 ('eggs', '</s>'),
 ('</s>', '<s>'),
 ('<s>', 'I'),
 ('I', 'do'),
 ('do', 'not'),
 ('not', 'like'),
 ('like', 'green'),
 ('green', 'eggs'),
 ('eggs', 'and'),
 ('and', 'ham'),
 ('ham', '</s>')]

### Conditional Frequencies

Now we want to see which are the common words that come after another word. To do that, we need to create a conditional freqyency list over bigrams by combining `nltk.bigrams` and `nltk.ConditionalFreqDist` to do it.

Please check the documentation, and see below how we get the conditional frequencies for the words in 'text1').

In [8]:
help(nltk.ConditionalFreqDist)

Help on class ConditionalFreqDist in module nltk.probability:

class ConditionalFreqDist(collections.defaultdict)
 |  ConditionalFreqDist(cond_samples=None)
 |  
 |  A collection of frequency distributions for a single experiment
 |  run under different conditions.  Conditional frequency
 |  distributions are used to record the number of times each sample
 |  occurred, given the condition under which the experiment was run.
 |  For example, a conditional frequency distribution could be used to
 |  record the frequency of each word (type) in a document, given its
 |  length.  Formally, a conditional frequency distribution can be
 |  defined as a function that maps from each condition to the
 |  FreqDist for the experiment under that condition.
 |  
 |  Conditional frequency distributions are typically constructed by
 |  repeatedly running an experiment under a variety of conditions,
 |  and incrementing the sample outcome counts for the appropriate
 |  conditions.  For example, the foll

In [11]:
cfreq_text1_bigrams= nltk.ConditionalFreqDist(nltk.bigrams(words))
cfreq_text1_bigrams

<ConditionalFreqDist with 12 conditions>

 What words occur after the word "I"? How often do those words occur? Check that!

In [12]:
#YOUR CODE HERE
cfreq_text1_bigrams['I']


FreqDist({'am': 2, 'like': 1, 'do': 1})

We all know that we can't draw conclusions from only frequencies. There are some words (usually connectors) that appear much more often tham others.

But, we need to normalize all those figures that we have. Because of that, we will calculate conditional probabilities.

### Conditional probabilities

Instead of checking just the frequency distributions, now you will build a set of probability distributions for each word in the corpus. You have to do it in a similar way, using the `nltk.ConditionalProbDist` class. Check the documentation and play with it. You can also check chapter 2.2 at the NLTK book (Bird and others, 2009).

In [28]:
#YOUR CODE HERE
#help(nltk.ConditionalProbDist)
cprob_text1_bigrams = nltk.ConditionalProbDist(cfreq_text1_bigrams, nltk.ELEProbDist, 10)
for el in cprob_text1_bigrams.items():
    print(el)

('<s>', <ELEProbDist based on 4 samples>)
('I', <ELEProbDist based on 4 samples>)
('am', <ELEProbDist based on 2 samples>)
('Sam', <ELEProbDist based on 2 samples>)
('</s>', <ELEProbDist based on 3 samples>)
('like', <ELEProbDist based on 2 samples>)
('eggs', <ELEProbDist based on 2 samples>)
('do', <ELEProbDist based on 1 samples>)
('not', <ELEProbDist based on 1 samples>)
('green', <ELEProbDist based on 1 samples>)
('and', <ELEProbDist based on 1 samples>)
('ham', <ELEProbDist based on 1 samples>)


### Sentence probability

You calculate the probability of a sentence by multiplying the different conditional probabilities in the sentence. For instance, let us take the following sentence:

`I like green eggs`

We want to calculate the whole sentence probability,

$P(<s>, I, like, green, eggs, </s>)$

which, by using the Markov property, we can model by multiplying bigram probabilities:

$P(<s>, I, like, green, eggs, </s>) = $

$P(I | <s>) * P(like | I) * P(green | like) * P(eggs | green) * P(</s> | eggs)$


In [12]:
#YOUR CODE HERE


### Final exercise

Now, you learned about how to create simple language models. The goal now is to see how can we use them in real life. Language models are been used in the last years for a wide variety of topics.

In this last exercise we will do something similar. You will have to get two texts from project gutenberg:

 - Alice's adventures in Wonderland: 'carroll-alice.txt'
 
 - Austen's Sense and Sensibility: 'austen-sense.txt'
 
Calculate separate bigram language models and then find out some word sequences that are more probable for one author or another. You can find the books at http://www.gutenberg.org/files/11/11-h/11-h.htm and https://www.gutenberg.org/files/161/161-h/161-h.htm, respectively. 
Do you imagine it would be easy/possible to find one or more whole sentences that have a non empty probability in both novels?

In [24]:
#YOUR CODE HERE

