<div class="alert alert-danger">
**Due date:** 2018-02-02
</div>

# L2: Language modelling

## Introduction

In this lab you will experiment with $n$-gram models. You will test various parameters that influence these models&rsquo; quality and train to estimate models with additive smoothing.

The following lines of code import the Python modules needed for this lab:

In [1]:
import nlp2
import ngrams

The data for this lab consists of Arthur Conan Doyle&rsquo;s stories about Sherlock Holmes: *The Adventures of Sherlock Holmes*, *The Memoirs of Sherlock Holmes*, *The Return of Sherlock Holmes*, *His Last Bow* and *The Case-Book of Sherlock Holmes*. The next piece of code loads the first three of these as training data:

In [2]:
training_data = nlp2.read_data("/home/TDDE09/labs/l2/data/advs.txt",
                               "/home/TDDE09/labs/l2/data/mems.txt",
                               "/home/TDDE09/labs/l2/data/retn.txt")

The data is represented as a list of sentences, where one sentence is represented as a list of tokens (strings). The next line prints the 101th sentence:

In [3]:
print(training_data[101])

['"', 'He', 'took', 'down', 'a', 'heavy', 'brown', 'volume', 'from', 'his', 'shelves', '.']


## Relation between a model’s quality and its order

In the first part of this lab you will examine the relation between an $n$-gram model’s quality and its **order**, i.e. the value of&nbsp;$n$. You will do both a qualitative and quantitative evaluation with the help of the entropy measure.

### Qualitative evaluation

The following line trains a bigram-model of the class `ngrams.Model` on the training data.

In [4]:
model = nlp2.train(ngrams.Model, 2, training_data)

With this model you are able to generate random sentences. Every time you run the following code cell a new sentence is generated.

In [5]:
print(" ".join(model.generate()))

If I have my father's case of the lane now , to -- mental , and a little village policeman , heads .


Look at the sentences. Do they sound natural?

<div class="panel panel-primary">
<div class="panel-heading">Problem 1</div>
<div class="panel-body">
Train a unigram-, bigram-, trigram-, and quadrigram-model, and generate random sentences with each. How does the quality of the sentences change with the model’s order? Explain your observations using your understanding from how an $n$-gram model works. Use some generated sentences in order to illustrate your discussion. How would the sentences look like for higher values of $n$, such as $n=10$?
</div>
</div>

In [6]:
# TODO: Insert your code here
models =[]
for i in range(1,5):
    models.append(nlp2.train(ngrams.Model, i, training_data))
    print(" ".join(models[-1].generate()))

model2 = nlp2.train(ngrams.Model, 10, training_data)
print(" ".join(model2.generate()))

the name wait no the operation can the : , Joseph
My God help me for three lads in the windows would cover all at Coxon's manager .
But here an instant the case , and almost to the end of the quiet thinker and logician of Baker Street .
" Instead of being ruined , my good sir , you will find a considerable quantity of straw , " said he .
" " Yes .



We observe that the sentences get more natural as the value of n is increased. <br> 
For example: <br>
n=1 gave us this: "That's was he astonishing departs . preserved It that" This is obviously not a very natural sentence. This is because a unigram model does not take any context into account, but simply looks at which words are common. <br>
n=4 gave us: "You have a few solid stepping-stones on which we may hope to get some data which may help us in our operations we erected a hydraulic press in excavating fuller's-earth , which , old as they were , too ." This is a more natural sentence that begins to make sense. This is because in a quadrigram model the context of the 3 previous words is taken into account. <br><br>

A bigger n, such as n=10 would improve this even further by taking more words into account. One can expect diminishing returns at some point when n gets quite large, for example there are no natural sentences with 1000 words.
    

### Quantitative evaluation

In order to do a quantitative evaluation of a model we can compute its **entropy** on held-out data. We will use the first part of the novel *The Adventures of Sherlock Holmes* for this. It is loaded by the following command:

In [7]:
test_data = nlp2.read_data("/home/TDDE09/labs/l2/data/test.txt")

The next piece of code trains a bigram-model and computes its entropy on the test data:

In [8]:
model = nlp2.train(ngrams.Model, 2, training_data)
nlp2.evaluate(model, test_data)

3.426862596420277

<div class="panel panel-primary">
<div class="panel-heading">Problem 2</div>
<div class="panel-body">
Compute the entropy for the four models you created for the previous problem. How and why does the model’s entropy change with the model’s order? Explain using your knowledge of the entropy measure.
</div>
</div>

In [9]:
# TODO: Insert your code here
for i,model in enumerate(models):
    print(i+1, nlp2.evaluate(model,test_data))

1 7.337551182974018
2 3.426862596420277
3 1.4289533769461726
4 0.5436027106964166


As we increase n the words will make more and more sense given the words that came before them. This is the result of the model considering more previous words when generating new ones, as n increases. Because of this there will be less surprises for larger n values, and therefore a smaller entropy.

## Relation between a model’s quality and the estimation method

In the second part of this lab you will implement and evaluate different estimation methods. When you called `nlp2.train()` you created an $n$-gram model, an instance of the class `ngrams.Model`, and trained this model using maximum likelihood estimation. To implement different estimation methods you will have to create instances of your own class of models.

### The contents of a model

The first step towards your own model class is to understand what methods are available inside a model. To this end, the next cell shows you skeleton code for your own `Model` class, which inherits from `ngrams.Model`. Note that you do not need to modify this code at this point; you should simply try to follow what&rsquo;s going on and try to understand how to use instances of the model class.

In [10]:
class Model(ngrams.Model):
    
    def order(self):
        """Return the order of this model (an integer)."""
        return super().order()
    
    def vocabulary(self):
        """Return this model's vocabulary (a set)."""
        return super().vocabulary()
    
    def freq(self, ctxt, word):
        """Return the number of occurrences of `word` (a string) after `ctxt` (a tuple of strings)."""
        return super().freq(ctxt, word)
    
    def total(self, ctxt):
        """Return the total number of ngrams that start with `ctxt` (a tuple of strings)."""
        return super().total(ctxt)
    
    def prob(self, ctxt, word):
        """Return the probability for `word` (a string) given `ctxt` (a tuple of strings)."""
        return super().prob(ctxt, word)

The code in the next cell trains a bigram-model of class `Model` and prints the model’s order (an integer) and the size of its vocabulary (a set of strings, represented by Python’s `set` type).

In [11]:
model = nlp2.train(Model, 2, training_data)
print("order of the model:", model.order())
print("number of words in the model's vocabulary:", len(model.vocabulary()))

order of the model: 2
number of words in the model's vocabulary: 15339


#### Look up an n-gram’s absolute frequency

A trained model consists primarily of a table with absolute frequencies for all $n$-grams that appear in the text it was trained on. In order to look up an $n$-gram’s absolute frequency you can use the method `freq()`. An $n$-gram is divided into two parts: an $(n-1)$-gram called **context** (`ctxt`) and a final unigram (`word`). In Python the context is represented as a tuple of strings and the unigram as a normal string.

If you want to train a trigram model and then know the absolute frequency for the trigram *Mr. Sherlock Holmes* you can write:

In [12]:
model = nlp2.train(Model, 3, training_data)
model.freq(("Mr.", "Sherlock"), "Holmes")

50

For training a bigram model and looking up the absolute frequency for the bigram *Baker Street* you can write the following. Note that the context of a bigram model is a 1-tuple of strings, which has a special notation in Python.

In [13]:
model = nlp2.train(Model, 2, training_data)
model.freq(("Baker",), "Street")

67

#### Look up the absolute frequency of an n-gram with a given context

The method `total()` returns the absolute frequency of $n$-grams with the specified context. Here is an example for a trigram model:

In [14]:
model = nlp2.train(Model, 3, training_data)
model.total(("Mr.", "Sherlock"))
print(model.vocabulary().pop())

transverse


<div class="panel panel-primary">
<div class="panel-heading">Problem 3</div>
<div class="panel-body">
Train a bigram model and use it to calculate the following values, using the methods shown above.
</div>
</div>

In [15]:
model = nlp2.train(Model, 2, training_data)

**3.1.** the absolute frequency for the bigram *Sherlock Holmes*

In [16]:
model.freq(("Sherlock",), "Holmes")

195

**3.2.** the absolute frequency of bigrams with the context *Sherlock*

In [17]:
model.total(("Sherlock",))

210

**3.3.** the absolute frequency for the unigram *Sherlock* &ndash; **note that you should still use the bigram model for this!**

In [18]:
model.total(("Sherlock",))

210

**3.4.** the absolute frequency of trigrams with the context *Sherlock Holmes* &ndash; **note that you should still use the bigram model for this!**

In [19]:
model.freq(("Sherlock",), "Holmes")

195

**3.5.** the number of words in the vocabulary

In [20]:
len(model.vocabulary())

15339

**3.6.** a list with all the unique words following the context *Sherlock*

In [21]:
for word in model.vocabulary():
    if model.freq(("Sherlock",),word):
        print(word)

,
has
everywhere
.
looked
Holmes
?
Holmes's
!


(For 3.6 you will need to write a bit more than a simple function call.)

### Estimate probabilities with the Maximum Likelihood method

The method `prob()` returns the estimated conditional probability $P(w|c)$ for a word $w$ given a context $c$. The following code snippet trains a trigram model and estimates the pobability for *Holmes* given the context *Mr. Sherlock*:

In [22]:
model = nlp2.train(Model, 3, training_data)
model.prob(("Mr.", "Sherlock"), "Holmes")

1.0

(What does the returned value imply?)

<div class="panel panel-primary">
<div class="panel-heading">Problem 4</div>
<div class="panel-body">
Do your own implementation of the method `prob()`. The method should estimate probabilities using the Maximum Likelihood method. You can call the methods that you used to solve Problem&nbsp;3. Test your implementation by redoing the evaluation from Problem&nbsp;2 with the new class `Model` instead of `ngrams.Model`. You should get the same results as before.
</div>
</div>

In order to solve this problem you will need to turn the formula for Maximum Likelihood estimation into code. We illustrate the formula for a bigram model. If we write $f(w_1w_2)$ for the number of occurrences of the bigram  $w_1w_2$ and $f(w_1)$ for the number of occurrences of the unigram $w_1$, then the probability for observing $w_2$ given $w_1$ is
$$
P(w_2|w_1) = \frac{f(w_1w_2)}{f(w_1)}
$$

In [23]:
class Model(ngrams.Model):
    
    def prob(self, ctxt, word):
        """Return the probability for `word` (a string) given `ctxt` (a tuple of strings)."""
        #print("Running new version of prob\n")
        return super().freq(ctxt,word)/super().total(ctxt)


model = nlp2.train(Model, 3, training_data)
model.prob(("Mr.", "Sherlock"), "Holmes")

1.0

### Problems with Maximum Likelihood estimation

The file `yoda.txt` contains the same text as `test.txt`, but in the jumbled [Yoda-language]( http://itre.cis.upenn.edu/~myl/languagelog/archives/002173.html).

In [24]:
yoda_data = nlp2.read_data("/home/TDDE09/labs/l2/data/yoda.txt")

<div class="panel panel-primary">
<div class="panel-heading">Problem 5</div>
<div class="panel-body">
Redo the evaluation of the four previous models with `yoda.txt` as test data. For models with $n>1$ you get an error. Why? Explain what goes wrong based on your knowledge of Maximum Likelihood estimation.
</div>
</div>

In [25]:

for i,model in enumerate(models):
    print(i+1, nlp2.evaluate(model,yoda_data))

1 7.2441064060866385


ValueError: math domain error

The formula used to calculate entropy uses log-probabilities as such: log(P(w1,w2,...,wn)) with P(w1,w2,...,wn) meaning the probablility of the sequence of words appearing together. This means that the evaluation of models with n>1 will crash if a never seen sequence of words appears since P(w1,....) will be 0. 

### Estimate probabilities with additive smoothing

For the next problem you are going to do Maximum Likelihood estimation, but with additive smoothing.

<div class="panel panel-primary">
<div class="panel-heading">Problem 6</div>
<div class="panel-body">
<p>
Write a new implementation of the method `prob()`, such that it estimates probabilities with additive smoothing.</p>
<p>
Evaluate the system on `test.txt` with new new class using the entropy measure from Problem&nbsp;2. Choose the following values for the smoothing constant $k$: 0.00, 0.01, 0.10, 1.00. For $k=0$ you should get the same results as in Problem&nbsp;4.
</p>
<p>
Why and how does the smoothing constant influence the model’s entropy? Provide an explanation based on your understanding of what smoothing does to the distribution of the probability mass among observed and hallucinated occurrences.
</p>
</div>
</div>

In [26]:
class Model(ngrams.Model):
    
    def prob(self, ctxt, word):
        """Return the probability for `word` (a string) given `ctxt` (a tuple of strings)."""
        if not hasattr(self, 'k'):
            self.k = 0
       
        return (super().freq(ctxt,word)+self.k)/(super().total(ctxt) + self.k*len(self.vocabulary()))


for n in range(1, 5):
    model = nlp2.train(Model, n, training_data)
    for k in [0,0.01,0.10,1.00]:
        model.k = k
        print("n=%s k=%s Entropy: %s" % (n, k, nlp2.evaluate(model,test_data)))
    print ()

n=1 k=0 Entropy: 7.337551182974018
n=1 k=0.01 Entropy: 7.3366010522328216
n=1 k=0.1 Entropy: 7.328460956945743
n=1 k=1.0 Entropy: 7.273171406469247

n=2 k=0 Entropy: 3.426862596420277
n=2 k=0.01 Entropy: 4.813092274317304
n=2 k=0.1 Entropy: 5.983399629244003
n=2 k=1.0 Entropy: 7.383667454674711

n=3 k=0 Entropy: 1.4289533769461726
n=3 k=0.01 Entropy: 4.767917212034177
n=3 k=0.1 Entropy: 6.733205552260075
n=3 k=1.0 Entropy: 8.457347218116801

n=4 k=0 Entropy: 0.5436027106964166
n=4 k=0.01 Entropy: 4.854491062575155
n=4 k=0.1 Entropy: 6.955629963762169
n=4 k=1.0 Entropy: 8.657735266369459



To solve this problem, you can simply run your code multiple times, with different values of for&nbsp;$n$ and&nbsp;$k$. Enter your results into the table below.

<table>
<tr><td></td><td>k = 0.00</td><td>k = 0.01</td><td>k = 0.10</td><td>k = 1.00</td></tr>
<tr><td>n = 1</td><td>7.3375</td><td>7.3366</td><td>7.3284</td><td>7.2731</td></tr>
<tr><td>n = 2</td><td>3.4268</td><td>4.8130</td><td>5.9833</td><td>7.3836</td></tr>
<tr><td>n = 3</td><td>1.4289</td><td>4.7679</td><td>6.7332</td><td>8.4573</td></tr>
<tr><td>n = 4</td><td>0.5436</td><td>4.8544</td><td>6.9556</td><td>8.6577</td></tr>
</table>


We notice that we get the same results as earlier when k = 0. Since k redistributes the probability distribution by essentially adding imaginary occurences of words given context, the entropy is increased for k values larger than zero. For n = 1 we notice that the entropy is slightly reduced as k increases, we reason that this is because the probability is evened out, and since the model does not take contextual information into account the model should not be confident in a particular word.

### An unseen test set

Your last exercise is to redo the evaluation on a previously unseen test set, texts from the collection *His Last Bow*.

In [27]:
unseen_data = nlp2.read_data("/home/TDDE09/labs/l2/data/lstb.txt")

<div class="panel panel-primary">
<div class="panel-heading">Problem 7</div>
<div class="panel-body">
Redo the evaluation from Problem 6 with the new test data. Explain (without fixing anything) what happens given the differences between `test.txt` and `lstb.txt`.
</div>
</div>

In [28]:
for n in range(1, 5):
    model = nlp2.train(Model, n, training_data)
    for k in [0,0.01,0.10,1.00]:
        model.k = k
        print("n=%s k=%s Entropy: %s" % (n, k, nlp2.evaluate(model,unseen_data)))

KeyError: 'HIS'

The token 'HIS' does not occur in the training data and is therefore not in the vocabulary. This means that we get a KeyError when evaluating the model on unseen_data.