# Assignment #2: Language models
### Author: Hicham Mohamad (hi8826mo-s)

 Table of Contents
 ==
1. [Collecting and analyzing a corpus](#t1)
2. [Segmenting a corpus](#t2)
3. [Counting unigrams and bigrams](#t3)
4. [Computing the likelihood of a sentence](#t4)
5. [Online prediction of words](#t5)
6. [Check answers](#t6)
7. [Submission](#t7)
8. [Reading](#t8) 

## Objectives

The objectives of this assignment are to:
* Write a program to find n-gram statistics
* Compute the probability of a sentence
* Know what a language model is
* Write a short report of 1 to 2 pages on the assignment
* Optionally read a short article on the importance of corpora


## Submission

Once you have written all the missing code and run all the cells, you will submit your notebook to an automatic marking system. Do not erase the content of the cells as we will possibly check your programs manually.
The submission instructions are at the bottom of the notebook.

## Organization

* Each group will have to write Python programs to count **unigrams**, **bigrams**, and **trigrams** in a corpus of approximately one million words and to determine the probability of a sentence.
* You can test you regular expression using the **regex101.com** site
* Each student will have to write a short report of one to two pages and comment briefly the results. In your report, you must produce the tabulated results of your analysis as described below.

## Programming

### Imports

Some imports you may need. Add others as needed.

In [20]:
import bz2
import math
import os
import regex as re
import requests
import sys
from zipfile import ZipFile

### Collecting and analyzing a corpus <a name="t1"/>

In [21]:
# You may have to adjust the path
#corpus = open('../../../corpus/Selma.txt', encoding='utf8').read()
corpus = open('Selma.txt', encoding='utf8').read()

- Run the <a href="https://github.com/pnugues/ilppp/tree/master/programs/ch02/python">concordance
program </a> to print the lines containing a specific word, for instance <i>Nils</i>.

In [22]:
pattern = 'Nils Holgersson'
width = 25

#### Concordance program

In [23]:
# spaces match tabs and newlines
pattern = re.sub(' ', '\\s+', pattern)
# Replaces newlines with spaces in the text
clean_corpus = re.sub('\s+', ' ', corpus)
concordance = ('(.{{0,{width}}}{pattern}.{{0,{width}}})'
               .format(pattern=pattern, width=width))
for match in re.finditer(concordance, clean_corpus):
    print(match.group(1))
# print the string with 0..width characters on either side

Selma Lagerlöf Nils Holgerssons underbara resa genom Sv
! Se på Tummetott! Se på Nils Holgersson Tummetott!» Genast vände
r,» sade han. »Jag heter Nils Holgersson och är son till en husma
lden. »Inte är det värt, Nils Holgersson, att du är ängslig eller
 i dem. På den tiden, då Nils Holgersson drog omkring med vildgäs
ulle allt visa honom vad Nils Holgersson från Västra Vemmenhög va
om ägde rum det året, då Nils Holgersson for omkring med vildgäss
m vad det kan kosta dem. Nils Holgersson hade inte haft förstånd 
de det inte mer sägas om Nils Holgersson, att han inte tyckte om 
 Rosenbom?» För där stod Nils Holgersson mitt uppe på Rosenboms n
 Med ens fingo de syn på Nils Holgersson, och då sköt den store v
vila. När vildgässen och Nils Holgersson äntligen hade letat sig 
 slags arbetare. Men vad Nils Holgersson inte såg, det var, att s
nde han fråga, och om då Nils Holgersson sade nej, började han ge
de lille Mats, och om nu Nils Holgersson också hade tegat, så had
åg så försmädlig ut,

- Run a simple <a href="https://github.com/pnugues/ilppp/tree/master/programs/ch05/python">tokenization
program</a> on your corpus.

In [24]:
def tokenize(text):
    words = re.findall('\p{L}+', text)
    return words

In [25]:
words = tokenize(corpus)
words[:10]

['Selma',
 'Lagerlöf',
 'Nils',
 'Holgerssons',
 'underbara',
 'resa',
 'genom',
 'Sverige',
 'Första',
 'bandet']

- Count the number of **unique words** in the original corpus and then setting all the words in **lowercase**

Original text

In [26]:
# Write your code here
# Count the words and store them in a dictionary.
# It scans the words list and 
# increments the frequency of the words as they occur.
def count_unigrams(words):
    frequency = {}
    for word in words:
        if word in frequency:
            frequency[word] += 1
        else:
            frequency[word] = 1
            
    return frequency

In [27]:
frequency = count_unigrams(words)
print('vocabulary/words: ', len(words))
print('unique words: ', len(frequency))

vocabulary/words:  923485
unique words:  44256


Lowercased text

In [28]:
# Write your code here
lowerwords = []
for word in words:
    lowerwords.append(word.lower())
    
#print(len(lowerwords))    
lowerwords[:10]

['selma',
 'lagerlöf',
 'nils',
 'holgerssons',
 'underbara',
 'resa',
 'genom',
 'sverige',
 'första',
 'bandet']

### Segmenting a corpus <a name="t2"/>

You will write a program to tokenize your text, insert `<s>` and `</s>` tags to delimit sentences, and set all the words in lowercase letters. In the end, you will only keep the words.

#### Normalizing 

- Write a **regular expression** that matches all the characters that are neither a letter nor a punctuation sign. The **punctuations signs** will be the followings: `.;:?!`. In your regex, use the same order. For the definition of a letter, use a Unicode regex. You will call the regex string `nonletter`

**NOTE:** A string of characters enclosed in square **brackets** ([ ]) matches any one character in that string. If the first character in the brackets is a **caret** (^), it matches any character except those in the string. 

In [29]:
# Write your code
nonletter = '([^\p{L}.;:?!]+)'
#re.findall(nonletter, 'En gång hade de på Mårbacka en barnpiga, som hette Back-Kajsa')

- Write a `clean()` function that replaces all the characters that are neither a letter nor a punctuation sign with a **space**. The punctuations signs will be the followings: `.;:?!`.   For the sentence:

_En gång hade de på Mårbacka en barnpiga, som hette Back-Kajsa._

the result will be:

`En gång hade de på Mårbacka en barnpiga som hette Back Kajsa.`

In [30]:
# Write your code here
def clean(text):
    return re.sub(nonletter, ' ', text)

In [31]:
clean('En gång hade de på Mårbacka en barnpiga, som hette Back-Kajsa.')

'En gång hade de på Mårbacka en barnpiga som hette Back Kajsa.'

In [32]:
test_para = 'En gång hade de på Mårbacka en barnpiga, som hette Back-Kajsa. \
Hon var nog sina tre alnar lång, hon hade ett stort, grovt ansikte med stränga, mörka drag, \
hennes händer voro hårda och fulla av sprickor, som barnens hår fastnade i, \
när hon kammade dem, och till humöret var hon dyster och sorgbunden.'

In [33]:
test_para = clean(test_para)
test_para

'En gång hade de på Mårbacka en barnpiga som hette Back Kajsa. Hon var nog sina tre alnar lång hon hade ett stort grovt ansikte med stränga mörka drag hennes händer voro hårda och fulla av sprickor som barnens hår fastnade i när hon kammade dem och till humöret var hon dyster och sorgbunden.'

#### Segmenter

In this section, you will write a **sentence segmenter** that will delimit each sentence with `</s>` and `<s>` symbols. For example the sentence:

_En gång hade de på Mårbacka en barnpiga, som hette Back-Kajsa._

will be bracketed as:

`<s> En gång hade de på Mårbacka en barnpiga som hette Back-Kajsa </s>`

As algorithm, you will use a simple heuristics to detect the **sentence boundaries**: *A sentence starts with a capital letter and ends with a period-equivalent punctuation sign*. You will write a **regex** to match these boundaries with a regular expression and you will insert `</s>\n<s>` symbols with a **substitution** function.

##### Detecting sentence boundaries

- Write a **regular expression** that matches a punctuation, a sequence of spaces, and an uppercase letter. Call this regex string `sentence_boundaries`. In the regex, you will remember the value of the uppercase letter using a **backreference**. Use the **Unicode regexes** for the letters and the spaces.

Backreference examples

text = "From the beginning of the world. There was a ship. The man with no tail."

def normalization(text):
    return re.sub('([\p{S}\p{P}])', r' \1 ', text)

normalization(text)

re.sub(r'(\b[a-z]+) \1', r'\1', 'cat in the the hat')

- The '`r`' at the start of the pattern string designates a **python "raw" string** which passes through backslashes without change.
- Square brackets can be used to indicate a **set of chars**, so `[abc]` matches 'a' or 'b' or 'c'. 
- It is often useful to be able to refer to a particular subpart of the string matching the first pattern. For example, suppose we wanted to put angle brackets around all integers in a text, changing e.g. the *the 35 boxes* to *the <35> boxes*. We'd like a way to *refer back* to the integer we've found so that we can easily add the brackets. To do this, we put *parenthese* `(` and `)`around the first pattern, and use the **number operator** `\1` in the second pattern to refer back. Here is how it looks:

re.sub(r'([0-9]+) \1', r'<\1>', 'the the 35 35 boxes')

re.sub(r'([a-z]+) \1', r'\1', 'the the 35 35 boxes')

In [34]:
# Write your code here
# \p{Lu}  an uppercase letter that has a lowercase variant.
# sentence_boundaries = r'\p{P}\s+(\p{Lu})' 
sentence_boundaries = r'[.;:?!]+\p{Z}+(\p{Lu})'
#sentence_boundaries = r'(\p{Lu})'


##### Replacement markup

- Write a regex string to replace the **matched boundaries** with the sentence **boundary markup**. Remember that a sentence ends with `</s>` and starts with `<s>` and that there is one sentence per line. Hint: The markup is `</s>\n<s>`. Remember also that the first letter of your sentence is in a **regex backreference**. Call the regex string `sentence_markup`.

In [35]:
# Write your code here
sentence_markup = r' </s>\n<s> \1'

##### Applying the substitution

- Use your regexes to segment your text. Use the string `sentence_boundaries`, `sentence_markup`, and `test_para` as input and `text` as output.

In [36]:
# Write your code here
text = re.sub(sentence_boundaries, sentence_markup, test_para)

In [37]:
print(text)

En gång hade de på Mårbacka en barnpiga som hette Back Kajsa </s>
<s> Hon var nog sina tre alnar lång hon hade ett stort grovt ansikte med stränga mörka drag hennes händer voro hårda och fulla av sprickor som barnens hår fastnade i när hon kammade dem och till humöret var hon dyster och sorgbunden.


The output should look like this:

`En gång hade de på Mårbacka en barnpiga, som hette Back-Kajsa </s>
<s> Hon var nog sina tre alnar lång, hon hade ett stort, grovt ansikte med stränga, mörka drag, hennes händer voro hårda och fulla av sprickor, som barnens hår fastnade i, när hon kammade dem, och till humöret var hon dyster och sorgbunden.`

- Insert **markup codes** in the beginning and end of the text

In [38]:
# Write your code here
#text = re.sub(r'^', r' <s> ', text)
#text = re.sub(r'$', r' </s>', text)
text = '<s> ' + text + ' </s>'
print(text)

<s> En gång hade de på Mårbacka en barnpiga som hette Back Kajsa </s>
<s> Hon var nog sina tre alnar lång hon hade ett stort grovt ansikte med stränga mörka drag hennes händer voro hårda och fulla av sprickor som barnens hår fastnade i när hon kammade dem och till humöret var hon dyster och sorgbunden. </s>


The output should look like this:

`<s> En gång hade de på Mårbacka en barnpiga, som hette Back-Kajsa </s>
<s> Hon var nog sina tre alnar lång, hon hade ett stort, grovt ansikte med stränga, mörka drag, hennes händer voro hårda och fulla av sprickor, som barnens hår fastnade i, när hon kammade dem, och till humöret var hon dyster och sorgbunden. </s>`

- Replace the **space duplicates** with one space and remove the **punctuation signs**. For the spaces, use the Unicode regex.

In [39]:
# Write your code here
# Replace the space duplicates with one space
# Z refers to seperators
text = re.sub(r'\p{Z}+', r' ', text)

# remove the punctuation signs
text = re.sub(r'[.;:?!]', r'', text)

In [40]:
print(text)

<s> En gång hade de på Mårbacka en barnpiga som hette Back Kajsa </s>
<s> Hon var nog sina tre alnar lång hon hade ett stort grovt ansikte med stränga mörka drag hennes händer voro hårda och fulla av sprickor som barnens hår fastnade i när hon kammade dem och till humöret var hon dyster och sorgbunden </s>


The output should look like this:
    
`<s> En gång hade de på Mårbacka en barnpiga, som hette Back-Kajsa </s>
<s> Hon var nog sina tre alnar lång, hon hade ett stort, grovt ansikte med stränga, mörka drag, hennes händer voro hårda och fulla av sprickor, som barnens hår fastnade i, när hon kammade dem, och till humöret var hon dyster och sorgbunden </s>`

- Write a `segment_sentences(text)` function to gather the code in the **Segmenter** section and set the text in **lowercase**

In [41]:
# Write your code here
def segment_sentences(text):
    # matches a punctuation, a sequence of spaces, and an uppercase letter
    sentence_boundaries = r'[.;:?!]\p{Z}+(\p{Lu})'
    
    # sentence boundary markup
    sentence_markup = r' </s>\n<s> \1'
    
    # Substitution: replace the matched boundaries with the sentence boundary markup
    text = re.sub(sentence_boundaries, sentence_markup, text)
    
    # Insert markup codes in the beginning and end of the text
    text = '<s> ' + text + ' </s>'
    
    # Replace the space duplicates with one space
    # Z refers to seperators
    text = re.sub(r'\p{Z}+', r' ', text)
    
    # remove the punctuation signs
    text = re.sub(r'[.;:?!]', r'', text)
    #text = re.sub(r'^', r'<s> ', text)
    #text = re.sub(r'$', r' </s>', text)
        
    text = text.lower()
    return text
    

In [42]:
print(segment_sentences(test_para))

<s> en gång hade de på mårbacka en barnpiga som hette back kajsa </s>
<s> hon var nog sina tre alnar lång hon hade ett stort grovt ansikte med stränga mörka drag hennes händer voro hårda och fulla av sprickor som barnens hår fastnade i när hon kammade dem och till humöret var hon dyster och sorgbunden </s>


- Estimate roughly the **accuracy** of your program.

#### Tokenizing the corpus

- Clean and segment the corpus

In [43]:
# Write your code here
#corpus = open('Selma.txt', encoding='utf8').read()
#print(corpus[-557:])

In [44]:
corpus = clean(corpus)
corpus = segment_sentences(corpus)

In [45]:
print(corpus[-557:])

<s> hon hade fått större kärlek av sina föräldrar än någon annan han visste och sådan kärlek måste vändas i välsignelse </s>
<s> då prästen sade detta kom alla människor att se bort mot klara gulla och de förundrade sig över vad de såg </s>
<s> prästens ord tycktes redan ha gått i uppfyllelse </s>
<s> där stod klara fina gulleborg ifrån skrolycka hon som var uppkallad efter själva solen vid sina föräldrars grav och lyste som en förklarad </s>
<s> hon var likaså vacker som den söndagen då hon gick till kyrkan i den röda klänningen om inte vackrare </s>


The result should be a **normalized text** without punctuation signs where all the sentences are delimited with `<s>` and `</s>` tags. The five last lines of the text should look like this:

```
<s> hon hade fått större kärlek av sina föräldrar än någon annan han visste och sådan kärlek måste vändas i välsignelse </s> 
<s> då prästen sade detta kom alla människor att se bort mot klara gulla och de förundrade sig över vad de såg </s>
<s> prästens ord tycktes redan ha gått i uppfyllelse </s>
<s> där stod klara fina gulleborg ifrån skrolycka hon som var uppkallad efter själva solen vid sina föräldrars grav och lyste som en förklarad </s>
<s> hon var likaså vacker som den söndagen då hon gick till kyrkan i den röda klänningen om inte vackrare </s>
```

- You will now create a **list of words** from your string. You will consider that a space or a carriage return is an item **separator**

In [46]:
# Write your code here
wordsOnLine = re.sub(r'\s+', r'\n', corpus)
#print(type(wordsOnLine))
words = wordsOnLine.split()
#words.remove('')

In [47]:
print(words[-101:])

['<s>', 'hon', 'hade', 'fått', 'större', 'kärlek', 'av', 'sina', 'föräldrar', 'än', 'någon', 'annan', 'han', 'visste', 'och', 'sådan', 'kärlek', 'måste', 'vändas', 'i', 'välsignelse', '</s>', '<s>', 'då', 'prästen', 'sade', 'detta', 'kom', 'alla', 'människor', 'att', 'se', 'bort', 'mot', 'klara', 'gulla', 'och', 'de', 'förundrade', 'sig', 'över', 'vad', 'de', 'såg', '</s>', '<s>', 'prästens', 'ord', 'tycktes', 'redan', 'ha', 'gått', 'i', 'uppfyllelse', '</s>', '<s>', 'där', 'stod', 'klara', 'fina', 'gulleborg', 'ifrån', 'skrolycka', 'hon', 'som', 'var', 'uppkallad', 'efter', 'själva', 'solen', 'vid', 'sina', 'föräldrars', 'grav', 'och', 'lyste', 'som', 'en', 'förklarad', '</s>', '<s>', 'hon', 'var', 'likaså', 'vacker', 'som', 'den', 'söndagen', 'då', 'hon', 'gick', 'till', 'kyrkan', 'i', 'den', 'röda', 'klänningen', 'om', 'inte', 'vackrare', '</s>']


The five last lines of the corpus should like this:

`['<s>', 'hon', 'hade', 'fått', 'större', 'kärlek', 'av', 'sina', 'föräldrar', 'än', 'någon', 'annan', 'han', 'visste', 'och', 'sådan', 'kärlek', 'måste', 'vändas', 'i', 'välsignelse', '</s>', '<s>', 'då', 'prästen', 'sade', 'detta', 'kom', 'alla', 'människor', 'att', 'se', 'bort', 'mot', 'klara', 'gulla', 'och', 'de', 'förundrade', 'sig', 'över', 'vad', 'de', 'såg', '</s>', '<s>', 'prästens', 'ord', 'tycktes', 'redan', 'ha', 'gått', 'i', 'uppfyllelse', '</s>', '<s>', 'där', 'stod', 'klara', 'fina', 'gulleborg', 'ifrån', 'skrolycka', 'hon', 'som', 'var', 'uppkallad', 'efter', 'själva', 'solen', 'vid', 'sina', 'föräldrars', 'grav', 'och', 'lyste', 'som', 'en', 'förklarad', '</s>', '<s>', 'hon', 'var', 'likaså', 'vacker', 'som', 'den', 'söndagen', 'då', 'hon', 'gick', 'till', 'kyrkan', 'i', 'den', 'röda', 'klänningen', 'om', 'inte', 'vackrare', '</s>']`



### Counting unigrams and bigrams <a name="t3"/>

Read and try programs to compute the **frequency** of unigrams and bigrams of the training set: [<a
            href="https://github.com/pnugues/ilppp/tree/master/programs/ch05/python">Program folder</a>].

#### NOTE: 
Knowing the frequency of words and sequences of words is crucial in many fields
of language processing. 
- We first normalized the text: we created a file with one sentence per line. 
- We inserted automatically the delimiters `<s>` and `</s>`. 
- We removed the punctuations, parentheses, quotes, stars, dashes, tabulations and double white spaces. 
- We set all the words in lowercase letters. 
- We counted the word, and we produced a file with the unigram and bigram counts.

#### Unigrams

In [48]:
def unigrams(words):
    frequency = {}
    for i in range(len(words)):
        if words[i] in frequency:
            frequency[words[i]] += 1
        else:
            frequency[words[i]] = 1
    return frequency

We compute the frequencies.

In [49]:
frequency = unigrams(words)
list(frequency.items())[:20]

[('<s>', 59047),
 ('selma', 52),
 ('lagerlöf', 270),
 ('nils', 87),
 ('holgerssons', 6),
 ('underbara', 23),
 ('resa', 317),
 ('genom', 688),
 ('sverige', 56),
 ('</s>', 59047),
 ('första', 525),
 ('bandet', 6),
 ('bokutgåva', 11),
 ('albert', 15),
 ('bonniers', 11),
 ('förlag', 11),
 ('stockholm', 77),
 ('den', 11624),
 ('kristliga', 2),
 ('dagvisan', 2)]

#### Bigrams

In [50]:
def bigrams(words):
    bigrams = []
    for i in range(len(words) - 1):
        bigrams.append((words[i], words[i + 1]))
    frequency_bigrams = {}
    for i in range(len(words) - 1):
        if bigrams[i] in frequency_bigrams:
            frequency_bigrams[bigrams[i]] += 1
        else:
            frequency_bigrams[bigrams[i]] = 1
    return frequency_bigrams

In [51]:
frequency_bigrams = bigrams(words)
list(frequency_bigrams.items())[:20]

[(('<s>', 'selma'), 8),
 (('selma', 'lagerlöf'), 11),
 (('lagerlöf', 'nils'), 1),
 (('nils', 'holgerssons'), 6),
 (('holgerssons', 'underbara'), 4),
 (('underbara', 'resa'), 4),
 (('resa', 'genom'), 6),
 (('genom', 'sverige'), 5),
 (('sverige', '</s>'), 17),
 (('</s>', '<s>'), 59046),
 (('<s>', 'första'), 11),
 (('första', 'bandet'), 1),
 (('bandet', 'bokutgåva'), 2),
 (('bokutgåva', 'albert'), 11),
 (('albert', 'bonniers'), 11),
 (('bonniers', 'förlag'), 11),
 (('förlag', 'stockholm'), 10),
 (('stockholm', '</s>'), 24),
 (('<s>', 'den'), 1375),
 (('den', 'kristliga'), 2)]

- In the report, tell what is the **possible number** of bigrams and their **real numbe**r? Explain why such a difference. 
- What would be the possible number of 4-grams.
- Propose a **solution** to cope with bigrams **unseen** in the corpus. This topic will be discussed during the lab session.

The bigram model approximates the probability of a word given all the previous words by using only the **conditional probability** of the preceding word. When we use a bigram model to predict the conditional probability of the next word, we are thus making the following approximation:
\begin{equation*}
P(w_n|w_1^{n-1})  \approx P(w_n|w_{n-1})
\end{equation*}
we can compute the probability of a complete word sequence by
\begin{equation*}
P(w_1^n)  \approx \prod_{k=1}^{n} P(w_k|w_{k-1})
\end{equation*}
The assumption that the probability of a word depends only on the previous word is is called a **Markov** assumption. **Markov models** are the class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past. 

We can generalize the bigram (which looks one word into the past) to the trigram (which looks two words into the past) 
\begin{equation*}
P(w_n|w_1^{n-1})  \approx P(w_n|w_{n-2}w_{n-1})
\end{equation*}
and thus to the n-gram (which looks n−1 words into the past). For example, 4-gram models condition on the previous three words rather than the previous word.

### Computing the likelihood of a sentence <a name="t4"/>

In practice we don’t use **raw probability** as our metric for evaluating language models, but a variant called **perplexity**. The perplexity (sometimes called PP for short) of a language model on a test set is the **inverse probability** of the test set, normalized by the number of words. For a test set $W = w_1 w_2 \cdots w_N$, if we are computing the perplexity of $W$ with a bigram language model,
we get:
\begin{equation*}
PP(W) = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i|w_{i-1})}}
\end{equation*}

#### Unigrams

- Write a program to compute a **sentence's probability** using unigrams. You may find useful the **dictionaries** that we saw in the **mutual information** program: [<a href="https://github.com/pnugues/ilppp/tree/master/programs/ch05/python">Program folder</a>]. Your function will return the **perplexity**.

Your function should **print and tabulate** the results as in the examples below with the sentence _Det var en gång en katt som hette Nils_. 

```
=====================================================
wi 	 C(wi) 	 #words 	 P(wi)
=====================================================
det 	 21108 	 1041631 	 0.0202643738521607
var 	 12090 	 1041631 	 0.01160679741674355
en 	 13514 	 1041631 	 0.01297388422579589
gång 	 1332 	 1041631 	 0.001278763784871994
en 	 13514 	 1041631 	 0.01297388422579589
katt 	 16 	 1041631 	 1.5360525944408337e-05
som 	 16288 	 1041631 	 0.015637015411407686
hette 	 97 	 1041631 	 9.312318853797554e-05
nils 	 87 	 1041631 	 8.352285982272032e-05
</s> 	 59047 	 1041631 	 0.056687060964967444
=====================================================
Prob. unigrams:	 5.361459667285409e-27
Geometric mean prob.: 0.0023600885848765307
Entropy rate:	 8.726943273141258
Perplexity:	 423.71290908655254
```

In [52]:
# Write your code
def unigram_lm(freq_unigrams, sent_words):
    print('=====================================================')
    print('wi      C(wi)      #words      P(wi)')
    print('=====================================================')
    
    #sentence = {}
    # entropy
    #entropy = 0
    prob_unigrams = 1
    # We need the end-symbol to make the bigram grammar a true probability 
    # distribution. Without an end-symbol, the sentence probabilities 
    # for all sentences of a given length would sum to one.
    N = len(words)
    for w in sent_words:
        
        prob = freq_unigrams[w]/N    # XXXXX
        prob_unigrams *= prob
        
        #entropy += -math.log(prob,2)
        
        print(w, '\t', freq_unigrams[w], 
                 '\t', len(words), '\t', prob)
    
    #end_tag = '</s>'
    #open_tag = '<s>'
    
    entropy = -math.log(prob_unigrams,2)
    #prob_unigrams *= freq_unigrams[end_tag]/N
    geom_mean_prob = math.pow(prob_unigrams, 1/len(sent_words))
    
    # Perplexity
    #PP = math.pow(2, entropy/(N-1))
    PP = math.pow(2, entropy/len(sent_words))
    
    #print(end_tag, '\t', freq_unigrams[end_tag], 
    #             '\t', len(sent_words), '\t', freq_unigrams[end_tag]/N)
    print('=====================================================')
    
    print('Prob. unigrams: ' + str(prob_unigrams))
    print('Geometric mean prob.: ' + str(geom_mean_prob))
    #print('Entropy rate: ' + str(entropy/(N-1)))
    print('Entropy rate: ' + str(entropy/len(sent_words)))
    print('Perplexity: ' + str(PP))
    
    return PP

In [53]:
sentence = 'det var en gång en katt som hette nils </s>'
sent_words = sentence.split()
sent_words

['det', 'var', 'en', 'gång', 'en', 'katt', 'som', 'hette', 'nils', '</s>']

In [54]:
perplexity_unigrams = unigram_lm(frequency, sent_words)

wi      C(wi)      #words      P(wi)
det 	 21108 	 1041560 	 0.020265755213333847
var 	 12090 	 1041560 	 0.011607588617074388
en 	 13514 	 1041560 	 0.01297476861630631
gång 	 1332 	 1041560 	 0.001278850954337724
en 	 13514 	 1041560 	 0.01297476861630631
katt 	 16 	 1041560 	 1.5361573025077767e-05
som 	 16288 	 1041560 	 0.015638081339529167
hette 	 97 	 1041560 	 9.312953646453396e-05
nils 	 87 	 1041560 	 8.352855332386037e-05
</s> 	 59047 	 1041560 	 0.056690925150735434
Prob. unigrams: 5.3651155337425844e-27
Geometric mean prob.: 0.0023602494649885993
Entropy rate: 8.726844932328587
Perplexity: 423.68402782577465


In [55]:
perplexity_unigrams = int(perplexity_unigrams)
perplexity_unigrams

423

#### Bigrams

- Write a program to compute the **sentence probability** using bigrams. Your function will tabulate and print the results as below. It will return the **perplexity**.

```
=====================================================
wi 	 wi+1 	 Ci,i+1 	 C(i) 	 P(wi+1|wi)
=====================================================
<s>	 det 	 5672 	 59047 	 0.09605907158704083
det 	 var 	 3839 	 21108 	 0.1818741709304529
var 	 en 	 712 	 12090 	 0.058891645988420185
en 	 gång 	 706 	 13514 	 0.052242119283705785
gång 	 en 	 20 	 1332 	 0.015015015015015015
en 	 katt 	 6 	 13514 	 0.0004439840165754033
katt 	 som 	 2 	 16 	 0.125
som 	 hette 	 45 	 16288 	 0.002762770137524558
hette 	 nils 	 0 	 97 	 0.0 	 *backoff: 	 8.352285982272032e-05
nils 	 </s> 	 2 	 87 	 0.022988505747126436
=====================================================
Prob. bigrams:	 2.376007803503683e-19
Geometric mean prob.: 0.013727289294133601
Entropy rate:	 6.186809422848149
Perplexity:	 72.84759420254609
```

In [56]:
# Write your code
def bigram_lm(freq_unigrams, freq_bigrams, sent_words):
    print('=====================================================')
    print('wi      wi+1      Ci,i+1      C(i)      P(wi+1|wi)')
    print('=====================================================')
    
    #sentence = {}
    # entropy
    #entropy = 0
    prob_bigrams = 1
    # We need the end-symbol to make the bigram grammar a true probability 
    # distribution. Without an end-symbol, the sentence probabilities 
    # for all sentences of a given length would sum to one.
    N = len(words)
    for i in range(len(sent_words) - 1):
    #for i in range(len(sent_words)):
        if ((sent_words[i], sent_words[i+1]) in freq_bigrams):
            ci_inext = freq_bigrams[(sent_words[i], sent_words[i+1])]
            ci = freq_unigrams[(sent_words[i])]
            
            #prob_bi *= frequency_bigrams[(sent_words[i], 
           #             sent_words[i + 1])] / frequency[sent_words[i]]
            
            prob = ci_inext/ci    # XXXXXX    
            #prob_unigrams *= prob
            print(sent_words[i], '\t', sent_words[i+1], '\t', 
                  str(ci_inext), '\t', str(ci), '\t', prob)
        else:
            prob = freq_unigrams[sent_words[i+1]]/len(words)
            print(sent_words[i], '\t', sent_words[i+1], '\t', 
                  '0', '\t', str(freq_unigrams[sent_words[i]]), '\t',
                  '0.0 *backoff: ', str(prob))
                                
        prob_bigrams *= prob
        
    #end_tag = '</s>'
    #open_tag = '<s>'
    
    entropy = -math.log(prob_bigrams,2)
    geom_mean_prob = math.pow(prob_bigrams, 1/(len(sent_words) - 1))
    
    # Perplexity
    #PP = math.pow(2, entropy/(N-1))
    PP = math.pow(2, entropy/(len(sent_words) - 1))
    
    #print(end_tag, '\t', freq_unigrams[end_tag], 
    #             '\t', len(sent_words), '\t', freq_unigrams[end_tag]/N)
    print('=====================================================')
    
    print('Prob. bigrams: ' + str(prob_bigrams))
    print('Geometric mean prob.: ' + str(geom_mean_prob))
    #print('Entropy rate: ' + str(entropy/(N-1)))
    print('Entropy rate: ' + str(entropy/(len(sent_words) - 1)))
    print('Perplexity: ' + str(PP))
    
    return PP

In [57]:
sentence = '<s> det var en gång en katt som hette nils </s>'
sent_words = sentence.split()
sent_words
#len(sent_words)

['<s>',
 'det',
 'var',
 'en',
 'gång',
 'en',
 'katt',
 'som',
 'hette',
 'nils',
 '</s>']

In [58]:
perplexity_bigrams = bigram_lm(frequency, frequency_bigrams, sent_words)

wi      wi+1      Ci,i+1      C(i)      P(wi+1|wi)
<s> 	 det 	 5672 	 59047 	 0.09605907158704083
det 	 var 	 3839 	 21108 	 0.1818741709304529
var 	 en 	 712 	 12090 	 0.058891645988420185
en 	 gång 	 706 	 13514 	 0.052242119283705785
gång 	 en 	 20 	 1332 	 0.015015015015015015
en 	 katt 	 6 	 13514 	 0.0004439840165754033
katt 	 som 	 2 	 16 	 0.125
som 	 hette 	 45 	 16288 	 0.002762770137524558
hette 	 nils 	 0 	 97 	 0.0 *backoff:  8.352855332386037e-05
nils 	 </s> 	 2 	 87 	 0.022988505747126436
Prob. bigrams: 2.376169768780815e-19
Geometric mean prob.: 0.013727382866049192
Entropy rate: 6.186799588766882
Perplexity: 72.84709764111103


In [59]:
perplexity_bigrams = int(perplexity_bigrams)
perplexity_bigrams

72

- In addition to this sentence, _Det var en gång en katt som hette Nils_, write five other sentences that will form your **test set** and run your programs on them. You will insert them in your report.

### Five sentences - Unigrams

In [60]:
# A little knowledge is a dangerous thing
sentence1 = 'lite kunskap är en farlig sak </s>'
sent_words1 = sentence1.split()
sent_words1

['lite', 'kunskap', 'är', 'en', 'farlig', 'sak', '</s>']

In [61]:
perplexity_unigrams1 = unigram_lm(frequency, sent_words1)

wi      C(wi)      #words      P(wi)
lite 	 45 	 1041560 	 4.320442413303122e-05
kunskap 	 14 	 1041560 	 1.3441376396943047e-05
är 	 6290 	 1041560 	 0.006039018395483697
en 	 13514 	 1041560 	 0.01297476861630631
farlig 	 40 	 1041560 	 3.840393256269442e-05
sak 	 205 	 1041560 	 0.0001968201543838089
</s> 	 59047 	 1041560 	 0.056690925150735434
Prob. unigrams: 1.9498300024612508e-23
Geometric mean prob.: 0.0005697886857557694
Entropy rate: 10.777285405072845
Perplexity: 1755.0366039185164


In [62]:
# Early to bed and early to rise, makes a man healthy, wealthy and wise
sentence2 = 'tidigt till sängs och tidigt att stiga gör en man frisk </s>'
sent_words2 = sentence2.split()
sent_words2

['tidigt',
 'till',
 'sängs',
 'och',
 'tidigt',
 'att',
 'stiga',
 'gör',
 'en',
 'man',
 'frisk',
 '</s>']

In [63]:
perplexity_unigrams2 = unigram_lm(frequency, sent_words2)

wi      C(wi)      #words      P(wi)
tidigt 	 38 	 1041560 	 3.64837359345597e-05
till 	 9139 	 1041560 	 0.008774338492261608
sängs 	 18 	 1041560 	 1.728176965321249e-05
och 	 36356 	 1041560 	 0.03490533430623296
tidigt 	 38 	 1041560 	 3.64837359345597e-05
att 	 28020 	 1041560 	 0.02690195476016744
stiga 	 90 	 1041560 	 8.640884826606244e-05
gör 	 355 	 1041560 	 0.000340834901493913
en 	 13514 	 1041560 	 0.01297476861630631
man 	 2322 	 1041560 	 0.002229348285264411
frisk 	 69 	 1041560 	 6.624678367064787e-05
</s> 	 59047 	 1041560 	 0.056690925150735434
Prob. unigrams: 6.063662305241324e-37
Geometric mean prob.: 0.0009591677848954893
Entropy rate: 10.025929175084222
Perplexity: 1042.570461339003


In [64]:
# The bigger they are, the harder they fall
sentence3 = 'ju större de är desto hårdare faller de </s>'
sent_words3 = sentence3.split()
sent_words3

['ju', 'större', 'de', 'är', 'desto', 'hårdare', 'faller', 'de', '</s>']

In [65]:
perplexity_unigrams3 = unigram_lm(frequency, sent_words3)

wi      C(wi)      #words      P(wi)
ju 	 1250 	 1041560 	 0.0012001228925842006
större 	 150 	 1041560 	 0.00014401474711010407
de 	 11942 	 1041560 	 0.011465494066592419
är 	 6290 	 1041560 	 0.006039018395483697
desto 	 44 	 1041560 	 4.224432581896386e-05
hårdare 	 8 	 1041560 	 7.680786512538884e-06
faller 	 47 	 1041560 	 4.5124620761165944e-05
de 	 11942 	 1041560 	 0.011465494066592419
</s> 	 59047 	 1041560 	 0.056690925150735434
Prob. unigrams: 1.1389004736737127e-28
Geometric mean prob.: 0.0007855341786154558
Entropy rate: 10.314038330960237
Perplexity: 1273.019083348546


In [66]:
# How beautiful would be the world if there were a rule for going round in labyrinths
sentence4 = 'hur vacker skulle världen vara om det fanns en regel för att gå runt i labyrinter </s>'
sent_words4 = sentence4.split()
sent_words4

['hur',
 'vacker',
 'skulle',
 'världen',
 'vara',
 'om',
 'det',
 'fanns',
 'en',
 'regel',
 'för',
 'att',
 'gå',
 'runt',
 'i',
 'labyrinter',
 '</s>']

In [67]:
perplexity_unigrams4 = unigram_lm(frequency, sent_words4)

wi      C(wi)      #words      P(wi)
hur 	 1996 	 1041560 	 0.0019163562348784515
vacker 	 209 	 1041560 	 0.00020066054764007835
skulle 	 5433 	 1041560 	 0.00521621414032797
världen 	 363 	 1041560 	 0.00034851568800645187
vara 	 1803 	 1041560 	 0.001731057260263451
om 	 8075 	 1041560 	 0.007752793886093936
det 	 21108 	 1041560 	 0.020265755213333847
fanns 	 702 	 1041560 	 0.0006739890164752871
en 	 13514 	 1041560 	 0.01297476861630631
regel 	 2 	 1041560 	 1.920196628134721e-06
för 	 9443 	 1041560 	 0.009066208379738086
att 	 28020 	 1041560 	 0.02690195476016744
gå 	 1590 	 1041560 	 0.0015265563193671032
runt 	 154 	 1041560 	 0.0001478551403663735
i 	 16508 	 1041560 	 0.015849302968623986
labyrinter 	 1 	 1041560 	 9.600983140673605e-07
</s> 	 59047 	 1041560 	 0.056690925150735434
Prob. unigrams: 1.5161591885734908e-49
Geometric mean prob.: 0.001343628187008172
Entropy rate: 9.539650318399216
Perplexity: 744.2535142305092


In [68]:
# See no evil, hear no evil, speak no evil
sentence5 = 'se inget ont hör inget ont tala inget ont </s>'
sent_words5 = sentence5.split()
sent_words5

['se', 'inget', 'ont', 'hör', 'inget', 'ont', 'tala', 'inget', 'ont', '</s>']

In [69]:
perplexity_unigrams5 = unigram_lm(frequency, sent_words5)

wi      C(wi)      #words      P(wi)
se 	 1989 	 1041560 	 0.00190963554667998
inget 	 34 	 1041560 	 3.264334267829026e-05
ont 	 150 	 1041560 	 0.00014401474711010407
hör 	 222 	 1041560 	 0.00021314182572295404
inget 	 34 	 1041560 	 3.264334267829026e-05
ont 	 150 	 1041560 	 0.00014401474711010407
tala 	 845 	 1041560 	 0.0008112830753869196
inget 	 34 	 1041560 	 3.264334267829026e-05
ont 	 150 	 1041560 	 0.00014401474711010407
</s> 	 59047 	 1041560 	 0.056690925150735434
Prob. unigrams: 1.9449565461801393e-36
Geometric mean prob.: 0.00026846704976591837
Entropy rate: 11.862967349276994
Perplexity: 3724.85189848035


### Five sentences - Bigrams

In [70]:
# A little knowledge is a dangerous thing
sentence1 = '<s> lite kunskap är en farlig sak </s>'
sent_words1 = sentence1.split()
sent_words1

['<s>', 'lite', 'kunskap', 'är', 'en', 'farlig', 'sak', '</s>']

In [72]:
perplexity_bigrams1 = bigram_lm(frequency, frequency_bigrams, sent_words1)

wi      wi+1      Ci,i+1      C(i)      P(wi+1|wi)
<s> 	 lite 	 0 	 59047 	 0.0 *backoff:  4.320442413303122e-05
lite 	 kunskap 	 0 	 45 	 0.0 *backoff:  1.3441376396943047e-05
kunskap 	 är 	 0 	 14 	 0.0 *backoff:  0.006039018395483697
är 	 en 	 304 	 6290 	 0.04833068362480127
en 	 farlig 	 6 	 13514 	 0.0004439840165754033
farlig 	 sak 	 0 	 40 	 0.0 *backoff:  0.0001968201543838089
sak 	 </s> 	 34 	 205 	 0.16585365853658537
Prob. bigrams: 2.4565364591697616e-21
Geometric mean prob.: 0.0011369999855527747
Entropy rate: 9.78055204876345
Perplexity: 879.5074869889572


In [73]:
# Early to bed and early to rise, makes a man healthy, wealthy and wise
sentence2 = '<s> tidigt till sängs och tidigt att stiga gör en man frisk </s>'
sent_words2 = sentence2.split()
sent_words2

['<s>',
 'tidigt',
 'till',
 'sängs',
 'och',
 'tidigt',
 'att',
 'stiga',
 'gör',
 'en',
 'man',
 'frisk',
 '</s>']

In [74]:
perplexity_bigrams2 = bigram_lm(frequency, frequency_bigrams, sent_words2)

wi      wi+1      Ci,i+1      C(i)      P(wi+1|wi)
<s> 	 tidigt 	 3 	 59047 	 5.080698426677054e-05
tidigt 	 till 	 1 	 38 	 0.02631578947368421
till 	 sängs 	 18 	 9139 	 0.001969580916949338
sängs 	 och 	 1 	 18 	 0.05555555555555555
och 	 tidigt 	 1 	 36356 	 2.750577621300473e-05
tidigt 	 att 	 2 	 38 	 0.05263157894736842
att 	 stiga 	 28 	 28020 	 0.0009992862241256246
stiga 	 gör 	 0 	 90 	 0.0 *backoff:  0.000340834901493913
gör 	 en 	 6 	 355 	 0.016901408450704224
en 	 man 	 117 	 13514 	 0.008657688323220364
man 	 frisk 	 0 	 2322 	 0.0 *backoff:  6.624678367064787e-05
frisk 	 </s> 	 10 	 69 	 0.14492753623188406
Prob. bigrams: 1.0134118069871207e-31
Geometric mean prob.: 0.002613056679059623
Entropy rate: 8.580045866582262
Perplexity: 382.69357416306656


In [75]:
# The bigger they are, the harder they fall
sentence3 = '<s> ju större de är desto hårdare faller de </s>'
sent_words3 = sentence3.split()
sent_words3

['<s>', 'ju', 'större', 'de', 'är', 'desto', 'hårdare', 'faller', 'de', '</s>']

In [76]:
perplexity_bigrams3 = bigram_lm(frequency, frequency_bigrams, sent_words3)

wi      wi+1      Ci,i+1      C(i)      P(wi+1|wi)
<s> 	 ju 	 21 	 59047 	 0.00035564888986739377
ju 	 större 	 0 	 1250 	 0.0 *backoff:  0.00014401474711010407
större 	 de 	 0 	 150 	 0.0 *backoff:  0.011465494066592419
de 	 är 	 59 	 11942 	 0.004940545972198961
är 	 desto 	 0 	 6290 	 0.0 *backoff:  4.224432581896386e-05
desto 	 hårdare 	 0 	 44 	 0.0 *backoff:  7.680786512538884e-06
hårdare 	 faller 	 0 	 8 	 0.0 *backoff:  4.5124620761165944e-05
faller 	 de 	 0 	 47 	 0.0 *backoff:  0.011465494066592419
de 	 </s> 	 102 	 11942 	 0.008541282867191425
Prob. bigrams: 4.160060663585748e-30
Geometric mean prob.: 0.0005438204290664426
Entropy rate: 10.844582031130408
Perplexity: 1838.842284238317


In [77]:
# How beautiful would be the world if there were a rule for going round in labyrinths
sentence4 = '<s> hur vacker skulle världen vara om det fanns en regel för att gå runt i labyrinter </s>'
sent_words4 = sentence4.split()
sent_words4

['<s>',
 'hur',
 'vacker',
 'skulle',
 'världen',
 'vara',
 'om',
 'det',
 'fanns',
 'en',
 'regel',
 'för',
 'att',
 'gå',
 'runt',
 'i',
 'labyrinter',
 '</s>']

In [78]:
perplexity_bigrams4 = bigram_lm(frequency, frequency_bigrams, sent_words4)

wi      wi+1      Ci,i+1      C(i)      P(wi+1|wi)
<s> 	 hur 	 238 	 59047 	 0.004030687418497129
hur 	 vacker 	 3 	 1996 	 0.001503006012024048
vacker 	 skulle 	 0 	 209 	 0.0 *backoff:  0.00521621414032797
skulle 	 världen 	 0 	 5433 	 0.0 *backoff:  0.00034851568800645187
världen 	 vara 	 0 	 363 	 0.0 *backoff:  0.001731057260263451
vara 	 om 	 5 	 1803 	 0.0027731558513588465
om 	 det 	 558 	 8075 	 0.06910216718266254
det 	 fanns 	 253 	 21108 	 0.011985976880803486
fanns 	 en 	 74 	 702 	 0.10541310541310542
en 	 regel 	 1 	 13514 	 7.399733609590054e-05
regel 	 för 	 0 	 2 	 0.0 *backoff:  0.009066208379738086
för 	 att 	 2932 	 9443 	 0.3104945462247167
att 	 gå 	 360 	 28020 	 0.01284796573875803
gå 	 runt 	 3 	 1590 	 0.0018867924528301887
runt 	 i 	 13 	 154 	 0.08441558441558442
i 	 labyrinter 	 0 	 16508 	 0.0 *backoff:  9.600983140673605e-07
labyrinter 	 </s> 	 1 	 1 	 1.0
Prob. bigrams: 1.88910284270759e-39
Geometric mean prob.: 0.005273909614054819
Entropy rate: 7.5669

In [79]:
# See no evil, hear no evil, speak no evil
sentence5 = '<s> se inget ont hör inget ont tala inget ont </s>'
sent_words5 = sentence5.split()
sent_words5

['<s>',
 'se',
 'inget',
 'ont',
 'hör',
 'inget',
 'ont',
 'tala',
 'inget',
 'ont',
 '</s>']

In [80]:
perplexity_bigrams5 = bigram_lm(frequency, frequency_bigrams, sent_words5)

wi      wi+1      Ci,i+1      C(i)      P(wi+1|wi)
<s> 	 se 	 196 	 59047 	 0.003319389638762342
se 	 inget 	 0 	 1989 	 0.0 *backoff:  3.264334267829026e-05
inget 	 ont 	 6 	 34 	 0.17647058823529413
ont 	 hör 	 0 	 150 	 0.0 *backoff:  0.00021314182572295404
hör 	 inget 	 0 	 222 	 0.0 *backoff:  3.264334267829026e-05
inget 	 ont 	 6 	 34 	 0.17647058823529413
ont 	 tala 	 0 	 150 	 0.0 *backoff:  0.0008112830753869196
tala 	 inget 	 0 	 845 	 0.0 *backoff:  3.264334267829026e-05
inget 	 ont 	 6 	 34 	 0.17647058823529413
ont 	 </s> 	 23 	 150 	 0.15333333333333332
Prob. bigrams: 1.682429136522422e-26
Geometric mean prob.: 0.002646023385115938
Entropy rate: 8.561958472689598
Perplexity: 377.92560928413127


### Online prediction of words <a name="t5"/>

You will now carry out an online prediction of words. You will consider two cases:
1. Prediction of the **current word** a user is typing;
2. Prediction of the **next word**.

Ideally, you would write a **loop** that reads the words and apply the models while typing. As the Jupyter labs are not designed for **interactive** input and output, we will simplify the experimental settings with **constant strings** at a given time of the input.  

We will assume the user is typing the phrase: _Det var en gång_. 

#### Trigrams

- To have a more accurate prediction, you will use a **trigram counting** function. Program it following the model of bigrams.

In [80]:
# Write your code
def trigrams(words):
    trigrams = []
    for i in range(len(words) - 3 + 1):
        trigrams.append(tuple(words[i : i+3]))
        
    frequency_trigrams = {}
    #for i in range(len(words) - 3 + 1):
    for gram in trigrams:
        #if trigrams[i] in frequency_trigrams:
        if gram in frequency_trigrams:
            frequency_trigrams[gram] += 1
        else:
            frequency_trigrams[gram] = 1
    return frequency_trigrams

In [81]:
frequency_trigrams = trigrams(words)
frequency_trigrams[('det', 'var', 'en')]

330

#### Prediction

The user starts typing _Det var en gång_. After the 2nd character, your program tries to help the user with suggested words.

In [82]:
starting_text = 'De'.lower()

- Write a program to rank the **five first candidates** at this point. Assign these predictions in a **list** that you will call `current_word_predictions_1`. Note that you are starting a sentence and you can then use the **bigram frequencies**.

In [83]:
cand_nbr = 5

In [84]:
# Write your code here
current_word_predictions_1 = []
predictions = []
for bigram in sorted(frequency_bigrams.keys(), 
                   key=frequency_bigrams.get, reverse=True):
    #print(bigram)
    #print(bigram[0])
    if (bigram[0] == '<s>'):
        if (bigram[1][:2] == starting_text):
            #print(bigram)
            #print(frequency_bigrams[bigram])
            predictions.append(bigram[1])
for i in range(cand_nbr):
    current_word_predictions_1.append(predictions[i])
    

In [85]:
current_word_predictions_1

['det', 'de', 'den', 'detta', 'denna']

- Let us now suppose that the user has typed: _Det var en_. After detecting a **space**, your program starts predicting a next possible word.

In [86]:
current_text = "Det var en ".lower()

Tokenize this text and return a list of tokens. Call it `tokens`.

In [87]:
# Write your code here
tokens = tokenize(current_text)

In [88]:
tokens

['det', 'var', 'en']

Write a program to propose the five next possible words ranked by frequency using a **trigram model**. Assign these predictions to a variable that you will call `next_word_predictions`

4-gram looks three words into the past.

In [89]:
# Write your code here
next_word_predictions = []
tri_predictions = []
for trigram in sorted(frequency_trigrams.keys(), 
                   key=frequency_trigrams.get, reverse=True):
    #print(trigram)
    #print(trigram[0:2])
    #if (trigram[0] == token[1]):
    #print(tokens[1:3])
    if (trigram[0:2] == tuple(tokens[1:3])):
        #print(trigram)
        #print(frequency_trigrams[trigram])
        tri_predictions.append(trigram[2])
for i in range(cand_nbr):
    next_word_predictions.append(tri_predictions[i])
    

In [90]:
next_word_predictions

['stor', 'liten', 'gammal', 'god', 'sådan']

- Finally, let us suppose that the user has typed _Det var en g_, rank the five possible candidates. Assign these predictions in a list that you will call `current_word_predictions_2`

In [91]:
current_text = "Det var en g".lower()
#current_text[-1]

In [92]:
# Write your code here
current_word_predictions_2 = []
current_tri_predictions = []
for trigram in sorted(frequency_trigrams.keys(), 
                   key=frequency_trigrams.get, reverse=True):
    #print(trigram)
    #print(trigram[0:2])
    #if (trigram[0] == token[1]):
    #print(tokens[1:3])
    if (trigram[0:2] == tuple(tokens[1:3])):
        #print(trigram)
        #print(frequency_trigrams[trigram])
        if (trigram[2][0:1] == current_text[-1]):
            current_tri_predictions.append(trigram[2])
            
for i in range(cand_nbr):
    current_word_predictions_2.append(current_tri_predictions[i])
    

In [93]:
current_word_predictions_2

['gammal', 'god', 'gång', 'ganska', 'grann']

## Checked answers <a name="t6"/>

The system will check these answers: `(perplexity_unigrams, perplexity_bigrams, current_word_predictions_1, next_word_predictions, current_word_predictions_2)`

In [94]:
(perplexity_unigrams, perplexity_bigrams, current_word_predictions_1, next_word_predictions, current_word_predictions_2)

(423,
 72,
 ['det', 'de', 'den', 'detta', 'denna'],
 ['stor', 'liten', 'gammal', 'god', 'sådan'],
 ['gammal', 'god', 'gång', 'ganska', 'grann'])

## Submission <a name="t7"/>

When you have written all the code and run all the cells, fill in your ID and as well as the name of the notebook.

In [95]:
STIL_ID = ["hi8826mo-s"] # Write your stil ids as a list
CURRENT_NOTEBOOK_PATH = os.path.join(os.getcwd(), 
                                     "2-language_models_HichamMohamad.ipynb") # Write the name of your notebook

The submission code will send your answer. It consists of the perplexities and predictions.

In [96]:
ANSWER = str((perplexity_unigrams, perplexity_bigrams, current_word_predictions_1, next_word_predictions, current_word_predictions_2))
ANSWER

"(423, 72, ['det', 'de', 'den', 'detta', 'denna'], ['stor', 'liten', 'gammal', 'god', 'sådan'], ['gammal', 'god', 'gång', 'ganska', 'grann'])"

Now the moment of truth:
1. Save your notebook and
2. Run the cells below

In [97]:
SUBMISSION_NOTEBOOK_PATH = CURRENT_NOTEBOOK_PATH + ".submission.bz2"

In [98]:
ASSIGNMENT = 2
API_KEY = "f581ba347babfea0b8f2c74a3a6776a7"

# Copy and compress current notebook
with bz2.open(SUBMISSION_NOTEBOOK_PATH, mode="wb") as fout:
    with open(CURRENT_NOTEBOOK_PATH, "rb") as fin:
        fout.write(fin.read())

In [99]:
res = requests.post("https://vilde.cs.lth.se/edan20checker/submit", 
                    files={"notebook_file": open(SUBMISSION_NOTEBOOK_PATH, "rb")}, 
                    data={
                        "stil_id": STIL_ID,
                        "assignment": ASSIGNMENT,
                        "answer": ANSWER,
                        "api_key": API_KEY,
                    },
                   verify=True)

# from IPython.display import display, JSON
res.json()

{'msg': None,
 'status': 'correct',
 'signature': '4e1e0f6dcf3c7b185cb3460ba5e0e24f5155363685baa53dbdad76008a8eed1785be88917315df56c998a17f3b514b67024d4e7321231c6aeab661f1f5c11b29',
 'submission_id': '801a964c-dc25-4cb1-8316-0515f94e985b'}

## Reading <a name="t8"/>

<p>As an application of <b>n-grams</b>, execute the Jupyter notebook by Peter Norvig <a
        href="http://nbviewer.jupyter.org/url/norvig.com/ipython/How%20to%20Do%20Things%20with%20Words.ipynb">
    here</a>. Just run all the cells and be sure that you understand the code.
    You will find the data <a href="http://norvig.com/ngrams/">here</a>.</p>
<p>In your report, you will also <b>describe one experiment with a long string of words</b>
    you will create yourself or copy from a text you like. You will remove all the punctuation and
    white spaces from this string. Set this string in lowercase letters.</p>
<p>You will just add a cell at the end of Sect. 7 in Norvig's notebook, where you will use your string and
    run the notebook cell with the <b>segment()</b> and <b>segment2()</b> functions. </p>
<p>You will <b>comment the segmentation results</b> you obtain with unigram and bigram models.
</p>