# Assignment #2: Language models

## Objectives

The objectives of this assignment are to:
* Write a program to find n-gram statistics
* Compute the probability of a sentence
* Know what a language model is
* Write a short report of 1 to 2 pages on the assignment
* Optionally read a short article on the importance of corpora


## Organization

* Each group will have to write Python programs to count unigrams, bigrams, and trigrams in a corpus of approximately one million words and to determine the probability of a sentence.
* You can test you regular expression using the regex101.com site
* Each student will have to write a short report of one to two pages and comment briefly the results. In your report, you must produce the tabulated results of your analysis as described below.

## Programming

### Collecting a corpus

<ol>
            <li>Retrieve a corpus of novels by Selma Lagerl&ouml;f from this URL:
                <a href="https://github.com/pnugues/ilppp/blob/master/programs/corpus/Selma.txt">
                    <tt>https://github.com/pnugues/ilppp/blob/master/programs/corpus/Selma.txt</tt>
                </a>. The text of these novels was extracted
                from <a href="https://litteraturbanken.se/forfattare/LagerlofS/titlar">Lagerlöf arkivet</a> at
                <a href="https://litteraturbanken.se/">Litteraturbanken</a>.
            </li>
            <li>Alternatively, you can collect a corpus of at least 750,000 words. You will check the number of words using the Unix
                command <tt>wc -w</tt>.
            </li>
            <li>Run the <a href="https://github.com/pnugues/ilppp/tree/master/programs/ch02/python">concordance
                program
            </a> to print the lines containing a specific word, for instance <i>Nils</i>.
            </li>
            <li>Run the <a href="https://github.com/pnugues/ilppp/tree/master/programs/ch05/python">tokenization
                program
            </a> on your corpus and count the words using the Unix <tt>sort</tt> and <tt>uniq</tt> commands.
            </li>
        </ol>

### Normalizing a corpus

Write a program to insert `<s>` and `</s>` tags to delimit sentences. You can start from the tokenization and modify it. Use a simple heuristics such as: a sentence starts with a capital letter and ends with a period. Estimate roughly the accuracy of your program.

In [None]:
# Write your code here

Modify your program to remove the punctuation signs and set all the text in lower case letters.

In [2]:
# Write your code here

The result should be a normalized text without punctuation signs where all the sentences are delimited with `<s>` and `</s>` tags. The five last lines of the text should look like this:

```
<s> hon hade fått större kärlek av sina föräldrar än någon annan han visste och sådan kärlek måste vändas i välsignelse </s> 
<s> då prästen sade detta kom alla människor att se bort mot klara gulla och de förundrade sig över vad de såg </s>
<s> prästens ord tycktes redan ha gått i uppfyllelse </s>
<s> där stod klara fina gulleborg ifrån skrolycka hon som var uppkallad efter själva solen vid sina föräldrars grav och lyste som en förklarad </s>
<s> hon var likaså vacker som den söndagen då hon gick till kyrkan i den röda klänningen om inte vackrare </s>
```

### Counting unigrams and bigrams

<ol>
    <li>Read and try programs to compute the frequency of unigrams and bigrams of the training set: [<a
            href="https://github.com/pnugues/ilppp/tree/master/programs/ch05/python">Program folder</a>].
    </li>
    <li>What is the possible number of bigrams and their real number? Explain why such a difference. What would
        be the possible number of 4-grams.
    </li>
    <li>Propose a solution to cope with bigrams unseen in the corpus. This topic will be discussed during the
        lab session.
    </li>
</ol>

### Computing the likelihood of a sentence

Write a program to compute a sentence's probability using unigrams. You may find useful the dictionaries that we saw in the mutual information program: [<a
                href="https://github.com/pnugues/ilppp/tree/master/programs/ch05/python">Program
            folder</a>]

In [None]:
# Write your code

Write a program to compute the sentence probability using bigrams.

In [None]:
# Write your code

Select five sentences in your test set and run your programs on them. Tabulate your results as in the examples below with the sentence <i>Det var en gång en katt som hette
        Nils</i>

Unigram model

```
=====================================================
wi 	 C(wi) 	 #words 	 P(wi)
=====================================================
det 	 21108 	 1041631 	 0.0202643738521607
var 	 12090 	 1041631 	 0.01160679741674355
en 	 13514 	 1041631 	 0.01297388422579589
gång 	 1332 	 1041631 	 0.001278763784871994
en 	 13514 	 1041631 	 0.01297388422579589
katt 	 16 	 1041631 	 1.5360525944408337e-05
som 	 16288 	 1041631 	 0.015637015411407686
hette 	 97 	 1041631 	 9.312318853797554e-05
nils 	 87 	 1041631 	 8.352285982272032e-05
</s> 	 59047 	 1041631 	 0.056687060964967444
=====================================================
Prob. unigrams:	 5.361459667285409e-27
Geometric mean prob.: 0.0023600885848765307
Entropy rate:	 8.726943273141258
Perplexity:	 423.71290908655254
```

Bigram model

```
=====================================================
wi 	 wi+1 	 Ci,i+1 	 C(i) 	 P(wi+1|wi)
=====================================================
<s>	 det 	 5672 	 59047 	 0.09605907158704083
det 	 var 	 3839 	 21108 	 0.1818741709304529
var 	 en 	 712 	 12090 	 0.058891645988420185
en 	 gång 	 706 	 13514 	 0.052242119283705785
gång 	 en 	 20 	 1332 	 0.015015015015015015
en 	 katt 	 6 	 13514 	 0.0004439840165754033
katt 	 som 	 2 	 16 	 0.125
som 	 hette 	 45 	 16288 	 0.002762770137524558
hette 	 nils 	 0 	 97 	 0.0 	 *backoff: 	 8.352285982272032e-05
nils 	 </s> 	 2 	 87 	 0.022988505747126436
=====================================================
Prob. bigrams:	 2.376007803503683e-19
Geometric mean prob.: 0.013727289294133601
Entropy rate:	 6.186809422848149
Perplexity:	 72.84759420254609
```

## Testing

Write a loop that reads your words when you type then and apply the unigram model to carry out the prediction

In [4]:
# Write your code here

Write a loop that reads your words when you type then and apply the bigram model to carry out the prediction

In [5]:
# Write your code here

## Reading

<p>As an application of n-grams, execute the Jupyter notebook by Peter Norvig <a
        href="http://nbviewer.jupyter.org/url/norvig.com/ipython/How%20to%20Do%20Things%20with%20Words.ipynb">
    here</a>. Just run all the cells and be sure that you understand the code.
    You will find the data <a href="http://norvig.com/ngrams/">here</a>.</p>
<p>In your report, you will also describe one experiment with a long string of words
    your will create yourself or copy from a text you like. You will remove all the punctuation and
    white spaces from this string. Set this string in lowercase letters.</p>
<p>You will just add a cell at the end of Sect. 7 in Norvig's notebook, where you will use your string and
    run the notebook cell with the <tt>segment()</tt> and <tt>segment2()</tt> functions. </p>
<p>You will comment the segmentation results you obtain with unigram and bigram models.
</p>