# MT Praktikum - Text processing and Language Models

In the directory `/opt/data/nc19/` you will find the raw text files from
the News-Commentary corpus which we will use today for our
preprocessing, as well as the language model. The corresponding source
data you can find in  
`/opt/data/wmt10-xlats/ref/wmt10-newssyscombtest2010-src.de.sgm`

Preprocessing
=============

Tokenization
------------

Tokenization, in brief terms, is the task of breaking down the text
stream into discrete units, called *tokens*. Before looking at
tokenization, let's first take a look at the data itself:


In [1]:
export LANGUAGE=C.UTF-8
export LC_ALL=C.UTF-8

In [2]:
head /opt/data/nc19/europarl-v9.en

Resumption of the session
I declare resumed the session of the European Parliament adjourned on Friday, 15 December 2000.
Statements by the President
Ladies and gentlemen, on Saturday, as you know, an earthquake struck Central America once again, with tragic consequences.
This is an area which has already been seriously affected on a number of occasions since the beginning of the twentieth century.
The latest, provisional, figures for victims in El Salvador are already very high.
There are 350 people dead, 1 200 people missing, the area is completely devastated and thousands of homes have been destroyed throughout the country.
The European Union has already shown its solidarity by sending a rescue team to the area, whilst financial assistance from the Union and Member States has been, or is in the process of being, released and I am able to inform you that some groups in the European Parliament have requested that this issue be included in the debate on topical and urgent subjects of m

In [3]:
wc -l /opt/data/nc19/europarl-v9.en

2295044 /opt/data/nc19/europarl-v9.en


First, let's manually try to extract the unique words in this corpus.
The following command will extract a (sorted and de-duplicated) list of
tokens, as well as an occurrence count for each of them, from the text
file:

In [4]:
cat /opt/data/nc19/europarl-v9.en | tr ' ' '\n' | sort | uniq -c > /tmp/europarl.hist

Then read the file *europarl.hist*. Try to answer the following
questions:

-   How many items (words separated by space) are there in the original
    europarl-v9.en data?

In [5]:
wc -l < /tmp/europarl.hist

329276


-   Is each one of the items totally unique, or can you spot some
    obvious redundancies?

In [6]:
cat /tmp/europarl.hist | grep -i ' *[0-9]*  *example'

      2 Example
    162 Examples
      1 Examples,
  11829 example
     12 example!
      4 example'
      1 example',
      1 example'.
     23 example)
     20 example),
     11 example).
      1 example)?
  16499 example,
   2328 example.
      1 example.4.If
      1 example.I
    379 example:
     46 example;
     91 example?
      1 example?!
   2263 examples
      2 examples)
      2 examples),
      3 examples).
    254 examples,
    438 examples.
    144 examples:
      9 examples;
      6 examples?
      1 examplesas
      1 example…


-   Does the phenomenon in question affect statistical models (such as
    $n$-gram models) or probabilistic models such as neural language
    models?
  - Yes, it affects both. With more vocabulary words and less samples per word the maximum-likelihood estimation for the word embeddings, as well as the $n$-gram probabilities will yield worse estimations

Tokenization aims at solving the problems that we observed. For
languages such as English and German, the tools are often implemented
with rule-based approaches. A standard tool for such tokenization is
`tokenizer.perl` from the `Moses` SMT project:

In [7]:
echo 'This is an example, (it shows how "tokenziation" works).' |
    tokenizer.perl -l en  2>/dev/null

This is an example , ( it shows how &quot; tokenziation &quot; works ) .


You can run the tool to tokenize your input file:

In [8]:
tokenizer.perl -l en < /opt/data/nc19/europarl-v9.en > /tmp/europarl-v9.tok.en

Tokenizer Version 1.1
Language: en
Number of threads: 1


Note: The file name is a typical convention in the Natural Language
Processing (NLP) community. The 'tok' suffix is just a naming
convention, telling that tokenization is applied on top of the input
file.

Now you can try to extract to vocab again.

In [9]:
cat /tmp/europarl-v9.tok.en | tr ' ' '\n' | sort | uniq -c > /tmp/europarl.tok.hist

The tokenized text file is always longer than the original one. By using
the `wc` command you can verify if your command ran correctly or not.
How many words do you now have in this vocabulary?

In [10]:
wc -l < /tmp/europarl.tok.hist

140473


True-Casing
-----------

When you look at the vocabulary file, you will probably find there to
still be some duplicate words, once in upper-case form and once in
lower-case form: 

In [11]:
cat /tmp/europarl.tok.hist | grep -i ' *[0-9]*  *example$'

      2 Example
  31252 example


The more we can
reduce the number of duplication the better, so after tokenization we
will use a true-casing tool to strip even more redundancy.  
We apply the true-casing in a 2 step procedure:

1.  train a true-casing model to get the "true" case of each vocabulary
    word using  
    `$ train-truecaser.perl --model truecase-model.en --corpus europarl-v9.tok.en`

2.  apply the model to the data to convert upper-cased words at the the
    beginning of the sentence to their respective "true" case:  
    `$ truecase.perl --model truecase-model.en < europarl-v9.tok.en > europarl-v9.true.en`  
    (it may take a few minutes to complete)

In [12]:
train-truecaser.perl --model /tmp/truecase-model.en --corpus /tmp/europarl-v9.tok.en
truecase.perl        --model /tmp/truecase-model.en        < /tmp/europarl-v9.tok.en > /tmp/europarl-v9.true.en

Note that we need the tokenized text file to train the model (Why? What
would happen if we use the original file?).  

If you check the model file
contents, you will see it simply contains statistics about upper-case
and lower-case occurrences for each word.

In [13]:
head -n128 /tmp/truecase-model.en

round-up (5/6) Round-up (1)
over-emphasised (12/12)
cicadas (1/1)
wrap (43/43)
instead (6602/6612) Instead (10)
improvised (49/49)
Christer (3/3)
firebrands (1/1)
emeritus (1/1)
sleeps (2/2)
al-Mabhouh (1/1)
ná (4/4)
C4-0661 (2/2)
christening (4/4)
compacted (5/5)
complacency (176/176)
Beitenu (1/1)
rum-producing (1/1)
in-transit (4/4)
IBAN (16/16)
conquering (21/21)
Balázs (5/5)
C5-0226 (1/1)
befell (20/20)
B5-0163 (2/2)
opposed (4185/4185)
non-transparently (1/1)
pallets (3/3)
root-andbranch (1/1)
H5N1 (18/18)
Kajumulo (1/1)
-B4-0764 (2/2)
Joel (4/4)
risksall (1/1)
C4-0122 (1/1)
berthing (3/3)
hobby-horses (3/3)
CO2-polluted (1/1)
Jangchub (2/2)
unido (1/2) UNIDO (1)
sector (27517/27538) Sector (21)
expectantly (9/9)
Pan-Orthodox (1/1)
place-based (3/3)
weapons-related (1/1)
EKO (1/1)
administratively (60/60)
TSIs (9/9)
Apalina (1/1)
SeaFrance (2/2)
hooks (5/5)
Saint-Barthélemy (1/1)
abnormally (31/31)
Euro-Jus (1/1)
definitive (944/951) Definitive (7)
O-0048 (2/2)
Bouler (2/2)
left-

In [14]:
echo 'Listen Potato Word Peter Germany USA' |
    truecase.perl --model /tmp/truecase-model.en
echo
grep -E -i '^(Listen|Potato|Word|Peter|Germany|USA) ' /tmp/truecase-model.en

listen potato Word Peter Germany USA

Peter (243/250) peter (7)
USA (3044/3044)
potato (245/245)
Germany (6193/6193)
word (4988/4996) WORD (1) Word (7)
listen (2362/2367) Listen (5)


With the true-cased text you can now try to extract to vocab again.

In [15]:
cat /tmp/europarl-v9.true.en | tr ' ' '\n' | sort | uniq -c > /tmp/europarl.true.hist

If we did everything correctly the vocabulary size should have further
decreased.  
What is the vocabulary size at the moment?

In [16]:
wc -l < /tmp/europarl.true.hist

134462


Try to look at the histogram file a little bit more.  
Notice that many words share the same root and differ in suffixes or prefixes. 

In [17]:
cat /tmp/europarl.true.hist | grep -i ' *[0-9]*  *listen'

      1 Listen
      2 Listening
   2407 listen
   1718 listened
     10 listener
     29 listeners
   1438 listening
     92 listens


Also most of the items in the vocabulary appear only once in the data (especially
numbers).

In [18]:
cat /tmp/europarl.true.hist | grep ' *1 ' | wc -l
cat /tmp/europarl.true.hist | grep ' *1 [0-9][0-9]*$' | wc -l

58699
1334


What could be the problem for algorithms that learn embeddings or
statistical/probabilistic models in general? (This is an open question,
and there are several problems that I can remember, but in general it
comes from the curse of dimensionality).

Byte-Pair Encoding
------------------

Byte-Pair Encoding (BPE) is an algorithm that helps us automatically
split words into smaller components. Since BPE is also a statistical
algorithm, first we need to extract the statistics from our data:

In [19]:
subword-nmt learn-bpe -s 32000 < /tmp/europarl-v9.true.en > /tmp/code.en

  0%|                                                 | 0/32000 [00:00<?, ?it/s]  0%|                                         | 3/32000 [00:00<45:58, 11.60it/s]  0%|                                       | 5/32000 [00:00<1:02:47,  8.49it/s]  0%|                                       | 6/32000 [00:00<1:07:22,  7.91it/s]  0%|                                       | 7/32000 [00:00<1:07:48,  7.86it/s]  0%|                                         | 9/32000 [00:01<59:11,  9.01it/s]  0%|                                      | 10/32000 [00:01<1:05:40,  8.12it/s]  0%|                                        | 14/32000 [00:01<40:38, 13.12it/s]  0%|                                        | 16/32000 [00:01<38:20, 13.90it/s]  0%|                                        | 20/32000 [00:01<30:44, 17.34it/s]  0%|                                        | 23/32000 [00:01<29:12, 18.24it/s]  0%|                                        | 25/32000 [00:01<32:32, 16.38it/s]  0%|                      

In [20]:
wc -l /tmp/code.en

32001 /tmp/code.en


In [21]:
tail /tmp/code.en

sea food</w>
scre ened</w>
sco t-free</w>
sa y
s as</w>
ru de</w>
ron ique</w>
restra ining</w>
reservo ir</w>
res a</w>


Similar to true-casing, after training the BPE codes we have to apply it
to the data:

In [22]:
subword-nmt apply-bpe -c /tmp/code.en < /tmp/europarl-v9.true.en > /tmp/europarl-v9.bpe.en

Now you can check the vocabulary size once more:

In [23]:
cat /tmp/europarl-v9.bpe.en | tr ' ' '\n' | sort | uniq -c > /tmp/europarl.bpe.hist

In [24]:
wc -l < /tmp/europarl.bpe.hist
echo
cat /tmp/europarl.bpe.hist | grep -i ' *[0-9]*  *listen'
echo
echo 'listeners andaverylongword' | subword-nmt apply-bpe -c /tmp/code.en

31520

   2407 listen
     39 listen@@
   1719 listened
   1438 listening
     92 listens

listen@@ ers an@@ da@@ very@@ long@@ word


Statistical Language Model
==========================

In the following we will use n-gram language modeling to look at domain,
an important consideration when training NMT models.  
All of the data are provided at `/opt/data/lmdom/`.  
To train an order-n language model use
```bash
lmplz -o {order} < {training_data} > {lm_name.arpa}
```

In [25]:
lmplz -o 2 < /opt/data/lmdom/wiki.en.txt > /tmp/wiki.arpa

=== 1/5 Counting and sorting n-grams ===
Reading /opt/data/lmdom/wiki.en.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 42681165 types 1114998
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:13379976 2:216187715584
Statistics:
1 1114998 D1=0.731948 D2=1.03082 D3+=1.2829
2 9551015 D1=0.754579 D2=1.0721 D3+=1.31276
Memory estimate for binary LM:
type     MB
probing 191 assuming -p 1.5
probing 195 assuming -r models -p 1.5
trie     84 without quantization
trie     58 assuming -q 8 -b 8 quantization 
trie     84 assuming -a 22 array pointer compression
trie     58 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:13379976 2:152816240
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---

To get the perplexity of a dataset using an arpa file use
```bash
python3 perp.py {lm_name.arpa} {text_data}
```

In [26]:
perp.py /tmp/wiki.arpa /opt/data/lmdom/wiki.en.txt
perp.py /tmp/wiki.arpa /tmp/wiki.arpa

Loading the LM will be faster if you build a binary file.
Reading /tmp/wiki.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
212.9959476604841
Loading the LM will be faster if you build a binary file.
Reading /tmp/wiki.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
353676.242629899


-   First, train a bigram ($n=2$) LM on the English UDHR. Use this model
    on `dev` data. What is the perplexity?   

In [27]:
lmplz -o 2 < /opt/data/lmdom/english.udhr > /tmp/udhr.arpa

=== 1/5 Counting and sorting n-grams ===
Reading /opt/data/lmdom/english.udhr
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 1778 types 627
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:7524 2:216201084928
Statistics:
1 627 D1=0.786408 D2=1.03486 D3+=1.95146
2 1341 D1=0.8418 D2=1.35127 D3+=1.19614
Memory estimate for binary LM:
type    kB
probing 39 assuming -p 1.5
probing 41 assuming -r models -p 1.5
trie    21 without quantization
trie    18 assuming -q 8 -b 8 quantization 
trie    21 assuming -a 22 array pointer compression
trie    18 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:7524 2:21456
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
##########

In [28]:
head /opt/data/lmdom/english.udhr

Universal Declaration of Human Rights
Preamble
Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world,
Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people,
Whereas it is essential, if man is not to be compelled to have recourse, as a last resort, to rebellion against tyranny and oppression, that human rights should be protected by the rule of law,
Whereas it is essential to promote the development of friendly relations between nations,
Whereas the peoples of the United Nations have in the Charter reaffirmed their faith in fundamental human rights, in the dignity and worth of the human person and in the equ

In [29]:
perp.py /tmp/udhr.arpa /opt/data/lmdom/dev

Loading the LM will be faster if you build a binary file.
Reading /tmp/udhr.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
613.303224701251


-   Now, train a bigram LM on the wikipedia data. Use this model to
    calculate the perplexity of the `dev` data. What is the perplexity?

In [30]:
perp.py /tmp/wiki.arpa /opt/data/lmdom/dev

Loading the LM will be faster if you build a binary file.
Reading /tmp/wiki.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
2455.2919557159653


-   Finally, train a bigram LM on the novel chapter. Use this model to
    calculate the perplexity of the `dev` data. What is the perplexity?    

In [31]:
head /opt/data/lmdom/hpchapter1.txt

Mr. and Mrs. Dursley , of number four , Privet Drive , were proud to say that they were perfectly normal , thank you very much . They were the last people you &apos;d expect to be involved in anything strange or mysterious , because they just didn &apos;t hold with such nonsense .

Mr. Dursley was the director of a firm called Grunnings , which made drills . He was a big , beefy man with hardly any neck , although he did have a very large mustache . Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck , which came in very useful as she spent so much of her time craning over garden fences , spying on the neighbors . The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere .

The Dursleys had everything they wanted , but they also had a secret , and their greatest fear was that somebody would discover it . They didn &apos;t think they could bear it if anyone found out about the Potters . Mrs. Potter was Mrs. Dursley &apos;s

In [32]:
lmplz -o 2 < /opt/data/lmdom/hpchapter1.txt > /tmp/hpchapter1.arpa

=== 1/5 Counting and sorting n-grams ===
Reading /opt/data/lmdom/hpchapter1.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 5722 types 1251
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:15012 2:216201084928
Statistics:
1 1251 D1=0.637427 D2=1.27739 D3+=1.75624
2 4003 D1=0.808964 D2=1.09147 D3+=1.53722
Memory estimate for binary LM:
type     kB
probing 102 assuming -p 1.5
probing 107 assuming -r models -p 1.5
trie     49 without quantization
trie     39 assuming -q 8 -b 8 quantization 
trie     49 assuming -a 22 array pointer compression
trie     39 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:15012 2:64048
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95-

In [33]:
perp.py /tmp/hpchapter1.arpa /opt/data/lmdom/dev

Loading the LM will be faster if you build a binary file.
Reading /tmp/hpchapter1.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
516.76552634914


-   How was the perplexity different with each of the models? Look at
    the first few lines of each training dataset, and the `dev` data.
    Why might this be?

In [34]:
head -n128 /opt/data/lmdom/dev

A Long-Expected Party
When Mr. Bilbo Baggins of Bag End announced that he would shortly be
celebrating his eleventy-first birthday with a party of special
magnificence , there was much talk and excitement in Hobbiton .
Bilbo was very rich and very peculiar , and had been the
wonder of the Shire for sixty years , ever since his remarkable
disappearance and unexpected return . The riches he had brought back
from his travels had now become a local legend , and it was popularly
believed , whatever the old folk might say , that the Hill at Bag End
was full of tunnels stuffed with treasure . And if that was not enough
for fame , there was also his prolonged vigour to marvel at . Time wore
on , but it seemed to have little effect on Mr. Baggins . At ninety he
was much the same as at fifty . At ninety-nine they began to call him
well-preserved ; but unchanged would have been nearer the mark . There
were some that shook their heads and thought this was too much of a
good thing ; it seemed unfai

Mr. Bilbo when he came back , a matter of sixty years ago , when I was
a lad . I &quot; d not long come prentice to old Holman ( him being my dad &quot; s
cousin ) , but he had me up at Bag End helping him to keep folks from
trampling and trapessing all over the garden while the sale was on .


-   How do you think this would change if you used a larger order
    (e.g. 5) LM?   
    N-grams can be good models, but only if the test corpus looks like
    the training corpus. In reality, it often does not. We need to come
    up with adaptation and smoothing methods to account for this!

In [35]:
lmplz -o 4 < /opt/data/lmdom/hpchapter1.txt > /tmp/hp_o4.arpa
perp.py /tmp/hp_o4.arpa /opt/data/lmdom/dev

=== 1/5 Counting and sorting n-grams ===
Reading /opt/data/lmdom/hpchapter1.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 5722 types 1251
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:15012 2:36800184320 3:69000347648 4:110400552960
Statistics:
1 1251 D1=0.637427 D2=1.27739 D3+=1.75624
2 4003 D1=0.82817 D2=1.23715 D3+=1.60199
3 5247 D1=0.934414 D2=1.50625 D3+=1.67374
4 5494 D1=0.976564 D2=1.54928 D3+=2.60937
Memory estimate for binary LM:
type     kB
probing 345 assuming -p 1.5
probing 404 assuming -r models -p 1.5
trie    155 without quantization
trie     92 assuming -q 8 -b 8 quantization 
trie    149 assuming -a 22 array pointer compression
trie     85 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1

In [36]:
lmplz -o 4 < /opt/data/lmdom/english.udhr > /tmp/udhr_o4.arpa

=== 1/5 Counting and sorting n-grams ===
Reading /opt/data/lmdom/english.udhr
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 1778 types 627
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:7524 2:36800184320 3:69000347648 4:110400552960
Statistics:
1 627 D1=0.786408 D2=1.03486 D3+=1.95146
2 1341 D1=0.872832 D2=1.43464 D3+=1.52997
3 1551 D1=0.956688 D2=1.4091 D3+=0.813285
4 1543 D1=0.950096 D2=1.41533 D3+=0.62476
Memory estimate for binary LM:
type     kB
probing 110 assuming -p 1.5
probing 130 assuming -r models -p 1.5
trie     52 without quantization
trie     36 assuming -q 8 -b 8 quantization 
trie     50 assuming -a 22 array pointer compression
trie     34 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:752

In [37]:
perp.py /tmp/udhr_o3.arpa /opt/data/lmdom/dev
perp.py /tmp/udhr_o4.arpa /opt/data/lmdom/dev

Traceback (most recent call last):
  File "kenlm.pyx", line 139, in kenlm.Model.__init__
RuntimeError: util/file.cc:76 in int util::OpenReadOrThrow(const char*) threw ErrnoException because `-1 == (ret = open(name, 00))'.
No such file or directory while opening /tmp/udhr_o3.arpa

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/perp.py", line 9, in <module>
    model=kenlm.Model(arpafile)
  File "kenlm.pyx", line 142, in kenlm.Model.__init__
OSError: Cannot read model '/tmp/udhr_o3.arpa' (util/file.cc:76 in int util::OpenReadOrThrow(const char*) threw ErrnoException because `-1 == (ret = open(name, 00))'. No such file or directory while opening /tmp/udhr_o3.arpa)
Loading the LM will be faster if you build a binary file.
Reading /tmp/udhr_o4.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************

-   Perplexity is often used as intrinsic evaluation for language
    models. Let's think about what these values mean more closely.
    Suppose a 'sentence' consists of random digits. What is the
    perplexity of this sentence, if our model assigns probability
    $p=1/10$ to each digit?   
  - \begin{align*}
      H = & \sum_{i=0}^{9}{\frac{1}{10} log \left( \frac{1}{10} \right) } \\
        = & -log(10) \\
      \implies ppl = & e^{-H} = 10  
    \end{align*}
  - "intuitively": perplexity is the "average branching factor". Given a uniform distribution each decoding step branches into 10 equally probable decisions  
     $\implies$ average branching factor 10

-   Consider now a natural language sentence. What is the maximum
    perplexity of a sentence with 10 tokens? With 100 tokens?   
  - maximum perplexity is reached in case of uniform distribution (see above), for $N$ tokens and vocab size $V$:
    $$ H = exp\left(\frac{N}{V}log\left(V\right)\right) = V^{\frac{N}{V}} $$ 

-   Let's return to our language models. Pick one of the three datasets,
    and train trigram and 4-gram LMs as well. Evaluate the perplexity of
    the `dev` data. How is it different between the bigram, trigram, and
    4-gram models? Why might this be?   

-   One major problem with models is generalization. If we have a bigram
    we have never seen before in `dev`, our model will produce a
    probability of 0 for the sentence and we can't compute the
    perplexity (can't divide by 0!). Not good  
    In order to do something about this, people typically use smoothing
    methods. The simplest is called add-one or Laplace smoothing. This
    is as simple as it sounds: we increment the counts of all seen word
    types (unique) by 1, and the vocabulary by the same amount (size of
    the vocabulary, number of unique words)  
    Now, there is a small probability allocated for unknown words:
    unseen n-grams have $\frac{1}{N+V}$ instead of 0!
    $$\mathcal{P}_{Laplace}(w_i) = \frac{count_i + 1}{N + V}$$

    A basic bigram language model has been coded for you in python,
    `bigram_lm.py`. Use this script to train a bigram LM on
    hpchapter1.txt. It will print the model entropy.   
    In this last exercise, modify this script to use add-one smoothing.
    Now train a bigram LM. How has the entropy changed?   

In [38]:
bigram_lm.py /opt/data/lmdom/hpchapter1.txt

The entropy of the bigram model for this file is: 3.052 bits.


In [39]:
cp /usr/local/bin/bigram_lm.py bigram_lm.py
patch bigram_lm.py <<EOF
--- /usr/local/bin/bigram_lm.py	2021-07-05 12:54:44.000000000 +0000
+++ bigram_lm.py	2022-06-20 12:38:37.247364696 +0000
@@ -39,8 +39,8 @@
         bigramsplit = k.split("_")
         hist = bigramsplit[0]
         if hist==j:
-            numer = bigramfreqs[k]
-            denom = bigramhcs[j]
+            numer = bigramfreqs[k]+1
+            denom = bigramhcs[j]+bigramnum
             frac = numer/denom
             y = math.log(frac,2)
             z = numer*y
EOF

patching file bigram_lm.py


In [40]:
./bigram_lm.py /opt/data/lmdom/hpchapter1.txt

The entropy of the bigram model for this file is: 18.98 bits.
