# MT Praktikum 2: Text processing and Language Models

In the directory `/opt/data/nc19/` you will find the raw text files from
the News-Commentary corpus which we will use today for our
preprocessing, as well as the language model. The corresponding source
data you can find in  
`/opt/data/wmt10-xlats/ref/wmt10-newssyscombtest2010-src.de.sgm`

Preprocessing
=============

Tokenization
------------

Tokenization, in brief terms, is the task of breaking down the text
stream into discrete units, called *tokens*. Before looking at
tokenization, let's first take a look at the data itself:


In [1]:
head /opt/data/nc19/europarl-v9.en

Resumption of the session
I declare resumed the session of the European Parliament adjourned on Friday, 15 December 2000.
Statements by the President
Ladies and gentlemen, on Saturday, as you know, an earthquake struck Central America once again, with tragic consequences.
This is an area which has already been seriously affected on a number of occasions since the beginning of the twentieth century.
The latest, provisional, figures for victims in El Salvador are already very high.
There are 350 people dead, 1 200 people missing, the area is completely devastated and thousands of homes have been destroyed throughout the country.
The European Union has already shown its solidarity by sending a rescue team to the area, whilst financial assistance from the Union and Member States has been, or is in the process of being, released and I am able to inform you that some groups in the European Parliament have requested that this issue be included in the debate on topical and urgent subjects of m

First, let's manually try to extract the unique words in this corpus.
The following command will extract a (sorted and de-duplicated) list of
tokens, as well as an occurrence count for each of them, from the text
file:

In [2]:
cat /opt/data/nc19/europarl-v9.en | tr ' ' '\n' | sort | uniq -c > /tmp/europarl.hist

Then read the file *europarl.hist*. Try to answer the following
questions:

-   How many items (words separated by space) are there in the original
    europarl-v9.en data?

In [3]:
wc -l < /tmp/europarl.hist

329276


-   Is each one of the items totally unique, or can you spot some
    obvious redundancies?

In [4]:
cat /tmp/europarl.hist | grep -i ' *[0-9]*  *example'

[01;31m[K      2 Example[m[K
[01;31m[K    162 Example[m[Ks
[01;31m[K      1 Example[m[Ks,
[01;31m[K  11829 example[m[K
[01;31m[K     12 example[m[K!
[01;31m[K      4 example[m[K'
[01;31m[K      1 example[m[K',
[01;31m[K      1 example[m[K'.
[01;31m[K     23 example[m[K)
[01;31m[K     20 example[m[K),
[01;31m[K     11 example[m[K).
[01;31m[K      1 example[m[K)?
[01;31m[K  16499 example[m[K,
[01;31m[K   2328 example[m[K.
[01;31m[K      1 example[m[K.4.If
[01;31m[K      1 example[m[K.I
[01;31m[K    379 example[m[K:
[01;31m[K     46 example[m[K;
[01;31m[K     91 example[m[K?
[01;31m[K      1 example[m[K?!
[01;31m[K   2263 example[m[Ks
[01;31m[K      2 example[m[Ks)
[01;31m[K      2 example[m[Ks),
[01;31m[K      3 example[m[Ks).
[01;31m[K    254 example[m[Ks,
[01;31m[K    438 example[m[Ks.
[01;31m[K    144 example[m[Ks:
[01;31m[K      9 example[m[Ks;
[01;31m[K      6 example[m

-   Does the phenomenon in question affect statistical models (such as
    $n$-gram models) or probabilistic models such as neural language
    models?
  - Yes, it affects both. With more vocabulary words and less samples per word the maximum-likelihood estimation for the word embeddings, as well as the $n$-gram probabilities will yield worse estimations

Tokenization aims at solving the problems that we observed. For
languages such as English and German, the tools are often implemented
with rule-based approaches. A standard tool for such tokenization is
`tokenizer.perl` from the `Moses` SMT project:

In [5]:
echo 'This is an example, (it shows how "tokenziation" works).' |
    tokenizer.perl -l en  2>/dev/null

This is an example , ( it shows how &quot; tokenziation &quot; works ) .


You can run the tool to tokenize your input file:

In [6]:
tokenizer.perl -l en < /opt/data/nc19/europarl-v9.en > /tmp/europarl-v9.tok.en

Tokenizer Version 1.1
Language: en
Number of threads: 1


Note: The file name is a typical convention in the Natural Language
Processing (NLP) community. The 'tok' suffix is just a naming
convention, telling that tokenization is applied on top of the input
file.

Now you can try to extract to vocab again.

In [7]:
cat /tmp/europarl-v9.tok.en | tr ' ' '\n' | sort | uniq -c > /tmp/europarl.tok.hist

The tokenized text file is always longer than the original one. By using
the `wc` command you can verify if your command ran correctly or not.
How many words do you now have in this vocabulary?

In [8]:
wc -l < /tmp/europarl.tok.hist

140473


True-Casing
-----------

When you look at the vocabulary file, you will probably find there to
still be some duplicate words, once in upper-case form and once in
lower-case form: 

In [9]:
cat /tmp/europarl.tok.hist | grep -i ' *[0-9]*  *example$'

[01;31m[K      2 Example[m[K
[01;31m[K  31252 example[m[K


The more we can
reduce the number of duplication the better, so after tokenization we
will use a true-casing tool to strip even more redundancy.  
We apply the true-casing in a 2 step procedure:

1.  train a true-casing model to get the "true" case of each vocabulary
    word using  
    `$ train-truecaser.perl --model truecase-model.en --corpus europarl-v9.tok.en`

2.  apply the model to the data to convert upper-cased words at the the
    beginning of the sentence to their respective "true" case:  
    `$ truecase.perl --model truecase-model.en < europarl-v9.tok.en > europarl-v9.true.en`  
    (it may take a few minutes to complete)

In [10]:
train-truecaser.perl --model /tmp/truecase-model.en --corpus /tmp/europarl-v9.tok.en
truecase.perl        --model /tmp/truecase-model.en        < /tmp/europarl-v9.tok.en > /tmp/europarl-v9.true.en

Note that we need the tokenized text file to train the model (Why? What
would happen if we use the original file?).  

If you check the model file
contents, you will see it simply contains statistics about upper-case
and lower-case occurrences for each word.

In [11]:
head -n128 /tmp/truecase-model.en

drinking-water (10/11) Drinking-Water (1)
forceps (1/1)
Vals (1/1)
Stercks (3/3)
unwinding (6/6)
vergine (1/1)
legend (15/15)
magazine (130/138) Magazine (8)
ISD (36/36)
EVP (4/4)
tonnes (1954/1954)
gradings (1/1)
non-costly (1/1)
self-supporting (19/19)
weakest (521/521)
rudely (10/10)
TGVs (1/1)
command (365/377) Command (12)
mid-season (2/2)
waking (34/34)
V2 (1/1)
ìýñéá (1/1)
Romaphobia (4/4)
signing-on (1/1)
welfare-promoting (1/1)
impede (228/228)
drei (1/1)
Ganleys (2/2)
heat-and-power (1/1)
symbolized (7/7)
hydrology (1/1)
analgesic (6/6)
besmirch (6/6)
Childers (3/3)
turning (1723/1726) Turning (3)
avenger (1/1)
H-0843 (1/1)
Luang (1/1)
soapboxes (4/4)
implicitly (148/148)
blood-curdling (2/2)
Jalal-Abad (2/2)
climate-sceptics (1/1)
Sapir (18/18)
people-led (1/1)
BEF (12/12)
consumer-led (6/6)
opportunely (20/20)
German-only (1/1)
postenlargement (1/1)
Americanization (3/3)
peck (1/1)
-secondly (5/5)
dropped (467/467)
row (263/264) Row (1)
Brasov (3/3)
centimetre (10/10)
fish-

In [12]:
echo 'Listen Potato Word Peter Germany USA' |
    truecase.perl --model /tmp/truecase-model.en
echo
grep -E -i '^(Listen|Potato|Word|Peter|Germany|USA) ' /tmp/truecase-model.en

listen potato Word Peter Germany USA

[01;31m[KPeter [m[K(243/250) peter (7)
[01;31m[KUSA [m[K(3044/3044)
[01;31m[KGermany [m[K(6193/6193)
[01;31m[Kpotato [m[K(245/245)
[01;31m[Klisten [m[K(2362/2367) Listen (5)
[01;31m[Kword [m[K(4988/4996) Word (7) WORD (1)


With the true-cased text you can now try to extract to vocab again.

In [13]:
cat /tmp/europarl-v9.true.en | tr ' ' '\n' | sort | uniq -c > /tmp/europarl.true.hist

If we did everything correctly the vocabulary size should have further
decreased.  
What is the vocabulary size at the moment?

In [14]:
wc -l < /tmp/europarl.true.hist

134463


Try to look at the histogram file a little bit more.  
Notice that many words share the same root and differ in suffixes or prefixes. 

In [15]:
cat /tmp/europarl.true.hist | grep -i ' *[0-9]*  *listen'

[01;31m[K      1 Listen[m[K
[01;31m[K      2 Listen[m[King
[01;31m[K   2407 listen[m[K
[01;31m[K   1718 listen[m[Ked
[01;31m[K     10 listen[m[Ker
[01;31m[K     29 listen[m[Kers
[01;31m[K   1438 listen[m[King
[01;31m[K     92 listen[m[Ks


Also most of the items in the vocabulary appear only once in the data (especially
numbers).

In [16]:
cat /tmp/europarl.true.hist | grep ' *1 ' | wc -l
cat /tmp/europarl.true.hist | grep ' *1 [0-9][0-9]*$' | wc -l

58704
1334


What could be the problem for algorithms that learn embeddings or
statistical/probabilistic models in general? (This is an open question,
and there are several problems that I can remember, but in general it
comes from the curse of dimensionality).

Byte-Pair Encoding
------------------

Byte-Pair Encoding (BPE) is an algorithm that helps us automatically
split words into smaller components. Since BPE is also a statistical
algorithm, first we need to extract the statistics from our data:

In [17]:
subword-nmt learn-bpe -s 32000 < /tmp/europarl-v9.true.en > /tmp/code.en

In [18]:
wc -l /tmp/code.en

32001 /tmp/code.en


Similar to true-casing, after training the BPE codes we have to apply it
to the data:

In [19]:
subword-nmt apply-bpe -c /tmp/code.en < /tmp/europarl-v9.true.en > /tmp/europarl-v9.bpe.en

Now you can check the vocabulary size once more:

In [20]:
cat /tmp/europarl-v9.bpe.en | tr ' ' '\n' | sort | uniq -c > /tmp/europarl.bpe.hist

In [21]:
wc -l < /tmp/europarl.bpe.hist
echo
cat /tmp/europarl.bpe.hist | grep -i ' *[0-9]*  *listen'
echo
echo 'listeners andaverylongword' | subword-nmt apply-bpe -c /tmp/code.en

31520

[01;31m[K   2407 listen[m[K
[01;31m[K     39 listen[m[K@@
[01;31m[K   1719 listen[m[Ked
[01;31m[K   1438 listen[m[King
[01;31m[K     92 listen[m[Ks

listen@@ ers an@@ da@@ very@@ long@@ word


Statistical Language Model
==========================

In the following we will use n-gram language modeling to look at domain,
an important consideration when training NMT models.  
All of the data are provided at `/opt/data/lmdom/`.  
To train an order-n language model use
```bash
lmplz -o {order} < {training_data} > {lm_name.arpa}
```

To get the perplexity of a dataset using an arpa file use
```bash
python3 perp.py {lm_name.arpa} {text_data}
```

-   First, train a bigram ($n=2$) LM on the English UDHR. Use this model
    on `dev` data. What is the perplexity?   

In [24]:
lmplz -o 2 < /opt/data/lmdom/english.udhr > /tmp/udhr.arpa

=== 1/5 Counting and sorting n-grams ===
Reading /opt/data/lmdom/english.udhr
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 1778 types 627
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:7524 2:216371812761
Statistics:
1 627 D1=0.786408 D2=1.03486 D3+=1.95146
2 1341 D1=0.8418 D2=1.35127 D3+=1.19614
Memory estimate for binary LM:
type    kB
probing 39 assuming -p 1.5
probing 41 assuming -r models -p 1.5
trie    21 without quantization
trie    18 assuming -q 8 -b 8 quantization 
trie    21 assuming -a 22 array pointer compression
trie    18 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:7524 2:21456
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
##########

In [26]:
perp.py /tmp/udhr.arpa /opt/data/lmdom/dev

Loading the LM will be faster if you build a binary file.
Reading /tmp/udhr.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
613.303224701251


-   Now, train a bigram LM on the wikipedia data. Use this model to
    calculate the perplexity of the `dev` data. What is the perplexity?

In [32]:
lmplz -o 2 < /opt/data/lmdom/wiki.en.txt > /tmp/wiki.arpa

=== 1/5 Counting and sorting n-grams ===
Reading /opt/data/lmdom/wiki.en.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 42681165 types 1114998
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:13379976 2:216358436864
Statistics:
1 1114998 D1=0.731948 D2=1.03082 D3+=1.2829
2 9551015 D1=0.754579 D2=1.0721 D3+=1.31276
Memory estimate for binary LM:
type     MB
probing 191 assuming -p 1.5
probing 195 assuming -r models -p 1.5
trie     84 without quantization
trie     58 assuming -q 8 -b 8 quantization 
trie     84 assuming -a 22 array pointer compression
trie     58 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:13379976 2:152816240
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---

In [27]:
perp.py /tmp/wiki.arpa /opt/data/lmdom/dev

Loading the LM will be faster if you build a binary file.
Reading /tmp/wiki.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
2455.2919557159653


-   Finally, train a bigram LM on the novel chapter. Use this model to
    calculate the perplexity of the `dev` data. What is the perplexity?    

In [29]:
lmplz -o 2 < /opt/data/lmdom/hpchapter1.txt > /tmp/hpchapter1.arpa

=== 1/5 Counting and sorting n-grams ===
Reading /opt/data/lmdom/hpchapter1.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 5722 types 1251
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:15012 2:216371789824
Statistics:
1 1251 D1=0.637427 D2=1.27739 D3+=1.75624
2 4003 D1=0.808964 D2=1.09147 D3+=1.53722
Memory estimate for binary LM:
type     kB
probing 102 assuming -p 1.5
probing 107 assuming -r models -p 1.5
trie     49 without quantization
trie     39 assuming -q 8 -b 8 quantization 
trie     49 assuming -a 22 array pointer compression
trie     39 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:15012 2:64048
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95-

In [30]:
perp.py /tmp/hpchapter1.arpa /opt/data/lmdom/dev

Loading the LM will be faster if you build a binary file.
Reading /tmp/hpchapter1.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
516.76552634914


-   How was the perplexity different with each of the models? Look at
    the first few lines of each training dataset, and the `dev` data.
    Why might this be?

|Data Set|dev Perplexity|
|--------|--------------|
|UHDR | 613.3 |
|Wikipedia | 2455.3 |
|Novel Chapter | 516.8 |

In [33]:
head /opt/data/lmdom/english.udhr

Universal Declaration of Human Rights
Preamble
Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world,
Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people,
Whereas it is essential, if man is not to be compelled to have recourse, as a last resort, to rebellion against tyranny and oppression, that human rights should be protected by the rule of law,
Whereas it is essential to promote the development of friendly relations between nations,
Whereas the peoples of the United Nations have in the Charter reaffirmed their faith in fundamental human rights, in the dignity and worth of the human person and in the equ

In [39]:
head -1 /opt/data/lmdom/wiki.en.txt

" Handog ng Pilipino sa Mundo " ( lit. " The Gift of the Filipinos to the World " ) is a 1986 song recorded in Filipino by a supergroup composed of 15 Filipino artists. The song became the anthem of the bloodless People Power Revolution. The lyrics of the song are inscribed on a wall of Our Lady of EDSA Shrine , the center of the revolution. Songwriter Jim Paredes wrote the song in three minutes , with no revisions , using the success of the 1986 EDSA People Power Revolution as his inspiration. After finishing the composition , he sent it to WEA Records , who at that time is compiling an album of patriotic songs. The song eventually became its carrier single. [ 1 ] [ 2 ] A music video was also made for the song. Paredes then invited artists who were involved with the EDSA Revolution. Kris Aquino , then a teenager , also appeared in the music video. National heroes since the Spanish period like Jose Rizal and Andres Bonifacio , prominent anti-Marcos figures and scenes from the revolutio

In [34]:
head /opt/data/lmdom/hpchapter1.txt

Mr. and Mrs. Dursley , of number four , Privet Drive , were proud to say that they were perfectly normal , thank you very much . They were the last people you &apos;d expect to be involved in anything strange or mysterious , because they just didn &apos;t hold with such nonsense .

Mr. Dursley was the director of a firm called Grunnings , which made drills . He was a big , beefy man with hardly any neck , although he did have a very large mustache . Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck , which came in very useful as she spent so much of her time craning over garden fences , spying on the neighbors . The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere .

The Dursleys had everything they wanted , but they also had a secret , and their greatest fear was that somebody would discover it . They didn &apos;t think they could bear it if anyone found out about the Potters . Mrs. Potter was Mrs. Dursley &apos;s

In [36]:
head /opt/data/lmdom/dev

A Long-Expected Party
When Mr. Bilbo Baggins of Bag End announced that he would shortly be
celebrating his eleventy-first birthday with a party of special
magnificence , there was much talk and excitement in Hobbiton .
Bilbo was very rich and very peculiar , and had been the
wonder of the Shire for sixty years , ever since his remarkable
disappearance and unexpected return . The riches he had brought back
from his travels had now become a local legend , and it was popularly
believed , whatever the old folk might say , that the Hill at Bag End
was full of tunnels stuffed with treasure . And if that was not enough


- **UHDR** = Universal Declaration of Human Rights
  - formal text
  - "[legaleze](https://www.merriam-webster.com/dictionary/legalese)"
- Wikipedia: each sentence is a paragraph from a Wikipedia article
- the **dev** data turns out to also be a novel chapter
  - same domain as the **hpchapter1**
  - lowest perplexity

-   How do you think this would change if you used a larger order
    (e.g. 4) LM?

In [47]:
lmplz -o 4 < /opt/data/lmdom/english.udhr > /tmp/udhr_o4.arpa 2> /dev/null
perp.py /tmp/udhr_o4.arpa /opt/data/lmdom/dev 2> /dev/null

591.2512898622641


In [49]:
lmplz -o 4 < /opt/data/lmdom/wiki.en.txt > /tmp/wiki_o4.arpa 2> /dev/null
perp.py /tmp/wiki_o4.arpa /opt/data/lmdom/dev 2> /dev/null

1952.3559672598092


In [50]:
lmplz -o 4 < /opt/data/lmdom/hpchapter1.txt > /tmp/hp_o4.arpa 2> /dev/null
perp.py /tmp/hp_o4.arpa /opt/data/lmdom/dev 2> /dev/null

503.7123702342283


N-grams can be good models, but only if the test corpus looks like
the training corpus. In reality, it often does not. We need to come
up with adaptation and smoothing methods to account for this!

-   Perplexity is often used as intrinsic evaluation for language
    models. Let's think about what these values mean more closely.
    Suppose a 'sentence' consists of random digits. What is the
    perplexity of this sentence, if our model assigns probability
    $p=1/10$ to each digit?  
  - intuitive explanation of perplexity:
    - on average, for each output word in the sequence, how many different vocabulary words does the model have to consider?
  - uniform distribution (e.g. $p_w = \frac{1}{|V|} \text{ for } w \in V$) yields the worst-case perplexity:
    - the model has to consider every single vocabulary word $w$ at every step (since all words are equally probable)
    - perplexity becomes $|V|$

-   Consider now a natural language sentence. What is the maximum
    perplexity of a sentence with 10 tokens? With 100 tokens?
  - since perplexity doesn't differentiate between sequences of natural language words and sequences of digits the above explanation holds true
  - maximum perplexity is the vocabulary size $|V|$

-   Let's return to our language models. Pick one of the three datasets,
    and train trigram and 4-gram LMs as well. Evaluate the perplexity of
    the `dev` data. How is it different between the bigram, trigram, and
    4-gram models? Why might this be?   

-   One major problem with models is generalization. If we have a bigram
    we have never seen before in `dev`, our model will produce a
    probability of 0 for the sentence and we can't compute the
    perplexity (can't divide by 0!). Not good  
    In order to do something about this, people typically use smoothing
    methods. The simplest is called add-one or Laplace smoothing. This
    is as simple as it sounds: we increment the counts of all seen word
    types (unique) by 1, and the vocabulary by the same amount (size of
    the vocabulary, number of unique words)  
    Now, there is a small probability allocated for unknown words:
    unseen n-grams have $\frac{1}{N+V}$ instead of 0!
    $$\mathcal{P}_{Laplace}(w_i) = \frac{count_i + 1}{N + V}$$

    A basic bigram language model has been coded for you in python,
    `bigram_lm.py`. Use this script to train a bigram LM on
    hpchapter1.txt. It will print the model entropy.   
    In this last exercise, modify this script to use add-one smoothing.
    Now train a bigram LM. How has the entropy changed?   

In [52]:
bigram_lm.py /opt/data/lmdom/hpchapter1.txt

The entropy of the bigram model for this file is: 3.052 bits.


In [53]:
# patch the file and write the patched file to /tmp/bigram_lm.py
patch -d /usr/local/bin -o /tmp/bigram_lm.py << EOF
--- /usr/local/bin/bigram_lm.py	2021-07-05 12:54:44.000000000 +0000
+++ /tmp/bigram_lm.py	2021-07-30 11:24:14.340705608 +0000
@@ -39,8 +39,8 @@
         bigramsplit = k.split("_")
         hist = bigramsplit[0]
         if hist==j:
-            numer = bigramfreqs[k]
-            denom = bigramhcs[j]
+            numer = bigramfreqs[k]+1
+            denom = bigramhcs[j]+bigramnum
             frac = numer/denom
             y = math.log(frac,2)
             z = numer*y
EOF

File bigram_lm.py is read-only; trying to patch anyway
patching file /tmp/bigram_lm.py (read from bigram_lm.py)


In [54]:
/tmp/bigram_lm.py /opt/data/lmdom/hpchapter1.txt

The entropy of the bigram model for this file is: 18.98 bits.


- a lot of probability mass is now assigned to all of the unseen bigrams
- thus the seen bigrams have lower probability
- proper English sentences become much less probable