# Language models

#### Language models seek to mathematically predict the probability of words occurring consecutively. This can be used to increase accuracy of predictive text. Strings of words of *n* size are known as n-grams, and language models can often be adjusted for different values of *n*. This lab will demonstrate some basic capabilities of language models, using two language modelling toolkits: KenLM (2011) and SRILM (2002). The focus of the lab will be on estimating probabilities and on evaluating perplexity (perplexity being a metric for how accurately a model calculates probabilities, with lower values being favourable).

## 1. Mikolov's data

#### This first step of the lab is purely setup for the next two steps — all that must be done here is ensuring that the necessary data is downloaded and in the correct location.

##### Download the dataset. This data has been preprocessed from the Penn Treebank (PTB) by Mikolov (2012), with out-of-vocabulary (OOV) words being given a special label \<unk\>.

##### What that accomplishes is preventing evaluations of things like perplexity from getting too caught up on rare words that might appear only once in the entire dataset.

In [1]:
!wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz

--2021-10-02 15:52:45--  http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
Résolution de www.fit.vutbr.cz (www.fit.vutbr.cz)… 147.229.9.23
Connexion à www.fit.vutbr.cz (www.fit.vutbr.cz)|147.229.9.23|:80… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 34869662 (33M) [application/x-gtar]
Sauvegarde en : « simple-examples.tgz »


2021-10-02 15:52:55 (4,20 MB/s) — « simple-examples.tgz » sauvegardé [34869662/34869662]



##### Unzip the compressed dataset:

In [2]:
!tar xfz simple-examples.tgz

##### List all contents of the active directory:

In [3]:
!ls -l

total 84168
drwxr-xr-x   3 marcellmaitinsky  staff        96  2 oct 00:42 [34mdata[m[m
-rw-r--r--   1 marcellmaitinsky  staff    142573  1 oct 00:22 en_ewt-ud-dev-alt.txt
-rw-r--r--   1 marcellmaitinsky  staff    126870  2 oct 00:41 en_ewt-ud-dev.txt
-rw-r--r--   1 marcellmaitinsky  staff    126137  2 oct 09:48 en_ewt-ud-dev.txt.sbd_splitta
drwxr-xr-x   3 marcellmaitinsky  staff        96 22 sep 18:55 [34mlab1[m[m
-rw-r--r--   1 marcellmaitinsky  staff     23802 22 sep 18:55 lab1.ipynb
-rw-r--r--   1 marcellmaitinsky  staff     21477  2 oct 09:50 lab2.ipynb
drwxr-xr-x   2 marcellmaitinsky  staff        64  2 oct 15:52 [34mlab3[m[m
-rw-r--r--   1 marcellmaitinsky  staff     27896  2 oct 15:52 lab3.ipynb
drwxr-xr-x   3 marcellmaitinsky  staff        96 29 sep 18:01 [34mlab4[m[m
drwxr-xr-x   8 marcellmaitinsky  staff       256  1 oct 23:26 [34mscripts[m[m
-rw-r--r--   1 marcellmaitinsky  staff      4927 29 déc  2007 scripts.tgz
drwx------  15 marcellmaitinsky  staff       4

##### Change directory to enter the folder containing the dataset:

In [4]:
%cd simple-examples/

/Users/marcellmaitinsky/LING242/computational-tools-for-linguistic-analysis-ubc/labs/simple-examples


##### List contents of active directory, as well as the contents of those aforementioned contents if applicable (this is what -R as opposed to -l does after ls). 

In [12]:
!ls -R

[34m1-train[m[m                    [34m8-direct[m[m
[34m2-nbest-rescore[m[m            [34m9-char-based-lm[m[m
[34m3-combination[m[m              [34mdata[m[m
[34m4-data-generation[m[m          [34mmodels[m[m
[34m5-one-iter[m[m                 [34mrnnlm-0.2b[m[m
[34m6-recovery-during-training[m[m [34mtemp[m[m
[34m7-dynamic-evaluation[m[m

./1-train:
README   [31mtest.sh[m[m  [31mtrain.sh[m[m

./2-nbest-rescore:
[31mREADME[m[m      getbest.c   gettext.c   [31mmakenbest[m[m
[31mgetbest[m[m     [31mgettext[m[m     [34mlattices[m[m    makenbest.c

./2-nbest-rescore/lattices:
AMI-3E0501_u3005_127040_127488.lat.gz AMI-3E0501_u3005_128490_129032.lat.gz
AMI-3E0501_u3005_127513_127835.lat.gz latlist
AMI-3E0501_u3005_127865_128175.lat.gz [34mnbest[m[m
AMI-3E0501_u3005_128188_128447.lat.gz [31mnbest.sh[m[m

./2-nbest-rescore/lattices/nbest:

./3-combination:
README   [31mtest.sh[m[m  [31mtrain.sh[m[m

./4-data-generation:
REA

##### Change to parent directory:

In [5]:
%cd ..

/Users/marcellmaitinsky/LING242/computational-tools-for-linguistic-analysis-ubc/labs


## 2. KenLM

#### KenLM is software that was developed for the purpose of executing various mathematical operations on language models. This section of the lab will give an introduction to KenLM's estimation and querying capabilities in particular.

##### Download kenlm and then unpack it, in a single command.

In [6]:
!wget -O - https://kheafield.com/code/kenlm.tar.gz  | tar xz

--2021-10-02 15:54:24--  https://kheafield.com/code/kenlm.tar.gz
Résolution de kheafield.com (kheafield.com)… 35.196.63.85
Connexion à kheafield.com (kheafield.com)|35.196.63.85|:443… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 491090 (480K) [application/x-gzip]
Sauvegarde en : « STDOUT »


2021-10-02 15:54:25 (1,15 MB/s) — envoi vers sortie standard [491090/491090]



##### Create a folder within your existing kenlm folder. This is where you will compile KenLM.

In [7]:
!mkdir kenlm/build

##### Change directory to make sure you are in your newly-created "build" folder.

In [2]:
%cd kenlm/build

/Users/marcellmaitinsky/LING242/computational-tools-for-linguistic-analysis-ubc/labs/kenlm/build


##### Compile KenLM into the build folder. 

In [3]:
!cmake ..

-- Could NOT find Eigen3 (missing: Eigen3_DIR)
-- Found Boost: /usr/local/lib/cmake/Boost-1.76.0/BoostConfig.cmake (found suitable version "1.76.0", minimum required is "1.41.0") found components: program_options system thread unit_test_framework 
-- Found ZLIB: /Library/Developer/CommandLineTools/SDKs/MacOSX11.3.sdk/usr/lib/libz.tbd (found version "1.2.11") 
-- Found BZip2: /Library/Developer/CommandLineTools/SDKs/MacOSX11.3.sdk/usr/lib/libbz2.tbd (found version "1.0.6") 
-- Looking for BZ2_bzCompressInit
-- Looking for BZ2_bzCompressInit - found
-- Looking for lzma_auto_decoder in /Library/Developer/CommandLineTools/SDKs/MacOSX11.3.sdk/usr/lib/liblzma.tbd
-- Looking for lzma_auto_decoder in /Library/Developer/CommandLineTools/SDKs/MacOSX11.3.sdk/usr/lib/liblzma.tbd - found
-- Looking for lzma_easy_encoder in /Library/Developer/CommandLineTools/SDKs/MacOSX11.3.sdk/usr/lib/liblzma.tbd
-- Looking for lzma_easy_encoder in /Library/Developer/CommandLineTools/SDKs/MacOSX11.3.sdk/usr/lib/li

##### Notice how some files could not be found. Normally "cmake" is preferred over "make" because it is equally usable on multiple operating systems, including the most popular Windows and Mac OS, but since cmake has not been able to fully compile KenLM, make is necessary too:

In [4]:
!make

[  1%] [32mBuilding CXX object util/CMakeFiles/kenlm_util.dir/double-conversion/bignum-dtoa.cc.o[0m
[  2%] [32mBuilding CXX object util/CMakeFiles/kenlm_util.dir/double-conversion/bignum.cc.o[0m
[  3%] [32mBuilding CXX object util/CMakeFiles/kenlm_util.dir/double-conversion/cached-powers.cc.o[0m
[  5%] [32mBuilding CXX object util/CMakeFiles/kenlm_util.dir/double-conversion/diy-fp.cc.o[0m
[  6%] [32mBuilding CXX object util/CMakeFiles/kenlm_util.dir/double-conversion/double-conversion.cc.o[0m
[  7%] [32mBuilding CXX object util/CMakeFiles/kenlm_util.dir/double-conversion/fast-dtoa.cc.o[0m
[  8%] [32mBuilding CXX object util/CMakeFiles/kenlm_util.dir/double-conversion/fixed-dtoa.cc.o[0m
[ 10%] [32mBuilding CXX object util/CMakeFiles/kenlm_util.dir/double-conversion/strtod.cc.o[0m
[ 11%] [32mBuilding CXX object util/CMakeFiles/kenlm_util.dir/stream/chain.cc.o[0m
[ 12%] [32mBuilding CXX object util/CMakeFiles/kenlm_util.dir/stream/count_records.cc.o[0m
[ 13%] [32mBuil

##### Output the first 10 lines (no exact number specified, so default is 10) of ptb.train.txt to a text file. Also make sure the current standard "\<unk\>" label for OOV words is replaced in all instances with just a capital "UNK" label in the new text file, using sed.

In [7]:
!head ~/LING242/computational-tools-for-linguistic-analysis-ubc/labs/simple-examples/data/ptb.train.txt | sed 's/<unk>/UNK/g' > text

##### Estimate ngram probabilities using lmplz, with "-o 5" specifying the maximum value of n you want to calculate n-gram probabilities for. So here you are calculating for unigrams, 2-grams, 3-grams, 4-grams, and 5-grams.

In [8]:
!bin/lmplz -o 5 <text >text.arpa

=== 1/5 Counting and sorting n-grams ===
Reading stdin
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 213 types 143
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:1716 2:1340867584 3:2514126848 4:4022602752 5:5866296320
/Users/marcellmaitinsky/LING242/computational-tools-for-linguistic-analysis-ubc/labs/kenlm/lm/builder/adjust_counts.cc:52 in void lm::builder::(anonymous namespace)::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig &) threw BadDiscountException because `s.n[j] == 0'.
Could not calculate Kneser-Ney discounts for 2-grams with adjusted count 4 because we didn't observe any 2-grams with adjusted count 3; Is this small or artificial data?
Try deduplicating the input.  To override this error for e.g. a class-based model, rerun with --discount_fallback



##### Note "Is this small or artificial data?" — in this case the data is indeed small, which is why aiming for 5-grams does not work. Reducing it down to 2 is more feasible for this dataset:

In [9]:
!bin/lmplz -o 2 <text >text.arpa

=== 1/5 Counting and sorting n-grams ===
Reading stdin
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 213 types 143
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:1716 2:13743893504
Statistics:
1 143 D1=0.771812 D2=1.7276 D3+=1.45638
2 208 D1=0.881818 D2=1.7965 D3+=3
Memory estimate for binary LM:
type       B
probing 7472 assuming -p 1.5
probing 8048 assuming -r models -p 1.5
trie    4499 without quantization
trie    4930 assuming -q 8 -b 8 quantization 
trie    4499 assuming -a 22 array pointer compression
trie    4930 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:1716 2:3328
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
#############################

##### The command used above has taken data from the file "text", calculated probabilities, and generated a language model in "text.arpa". It is not shown here, but arpa files store log probabilities rather than exact probabilities, to avoid having excessively long decimals for n-grams with low probabilities. This often results in what looks like negative probabilities, but those are simply the exponent of 10 needed to get the actual probability in question.

##### Convert the language model to binary to make querying more efficient:

In [10]:
!bin/build_binary text.arpa text.binary

Reading text.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS


##### Feed the end of the ptb.train.txt file into a new file called "data", once again making sure to replace \<unk\> with "UNK" to ensure querying works correctly. Unlike the head command from earlier, this tail command does have a number specified, and in this case, to ensure speed, this number is 1. That is, only the very last line of the data is actually being used. Ideally this would not be the case.

In [16]:
!tail -n 1 /Users/marcellmaitinsky/LING242/computational-tools-for-linguistic-analysis-ubc/labs/simple-examples/data/ptb.train.txt   | sed 's/<unk>/UNK/g' > data

##### Complete the querying for the binary version of the language model, using the data file from the previous step to help calculate perplexity:

In [17]:
!bin/query text.binary <data                                     

This binary file contains probing hash tables.
in=110 1 -1.9787562	los=0 1 -2.4069233	angeles=0 1 -2.3523023	for=0 1 -2.3523023	example=0 1 -2.3523023	central=0 1 -2.3523023	has=71 1 -2.2564688	had=0 1 -2.4069233	a=37 1 -1.5073059	strong=0 1 -2.4049046	market=0 1 -2.3523023	position=0 1 -2.3523023	while=0 1 -2.3523023	unilab=0 1 -2.3523023	's=121 1 -2.2564688	presence=0 1 -2.4069233	has=71 1 -2.2564688	been=0 1 -2.4069233	less=0 1 -2.3523023	prominent=0 1 -2.3523023	according=0 1 -2.3523023	to=66 1 -1.6679683	mr.=41 1 -2.3078644	UNK=28 2 -0.76690423	</s>=2 1 -1.4424298	Total: -54.348557 OOV: 16
Perplexity including OOVs:	149.25959564199744
Perplexity excluding OOVs:	67.10224231949275
OOVs:	16
Tokens:	25
RSSMax:1220608 kB	user:0.005163	sys:0.009904	CPU:0.015094	real:0.000893


##### The output here shows the n-gram itself (in this case all unigrams), with an '=' followed by ???, then the n-gram length and the log probability. As explained earlier, the probabilities are negative because they are the exponent of 10 needed to find the actual probability. The output also includes perplexity (notice that the perplexity is lower when OOVs are excluded, showing higher accuracy), as well as number of OOVs and total number of tokens.

## 3. SRILM

##### SRILM is yet another language modelling software, somewhat older than KenLM. Both softwares have similar applications, such as estimating probabilities and querying. 

**TODO** complete commands for _srilm_ in jupyter except for downloading. It should include
* compile, 
* estimate probabilities, and
* query 

Use `ptb.train.txt` for training (estimating) and `ptb.test.txt` for querying to calculate perplexities. 
When you estimate probabilities, we use the standard LM option. See [[link1](https://www.statmt.org/wmt09/baseline.html)] and [[link2](https://kheafield.com/code/kenlm/estimation/)]. For querying, see [[link](http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html)]

##### Change into the correct directory for srilm:

In [31]:
%cd /Users/marcellmaitinsky/LING242/computational-tools-for-linguistic-analysis-ubc/labs/srilm

/Users/marcellmaitinsky/LING242/computational-tools-for-linguistic-analysis-ubc/labs/srilm


##### Compile srilm:

In [32]:
!make

mkdir -p include lib bin
/Library/Developer/CommandLineTools/usr/bin/make init
for subdir in misc dstruct lm flm lattice utils zlib; do \
		(cd $subdir/src; /Library/Developer/CommandLineTools/usr/bin/make SRILM=/Users/marcellmaitinsky/LING242/computational-tools-for-linguistic-analysis-ubc/labs/srilm MACHINE_TYPE=macosx OPTION= MAKE_PIC= init) || exit 1; \
	done
cd ..; /Users/marcellmaitinsky/LING242/computational-tools-for-linguistic-analysis-ubc/labs/srilm/sbin/make-standard-directories
/Library/Developer/CommandLineTools/usr/bin/make ../obj/macosx/STAMP ../bin/macosx/STAMP
make[3]: `../obj/macosx/STAMP' is up to date.
make[3]: `../bin/macosx/STAMP' is up to date.
cd ..; /Users/marcellmaitinsky/LING242/computational-tools-for-linguistic-analysis-ubc/labs/srilm/sbin/make-standard-directories
/Library/Developer/CommandLineTools/usr/bin/make ../obj/macosx/STAMP ../bin/macosx/STAMP
make[3]: `../obj/macosx/STAMP' is up to date.
make[3]: `../bin/macosx/STAMP' is up to date.
cd ..; /Users/

##### List all files with name ngram-count:

In [33]:
!ls -l bin/macosx/ngram-count

-r-xr-xr-x  1 marcellmaitinsky  staff  1699400 22 sep 17:39 [31mbin/macosx/ngram-count[m[m


##### Estimate n-gram probabilities. First name the text file you are using as training data, then the file that you are writing the language model to. 

In [92]:
!./bin/macosx/ngram-count -text ./simple-examples/data/ptb.train.txt -lm text.lm



##### The above warning can be ignored for the purposes of this assignment, and does not affect the final step. All it tells us is that the dataset we are working with is very small. This step had no required output, so if you see nothing pop up other than this warning, that means the estimating step has been successful.

##### Complete querying/evaluation for your text.lm language model. This time the language model goes first, followed by the data that you are comparing to find the perplexity of your model — indeed, "ppl" is short for perplexity.

In [93]:
!./bin/macosx/ngram -lm text.lm -ppl ./simple-examples/data/ptb.test.txt

file ./simple-examples/data/ptb.test.txt: 3761 sentences, 78669 words, 4794 OOVs
0 zeroprobs, logprob= -176327.4 ppl= 186.727 ppl1= 243.6885


##### The output above shows the results of querying, including basic data on number of sentences, words, and OOVs, followed by the log probability (see KenLM section for explanation), perplexity, and perplexity with add-one smoothing ("ppl1" because each probability is incremented by 1).

##### Because the KenLM and SRILM sections of this lab worked with slightly different data, it is not possible to determine from this demonstration which of the two is better for which purpose, but things like perplexity could in theory be compared if the data for the language model was exactly the same. 