# Lab block 1: Text level

Jordi Armengol Estapé, Joan Llop Palao

## Imports

In [1]:
import nltk
import re
from math import inf, log
from nltk.collocations import TrigramCollocationFinder
from nltk.collocations import BigramCollocationFinder
import time

## Preprocessing

We define a function for preprocessing the corpus. It will read the sentences of each line, convert them to lowercase, remove digits and continuous white spaces.

In [2]:
def preprocessing(text_lines,l=None):
    preprocessed = []
    for index, line in enumerate(text_lines):
        num, sentence = line.split('\t')
        new_line = (re.sub("\d", "", sentence)).lower()
        new_line = ' '.join(new_line.split())
        preprocessed.append(new_line)
    return preprocessed

We will apply the aforementioned function to the whole corpus, including both train and test sets for all languages.

Notice that in the case of the train set, we will concatenate all the sentences (separated by a double space, denoting end of sentence), while for the test set the sentences will remain separate, because for evaluating we have to go sentence by sentence. However, we will add double spaces to the sentences in test, in order to be preprocessed in the same way as the training set.

In [4]:
langs = ['eng', 'deu', 'fra', 'ita', 'nld', 'spa']
dataset = {}
for l in langs:
    dataset[l] = {}
    with open('langId/' + l + '_trn.txt', 'r') as file:
        dataset[l]['trn'] = file.readlines()
    with open('langId/' + l + '_tst.txt', 'r') as file:
        dataset[l]['tst'] = file.readlines()
    dataset[l]['trn'] = '  ' + '  '.join(preprocessing(dataset[l]['trn'],l)) + '  '
    dataset[l]['tst'] = ['  ' + sentence + '  ' for sentence in preprocessing(dataset[l]['tst'],l)]

## Language models and training

We will define a generic trigram LanguageModel class, which will be initialized with the cleaned corpus of its corresponding language.

### Initialization

We will extract all the trigrams (with their respective frequencies) and filter out the ones with less than 5 occurrences (as said in the assignment). For approximating the probabilities, as described in Jurafsky’s book, we will need bigrams as well.

### Infer

As described in Jurafsky’s book, instead of computing the exact probabilities by taking into account the whole sequences, we approximate the probability of a given trigram by dividing the frequency of the said trigram (ci) by the frequency of the corresponding starting bigram (N). We apply Laplace smoothing with the trigram vocabualary size (V) and lambda = 1.

### Eval

Given a sentence, the function eval(sentence) will compute the log probability of the sentence by summing the logarithms of the probabilities of each trigram in the sentence (using the infer(trigram) function). We used the log probability because with vanilla probabilities we got 0 many times due to floating point precision errors. We could have used perplexity as well. 

In [5]:
class LanguageModel:
    def __init__(self, clean_corpus, lang):
        self.trigrams = TrigramCollocationFinder.from_words(clean_corpus)
        self.trigrams.apply_freq_filter(5)
        self.bigrams = self.trigrams.bigram_finder()
        self.lambd = 1
        self.V = len(self.trigrams.ngram_fd)
        self.lang = lang
            
    def infer(self, trigram):
        # N: bigram_freq, V: train trigrams vocabulary size, ci: trigram frequency
        ci = self.trigrams.ngram_fd[trigram]
        (t1, t2, t3) = trigram
        N = self.bigrams.ngram_fd[(t1, t2)]
        return (ci + self.lambd) / (N + self.lambd * self.V)
        
        
    def eval(self, sentence):
        # log probablity, not perplexity
        p = 1
        trigrams = TrigramCollocationFinder.from_words(sentence)
        for trigram in trigrams.ngram_fd:
            p += log(self.infer(trigram))
        return p

For each language, we instantiate its corresponding LanguageModel object. Notice that instead of storing a table with the computed probabilities we just look up it dynamically (because we believe that it's more efficient), so the training time it's really fast, since it only extracts the corresponding trigrams and bigrams.

In [6]:
lt0 = time.time()
print('Training English ...')
engModel = LanguageModel(dataset['eng']['trn'], 'eng')
print('Training Deutsch ...')
deuModel = LanguageModel(dataset['deu']['trn'], 'deu')
print('Training Français ...')
fraModel = LanguageModel(dataset['fra']['trn'], 'fra')
print('Training Italiano ...')
itaModel = LanguageModel(dataset['ita']['trn'], 'ita')
print('Training Nederlands ...')
nldModel = LanguageModel(dataset['nld']['trn'], 'nld')
print('Training Español ...')
spaModel = LanguageModel(dataset['spa']['trn'], 'spa')
lt1 = time.time()
print('Training time:', lt1 - lt0)

Training English ...
Training Deutsch ...
Training Français ...
Training Italiano ...
Training Nederlands ...
Training Español ...
Training time: 75.50163626670837


## Metamodel

The general idea of our classifier is the following: once we have trained a language model for each language in the corpus, we can define a meta-model that evaluates each of the language models for the given sentence, and then returns the one that has the higher probability (ie. the one that would have the lowest perplexity). Recall that we computed the log probabilities.

Once the language models are built, it would be easy to define alternative meta-models, as in ensemble learning. For instance, we could train a classifier that would output the identified language given the perplexities of each language model (that could be useful if there were many "draws", and the meta-model could learn that in case of draw between, say, English and Spanish, Spanish should have priority), instead of just taking the onbe with the lowest perplexity. However, we believe that this is not necessary.

In [7]:
def metamodel(models, sentence):
    max_p = inf
    res = None
    for model in models:
        p = model.eval(sentence)
        if abs(p) < max_p:
            # (log probablity)
            max_p = abs(p)
            res = model.lang
    return res

models = [engModel, deuModel, fraModel, itaModel, nldModel, spaModel]

## Test

We run the metamodel for all the sentences in the train set and build the confusion matrix. Inference is relatively fast, but it's slower than the train set because we are dynamically computing the probabilities instead of storing a large table, which would not be very memory-efficient. Recall that the test sentences, which have not been used for training the model (in order to have a fair evaluation), are also labeled, so we can evaluate our model.

In [8]:
tt0 = time.time()
confusion_matrix = {}
for l in langs:
    print('Evaluating on ...', l)
    confusion_matrix[l] = {}
    for sentence in dataset[l]['tst']:
        res = metamodel(models, sentence)
        if res in confusion_matrix[l]:
            confusion_matrix[l][res] += 1
        else:
            confusion_matrix[l][res] = 1
tt1 = time.time()

print('Inference time:', tt1-tt0)
        

Evaluating on ... eng
Evaluating on ... deu
Evaluating on ... fra
Evaluating on ... ita
Evaluating on ... nld
Evaluating on ... spa
Inference time: 192.2320203781128


Finally, we can extract the accuracy and print a prettified version of our confusion matrix, which tells us what the model should have output (ie. the ground truth) vs what it actually output. Ideally, we would like a diagonal matrix, which would imply an accuracy of 1.0.

In [159]:
def prettify(conf_mat):
    print('***', end = '')
    for l in langs:
        print('|*' + l, end = '')
    print('|')
    for l in langs:
        print(l, end = '|')
        for l2 in langs:
            if l2 in conf_mat[l]:
                print(str(conf_mat[l][l2]).rjust(4), end = '|')
            else:
                print(str('   .'), end = '|')
        print()
    print('-'*34)

def get_accuracy(conf_mat):
    total = 0
    right = 0
    for l in langs:
        right += conf_mat[l][l]
        total += sum(conf_mat[l].values())
    return right/total
    
prettify(confusion_matrix)
print('Accuracy', get_accuracy(confusion_matrix))
print('Training time:', lt1 - lt0)
print('Inference time:', tt1-tt0)
        

***|*eng|*deu|*fra|*ita|*nld|*spa|
eng|9985|   .|   1|   .|   1|   .|
deu|  11|9971|   .|   1|   6|   1|
fra|  11|   .|9980|   5|   3|   1|
ita|   8|   .|   1|9987|   .|   4|
nld|  28|   7|   2|   4|9957|   2|
spa|   4|   .|   1|   3|   .|9992|
----------------------------------
Accuracy 0.9982493289094153
Training time: 70.53643941879272
Inference time: 209.0808641910553


## Conclusions

The confusion matrix is almost diagonal and the accuracy is almost perfect. Accuracy is the right metric in this case because the dataset is balanced.

Inspecting the test and errors, we can see that there are some special cases, like English sentences with Japanese characters, for instance, which can fool the classifier, but in general it is pretty solid for all languages.

We can see that the most errors are committed when a non-English sentence is tagged as English. We believe that this happens because English words are more usually used in other languages than the other way around.

With regard the execution times, the model is very fast. Perhaps it could be faster on inference, but there is a trade-off between training and inference time (eg. we could have stored a big table with pre-computed probabilities).

As future work, we suggest using dynamic programming for avoiding to recompute certain probabilities (we didn't implement it because we don't know if the inference time would have been fair in that case) and building an ensemble classifier. Also, it should be investigated whether the special sentences we mentioned could be detected and treated in some way.