# TueSNLP - Assignment 3

## Language identification
The assignment is available here: https://snlp2018.github.io/assignments.html.

### Exercise 1
This exercise is about creating a dataset of sentences in different languages starting from ids of tweets collected during the class. We don't have access to the private repo where the tweets had been saved, so we try to keep the spirit of the exercise using data from http://tatoeba.org a crowd-sourced collection of sentences and translations.

In particular, http://downloads.tatoeba.org/exports/sentences.tar.bz2 contains ~8 milion sentences each with corresponding language code. In order to mimic the dataset originally provided for the assignment, we select 30 languages at random and pick a certain number of random sentences for each language. We use the majority of sentences to build the development set and the remainder to build an evaluation set which will be used to evaluate our model(s). Precise quantities and ratio will be determined after we explore the dataset.

In [107]:
# libraries
import pandas as pd
import numpy as np
import random
import string
import progressbar

In [108]:
# read data
full_data_raw = pd.read_csv("data/sentences.csv", sep = "\t", names = ["id", "lang", "sentence"])
full_data_raw.head()

Unnamed: 0,id,lang,sentence
0,1,cmn,我們試試看！
1,2,cmn,我该去睡觉了。
2,3,cmn,你在干什麼啊？
3,4,cmn,這是什麼啊？
4,5,cmn,今天是６月１８号，也是Muiriel的生日！


We don't need sentence id, we can drop the column:

In [109]:
full_data_raw = full_data_raw.drop("id", axis = 1)

In [110]:
# how many unique languages?
full_data_raw.nunique()

lang            349
sentence    8112291
dtype: int64

Let's pick 30 languages at random, then filter the df:

In [111]:
random.seed(2)
languages = random.sample(set(full_data_raw["lang"]), 30)
print(languages)

['wln', 'ban', 'ary', 'mah', 'vro', 'kaa', 'mwl', 'dsb', 'tel', 'kam', 'hrx', 'dws', 'ldn', 'urh', 'tlh', 'pol', 'lmo', 'tsn', 'mnw', 'enm', '\\N', 'lou', 'kha', 'mhr', 'prg', 'arz', 'dng', 'srd', 'nav', 'cor']


The `\\N` is suspicious, let's see what it corresponds to in the df:

In [112]:
# filter df
full_data_raw[full_data_raw["lang"] == "\\N"].head()

Unnamed: 0,lang,sentence
6095954,\N,"Sābuku mamayamin, niyaꞋ takiteku manaꞋul magla..."
6104067,\N,Kataau kano koson i Ama' min pana mataau!
6310672,\N,Нуӈан дэмэрипчут дылви амаскиви донӈорочон.
6310678,\N,"Чикчакун дочадяран, мудана ачинди иргикэндиви ..."
6310683,\N,Том сома дэмэр куӈакан бичэн.


It appears to be a placeholder for missing language codes. Let's filter out the corresponding rows:

(eventually, if our classifier work well enough, we will be able to use it to infer the missing language codes...)

In [113]:
filtered_data = full_data_raw[full_data_raw["lang"] != "\\N"]
filtered_data.nunique()

lang            348
sentence    8112201
dtype: int64

We don't know how many sentences there are for each language. Before sampling 30 languages, let's remove those with less than 100 sentences:

In [114]:
sentence_count = filtered_data.groupby("lang").nunique() # group by language and count sentences per group
sentence_count.head()

Unnamed: 0_level_0,lang,sentence
lang,Unnamed: 1_level_1,Unnamed: 2_level_1
abk,1,26
acm,1,49
ady,1,31
afb,1,133
afh,1,79


In [115]:
# extract only languages with >99 sentences
languages = sentence_count[sentence_count["sentence"] > 99].index.values
len(languages)

178

Among these, we sample 30:

In [116]:
random.seed(2)
languages = random.sample(list(languages), 30)
print(languages)

['bar', 'cbk', 'bul', 'ldn', 'eus', 'wuu', 'kab', 'hrx', 'tha', 'got', 'vol', 'ast', 'swe', 'eng', 'nds', 'mar', 'prg', 'lit', 'sah', 'nob', 'pol', 'ido', 'urd', 'arz', 'lfn', 'ori', 'kaz', 'lvs', 'mya', 'rom']


Next, let's pick 50 sentences at random for each language:

In [117]:
sampled_data = filtered_data[filtered_data["lang"].isin(languages)] # filter based on list of sampled languages
# we apply a sampling function groupwise
sampled_data = sampled_data.groupby("lang").apply(lambda x : x.sample(50, random_state = 2)).reset_index(drop=True)
sampled_data.head()

Unnamed: 0,lang,sentence
0,arz,كل الناس بتحب المكان ده.
1,arz,كله محصل بعضه.
2,arz,كارلوس طلع الجبل.
3,arz,مفيش أي مشاكل.
4,arz,معدش بيشوف.


In [118]:
sampled_data.shape[0]

1500

Finally, let's shuffle this and save as tsv:

In [119]:
sampled_data = sampled_data.sample(n = sampled_data.shape[0], random_state = 2).reset_index(drop=True)
sampled_data.head()

Unnamed: 0,lang,sentence
0,lit,"Ačiū, kad tu atėjai manęs gelbėti."
1,vol,Ävilob jonön omes buki olik.
2,swe,"De sade att de skulle göra läxor, men i ställe..."
3,ast,Tien la zuna de escargatiar nes ñarres.
4,eng,You had a week to do this.


Write to disk:

In [120]:
sampled_data.to_csv("data/assignment3-data.tsv", sep = "\t", header = False, index = False)

### Exercise 2

This exercise is about feature extraction. Each sentence is tokenized at the level of character bigrams. Then each sentence is represented as an array of counts of bigrams, for each bigram in the dataset (so a very sparse array, the vast majority of entries will be 0).

In [78]:
# define a function to extract character-level bigrams
def sentence_tokenizer(in_string): # takes a string as input
    tknzd_string = [] # initialize empty output
    for i in range(0, len(in_string) - 1):
        tknzd_string.append(in_string[i] + in_string[i+1]) # each bigram is a character followed by the next one
    tknzd_string.insert(0, "<BOS>" + tknzd_string[0][0]) # first bigram is always <BOS>+first character
    tknzd_string.append(tknzd_string[-1][-1] + "<EOS>") # last bigram is always last character+<EOS>
    return(tknzd_string)

For example:

In [79]:
in_string = "ti che ti tachi i tac tacam i tac a mi"
out_string = sentence_tokenizer(in_string)
print(out_string)

['<BOS>t', 'ti', 'i ', ' c', 'ch', 'he', 'e ', ' t', 'ti', 'i ', ' t', 'ta', 'ac', 'ch', 'hi', 'i ', ' i', 'i ', ' t', 'ta', 'ac', 'c ', ' t', 'ta', 'ac', 'ca', 'am', 'm ', ' i', 'i ', ' t', 'ta', 'ac', 'c ', ' a', 'a ', ' m', 'mi', 'i<EOS>']


In [80]:
# define a function to count bigrams in a sentence
def bigram_counter(in_string):
    tknzd_string = sentence_tokenizer(in_string)
    bigrams, counts = np.unique(tknzd_string, return_counts=True)
    return(bigrams, counts)

For example:

In [81]:
in_string = "ti che ti tachi i tac tacam i tac a mi"
bigram_counter(in_string)

(array([' a', ' c', ' i', ' m', ' t', '<BOS>t', 'a ', 'ac', 'am', 'c ',
        'ca', 'ch', 'e ', 'he', 'hi', 'i ', 'i<EOS>', 'm ', 'mi', 'ta',
        'ti'], dtype='<U6'),
 array([1, 1, 2, 1, 5, 1, 1, 4, 1, 2, 1, 2, 1, 1, 1, 5, 1, 1, 1, 4, 2]))

(we don't worry about the order of bigrams)

Let's strip punctuation from the sentences, then tokenize all the sentences in the dataset:

In [121]:
sampled_data.loc[:, "clean_sentence"] = sampled_data.apply(lambda row : row.sentence.translate(str.maketrans('', '', string.punctuation)),
                                               axis=1)
sampled_data.head()

Unnamed: 0,lang,sentence,clean_sentence
0,lit,"Ačiū, kad tu atėjai manęs gelbėti.",Ačiū kad tu atėjai manęs gelbėti
1,vol,Ävilob jonön omes buki olik.,Ävilob jonön omes buki olik
2,swe,"De sade att de skulle göra läxor, men i ställe...",De sade att de skulle göra läxor men i stället...
3,ast,Tien la zuna de escargatiar nes ñarres.,Tien la zuna de escargatiar nes ñarres
4,eng,You had a week to do this.,You had a week to do this


In [122]:
sampled_data.loc[:, "tokenized"] = sampled_data.apply(lambda row : bigram_counter(row.clean_sentence)[0], axis=1)

Next, count bigrams in each sentence:

In [123]:
sampled_data.loc[:, "counts"] = sampled_data.apply(lambda row : bigram_counter(row.clean_sentence)[1], axis=1)

In [124]:
sampled_data.head()

Unnamed: 0,lang,sentence,clean_sentence,tokenized,counts
0,lit,"Ačiū, kad tu atėjai manęs gelbėti.",Ačiū kad tu atėjai manęs gelbėti,"[ a, g, k, m, t, <BOS>A, Ač, ad, ai, an, a...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,vol,Ävilob jonön omes buki olik.,Ävilob jonön omes buki olik,"[ b, j, o, <BOS>Ä, b , bu, es, i , ik, il, j...","[1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,swe,"De sade att de skulle göra läxor, men i ställe...",De sade att de skulle göra läxor men i stället...,"[ a, b, d, g, i, l, m, p, s, <BOS>D, D...","[1, 1, 2, 2, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, ..."
3,ast,Tien la zuna de escargatiar nes ñarres.,Tien la zuna de escargatiar nes ñarres,"[ d, e, l, n, z, ñ, <BOS>T, Ti, a , ar, a...","[1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 1, 1, 1, 1, 1, ..."
4,eng,You had a week to do this.,You had a week to do this,"[ a, d, h, t, w, <BOS>Y, Yo, a , ad, d , d...","[1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


Next, let's collect the "vocabulary" of bigrams of the dataset:

In [125]:
vocab = np.unique([bigram for tknzd_sentence in sampled_data["tokenized"] for bigram in tknzd_sentence])

In [126]:
len(vocab)

5485

Now that we have the complete vocabulary, we can split the dataset into development (90%) and evaluation (10%) set:

In [None]:
dev_df = sampled_data.head(int(sampled_data.shape[0]*9/10))
eval_df = sampled_data.tail(int(sampled_data.shape[0]*1/10)).reset_index(drop=True)

For each set we want to initialize (then fill) a sparse matrix using `scipy` with as many rows as the number of sentences in the development or evaluationset and as many columns as the number of bigrams in the full vocab:

In [88]:
from scipy.sparse import dok_matrix

In [129]:
def data_count_vectorizer(df, vocab): # df must have a 'lang' column 
    matrix = dok_matrix((df.shape[0], len(vocab)), dtype = np.int16) # initialize matrix
    
    # it might take a while, so let's add a progressbar to have an idea of the running time
    bar = progressbar.ProgressBar(maxval = matrix.shape[0], \
        widgets = [progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])

    bar.start()
    for i in range(matrix.shape[0]): # for each sentence,
        for j in range(matrix.shape[1]): # for each bigram in the vocabulary,
            if vocab[j] in dev_df["tokenized"][i]: # if the bigram is found in the current sentence...
                matrix[i, j] = dev_df["counts"][i][dev_df["tokenized"][i] == vocab[j]] # ...the value of the matrix is the count of that bigram in that sentence...
            else:
                matrix[i, j] = 0 # ...otherwise is 0
        bar.update(i+1)
    bar.finish()
    
    return(matrix, df['lang'])

Finally, let's fill the matrix for the development set:

In [130]:
dev_matrix, dev_tags = data_count_vectorizer(dev_df, vocab)



### Exercise 3
Logistic regression using `sklearn.linear_model.LogisticRegression`:

In [131]:
from sklearn import linear_model as lm

In [132]:
logReg = lm.LogisticRegression() # initialize model

X_train = dev_matrix # counts of bigrams as features
y_train = dev_tags # language label as target

In [138]:
logReg.fit(X_train, y_train) # train

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [139]:
logReg.score(X_train, y_train) # print score (mean accuracy)

1.0

How about score on evaluation set?

In [140]:
eval_matrix, eval_tags = data_count_vectorizer(eval_df, vocab)



In [136]:
X_eval = eval_matrix
y_eval = eval_tags
logReg.score(X_eval, y_eval) # print score (mean accuracy)

0.03333333333333333

Pretty bad generalization.

### Exercise 4
We write a function that takes two sequences of labels (read: gold-standard labels and predicted labels) and a specific label as input and returns precision, recall and F1-score with respect to the input specific label.

In [201]:
def get_scores(true_labels, predicted_labels, label):
    true_indices = [i for i, x in enumerate(true_labels) if x == label] # indices of true occurrences of "label"
    predicted_indices = [i for i, x in enumerate(predicted_labels) if x == label] # predictions of true occ. of "label"
    neg_true_indices = [i for i, x in enumerate(true_labels) if x != label] # indices of true occurrences of "NOTlabel"
    neg_predicted_indices = [i for i, x in enumerate(predicted_labels) if x != label] # predictions of true occ. of "NOTlabel"

    tp_indices = list(set(true_indices).intersection(predicted_indices)) # indices of true positives
    fp_indices = [i for i in predicted_indices if not i in tp_indices] # indices of false positives
    tn_indices = list(set(neg_true_indices).intersection(neg_predicted_indices)) # indices of true negatives
    fn_indices = [i for i in neg_predicted_indices if not i in tn_indices] # indices of false negatives

    tp = len(tp_indices) # how many?
    fp = len(fp_indices)
    fn = len(fn_indices)
    
    ###TODO: add check for division by zero :)
    precision = tp/(tp+fp) # scores
    recall = tp/(tp+fn)
    f1_score = (2*precision*recall)/(precision+recall)

    return(precision, recall, f1_score)

For example:

In [231]:
true_labels = ["ita", "fra", "eng", "deu", "esp", "por", "swe", 
               "ita", "fra", "eng", "deu", "esp", "por", "swe",
               "ita", "fra", "eng", "deu", "esp", "por", "swe", 
               "ita", "fra", "eng", "deu", "esp", "por", "swe", 
               "ita", "fra", "eng", "deu", "esp", "por", "swe",
               "ita", "fra", "eng", "deu", "esp", "por", "swe",
               "ita", "fra", "eng", "deu", "esp", "por", "swe"]

predicted_labels = ["ita", "fra", "fra", "eng", "por", "por", "esp", 
                    "ita", "fra", "deu", "deu", "esp", "por", "swe",
                    "ita", "fra", "deu", "deu", "por", "por", "swe", 
                    "ita", "fra", "deu", "deu", "esp", "por", "swe", 
                    "ita", "fra", "deu", "deu", "esp", "por", "swe",
                    "ita", "fra", "deu", "deu", "por", "esp", "swe",
                    "ita", "fra", "eng", "swe", "esp", "por", "swe"]

In [204]:
pr, re, f1 = get_scores(true_labels, predicted_labels, "ita")
print("Precision: {}, Recall: {}, F1-score: {}".format(pr, re, f1))

Precision: 1.0, Recall: 1.0, F1-score: 1.0


In [205]:
pr, re, f1 = get_scores(true_labels, predicted_labels, "fra")
print("Precision: {}, Recall: {}, F1-score: {}".format(pr, re, f1))

Precision: 1.0, Recall: 0.5714285714285714, F1-score: 0.7272727272727273


In [206]:
pr, re, f1 = get_scores(true_labels, predicted_labels, "swe")
print("Precision: {}, Recall: {}, F1-score: {}".format(pr, re, f1))

Precision: 0.8571428571428571, Recall: 0.8571428571428571, F1-score: 0.8571428571428571


Next, we write a function to compute averaged scores over a list of specific labels (instead of just one label as before):

In [215]:
def avg_scores(true_labels, predicted_labels, labels):
    precisions = [] # initialize empty lists
    recalls = []
    f1_scores = []
    for label in list(set(labels)): # cycle through unique entries in label list
        pr, re, f1 = get_scores(true_labels, predicted_labels, label)
        precisions.append(pr)
        recalls.append(re)
        f1_scores.append(f1)
    avg_precision = sum(precisions)/len(precisions) # compute average scores
    avg_recall = sum(recalls)/len(recalls)
    avg_f1_score = sum(f1_scores)/len(f1_scores)
    
    return(avg_precision, avg_recall, avg_f1_score)

For example:

In [219]:
labels = ["ita", "fra", "eng", "deu", "esp", "por", "swe"]
avg_pr, avg_re, avg_f1 = avg_scores(true_labels, predicted_labels, labels)
print("Average scores. Precision: {}, Recall: {}, F1-score: {}".format(avg_pr, avg_re, avg_f1))

Average scores. Precision: 0.7414965986394557, Recall: 0.7346938775510203, F1-score: 0.7282108647654866


In [232]:
labels = ["ita", "fra"]
avg_pr, avg_re, avg_f1 = avg_scores(true_labels, predicted_labels, labels)
print("Average scores. Precision: {}, Recall: {}, F1-score: {}".format(avg_pr, avg_re, avg_f1))

Average scores. Precision: 0.9375, Recall: 1.0, F1-score: 0.9666666666666667


Finally, let's copmute these average scores for the model trained above.

In [235]:
true_labels = dev_tags
predicted_labels = logReg.predict(X_train)
labels = list(set(dev_df['lang']))

In [236]:
avg_pr, avg_re, avg_f1 = avg_scores(true_labels, predicted_labels, labels)
print("Average scores. Precision: {}, Recall: {}, F1-score: {}".format(avg_pr, avg_re, avg_f1))

Average scores. Precision: 1.0, Recall: 1.0, F1-score: 1.0


Double check with `sklearn.metrics`?