# TueSNLP - Assignment 3

## Language identification
The assignment is available here: https://snlp2018.github.io/assignments.html.

### Exercise 1
This exercise is about creating a dataset of sentences in different languages starting from ids of tweets collected during the class. We don't have access to the private repo where the tweets had been saved, so we try to keep the spirit of the exercise using data from http://tatoeba.org a crowd-sourced collection of sentences and translations.

In particular, http://downloads.tatoeba.org/exports/sentences.tar.bz2 contains ~8 milion sentences each with corresponding language code. In order to mimic the dataset originally provided for the assignment, we select 30 languages at random and pick a certain number of random sentences for each language. We use the majority of sentences to build the development set and the remainder to build an evaluation set which will be used to evaluate our model(s). Precise quantities and ratio will be determined after we explore the dataset.

In [1]:
# libraries
import pandas as pd
import numpy as np
import random
import string

In [2]:
# read data
full_data_raw = pd.read_csv("data/sentences.csv", sep = "\t", names = ["id", "lang", "sentence"])
full_data_raw.head()

Unnamed: 0,id,lang,sentence
0,1,cmn,我們試試看！
1,2,cmn,我该去睡觉了。
2,3,cmn,你在干什麼啊？
3,4,cmn,這是什麼啊？
4,5,cmn,今天是６月１８号，也是Muiriel的生日！


We don't need sentence id, we can drop the column:

In [3]:
full_data_raw = full_data_raw.drop("id", axis = 1)

In [4]:
# how many unique languages?
full_data_raw.nunique()

lang            349
sentence    8112291
dtype: int64

Let's pick 30 languages at random, then filter the df:

In [5]:
random.seed(2)
languages = random.sample(set(full_data_raw["lang"]), 30)
print(languages)

['sna', 'pap', 'bzt', 'ngt', 'ksh', 'lou', 'rus', 'hun', 'dws', 'oss', 'nch', 'que', 'ara', 'bel', 'ldn', 'fuv', 'otk', 'ile', 'jav', 'mah', 'ssw', 'tgk', 'cho', 'grc', 'gbm', 'sun', 'mfe', 'mar', 'nau', 'avk']


The `\\N` is suspicious, let's see what it corresponds to in the df:

In [6]:
# filter df
full_data_raw[full_data_raw["lang"] == "\\N"].head()

Unnamed: 0,lang,sentence
6095954,\N,"Sābuku mamayamin, niyaꞋ takiteku manaꞋul magla..."
6104067,\N,Kataau kano koson i Ama' min pana mataau!
6310672,\N,Нуӈан дэмэрипчут дылви амаскиви донӈорочон.
6310678,\N,"Чикчакун дочадяран, мудана ачинди иргикэндиви ..."
6310683,\N,Том сома дэмэр куӈакан бичэн.


It appears to be a placeholder for missing language codes. Let's filter out the corresponding rows:

(eventually, if our classifier work well enough, we will be able to use it to infer the missing language codes...)

In [7]:
filtered_data = full_data_raw[full_data_raw["lang"] != "\\N"]
filtered_data.nunique()

lang            348
sentence    8112201
dtype: int64

We don't know how many sentences there are for each language. Before sampling 30 languages, let's remove those with less than 100 sentences:

In [8]:
sentence_count = filtered_data.groupby("lang").nunique() # group by language and count sentences per group
sentence_count.head()

Unnamed: 0_level_0,lang,sentence
lang,Unnamed: 1_level_1,Unnamed: 2_level_1
abk,1,26
acm,1,49
ady,1,31
afb,1,133
afh,1,79


In [9]:
# extract only languages with >99 sentences
languages = sentence_count[sentence_count["sentence"] > 99].index.values
len(languages)

178

Among these, we sample 30:

In [10]:
random.seed(2)
languages = random.sample(list(languages), 30)
print(languages)

['bar', 'cbk', 'bul', 'ldn', 'eus', 'wuu', 'kab', 'hrx', 'tha', 'got', 'vol', 'ast', 'swe', 'eng', 'nds', 'mar', 'prg', 'lit', 'sah', 'nob', 'pol', 'ido', 'urd', 'arz', 'lfn', 'ori', 'kaz', 'lvs', 'mya', 'rom']


Next, let's pick 100 sentences at random for each language:

In [11]:
sampled_data = filtered_data[filtered_data["lang"].isin(languages)] # filter based on list of sampled languages
# we apply a sampling function groupwise
sampled_data = sampled_data.groupby("lang").apply(lambda x : x.sample(100, random_state = 2)).reset_index(drop=True)
sampled_data.head()

Unnamed: 0,lang,sentence
0,arz,كل الناس بتحب المكان ده.
1,arz,كله محصل بعضه.
2,arz,كارلوس طلع الجبل.
3,arz,مفيش أي مشاكل.
4,arz,معدش بيشوف.


In [12]:
sampled_data.shape[0]

3000

Finally, let's shuffle this and remove the bottom 10% of rows, which we'll save as evaluation set.

In [13]:
sampled_data = sampled_data.sample(n = sampled_data.shape[0], random_state = 2).reset_index(drop=True)
sampled_data.head()

Unnamed: 0,lang,sentence
0,ldn,Meháya rul woho wa.
1,eng,I want to borrow some money.
2,bar,Ohne Schmäh?
3,vol,Fat Tomasa äbinom sanel.
4,ldn,"Bíid thad dóyom le leyóoth wa, Thom."


In [53]:
dev_df = sampled_data.head(int(sampled_data.shape[0]*9/10))
eval_df = sampled_data.tail(int(sampled_data.shape[0]*1/10))

Write both to disk for later use:

In [15]:
dev_df.to_csv("data/assignment3-dev.tsv", sep = "\t", header = False, index = False)
eval_df.to_csv("data/assignment3-eval.tsv", sep = "\t", header = False, index = False)

### Exercise 2

This exercise is about feature extraction. Each sentence is tokenized at the level of character bigrams. Then each sentence is represented as an array of counts of bigrams, for each bigram in the dataset (so a very sparse array, the vast majority of entries will be 0).

In [16]:
# define a function to extract character-level bigrams
def sentence_tokenizer(in_string): # takes a string as input
    tknzd_string = [] # initialize empty output
    for i in range(0, len(in_string) - 1):
        tknzd_string.append(in_string[i] + in_string[i+1]) # each bigram is a character followed by the next one
    tknzd_string.insert(0, "<BOS>" + tknzd_string[0][0]) # first bigram is always <BOS>+first character
    tknzd_string.append(tknzd_string[-1][-1] + "<EOS>") # last bigram is always last character+<EOS>
    return(tknzd_string)

For example:

In [17]:
in_string = "ti che ti tachi i tac tacam i tac a mi"
out_string = sentence_tokenizer(in_string)
print(out_string)

['<BOS>t', 'ti', 'i ', ' c', 'ch', 'he', 'e ', ' t', 'ti', 'i ', ' t', 'ta', 'ac', 'ch', 'hi', 'i ', ' i', 'i ', ' t', 'ta', 'ac', 'c ', ' t', 'ta', 'ac', 'ca', 'am', 'm ', ' i', 'i ', ' t', 'ta', 'ac', 'c ', ' a', 'a ', ' m', 'mi', 'i<EOS>']


In [18]:
# define a function to count bigrams in a sentence
def bigram_counter(in_string):
    tknzd_string = sentence_tokenizer(in_string)
    bigrams, counts = np.unique(tknzd_string, return_counts=True)
    return(bigrams, counts)

For example:

In [19]:
in_string = "ti che ti tachi i tac tacam i tac a mi"
bigram_counter(in_string)

(array([' a', ' c', ' i', ' m', ' t', '<BOS>t', 'a ', 'ac', 'am', 'c ',
        'ca', 'ch', 'e ', 'he', 'hi', 'i ', 'i<EOS>', 'm ', 'mi', 'ta',
        'ti'], dtype='<U6'),
 array([1, 1, 2, 1, 5, 1, 1, 4, 1, 2, 1, 2, 1, 1, 1, 5, 1, 1, 1, 4, 2]))

(we don't worry about the order of bigrams)

Let's strip punctuation from the sentences, then tokenize all the sentences in the dataset:

In [54]:
dev_df.loc[:, "clean_sentence"] = dev_df.apply(lambda row : row.sentence.translate(str.maketrans('', '', string.punctuation)),
                                               axis=1)
dev_df.head()

Unnamed: 0,lang,sentence,clean_sentence
0,ldn,Meháya rul woho wa.,Meháya rul woho wa
1,eng,I want to borrow some money.,I want to borrow some money
2,bar,Ohne Schmäh?,Ohne Schmäh
3,vol,Fat Tomasa äbinom sanel.,Fat Tomasa äbinom sanel
4,ldn,"Bíid thad dóyom le leyóoth wa, Thom.",Bíid thad dóyom le leyóoth wa Thom


In [55]:
dev_df.loc[:, "tokenized"] = dev_df.apply(lambda row : bigram_counter(row.clean_sentence)[0], axis=1)

Unnamed: 0,lang,sentence,clean_sentence,tokenized
0,ldn,Meháya rul woho wa.,Meháya rul woho wa,"[ r, w, <BOS>M, Me, a , a<EOS>, eh, ho, há, l..."
1,eng,I want to borrow some money.,I want to borrow some money,"[ b, m, s, t, w, <BOS>I, I , an, bo, e , e..."
2,bar,Ohne Schmäh?,Ohne Schmäh,"[ S, <BOS>O, Oh, Sc, ch, e , h<EOS>, hm, hn, m..."
3,vol,Fat Tomasa äbinom sanel.,Fat Tomasa äbinom sanel,"[ T, s, ä, <BOS>F, Fa, To, a , an, as, at, b..."
4,ldn,"Bíid thad dóyom le leyóoth wa, Thom.",Bíid thad dóyom le leyóoth wa Thom,"[ T, d, l, t, w, <BOS>B, Bí, Th, a , ad, d..."


Next, count bigrams in each sentence:

In [57]:
dev_df.loc[:, "counts"] = dev_df.apply(lambda row : bigram_counter(row.clean_sentence)[1], axis=1)

In [58]:
dev_df.head()

Unnamed: 0,lang,sentence,clean_sentence,tokenized,counts
0,ldn,Meháya rul woho wa.,Meháya rul woho wa,"[ r, w, <BOS>M, Me, a , a<EOS>, eh, ho, há, l...","[1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,eng,I want to borrow some money.,I want to borrow some money,"[ b, m, s, t, w, <BOS>I, I , an, bo, e , e...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,bar,Ohne Schmäh?,Ohne Schmäh,"[ S, <BOS>O, Oh, Sc, ch, e , h<EOS>, hm, hn, m...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
3,vol,Fat Tomasa äbinom sanel.,Fat Tomasa äbinom sanel,"[ T, s, ä, <BOS>F, Fa, To, a , an, as, at, b...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,ldn,"Bíid thad dóyom le leyóoth wa, Thom.",Bíid thad dóyom le leyóoth wa Thom,"[ T, d, l, t, w, <BOS>B, Bí, Th, a , ad, d...","[1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, ..."


Next, let's collect the "vocabulary" of bigrams of the dataset:

In [31]:
vocab = np.unique([bigram for tknzd_sentence in dev_df["tokenized"] for bigram in tknzd_sentence])

In [33]:
len(vocab)

7011

So we want to initialize a sparse matrix using `scipy` with as many rows as the number of sentences in the dataset and as many columns as the number of bigrams:

In [43]:
from scipy.sparse import dok_matrix

In [78]:
dev_matrix = dok_matrix((dev_df.shape[0], len(vocab)), dtype = np.int16)

In [79]:
for i in range(dev_matrix.shape[0]): # for each sentence
    for j in range(dev_matrix.shape[1]): # for each bigram in the vocabulary
        if vocab[j] in dev_df["tokenized"][i]: # if the bigram is found in the current sentence...
            dev_matrix[i, j] = dev_df["counts"][i][dev_df["tokenized"][i] == vocab[j]] # ...the value of the matrix is the count of that bigram in that sentence...
        else:
            dev_matrix[i, j] = 0 # ...otherwise is 0