# Task 0 problem - Lucas Despin & Théo Fagnoni

In this notebook, we will present methods to solve the task 0 problem presented on this [github](http://github.com/sigmorphon2020/task0-data).
We will propose some ideas to tackle this problem, as well as some implementations and related results, on different datasets.

## Imports

In [1]:
import numpy as np
import pandas as pd
import os
import yaml
import codecs
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors
from evaluate import distance

#To access data
from functions import tags_file, categories, categories_df
from functions import extracting_test_sets_per_classes
from functions import extracting_train_input_models
from functions import charac_dict, read, dict_to_frame

#To encode lemmas
from functions import get_longest, encode_data

#To encode tags
from functions import encode_tag, encode_tags_dataset

#To classify test bundles
from functions import indicator_non_zero, assign_class_test

#To train
from functions import train_linear_reg, train_len_words, decode_vector, convert_pred, convert_pred_limit
from functions import predict_and_save, predict_len, encode_len, encode_sets

## Access data

We start by chosing a specific language. Then we can build a character dictionnary that will present, for each possible character of this language, a related index. 

In [2]:
rep = "DEVELOPMENT-LANGUAGES"
region = "austronesian"
lang = "ceb"

In [3]:
char_dict = charac_dict(lang)

In [4]:
char_dict

{'-': 0,
 'a': 1,
 'b': 2,
 'd': 3,
 'e': 4,
 'g': 5,
 'h': 6,
 'i': 7,
 'k': 8,
 'l': 9,
 'm': 10,
 'n': 11,
 'o': 12,
 'p': 13,
 'r': 14,
 's': 15,
 't': 16,
 'u': 17,
 'w': 18,
 'y': 19,
 ' ': 20}

Training set

In [5]:
Dict = read(f"task0-data/{rep}/{region}/{lang}.trn")
df = dict_to_frame(Dict, 'train')

In [6]:
test_index = np.random.randint(0, len(df), 15)

In [7]:
train_df = df.drop(index=test_index)
new_indexes_train = np.arange(len(train_df))
train_df = train_df.set_index(new_indexes_train)

In [8]:
test_df = df.loc[test_index]
new_indexes_test = np.arange(len(test_df))
test_df = test_df.set_index(new_indexes_test)

## Encoding forms and lemmas

In order to prepare our training, we would like to encode both lemmas and forms from our training dataset. To do so, we will encode each word using on-hot vectors. Its dimension will be $M \times T$, where $M$ is the length of the longest word in the training dataset, and T is the length of the alphabet ($i.e.$ the length of char_dict). If the $i^{th}$ character of a word is the $j^{th}$ character of the alphabet, we will put a $1.0$ at the $(i,j)$ coordinate, and zeros everywhere else. We will perform this schema for every $N$ input of our training dataset, and end up with encoded vectors $X_{train}$ and $Y_{train}$ of shape $N \times M \times T$.

In [9]:
M = get_longest(train_df)
T = len(char_dict)

## Encoding tags

In [10]:
categories_df = pd.DataFrame(categories, columns=["categories"])

categories_to_drop = []

In [11]:
tags_file['categories']

{'Aktionsart': ['STAT',
  'DYN',
  'TEL',
  'ATEL',
  'PCT',
  'DUR',
  'ACH',
  'ACCMP',
  'SEMEL',
  'ACTY',
  'DUR+SEMEL',
  'DUR+STAT'],
 'Animacy': ['ANIM', 'INAN', 'HUM', 'NHUM'],
 'Argument Marking': ['NO1S',
  'NO1P',
  'NO2S',
  'NO2P',
  'NO3S',
  'NO3SA',
  'NO3SI',
  'NO3PA',
  'NO3P',
  'AC1S',
  'AC1P',
  'AC2S',
  'AC2P',
  'AC3S',
  'AC3P',
  'AC1',
  'AC2',
  'AC3',
  'AB1S',
  'AB1P',
  'AB2S',
  'AB2P',
  'AB3S',
  'AB3P',
  'ER1S',
  'ER1P',
  'ER2S',
  'ER2P',
  'ER3S',
  'ER3P',
  'DA1S',
  'DA1P',
  'DA2S',
  'DA2P',
  'DA3S',
  'DA3P',
  'BE1S',
  'BE1P',
  'BE2S',
  'BE2P',
  'BE3S',
  'BE3P'],
 'Aspect': ['ITER',
  'IPFV',
  'PFV',
  'PRF',
  'PROG',
  'PFV+PROG',
  'PRF+PROG',
  'PROSP',
  'HAB',
  'HAB+PROG',
  'HAB+PRF',
  'HAB+IPFV'],
 'Case': ['NOM',
  'ACC',
  'ACC+COMPV',
  'LOC',
  'ERG',
  'ABS',
  'NOMS',
  'DAT',
  'DAT+COMPV',
  'BEN',
  'PRP',
  'GEN',
  'REL',
  'PRT',
  'INS',
  'INS+COMPV',
  'INS+DAT',
  'COM',
  'COM+TERM',
  'COM+ACC',
  'VO

Encoding of a given language

In [12]:
encoded_train_df = encode_tags_dataset(train_df)

In [13]:
encoded_train_df

Unnamed: 0,bundle,form,lemma,Part of Speech,Aktionsart,Animacy,Argument Marking,Aspect,Case,Comparison,...,Number,New,Person,Polarity,Politeness,Possession,Switch-Reference,Tense,Valency,Voice
0,V;PST,nibalik,mobalik,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
1,V;PRS,nagbalik,mobalik,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,V;NFIN,mobalik,mobalik,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,V;PROG;PRS,ningbalik,mobalik,9,0,0,0,5,0,0,...,0,0,0,0,0,0,0,1,0,0
4,V;FUT,mobalik,mobalik,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,5,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
397,V;PST,daw,daw,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
398,V;PRF;PST,daw,daw,9,0,0,0,4,0,0,...,0,0,0,0,0,0,0,2,0,0
399,V;FUT,daw,daw,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,5,0,0
400,V;PRS,nagbarog,mobarog,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


### Bundle analysis

Classes of different bundles.
"Values", "Indexes" and "train_sets_per_classes" are dictionnaries whose keys are integers refering to classes.

Values[ i ] provides the reference projected bundle defining the class i, Indexes[ i ] the corresponding sample indexes in the training set, train_sets_per_classes[ i ] the sub-training set on which to train class i's model.

Categories might be drop for the definition of the classes, i.e turned to 0, cf sparsity analysis

In [14]:
train_Values = extracting_train_input_models(train_df)[0]

In [15]:
categories_to_drop = []

## Test set classification

In [16]:
test_bundle = np.array([9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0])

In [17]:
encoded_test_df = encode_tags_dataset(test_df)
encoded_test_df

Unnamed: 0,bundle,form,lemma,Part of Speech,Aktionsart,Animacy,Argument Marking,Aspect,Case,Comparison,...,Number,New,Person,Polarity,Politeness,Possession,Switch-Reference,Tense,Valency,Voice
0,V;PST,niabli,moabli,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
1,V;PRF;PST,nalupad,molupad,9,0,0,0,4,0,0,...,0,0,0,0,0,0,0,2,0,0
2,V;FUT,momaneho,momaneho,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,5,0,0
3,V;PRS,naggamit,mogamit,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,V;PST,nitubag,motubag,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
5,V;NFIN,mamatay,mamatay,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,V;PROG;PRS,ningpakita,mopakita,9,0,0,0,5,0,0,...,0,0,0,0,0,0,0,1,0,0
7,V;PST,nitubag,motubag,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
8,V;PROG;PRS,ningsulti,mosulti,9,0,0,0,5,0,0,...,0,0,0,0,0,0,0,1,0,0
9,V;PST,nipakita,mopakita,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [18]:
encoded_test_df['model_class'] = encoded_test_df.apply(
    lambda x: assign_class_test(x[categories].values, train_Values, []), axis=1
)

In [19]:
encoded_test_df

Unnamed: 0,bundle,form,lemma,Part of Speech,Aktionsart,Animacy,Argument Marking,Aspect,Case,Comparison,...,New,Person,Polarity,Politeness,Possession,Switch-Reference,Tense,Valency,Voice,model_class
0,V;PST,niabli,moabli,9,0,0,0,0,0,0,...,0,0,0,0,0,0,2,0,0,0
1,V;PRF;PST,nalupad,molupad,9,0,0,0,4,0,0,...,0,0,0,0,0,0,2,0,0,5
2,V;FUT,momaneho,momaneho,9,0,0,0,0,0,0,...,0,0,0,0,0,0,5,0,0,4
3,V;PRS,naggamit,mogamit,9,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
4,V;PST,nitubag,motubag,9,0,0,0,0,0,0,...,0,0,0,0,0,0,2,0,0,0
5,V;NFIN,mamatay,mamatay,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
6,V;PROG;PRS,ningpakita,mopakita,9,0,0,0,5,0,0,...,0,0,0,0,0,0,1,0,0,3
7,V;PST,nitubag,motubag,9,0,0,0,0,0,0,...,0,0,0,0,0,0,2,0,0,0
8,V;PROG;PRS,ningsulti,mosulti,9,0,0,0,5,0,0,...,0,0,0,0,0,0,1,0,0,3
9,V;PST,nipakita,mopakita,9,0,0,0,0,0,0,...,0,0,0,0,0,0,2,0,0,0


Sparsity analysis in training set bundles

In [20]:
X = encoded_train_df[categories]

In [21]:
D = {}
for cat in categories:
    values = X[cat].unique()
    indexes_values = {}
    for val in values:
        indexes_values[val] = list(X.where(X[cat] == val).dropna().index)            
    D[cat] = indexes_values

In [22]:
E = pd.DataFrame(columns=['categories', 'number_unique'])
for i in range(len(categories)):
    cat = categories[i]
    values = X[cat].unique()
    E.loc[i] = [cat, len(values)]

E

Unnamed: 0,categories,number_unique
0,Part of Speech,1
1,Aktionsart,1
2,Animacy,1
3,Argument Marking,1
4,Aspect,3
5,Case,1
6,Comparison,1
7,Definiteness,1
8,Deixis,1
9,Evidentiality,1


In [23]:
A = E.sort_values(by='number_unique', axis=0, ignore_index=True)
A[A.number_unique > 1]

Unnamed: 0,categories,number_unique
22,Finiteness,2
23,Aspect,3
24,Tense,4


# Training

## Linear regression

In [24]:
extract_train = extracting_train_input_models(train_df)
train_Values, Non_encoded_train_Values, train_sets_per_classes = extract_train[0], extract_train[1], extract_train[3]
models, X_train, Y_train = train_linear_reg(train_sets_per_classes, char_dict, M)

In [25]:
Y_hat_train = {}

#We predict the form, for each the different datasets associated with different classes
for i in models:
    Y_hat_train[i] = models[i].predict(X_train[i])
    
#We reshape our encoded vectors so they are of form (N,M,T) 
for i in X_train:
    (N,_) = X_train[i].shape
    X_train[i] = X_train[i].reshape((N,M,T))
    Y_train[i] = Y_train[i].reshape((N,M,T))
    Y_hat_train[i] = Y_hat_train[i].reshape((N,M,T))

In [26]:
f = open("results/results_train.txt","w")
for i in models:
    lemma = decode_vector(X_train[i], char_dict)
    form = decode_vector(Y_train[i], char_dict)
    Y_hat_train[i] = convert_pred(Y_hat_train[i])
    form_hat = decode_vector(Y_hat_train[i], char_dict)
    bundle = Non_encoded_train_Values[i]
    
    for j in range(len(lemma)):
        f.write(f"{lemma[j]}\t{bundle}\t{form[j]}\t{form_hat[j]}\n")
f.close()

Linear regression intuition:
Input space: representation of the repartition of the characters of the alphabet, regarding their position in a word.
Same thing for the target space.
The theta parameter we train during linear regression makes the link between these two spaces.

# To tackle the problem of too much letter predicted

1. Predict word lengths per classes using linear regression, then cut the predicted value accordingly
2. Estimate the probability of having a letter or not knowing the norm, by counting cleverly over the training set, per class.

In [27]:
models_len, X_train_len, Y_train_len = train_len_words(train_sets_per_classes)

In [28]:
X_train, Y_train = encode_sets(train_sets_per_classes, char_dict, M)

In [29]:
predict_and_save(X_train, Y_train, train_sets_per_classes, models, models_len, M, T, char_dict, Non_encoded_train_Values, "results/results_train.txt")

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 4,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 2,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 14,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 2,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 14,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 3,
 0,
 4,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 2,
 0,
 0,
 2,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 2,
 1,
 0,
 0,
 2,
 0,
 0,
 1,
 1,
 14,
 2,
 0,
 

# Predict on test set

In [30]:
test_sets_per_classes = extracting_test_sets_per_classes(test_df, train_Values)

In [31]:
X_test, Y_test = encode_sets(test_sets_per_classes, char_dict, M)

In [32]:
predict_and_save(X_test, Y_test, test_sets_per_classes, models, models_len, M, T, char_dict, Non_encoded_train_Values, "results/results_test.txt")

[1, 1, 1, 5, 4, 3, 0, 0, 1, 1, 4, 1, 5, 0, 2]

# For the future python script to iterate on the languages

In [39]:
results_train = pd.DataFrame(columns = ['size', 'correctness%', 'error_min', 'error_max', 'error_mean', 'error_variance', 'error_median', 'error_75_percentile', 'error_95_percentile'])
results_test = pd.DataFrame(columns = ['size', 'correctness%', 'error_min', 'error_max', 'error_mean', 'error_variance', 'error_median', 'error_75_percentile', 'error_95_percentile'])

In [41]:
languages = {
   'austronesian':['ceb', 'hil', 'mao', 'mlg', 'tgl'],
    'germanic':['ang', 'dan', 'deu', 'eng', 'frr', 'gmh', 'isl', 'nld', 'swe'],
    'niger-congo':['aka', 'gaa', 'kon', 'lin', 'lug', 'nya', 'sot', 'swa', 'zul'],
    'uralic':[ 'fin']
}

In [35]:
characters_file = open("task0-data/alpha.all",'r')

In [45]:
languages = {
       'germanic':[ 'gmh', 'isl', 'nld', 'swe'],
        'niger-congo':['aka', 'gaa', 'kon', 'lin', 'lug', 'nya', 'sot', 'swa', 'zul'],
        'uralic':[ 'fin']
}

In [48]:
lng = ['ceb', 'hil', 'mao', 'mlg', 'tgl', 'ang', 'dan', 'deu', 'eng', 'gmh', 'isl', 'nld', 'swe', 'aka', 'gaa', 'kon', 'lin', 'lug', 'nya', 'sot', 'swa', 'zul', 'fin']

In [46]:
for region in languages:
    for lang in languages[region]:
        char_dict = charac_dict(lang)
        Dict = read(f"task0-data/{rep}/{region}/{lang}.trn")
        df = dict_to_frame(Dict, 'train')
        size = len(df)
        test_size = size // 5
        train_size = size - test_size
        test_index = np.random.randint(0, len(df), test_size)
        
        train_df = df.drop(index=test_index)
        new_indexes_train = np.arange(len(train_df))
        train_df = train_df.set_index(new_indexes_train)
        
        test_df = df.loc[test_index]
        new_indexes_test = np.arange(len(test_df))
        test_df = test_df.set_index(new_indexes_test)
        
        M = get_longest(train_df)
        T = len(char_dict)
        
        extract_train = extracting_train_input_models(train_df)
        train_Values, Non_encoded_train_Values, train_sets_per_classes = extract_train[0], extract_train[1], extract_train[3]
        models, X_train, Y_train = train_linear_reg(train_sets_per_classes, char_dict, M)
        
        models_len, X_train_len, Y_train_len = train_len_words(train_sets_per_classes)
        X_train, Y_train = encode_sets(train_sets_per_classes, char_dict, M) 
        error_train = predict_and_save(X_train, Y_train, train_sets_per_classes, models, models_len, M, T, char_dict, Non_encoded_train_Values, f"results/results_train_{lang}.txt")
        results_train.loc[lang] = [
            train_size,
            (1 - np.count_nonzero(error_train) / train_size)*100,
            np.min(error_train),
            np.max(error_train),
            np.mean(error_train),
            np.var(error_train),
            np.percentile(error_train, 50),
            np.percentile(error_train, 75),
            np.percentile(error_train, 95),
        ]
        test_sets_per_classes = extracting_test_sets_per_classes(test_df, train_Values)
        X_test, Y_test = encode_sets(test_sets_per_classes, char_dict, M)
        error_test = predict_and_save(X_test, Y_test, test_sets_per_classes, models, models_len, M, T, char_dict, Non_encoded_train_Values, f"results/results_test_{lang}.txt")
        results_test.loc[lang] = [
            test_size,
            (1 - np.count_nonzero(error_test) / test_size)*100,
            np.min(error_test),
            np.max(error_test),
            np.mean(error_test),
            np.var(error_test),
            np.percentile(error_test, 50),
            np.percentile(error_test, 75),
            np.percentile(error_test, 95),
        ]

In [58]:
columns = ['size', 'correctness%', 'error_min', 'error_max', 'error_mean', 'error_variance', 'error_median', 'error_75_percentile', 'error_95_percentile']

In [59]:
results_train.columns = columns
results_test.columns = columns

In [49]:
results_train = results_train.loc[lng].sort_values(by='size', ascending=False)

In [50]:
results_test = results_test.loc[lng].sort_values(by='size', ascending=False)

In [51]:
results_train['size'] = results_train['size'].apply(lambda x: int(x))
results_test['size'] = results_test['size'].apply(lambda x: int(x))

In [60]:
results_train

Unnamed: 0,size,correctness%,error_min,error_max,error_mean,error_variance,error_median,error_75_percentile,error_95_percentile
deu,79524,68.024747,0.0,46.0,0.379916,0.435113,0.0,1.0,2.0
fin,79523,47.96474,0.0,34.0,0.745434,0.928387,1.0,1.0,2.95
eng,63983,71.48305,0.0,28.0,0.346443,0.449312,0.0,1.0,1.0
swe,43911,57.812393,0.0,41.0,0.545689,0.664162,0.0,1.0,2.0
isl,42649,41.039649,0.0,26.0,0.842187,0.999272,1.0,1.0,3.0
nld,31061,62.351502,0.0,21.0,0.510799,0.639342,0.0,1.0,2.0
ang,20652,43.022468,0.0,18.0,0.82371,0.971143,1.0,1.0,3.0
dan,11929,54.97527,0.0,28.0,0.62867,0.90529,0.0,1.0,2.0
swa,2700,99.703704,0.0,16.0,0.04631,0.738811,0.0,0.0,0.0
lug,2626,74.790556,0.0,17.0,0.299926,0.42003,0.0,0.0,1.0


In [53]:
results_test

Unnamed: 0,size,correctness%,error_min,error_max,error_mean,error_variance,error_median,error_75_percentile,error_75_percentile.1
deu,19881,65.087269,0.0,27.0,0.661536,3.539735,0.0,1.0,2.0
fin,19880,36.433602,0.0,34.0,1.762626,9.30145,1.0,2.0,8.0
eng,15995,71.078462,0.0,28.0,0.4005,1.00759,0.0,1.0,2.0
swe,10977,52.546233,0.0,27.0,1.046916,5.007273,0.0,1.0,4.0
isl,10662,33.905459,0.0,22.0,1.488651,5.109935,1.0,2.0,5.0
nld,7765,58.042498,0.0,19.0,0.857051,3.183171,0.0,1.0,3.0
ang,5163,19.42669,0.0,15.0,2.257021,5.909342,2.0,3.0,8.0
dan,2982,48.222669,0.0,21.0,0.963112,3.360812,1.0,1.0,3.0
swa,674,54.005935,0.0,16.0,0.70178,1.10246,0.0,1.0,2.0
lug,656,28.963415,0.0,17.0,1.914634,4.270152,1.0,3.0,6.0


In [57]:
results_train.to_latex(escape=False)

'\\begin{tabular}{lrrrrrrrrr}\n\\toprule\n{} &   size &  correctness% &  error_min &  error_max &  error_mean &  error_variance &  error_median &  error_75_percentile &  error_75_percentile \\\\\n\\midrule\ndeu &  79524 &     68.024747 &        0.0 &       46.0 &    0.379916 &        0.435113 &           0.0 &                  1.0 &                 2.00 \\\\\nfin &  79523 &     47.964740 &        0.0 &       34.0 &    0.745434 &        0.928387 &           1.0 &                  1.0 &                 2.95 \\\\\neng &  63983 &     71.483050 &        0.0 &       28.0 &    0.346443 &        0.449312 &           0.0 &                  1.0 &                 1.00 \\\\\nswe &  43911 &     57.812393 &        0.0 &       41.0 &    0.545689 &        0.664162 &           0.0 &                  1.0 &                 2.00 \\\\\nisl &  42649 &     41.039649 &        0.0 &       26.0 &    0.842187 &        0.999272 &           1.0 &                  1.0 &                 3.00 \\\\\nnld &  31061 &    