# Task 0 problem - Lucas Despin & Théo Fagnoni

In this notebook, we will present methods to solve the task 0 problem presented on this [github](github.com/sigmorphon2020/task0-data).
We will propose some ideas to tackle this problem, as well as some implementations and related results, on different datasets.

## Imports

In [129]:
import numpy as np
import pandas as pd
import os
import yaml
import codecs
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors
from evaluate import distance

#To access data
from functions import tags_file, categories, categories_df
from functions import extracting_test_sets_per_classes
from functions import charac_dict, read, dict_to_frame

#To encode lemmas
from functions import get_longest, encode_data

#To encode tags
from functions import encode_tag, encode_tags_dataset

#To classify test bundles
from functions import indicator_non_zero, assign_class_test

#To train
from functions import train_linear_reg, train_len_words, decode_vector, convert_pred, convert_pred_limit

## Access data

We start by chosing a specific language. Then we can build a character dictionnary that will present, for each possible character of this language, a related index. 

In [130]:
rep = "DEVELOPMENT-LANGUAGES"
region = "austronesian"
lang = "ceb"

In [131]:
characters_file = open("task0-data/alpha.all",'r')
char_dict = charac_dict(characters_file, lang)
characters_file.close()

In [132]:
char_dict

{'-': 0,
 'a': 1,
 'b': 2,
 'd': 3,
 'e': 4,
 'g': 5,
 'h': 6,
 'i': 7,
 'k': 8,
 'l': 9,
 'm': 10,
 'n': 11,
 'o': 12,
 'p': 13,
 'r': 14,
 's': 15,
 't': 16,
 'u': 17,
 'w': 18,
 'y': 19,
 ' ': 20}

Training set

In [133]:
Dict = read(f"task0-data/{rep}/{region}/{lang}.trn")
df = dict_to_frame(Dict, 'train')

In [134]:
test_index = np.random.randint(0, len(df), 15)

In [135]:
train_df = df.drop(index=test_index)
new_indexes_train = np.arange(len(train_df))
train_df = train_df.set_index(new_indexes_train)

In [136]:
test_df = df.loc[test_index]
new_indexes_test = np.arange(len(test_df))
test_df = test_df.set_index(new_indexes_test)

## Encoding forms and lemmas

In order to prepare our training, we would like to encode both lemmas and forms from our training dataset. To do so, we will encode each word using on-hot vectors. Its dimension will be $M \times T$, where $M$ is the length of the longest word in the training dataset, and T is the length of the alphabet ($i.e.$ the length of char_dict). If the $i^{th}$ character of a word is the $j^{th}$ character of the alphabet, we will put a $1.0$ at the $(i,j)$ coordinate, and zeros everywhere else. We will perform this schema for every $N$ input of our training dataset, and end up with encoded vectors $X_{train}$ and $Y_{train}$ of shape $N \times M \times T$.

In [137]:
M = get_longest(train_df)
T = len(char_dict)

## Encoding tags

In [138]:
categories_df = pd.DataFrame(categories, columns=["categories"])

categories_to_drop = []

In [139]:
tags_file['categories']

{'Aktionsart': ['STAT',
  'DYN',
  'TEL',
  'ATEL',
  'PCT',
  'DUR',
  'ACH',
  'ACCMP',
  'SEMEL',
  'ACTY',
  'DUR+SEMEL',
  'DUR+STAT'],
 'Animacy': ['ANIM', 'INAN', 'HUM', 'NHUM'],
 'Argument Marking': ['NO1S',
  'NO1P',
  'NO2S',
  'NO2P',
  'NO3S',
  'NO3SA',
  'NO3SI',
  'NO3PA',
  'NO3P',
  'AC1S',
  'AC1P',
  'AC2S',
  'AC2P',
  'AC3S',
  'AC3P',
  'AC1',
  'AC2',
  'AC3',
  'AB1S',
  'AB1P',
  'AB2S',
  'AB2P',
  'AB3S',
  'AB3P',
  'ER1S',
  'ER1P',
  'ER2S',
  'ER2P',
  'ER3S',
  'ER3P',
  'DA1S',
  'DA1P',
  'DA2S',
  'DA2P',
  'DA3S',
  'DA3P',
  'BE1S',
  'BE1P',
  'BE2S',
  'BE2P',
  'BE3S',
  'BE3P'],
 'Aspect': ['ITER',
  'IPFV',
  'PFV',
  'PRF',
  'PROG',
  'PFV+PROG',
  'PRF+PROG',
  'PROSP',
  'HAB',
  'HAB+PROG',
  'HAB+PRF',
  'HAB+IPFV'],
 'Case': ['NOM',
  'ACC',
  'ACC+COMPV',
  'LOC',
  'ERG',
  'ABS',
  'NOMS',
  'DAT',
  'DAT+COMPV',
  'BEN',
  'PRP',
  'GEN',
  'REL',
  'PRT',
  'INS',
  'INS+COMPV',
  'INS+DAT',
  'COM',
  'COM+TERM',
  'COM+ACC',
  'VO

Encoding of a given language

In [140]:
encoded_train_df = encode_tags_dataset(train_df)

In [141]:
encoded_train_df

Unnamed: 0,bundle,form,lemma,Part of Speech,Aktionsart,Animacy,Argument Marking,Aspect,Case,Comparison,...,Number,New,Person,Polarity,Politeness,Possession,Switch-Reference,Tense,Valency,Voice
0,V;PST,nibalik,mobalik,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
1,V;PRS,nagbalik,mobalik,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,V;NFIN,mobalik,mobalik,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,V;FUT,mabuhi,mabuhi,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,5,0,0
4,V;PRS,nabuhi,mabuhi,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
396,V;PST,daw,daw,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
397,V;PRF;PST,daw,daw,9,0,0,0,4,0,0,...,0,0,0,0,0,0,0,2,0,0
398,V;FUT,daw,daw,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,5,0,0
399,V;PRS,nagbarog,mobarog,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


### Bundle analysis

Classes of different bundles.
"Values", "Indexes" and "train_sets_per_classes" are dictionnaries whose keys are integers refering to classes.

Values[ i ] provides the reference projected bundle defining the class i, Indexes[ i ] the corresponding sample indexes in the training set, train_sets_per_classes[ i ] the sub-training set on which to train class i's model.

Categories might be drop for the definition of the classes, i.e turned to 0, cf sparsity analysis

In [142]:
def extracting_train_input_models(df: pd.DataFrame):
    '''
    Extract from non encoded dataframe Values and Non-encoded values of bundles, 
    Indexes of those values and sets corresponding subsets.
    Return dictionnaries of the corresponding entities.
    '''
    # Encoding
    encoded_df = encode_tags_dataset(df)
    
    # Extracting    
    Values = {}
    Non_encoded_Values = {}
    Indexes = {}
    sets_per_classes = {}

    Y = encoded_df[categories].copy()
    i = 0
    while len(Y) != 0:
        value = Y.values[0]
        Values[i] = value

        P = pd.DataFrame(Y == value, columns=categories)
        Q = P.all(axis=1)
        index = Q.where(Q==True).dropna().index
        Indexes[i] = index

        sets_per_classes[i] = df.loc[index].copy()
        Non_encoded_Values[i] = sets_per_classes[i]['bundle'].unique()[0]

        Y.drop(index, axis=0, inplace=True)
        i += 1
    
    return Values, Non_encoded_Values, Indexes, sets_per_classes

In [143]:
train_Values = extracting_train_input_models(train_df)[0]

In [144]:
categories_to_drop = []

## Test set classification

In [145]:
test_bundle = np.array([9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0])

In [146]:
encoded_test_df = encode_tags_dataset(test_df)
encoded_test_df

Unnamed: 0,bundle,form,lemma,Part of Speech,Aktionsart,Animacy,Argument Marking,Aspect,Case,Comparison,...,Number,New,Person,Polarity,Politeness,Possession,Switch-Reference,Tense,Valency,Voice
0,V;PRS,nag-abli,moabli,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,V;PRS,naghitabo,mahitabo,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,V;FUT,mobalik,mobalik,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,5,0,0
3,V;PRS,nahimo nga,mahimo nga,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,V;PROG;PRS,ningsugod,mosugod,9,0,0,0,5,0,0,...,0,0,0,0,0,0,0,1,0,0
5,V;FUT,magtinguha,magtinguha,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,5,0,0
6,V;PRF;PST,napakita,mopakita,9,0,0,0,4,0,0,...,0,0,0,0,0,0,0,2,0,0
7,V;PROG;PRS,nagluto,magluto,9,0,0,0,5,0,0,...,0,0,0,0,0,0,0,1,0,0
8,V;PST,nibutang,mobutang,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
9,V;PST,nilakaw,molakaw,9,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [147]:
encoded_test_df['model_class'] = encoded_test_df.apply(
    lambda x: assign_class_test(x[categories].values, train_Values, []), axis=1
)

In [148]:
encoded_test_df

Unnamed: 0,bundle,form,lemma,Part of Speech,Aktionsart,Animacy,Argument Marking,Aspect,Case,Comparison,...,New,Person,Polarity,Politeness,Possession,Switch-Reference,Tense,Valency,Voice,model_class
0,V;PRS,nag-abli,moabli,9,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
1,V;PRS,naghitabo,mahitabo,9,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
2,V;FUT,mobalik,mobalik,9,0,0,0,0,0,0,...,0,0,0,0,0,0,5,0,0,3
3,V;PRS,nahimo nga,mahimo nga,9,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
4,V;PROG;PRS,ningsugod,mosugod,9,0,0,0,5,0,0,...,0,0,0,0,0,0,1,0,0,4
5,V;FUT,magtinguha,magtinguha,9,0,0,0,0,0,0,...,0,0,0,0,0,0,5,0,0,3
6,V;PRF;PST,napakita,mopakita,9,0,0,0,4,0,0,...,0,0,0,0,0,0,2,0,0,5
7,V;PROG;PRS,nagluto,magluto,9,0,0,0,5,0,0,...,0,0,0,0,0,0,1,0,0,4
8,V;PST,nibutang,mobutang,9,0,0,0,0,0,0,...,0,0,0,0,0,0,2,0,0,0
9,V;PST,nilakaw,molakaw,9,0,0,0,0,0,0,...,0,0,0,0,0,0,2,0,0,0


Sparsity analysis in training set bundles

In [149]:
X = encoded_train_df[categories]

In [150]:
D = {}
for cat in categories:
    values = X[cat].unique()
    indexes_values = {}
    for val in values:
        indexes_values[val] = list(X.where(X[cat] == val).dropna().index)            
    D[cat] = indexes_values

In [151]:
E = pd.DataFrame(columns=['categories', 'number_unique'])
for i in range(len(categories)):
    cat = categories[i]
    values = X[cat].unique()
    E.loc[i] = [cat, len(values)]

E

Unnamed: 0,categories,number_unique
0,Part of Speech,1
1,Aktionsart,1
2,Animacy,1
3,Argument Marking,1
4,Aspect,3
5,Case,1
6,Comparison,1
7,Definiteness,1
8,Deixis,1
9,Evidentiality,1


In [152]:
A = E.sort_values(by='number_unique', axis=0, ignore_index=True)
A[A.number_unique > 1]

Unnamed: 0,categories,number_unique
22,Finiteness,2
23,Aspect,3
24,Tense,4


# Training

## Linear regression

In [165]:
train_df

Unnamed: 0,bundle,form,lemma
0,V;PST,nibalik,mobalik
1,V;PRS,nagbalik,mobalik
2,V;NFIN,mobalik,mobalik
3,V;FUT,mabuhi,mabuhi
4,V;PRS,nabuhi,mabuhi
...,...,...,...
396,V;PST,daw,daw
397,V;PRF;PST,daw,daw
398,V;FUT,daw,daw
399,V;PRS,nagbarog,mobarog


In [162]:
extract_train = extracting_train_input_models(train_df)
train_Values, Non_encoded_train_Values, train_sets_per_classes = extract_train[0], extract_train[1], extract_train[3]
models, X_train, Y_train = train_linear_reg(train_sets_per_classes, char_dict, M)

In [158]:
X_train, Y_train = encode_sets(train_sets_per_classes, M)

In [163]:
train_sets_per_classes

{0:     bundle        form       lemma
 0    V;PST     nibalik     mobalik
 7    V;PST      nabuhi      mabuhi
 11   V;PST      nibati      mobati
 19   V;PST     nigunit     mogunit
 23   V;PST      niadto      moadto
 ..     ...         ...         ...
 384  V;PST     nitubag     motubag
 389  V;PST     nigawas     mogawas
 392  V;PST    nimaneho    momaneho
 394  V;PST  nakadungog  makadungog
 396  V;PST         daw         daw
 
 [74 rows x 3 columns],
 1:     bundle        form    lemma
 1    V;PRS    nagbalik  mobalik
 4    V;PRS      nabuhi   mabuhi
 12   V;PRS  nagpaminaw  maminaw
 20   V;PRS    nag-adto   moadto
 32   V;PRS    nagpatay  mopatay
 ..     ...         ...      ...
 360  V;PRS     nagkaon   mokaon
 365  V;PRS    naghuman  mohuman
 371  V;PRS     nagbasa   mobasa
 380  V;PRS    naghalok  mohalok
 399  V;PRS    nagbarog  mobarog
 
 [66 rows x 3 columns],
 2:      bundle      form     lemma
 2    V;NFIN   mobalik   mobalik
 5    V;NFIN    mabuhi    mabuhi
 14   V;NFIN

In [167]:
X_train

{0: array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 1: array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 2: array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 3: array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0.,

In [168]:
predict_and_save(X_train, Y_train, train_sets_per_classes, models, models_len, "results/results_train.txt")

0.800498753117207

Linear regression intuition:
Input space: representation of the repartition of the characters of the alphabet, regarding their position in a word.
Same thing for the target space.
The theta parameter we train during linear regression makes the link between these two spaces.

# To tackle the problem of too much letter predicted

1. Predict word lengths per classes using linear regression, then cut the predicted value accordingly
2. Estimate the probability of having a letter or not knowing the norm, by counting cleverly over the training set, per class.

In [28]:
models_len, X_train_len, Y_train_len = train_len_words(train_sets_per_classes)

In [29]:
Y_train_len_hat = {}

for i in models_len:
    Y_train_len_hat[i] = models_len[i].predict(X_train_len[i].reshape(-1, 1)).round()

In [30]:
f = open("results_train_limited.txt","w")

for i in train_sets_per_classes:
    M_dict = Y_train_len_hat[i]
    lemma = decode_vector(X_train[i], char_dict)
    form = decode_vector(Y_train[i], char_dict)
    Y_hat_train[i] = convert_pred_limit(Y_hat_train[i], M_dict)
    form_hat = decode_vector(Y_hat_train[i], char_dict)
    bundle = Non_encoded_train_Values[i]
    
    for j in range(len(lemma)):
        f.write(f"{lemma[j]}\t{bundle}\t{form[j]}\t{form_hat[j]}\n")
f.close()

# Predict on test set

In [95]:
test_sets_per_classes = extracting_test_sets_per_classes(test_df, train_Values)

In [96]:
def encode_sets(sets_per_classes, M):
    
    X = {}
    Y = {} 

    for i in test_sets_per_classes:
        subset = sets_per_classes[i]
        new_index = np.arange(len(subset))
        subset = subset.set_index(new_index)
        X_temp, Y_temp = encode_data(subset, char_dict, M)
        (N,_,T) = X_temp.shape
        
        X_temp = X_temp.reshape((N,M*T))
        Y_temp = Y_temp.reshape((N,M*T))

        X[i] = X_temp
        Y[i] = Y_temp
        
    return X, Y

In [107]:
def encode_len(sets_per_classes):
    
    X_len = {}
    Y_len = {}
    
    for i in sets_per_classes:
        subset = sets_per_classes[i]
        new_index = np.arange(len(subset))
        subset = subset.set_index(new_index)
        
        X_len[i] = np.array([len(lemma) for lemma in subset["lemma"].values])
        Y_len[i] = np.array([len(form) for form in subset["form"].values])
    
    return X_len, Y_len

In [106]:
X_test = {}
Y_test = {}
X_test_len = {} 

for i in test_sets_per_classes:
    subset = test_sets_per_classes[i]
    new_index = np.arange(len(subset))
    subset = subset.set_index(new_index)
    X_test_temp, Y_test_temp = encode_data(subset, char_dict, M)
    (N,M,T) = X_test_temp.shape
    X_test_temp = X_test_temp.reshape((N,M*T))
    Y_test_temp = Y_test_temp.reshape((N,M*T))
        
    X_test[i] = X_test_temp
    X_test_len[i] = np.array([len(lemma) for lemma in subset["lemma"].values])
    Y_test[i] = Y_test_temp

In [108]:
def predict_len(X_len, sets_per_classes, models_len):
    Y_len_hat = {}
    for i in sets_per_classes:
        Y_len_hat[i] = models_len[i].predict(X_len[i].reshape(-1, 1)).round()
    return Y_len_hat

In [157]:
def predict_and_save(X, Y, sets_per_classes, models, models_len, saving_file="results/results.txt"):
    
    (X_len, _) = encode_len(sets_per_classes)
    Y_len_hat = predict_len(X_len, sets_per_classes, models_len)
    
    Y_hat = {}
    
    #We predict the form, for each the different datasets associated with different classes
    for i in sets_per_classes:
        Y_hat[i] = models[i].predict(X[i])
        
    #We reshape our encoded vectors so they are of form (N,M,T) 
    for i in sets_per_classes:
        (N,_) = X[i].shape
        X[i] = X[i].reshape((N,M,T))
        Y[i] = Y[i].reshape((N,M,T))
        Y_hat[i] = Y_hat[i].reshape((N,M,T))
        
    f = open(saving_file,"w")
    avg_distance = []
    
    for i in sets_per_classes:
        M_dict = Y_len_hat[i]
        lemma = decode_vector(X[i], char_dict)
        form = decode_vector(Y[i], char_dict)
        Y_hat[i] = convert_pred_limit(Y_hat[i], M_dict)
        form_hat = decode_vector(Y_hat[i], char_dict)
        bundle = Non_encoded_train_Values[i]

        for j in range(len(lemma)):
            avg_distance.append(distance(form[j],form_hat[j]))
            #print(form[j], form_hat[j],avg_distance[-1])
            f.write(f"{lemma[j]}\t{bundle}\t{form[j]}\t{form_hat[j]}\n")
    f.close()
    return np.mean(avg_distance)

In [119]:
X_test, Y_test = encode_sets(test_sets_per_classes, M)

In [120]:
predict_and_save(X_test, Y_test, test_sets_per_classes, models, models_len, "results/results_test.txt")

1.7333333333333334

# For the future python script to iterate on the languages

In [74]:
languages = {
    'austronesian':['ceb', 'hil', 'mao', 'mlg', 'tgl'],
    'germanic':['ang', 'dan', 'deu', 'eng', 'frr', 'gmh', 'isl', 'nld', 'nob', 'swe'],
    'niger-congo':['aka', 'gaa', 'kon', 'lin', 'lug', 'nya', 'sot', 'swa', 'zul'],
    'oto-manguean':['azg', 'cly', 'cpa', 'ctp', 'czn', 'ote', 'otm', 'pei', 'xty', 'zpv'],
    'uralic':['est', 'fin', 'izh', 'ktl', 'liv', 'mdf', 'mhr', 'myv', 'sme', 'vep', 'vot']
}