---
### **<p style="text-align: center; text-decoration: underline;">Natural Language Processing</p>**
# **<p style="text-align: center;">Practical: A bit of Morphological Warm up</p>**
---

> Realized by: *Zakaria Boulkhir* & *Omar Iken*.

> Master 2, Data Science, Lille University.

---

### ■ __Overview__
In this notebook, we will propose models for the shared-task task 0 problem. To do so, we will use the development data to design two models (a neural and a non neural) that given a lemma and a set of morphological attributes return the corresponding form.

### ■ **<a name="content">Contents</a>**

- [1. Warm Up](#warmup)

- [2. Actual Work](#actualWork)

### ■ **Libraries**

In [1]:
## numpy to handle arrays & matices
import numpy as np

## matplotlib & Seaborn to plot figures
import matplotlib.pyplot as plt
import seaborn as sns

## pandas to handle dataframes
import pandas as pd
from scipy import stats
from tqdm import tqdm
import codecs
import os
from utils.utils import *
from utils.eval import *
import unidecode

## sklearn dependencies
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_absolute_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

from sklearn.ensemble import RandomForestClassifier

In [2]:
#-----------< Setting >------------#
## set plots text font size & style
sns.set(font_scale=1.2, style='whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)

### ■ **<a name="warmup">1. Dataset</a>** [(&#8593;)](#content)
The objective of this section is to write down a very simple system that predict morphological attributes. To do so, we will use the data available [here](https://github.com/sigmorphon2020/task0-data). Then, we will pick a language we do not speak, for instance *Akan (aka)*, and we explore the data. Of course our models should be applicable to other languages.

>**Note:**  The choice of the language *Akan* is due to the fact that it covers most of the cases we will deal with in this notebook. 

In [4]:
## file path
train_file = 'data/DEVELOPMENT-LANGUAGES/niger-congo/aka.trn'
test_file = 'data/GOLD-TEST/aka.tst'

## dataframe
df_train = read_file(train_file)
df_test = read_file(test_file)

## number of training & testing samples
n_train = df_train.shape[0]
n_test = df_test.shape[0]
print(f'Number of training samples: {n_train}')
print(f'Number of testing samples : {n_test}')

## dispaly some samples
df_train.head()

Number of training samples: 2793
Number of testing samples : 763


Unnamed: 0,lemma,form,attributes
0,boro,bɛboroee,V;PST+IMMED
1,tow .. mu,tow .. mu,V;HAB;PRS
2,dum,dumee,V;HAB;PST
3,ba,ammbɛba,V;NEG;PST+IMMED
4,yi ayɛw,nnkɔyi ayɛwee,V;PRF;NEG;PRS;LGSPEC1


**Unique Characters:** Get the unique the characters of the language.

In [5]:
## get the number of unique characters
text = ''.join(df_train[['lemma', 'form']].to_numpy().flatten())

## get (number of) unique characters
unique_chars = sorted(set(text))
n_chars = len(unique_chars)

print(f"> Number of unique characters: {n_chars}\n> Characters: {', '.join(unique_chars)}")

> Number of unique characters: 23
> Characters:  , ., a, b, d, e, f, g, h, i, k, m, n, o, p, r, s, t, u, w, y, ɔ, ɛ


**Morphological attributes:** histogram of different attributes, this will allow us to see which one is dominating.

In [6]:
## get unique morphological attributes
morph_attrs = ';'.join(df_train['attributes'].to_list()).split(';')
morph_attrs = np.asarray(morph_attrs)
unique_attrs = sorted(set(morph_attrs)) ## sort to keep same order
n_attrs = len(unique_attrs)

print(f"> Number of unique morphological attributes: {n_attrs}\n> Morphological attributes: {', '.join(unique_attrs)}")

> Number of unique morphological attributes: 16
> Morphological attributes: FUT, HAB, HAB+PRF, HAB+PROG, IMP, LGSPEC1, NEG, NFIN, PRF, PROG, PRS, PRS+IMMED, PST, PST+IMMED, SBJV, V


### ■ **<a name="section2">2. First Approach: Bag of words (Characters)</a>** [(&#8593;)](#content)
For this first simple approach, we use the bag-of-words method.
The particularity of our proposed method is that we will use the bag-of-words method on characters instead of words (bag-of-characters).

#### **2.1. Modeling the data**
Bag-of-characters will allow us to have a feature space of fixed size, since the number of characters in each language is fixed unlike its set of words.
Also, unlike the bag-of-words on words method, this method will allow us to capture all words, even on the test set.
Therefore, an important step of this method consists of bringing all lemmas and forms to a fixed size (number of characters), in fact, in our case we compute the maximum possible number of characters for a lemma for a given language and we map all other words to that maximum value

In [7]:
## bag of words for characters
char_dict = dict(zip(unique_chars, range(1, len(unique_chars)+1)))
inv_char_dict = {n:char for char, n in char_dict.items()}

## bag of words for attributes
attr_dict = {attr:i for i, attr in enumerate(unique_attrs, start=1)}
attr_dict

{'FUT': 1,
 'HAB': 2,
 'HAB+PRF': 3,
 'HAB+PROG': 4,
 'IMP': 5,
 'LGSPEC1': 6,
 'NEG': 7,
 'NFIN': 8,
 'PRF': 9,
 'PROG': 10,
 'PRS': 11,
 'PRS+IMMED': 12,
 'PST': 13,
 'PST+IMMED': 14,
 'SBJV': 15,
 'V': 16}

In [8]:
## compute maximum possible characters in a lemma and a form
max_lemma_length = df_train['lemma'].apply(list).apply(len).max()
max_form_length = df_train['form'].apply(lambda x: len(list(x))).max()

## compute maximum possible number of attributes
max_n_attrs = df_train['attributes'].apply(lambda x: len(x.split(';'))).max()

## 
print(f'The maximum possible number of character in a lemma: {max_lemma_length}')
print(f'The maximum possible number of character in a form : {max_form_length}')
print(f'The maximum possible number of attributes for a given lemma & form: {max_n_attrs}')

The maximum possible number of character in a lemma: 9
The maximum possible number of character in a form : 19
The maximum possible number of attributes for a given lemma & form: 5


In [9]:
def pad(array, n, val):
    """to pad a given vector to size n with value val"""
    return np.append(array, np.full(n - len(array), val))

def vect2word(vect, char_dict):
    word = ''.join([char_dict[i] for i in vect if i])
    return word

def create_trainset(lemmas, forms, attributes, max_lemma, max_form, max_attrs):
    """create a bag of words training set"""
    ## create X and y train
    X_train, y_train = [], []
    for lemma, form, set_attrs in zip(lemmas, forms, attributes):
        x, y = [], []
        
        l = []
        for char in lemma:
            l.append(char_dict[char])
            
        for char in form:
            y.append(char_dict[char])

        at = []
        for attr in set_attrs:
            at.append(attr_dict[attr])


        x = np.append(pad(l, max_lemma, 0), pad(at, max_attrs, 0))
        X_train.append(x)
        y_train.append(pad(y, max_form, 0))
        
    return np.array(X_train), np.array(y_train)

In [10]:
## get training & test set
X_train, y_train = create_trainset(df_train['lemma'].values,
                                   df_train['form'].values, 
                                   df_train['attributes'].apply(lambda x:x.split(';')).values,
                                   max_lemma_length, max_form_length, max_n_attrs)

X_test, y_test = create_trainset(df_test['lemma'].values, 
                                 df_test['form'].values,
                                 df_test['attributes'].apply(lambda x:x.split(';')).values, 
                                 max_lemma_length, max_form_length, max_n_attrs)

X_train.shape

(2793, 14)

#### **2.2. Train & Evaluate the model**
We mainly rely on Random Forest algorithms because they have shown promising results and we think they are suitable for our case, since we are working on a classification problem of sorts. Another interesting and very useful fact about Random Forests is their ability to perform multi-class classification on several features at the same time, i.e. we predict several features using a single model.

**Training**

In [11]:
## create & fit an RF model
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)

## make predictions
y_pred = clf.predict(X_test)

## vects to words
words_prediction = [vect2word(vect, inv_char_dict) for vect in y_pred]
words_test = [vect2word(vect, inv_char_dict) for vect in y_test]

## compute Lanvenshtein distance
dist = [distance(word1, word2) for word1, word2 in zip(words_test, words_prediction)]
df_pred = pd.DataFrame([words_test, words_prediction, dist], 
                       index=['true', 'predicted', 'Lanvenshtein distance']).T
df_pred.head()

Unnamed: 0,true,predicted,Lanvenshtein distance
0,rekɔbisa,rekɔbaaa,2
1,kɔyiee,kɔyie,1
2,nna abɛkyiri,nna abɛhyina,3
3,akɔtwe,akɔttw,2
4,nna renyam,nna re,4


**Evaluation**

In [12]:
print(f'- The word by word accuracy          : {word_accuracy(words_prediction, words_test)}')
print(f'- The character by character accuracy: {character_accuracy(words_prediction, words_test)}')

- The word by word accuracy          : 0.06684141546526867
- The character by character accuracy: 0.7578841743119266


>**Comments:**
> - The model was able to learn the prefixes of most forms.
> - The model is not as accurate on the body of the lemma.
> - An interesting idea is to create a model that predicts the prefixes and suffixes of the form, and then put them together with the lamma to get the actual form.

### ■ **<a name="section3">3. Second Approach: Prefix-suffix-based approach</a>** [(&#8593;)](#content)
For this second approach, we proposed a new method that attempts to predict the prefix and suffix of a given lemma and, by combining them with the lemma, obtain the actual form.
To do this, we extract both the prefix and suffix of each form from the given lemma, and then associate with each prefix and suffix a number that will be predicted by the model.

#### **3.1. Modeling the data**

In [13]:
def get_prefix(form, lemma):
    """return the prefix from the given form and lemma"""
    if lemma in form:
        idx = form.index(lemma)
        return form[:idx]
    return ''

def get_suffix(form, lemma):
    """return the suffix from the given lemma and form"""
    if lemma in form:
        idx = form.index(lemma)
        return form[idx + len(lemma):]
    return ''

def remove_prefix(form, prefix):
    """remove the prefix from the form"""
    if form.startswith(prefix):
        return form[len(prefix):]
    return form

def remove_suffix(form, suffix):
    """remove the suffix from the form"""
    if suffix and form.endswith(suffix):
        return form[:-len(suffix)]
    return form

def get_lemma(form, prefix, suffix):
    """return the lemma from the form given the prefix and the suffix"""
    lemma = remove_suffix(form, suffix)
    lemma = remove_prefix(lemma, prefix)
    return lemma

In [14]:
df_train['prefix'] = df_train.apply(lambda col: get_prefix(col.form, col.lemma), axis=1)
df_train['suffix'] = df_train.apply(lambda col: get_suffix(col.form, col.lemma), axis=1)
df_train.head()

Unnamed: 0,lemma,form,attributes,prefix,suffix
0,boro,bɛboroee,V;PST+IMMED,bɛ,ee
1,tow .. mu,tow .. mu,V;HAB;PRS,,
2,dum,dumee,V;HAB;PST,,ee
3,ba,ammbɛba,V;NEG;PST+IMMED,ammbɛ,
4,yi ayɛw,nnkɔyi ayɛwee,V;PRF;NEG;PRS;LGSPEC1,nnkɔ,ee


In [15]:
## create target array
y_train = df_train[['prefix', 'suffix']].to_numpy()
y_train.shape

(2793, 2)

#### **3.2. Training & Evaluating the model**

In [16]:
## create & fit an RF model
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)

## make predictions
y_pred = clf.predict(X_test)

In [17]:
def get_predictions(X_test, y_pred):
    pred_forms = []
    for lemma, y in zip(X_test, y_pred):
        prefix, suffix = y
        form = prefix + lemma + suffix
        pred_forms.append(form)
    return pred_forms

In [18]:
## vector to word
words_test = [vect2word(vect, inv_char_dict) for vect in y_test]
words_prediction = get_predictions(df_test['lemma'].to_numpy(), y_pred)

## compute Lanvenshtein distance
dist = [distance(word1, word2) for word1, word2 in zip(words_test, words_prediction)]
df_pred = pd.DataFrame([words_test, words_prediction, dist], 
                       index=['true', 'predicted', 'Lanvenshtein distance']).T
df_pred.head()

Unnamed: 0,true,predicted,Lanvenshtein distance
0,rekɔbisa,rekɔbisa,0
1,kɔyiee,kɔyiee,0
2,nna abɛkyiri,nna abɛkyiri,0
3,akɔtwe,akɔtwe,0
4,nna renyam,nna renyam,0


In [19]:
print(f'- The word by word accuracy          : {word_accuracy(words_prediction, words_test)}')
print(f'- The character by character accuracy: {character_accuracy(words_prediction, words_test)}')

- The word by word accuracy          : 0.8636959370904325
- The character by character accuracy: 0.9683562428407789


#### **3.3. Improving further**

**Encode attributes**

In [20]:
## bag of words for attributes
attributes_dict = dict(zip(unique_attrs, range(n_attrs)))

def encode_attributes(attributes, max_n_attrs, attributes_dict):
    """(multi-)onehot encoding for attributes"""
    encoded = np.zeros(len(attributes_dict))
    encoded[[attributes_dict[attr] for attr in attributes]] = 1
    return encoded
    
## encode attributes   
encoded_attributes = np.array([encode_attributes(attrs.split(';'), max_n_attrs, attributes_dict) for attrs in df_train['attributes'].to_list()])
encoded_test_attributes = np.array([encode_attributes(attrs.split(';'), max_n_attrs, attributes_dict) for attrs in df_test['attributes'].to_list()])
print(f'encoded attributes shape: {encoded_attributes.shape}')

pd.DataFrame(encoded_attributes, columns=unique_attrs).astype('int').head()

encoded attributes shape: (2793, 16)


Unnamed: 0,FUT,HAB,HAB+PRF,HAB+PROG,IMP,LGSPEC1,NEG,NFIN,PRF,PROG,PRS,PRS+IMMED,PST,PST+IMMED,SBJV,V
0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1
1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1
2,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1
3,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1
4,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,1


**Create new training data**

In [21]:
## new training data
new_X_train = np.concatenate((X_train[:, :max_lemma_length], encoded_attributes), axis=1)
new_X_test = np.concatenate((X_test[:, :max_lemma_length], encoded_test_attributes), axis=1)
new_X_train.shape

(2793, 25)

**Training**

In [22]:
## create & fit an RF model
clf = RandomForestClassifier(random_state=0)
clf.fit(new_X_train, y_train)

## make predictions
y_pred = clf.predict(new_X_test)

## vects to words
words_prediction = get_predictions(df_test['lemma'].to_numpy(), y_pred)
words_test = [vect2word(vect, inv_char_dict) for vect in y_test]

## compute Lanvenshtein distance
dist = [distance(word1, word2) for word1, word2 in zip(words_test, words_prediction)]
df_pred = pd.DataFrame([words_test, words_prediction, dist], 
                       index=['true', 'predicted', 'Lanvenshtein distance']).T
df_pred.head()

Unnamed: 0,true,predicted,Lanvenshtein distance
0,rekɔbisa,rekɔbisa,0
1,kɔyiee,kɔyie,1
2,nna abɛkyiri,nna abɛkyiri,0
3,akɔtwe,akɔtwe,0
4,nna renyam,nna renyam,0


In [23]:
print(f'- The word by word accuracy          : {word_accuracy(words_prediction, words_test)}')
print(f'- The character by character accuracy: {character_accuracy(words_prediction, words_test)}')

- The word by word accuracy          : 0.927916120576671
- The character by character accuracy: 0.9887624261633771


>**Comments:**
- The improved version works pretty well...
- BUT, what about languages or forms where there is a change in the whole lemma to a get a form, for instance a `lemma=waba` and `form=wada`.
- We need a model that can capture this kind of information. It will allow us to predict the form of a lemma even if there structure is different.

### ■ **<a name="section4">4. Third Approach: Beyond Prefix & Suffix</a>** [(&#8593;)](#content)
The previously proposed method suffers when the lemma is not part of the body of the form. 
To solve this problem, we propose a new approach, which is in a way an improvement of the previous one.
This new approach takes into account the root of a given lemma and form and tries to capture what changes between the two representations (lemma and form).

#### **4.1. Modeling of the data**

In [24]:
def get_root(string_1, string_2):
    """return the root intersection of two strings"""
    
    if len(string_1) > len(string_2):
        larger_s = string_1 
        smaller_s = string_2
    else:
        larger_s = string_2
        smaller_s = string_1
        
    inter = ''
    for i in range(len(larger_s)):
        for j in range(i, len(larger_s)+1):
            if j - i < len(inter):
                continue
            part = larger_s[i:j]
            
            if part in smaller_s and len(part) > len(inter):
                inter = part
        
    return inter

In [25]:
## extract root
df_train['root'] = df_train.apply(lambda col: get_root(col.lemma, col.form), axis=1)

## extract prefixes
df_train['lemma_prefix'] = df_train.apply(lambda col: get_prefix(col.lemma, col.root), axis=1)
df_train['form_prefix'] = df_train.apply(lambda col: get_prefix(col.form, col.root), axis=1)

## extract suffixes
df_train['lemma_suffix'] = df_train.apply(lambda col: get_suffix(col.lemma, col.root), axis=1)
df_train['form_suffix'] = df_train.apply(lambda col: get_suffix(col.form, col.root), axis=1)

df_train.head()

Unnamed: 0,lemma,form,attributes,prefix,suffix,root,lemma_prefix,form_prefix,lemma_suffix,form_suffix
0,boro,bɛboroee,V;PST+IMMED,bɛ,ee,boro,,bɛ,,ee
1,tow .. mu,tow .. mu,V;HAB;PRS,,,tow .. mu,,,,
2,dum,dumee,V;HAB;PST,,ee,dum,,,,ee
3,ba,ammbɛba,V;NEG;PST+IMMED,ammbɛ,,ba,,ammbɛ,,
4,yi ayɛw,nnkɔyi ayɛwee,V;PRF;NEG;PRS;LGSPEC1,nnkɔ,ee,yi ayɛw,,nnkɔ,,ee


In [26]:
def bag_of_chars(lemmas, max_lemma):
    """create a bag of words training set"""
    ## create X and y train
    X_train = []
    for lemma in lemmas:
        
        x = []
        for char in lemma:
            x.append(char_dict[char])
            
        X_train.append(pad(x, max_lemma, 0))
        
    return np.array(X_train)

## get training & test set
X_train = bag_of_chars(df_train['lemma'].to_numpy(), max_lemma_length)
X_test  = bag_of_chars(df_test['lemma'].to_numpy(), max_lemma_length)
X_train.shape

(2793, 9)

In [27]:
## bag of words
lemma_prefix_dict = dict(zip(df_train['lemma_prefix'].unique(), range(1, len(df_train['lemma_prefix'].unique())+1)))
lemma_suffix_dict = dict(zip(df_train['lemma_suffix'].unique(), range(1, len(df_train['lemma_suffix'].unique())+1)))
form_prefix_dict  = dict(zip(df_train['form_prefix'].unique(), range(1, len(df_train['form_prefix'].unique())+1)))
form_suffix_dict  = dict(zip(df_train['form_suffix'].unique(), range(1, len(df_train['form_suffix'].unique())+1)))

inv_form_pref_dict = {n:pref for pref, n in form_prefix_dict.items()}
inv_form_suff_dict = {n:suff for suff, n in form_suffix_dict.items()}

## encode roots, suffixes & prefixes
max_length_root = df_train['root'].apply(len).max()
encoded_roots = bag_of_chars(df_train['root'].to_list(), max_length_root)
encoded_lemma_suffix = df_train['lemma_suffix'].apply(lemma_suffix_dict.get).to_numpy().reshape(-1, 1)
encoded_form_suffix  = df_train['form_suffix'].apply(form_suffix_dict.get).to_numpy().reshape(-1, 1)
encoded_lemma_prefix = df_train['lemma_prefix'].apply(lemma_prefix_dict.get).to_numpy().reshape(-1, 1)
encoded_form_prefix  = df_train['form_prefix'].apply(form_prefix_dict.get).to_numpy().reshape(-1, 1)

## create target labels
y_train = np.concatenate((encoded_roots, encoded_lemma_prefix, encoded_form_prefix, encoded_lemma_suffix, encoded_form_suffix), axis=1)
y_train.shape

(2793, 13)

In [28]:
## create & fit the model
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

## make predictions
y_pred = clf.predict(X_test)

In [29]:
def get_predictions(X_test, y_pred):
    
    predictions = []
    for x, y in zip(X_test, y_pred):
        root, lemma_pref, form_pref, lemma_suff, form_suff = y[:max_lemma_length], *y[max_lemma_length:]
        pref = inv_form_pref_dict[form_pref] 
        suff = inv_form_suff_dict[form_suff]
        form = pref + vect2word(root, inv_char_dict) + suff
        predictions.append(form)
    return predictions

In [30]:
## vects to words
words_prediction = get_predictions(X_test, y_pred)
words_test = [vect2word(vect, inv_char_dict) for vect in y_test]

## compute Lanvenshtein distance
dist = [distance(word1, word2) for word1, word2 in zip(words_test, words_prediction)]
df_pred = pd.DataFrame([words_test, words_prediction, dist], 
                       index=['true', 'predicted', 'Lanvenshtein distance']).T
df_pred.head()

Unnamed: 0,true,predicted,Lanvenshtein distance
0,rekɔbisa,bisa,4
1,kɔyiee,yi,4
2,nna abɛkyiri,kyiri,7
3,akɔtwe,twe,3
4,nna renyam,nnnyam,4


In [31]:
print(f'- The word by word accuracy          : {word_accuracy(words_prediction, words_test)}')
print(f'- The character by character accuracy: {character_accuracy(words_prediction, words_test)}')

- The word by word accuracy          : 0.027522935779816515
- The character by character accuracy: 0.4822241183162685


### **References** 
@article{vylomova2020sigmorphon, title={SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection}, author={Vylomova, Ekaterina and White, Jennifer and Salesky, Elizabeth and Mielke, Sabrina J and Wu, Shijie and Ponti, Edoardo and Maudslay, Rowan Hall and Zmigrod, Ran and Valvoda, Josef and Toldova, Svetlana and others}, journal={SIGMORPHON 2020}, pages={1}, year={2020} }

---
<p style="text-align: center;">Copyright © 2021 Omar Ikne & Zakaria Boulkhir</p>