# Cross lingual processing and Transfer Learning using multi-linguale embedding

On this notebook, we will work on a multilingual dataset containing sentences in four languages: english, dutch, spanish and russian. Every sentence of every language comes along a with sentiment label indicating positive or negative content. There is no sentence overlapp between idioms. 

Working with the LASER multilinguale representation, we directly provide the sentence embedding for all languages. Every sentence is represented by a 1024 dimensional vector indicating its position in LASER.

# Loading data from Github

In [0]:
#Let's download the dataset (if not done already)
!git clone https://github.com/ioannispartalas/CrossLingual-NLP-AMLD2020.git
#With this command, the path to the data is 
workdir = './CrossLingual-NLP-AMLD2020/'
path_to_data =  workdir + 'data/laser/'  
#Please check if this correct, otherwise correct path_to_data
!ls ./CrossLingual-NLP-AMLD2020/data/laser/

The dataset is made of numpy files:
```
'en_laser_train.npy'
'en_laser_test.npy'
'nl_laser_test.npy'
...
```
containing respectively training and test set for every language. 

Corresponding labels are stored in 
```
en_train_labels_adan.txt
en_test_labels_adan.txt
nl_laser_train.npy
...
```


In [0]:
import sys
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score, confusion_matrix
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
import random as rn
import pandas as pd

sys.path.insert(1, workdir)
from src.utils import load_language


For model performance evaluation, the [F1](https://en.wikipedia.org/wiki/F1_score) score will be used as it is better suited than the traditional accuracy for imbalanced dataset. [Confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) is also an important metric to analyse model outputs in details.

In [0]:
from sklearn.metrics import f1_score, confusion_matrix
def cross_lang_evaluation(model):
  """
  Measure F1 score and confusion matrix for the provided model over test data for the 4 languages
  """
  languages = ['en','nl','es','ru']
  EVAL = {}
  for lang in languages:
    x_test,y_test = load_language(path_to_data, lang, 'test')
    y_pred = model.predict(x_test)

    F1 = f1_score (y_test,y_pred, pos_label = 3)
    CONF =  pd.DataFrame(confusion_matrix(y_test,y_pred),index = ['TRUE NEGATIVE','TRUE POSITIVE'],columns=('PRED NEGATIVE','PRED POSITIVE'))

    EVAL[lang] = (F1,CONF)
  for lang, metric in EVAL.items():
    print(lang,': F1= ', metric[0],'\n', metric[1],'\n')
  return EVAL

Let's train a logistic regression on one language, english for instance

In [0]:
from sklearn.linear_model import LogisticRegression
x_train,y_train = load_language(path_to_data,'en', 'train')  
lr = LogisticRegression(C = 10,random_state = 1).fit(x_train,y_train)


And reuse this english model to predict sentiment on all languages

In [0]:
_ = cross_lang_evaluation(lr)

Although the model is the more accurate for english (where the model has been trained), it is able to predict sentiment with a fairly good accuracy on other languages.  

Let's try now different combination: train on all languages but english, predict english

In [0]:
x_train,y_train,x_test,y_test = [],[],[],[]

languages = ['nl','es','ru']
for lang in languages:
  x_tr,y_tr = load_language(path_to_data,lang,'train')  
  x_train.append(x_tr)
  y_train.append(y_tr)

x_train =np.vstack(x_train)
y_train =np.hstack(y_train)


In [0]:
lr_2 = LogisticRegression(C = 10, max_iter = 200, random_state = 1).fit(x_train,y_train)
_ = cross_lang_evaluation(lr_2)

Surprisingly, the new model is able to predict sentiment polarity in english with same accuracy as before without ever seeing any english sentences! 

Could we do better? Let's try more complex models, such as [multi layer perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) (MLP)

In [0]:
 from sklearn.neural_network import MLPClassifier
 mlp = MLPClassifier(solver='lbfgs', 
                     hidden_layer_sizes=(128,128),
                     activation = 'relu',
                     alpha=1e-8,
                     max_iter = 500,
                     early_stopping =True,
                     validation_fraction = 0.1, 
                     random_state=1)\
      
 _ = cross_lang_evaluation(mlp.fit(x_train,y_train))

or [extreme gradient boosting](https://en.wikipedia.org/wiki/XGBoost) (xgboost)

In [0]:
import xgboost as xgb
boost = xgb_model = xgb.XGBClassifier(objective="binary:logistic",max_depth =3, random_state=42)
_ = cross_lang_evaluation(boost.fit(x_train,y_train))

What can we conclude from the above results?