<a href="https://colab.research.google.com/github/ioannispartalas/CrossLingual-NLP-AMLD2020/blob/master/sklearn_LASER_cross_language_embd.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cross lingual processing and Transfer Learning using multi-linguale embedding

Facebook AI has released a multilingual toolkit called LASER (Language-Agnostic SEntence Representations) relying on sequence to sequence autoencoder. Sequence encoder has the undisputable advantage to process directly language sentences and thus apture their internal structures.

LASER has been trained over more than 100 languages which permits to project through the encoder any sentences from those languages in a common representation, called multi-lingual embedding  space.
<figure>
<img src= "https://engineering.fb.com/wp-content/uploads/2019/01/CodeBlog_embedding_space_v4.png" style= "width=45%">
 <figcaption>Fig.1 - Multinlingual embedding.</figcaption>
</figure>

On this notebook, we will work on a multilingual dataset containing sentences in four languages: english, dutch, spanish and russian. Every sentence of every language comes along a with sentiment label indicating positive or negative content. There is no sentence overlapp between idioms. We directly provide the sentence embedding for all langauges. Every sentence is represented by a 1024 dimensional vector indicating its position in LASER.

In [2]:
#Let's download the dataset
!git clone https://github.com/ioannispartalas/CrossLingual-NLP-AMLD2020.git
#Data are in copied in ./CrossLingual-NLP-AMLD2020/data/laser/ of your colab local filesystem
!ls ./CrossLingual-NLP-AMLD2020/data/laser/

Cloning into 'CrossLingual-NLP-AMLD2020'...
remote: Enumerating objects: 132, done.[K
remote: Counting objects: 100% (132/132), done.[K
remote: Compressing objects: 100% (102/102), done.[K
remote: Total 132 (delta 57), reused 66 (delta 22), pack-reused 0[K
Receiving objects: 100% (132/132), 36.67 MiB | 16.16 MiB/s, done.
Resolving deltas: 100% (57/57), done.
en_laser_test.npy	  es_test_labels_adan.txt   ru_laser_test.npy
en_laser_train.npy	  es_train_labels_adan.txt  ru_laser_train.npy
en_test_labels_adan.txt   nl_laser_test.npy	    ru_test_labels_adan.txt
en_train_labels_adan.txt  nl_laser_train.npy	    ru_train_labels_adan.txt
es_laser_test.npy	  nl_test_labels_adan.txt
es_laser_train.npy	  nl_train_labels_adan.txt


The dataset is made of numpy files:
```
'en_laser_train.npy'
'en_laser_test.npy'
'nl_laser_test.npy'
...
```
containing respectively training and test set for every language. 

Corresponding labels are stored in 
```
en_train_labels_adan.txt
en_test_labels_adan.txt
nl_laser_train.npy
...
```


In [0]:
import sys
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score, confusion_matrix
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
import random as rn

sys.path.insert(1, './CrossLingual-NLP-AMLD2020/')


In [0]:
# set seeds for reproducibility
np.random.seed(12)
rn.seed(1236)


First let's define io function

In [0]:
def load_language(language = 'en', train_or_test = 'train'):
    """
    load dataset for a particular language and dataset (train or test)
    """
    path = './CrossLingual-NLP-AMLD2020/data/laser/'
    feat_fn =  path  + language + '_laser_' + train_or_test + '.npy'
    label_fn = path  + language + '_' + train_or_test + '_labels_adan.txt'
    labels = np.loadtxt(label_fn) 
    kk = np.squeeze(np.where(labels != 2))
    feat = np.load(feat_fn)[kk]
    labels = labels[kk]
    return feat,labels

For model performance evaluation, the [F1](https://en.wikipedia.org/wiki/F1_score) score will be used as it is better suited than the traditional accuracy for imbalanced dataset. [Confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) is also an important metric to analyse model outputs in details.

In [0]:
from sklearn.metrics import f1_score, confusion_matrix
def cross_lang_evaluation(model):
  """
  Measure F1 score and confusion matrix for the provided model over test data for the 4 languages
  """
  languages = ['en','nl','es','ru']
  EVAL = {}
  for lang in languages:
    x_test,y_test = load_language(lang, 'test')
    y_pred = model.predict(x_test)
    F1 = f1_score (y_test,y_pred)
    CONF =  confusion_matrix(y_test,y_pred)
    EVAL[lang] = (F1,CONF)
  for lang, metric in EVAL.items():
    print(lang,': ', metric[0],'\n', metric[1],'\n')
  return EVAL

Let's train a logistic regression on one language, english for instance

In [0]:
from sklearn.linear_model import LogisticRegression
x_train,y_train = load_language('en', 'train')  
lr = LogisticRegression().fit(x_train,y_train)


And reuse this english model to predict sentiment on all languages

In [30]:
_ = cross_lang_evaluation(lr)

en :  0.8875878220140515 
 [[379  42]
 [ 54  81]] 

nl :  0.839041095890411 
 [[245  15]
 [ 79  90]] 

es :  0.8922155688622755 
 [[447  13]
 [ 95  90]] 

ru :  0.89358372456964 
 [[571  34]
 [102 127]] 



Although the model is the more accurate for english (where the model has been trained), it is able to predict sentiment with a fairly good accuracy on other languages.  

Let's try now different combination: train on all languages but english, predict english

In [0]:
x_train,y_train,x_test,y_test = [],[],[],[]

languages = ['nl','es','ru']
for lang in languages:
  x_tr,y_tr = load_language(lang,'train')  
  x_train.append(x_tr)
  y_train.append(y_tr)

x_train =np.vstack(x_train)
y_train =np.hstack(y_train)


In [48]:
lr_2 = LogisticRegression().fit(x_train,y_train)
_ = cross_lang_evaluation(lr_2)

en :  0.8875878220140515 
 [[379  42]
 [ 54  81]] 

nl :  0.839041095890411 
 [[245  15]
 [ 79  90]] 

es :  0.8922155688622755 
 [[447  13]
 [ 95  90]] 

ru :  0.89358372456964 
 [[571  34]
 [102 127]] 



Surprisingly, the new model is able to predict sentiment polarity in english with same accuracy as before without ever seeing any english sentences! 

Could we do better? Let's try more complex models, such as [multi layer perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) (MLP)

In [106]:
 from sklearn.neural_network import MLPClassifier
 mlp = MLPClassifier(solver='lbfgs', 
                     hidden_layer_sizes=(128,128),
                     activation = 'relu',
                     alpha=1e-3,
                     max_iter = 50,
                     early_stopping =True,
                     validation_fraction = 0.1, 
                     random_state=1)\
      
 _ = cross_lang_evaluation(mlp.fit(x_train,y_train))

en :  0.8801020408163265 
 [[345  76]
 [ 18 117]] 

nl :  0.8927875243664717 
 [[229  31]
 [ 24 145]] 

es :  0.9293361884368309 
 [[434  26]
 [ 40 145]] 

ru :  0.9191836734693877 
 [[563  42]
 [ 57 172]] 



or [extreme gradient boosting](https://en.wikipedia.org/wiki/XGBoost) (xgboost)

In [103]:
import xgboost as xgb
boost = xgb_model = xgb.XGBClassifier(objective="binary:logistic", random_state=42)
_ = cross_lang_evaluation(boost.fit(x_train,y_train))

en :  0.8709677419354838 
 [[351  70]
 [ 34 101]] 

nl :  0.8871595330739299 
 [[228  32]
 [ 26 143]] 

es :  0.9224318658280922 
 [[440  20]
 [ 54 131]] 

ru :  0.9129373474369405 
 [[561  44]
 [ 63 166]] 



What can we conclude from the above results?