# Setup

First let's install and setup the necessary libraries.

First clone the github repository with the following command:

In [0]:
import os

!git clone https://github.com/ioannispartalas/CrossLingual-NLP-AMLD2020.git
#With this command, the path to the data is 
workdir = './CrossLingual-NLP-AMLD2020/'
os.environ["WORKDIR"] = workdir
#Please check if this correct, otherwise correct path_to_data
!ls $WORKDIR/data/laser
!mkdir $WORKDIR/data/raw

Download data from on your local file system and upload it to colab fs with the cell below

In [0]:
from google.colab import files
uploaded = files.upload()
!tar -jxf *.bz2 -C  $WORKDIR/data/raw/
!rm ./semeveal15_sentiment_datasets.tar.bz2


Install LASER and conceptNet




In [0]:
%cd CrossLingual-NLP-AMLD2020/
!bash install_laser.sh
!bash download_conceptNet.sh

Restart the runtime  environnement.

In [0]:
os.kill(os.getpid(), 9)

Set  environnement variables and load modules

In [0]:
import sys
import os
import importlib

os.environ.setdefault("LASER","/root/projects/LASER/")
assert os.environ.get('LASER'), 'Please set the environement variable LASER'
LASER = os.environ['LASER']
sys.path.append(LASER + 'source/lib')
sys.path.append(LASER+"source/")

workdir = './CrossLingual-NLP-AMLD2020/'
os.environ["WORKDIR"] = workdir
sys.path.insert(1, workdir)

from src.models import *

If everything went well the following should not print any errors.

In [0]:
import sys
sys.path.append("..")
from src.models import *

print(Doc2Laser.__doc__)

# Introduction to Text Classification

## Language is hard! 

Take a look at the following sentences: 
1. Jane went to the store
2. went the to Jane store 
3. Jane went store 
4. Jane goed store 

They (try to) express similar meanings, but some feel un-natural!  

Several things to handle: 
- Morphology
- Syntax <- touch on this 
- Semantics/World Knowledge <- touch on this but mostly shallow semantics
- Discourse 
- Pragmatics 
- Multilinguality <- focus on this

## Sentiment classification
- binary (positive, negative)
- ternary (positive, neutral, negative)
- ordinal (image below!)

<img src="../data/images/sentiment-5class.png?raw=1" width="600">

*Input* (x): a text span 

*Output* (y): a class/category (sentiment polarity in the sentiment classification example)

**Goal**: Train a function $f(x) \rightarrow y$

- How to represent text? 
- What functions can we use for the task? 
- How to evaluate performance?


## Machine learning workflow
1. Get data
1. Inspect the data
1. Preprocess/Clean/Normalize the data
1. Vector Representation
1. Modeling 
1. Evaluation

<img src="../data/images/pipeline.png?raw=1" width="600">

## Text representation: traditional bag-of-words
Given a text, extract the vocabulary, build a vector of dim $|V|$, non-zeros are words that appear. 
<img src="../data/images/textVectorization.png?raw=1" width="600">


In [0]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

- Words are identified by their ids
- Non-zero means a word occurs
- The value, is the number of times the word occurs in a sentence

In [0]:
vectorizer.transform(['This is the first document', 'is document the first this']).toarray()

- order does not matter! Recall the example with Jane ;-)
- words like 'and, the' matter the same with words like 'super, great, ..'. This is a limitation. 
- tf-idf (term frequence, inverse document frequency) is an heuristic that can get us far!

$tf_{i,j}\times\log\frac{N}{df_i}$

where

$tf_{i,j}$ is number of times the term $i$ appears in document $j$, $df_i$ is the document frequency in the full collection of documents and $N$ is the number of available documents.

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?']

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

Other tricks and tips: 
- Recall text is a sequence of symbols. We may care for characters instead of words (think typos) 
- We may care for longer sequences that single words: New York, not great, ..   

In [0]:
# Character grams
vectorizer = CountVectorizer(analyzer='char', ngram_range=(1,1)) # This creates character-grams of size 1
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

N-grams are sequences of *objects*. Here, objects, are either charactets sequences or word sequences. For character sequences for example:

<img src="https://github.com/ioannispartalas/CrossLingual-NLP-AMLD2020/blob/master/data/images/ngrams.png?raw=1">

In this figure notice the sliding window of size 3. While moving from left to right, it generates the possible sequences that will be used to populate the vector representations. Due to the fact that the window is of size 3, the method will generate character 3-grams. If, instead of character in the figure, we were using words, we would be generating word 3-grams. 

**Question**: can you think of a limitation of word 3-grams, 4-grams, 5-grams etc.?

**Exercise**: how to get these sequences in Python (in an elegant way)?

In [0]:
# N-grams (can be either char-grams or word-grams)
vectorizer = CountVectorizer(analyzer='char', ngram_range=(2,3)) # This creates character-grams of sizes 2 and 3
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

## Dense Representations

All of the above techniques have a common limitation. The do not encode semantics! 
This means that the vector for `amazing` is completely disimilar from the vector of `great` and the of vector `Laussane`. 
Can we do better?
The answer is yes! Enter, word embeddings. 
Dense word representation, that can encode the meaning! 

<img src="../data/images/word2vec.png" width="600">

In [0]:
import numpy as np
from sklearn.manifold import TSNE
# For more information of TSNE: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
# For more information on GloVe: https://nlp.stanford.edu/projects/glove/
vectors = open(workdir+'/data/glove_excerpt.txt').read().strip().split('\n')
vectors = {line.split()[0]:np.array(line.split()[1:]).astype(float) for line in vectors}

In [0]:
import matplotlib.pyplot as plt
# Let's visualize this, using TSNE, a methods that can reduce the dimensionality of the vectors
labels = list(vectors.keys())
tokens = list(vectors.values())

tsne_model = TSNE(perplexity=1.5, n_components=2, init='pca', n_iter=2500, random_state=23)
new_values = tsne_model.fit_transform(tokens)

x = new_values[:,0]
y = new_values[:,1]

plt.figure(figsize=(7, 6)) 
for i in range(len(x)):
    plt.scatter(x[i],y[i])
    plt.annotate(labels[i],
                 xy=(x[i], y[i]),
                 xytext=(5, 2),
                 textcoords='offset points',
                 ha='right',
                 va='bottom')


This is a great result for several reasons:
- families of similar words are close between them
- Some of them encode some syntax (magnificent and amazing) need similar vectors to approach their adverbs! 

## Learning Representations with Deep-NN Models

More recently the introduction of deep neural models for building text representations provided us with capabilities of better language understading and subsequently solve easier text related tasks. Specifically, we can distinguish the so-called Transformers in three different classes with respect to the objective they optimize for: 

- Language Model: estimate the probability of a word given previous words.
- Machine Translation: in a sequence mode predict the words in the target sentence.
- Masked Language Model: predict the masked token.

<img src="../data/images/lm_models.png?raw=1">

# A Brief Introduction to Cross-Lingual Word Embeddings

We give here a brief introduction of the concepts and methods that can learn joint word embeddings for multiple languages.

In the previous notebook we described how one can learn word embeddings from mono-lingual corpora. These mono-lingual word embeddings capture the particularities of the specific language but they cannot used in other languages. For example, we would like to leverage high resource languages like English in order to enable downstream tasks in low resource languages. To do so, one should learn word embeddings of different languages in a common space.

Let's see how word embeddings learned in a common space look like. For this purpose we will use the ConceptNet multilingual embeddings for English and French. 

In [0]:
import sys
sys.path.append('..')
from src.utils import load_embeddings,emb2numpy
from IPython.display import Image
import numpy as np
from umap import UMAP
import matplotlib.pyplot as plt
import seaborn as sns

In [0]:
# We select some English and French words
english_words = ["room","hotel","towel","book","coffee","chair","glass","pen","shoe","two","amazing"]
french_words = ["hôtel","chambre","livre","café","chaise","serviette","verre","stylo","chaussure","deux","fantastique"]

In [0]:
en_emb = load_embeddings(path=workdir+"/concept_net_1706.300.en", dimension=300,skip_header=False,vocab=english_words)
fr_emb = load_embeddings(path=workdir+"/concept_net_1706.300.fr", dimension=300,skip_header=False,vocab=french_words)

In [0]:
# Put the vectors in arrays
words_en,V_en = emb2numpy(en_emb)
words_fr,V_fr = emb2numpy(fr_emb)

In [0]:
vectors = np.concatenate((V_en,V_fr))
all_words  = words_en+words_fr

In [0]:
# We project the 300d vectors to a 2d space for visualization
V_umap = UMAP(n_neighbors=3,min_dist=0.6).fit_transform(vectors)


In [0]:
sns.set_context("talk")

fig= plt.figure(figsize=(10,6))

plt.scatter(V_umap[:, 0], V_umap[:, 1])
for i, word in enumerate(all_words):
    plt.annotate(word, xy=(V_umap[i, 0], V_umap[i, 1]))
plt.show()

You can observe that the words from the different languages are close in the embeddings space.

Several methods for learning such cross-lingual embeddings have been proposed recently. They most straightforward ones try to align mono-lingual embeddings which has been learned seperately. In order to so, some sort of supervision is required which may be for example in the form of a bilingual dictionary or sentence-aligned data.

The following figure presents schematically the approach of mono-lingual mapping.

![](https://github.com/ioannispartalas/CrossLingual-NLP-AMLD2020/blob/master/data/images/alignment.png?raw=1)

The type of supervision can vary from parallel sentences, for example human translations, to cheaper signals like for example bilingual dictionaries.In the case of bilingual lexicons the respective methods learn linear projections from the target to the source embeddings using the dictionary. 


![](https://github.com/ioannispartalas/CrossLingual-NLP-AMLD2020/blob/master/data/images/bilingual_alignement.png?raw=1)

Other recent methods do not require any seed dictionaries and induce in an iterative procedure one that is used to learn the projections. For a comprehensive study one can refer to [1]. As there was a surge of works in this area there is an effort of defining more rigorous methodologies for evaluation CL embeddings [5] as well as using more robust tools for comparing them in tasks like classification or retrieval [4].

## Sentence Embeddings

One issue with cross-lingual word embeddings is that they may not be able to capture salient information as they neglect linguistic dependecies. One should use sequence models, like RNNs,  on top of these representations in order to learn a vector representation of longer textual segments such as sentences. In a recent work presented by Facebook research, a multi-lingual model was developed that learns vector representations over sentences for 93 languages. [LASER](https://github.com/facebookresearch/LASER) (Language-Agnostic Sentence Representations) is essentially a translation model that leverages a seq2seq architecture (figure was taken from [2]).

![](https://github.com/ioannispartalas/CrossLingual-NLP-AMLD2020/blob/master/data/images/laser.png?raw=1)

The model, unlike state-of-the-art translation models that use attention, uses a BiLSTM encoder with a max pooling operation which gives us the sentence embedding.

Let's use LASER and see how well can embed a few parallel sentences in English, French and Greek. For this, we will use the Doc2Laser class.

In [0]:
import sys
sys.path.append("..")
from src.models import *

In [0]:
print(Doc2Laser.__doc__)

In [0]:
en_sentences = ["This is a nice hotel.",
                "The bathroom was clean",
                "The dog is brown",
                "I will call you",
               "Not very far from the center"]

# define a transformer
doc2laser_transformer = Doc2Laser("en")

# Get the representation of the sentences
X_en = doc2laser_transformer.transform(en_sentences)

In [0]:
fr_sentences = ["Celui-ci était un hôtel magnifique",
                "La salle de bain était propre",
                "Le chien est brun",
                "Je t'appelle",
               "Pas très loin du centre"]

# Change the language in the transformer
doc2laser_transformer.set_params(lang="fr")
X_fr = doc2laser_transformer.transform(fr_sentences)

In [0]:
 # Change the language in the transformer
doc2laser_transformer.set_params(lang="el")
gr_sentences = ["Το ξενοδοχείο ήταν υπέροχο",
                "Η τουαλέτα ήταν καθαρή",
                "Ο σκύλος είναι καφέ",
                "Σε παίρνω τηλέφωνο",
                "Όχι πολύ μακριά από το κέντρο"]
X_gr = doc2laser_transformer.transform(gr_sentences)

Let's project the sentence representations now in a 2d space and check if the parallel sentences in the three languages are close.

In [0]:
V_umap = UMAP(n_neighbors=5,min_dist=0.2).fit_transform(np.concatenate((X_en,X_fr,X_gr)))


In [0]:
fig= plt.figure(figsize=(12,6))

plt.scatter(V_umap[:, 0], V_umap[:, 1])
for i, word in enumerate(en_sentences+fr_sentences+gr_sentences):
    plt.annotate(word, xy=(V_umap[i, 0], V_umap[i, 1]))
plt.show()

We can observe that the parallel sentences are close to the embedding space which means that the model can capture the semantic in a single latent multi-lingual space. 

***Exercise:*** Try to add few more parallel sentence in other languages and project them with the same way.

## References

[1. Ruder et al., A Survey Of Cross-lingual Word Embedding Models](https://arxiv.org/abs/1706.04902)

[2. Artexte and Schwenk, Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond](https://arxiv.org/abs/1812.10464)

[3. Lena Voita et al., Evolution of Representations in the Transformer](https://arxiv.org/abs/1909.01380)

[4. Balikas and Partalas, Wasserstein distances for evaluating cross-lingual embeddings](https://arxiv.org/abs/1910.11005)

[5. Artexte et al., A Call for More Rigor in Unsupervised Cross-lingual Learning, ACL 2020](https://arxiv.org/abs/2004.14958)


# Cross-lingual Document Classification

## Problem description

Cross-lingual document classification (CLDC) is the text mining problem where we are given:
- labeled documents for training in a source language $\ell_1$, and 
- test documents written in a target language $\ell_2$. 

For example, the training documents are written in English, and the test documents are written in French. 


CLDC is an interesting problem. The hope is that we can use resource-rich languages to train models that can be applied to resource-deprived languages. This would result in transferring knowledge from one language to another. 
There are several methods that can be used in this context. In this workshop we start from naive approaches and progressively introduce more complex solutions. 

The most naive solution is to ignore the fact the training and test documents are written in different languages.  

In [0]:
import pandas as pd
from sklearn.metrics import accuracy_score,f1_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y
from sklearn.utils.multiclass import unique_labels
from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns
from prettytable import PrettyTable


import sys
sys.path.append("..")

from collections import Counter
from src.models import *
from src.utils import *
from src.dataset import *


<img src="https://github.com/ioannispartalas/CrossLingual-NLP-AMLD2020/blob/master/data/images/classes.png?raw=1" width=600>

1. Dataset: holds the data of sources and target language
2. System: This is a set of steps: Does fit, predict. Can be in the form of a pipeline also
3. Experiment: Given a Dataset and a System it fits, predicts and reports evaluation scores

For this workshop we will use a dataset from the [SemEval](http://alt.qcri.org/semeval2015/) workshop for the Sentiment Analysis task. While the tasks have three classes, that is **Positive, Negative, Neutral**, we will use only two classes in order to simplify it. So, let's load the data for a pair of languages and check a few statistics.

In [0]:
dataset = Dataset(workdir+"/data/raw/","en", "es")

dataset.load_data()
#To check the arguments of the function
#print(dataset.load_cl_embeddings.__doc__)
dataset.load_cl_embeddings(workdir+"/",300,False)


In [0]:
# Plot the counts on the classes for the source language
sns.countplot(dataset.y_train,order=["negative","positive"])

In [0]:
# And for the Spanish dataset
sns.countplot(dataset.y_test,order=["negative","positive"])

Observe that the datasets are unbalanced as we have much more positive comments that negative ones. We will start by establishing a few baselines and see how we can improve over them by leveraging cross-lingual word embeddings. We will start with a dummy classifier that will respect the distribution of the classes to generate some random predictions. 

In [0]:
# Let's keep the scores of all the expriments in a table
x = PrettyTable()

x.field_names = ["Model", "f-score"]

# Majority Class
pipeline = Pipeline([('vectorizer', CountVectorizer()), 
                     ('classifier', DummyClassifier("stratified"))])
runner = Runner(pipeline, dataset)
score = runner.eval_system()
x.add_row(["Dummy", format_score(score)])
print(x)

We start with a model that just uses term frequencies in order to represent the documents. We expect that in cases where the source and target languages share a part of the vocabulary, for example in latin languages, this approach can potentially give descent results. We will just use unigrams for this exercice but of course you can alter this baseline in order to leverage character n-grams.

In [0]:
# Logistic Regression on words
pipeline = Pipeline([('vectorizer', CountVectorizer(lowercase=True)), 
                     ('classifier', LogisticRegression(solver="lbfgs"))])
runner = Runner(pipeline, dataset)
score = runner.eval_system()
x.add_row(["LR unigrams",format_score(score)])
print(x)

Let's see now how we can leverage the cross-lingual word embeddings in order to perform zero-shot learning. A simple but effective baseline consists of averaging the word embeddings in each document in order to come up with a document (or sentence) representation. We will do that by using a look-up table in order to pull the appropriate cross-linual word embeddings for each document as it is shown in the diagram:

![](https://github.com/ioannispartalas/CrossLingual-NLP-AMLD2020/blob/master/data/images/vec_average.png?raw=1)

As we saw during the introduction we use a binary representation for the document terms which we use to perform a look-up in the embeddings matrix of size $V\times d$, where $V$ is the size of the vocabulary and $d$ the dimension of the latent space, and pull the vectors. In the example we will pull three vectors. Finally, we will just calculate our document vector by just averaging the vectors. We will repeat this operation for each document in both the target and the source languages. Then we will follow the zero-shot learning framework and we will train a classifier on the source language and predict on the target language.

In [0]:
for name, myclf in zip(['Knn-nBow', 'LR-nBow'],[KNeighborsClassifier(n_neighbors=2), LogisticRegression(C=10, solver="lbfgs")]):

    avg_baseline = nBowClassifier(myclf,dataset.source_embeddings,dataset.target_embeddings)

    pipeline = Pipeline([('vectorizer', CountVectorizer(lowercase=True,vocabulary=dataset.vocab_)), 
                         ('classifier', avg_baseline)])

    runner = Runner(pipeline, dataset)
    x.add_row([name, format_score(runner.eval_system())])
    

In [0]:
print(x)

In the following model we will use the LASER representations in order to train the classifiers within the same framework.

In [0]:
for name, myclf in zip(['Knn-laser', 'LR-laser'],[KNeighborsClassifier(n_neighbors=2), LogisticRegression(C=10, solver="lbfgs")]):
    laser_clf = LASERClassifier(myclf, dataset.source_lang, dataset.target_lang)
    pipeline = Pipeline([("doc2laser",Doc2Laser()),('classifier', laser_clf)])
    pipeline.set_params(doc2laser__lang=dataset.source_lang)
    pipeline.fit(dataset.train,dataset.y_train)
    runner = Runner(pipeline, dataset)

    pipeline.set_params(doc2laser__lang=dataset.target_lang)
    x.add_row([name, format_score(runner.eval_system(prefit=True))])

In [0]:
print(x)

We observe that the zero-shot learning using LASER representations can achieve state-of-the-art results in this pair of languages. 

***Exercises:*** 

* Use other pairs of languages and see the performance. For example, you can try to transfer from more distant languages like Russian.
* Write a function in order to calculate all the pairs of (source,target) languages and compare the results.
* Tune the classifier or use other type of models.

# Few-shot Learning

On this notebook, we will work on a multilingual dataset containing sentences in six languages: english, dutch, spanish, russian, arabic and turkish. Every sentence of every language comes along a with sentiment label indicating *positive* or *negative* content. There is no sentence overlap between idioms. 

Working with the LASER multilinguale representation, we directly provide the sentence embedding for all languages. Every sentence is represented by a 1024 dimensional vector indicating its position in LASER.

In [0]:
import sys
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, confusion_matrix
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
sys.path.insert(1, workdir)

from src.utils import load_training_languages, model_evaluation, get_statistics

The 3 following utility functions will be used in this notebook:

- ```
model_evaluation(model, [languages])
```: evaluate the ```model``` over list of ```languages```. Returns [F1](https://en.wikipedia.org/wiki/F1_score) score, more suited for imbalanced dataset and [Confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) to analyse model outputs in details.
- ```x_train, y_train = load_training_languages([languages])```: Returns concatenated features and labels for languages specified in ```languages```.
- ```get_statistics([languages]```: print out class population for languages specified in ```languages```.

# Dataset statistics

The multilingual dataset consists in 6 different languages: english (```en```), spanish (`es`), dutch (`nl`), russian (`ru`), arabic (`a`r) and turkish (`tr`).

In [None]:
all_languages = ['en','es','nl','ru','ar','tr']

get_statistics(all_languages)

# Few Shot Learning
While learning a language classification model generally requires abundance of training materials, it happens frequently that some languages are systematically under representated, leading to poor prediction performance. 

In that situation, using a common language representation such as LASER permits to increase the training data by adding to the initial (small) set, (possibly larger) dataset from other languages. 

As shown in figure below, poplulating the training space increases the chances to accurately determine the decision function.  

![Few Shot Learning](https://upload.wikimedia.org/wikipedia/commons/d/d0/Example_of_unlabeled_data_in_semisupervised_learning.png)

In the following, we are going to experiment the Few Shot Learning concepts by training and testing classifier on different combinations of languague.

Let's train a [Logistic Regression](https://fr.wikipedia.org/wiki/R%C3%A9gression_logistique) (a linear classifier) on russian, and look at the model accuracy


In [0]:
x_train,y_train = load_training_languages(['ru'])
lr = LogisticRegression(C = 10,max_iter = 200,random_state = 1).fit(x_train,y_train)
_ = model_evaluation(lr, ['ru'])

The overall performance is not fantastic. Could we do better? Let's add more languages to the training data


In [0]:
x_train,y_train = load_training_languages(all_languages)
lr = LogisticRegression(C = 10,max_iter = 200,random_state = 1).fit(x_train,y_train)
_ = model_evaluation(lr, ['ru'])

The F1 score has improved by 0.1! Quite impressive.

Same operation with turkish



In [0]:

x_train,y_train = load_training_languages(['tr'])
lr = LogisticRegression(C = 10,random_state = 1).fit(x_train,y_train)
_ = model_evaluation(lr, ['tr'])


The F1 score is now quite low. Small dataset, data quality, language complexity may explain the poor performance.

Fair enough, let's use all available languages to improve our model


In [0]:

x_train,y_train = load_training_languages(all_languages)
lr = LogisticRegression(C = 10,max_iter = 200,random_state = 1).fit(x_train,y_train)
_ = model_evaluation(lr, ['tr'])


No improvement... Maybe another combination of languages leads to different results. What happen if we remove spanish and russian from the training set


In [0]:

x_train,y_train = load_training_languages(['ar','tr','nl','en'])
lr = LogisticRegression(C = 10,max_iter = 200,random_state = 1).fit(x_train,y_train)
_ = model_evaluation(lr, ['tr'])


Better! Apparently spanish and russian were perturbing the model for turkish language.

Could we imagine a more systematic source language selection to optimize performance on a specific target language? (Beware that the test set of the target language cannot be used to perform this selection)

## Non linear model

Until now we have used Logisitic Regression. However more complex models, such as [multi layer perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) (MLP) 

In [0]:
 from sklearn.neural_network import MLPClassifier
 mlp = MLPClassifier(solver='lbfgs', 
                     hidden_layer_sizes=(16),
                     activation = 'relu',
                     alpha=1e-3,
                     max_iter = 50,
                     early_stopping =True,
                     validation_fraction = 0.2, 
                     random_state=1)\
      
 _ = model_evaluation(mlp.fit(x_train,y_train),['ru'])

or [extreme gradient boosting](https://en.wikipedia.org/wiki/XGBoost) (xgboost) are obviously possible.

In [0]:
import xgboost as xgb
boost = xgb_model = xgb.XGBClassifier(objective="binary:logistic",max_depth =5, random_state=42)
_ = model_evaluation(boost.fit(x_train,y_train),['ru'])

What can we conclude from the above results?