# Problem description

Cross-lingual document classification (CLDC) is the text mining problem where we are given:
- labeled documents for training in a source language $\ell_1$, and 
- test documents written in a target language $\ell_2$. 

For example, the training documents are written in English, and the test documents are written in French. 


CLDC is an interesting problem. The hope is that we can use resource-rich languages to train models that can be applied to resource-deprived languages. This would result in transferring knowledge from one language to another. 
There are several methods that can be used in this context. In this workshop we start from naive approaches and progressively introduce more complex solutions. 

The most naive solution is to ignore the fact the training and test documents are written in different languages.  

In [1]:
import pandas as pd
from sklearn.metrics import accuracy_score,f1_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y
from sklearn.utils.multiclass import unique_labels
from sklearn.feature_extraction.text import CountVectorizer

from collections import Counter
from models import *
from utils import *
from dataset import *


/home/ipartalas/projects/LASER/
/home/ipartalas/projects/LASER/


1. Dataset: holds the data of sources and target language
2. System: This is a set of steps: Does fit, predict. Can be in the form of a pipeline also
3. Experiment: Given a Dataset and a System it fits, predicts and reports evaluation scores

For this workshop we will use a dataset from the [SemEval](http://alt.qcri.org/semeval2015/) workshop for the Sentiment Analysis task. While the tasks have three classes, that is **Positive, Negative, Neutral**, we will use only two classes in order to simplify it. So, let's load the data for a pair of languages and check a few statistics.

In [2]:
exp = Dataset("en", "es")
exp.load_data()
exp.load_cl_embeddings("./",300,False)



Training data
Training Data Shape:  (1635, 2)
Class distribution:  {'positive': 1114, 'negative': 521}

Training data
Training Data Shape:  (644, 2)
Class distribution:  {'positive': 455, 'negative': 189}
Loaded 3315 vectors
Loaded 1287 vectors


Observe that the datasets are unbalanced as we have much more positive comments that negative ones. We will start by establishing a few baselines and see how we can improve over them by leveraging cross-lingual word embeddings.

In [3]:
# Majority Class
pipeline = Pipeline([('vectorizer', CountVectorizer()), 
                     ('classifier', DummyClassifier())])
runner = Runner(pipeline, exp)
runner.eval_system()



0.3088235294117647

In [4]:
# Logistic Regression on words
pipeline = Pipeline([('vectorizer', CountVectorizer(lowercase=True)), 
                     ('classifier', LogisticRegression(solver="lbfgs"))])
runner = Runner(pipeline, exp)
runner.eval_system()

0.3515625

In [5]:
avg_baseline = nBowClassifier(KNeighborsClassifier(n_neighbors=2),exp.source_embeddings,exp.target_embeddings)

pipeline = Pipeline([('vectorizer', CountVectorizer(lowercase=True,vocabulary=exp.vocab_)), 
                     ('classifier', avg_baseline)])

runner = Runner(pipeline, exp)
runner.eval_system()

0.5748031496062992

In [6]:
for name, myclf in zip(['Knn', 'Log. Regression'],[KNeighborsClassifier(n_neighbors=2), LogisticRegression(C=10, solver="lbfgs")]):

    avg_baseline = nBowClassifier(myclf,exp.source_embeddings,exp.target_embeddings)

    pipeline = Pipeline([('vectorizer', CountVectorizer(lowercase=True,vocabulary=exp.vocab_)), 
                         ('classifier', avg_baseline)])

    runner = Runner(pipeline, exp)
    print(name, runner.eval_system())
    

Knn 0.5748031496062992
Log. Regression 0.6322188449848024


In [7]:
laser_clf = LASERClassifier(KNeighborsClassifier(n_neighbors=2), exp.source_lang, exp.target_lang)

pipeline = Pipeline([("doc2laser",Doc2Laser()),('classifier', laser_clf)])
pipeline.set_params(doc2laser__lang=exp.source_lang)
pipeline.fit(exp.train,exp.y_train)

runner = Runner(pipeline, exp)

pipeline.set_params(doc2laser__lang=exp.target_lang)
runner.eval_system(prefit=True)

 - Encoder: loading /home/ipartalas/projects/LASER/models/bilstm.93langs.2018-12-26.pt
 - Tokenizer: temp_in_docs.txt in language en  
 - fast BPE: processing tok
 - Encoder: bpe to out.raw
 - Encoder: 1635 sentences in 10s
 - Encoder: loading /home/ipartalas/projects/LASER/models/bilstm.93langs.2018-12-26.pt
 - Tokenizer: temp_in_docs.txt in language es  
 - fast BPE: processing tok
 - Encoder: bpe to out.raw
 - Encoder: 644 sentences in 7s


0.6493506493506493

In [9]:
for name, myclf in zip(['Knn', 'Log. Regression'],[KNeighborsClassifier(n_neighbors=2), LogisticRegression(C=10, solver="lbfgs")]):
    laser_clf = LASERClassifier(myclf, exp.source_lang, exp.target_lang)
    pipeline = Pipeline([("doc2laser",Doc2Laser()),('classifier', laser_clf)])
    pipeline.set_params(doc2laser__lang=exp.source_lang)
    pipeline.fit(exp.train,exp.y_train)
    runner = Runner(pipeline, exp)

    pipeline.set_params(doc2laser__lang=exp.target_lang)
    print(name, runner.eval_system(prefit=True))

 - Encoder: loading /home/ipartalas/projects/LASER/models/bilstm.93langs.2018-12-26.pt
 - Tokenizer: temp_in_docs.txt in language en  
 - fast BPE: processing tok
 - Encoder: bpe to out.raw
 - Encoder: 1635 sentences in 7s
 - Encoder: loading /home/ipartalas/projects/LASER/models/bilstm.93langs.2018-12-26.pt
 - Tokenizer: temp_in_docs.txt in language es  
 - fast BPE: processing tok
 - Encoder: bpe to out.raw
 - Encoder: 644 sentences in 6s
Knn 0.6493506493506493
 - Encoder: loading /home/ipartalas/projects/LASER/models/bilstm.93langs.2018-12-26.pt
 - Tokenizer: temp_in_docs.txt in language en  
 - fast BPE: processing tok
 - Encoder: bpe to out.raw
 - Encoder: 1635 sentences in 7s
 - Encoder: loading /home/ipartalas/projects/LASER/models/bilstm.93langs.2018-12-26.pt
 - Tokenizer: temp_in_docs.txt in language es  
 - fast BPE: processing tok
 - Encoder: bpe to out.raw
 - Encoder: 644 sentences in 6s
Log. Regression 0.7714285714285715
