# Problem description

Cross-lingual document classification (CLDC) is the text mining problem where we are given:
- labeled documents for training in a source language $\ell_1$, and 
- test documents written in a target language $\ell_2$. 

For example, the training documents are written in English, and the test documents are written in French. 


CLDC is an interesting problem. The hope is that we can use resource-rich languages to train models that can be applied to resource-deprived languages. This would result in transferring knowledge from one language to another. 
There are several methods that can be used in this context. In this workshop we start from naive approaches and progressively introduce more complex solutions. 

The most naive solution is to ignore the fact the training and test documents are written in different languages.  

In [1]:
import pandas as pd
from sklearn.metrics import accuracy_score,f1_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y
from sklearn.utils.multiclass import unique_labels
from sklearn.feature_extraction.text import CountVectorizer

from collections import Counter
from utils import *
from dataset import *
from models import *

/home/ipartalas/projects/LASER/
/home/ipartalas/projects/LASER/


1. Dataset: holds the data of sources and target language
2. System: This is a set of steps: Does fit, predict. Can be in the form of a pipeline also
3. Experiment: Given a Dataset and a System it fits, predicts and reports evaluation scores

In [11]:
exp = Dataset("./data/datasets/","en", "es")
exp.load_data()
exp.load_cl_embeddings("../NLP/sentiment_classification/embeddings/",300,False)



Training data
Training Data Shape:  (1635, 2)
Class distribution:  {'positive': 1114, 'negative': 521}

Training data
Training Data Shape:  (644, 2)
Class distribution:  {'positive': 455, 'negative': 189}
Loaded 3315 vectors
Loaded 1139 vectors


In [12]:
# Majority Class
pipeline = Pipeline([('vectorizer', CountVectorizer()), 
                     ('classifier', DummyClassifier())])
runner = Runner(pipeline, exp)
runner.eval_system()



0.2544529262086514

In [13]:
# Logistic Regression on words
pipeline = Pipeline([('vectorizer', CountVectorizer(lowercase=True)), 
                     ('classifier', LogisticRegression(solver="lbfgs"))])
runner = Runner(pipeline, exp)
runner.eval_system()

0.3515625

In [14]:
params = {"n_neighbors":2}
avg_baseline = nBowClassifier("knn",exp.source_embeddings,exp.target_embeddings,params)

pipeline = Pipeline([('vectorizer', CountVectorizer(lowercase=True,vocabulary=exp.vocab_)), 
                     ('classifier', avg_baseline)])

runner = Runner(pipeline, exp)
runner.eval_system()

0.5617529880478088

In [16]:
params = {"n_neighbors":2}
laser_clf = LASERClassifier("knn",exp.source_lang,exp.target_lang,params)

pipeline = Pipeline([("doc2laser",Doc2Laser()),('classifier', laser_clf)])
pipeline.set_params(doc2laser__lang=exp.source_lang)
pipeline.fit(exp.train,exp.y_train)

runner = Runner(pipeline, exp)

pipeline.set_params(doc2laser__lang=ex.target_lang)
runner.eval_system(prefit=True)

 - Encoder: loading /home/ipartalas/projects/LASER/models/bilstm.93langs.2018-12-26.pt
 - Tokenizer: temp_in_docs.txt in language en  
 - fast BPE: processing tok
 - Encoder: bpe to out.raw
 - Encoder: 1635 sentences in 6s
 - Encoder: loading /home/ipartalas/projects/LASER/models/bilstm.93langs.2018-12-26.pt
 - Tokenizer: temp_in_docs.txt in language es  
 - fast BPE: processing tok
 - Encoder: bpe to out.raw
 - Encoder: 644 sentences in 6s


0.6493506493506493