# Problem description

Cross-lingual document classification (CLDC) is the text mining problem where we are given:
- labeled documents for training in a source language $\ell_1$, and 
- test documents written in a target language $\ell_2$. 

For example, the training documents are written in English, and the test documents are written in French. 


CLDC is an interesting problem. The hope is that we can use resource-rich languages to train models that can be applied to resource-deprived languages. This would result in transferring knowledge from one language to another. 
There are several methods that can be used in this context. In this workshop we start from naive approaches and progressively introduce more complex solutions. 

The most naive solution is to ignore the fact the training and test documents are written in different languages.  

In [1]:
import pandas as pd
from ast import literal_eval
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.dummy import DummyClassifier

1. Dataset: holds the data of sources and target language
2. System: This is a set of steps: Does fit, predict. Can be in the form of a pipeline also
3. Experiment: Given a Dataset and a System it fits, predicts and reports evaluation scores

In [2]:
class Dataset:
    """Experiment class, that reads data in raw format and prints stats."""
    def __init__(self, source_lang, target_lang):
        self.source_lang = source_lang
        self.target_lang = target_lang
        self.tr_path = "./semeveal15_sentiment_datasets/semeval15.%s.train.csv" % source_lang
        self.te_path = "./semeveal15_sentiment_datasets/semeval15.%s.test.csv" % target_lang
    
    @staticmethod
    def read_csv(path):
        df = pd.read_csv(path)
        df['polarities'] = df['polarities'].apply(lambda l: literal_eval(l))
        df = df.loc[df.polarities.astype(bool)]
        df['sentiment'] = df['polarities'].apply(lambda l: l[0])
        return df[['text', 'sentiment']]

    def load_data(self):
        training = self.read_csv(self.tr_path)
        test = self.read_csv(self.te_path)
        print("\nTraining data\n==========")
        self.calculate_stats(training)
        print("\nTraining data\n==========")
        self.calculate_stats(test)
        self.train, self.y_train = training.text.values, training.sentiment.values
        self.test, self.y_test = test.text.values, test.sentiment.values

    def calculate_stats(self, df):
        print("Training Data Shape: ", df.shape)
        print("Class distribution: ", df.sentiment.value_counts().to_dict())

        
class Runner:
    def __init__(self, pipeline, experiment):
        self.pipeline = pipeline
        self.experiment = experiment
        self.experiment.load_data()
        
    def score(self, preds):
        return accuracy_score(exp.y_test, preds)
    
    def eval_system(self):
        pipeline.fit(exp.train, exp.y_train)
        preds = pipeline.predict(exp.test)
        scores = self.score(preds)
        return scores

In [3]:
exp = Dataset("en", "es")


In [4]:
# Majority Class
pipeline = Pipeline([('vectorizer', CountVectorizer()), 
                     ('classifier', DummyClassifier())])
runner = Runner(pipeline, exp)
runner.eval_system()


Training data
Training Data Shape:  (1708, 2)
Class distribution:  {'positive': 1114, 'negative': 516, 'neutral': 78}

Training data
Training Data Shape:  (677, 2)
Class distribution:  {'positive': 456, 'negative': 188, 'neutral': 33}


0.5494830132939439

In [5]:
# Logistic Regression on words
runner = Runner(pipeline, exp)
pipeline = Pipeline([('vectorizer', CountVectorizer()), 
                     ('classifier', LogisticRegression())])
runner.eval_system()


Training data
Training Data Shape:  (1708, 2)
Class distribution:  {'positive': 1114, 'negative': 516, 'neutral': 78}

Training data
Training Data Shape:  (677, 2)
Class distribution:  {'positive': 456, 'negative': 188, 'neutral': 33}




0.7031019202363368