## Spacy Lemma + MVP

This model will incorporate a lemmatization of the questions, by spaCy, to see if this improves upon on the MVP model. Regardless, the results will be scrutinized to determine if any patterns can be established of the pairs which are signficantly mis-classified.

**Pipeline**:
1. Stack questions
2. Lemmatize questions
3. TF-IDF
4. NMF (5 topics)
5. Unstack questions
6. XGBoost classifier

In [1]:
# data manipulation
import utils
import pandas as pd
import numpy as np

# modeling
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_validate, StratifiedKFold

from xgboost import XGBClassifier

In [2]:
X_train = utils.load('X_train')
y_train = utils.load('y_train')

In [3]:
try:
    X_train_lemma = utils.load('X_train_lemma')
except:
    pipe = Pipeline(
        [
            ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
            ('lemma', FunctionTransformer(utils.cleanup_text, validate=False))
        ]
    )

    X_train_lemma = pipe.transform(X_train)
    
    utils.save(pipe, 'lemma_pipeline_only')
    utils.save(X_train_lemma, 'X_train_lemma') # takes about 10 mins to lemmatize the documents.

In [4]:
# need to transform X back to the same dimension as y as the cross_validate has a check,
# even though the pipeline will ensure X is the same length as y!!

pipe_transform = Pipeline(
    [
        ('tfidf', TfidfVectorizer()),
        ('nmf', NMF(n_components = 5)),
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=False))
    ]
)

X_train_transform = pipe_transform.fit_transform(X_train_lemma)

In [5]:
# pipe.fit(X_train_lemma, y_train)
skf = StratifiedKFold(n_splits=3, random_state=42)
cv = cross_validate(XGBClassifier(n_estimators=500, random_state=42), 
               X_train_transform, 
               y_train, 
               cv=skf, 
               n_jobs=-1, 
               scoring=('accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'neg_log_loss'))

In [6]:
results_df = utils.load('results')
results_df

Unnamed: 0,avg_accuracy,std_accuracy,avg_precision,std_precision,avg_recall,std_recall,avg_f1,std_f1,avg_auc,std_auc,avg_log_loss,std_log_loss
"mvp (tf-idf, nmf(5), xgboost)",0.700345,0.000466,0.661571,0.000461,0.385736,0.002493,0.487325,0.001983,0.740593,0.001647,0.568958,0.001288


In [7]:
results_df = results_df.drop(index='mvp (+ lemma)', errors='ignore')
results_df = results_df.append(utils.log_scores(cv, 'mvp (+ lemma)'))
results_df

Unnamed: 0,avg_accuracy,std_accuracy,avg_precision,std_precision,avg_recall,std_recall,avg_f1,std_f1,avg_auc,std_auc,avg_log_loss,std_log_loss
"mvp (tf-idf, nmf(5), xgboost)",0.700345,0.000466,0.661571,0.000461,0.385736,0.002493,0.487325,0.001983,0.740593,0.001647,0.568958,0.001288
mvp (+ lemma),0.696787,0.001055,0.649977,0.003057,0.387424,0.00323,0.485464,0.002485,0.738037,0.001362,0.572483,0.000815


Very similar results compared to the MVP. Need to analyze the pairs which are difficult to classify and determine the next steps.

In [9]:
utils.save(results_df, 'results')