## Spacy Lemma + MVP

This model will incorporate a lemmatization of the questions, by spaCy, to see if this improves upon on the MVP model. Regardless, the results will be scrutinized to determine if any patterns can be established of the pairs which are signficantly mis-classified.

**Pipeline**:
1. Stack questions
2. Lemmatize questions
3. TF-IDF
4. NMF (5 topics)
5. Unstack questions
6. XGBoost classifier

In [1]:
# data manipulation
import utils
import pandas as pd
import numpy as np

# modeling
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.pipeline import Pipeline

from xgboost import XGBClassifier

In [2]:
X_train = utils.load('X_train')
y_train = utils.load('y_train')

In [3]:
try:
    X_train_lemma = utils.load('X_train_lemma')
except:
    pipe = Pipeline(
        [
            ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
            ('lemma', FunctionTransformer(utils.cleanup_text, validate=False))
        ]
    )

    X_train_lemma = pipe.transform(X_train)
    
    utils.save(pipe, 'lemma_pipeline_only')
    utils.save(X_train_lemma, 'X_train_lemma') # takes about 10 mins to lemmatize the documents.

In [4]:
pipe = Pipeline(
    [
        ('tfidf', TfidfVectorizer()),
        ('nmf', NMF(n_components = 5)),
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=False)),
        ('xgb', XGBClassifier(n_estimators=500, random_state=42))
    ]
)

pipe.fit(X_train_lemma, y_train)

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...ate=42, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1))])

In [5]:
results_df = utils.load('results')
results_df

Unnamed: 0,accuracy,precision,recall,f1,auc,log_loss
"mvp (tf-idf, nmf(5), xgboost)",0.70555,0.673297,0.393392,0.49662,0.749459,0.562185
mvp (+ lemma),0.700616,0.657091,0.39558,0.493852,0.744419,0.56762


In [6]:
results_df = results_df.drop(index='mvp (+ lemma)', errors='ignore')
results_df = results_df.append(utils.log_scores(pipe, X_train_lemma, y_train, 'mvp (+ lemma)'))
results_df

Unnamed: 0,accuracy,precision,recall,f1,auc,log_loss
"mvp (tf-idf, nmf(5), xgboost)",0.70555,0.673297,0.393392,0.49662,0.749459,0.562185
mvp (+ lemma),0.701107,0.65811,0.396411,0.494788,0.744856,0.567334


Very similar results compared to the MVP. Need to analyze the pairs which are difficult to classify and determine the next steps.

In [7]:
utils.save(results_df, 'results')
utils.save(pipe, 'mvp_lemma_model')