# MVP Model

A simple MVP model to test out the pipeline and get a baseline score. The MVP model pipeline will be,

1. Transform the question pairs into a list of documents
2. Default tf-idf document term matrix
3. NMF topic model with 5 topics
4. Transform the list of documents back to question pairs.
5. Use XGBoostClassifier to fit the model for predicting whether or not the pair of questions are duplicates.

In [1]:
# data manipulation
import utils
import numpy as np
import pandas as pd

# text manipulation
import spacy

nlp = spacy.load('en_core_web_lg')
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

import string
punctuations = string.punctuation

# modeling
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn import metrics
from sklearn.model_selection import cross_validate, StratifiedKFold

from xgboost import XGBClassifier

In [2]:
X_train = utils.load('X_train')
y_train = utils.load('y_train')

In [3]:
pipeline = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('tf', TfidfVectorizer(stop_words=spacy_stopwords)),
        ('nmf', NMF(n_components=5)),
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=False)),
        ('xgb', XGBClassifier(n_estimators=500, n_jobs=-1, random_state=42))
    ]
)

skf = StratifiedKFold(n_splits=3, random_state=42)
cv = cross_validate(pipeline, 
               X_train, 
               y_train, 
               cv=skf, 
               n_jobs=-1, 
               scoring=('accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'neg_log_loss'),
               return_train_score = False)

In [4]:
# Accuracy if predict everything as not a duplicate
1 - len(y_train[y_train == 1]) / len(y_train)

0.6307804445265321

In [7]:
results_df = utils.log_scores(cv, 'mvp (tf-idf, nmf(5), xgboost)')
results_df

Unnamed: 0,avg_accuracy,std_accuracy,avg_precision,std_precision,avg_recall,std_recall,avg_f1,std_f1,avg_auc,std_auc,avg_log_loss,std_log_loss
"mvp (tf-idf, nmf(5), xgboost)",0.700345,0.000466,0.661571,0.000461,0.385736,0.002493,0.487325,0.001983,0.740593,0.001647,0.568958,0.001288


In [9]:
utils.save(pipeline, 'mvp_model')
utils.save(results_df, 'results')

## Improvements

1. Add lemmatizer and fully incorporate a tokenizer with spacy.
2. Analyze and determine if any further data cleaning is needed.
  * Look at questions not ending in a ? mark.
3. Build a pipeline using the GloVe vectors.