# MVP Model

A simple MVP model to test out the pipeline and get a baseline score. The MVP model pipeline will be,

1. Transform the question pairs into a list of documents
2. Default tf-idf document term matrix
3. NMF topic model with 5 topics
4. Transform the list of documents back to question pairs.
5. Use XGBoostClassifier to fit the model for predicting whether or not the pair of questions are duplicates.

In [1]:
# data manipulation
import utils
import numpy as np
import pandas as pd

# text manipulation
import spacy

nlp = spacy.load('en_core_web_lg')
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

import string
punctuations = string.punctuation

# modeling
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn import metrics

from xgboost import XGBClassifier

In [2]:
X_train = utils.load('X_train')
y_train = utils.load('y_train')

In [3]:
pipeline = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('tf', TfidfVectorizer(stop_words=spacy_stopwords)),
        ('nmf', NMF(n_components=5)),
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=False)),
        ('xgb', XGBClassifier(n_estimators=500, n_jobs=-1, random_state=42))
    ]
)

pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('stack', FunctionTransformer(accept_sparse=False, check_inverse=True,
          func=<function stack_questions at 0x1a1bc0c158>,
          inv_kw_args=None, inverse_func=None, kw_args=None,
          pass_y='deprecated', validate=False)), ('tf', TfidfVectorizer(analyzer='word', binary=False,...ate=42, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1))])

In [4]:
# Accuracy if predict everything as not a duplicate
1 - len(y_train[y_train == 1]) / len(y_train)

0.6307804445265321

In [5]:
results_df = utils.log_scores(pipeline, X_train, y_train, 'mvp (tf-idf, nmf(5), xgboost)')
results_df

Unnamed: 0,accuracy,precision,recall,f1,auc,log_loss
"mvp (tf-idf, nmf(5), xgboost)",0.70555,0.673297,0.393392,0.49662,0.749459,0.562185


In [6]:
utils.save(pipeline, 'mvp_model')
utils.save(results_df, 'results')

In [7]:
# utils.generate_submissions(pipeline, 'mvp_model')

MVP model score on Kaggle was 0.47989 ranked 2417 out of 3307. The train log loss was 0.562, so it does not seem to have overfitted. However, the train data may be more challenging compared to the test data.

## Improvements

1. Add lemmatizer and fully incorporate a tokenizer with spacy.
2. Analyze and determine if any further data cleaning is needed.
  * Look at questions not ending in a ? mark.
3. Build a pipeline using the GloVe vectors.