## Cos sim with TF-IDF

The cosine simialrity between the NMF 5-topic vectors was not realting the two sentences as much as I would like. I first going to add a cleaning step to strip out what appars to be LaTeX or math jax in some of the Quora questions. I will then calculate the cosine similarity utilzing the TF-IDF vectors, and combine this with the NMF vectors.

**Pipeline**
1. Stack questions
2. Clean questions
3. Lemmatize
4. TF-IDF
5. UNION
    1. TF-IDF -> NMF(5 topic) -> Unstack
    2. TF-IDF -> Unstack -> cos sim
7. XGBClassifier

In [1]:
# data manipulation
import utils
import pandas as pd
import numpy as np

# modeling
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import cross_validate, StratifiedKFold

from xgboost import XGBClassifier

In [2]:
X_train = utils.load('X_train')
y_train = utils.load('y_train')

In [10]:
nmf_pipe = Pipeline(
    [
        ('nmf', NMF(n_components=5)),
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=True))
    ]
)

cos_pipe = Pipeline(
    [
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=False)),
        ('cos', FunctionTransformer(utils.calc_cos_sim, validate=True))
    ]
)

pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False)),
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False))#,
#         ('tf', TfidfVectorizer()),
#         ('feats', FeatureUnion(
#             [
#                 ('nmf_pipe', nmf_pipe),
#                 ('cos_pipe', cos_pipe)
#             ]
#         )),
#         ('xgb', XGBClassifier(n_estimators=500, n_jobs=-1, random_state=42))
    ]
)

In [None]:
skf = StratifiedKFold(n_splits=3, random_state=42)
cv = cross_validate(pipe, 
               X_train, 
               y_train, 
               cv=skf, 
               n_jobs=-1, 
               scoring=('accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'neg_log_loss'))

In [11]:
X_new = pipe.transform(X_train)

In [12]:
utils.save(X_new, 'X_train_lemma_clean')

In [None]:
results_df = utils.load('results')

results_df = results_df.drop(index='cos_sim_tfidf_model', errors='ignore')
results_df = results_df.append(utils.log_scores(cv, 'cos_sim_tfidf_model'))
results_df

In [None]:
utils.save(results_df, 'results')