## Cos sim with TF-IDF

The cosine simialrity between the NMF 5-topic vectors was not realting the two sentences as much as I would like. I first going to add a cleaning step to strip out what appars to be LaTeX or math jax in some of the Quora questions. I will then calculate the cosine similarity utilzing the TF-IDF vectors, and combine this with the NMF vectors.

**Pipeline**
1. Stack questions
2. Clean questions
3. Lemmatize
4. TF-IDF
5. UNION
    1. TF-IDF -> NMF(5 topic) -> Unstack
    2. TF-IDF -> cos sim
7. XGBClassifier

In [1]:
# data manipulation
import utils
import pandas as pd
import numpy as np

# modeling
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import cross_validate, StratifiedKFold

from xgboost import XGBClassifier

In [2]:
X_train = utils.load('X_train')
y_train = utils.load('y_train')

In [3]:
nmf_pipe = Pipeline(
    [
        ('nmf', NMF(n_components=5)),
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=True))
    ]
)

cos_pipe = Pipeline(
    [
        ('cos', FunctionTransformer(utils.calc_cos_sim_stack, validate=False))
    ]
)

pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False)),
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False)),
        ('tf', TfidfVectorizer()),
        ('feats', FeatureUnion(
            [
                ('nmf_pipe', nmf_pipe),
                ('cos_pipe', cos_pipe)
            ]
        ))
    ]
)
X_transform = pipe.fit_transform(X_train)

In [6]:
xgb = XGBClassifier(n_estimators=500, n_jobs=-1, random_state=42)
skf = StratifiedKFold(n_splits=3, random_state=42)
cv = cross_validate(xgb, 
               X_transform, 
               y_train, 
               cv=skf, 
               n_jobs=1, 
               scoring=('accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'neg_log_loss'))

In [7]:
results_df = utils.load('results')

results_df = results_df.drop(index='cos_sim_tfidf_model', errors='ignore')
results_df = results_df.append(utils.log_scores(cv, 'cos_sim_tfidf_model'))
results_df

Unnamed: 0,avg_accuracy,std_accuracy,avg_precision,std_precision,avg_recall,std_recall,avg_f1,std_f1,avg_auc,std_auc,avg_log_loss,std_log_loss
"mvp (tf-idf, nmf(5), xgboost)",0.700345,0.000466,0.661571,0.000461,0.385736,0.002493,0.487325,0.001983,0.740593,0.001647,0.568958,0.001288
mvp (+ lemma),0.696787,0.001055,0.649977,0.003057,0.387424,0.00323,0.485464,0.002485,0.738037,0.001362,0.572483,0.000815
cos_sim_model,0.7102,0.00083,0.658748,0.002578,0.446336,0.002215,0.53212,0.001306,0.746769,0.001279,0.56525,0.000963
feat_eng_model,0.743614,0.002021,0.664102,0.003502,0.6184,0.001553,0.640434,0.002281,0.82107,0.001428,0.489465,0.001141
feat_eng_model_lemma_fix,0.744356,0.002107,0.664513,0.004333,0.621357,0.000901,0.642201,0.001609,0.822197,0.00171,0.488131,0.001342
feat_eng_model_lemma_clean,0.763927,0.002404,0.676166,0.003904,0.692113,0.001128,0.684044,0.002549,0.846923,0.001643,0.456929,0.00141
rf_feat_eng_model_lemma_clean,0.783667,0.00226,0.708853,0.003681,0.702725,0.001666,0.705774,0.002658,0.868202,0.001148,0.436197,0.00064
ensemble_rf_xgb,0.779,0.00274,0.697794,0.004357,0.708157,0.001912,0.702935,0.003148,0.863334,0.001438,0.441784,0.001107
cos_sim_tfidf_model,0.729511,0.001216,0.66168,0.002219,0.547188,0.001744,0.59901,0.001703,0.800271,0.001291,0.512085,0.001299


In [8]:
utils.save(results_df, 'results')

## Results

Overall the cosine similarity between the tf-idf vectors (and cleaning the questions) seems to best model yet, with an average AUC of 0.79 and log loss of 0.51.

Let's take a look at the worse false positives and false negatives.

In [9]:
xgb.fit(X_transform, y_train)
utils.save(xgb, 'cos_sim_tfidf_model')

y_probs = xgb.predict_proba(X_transform)[:, 1]
class_errors_df = utils.ground_truth_analysis(y_train, y_probs)
class_errors_df.head()

Unnamed: 0,gt,prob,diff
0,0,0.664762,-0.664762
1,0,0.376845,-0.376845
2,0,0.212445,-0.212445
3,0,0.002759,-0.002759
4,0,0.166415,-0.166415


In [9]:
pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False)),
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False))
    ]
)
X_train_lemma = pipe.transform(X_train)

## Top false negative examples

In [11]:
fn_idx = class_errors_df.sort_values('diff', ascending = False).head().index
for idx in fn_idx:
    print('Prob:', y_probs[idx])
    print()
    print(X_train.iloc[idx].question1)
    print(X_train.iloc[idx].question2)
    print()
    print('Lemma--------')
    print(X_train_lemma[idx])
    print(X_train_lemma[idx+1])
    print()
    print('Cos sim------')
    print(X_transform[idx, -1])
    print('-------------------------------------------')
    print()

Prob: 0.0030294533

How can I see if my boyfriend is on a dating website?
How can I see what apps and dating sites my husband uses?

Lemma--------
mean breast sore pregnant
breast sore mean pregnant

Cos sim------
0.0
-------------------------------------------

Prob: 0.0037913064

Where can I get funding for my idea?
How can I find funding for a startup business?

Lemma--------
s science firework
science jallikattu

Cos sim------
0.0
-------------------------------------------

Prob: 0.004062539

What are the top 200 ranking signals Google uses?
What are Google's 200 ranking factors?

Lemma--------
way travel money
travel world money

Cos sim------
0.0
-------------------------------------------

Prob: 0.004815716

Why is mathematics so tough?
Why is Mathematics so hard?

Lemma--------
share ethernet internet connection mobile wifi laptop
effect valley storage

Cos sim------
0.0
-------------------------------------------

Prob: 0.004822677

I'm 15 right now. What can I do to become a