## Cos sim with TF-IDF

The cosine simialrity between the NMF 5-topic vectors was not realting the two sentences as much as I would like. I first going to add a cleaning step to strip out what appars to be LaTeX or math jax in some of the Quora questions. I will then calculate the cosine similarity utilzing the TF-IDF vectors, and combine this with the NMF vectors.

**Pipeline**
1. Stack questions
2. Clean questions
3. Lemmatize
4. TF-IDF
5. UNION
    1. TF-IDF -> NMF(5 topic) -> Unstack
    2. TF-IDF -> cos sim
7. XGBClassifier

In [2]:
# data manipulation
import utils
import pandas as pd
import numpy as np

# modeling
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import cross_validate, StratifiedKFold

from xgboost import XGBClassifier

In [3]:
X_train = utils.load('X_train')
y_train = utils.load('y_train')

In [4]:
nmf_pipe = Pipeline(
    [
        ('nmf', NMF(n_components=5)),
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=True))
    ]
)

cos_pipe = Pipeline(
    [
        ('cos', FunctionTransformer(utils.calc_cos_sim_stack, validate=False))
    ]
)

pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False)),
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False)),
        ('tf', TfidfVectorizer()),
        ('feats', FeatureUnion(
            [
                ('nmf_pipe', nmf_pipe),
                ('cos_pipe', cos_pipe)
            ]
        ))
    ]
)
X_transform = pipe.fit_transform(X_train)

In [5]:
xgb = XGBClassifier(n_estimators=500, n_jobs=-1, random_state=42)
skf = StratifiedKFold(n_splits=3, random_state=42)
cv = cross_validate(xgb, 
               X_transform, 
               y_train, 
               cv=skf, 
               n_jobs=-1, 
               scoring=('accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'neg_log_loss'))

  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
Process ForkPoolWorker-10:
Process ForkPoolWorker-9:
Process ForkPoolWorker-1:
Process ForkPoolWorker-4:
Process ForkPoolWorker-8:
Process ForkPoolWorker-5:
Traceback (most recent call last):
Process ForkPoolWorker-7:
Traceback (most recent call last):
Traceback (most recent call last):
Process ForkPoolWorker-6:
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/multiprocessing/process.py", 

KeyboardInterrupt: 

In [None]:
results_df = utils.load('results')

results_df = results_df.drop(index='cos_sim_tfidf_model', errors='ignore')
results_df = results_df.append(utils.log_scores(cv, 'cos_sim_tfidf_model'))
results_df

In [None]:
utils.save(results_df, 'results')

## Results

Overall the cosine similarity between the tf-idf vectors (and cleaning the questions) seems to best model yet, with an average AUC of 0.79 and log loss of 0.51.

Let's take a look at the worse false positives and false negatives.

In [None]:
xgb.fit(X_transform, y_train)
utils.save(xgb, 'cos_sim_tfidf_model')

y_probs = xgb.predict_proba(X_transform)[:, 1]
class_errors_df = utils.ground_truth_analysis(y_train, y_probs)
class_errors_df.head()

In [9]:
pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False)),
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False))
    ]
)
X_train_lemma = pipe.transform(X_train)

## Top false negative examples

In [11]:
fn_idx = class_errors_df.sort_values('diff', ascending = False).head().index
for idx in fn_idx:
    print('Prob:', y_probs[idx])
    print()
    print(X_train.iloc[idx].question1)
    print(X_train.iloc[idx].question2)
    print()
    print('Lemma--------')
    print(X_train_lemma[idx])
    print(X_train_lemma[idx+1])
    print()
    print('Cos sim------')
    print(X_transform[idx, -1])
    print('-------------------------------------------')
    print()

Prob: 0.0030294533

How can I see if my boyfriend is on a dating website?
How can I see what apps and dating sites my husband uses?

Lemma--------
mean breast sore pregnant
breast sore mean pregnant

Cos sim------
0.0
-------------------------------------------

Prob: 0.0037913064

Where can I get funding for my idea?
How can I find funding for a startup business?

Lemma--------
s science firework
science jallikattu

Cos sim------
0.0
-------------------------------------------

Prob: 0.004062539

What are the top 200 ranking signals Google uses?
What are Google's 200 ranking factors?

Lemma--------
way travel money
travel world money

Cos sim------
0.0
-------------------------------------------

Prob: 0.004815716

Why is mathematics so tough?
Why is Mathematics so hard?

Lemma--------
share ethernet internet connection mobile wifi laptop
effect valley storage

Cos sim------
0.0
-------------------------------------------

Prob: 0.004822677

I'm 15 right now. What can I do to become a