## Cosine Similarity Feature

This model will add a cosine similarity feature in addition to NMF features. The previous model showed some misclassifed pairs had a very high cosine similarity.

**Pipeline**
1. Stack questions
2. Lemmatize questions
3. Tfidf
4. NMF
5. Unstack questions
6. Add cosine similarity
7. XGBClassifier

In [1]:
# data manipulation
import utils
import pandas as pd
import numpy as np

# modeling
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import cross_validate, StratifiedKFold

from xgboost import XGBClassifier

In [2]:
X_train = utils.load('X_train')
y_train = utils.load('y_train')

In [3]:
try:
    X_train_lemma = utils.load('X_train_lemma') 
except:
    pipe_cos_sim = Pipeline(
        [
            ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
            ('lemma', FunctionTransformer(utils.cleanup_text, validate=False)),
        ]
    )

    X_train_lemma = pipe_cos_sim.transform(train_df)
    utils.save(train_lemma, 'X_train_lemma') # save as it can take 10 minutes to lemmatize the entire corpus

In [4]:
pipe_transform = Pipeline(
    [
        ('tfidf', TfidfVectorizer()),
        ('nmf', NMF(n_components=5)),
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=True)),
        ('cos_sim', FunctionTransformer(utils.calc_cos_sim, validate=True))
    ]
)

X_train_transform = pipe_transform.fit_transform(X_train_lemma)

In [5]:
skf = StratifiedKFold(n_splits=3, random_state=42)
cv = cross_validate(XGBClassifier(n_estimators=500, random_state=42, n_jobs=-1), 
               X_train_transform, 
               y_train, 
               cv=skf, 
               n_jobs=-1, 
               scoring=('accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'neg_log_loss'))

In [6]:
results_df = utils.load('results')

results_df = results_df.drop(index='cos_sim_model', errors='ignore')
results_df = results_df.append(utils.log_scores(cv, 'cos_sim_model'))
results_df

Unnamed: 0,avg_accuracy,std_accuracy,avg_precision,std_precision,avg_recall,std_recall,avg_f1,std_f1,avg_auc,std_auc,avg_log_loss,std_log_loss
"mvp (tf-idf, nmf(5), xgboost)",0.700345,0.000466,0.661571,0.000461,0.385736,0.002493,0.487325,0.001983,0.740593,0.001647,0.568958,0.001288
mvp (+ lemma),0.696787,0.001055,0.649977,0.003057,0.387424,0.00323,0.485464,0.002485,0.738037,0.001362,0.572483,0.000815
cos_sim_model,0.7102,0.00083,0.658748,0.002578,0.446336,0.002215,0.53212,0.001306,0.746769,0.001279,0.56525,0.000963


In [8]:
utils.save(results_df, 'results')

### Results

Adding the cosine similarity metric to the model made a marginal improvement in the training statistics, and is possibly a good candidate to hyper tune via cross validation. 

Let's now take a look at where the classifier was wrong.

In [10]:
xgb = XGBClassifier(n_estimators=500, random_state=42, n_jobs=-1)
xgb.fit(X_train_transform, y_train)
y_probs = xgb.predict_proba(X_train_transform)[:, 1]
class_errors_df = utils.ground_truth_analysis(y_train, y_probs)
class_errors_df.head()

Unnamed: 0,gt,prob,diff
0,0,0.22341,-0.22341
1,0,0.172906,-0.172906
2,0,0.320049,-0.320049
3,0,0.156677,-0.156677
4,0,0.373449,-0.373449


In [15]:
X_train_transform.shape

(303199, 11)

## Top false negative examples

In [22]:
fn_idx = class_errors_df.sort_values('diff', ascending = False).head().index
for idx in fn_idx:
    print('Prob:', y_probs[idx])
    print()
    print(X_train.iloc[idx].question1)
    print(X_train.iloc[idx].question2)
    print()
    print('Lemma-------')
    print(X_train_lemma[idx*2])
    print(X_train_lemma[idx*2+1])
    print()
    print('Cos sim------')
    print(X_train_transform[idx, -1])
    print()
    print('Vecs-----------')
    print(X_train_transform[idx, :5])
    print(X_train_transform[idx, 5:-1])
    print('-------------------------------------------')
    print()

Prob: 0.023252033

How can I get unlimited Ola Credits? Please help. I know there's a hack for that.
What is the best Ola hack to get unlimited Ola Credits?

Lemma-------
good ola hack unlimited ola credit
unlimited ola credit help know hack

Cos sim------
0.0015347087514404256

Vecs-----------
[0.         0.         0.00169648 0.01552974 0.00111447]
[0.01473608 0.00013005 0.         0.         0.00031791]
-------------------------------------------

Prob: 0.024024863

What is the best Ola hack to get unlimited Ola Credits?
How can I get unlimited Ola credits? I know there's a hack for that.

Lemma-------
unlimited ola credit know hack
good ola hack unlimited ola credit

Cos sim------
0.001481936862938256

Vecs-----------
[0.01473608 0.00013005 0.         0.         0.00031791]
[0.         0.         0.00018338 0.01586184 0.00109255]
-------------------------------------------

Prob: 0.03743049

How do I train myself to build mental visualisation skills like Nikola Tesla?
How do I trai

## Top false positive examples

In [21]:
fp_idx = class_errors_df.sort_values('diff').head().index
for idx in fp_idx:
    print('Prob:', y_probs[idx])
    print()
    print(X_train.iloc[idx].question1)
    print(X_train.iloc[idx].question2)
    print()
    print('Lemma-------')
    print(X_train_lemma[idx*2])
    print(X_train_lemma[idx*2+1])
    print()
    print('Cos sim------')
    print(X_train_transform[idx, -1])
    print()
    print('Vecs-----------')
    print(X_train_transform[idx, :5])
    print(X_train_transform[idx, 5:-1])
    print('-------------------------------------------')
    print()

Can I make money online?
I am willing to work hard for money but how I can make money?
Prob: 0.96103376

Lemma-------
willing work hard money money
money online

Cos sim------
0.9993590913043952

Vecs-----------
[0.        0.        0.1067102 0.        0.       ]
[0.         0.         0.06381616 0.00228588 0.        ]
-------------------------------------------

How can you earn money from Quora as a user?
How does Quora make money?
Prob: 0.95245737

Lemma-------
quora money
earn money quora user

Cos sim------
0.9873872317045038

Vecs-----------
[0.         0.04005779 0.06285799 0.         0.        ]
[0.         0.06401524 0.07205664 0.         0.        ]
-------------------------------------------

How do I earn money through Quora?
How can I use Quora to make money?
Prob: 0.9491555

Lemma-------
earn money quora
use quora money

Cos sim------
0.9810484046375674

Vecs-----------
[0.         0.05729245 0.06246081 0.         0.00039522]
[0.         0.04775823 0.07836525 0.         0

## Summary

The cosine similarity feature seems to be working as intended. However, performing a similarity on the NMF vectors may not be the right approach. A good next step would be at least updating the cosine similarity to be calculated based on the document vector of the lemmatized question. However, some of the lemmatized questions in the false positive category are very similar. So, may need to add a similarity for both the lemmatized and non-lemmatized pair of questions.