## Cosine Similarity Feature

This model will add a cosine similarity feature in addition to NMF features. The previous model showed some misclassifed pairs had a very high cosine similarity.

**Pipeline**
1. Stack questions
2. Lemmatize questions
3. Tfidf
4. NMF
5. Unstack questions
6. Add cosine similarity
7. XGBClassifier

In [1]:
# data manipulation
import utils
import pandas as pd
import numpy as np

# modeling
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.preprocessing import FunctionTransformer

from xgboost import XGBClassifier

In [6]:
X_train = utils.load('X_train')
y_train = utils.load('y_train')

In [7]:
try:
    X_train_lemma = utils.load('X_train_lemma') 
except:
    pipe_cos_sim = Pipeline(
        [
            ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
            ('lemma', FunctionTransformer(utils.cleanup_text, validate=False)),
        ]
    )

    X_train_lemma = pipe_cos_sim.transform(train_df)
    utils.save(train_lemma, 'X_train_lemma') # save as it can take 10 minutes to lemmatize the entire corpus

In [8]:
pipe_cos_sim = Pipeline(
    [
        ('tfidf', TfidfVectorizer()),
        ('nmf', NMF(n_components=5)),
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=True)),
        ('cos_sim', FunctionTransformer(utils.calc_cos_sim, validate=True)),
        ('xgb', XGBClassifier(n_estimators=500, random_state=42))
    ]
)

pipe_cos_sim.fit(X_train_lemma, y_train)

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...ate=42, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1))])

In [9]:
y_probs = pipe_cos_sim.predict_proba(X_train_lemma)[:, 1]

In [10]:
results_df = utils.load('results')

results_df = results_df.drop(index='cos_sim_model', errors='ignore')
results_df = results_df.append(utils.log_scores(pipe_cos_sim, X_train_lemma, y_train, 'cos_sim_model'))
results_df

Unnamed: 0,accuracy,precision,recall,f1,auc,log_loss
"mvp (tf-idf, nmf(5), xgboost)",0.631325,0.823529,0.001876,0.003743,0.571099,0.654121
mvp (+ lemma),0.631466,0.819018,0.002385,0.004756,0.571259,0.654228
all_neg,0.63078,0.0,0.0,0.0,0.5,12.752392
cos_sim_model,0.631381,0.811644,0.002117,0.004223,0.573368,0.654084


In [12]:
utils.save(results_df, 'results')
utils.save(pipe_cos_sim, 'cos_sim_model')

### Results

Adding the cosine similarity metric to the model made a marginal improvement in the training statistics, and is possibly a good candidate to hyper tune via cross validation. 

Let's now take a look at where the classifier was wrong.

In [11]:
class_errors_df = utils.ground_truth_analysis(y_train, y_probs)
class_errors_df.head()

Unnamed: 0,gt,prob,diff
0,0,0.374594,-0.374594
1,0,0.363202,-0.363202
2,1,0.364842,0.635158
3,0,0.367989,-0.367989
4,1,0.364302,0.635698


In [23]:
fn_idx = class_errors_df.sort_values('diff', ascending = False).head().index
for idx in fn_idx:
    print(X_train.iloc[idx].question1)
    print(X_train.iloc[idx].question2)
    print('Prob:', y_probs[idx])
    print(X_train_lemma[idx*2])
    print(X_train_lemma[idx*2+1])
    print()

Where can I buy online products?
Where do I buy online products?
Prob: 0.16664948
money home
way money online

What would a bedroom in the year 1980 look like but it would have to be an older man's theme?
What items were in an older man's bedroom circa 1980?
Prob: 0.1992089
question quora
question quora

What should I do if I find out that my dad is having an affair?
My dad is having an affair. What should I do?
Prob: 0.22508842
programming language learn
order learn programming language

What can natrully change your eye color?
Can subminals audio change eye colour?
Prob: 0.24299282
question quora
ask question quora

What are the prospects and challenges for pulses for sustainable food security?
What are the prospects for pulses for sustainable food security?
Prob: 0.24651669
thing people know
thing people know



In [24]:
X_train_lemma[:10]

array(['step step guide invest share market india',
       'step step guide invest share market',
       'story kohinoor koh noor diamond',
       'happen indian government steal kohinoor koh noor diamond',
       'increase speed internet connection use vpn',
       'internet speed increase hack dns', 'mentally lonely solve',
       'find remainder math]23^{24}[/math divide 24,23',
       'dissolve water quikly sugar salt methane carbon di oxide',
       'fish survive salt water'], dtype='<U535')

In [26]:
X_train.sort_values('id').iloc[:10]

Unnamed: 0,id,question1,question2
0,0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...
1,1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...
2,2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...
3,3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...
4,4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?
5,5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan..."
8,8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?"
9,9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?
10,10,Method to find separation of slits using fresn...,What are some of the things technicians can te...
11,11,How do I read and find my YouTube comments?,How can I see all my Youtube comments?
