## Enhanced featrue engineer model

Engineered two different types of features,

1. n_gram similarity between each pair of questions
2. min/max/avg distance between words in a single question. Currently using the following metrics,
  * euclidean
  * cosine
  * city block or manhattan
  
**Pipeline**
1. Stack questions
2. Clean questions - now lower cases all words to better lemmatize proper nouns
3. Lemmatize questions
4. UNION
    1. n_gram similarity
    2. min/max/avg distances
5. XGBClassifier

In [1]:
# data manipulation
import utils
import pandas as pd
import numpy as np

# modeling
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import cross_validate, StratifiedKFold

from xgboost import XGBClassifier

In [2]:
X_train = utils.load('X_train')
y_train = utils.load('y_train')

In [3]:
single_question_pipe = Pipeline(
    [
        ('dist', FunctionTransformer(utils.add_min_max_avg_distance_features, validate=False)),
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=False))
    ]
)

pre_process_pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False)),
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False)),
        ('feats', FeatureUnion(
            [
                ('ngram_sim', FunctionTransformer(utils.calc_ngram_similarity, kw_args={'n_grams':[1, 2, 3]}, validate=False)),
                ('single', single_question_pipe)
            ]
        ))
    ]
)

xgb = XGBClassifier(n_estimators=500, n_jobs=-1, random_state=42)

In [4]:
X_transform = pre_process_pipe.transform(X_train)

skf = StratifiedKFold(n_splits=3, random_state=42)
cv = cross_validate(xgb, 
               X_transform, 
               y_train, 
               cv=skf, 
               n_jobs=-1, 
               scoring=('accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'neg_log_loss'))

In [5]:
results_df = utils.load('results')

results_df = results_df.drop(index='feat_eng_model_lemma_fix', errors='ignore')
results_df = results_df.append(utils.log_scores(cv, 'feat_eng_model_lemma_fix'))
results_df

Unnamed: 0,avg_accuracy,std_accuracy,avg_precision,std_precision,avg_recall,std_recall,avg_f1,std_f1,avg_auc,std_auc,avg_log_loss,std_log_loss
"mvp (tf-idf, nmf(5), xgboost)",0.700345,0.000466,0.661571,0.000461,0.385736,0.002493,0.487325,0.001983,0.740593,0.001647,0.568958,0.001288
mvp (+ lemma),0.696787,0.001055,0.649977,0.003057,0.387424,0.00323,0.485464,0.002485,0.738037,0.001362,0.572483,0.000815
cos_sim_model,0.7102,0.00083,0.658748,0.002578,0.446336,0.002215,0.53212,0.001306,0.746769,0.001279,0.56525,0.000963
cos_sim_tfidf_model,0.728261,0.001248,0.659662,0.00224,0.545419,0.00137,0.597124,0.001666,0.799173,0.001407,0.513172,0.001191
feat_eng_model,0.743614,0.002021,0.664102,0.003502,0.6184,0.001553,0.640434,0.002281,0.82107,0.001428,0.489465,0.001141
feat_eng_model_lemma_fix,0.744356,0.002107,0.664513,0.004333,0.621357,0.000901,0.642201,0.001609,0.822197,0.00171,0.488131,0.001342


In [9]:
utils.save(results_df, 'results')

## Results

Wow! The feature engineering shows a significant jump in AUC from 0.8 to 0.82.

In [6]:
xgb.fit(X_transform, y_train)
y_probs = xgb.predict_proba(X_transform)[:, 1]
class_errors_df = utils.ground_truth_analysis(y_train, y_probs)
class_errors_df.head()

Unnamed: 0,gt,prob,diff
0,0,0.376427,-0.376427
1,0,0.498056,-0.498056
2,0,0.306907,-0.306907
3,0,0.008354,-0.008354
4,0,0.020841,-0.020841


In [7]:
lemma_pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False)),
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False)),
    ]
)
X_train_lemma = lemma_pipe.transform(X_train)

## Top false negative errors

In [8]:
fn_idx = class_errors_df.sort_values('diff', ascending = False).head().index
for idx in fn_idx:
    print('Prob:', y_probs[idx])
    print()
    print(X_train.iloc[idx].question1)
    print(X_train.iloc[idx].question2)
    print()
    print('Lemma--------')
    print()
    print(X_train_lemma[idx*2])
    print(X_train_lemma[idx*2+1])
    print()
    print('Feature Space------')
    print(X_transform[idx])
    print('-------------------------------------------')
    print()

Prob: 0.0038068118

Is it right that males are more sexual than females?
Do men have more sex drive than women?

Lemma--------

man sex drive woman
right male sexual female

Feature Space------
[0.00000000e+00 0.00000000e+00 0.00000000e+00 2.65893855e+00
 8.09540605e+00 6.68566402e+00 6.57226864e-02 7.81242170e-01
 5.30844962e-01 3.68800914e+01 1.10758113e+02 9.17549343e+01
 4.80341970e+00 9.19854137e+00 7.41691832e+00 2.59825547e-01
 7.92403629e-01 5.83720550e-01 6.58726917e+01 1.27245393e+02
 1.01846836e+02]
-------------------------------------------

Prob: 0.0038958814

Is it possible that everyone else is just an imitation of consciousness, and I am the only real conscious being in existence? What would this imply?
How do you prove that I am not the only person in the Matrix and everyone else is just a computer program?

Lemma--------

prove person matrix computer program
possible imitation consciousness real conscious existence imply

Feature Space------
[  0.           0.       

## Top false positive errors

In [10]:
fn_idx = class_errors_df.sort_values('diff').head().index
for idx in fn_idx:
    print('Prob:', y_probs[idx])
    print()
    print(X_train.iloc[idx].question1)
    print(X_train.iloc[idx].question2)
    print()
    print('Lemma--------')
    print()
    print(X_train_lemma[idx*2])
    print(X_train_lemma[idx*2+1])
    print()
    print('Feature Space------')
    print(X_transform[idx])
    print('-------------------------------------------')
    print()

Prob: 0.98156744

How 2000 rupee note stops black money?
Do you think the 2000 rupee notes will increase black money?

Lemma--------

think  rupee note increase black money
 rupee note stop black money

Feature Space------
[  0.76923077   0.61538462   0.46153846   6.44314494  10.03362521
   7.9075828    0.57044912   0.98469754   0.78002878  86.52876364
 137.55217186 107.61267939   5.47033709  10.03362521   7.62770309
   0.53674589   0.98469754   0.73902437  73.17100963 137.55217186
 102.76879175]
-------------------------------------------

Prob: 0.9380506

Why do people ask questions on Quora that can easily be answered by Google?
Why do people use Quora to ask questions when Google or Wikipedia would be sufficient?

Lemma--------

people use quora ask question google wikipedia sufficient
people ask question quora easily answer google

Feature Space------
[6.66666667e-01 1.33333333e-01 0.00000000e+00 3.54939540e+00
 1.02873433e+01 6.97978315e+00 2.03129739e-01 1.19576709e+00
 6.794305

## Next Steps

1. Adding the same set of fetures for the non-lemmatized, but cleaned question can help with some of the examples. This will be the next model.