## Enhanced featrue engineer model

This model will add engineered features for the original question, in addition to the lemmatized question.

Engineered two different types of features,

1. n_gram similarity between each pair of questions
2. min/max/avg distance between words in a single question. Currently using the following metrics,
  * euclidean
  * cosine
  * city block or manhattan
  
**Pipeline**
1. Stack questions
2. Clean questions - now lower cases all words to better lemmatize proper nouns
3. UNION
    1. n_gram similarity
    2. min/max/avg distance
4. Lemmatize questions
5. UNION
    1. n_gram similarity
    2. min/max/avg distances
6. UNION together both sets of features
7. XGBClassifier

In [2]:
# data manipulation
import utils
import pandas as pd
import numpy as np

# modeling
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import cross_validate, StratifiedKFold

from xgboost import XGBClassifier

In [3]:
X_train = utils.load('X_train')
y_train = utils.load('y_train')

In [4]:
# feature engineering pipes
single_question_pipe = Pipeline(
    [
        ('dist', FunctionTransformer(utils.add_min_max_avg_distance_features, validate=False)),
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=False))
    ]
)

pair_question_pipe = Pipeline(
    [
        ('ngram_sim', FunctionTransformer(utils.calc_ngram_similarity, kw_args={'n_grams':[1, 2, 3]}, validate=False))
    ]
)

# clean text pipe
clean_text_pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False)),
        ('feats', FeatureUnion(
            [
                ('pair', pair_question_pipe),
                ('single', single_question_pipe)
            ]
        ))
    ]
)

# lemma pipe
lemma_pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False)),
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False)),
        ('feats', FeatureUnion(
            [
                ('pair', pair_question_pipe),
                ('single', single_question_pipe)
            ]
        ))
    ]
)

# pre-process pipe
pre_process_pipe = Pipeline(
    [
        ('feats', FeatureUnion(
            [
                ('clean_features', clean_text_pipe),
                ('lemma_pipe', lemma_pipe)
            ]
        ))
    ]
)

xgb = XGBClassifier(n_estimators=500, n_jobs=-1, random_state=42)

In [5]:
X_transform = pre_process_pipe.transform(X_train)

skf = StratifiedKFold(n_splits=3, random_state=42)
cv = cross_validate(xgb, 
               X_transform, 
               y_train, 
               cv=skf, 
               n_jobs=-1, 
               scoring=('accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'neg_log_loss'))

Process ForkPoolWorker-9:
Process ForkPoolWorker-10:
Process ForkPoolWorker-7:
Process ForkPoolWorker-11:
Process ForkPoolWorker-8:
Process ForkPoolWorker-6:
Process ForkPoolWorker-4:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
Traceback (most recent call last):
Process ForkPoolWorker-5:
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
Tr

KeyboardInterrupt: 

In [None]:
results_df = utils.load('results')

results_df = results_df.drop(index='feat_eng_model_lemma_clean', errors='ignore')
results_df = results_df.append(utils.log_scores(cv, 'feat_eng_model_lemma_clean'))
results_df

In [None]:
utils.save(results_df, 'results')

## Results

Wow! The feature engineering shows a significant jump in AUC from 0.8 to 0.82.

In [None]:
xgb.fit(X_transform, y_train)
utils.save(xgb, 'xgb_feat_eng_model')

y_probs = xgb.predict_proba(X_transform)[:, 1]
class_errors_df = utils.ground_truth_analysis(y_train, y_probs)
class_errors_df.head()

In [8]:
lemma_pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False)),
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False)),
    ]
)
X_train_lemma = lemma_pipe.transform(X_train)

## Top false negative errors

In [9]:
fn_idx = class_errors_df.sort_values('diff', ascending = False).head().index
for idx in fn_idx:
    print('Prob:', y_probs[idx])
    print()
    print(X_train.iloc[idx].question1)
    print(X_train.iloc[idx].question2)
    print()
    print('Lemma--------')
    print()
    print(X_train_lemma[idx*2])
    print(X_train_lemma[idx*2+1])
    print()
    print('Feature Space------')
    print(X_transform[idx])
    print('-------------------------------------------')
    print()

Prob: 0.001027111

Is this move of banning 500 & 1000 Rupee notes right?
What's Balaji Vishwanathan's take on banning 500 and 1000 Rs. currency?

Lemma--------

ban  rupee note right
s balaji vishwanathan ban   rs currency

Feature Space------
[1.00000000e-01 0.00000000e+00 0.00000000e+00 4.20957322e+00
 1.07570888e+01 7.72370876e+00 3.34027115e-01 1.19045726e+00
 8.35664798e-01 5.76906718e+01 1.45533761e+02 1.03886527e+02
 4.46231557e+00 9.78955353e+00 7.04691247e+00 3.60066705e-01
 9.95033675e-01 7.27225144e-01 6.11183202e+01 1.32570303e+02
 9.57032626e+01 1.66666667e-01 0.00000000e+00 0.00000000e+00
 7.22822451e+00 1.07570888e+01 9.17083311e+00 7.78746179e-01
 1.10782110e+00 9.21600294e-01 9.70216655e+01 1.45533761e+02
 1.23956212e+02 5.51670254e+00 9.80649784e+00 8.05885513e+00
 5.78655325e-01 9.95033675e-01 8.13570311e-01 7.30594916e+01
 1.30713938e+02 1.08378265e+02]
-------------------------------------------

Prob: 0.0016096202

Is there any Nano technology GPS tracking feature

## Top false positive errors

In [10]:
fn_idx = class_errors_df.sort_values('diff').head().index
for idx in fn_idx:
    print('Prob:', y_probs[idx])
    print()
    print(X_train.iloc[idx].question1)
    print(X_train.iloc[idx].question2)
    print()
    print('Lemma--------')
    print()
    print(X_train_lemma[idx*2])
    print(X_train_lemma[idx*2+1])
    print()
    print('Feature Space------')
    print(X_transform[idx])
    print('-------------------------------------------')
    print()

Prob: 0.9608003

Why do people ask questions on Quora that can easily be answered by Google?
Why do people use Quora to ask questions when Google or Wikipedia would be sufficient?

Lemma--------

people use quora ask question google wikipedia sufficient
people ask question quora easily answer google

Feature Space------
[5.51724138e-01 2.06896552e-01 6.89655172e-02 3.30257131e+00
 1.05098422e+01 6.48271009e+00 2.22047507e-01 1.25827333e+00
 6.44444557e-01 4.56209202e+01 1.38667841e+02 8.77513434e+01
 3.35628163e+00 1.00165953e+01 6.17317732e+00 2.23288568e-01
 1.25827333e+00 6.11561284e-01 4.57775741e+01 1.30679687e+02
 8.26270183e+01 6.66666667e-01 1.33333333e-01 0.00000000e+00
 3.54939540e+00 1.02873433e+01 6.97978315e+00 2.03129739e-01
 1.19576709e+00 6.79430514e-01 4.94086112e+01 1.38667841e+02
 9.37634244e+01 4.57339153e+00 1.00165953e+01 7.14255837e+00
 3.35699168e-01 1.14565285e+00 7.17501372e-01 6.20160041e+01
 1.29971439e+02 9.49136603e+01]
------------------------------------

## Next Steps

1. The false negative errors appear to be related to numbers. 
  * Either stop removing numebers in the clean text proces, or create another set of features.
2. The false positives seem very tricky. Found the Facebook InferSent model which embeds sentences to 4096 dimension.
  * Could see if the distances between some of the false positive examples is different in this space.
3. Could look at alignments of the two questions. (I think nltk)
4. Add different word embeddings to get a different nuance, or even use the vector only data set from Spacy.