## Enhanced featrue engineer model

Engineered two different types of features,

1. n_gram similarity between each pair of questions
2. min/max/avg distance between words in a single question. Currently using the following metrics,
  * euclidean
  * cosine
  * city block or manhattan
  
**Pipeline**
1. Stack questions
2. Clean questions - now lower cases all words to better lemmatize proper nouns
3. Lemmatize questions
4. UNION
    1. n_gram similarity
    2. min/max/avg distances
5. XGBClassifier

In [1]:
# data manipulation
import utils
import pandas as pd
import numpy as np

# modeling
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import cross_validate, StratifiedKFold

from xgboost import XGBClassifier

In [2]:
X_train = utils.load('X_train')
y_train = utils.load('y_train')

In [9]:
single_question_pipe = Pipeline(
    [
        ('dist', FunctionTransformer(utils.add_min_max_avg_distance_features, validate=False)),
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=False))
    ]
)

pre_process_pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False)),
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False)),
        ('feats', FeatureUnion(
            [
                ('ngram_sim', FunctionTransformer(utils.calc_ngram_similarity, kw_args={'n_grams':[1, 2, 3]}, validate=False)),
                ('single', single_question_pipe)
            ]
        ))
    ]
)

xgb = XGBClassifier(n_estimators=500, n_jobs=-1, random_state=42)

In [10]:
X_transform = pre_process_pipe.transform(X_train)

skf = StratifiedKFold(n_splits=3, random_state=42)
cv = cross_validate(xgb, 
               X_transform, 
               y_train, 
               cv=skf, 
               n_jobs=-1, 
               scoring=('accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'neg_log_loss'))

In [5]:
results_df = utils.load('results')

results_df = results_df.drop(index='feat_eng_model_lemma_fix', errors='ignore')
results_df = results_df.append(utils.log_scores(cv, 'feat_eng_model_lemma_fix'))
results_df

Unnamed: 0,avg_accuracy,std_accuracy,avg_precision,std_precision,avg_recall,std_recall,avg_f1,std_f1,avg_auc,std_auc,avg_log_loss,std_log_loss
"mvp (tf-idf, nmf(5), xgboost)",0.700345,0.000466,0.661571,0.000461,0.385736,0.002493,0.487325,0.001983,0.740593,0.001647,0.568958,0.001288
mvp (+ lemma),0.696787,0.001055,0.649977,0.003057,0.387424,0.00323,0.485464,0.002485,0.738037,0.001362,0.572483,0.000815
cos_sim_model,0.7102,0.00083,0.658748,0.002578,0.446336,0.002215,0.53212,0.001306,0.746769,0.001279,0.56525,0.000963
cos_sim_tfidf_model,0.728261,0.001248,0.659662,0.00224,0.545419,0.00137,0.597124,0.001666,0.799173,0.001407,0.513172,0.001191
feat_eng_model,0.743614,0.002021,0.664102,0.003502,0.6184,0.001553,0.640434,0.002281,0.82107,0.001428,0.489465,0.001141


## Results

Wow! The feature engineering shows a significant jump in AUC from 0.8 to 0.82.

In [6]:
xgb.fit(X_transform, y_train)
y_probs = xgb.predict_proba(X_transform)[:, 1]
class_errors_df = utils.ground_truth_analysis(y_train, y_probs)
class_errors_df.head()

Unnamed: 0,gt,prob,diff
0,0,0.401812,-0.401812
1,0,0.454141,-0.454141
2,0,0.323676,-0.323676
3,0,0.007815,-0.007815
4,0,0.02645,-0.02645


In [8]:
lemma_pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False)),
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False)),
    ]
)
X_train_lemma = lemma_pipe.transform(X_train)

## Top false negative errors

In [9]:
fn_idx = class_errors_df.sort_values('diff', ascending = False).head().index
for idx in fn_idx:
    print('Prob:', y_probs[idx])
    print()
    print(X_train.iloc[idx].question1)
    print(X_train.iloc[idx].question2)
    print()
    print('Lemma--------')
    print()
    print(X_train_lemma[idx*2])
    print(X_train_lemma[idx*2+1])
    print()
    print('Feature Space------')
    print(X_transform[idx])
    print('-------------------------------------------')
    print()

Prob: 0.002060809

What is the difference between Cherubim and Seraphim?
How do the cherubim differ from the seraphim?

Lemma--------

cherub differ seraph
difference cherubim seraphim

Feature Space------
[  0.           0.           0.           6.55570976  10.24933783
   8.72280699   0.30975165   1.08674129   0.79675768  92.10777702
 138.4221953  119.12508946   6.84490638   8.78600106   8.04551011
   0.5055403    1.0963308    0.86568997  94.37095225 119.35377098
 109.56823112]
-------------------------------------------

Prob: 0.0024182333

How is microeconomics similar to macroeconomics?
What are the similarities between Microeconomics and Macroeconomics?

Lemma--------

microeconomic similar macroeconomic
similarity microeconomics macroeconomics

Feature Space------
[  0.           0.           0.           3.63063221   9.03622263
   7.16236384   0.1319848    0.92905191   0.66066538  50.7361753
 127.33005978 100.53301635   5.83172982   8.52832234   7.49574089
   0.30714057   0.957

## Next Steps

1. Need to lower case everything prior to lemmatization.