## Enhanced featrue engineer model

This model will add engineered features for the original question, in addition to the lemmatized question.

Engineered two different types of features,

1. n_gram similarity between each pair of questions
2. min/max/avg distance between words in a single question. Currently using the following metrics,
  * euclidean
  * cosine
  * city block or manhattan
  
**Pipeline**
1. Stack questions
2. Clean questions - now lower cases all words to better lemmatize proper nouns
3. UNION
    1. n_gram similarity
    2. min/max/avg distance
4. Lemmatize questions
5. UNION
    1. n_gram similarity
    2. min/max/avg distances
6. UNION together both sets of features
7. Avearge Ensemble
    1. Random Forrest
    2. XGBClassifier

In [1]:
# data manipulation
import utils
import pandas as pd
import numpy as np

# modeling
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, VotingClassifier

from xgboost import XGBClassifier

In [2]:
X_train = utils.load('X_train')
y_train = utils.load('y_train')

In [3]:
# feature engineering pipes
single_question_pipe = Pipeline(
    [
        ('dist', FunctionTransformer(utils.add_min_max_avg_distance_features, validate=False)),
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=False))
    ]
)

pair_question_pipe = Pipeline(
    [
        ('ngram_sim', FunctionTransformer(utils.calc_ngram_similarity, kw_args={'n_grams':[1, 2, 3]}, validate=False))
    ]
)

# clean text pipe
clean_text_pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False)),
        ('feats', FeatureUnion(
            [
                ('pair', pair_question_pipe),
                ('single', single_question_pipe)
            ]
        ))
    ]
)

# lemma pipe
lemma_pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False)),
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False)),
        ('feats', FeatureUnion(
            [
                ('pair', pair_question_pipe),
                ('single', single_question_pipe)
            ]
        ))
    ]
)

# pre-process pipe
pre_process_pipe = Pipeline(
    [
        ('feats', FeatureUnion(
            [
                ('clean_features', clean_text_pipe),
                ('lemma_pipe', lemma_pipe)
            ]
        ))
    ]
)

rf = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42)
xgb = XGBClassifier(n_estimators=500, n_job=-1, random_state=42)
vc = VotingClassifier([('rf', rf), ('xgb', xgb)], voting='soft', n_jobs=-1)

In [4]:
X_transform = pre_process_pipe.transform(X_train)

skf = StratifiedKFold(n_splits=3, random_state=42)
cv = cross_validate(vc, 
               X_transform, 
               y_train, 
               cv=skf, 
               n_jobs=-1, 
               scoring=('accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'neg_log_loss'))

In [5]:
results_df = utils.load('results')

results_df = results_df.drop(index='ensemble_rf_xgb', errors='ignore')
results_df = results_df.append(utils.log_scores(cv, 'ensemble_rf_xgb'))
results_df

Unnamed: 0,avg_accuracy,std_accuracy,avg_precision,std_precision,avg_recall,std_recall,avg_f1,std_f1,avg_auc,std_auc,avg_log_loss,std_log_loss
"mvp (tf-idf, nmf(5), xgboost)",0.700345,0.000466,0.661571,0.000461,0.385736,0.002493,0.487325,0.001983,0.740593,0.001647,0.568958,0.001288
mvp (+ lemma),0.696787,0.001055,0.649977,0.003057,0.387424,0.00323,0.485464,0.002485,0.738037,0.001362,0.572483,0.000815
cos_sim_model,0.7102,0.00083,0.658748,0.002578,0.446336,0.002215,0.53212,0.001306,0.746769,0.001279,0.56525,0.000963
cos_sim_tfidf_model,0.728261,0.001248,0.659662,0.00224,0.545419,0.00137,0.597124,0.001666,0.799173,0.001407,0.513172,0.001191
feat_eng_model,0.743614,0.002021,0.664102,0.003502,0.6184,0.001553,0.640434,0.002281,0.82107,0.001428,0.489465,0.001141
feat_eng_model_lemma_fix,0.744356,0.002107,0.664513,0.004333,0.621357,0.000901,0.642201,0.001609,0.822197,0.00171,0.488131,0.001342
feat_eng_model_lemma_clean,0.763927,0.002404,0.676166,0.003904,0.692113,0.001128,0.684044,0.002549,0.846923,0.001643,0.456929,0.00141
rf_feat_eng_model_lemma_clean,0.783667,0.00226,0.708853,0.003681,0.702725,0.001666,0.705774,0.002658,0.868202,0.001148,0.436197,0.00064
ensemble_rf_xgb,0.779,0.00274,0.697794,0.004357,0.708157,0.001912,0.702935,0.003148,0.863334,0.001438,0.441784,0.001107


In [6]:
utils.save(results_df, 'results')

## Results

Wow! The feature engineering shows a significant jump in AUC from 0.8 to 0.82.

In [9]:
vc = VotingClassifier([('rf', rf), ('xgb', xgb)], voting='soft')
vc.fit(X_transform, y_train)

VotingClassifier(estimators=[('rf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_we...ate=42, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1))],
         flatten_transform=None, n_jobs=None, voting='soft', weights=None)

In [10]:
y_probs = vc.predict_proba(X_transform)[:, 1]
class_errors_df = utils.ground_truth_analysis(y_train, y_probs)
class_errors_df.head()

Unnamed: 0,gt,prob,diff
0,0,0.271811,-0.271811
1,0,0.331749,-0.331749
2,0,0.158016,-0.158016
3,0,0.000237,-0.000237
4,0,0.034689,-0.034689


In [11]:
lemma_pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False)),
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False)),
    ]
)
X_train_lemma = lemma_pipe.transform(X_train)

## Top false negative errors

In [12]:
fn_idx = class_errors_df.sort_values('diff', ascending = False).head().index
for idx in fn_idx:
    print('Prob:', y_probs[idx])
    print()
    print(X_train.iloc[idx].question1)
    print(X_train.iloc[idx].question2)
    print()
    print('Lemma--------')
    print()
    print(X_train_lemma[idx*2])
    print(X_train_lemma[idx*2+1])
    print()
    print('Feature Space------')
    print(X_transform[idx])
    print('-------------------------------------------')
    print()

Prob: 0.06066200464556905

Is unifunds.com legit?
Is unifunds legit?

Lemma--------

unifund legit
unifundscom legit

Feature Space------
[ 0.66666667  0.33333333  0.33333333  6.57793499  6.57793499  6.57793499
  0.732846    0.732846    0.732846   88.33192412 88.33192412 88.33192412
  6.57793499  6.57793499  6.57793499  0.732846    0.732846    0.732846
 88.33192412 88.33192412 88.33192412  0.5         0.5         0.5
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.        ]
-------------------------------------------

Prob: 0.14889085861018544

What is the remainder when 2^100 is divided by 101?
What is the remainder when [math]2^{100}[/math] is divided by 101?

Lemma--------

remainder  divide
remainder  divide

Feature Space------
[  1.           1.           1.           0.           7.61300637
   5.48716164   0.           0.83449638   0.5

## Top false positive errors

In [13]:
fn_idx = class_errors_df.sort_values('diff').head().index
for idx in fn_idx:
    print('Prob:', y_probs[idx])
    print()
    print(X_train.iloc[idx].question1)
    print(X_train.iloc[idx].question2)
    print()
    print('Lemma--------')
    print()
    print(X_train_lemma[idx*2])
    print(X_train_lemma[idx*2+1])
    print()
    print('Feature Space------')
    print(X_transform[idx])
    print('-------------------------------------------')
    print()

Prob: 0.8096086367161561

What was the significance of the battle of Somme, and how did this battle compare and contrast to the Battle of Taiyuan?
What was the significance of the battle of Somme, and how did this battle compare and contrast to the Battle of Eslands River?

Lemma--------

significance battle somme battle compare contrast battle esland river
significance battle somme battle compare contrast battle taiyuan

Feature Space------
[  0.90322581   0.87804878   0.8372093    0.           8.94758213
   5.63428436   0.           1.14395334   0.58597778   0.
 119.74790659  76.10832393   0.           9.47086476   5.85348526
   0.           1.14395334   0.60099757   0.         123.61142434
  79.3040757    0.76923077   0.70588235   0.58823529   0.
   8.94758213   6.58424623   0.           1.07835326   0.69907214
   0.         119.74790659  88.10770493   0.           9.47086476
   7.08213208   0.           1.07835326   0.72237867   0.
 123.61142434  95.47296559]
----------------------

## Next Steps

1. The false negative errors appear to be related to numbers. 
  * Either stop removing numebers in the clean text proces, or create another set of features.
2. The false positives seem very tricky. Found the Facebook InferSent model which embeds sentences to 4096 dimension.
  * Could see if the distances between some of the false positive examples is different in this space.
3. Could look at alignments of the two questions. (I think nltk)
4. Add different word embeddings to get a different nuance, or even use the vector only data set from Spacy.