## Enhanced featrue engineer model

This model will add engineered features for the original question, in addition to the lemmatized question.

Engineered two different types of features,

1. n_gram similarity between each pair of questions
2. min/max/avg distance between words in a single question. Currently using the following metrics,
  * euclidean
  * cosine
  * city block or manhattan
  
**Pipeline**
1. Stack questions
2. Clean questions - now lower cases all words to better lemmatize proper nouns
3. UNION
    1. n_gram similarity
    2. min/max/avg distance
4. Lemmatize questions
5. UNION
    1. n_gram similarity
    2. min/max/avg distances
6. UNION together both sets of features
7. Avearge Ensemble
    1. Random Forrest
    2. XGBClassifier

In [1]:
# data manipulation
import utils
import pandas as pd
import numpy as np

# modeling
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, VotingClassifier

from xgboost import XGBClassifier

In [2]:
X_train = utils.load('X_train')
y_train = utils.load('y_train')
MEM_PATH = '../data/transform_memory'

## Pre-Process pipes

In [3]:
clean_pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False))
    ],
    memory = MEM_PATH
)

lemma_pipe = Pipeline(
    [
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False))
    ],
    memory = MEM_PATH
)

## Cos similarity of TF-IDF vector plus NMF topics

In [4]:
nmf_pipe = Pipeline(
    [
        ('nmf', NMF(n_components=5)),
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=True))
    ]
)

cos_pipe = Pipeline(
    [
        ('cos', FunctionTransformer(utils.calc_cos_sim_stack, validate=False))
    ]
)

nmf_cos_pipe = Pipeline(
    [
        ('clean', clean_pipe),
        ('lemma', lemma_pipe),
        ('tf', TfidfVectorizer()),
        ('feats', FeatureUnion(
            [
                ('nmf_pipe', nmf_pipe),
                ('cos_pipe', cos_pipe)
            ]
        )),
        ('xgb', XGBClassifier(n_estimators=500, n_jobs=-1, random_state=42))
    ]
)
# X_transform = pipe.fit_transform(X_train)

## Feature Engineering Pipes

In [5]:
# feature engineering pipes
single_question_pipe = Pipeline(
    [
        ('dist', FunctionTransformer(utils.add_min_max_avg_distance_features, validate=False)),
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=False))
    ]
)

pair_question_pipe = Pipeline(
    [
        ('ngram_sim', FunctionTransformer(utils.calc_ngram_similarity, kw_args={'n_grams':[1, 2, 3]}, validate=False))
    ]
)

# clean text pipe
clean_text_pipe = Pipeline(
    [
        ('clean', clean_pipe),
        ('feats', FeatureUnion(
            [
                ('pair', pair_question_pipe),
                ('single', single_question_pipe)
            ]
        ))
    ]
)

# lemma pipe
lemma_text_pipe = Pipeline(
    [
        ('clean', clean_pipe),
        ('lemma', lemma_pipe),
        ('feats', FeatureUnion(
            [
                ('pair', pair_question_pipe),
                ('single', single_question_pipe)
            ]
        ))
    ]
)

## XGBoost

In [6]:

XGB_pipe = Pipeline(
    [
        ('feats', FeatureUnion(
            [
                ('clean_features', clean_text_pipe),
                ('lemma_pipe', lemma_text_pipe)
            ]
        )),
        ('xgb', XGBClassifier(n_estimators=500, n_jobs=-1, random_state=42))
    ]
)

## Random Forrest

In [7]:
RF_pipe = Pipeline(
    [
        ('feats', FeatureUnion(
            [
                ('clean_features', clean_text_pipe),
                ('lemma_pipe', lemma_text_pipe)
            ]
        )),
        ('rf', RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42))
    ]
)

## Ensemble

In [10]:
# weighting based on individual AUC
## cos_sim = 0.799173
## xgboost = 0.846923
## rf = 0.868202

total = 0.799173 + 0.846923 + 0.868202
cos_sim_weight = 0.799173 / total
xgb_weight = 0.846923 / total
rf_weight = 0.868202 / total

weights = [cos_sim_weight, xgb_weight, rf_weight]

# nmf_cos = utils.load('cos_sim_tfidf_model')
# xgb = utils.load('xgb_feat_eng_model')
# rf = utils.load('rf_feat_eng_model')

estimators = [('cos_sim', nmf_cos_pipe), ('xgb', XGB_pipe), ('rf', RF_pipe)]
vc = VotingClassifier(estimators, voting='soft', n_jobs=1, weights=weights) 
## don't think this works with what I want to do

## Cross-validate

In [None]:
# X_transform = pre_process_pipe.transform(X_train)

skf = StratifiedKFold(n_splits=3, random_state=42)
cv = cross_validate(vc, 
               X_train[:10000], 
               y_train[:10000], 
               cv=skf, 
               n_jobs=1, 
               scoring=('accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'neg_log_loss'),
               verbose=100)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV]  ................................................................
[CV]  , accuracy=0.7366526694661067, precision=0.6607929515418502, recall=0.6033789219629928, f1=0.6307821698906645, roc_auc=0.8161449694568877, neg_log_loss=-0.4969755618973226, total= 4.5min
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 11.0min remaining:    0.0s
[CV]  ................................................................
[CV]  , accuracy=0.7314731473147315, precision=0.6494401378122309, recall=0.607085346215781, f1=0.6275488972118186, roc_auc=0.8166788729552542, neg_log_loss=-0.49579440384491175, total= 4.5min
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 21.9min remaining:    0.0s
[CV]  ................................................................


## Log results

In [None]:
results_df = utils.load('results')

results_df = results_df.drop(index='ensemble_rf_xgb_cos_sim', errors='ignore')
results_df = results_df.append(utils.log_scores(cv, 'ensemble_rf_xgb_cos_sim'))
results_df

In [None]:
utils.save(results_df, 'results')

## Results

Wow! The feature engineering shows a significant jump in AUC from 0.8 to 0.82.

In [9]:
vc = VotingClassifier([('rf', rf), ('xgb', xgb)], voting='soft')
vc.fit(X_transform, y_train)

VotingClassifier(estimators=[('rf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_we...ate=42, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1))],
         flatten_transform=None, n_jobs=None, voting='soft', weights=None)

In [10]:
y_probs = vc.predict_proba(X_transform)[:, 1]
class_errors_df = utils.ground_truth_analysis(y_train, y_probs)
class_errors_df.head()

Unnamed: 0,gt,prob,diff
0,0,0.271811,-0.271811
1,0,0.331749,-0.331749
2,0,0.158016,-0.158016
3,0,0.000237,-0.000237
4,0,0.034689,-0.034689


In [None]:
lemma_pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False)),
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False)),
    ]
)
X_train_lemma = lemma_pipe.transform(X_train)

## Top false negative errors

In [None]:
fn_idx = class_errors_df.sort_values('diff', ascending = False).head().index
for idx in fn_idx:
    print('Prob:', y_probs[idx])
    print()
    print(X_train.iloc[idx].question1)
    print(X_train.iloc[idx].question2)
    print()
    print('Lemma--------')
    print()
    print(X_train_lemma[idx*2])
    print(X_train_lemma[idx*2+1])
    print()
    print('Feature Space------')
    print(X_transform[idx])
    print('-------------------------------------------')
    print()

## Top false positive errors

In [None]:
fn_idx = class_errors_df.sort_values('diff').head().index
for idx in fn_idx:
    print('Prob:', y_probs[idx])
    print()
    print(X_train.iloc[idx].question1)
    print(X_train.iloc[idx].question2)
    print()
    print('Lemma--------')
    print()
    print(X_train_lemma[idx*2])
    print(X_train_lemma[idx*2+1])
    print()
    print('Feature Space------')
    print(X_transform[idx])
    print('-------------------------------------------')
    print()

## Next Steps

1. The false negative errors appear to be related to numbers. 
  * Either stop removing numebers in the clean text proces, or create another set of features.
2. The false positives seem very tricky. Found the Facebook InferSent model which embeds sentences to 4096 dimension.
  * Could see if the distances between some of the false positive examples is different in this space.
3. Could look at alignments of the two questions. (I think nltk)
4. Add different word embeddings to get a different nuance, or even use the vector only data set from Spacy.

In [18]:
nmf_cos_proba = nmf_cos.predict_proba(X_nmf_cos_feat)[:, 1]
xgb_proba = xgb.predict_proba(X_feat_eng_transform)[:, 1]
rf_proba = rf.predict_proba(X_feat_eng_transform)[:, 1]

In [13]:
nmf_cos_proba = nmf_cos_proba * cos_sim_weight
xgb_proba = xgb_proba * xgb_weight
rf_proba = rf_proba * rf_weight
ensemble_proba = nmf_cos_proba + xgb_proba + rf_proba
ensemble_proba[:10]

array([0.39567994, 0.34479889, 0.17470645, 0.00103656, 0.07662025,
       0.45168266, 0.29587881, 0.14138014, 0.0013542 , 0.68504003])

In [15]:
from sklearn import metrics

In [20]:
print(f'Ensemble: {metrics.roc_auc_score(y_train, ensemble_proba)}')
print(f'Nmf_cos: {metrics.roc_auc_score(y_train, nmf_cos_proba)}')
print(f'XGB: {metrics.roc_auc_score(y_train, xgb_proba)}')
print(f'RF: {metrics.roc_auc_score(y_train, rf_proba)}')

Ensemble: 0.9846300420170292
Nmf_cos: 0.8056886449194024
XGB: 0.8540229521257536
RF: 0.9999764283543164


In [21]:
print(f'Ensemble: {metrics.log_loss(y_train, ensemble_proba)}')
print(f'Nmf_cos: {metrics.log_loss(y_train, nmf_cos_proba)}')
print(f'XGB: {metrics.log_loss(y_train, xgb_proba)}')
print(f'RF: {metrics.log_loss(y_train, rf_proba)}')

Ensemble: 0.32102219836966456
Nmf_cos: 0.5069449937264294
XGB: 0.4495672382949828
RF: 0.12308486086868513
