## Enhanced featrue engineer model

This model will add engineered features for the original question, in addition to the lemmatized question.

Engineered two different types of features,

1. n_gram similarity between each pair of questions
2. min/max/avg distance between words in a single question. Currently using the following metrics,
  * euclidean
  * cosine
  * city block or manhattan
  
**Pipeline**
1. Stack questions
2. Clean questions - now lower cases all words to better lemmatize proper nouns
3. UNION
    1. n_gram similarity
    2. min/max/avg distance
4. Lemmatize questions
5. UNION
    1. n_gram similarity
    2. min/max/avg distances
6. UNION together both sets of features
7. Avearge Ensemble
    1. Random Forrest
    2. XGBClassifier

In [1]:
# data manipulation
import utils
import pandas as pd
import numpy as np

# modeling
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, VotingClassifier

from xgboost import XGBClassifier

In [2]:
X_train = utils.load('X_train')
y_train = utils.load('y_train')
MEM_PATH = '../data/transform_memory'

## Cos similarity of TF-IDF vector plus NMF topics

In [3]:
nmf_pipe = Pipeline(
    [
        ('nmf', NMF(n_components=5)),
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=True))
    ]
)

cos_pipe = Pipeline(
    [
        ('cos', FunctionTransformer(utils.calc_cos_sim_stack, validate=False))
    ]
)

nmf_cos_pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False)),
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False)),
        ('tf', TfidfVectorizer()),
        ('feats', FeatureUnion(
            [
                ('nmf_pipe', nmf_pipe),
                ('cos_pipe', cos_pipe)
            ]
        )),
        ('xgb', XGBClassifier(n_estimators=500, n_jobs=-1, random_state=42))
    ],
    memory = MEM_PATH
)
# X_transform = pipe.fit_transform(X_train)

## Feature Engineering Pipes

In [4]:
# feature engineering pipes
single_question_pipe = Pipeline(
    [
        ('dist', FunctionTransformer(utils.add_min_max_avg_distance_features, validate=False)),
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=False))
    ]
)

pair_question_pipe = Pipeline(
    [
        ('ngram_sim', FunctionTransformer(utils.calc_ngram_similarity, kw_args={'n_grams':[1, 2, 3]}, validate=False))
    ]
)

# clean text pipe
clean_text_pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False)),
        ('feats', FeatureUnion(
            [
                ('pair', pair_question_pipe),
                ('single', single_question_pipe)
            ]
        ))
    ]
)

# lemma pipe
lemma_pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False)),
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False)),
        ('feats', FeatureUnion(
            [
                ('pair', pair_question_pipe),
                ('single', single_question_pipe)
            ]
        ))
    ]
)

## XGBoost

In [5]:

XGB_pipe = Pipeline(
    [
        ('feats', FeatureUnion(
            [
                ('clean_features', clean_text_pipe),
                ('lemma_pipe', lemma_pipe)
            ]
        )),
        ('xgb', XGBClassifier(n_estimators=500, n_jobs=-1, random_state=42))
    ],
    memory = MEM_PATH
)

## Random Forrest

In [6]:
RF_pipe = Pipeline(
    [
        ('feats', FeatureUnion(
            [
                ('clean_features', clean_text_pipe),
                ('lemma_pipe', lemma_pipe)
            ]
        )),
        ('rf', RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42))
    ],
    memory = MEM_PATH
)

## Ensemble

In [7]:
# weighting based on individual AUC
## cos_sim = 0.799173
## xgboost = 0.846923
## rf = 0.868202

total = 0.799173 + 0.846923 + 0.868202
cos_sim_weight = 0.799173 / total
xgb_weight = 0.846923 / total
rf_weight = 0.868202 / total

weights = [cos_sim_weight, xgb_weight, rf_weight]
estimators = [('cos_sim', nmf_cos_pipe), ('xgb', XGB_pipe), ('rf', RF_pipe)]
vc = VotingClassifier(estimators, voting='soft', n_jobs=-1, weights=weights)

## Cross-validate

In [8]:
# X_transform = pre_process_pipe.transform(X_train)

skf = StratifiedKFold(n_splits=3, random_state=42)
cv = cross_validate(vc, 
               X_train, 
               y_train, 
               cv=skf, 
               n_jobs=-1, 
               scoring=('accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'neg_log_loss'))

KeyboardInterrupt: 

## Log results

In [None]:
results_df = utils.load('results')

results_df = results_df.drop(index='ensemble_rf_xgb_cos_sim', errors='ignore')
results_df = results_df.append(utils.log_scores(cv, 'ensemble_rf_xgb_cos_sim'))
results_df

In [None]:
utils.save(results_df, 'results')

## Results

Wow! The feature engineering shows a significant jump in AUC from 0.8 to 0.82.

In [9]:
vc = VotingClassifier([('rf', rf), ('xgb', xgb)], voting='soft')
vc.fit(X_transform, y_train)

VotingClassifier(estimators=[('rf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_we...ate=42, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1))],
         flatten_transform=None, n_jobs=None, voting='soft', weights=None)

In [10]:
y_probs = vc.predict_proba(X_transform)[:, 1]
class_errors_df = utils.ground_truth_analysis(y_train, y_probs)
class_errors_df.head()

Unnamed: 0,gt,prob,diff
0,0,0.271811,-0.271811
1,0,0.331749,-0.331749
2,0,0.158016,-0.158016
3,0,0.000237,-0.000237
4,0,0.034689,-0.034689


In [None]:
lemma_pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False)),
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False)),
    ]
)
X_train_lemma = lemma_pipe.transform(X_train)

## Top false negative errors

In [None]:
fn_idx = class_errors_df.sort_values('diff', ascending = False).head().index
for idx in fn_idx:
    print('Prob:', y_probs[idx])
    print()
    print(X_train.iloc[idx].question1)
    print(X_train.iloc[idx].question2)
    print()
    print('Lemma--------')
    print()
    print(X_train_lemma[idx*2])
    print(X_train_lemma[idx*2+1])
    print()
    print('Feature Space------')
    print(X_transform[idx])
    print('-------------------------------------------')
    print()

## Top false positive errors

In [None]:
fn_idx = class_errors_df.sort_values('diff').head().index
for idx in fn_idx:
    print('Prob:', y_probs[idx])
    print()
    print(X_train.iloc[idx].question1)
    print(X_train.iloc[idx].question2)
    print()
    print('Lemma--------')
    print()
    print(X_train_lemma[idx*2])
    print(X_train_lemma[idx*2+1])
    print()
    print('Feature Space------')
    print(X_transform[idx])
    print('-------------------------------------------')
    print()

## Next Steps

1. The false negative errors appear to be related to numbers. 
  * Either stop removing numebers in the clean text proces, or create another set of features.
2. The false positives seem very tricky. Found the Facebook InferSent model which embeds sentences to 4096 dimension.
  * Could see if the distances between some of the false positive examples is different in this space.
3. Could look at alignments of the two questions. (I think nltk)
4. Add different word embeddings to get a different nuance, or even use the vector only data set from Spacy.