## Enhanced featrue engineer model

There seems to be some questions which are different based on the numbers in the question. This pipeline will no longer strip numbers from the questions.

Engineered two different types of features,

1. n_gram similarity between each pair of questions
2. min/max/avg distance between words in a single question. Currently using the following metrics,
  * euclidean
  * cosine
  * city block or manhattan
  
**Pipeline**
1. Stack questions
2. Clean questions - **include numbers**
3. UNION
    1. n_gram similarity
    2. min/max/avg distance
4. Lemmatize questions
5. UNION
    1. n_gram similarity
    2. min/max/avg distances
6. UNION together both sets of features
7. Random Forrest

In [1]:
# data manipulation
import utils
import pandas as pd
import numpy as np

# modeling
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier

from xgboost import XGBClassifier

In [2]:
X_train = utils.load('X_train')
y_train = utils.load('y_train')
MEM_PATH = '../data/transform_memory/'
model_name = 'xgb_feat_eng_incl_nums'

In [9]:
clean_text_pipe = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False, kw_args={'excl_num':False})),
    ]
)

lemma_text_pipe = Pipeline(
    [
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False))        
    ],
    memory = MEM_PATH
)

# feature engineering pipes
single_question_pipe = Pipeline(
    [
        ('dist', FunctionTransformer(utils.add_min_max_avg_distance_features, validate=False)),
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=False))
    ]
)

pair_question_pipe = Pipeline(
    [
        ('ngram_sim', FunctionTransformer(utils.calc_ngram_similarity, kw_args={'n_grams':[1, 2, 3]}, validate=False))
    ]
)

# clean text pipe
clean_text_features = Pipeline(
    [
        ('clean', clean_text_pipe),
        ('feats', FeatureUnion(
            [
                ('pair', pair_question_pipe),
                ('single', single_question_pipe)
            ]
        ))
    ]
)

# lemma pipe
lemma_features = Pipeline(
    [
        ('clean', clean_text_pipe),
        ('lemma', lemma_text_pipe),
        ('feats', FeatureUnion(
            [
                ('pair', pair_question_pipe),
                ('single', single_question_pipe)
            ]
        ))
    ]
)

# pre-process pipe
feature_union_pipe = Pipeline(
    [
        ('feats', FeatureUnion(
            [
                ('clean_features', clean_text_features),
                ('lemma_pipe', lemma_features)
            ]
        )),
        ('xgb', XGBClassifier(n_estimators=500, n_jobs=-1, random_state=42))
    ]
)

In [4]:
skf = StratifiedKFold(n_splits=3, random_state=42)
cv = cross_validate(feature_union_pipe, 
               X_train[:100000], 
               y_train[:100000], 
               cv=skf, 
               n_jobs=1, 
               scoring=('accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'neg_log_loss'),
               verbose=100)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV]  ................................................................
[CV]  , accuracy=0.7656446871062579, precision=0.6808711376680556, recall=0.6976556835575606, f1=0.6891612287123985, roc_auc=0.8503861836558442, neg_log_loss=-0.4523489168091895, total=23.6min
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 58.8min remaining:    0.0s
[CV]  ................................................................
[CV]  , accuracy=0.7692976929769297, precision=0.6858762399622107, recall=0.7019013857557203, f1=0.6937962889225133, roc_auc=0.8532734780564172, neg_log_loss=-0.44961479599386456, total=24.8min
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 118.5min remaining:    0.0s
[CV]  ................................................................
[CV]  , accuracy=0.7663876638766388, precision=0.6798910929599378, recall=0.7041572671608122, f1=0.6918114536747536, roc_auc=0.8522108756876363, neg_log_loss=

In [7]:
results_df = utils.load('results')

results_df = results_df.drop(index=model_name, errors='ignore')
results_df = results_df.append(utils.log_scores(cv, model_name))
results_df.sort_values('avg_auc', ascending=False)

Unnamed: 0,avg_accuracy,std_accuracy,avg_precision,std_precision,avg_recall,std_recall,avg_f1,std_f1,avg_auc,std_auc,avg_log_loss,std_log_loss
rf_feat_eng_model_lemma_clean,0.783667,0.00226,0.708853,0.003681,0.702725,0.001666,0.705774,0.002658,0.868202,0.001148,0.436197,0.00064
ensemble_rf_xgb,0.779,0.00274,0.697794,0.004357,0.708157,0.001912,0.702935,0.003148,0.863334,0.001438,0.441784,0.001107
xgb_feat_eng_incl_nums,0.76711,0.001576,0.682213,0.002621,0.701238,0.002695,0.69159,0.001899,0.851957,0.001192,0.450099,0.001675
feat_eng_model_lemma_clean,0.763927,0.002404,0.676166,0.003904,0.692113,0.001128,0.684044,0.002549,0.846923,0.001643,0.456929,0.00141
feat_eng_model_lemma_fix,0.744356,0.002107,0.664513,0.004333,0.621357,0.000901,0.642201,0.001609,0.822197,0.00171,0.488131,0.001342
feat_eng_model,0.743614,0.002021,0.664102,0.003502,0.6184,0.001553,0.640434,0.002281,0.82107,0.001428,0.489465,0.001141
ensemble_rf_xgb_cos_sim,0.7387,0.007359,0.66129,0.010948,0.612827,0.009669,0.636128,0.009994,0.819987,0.005193,0.493703,0.003901
lstm_dropout_50,0.751849,0.0,0.6904,0.0,0.59451,0.0,0.638877,0.0,0.802315,0.0,8.570912,0.0
lstm_mvp,0.74976,0.0,0.685627,0.0,0.595133,0.0,0.637183,0.0,0.801019,0.0,8.643059,0.0
cos_sim_tfidf_model,0.729511,0.001216,0.66168,0.002219,0.547188,0.001744,0.59901,0.001703,0.800271,0.001291,0.512085,0.001299


In [8]:
utils.save(results_df, 'results')

## Results

In [10]:
feature_union_pipe.fit(X_train, y_train)
utils.save(feature_union_pipe, model_name)

y_probs = feature_union_pipe.predict_proba(X_train)[:, 1]
class_errors_df = utils.ground_truth_analysis(y_train, y_probs)
class_errors_df.head()

KeyboardInterrupt: 

In [None]:
lemma_pipe = Pipeline(
    [
        ('clean', clean_text_pipe),
        ('lemma', lemma_text_pipe),
    ]
)
X_train_lemma = lemma_pipe.transform(X_train)

## Top false negative errors

In [None]:
fn_idx = class_errors_df.sort_values('diff', ascending = False).head().index
for idx in fn_idx:
    print('Prob:', y_probs[idx])
    print()
    print(X_train.iloc[idx].question1)
    print(X_train.iloc[idx].question2)
    print()
    print('Lemma--------')
    print()
    print(X_train_lemma[idx*2])
    print(X_train_lemma[idx*2+1])
    print()
    print('Feature Space------')
    print(X_transform[idx])
    print('-------------------------------------------')
    print()

## Top false positive errors

In [None]:
fn_idx = class_errors_df.sort_values('diff').head().index
for idx in fn_idx:
    print('Prob:', y_probs[idx])
    print()
    print(X_train.iloc[idx].question1)
    print(X_train.iloc[idx].question2)
    print()
    print('Lemma--------')
    print()
    print(X_train_lemma[idx*2])
    print(X_train_lemma[idx*2+1])
    print()
    print('Feature Space------')
    print(X_transform[idx])
    print('-------------------------------------------')
    print()

## Next Steps

1. The set of false negative and false postive errors appears to be different compared to XGBoost model. 
  * Run an ensemble model with RandomForrest and XGBoost to smooth out the bias.
2. Short questions with numbers and stop words seem to be a common theme for the false positive errors. 
  * Generate features including stop words and numbers.