## Enhanced featrue engineer model

This model will add engineered features for the original question, in addition to the lemmatized question.

Engineered two different types of features,

1. n_gram similarity between each pair of questions
2. min/max/avg distance between words in a single question. Currently using the following metrics,
  * euclidean
  * cosine
  * city block or manhattan
  
**Pipeline**
1. Stack questions
2. Clean questions - now lower cases all words to better lemmatize proper nouns
3. UNION
    1. n_gram similarity
    2. min/max/avg distance
4. Lemmatize questions
5. UNION
    1. n_gram similarity
    2. min/max/avg distances
6. UNION together both sets of features
7. XGBClassifier

In [1]:
# data manipulation
import utils
import pandas as pd
import numpy as np

# modeling
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import cross_validate, StratifiedKFold

from xgboost import XGBClassifier

In [2]:
X_train = utils.load('X_train')
y_train = utils.load('y_train')
model_name = 'xgb_hypertuned_dup_features'

In [3]:
# text transformation pipes
clean_text = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False))

    ]
)

lemma_text = Pipeline(
    [
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False))
    ]
)

# feature engineering pipes
single_question_pipe = Pipeline(
    [
        ('dist', FunctionTransformer(utils.add_min_max_avg_distance_features, validate=False)),
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=False))
    ]
)

pair_question_pipe = Pipeline(
    [
        ('ngram_sim', FunctionTransformer(utils.calc_ngram_similarity, kw_args={'n_grams':[1, 2, 3]}, validate=False))
    ]
)

# build features on the cleaned text only
clean_text_features = Pipeline(
    [
        ('clean', clean_text),
        ('feats', FeatureUnion(
            [
                ('pair', pair_question_pipe),
                ('single', single_question_pipe)
            ]
        ))
    ]
)

# build features on the cleanned and lemmatized text features
lemma_text_features = Pipeline(
    [
        ('clean', clean_text),
        ('lemma', lemma_text),
        ('feats', FeatureUnion(
            [
                ('pair', pair_question_pipe),
                ('single', single_question_pipe)
            ]
        ))
    ]
)

# pre-process pipe
feature_transformation = Pipeline(
    [
        ('feats', FeatureUnion(
            [
                ('clean_text_features', clean_text_features),
                ('lemma_text_features', lemma_text_features)
            ]
        ))
    ]
)


In [4]:
%%time
try:
    X_train_transform = utils.load('X_train_transform')
except:
    X_train_transform = feature_transformation.transform(X_train) ## this takes a really long time
    utils.save(X_train_transform, 'X_train_transform')

CPU times: user 32 ms, sys: 64 ms, total: 96 ms
Wall time: 873 ms


In [5]:
X_train_transform.shape

(303199, 42)

In [7]:
first_question_dist_features = X_train_transform[:, :21]
first_question_dist_features.shape

(303199, 21)

In [8]:
second_question_dist_features = X_train_transform[:, 21:]
second_question_dist_features.shape

(303199, 21)

In [12]:
X_train_transform = np.vstack([X_train_transform, 
                               np.hstack([second_question_dist_features, first_question_dist_features])])

In [13]:
X_train_transform.shape

(606398, 42)

In [21]:
y_train = np.hstack([y_train, y_train])

In [22]:
search_cv = utils.load('tuned_models/xgb_hypertune_0.884651')

In [23]:
search_cv.best_params_

{'gamma': 0.1497064614824524,
 'learning_rate': 0.22505353861797678,
 'max_depth': 7,
 'n_estimators': 734,
 'reg_lambda': 0.7046261327596275}

In [24]:
xgb_params = search_cv.best_params_
xgb_params['n_jobs'] = 4
xgb_params['random_state'] = 42
xgb_params

{'gamma': 0.1497064614824524,
 'learning_rate': 0.22505353861797678,
 'max_depth': 7,
 'n_estimators': 734,
 'reg_lambda': 0.7046261327596275,
 'n_jobs': 4,
 'random_state': 42}

In [25]:
xgb = XGBClassifier(**xgb_params)

In [26]:
%%time
skf = StratifiedKFold(n_splits=3, random_state=42)
cv = cross_validate(xgb, 
               X_train_transform, 
               y_train, 
               cv=skf, 
               n_jobs=3, 
               scoring=('accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'neg_log_loss'))

CPU times: user 236 ms, sys: 160 ms, total: 396 ms
Wall time: 26min 20s


In [27]:
results_df = utils.load('results')

results_df = results_df.drop(index=model_name, errors='ignore')
results_df = results_df.append(utils.log_scores(cv, model_name))
results_df.sort_values('avg_auc', ascending=False)

Unnamed: 0,avg_accuracy,std_accuracy,avg_precision,std_precision,avg_recall,std_recall,avg_f1,std_f1,avg_auc,std_auc,avg_log_loss,std_log_loss
xgb_hypertuned,0.800791,0.001007,0.73202,0.001803,0.726379,0.001644,0.729187,0.001261,0.884651,0.000787,0.406161,0.00138
xgb_hypertuned_dup_features,0.78852,0.007471,0.714135,0.01098,0.712489,0.008232,0.713308,0.009598,0.873554,0.006916,0.420222,0.010238
rf_feat_eng_model_lemma_clean,0.783667,0.00226,0.708853,0.003681,0.702725,0.001666,0.705774,0.002658,0.868202,0.001148,0.436197,0.00064
ensemble_rf_xgb,0.779,0.00274,0.697794,0.004357,0.708157,0.001912,0.702935,0.003148,0.863334,0.001438,0.441784,0.001107
xgb_feat_eng_incl_nums,0.76711,0.001576,0.682213,0.002621,0.701238,0.002695,0.69159,0.001899,0.851957,0.001192,0.450099,0.001675
feat_eng_model_lemma_clean,0.763927,0.002404,0.676166,0.003904,0.692113,0.001128,0.684044,0.002549,0.846923,0.001643,0.456929,0.00141
feat_eng_model_lemma_fix,0.744356,0.002107,0.664513,0.004333,0.621357,0.000901,0.642201,0.001609,0.822197,0.00171,0.488131,0.001342
feat_eng_model,0.743614,0.002021,0.664102,0.003502,0.6184,0.001553,0.640434,0.002281,0.82107,0.001428,0.489465,0.001141
ensemble_rf_xgb_cos_sim,0.7387,0.007359,0.66129,0.010948,0.612827,0.009669,0.636128,0.009994,0.819987,0.005193,0.493703,0.003901
lstm_Bidrectional,0.752968,0.0,0.702084,0.0,0.5749,0.0,0.632158,0.0,0.80354,0.0,8.532243,0.0


In [28]:
utils.save(results_df, 'results')

## Fit entire training data

Validation AUC is 0.88. We will now fit on the entire train data to then score against the test data.

In [17]:
%%time
xgb.fit(X_train_transform, y_train)
utils.save(xgb, 'xgb_ht_best_model_feature_expansion')

CPU times: user 38min 9s, sys: 200 ms, total: 38min 9s
Wall time: 9min 32s


## Score the test data set

In [19]:
X_test = utils.load('X_test')
y_test = utils.load('y_test')

In [21]:
%%time
X_test_transform = feature_transformation.transform(X_test)

CPU times: user 22min 21s, sys: 2min 15s, total: 24min 36s
Wall time: 7min 40s


In [24]:
test_probs = xgb.predict_proba(X_test_transform)[:, 1]

In [25]:
from sklearn import metrics

In [26]:
metrics.roc_auc_score(y_test, test_probs)

0.8904411591506582

In [27]:
metrics.log_loss(y_test, test_probs)

0.39466975303115776

## Summary

The test score is very similar to the validation score, and thus the model should genearlize well. Next, the full data set will be trained for the Slack bot app.

In [32]:
X_full_transform = np.vstack([X_train_transform, X_test_transform])
y_full_transform = np.vstack([y_train.reshape(-1, 1), y_test.reshape(-1, 1)])

In [35]:
%%time
xgb.fit(X_full_transform, y_full_transform.reshape(-1,))

CPU times: user 36min 35s, sys: 9.26 s, total: 36min 44s
Wall time: 9min 11s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0.1497064614824524,
       learning_rate=0.22505353861797678, max_delta_step=0, max_depth=7,
       min_child_weight=1, missing=None, n_estimators=734, n_jobs=4,
       nthread=None, objective='binary:logistic', random_state=42,
       reg_alpha=0, reg_lambda=0.7046261327596275, scale_pos_weight=1,
       seed=None, silent=True, subsample=1)

In [37]:
utils.save(xgb, 'xgb_FINAL_model')
utils.save(feature_transformation, 'feature_pipe')
utils.save(X_full_transform, 'X_full_transform')
utils.save(y_full_transform, 'y_full_transform')
utils.save(X_test_transform, 'X_test_transform')