## Enhanced featrue engineer model

This model will add engineered features for the original question, in addition to the lemmatized question.

Engineered two different types of features,

1. n_gram similarity between each pair of questions
2. min/max/avg distance between words in a single question. Currently using the following metrics,
  * euclidean
  * cosine
  * city block or manhattan
  
**Pipeline**
1. Stack questions
2. Clean questions - now lower cases all words to better lemmatize proper nouns
3. UNION
    1. n_gram similarity
    2. min/max/avg distance
4. Lemmatize questions
5. UNION
    1. n_gram similarity
    2. min/max/avg distances
6. UNION together both sets of features
7. XGBClassifier

In [1]:
# data manipulation
import utils
import pandas as pd
import numpy as np

# modeling
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import cross_validate, StratifiedKFold

from xgboost import XGBClassifier

In [51]:
X_train = utils.load('X_train')
y_train = utils.load('y_train')
model_name = 'xgb_hypertuned_dup_features_upsample'
X_train.shape

(303199, 3)

## Up-sample duplicate questions

I have found with implementing the Slack bot, the current model does not do well in prediciting duplicate questions as having the same intent.

The sampling process will be,

1. Randomly sample questions from training data, and create duplicate pairs.
2. Sample enough questions to ensure there is a 50/50 split between the similar and not similar classes. 

In [52]:
# Need to generate this many duplicate samples
dup_samples = (((1 - y_train.mean()) * y_train.shape[0]) - (y_train.mean() * y_train.shape[0])).astype(int)
dup_samples

79305

In [53]:
question_population = pd.concat([X_train.question1, X_train.question2]).drop_duplicates()
dup_questions = pd.concat([question_population.sample(n=dup_samples, random_state=42), 
                           question_population.sample(n=dup_samples, random_state=42)],
                          axis = 1).reset_index(drop=True)
dup_questions['id'] = dup_questions.index+1 * -1
dup_questions = dup_questions.rename(columns={0:'question1', 1:'question2'})
X_train = pd.concat([X_train, dup_questions], sort=False)

Unnamed: 0,id,question1,question2
0,0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...
1,1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...
2,2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...
3,3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...
4,4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?


In [67]:
y_train = np.concatenate([y_train, np.repeat([1], dup_samples)])

In [55]:
# text transformation pipes
clean_text = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False))

    ]
)

lemma_text = Pipeline(
    [
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False))
    ]
)

# feature engineering pipes
single_question_pipe = Pipeline(
    [
        ('dist', FunctionTransformer(utils.add_min_max_avg_distance_features, validate=False)),
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=False))
    ]
)

pair_question_pipe = Pipeline(
    [
        ('ngram_sim', FunctionTransformer(utils.calc_ngram_similarity, kw_args={'n_grams':[1, 2, 3]}, validate=False))
    ]
)

# build features on the cleaned text only
clean_text_features = Pipeline(
    [
        ('clean', clean_text),
        ('feats', FeatureUnion(
            [
                ('pair', pair_question_pipe),
                ('single', single_question_pipe)
            ],
            n_jobs = -1
        ))
    ]
)

# build features on the cleanned and lemmatized text features
lemma_text_features = Pipeline(
    [
        ('clean', clean_text),
        ('lemma', lemma_text),
        ('feats', FeatureUnion(
            [
                ('pair', pair_question_pipe),
                ('single', single_question_pipe)
            ],
            n_jobs = -1
        ))
    ]
)

# pre-process pipe
feature_transformation = Pipeline(
    [
        ('feats', FeatureUnion(
            [
                ('clean_text_features', clean_text_features),
                ('lemma_text_features', lemma_text_features)
            ],
            n_jobs = -1
        ))
    ]
)


In [None]:
%%time
try:
    X_train_upsample = utils.load('X_train_upsample_transform')
except:
    X_train_upsample = feature_transformation.transform(X_train) ## this takes a really long time
    utils.save(X_train_upsample, 'X_train_upsample_transform')

In [59]:
X_train_upsample.shape

(382504, 42)

In [60]:
first_question_dist_features = X_train_upsample[:, :21]
first_question_dist_features.shape

(382504, 21)

In [61]:
second_question_dist_features = X_train_upsample[:, 21:]
second_question_dist_features.shape

(382504, 21)

In [62]:
X_train_upsample = np.vstack([X_train_upsample, 
                               np.hstack([second_question_dist_features, first_question_dist_features])])

X_train_upsample.shape

(765008, 42)

In [68]:
y_train = np.hstack([y_train, y_train])
y_train.shape

(765008,)

In [69]:
search_cv = utils.load('tuned_models/xgb_hypertune_0.884651')

In [70]:
search_cv.best_params_

{'gamma': 0.1497064614824524,
 'learning_rate': 0.22505353861797678,
 'max_depth': 7,
 'n_estimators': 734,
 'reg_lambda': 0.7046261327596275}

In [72]:
xgb_params = search_cv.best_params_
xgb_params['n_jobs'] = 7
xgb_params['random_state'] = 42
xgb_params

{'gamma': 0.1497064614824524,
 'learning_rate': 0.22505353861797678,
 'max_depth': 7,
 'n_estimators': 734,
 'reg_lambda': 0.7046261327596275,
 'n_jobs': 7,
 'random_state': 42}

In [86]:
xgb = XGBClassifier(n_estimators=500, n_jobs=4, random_state=42)

In [87]:
%%time
skf = StratifiedKFold(n_splits=3, random_state=42)
cv = cross_validate(xgb, 
               X_train_upsample, 
               y_train, 
               cv=skf, 
               n_jobs=3, 
               scoring=('accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'neg_log_loss'))

CPU times: user 196 ms, sys: 212 ms, total: 408 ms
Wall time: 7min 32s


In [88]:
results_df = utils.load('results')

results_df = results_df.drop(index=model_name, errors='ignore')
results_df = results_df.append(utils.log_scores(cv, model_name))
results_df.sort_values('avg_auc', ascending=False)

Unnamed: 0,avg_accuracy,std_accuracy,avg_precision,std_precision,avg_recall,std_recall,avg_f1,std_f1,avg_auc,std_auc,avg_log_loss,std_log_loss
xgb_hypertuned,0.800791,0.001007,0.73202,0.001803,0.726379,0.001644,0.729187,0.001261,0.884651,0.000787,0.406161,0.00138
xgb_hypertuned_dup_features,0.78852,0.007471,0.714135,0.01098,0.712489,0.008232,0.713308,0.009598,0.873554,0.006916,0.420222,0.010238
rf_feat_eng_model_lemma_clean,0.783667,0.00226,0.708853,0.003681,0.702725,0.001666,0.705774,0.002658,0.868202,0.001148,0.436197,0.00064
ensemble_rf_xgb,0.779,0.00274,0.697794,0.004357,0.708157,0.001912,0.702935,0.003148,0.863334,0.001438,0.441784,0.001107
xgb_feat_eng_incl_nums,0.76711,0.001576,0.682213,0.002621,0.701238,0.002695,0.69159,0.001899,0.851957,0.001192,0.450099,0.001675
feat_eng_model_lemma_clean,0.763927,0.002404,0.676166,0.003904,0.692113,0.001128,0.684044,0.002549,0.846923,0.001643,0.456929,0.00141
feat_eng_model_lemma_fix,0.744356,0.002107,0.664513,0.004333,0.621357,0.000901,0.642201,0.001609,0.822197,0.00171,0.488131,0.001342
feat_eng_model,0.743614,0.002021,0.664102,0.003502,0.6184,0.001553,0.640434,0.002281,0.82107,0.001428,0.489465,0.001141
ensemble_rf_xgb_cos_sim,0.7387,0.007359,0.66129,0.010948,0.612827,0.009669,0.636128,0.009994,0.819987,0.005193,0.493703,0.003901
lstm_Bidrectional,0.752968,0.0,0.702084,0.0,0.5749,0.0,0.632158,0.0,0.80354,0.0,8.532243,0.0


In [28]:
utils.save(results_df, 'results')

## Fit entire training data

Validation AUC is 0.88. We will now fit on the entire train data to then score against the test data.

In [76]:
xgb_params["n_jobs"] = 7
xgb = XGBClassifier(**xgb_params)
xgb

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0.1497064614824524,
       learning_rate=0.22505353861797678, max_delta_step=0, max_depth=7,
       min_child_weight=1, missing=None, n_estimators=734, n_jobs=7,
       nthread=None, objective='binary:logistic', random_state=42,
       reg_alpha=0, reg_lambda=0.7046261327596275, scale_pos_weight=1,
       seed=None, silent=True, subsample=1)

In [77]:
%%time
xgb.fit(X_train_upsample, y_train)
utils.save(xgb, 'xgb_ht_best_model_question_swapped_upsample')

CPU times: user 1h 17min 50s, sys: 408 ms, total: 1h 17min 50s
Wall time: 11min 13s


In [2]:
xgb = utils.load('xgb_ht_best_model_question_swapped')

## Score the test data set

In [78]:
X_test = utils.load('X_test')
y_test = utils.load('y_test')

In [79]:
%%time
try:
    X_test_transform = utils.load('X_test_transform')
except:
    X_test_transform = feature_transformation.transform(X_test) ## this takes a really long time
    utils.save(X_test_transform, 'X_test_transform')
# X_test_transform = feature_transformation.transform(X_test)

CPU times: user 16 ms, sys: 16 ms, total: 32 ms
Wall time: 220 ms


In [80]:
X_test_transform.shape

(101067, 42)

In [81]:
first_question_dist_features = X_test_transform[:, :21]
first_question_dist_features.shape

second_question_dist_features = X_test_transform[:, 21:]
second_question_dist_features.shape

X_test_transform = np.vstack([X_test_transform, 
                               np.hstack([second_question_dist_features, first_question_dist_features])])

X_test_transform.shape

(202134, 42)

In [82]:
y_test = np.hstack([y_test, y_test])
y_test.shape

(202134,)

In [83]:
test_probs = xgb.predict_proba(X_test_transform)[:, 1]

In [84]:
from sklearn import metrics

In [85]:
metrics.roc_auc_score(y_test, test_probs)

0.5597347300695545

In [11]:
metrics.log_loss(y_test, test_probs)

0.40297918872871447

## Summary

The test score is very similar to the validation score, and thus the model should genearlize well. Next, the full data set will be trained for the Slack bot app.

In [33]:
X_full_transform = np.vstack([X_train_transform, X_test_transform])
y_full_transform = np.hstack([y_train, y_test])
print('X_full shape:', X_full_transform.shape)
print('y_full shape:', y_full_transform.shape)

X_full shape: (808532, 42)
y_full shape: (808532,)


In [34]:
%%time
xgb.fit(X_full_transform, y_full_transform)

CPU times: user 1h 27min 24s, sys: 536 ms, total: 1h 27min 25s
Wall time: 12min 39s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0.1497064614824524,
       learning_rate=0.22505353861797678, max_delta_step=0, max_depth=7,
       min_child_weight=1, missing=None, n_estimators=734, n_jobs=7,
       nthread=None, objective='binary:logistic', random_state=42,
       reg_alpha=0, reg_lambda=0.7046261327596275, scale_pos_weight=1,
       seed=None, silent=True, subsample=1)

In [35]:
utils.save(xgb, 'xgb_FINAL_model_question_swapped')

In [37]:
utils.save(feature_transformation, 'feature_pipe')
utils.save(X_full_transform, 'X_full_transform')
utils.save(y_full_transform, 'y_full_transform')
utils.save(X_test_transform, 'X_test_transform')