## Hyper - tuning - RF model

The feature space, described below, and the RandomForrest classifier gives us the best validation AUC out of all other models. We will now use `BayesSearchCV` to hyper-tune the classifier on the feature space.

Engineered two different types of features,

1. n_gram similarity between each pair of questions
2. min/max/avg distance between words in a single question. Currently using the following metrics,
  * euclidean
  * cosine
  * city block or manhattan
  
**Pipeline**
1. Stack questions
2. Clean questions - now lower cases all words to better lemmatize proper nouns
3. UNION
    1. n_gram similarity
    2. min/max/avg distance
4. Lemmatize questions
5. UNION
    1. n_gram similarity
    2. min/max/avg distances
6. UNION together both sets of features
7. Random Forrest

**Changes**
* Fix the n_estimators to 500 and search other parameters

In [1]:
# data manipulation
import utils
import pandas as pd
import numpy as np

# modeling
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier

# parameter search
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

## Load data

In [2]:
X_train = utils.load('X_train')
y_train = utils.load('y_train')
model_name = 'rf_hypertune'

## Text transformation and Feature Engineer pipes

In [3]:
# text transformation pipes
clean_text = Pipeline(
    [
        ('stack', FunctionTransformer(utils.stack_questions, validate=False)),
        ('clean', FunctionTransformer(utils.clean_questions, validate=False))

    ]
)

lemma_text = Pipeline(
    [
        ('lemma', FunctionTransformer(utils.apply_lemma, validate=False))
    ]
)

# feature engineering pipes
single_question_pipe = Pipeline(
    [
        ('dist', FunctionTransformer(utils.add_min_max_avg_distance_features, validate=False)),
        ('unstack', FunctionTransformer(utils.unstack_questions, validate=False))
    ]
)

pair_question_pipe = Pipeline(
    [
        ('ngram_sim', FunctionTransformer(utils.calc_ngram_similarity, kw_args={'n_grams':[1, 2, 3]}, validate=False))
    ]
)

# build features on the cleaned text only
clean_text_features = Pipeline(
    [
        ('clean', clean_text),
        ('feats', FeatureUnion(
            [
                ('pair', pair_question_pipe),
                ('single', single_question_pipe)
            ]
        ))
    ]
)

# build features on the cleanned and lemmatized text features
lemma_text_features = Pipeline(
    [
        ('clean', clean_text),
        ('lemma', lemma_text),
        ('feats', FeatureUnion(
            [
                ('pair', pair_question_pipe),
                ('single', single_question_pipe)
            ]
        ))
    ]
)

# pre-process pipe
feature_transformation = Pipeline(
    [
        ('feats', FeatureUnion(
            [
                ('clean_text_features', clean_text_features),
                ('lemma_text_features', lemma_text_features)
            ]
        ))
    ]
)


In [4]:
%%time
try:
    X_train_transform = utils.load('X_train_transform')
except:
    X_train_transform = feature_transformation.transform(X_train) ## this takes a really long time
    utils.save(X_train_transform, 'X_train_transform')

CPU times: user 68 ms, sys: 28 ms, total: 96 ms
Wall time: 453 ms


## Configure the search

In [16]:
'''
{'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}
'''


skf = StratifiedKFold(n_splits=3, random_state=42)

# fixed params
rf_params = {
#     'n_estimators': 584,
    'n_jobs': 4,
    'random_state': 42,
    'verbose': 0
}

# tuning parameters -- start with estimators as I know 500 gives a very good AUC
rf_search_params = {
    'min_samples_split': Integer(2, 10),
    'min_samples_leaf': Integer(1, 4),
    'n_estimators': Integer(500,2000),
    'max_depth': Integer(10, 100),
    'bootstrap': Categorical([True, False])
}

bayes_params = {
    'estimator': RandomForestClassifier(**rf_params),
    'scoring': 'roc_auc',
    'search_spaces': rf_search_params,
    'n_iter': 50,
    'cv': skf,
    'n_jobs': 3
}

search_cv = BayesSearchCV(**bayes_params)

## Callbacks

In [42]:
def print_score_progress(optim_results):
    ''' Prints the best score, current score, and current iteration
    '''
    current_results = pd.DataFrame(search_cv.cv_results_)
    print(f'Iteration: {current_results.shape[0]}')
    print(f'Current AUC: {current_results.tail(1).mean_test_score.values[0]:.6f}')
    print(f'Best AUC: {search_cv.best_score_:.6f}')
    print()

def save_best_estimator(optim_results):
    ''' Saves best estimator
    '''
    current_results = pd.DataFrame(search_cv.cv_results_)
    best_score = search_cv.best_score_
    current_score = current_results.tail(1).mean_test_score.values[0]
    if current_score == best_score:
        model = f'tuned_models/{model_name}_{best_score:.6f}'
        print(f'Saving: {model}')
        print()
        utils.save(search_cv, model)

In [43]:
%%time
search_cv_results = search_cv.fit(X_train_transform, y_train, callback=[print_score_progress, save_best_estimator])

Iteration: 1
Current AUC: 0.872195
Best AUC: 0.872195

Saving: tuned_models/rf_hypertune_0.872195

Iteration: 2
Current AUC: 0.869368
Best AUC: 0.872195

Iteration: 3
Current AUC: 0.856825
Best AUC: 0.872195

Iteration: 4
Current AUC: 0.865939
Best AUC: 0.872195

Iteration: 5
Current AUC: 0.871559
Best AUC: 0.872195





Iteration: 6
Current AUC: 0.865283
Best AUC: 0.872195

Iteration: 7
Current AUC: 0.865462
Best AUC: 0.872195





Iteration: 8
Current AUC: 0.871194
Best AUC: 0.872195



KeyboardInterrupt: 

In [32]:
pd.DataFrame(search_cv_results.cv_results_).sort_values('mean_test_score', ascending=False)

Unnamed: 0,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_min_samples_leaf,param_min_samples_split,param_n_estimators,params
0,0.865546,0.863866,0.866661,0.865357,0.001149,1,299.439168,0.613496,6.363648,0.003194,2,8,692,"{'min_samples_leaf': 2, 'min_samples_split': 8..."
2,0.864886,0.863294,0.866188,0.864789,0.001184,1,256.508619,0.726135,5.458807,0.001653,2,9,596,"{'min_samples_leaf': 2, 'min_samples_split': 9..."
1,0.863357,0.861639,0.864745,0.863247,0.001271,1,263.271633,5.00294,5.527277,0.050183,4,9,619,"{'min_samples_leaf': 4, 'min_samples_split': 9..."


In [28]:
search_cv_results.best_estimator_.get_params() #AUC .868429

{'bootstrap': True,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 584,
 'n_jobs': 4,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

Best n-estimators: **AUC .868429**

{'bootstrap': True,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 584,
 'n_jobs': 4,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [32]:
from sklearn.model_selection import train_test_split

In [33]:
X_t, X_v, y_t, y_v = train_test_split(X_train_transform, y_train, stratify=y_train, random_state=42, test_size = 0.33)

In [36]:
rf = RandomForestClassifier(n_estimators=500, n_jobs=4, random_state=42, verbose=1)

In [37]:
rf.fit(X_t, y_t)

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:   18.3s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:  1.3min
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:  3.1min
[Parallel(n_jobs=4)]: Done 500 out of 500 | elapsed:  3.5min finished


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=4,
            oob_score=False, random_state=42, verbose=1, warm_start=False)

In [38]:
y_v_probs = rf.predict_proba(X_v)[:, 1]

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.4s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    2.0s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:    4.4s
[Parallel(n_jobs=4)]: Done 500 out of 500 | elapsed:    4.9s finished


In [39]:
from sklearn import metrics

In [40]:
metrics.roc_auc_score(y_v, y_v_probs)

0.8689592428199138