https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python

# Tuning Hyperparameters

Each estimators (linear regressions, nearest neighbors, support vector machines, XGBoost, and etc.) has its own set of hyperparameters to be selected/tuned. [Further Reading](https://scikit-learn.org/stable/modules/grid_search.html)

A search consists of:
1. An estimator/model
2. A parameter space
3. A method for searching or sampling candidates
4. A cross validation scheme (refer to the concept [k-fold cross validation](https://machinelearningmastery.com/k-fold-cross-validation/))
5. A score function

We "search" among the set of parameters that leads to better models. Better models can be quantified using scores such as **mean squared errors** and **accuracy score**. [Further Reading](https://scikit-learn.org/stable/modules/model_evaluation.html)

In [28]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
data = pd.read_csv("https://raw.githubusercontent.com/CodeOp-tech/DA-ML-regression-predictive/master/datasets/scores_synth.csv?token=AHM3F3AXEA6FMNHMZGL3TH3APGLKC")
data

Unnamed: 0,income,score,internet_connection
0,69.454075,635.305372,1
1,47.632800,743.301322,1
2,22.905094,673.037833,1
3,4.465032,442.894112,0
4,19.360381,627.178633,1
...,...,...,...
995,8.801915,464.993872,0
996,15.317348,641.288260,1
997,25.411924,641.858088,1
998,4.898013,447.408180,0


In [29]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data[['income']], data['score'], test_size=0.4, shuffle=True)

In [30]:
from xgboost import XGBRegressor

xgb = XGBRegressor()
xgb.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=12, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

## Exhaustive Grid Search

In [31]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

In [32]:
param_grid = {}
param_grid['max_depth'] = list(range(1,5))

In [33]:
gsearch = GridSearchCV(XGBRegressor(), 
                    param_grid, 
                    refit = True, 
                    verbose = 3,
                    n_jobs=-1) 


In [34]:
gsearch.fit(X_train, y_train) 


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of  20 | elapsed:    0.0s remaining:    0.3s
[Parallel(n_jobs=-1)]: Done  11 out of  20 | elapsed:    0.1s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  18 out of  20 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:    0.2s finished


GridSearchCV(estimator=XGBRegressor(base_score=None, booster=None,
                                    colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None, gamma=None,
                                    gpu_id=None, importance_type='gain',
                                    interaction_constraints=None,
                                    learning_rate=None, max_delta_step=None,
                                    max_depth=None, min_child_weight=None,
                                    missing=nan, monotone_constraints=None,
                                    n_estimators=100, n_jobs=None,
                                    num_parallel_tree=None, random_state=None,
                                    reg_alpha=None, reg_lambda=None,
                                    scale_pos_weight=None, subsample=None,
                                    tree_method=None, validate_parameters=None,
      

In [42]:
print(gsearch.best_params_) 


{'max_depth': 1}


In [43]:
losses = {}
losses['GridSearch'] = mean_squared_error(y_test, gsearch.predict(X_test))
print(losses)

{'GridSearch': 3516.356934158832}


## Randomized Search

In [44]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

rsearch = RandomizedSearchCV(XGBRegressor(), 
                    param_grid, 
                    refit = True, 
                    verbose = 3,
                    n_jobs=-1) 
rsearch.fit(X_train, y_train) 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of  20 | elapsed:    0.0s remaining:    0.4s


Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Done  11 out of  20 | elapsed:    0.1s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  18 out of  20 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:    0.2s finished


RandomizedSearchCV(estimator=XGBRegressor(base_score=None, booster=None,
                                          colsample_bylevel=None,
                                          colsample_bynode=None,
                                          colsample_bytree=None, gamma=None,
                                          gpu_id=None, importance_type='gain',
                                          interaction_constraints=None,
                                          learning_rate=None,
                                          max_delta_step=None, max_depth=None,
                                          min_child_weight=None, missing=nan,
                                          monotone_constraints=None,
                                          n_estimators=100, n_jobs=None,
                                          num_parallel_tree=None,
                                          random_state=None, reg_alpha=None,
                                          reg_lambda=None,
     

In [45]:
print(rsearch.best_params_) 


{'max_depth': 1}


In [46]:
losses['RandomizedSearch'] = mean_squared_error(y_test, rsearch.predict(X_test))
print(losses)

{'GridSearch': 3516.356934158832, 'RandomizedSearch': 3516.356934158832}
