## XGBoost Grid and Random Search: Hypertuning Multiple Parameters Simultaneously
### GridSearchCV()
    -Search exhaustively over a given set of hyperparameters, once per set of hyperparameters
    -Number of models = number of distinct values per hyperparameter multiplied across each hyperparameter
    -Pick final model hyperparameter values that give best crossvalidated evaluation metric value

In [8]:
# may be required as xgboost import throws errors 
# import os
# mingw_path = 'C:\\Program Files\\mingw-w64\\x86_64-7.2.0-posix-seh-rt_v5-rev1\\mingw64\\bin'
# os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']

In [9]:
import pandas as pd
import xgboost as xgb
import numpy as np
from sklearn.model_selection import GridSearchCV

In [10]:
# Load data Ames, Iowa dataset from DataCamp's AWS url
housing_data = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/production/course_3786/datasets/ames_housing_trimmed_processed.csv")

In [11]:
# Create df for the features and the target: X, y
X, y = housing_data.iloc[:,:-1], housing_data.iloc[:,-1]

In [12]:
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
    'colsample_bytree': [0.3, 0.7],
    'n_estimators': [50],
    'max_depth': [2, 5]}

# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor()

# Perform grid search: grid_mse
grid_mse = GridSearchCV(estimator=gbm, param_grid = gbm_param_grid, scoring='neg_mean_squared_error', cv = 4, verbose = 1)

# Fit grid_mse to the data
grid_mse.fit(X, y)

# Print the best parameters and lowest RMSE
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))

Fitting 4 folds for each of 4 candidates, totalling 16 fits
Best parameters found:  {'colsample_bytree': 0.3, 'max_depth': 5, 'n_estimators': 50}
Lowest RMSE found:  29655.3369735


[Parallel(n_jobs=1)]: Done  16 out of  16 | elapsed:    0.8s finished


### RandomizedSearchCV
    -Create a (possibly infinite) range of hyperparameter values per hyperparameter that you would like to search over
    -Set the number of iterations you would like for the random search to continue
    -During each iteration, randomly draw a value in the range of specified values for each hyperparameter searched over and train/evaluate a model with those hyperparameters
    -After you've reached the maximum number of iterations, select the hyperparameter configuration with the best evaluated score

Often, GridSearchCV can be really time consuming.  In the absence of GPUs, RandomizedSearchCV may be a better option. The key difference is the requirement to specify a param_distributions parameter instead of a param_grid parameter.

In [13]:
import pandas as pd
import xgboost as xgb
import numpy as np
from sklearn.model_selection import RandomizedSearchCV

In [14]:
# Load data Ames, Iowa dataset from DataCamp's AWS url
housing_data = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/production/course_3786/datasets/ames_housing_trimmed_processed.csv")

In [15]:
# Create df for the features and the target: X, y

X, y = housing_data.iloc[:,:-1], housing_data.iloc[:,-1]

In [16]:
# Create the parameter grid: gbm_param_grid 
gbm_param_grid = {'n_estimators': [25], 'max_depth': [2, 11]}

gbm_param_grid = {
    'n_estimators': [25],
    'max_depth': range(2, 12)
}

# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor(n_estimators=10)

# Perform random search: grid_mse
randomized_mse = RandomizedSearchCV(estimator = gbm, param_distributions = gbm_param_grid, 
                                    n_iter = 5, scoring = 'neg_mean_squared_error', cv = 4, verbose = 1)

# Fit randomized_mse to the data
randomized_mse.fit(X, y)

# Print the best parameters and lowest RMSE
print("Best parameters found: ", randomized_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(randomized_mse.best_score_)))


Fitting 4 folds for each of 5 candidates, totalling 20 fits
Best parameters found:  {'n_estimators': 25, 'max_depth': 6}
Lowest RMSE found:  36909.9821397


[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    1.7s finished


### Limitations to Grid and Random Search (Hypertuning XGBoost Parameters)

### Grid Search
Number of models built with every additional parameter may grow exponentially

### Random Search
Parameter space to explore can be massive
Randomly jumping throughout the space looking for a "best" result becomes a waiting game