# A probability trick

Let's say that we have this grid. It has 100 cells, so 100 different models. 10 different values each of two hyperparameters Let us say that these 5 models are the best, highlighted in green. How many models would we need to run with random search to have a 95% chance of getting one of the green squares?

![image](image.png)


Let's consider how likely it is that we continue to completely miss the good models, if we randomly select hyperparameter combinations uniformly.

- On our first trial we have 5% chance of getting one of these squares as it is 5 squares out of 100. Therefore we have `(1 - 0.05)` chance of missing these squares. 
- If we do a second trial, we now have `(1 - 0.05) * (1 - 0.05)` of missing that range.
- For a third trial we have `(1 - 0.05) * (1 - 0.05) * (1 - 0.05)` chance of missing that range.

In fact, with n trials we have `(1-0.05)^n` chance that every single trial misses all the good models.

So how many trials to have a high chance of being in the region? 

- We know that the probability of missing everything is `(1-0.05)^n` 
- So the probability of getting something in that area must be `1-(miss everything)` which is `1-(1-0.05)^n`. 

Therefore, the answer is `n>=59`.

# Random Search in Scikit-Learn

Steps:

1. An algorithm to tune the hyperparameters. (Sometimes called an 'estimator')
2. Defining which hyperparameters we will tune
3. Defining a range of values for each hyperparameter
4. Setting a cross-validation scheme; and
5. Define a score function so we can decide which square on our grid was 'the best'
6. Include ectra useful information or functions
7. Decide how many samples to take (then sample)

## Key difference from Grid Search
- `n_iter` which is the number of samples for the random search to take from your grid
- `param_distributions` is slightly different from `param_grid`, allowing optional ability to set a distribution for sampling.

In [1]:
import pandas as pd

# Load dataset
credits = pd.read_csv('datasets/credit-card-full.csv')
credits.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [2]:
# Features and labels
X = credits.drop(['ID','default payment next month'], axis=1)
y = credits['default payment next month']

# Split into Train-Test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                   test_size=0.3,
                                                   random_state=42)

In [3]:
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score

# Create the parameter grid
param_grid = {'learning_rate': np.linspace(0.1,2,150), 'min_samples_leaf': list(range(20,65))} 

# Create a random search object
random_GBM_class = RandomizedSearchCV(
    estimator = GradientBoostingClassifier(),
    param_distributions = param_grid,
    n_iter = 10,
    scoring='accuracy', n_jobs=4, cv = 5, refit=True, return_train_score = True)

# Fit to the training data
random_GBM_class.fit(X_train, y_train)

# Print the values used for both hyperparameters
print(random_GBM_class.cv_results_['param_learning_rate'])
print(random_GBM_class.cv_results_['param_min_samples_leaf'])

[1.6557046979865773 0.2912751677852349 1.1328859060402685
 0.2657718120805369 1.5919463087248322 1.5281879194630874
 1.5919463087248322 1.2604026845637584 1.7959731543624162
 0.2912751677852349]
[25 47 62 49 41 37 58 28 63 59]


# Analysing the output

In [4]:
cv_results = pd.DataFrame(random_GBM_class.cv_results_)
cv_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_min_samples_leaf,param_learning_rate,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,11.632708,0.094932,0.005024,0.000272,25,1.655705,"{'min_samples_leaf': 25, 'learning_rate': 1.65...",0.8,0.76881,0.779048,0.789762,0.767381,0.781,0.012455,9,0.784048,0.763333,0.781369,0.802143,0.768929,0.779964,0.013483
1,11.614731,0.143881,0.006379,0.00016,47,0.291275,"{'min_samples_leaf': 47, 'learning_rate': 0.29...",0.825238,0.818571,0.821667,0.817143,0.814286,0.819381,0.003772,3,0.834405,0.834881,0.83506,0.835833,0.834583,0.834952,0.000496
2,11.731306,0.095643,0.006755,0.000391,62,1.132886,"{'min_samples_leaf': 62, 'learning_rate': 1.13...",0.802381,0.800714,0.799762,0.801667,0.797143,0.800333,0.001823,4,0.849226,0.85244,0.852857,0.852976,0.85619,0.852738,0.00221
3,11.682813,0.158112,0.016839,0.02056,49,0.265772,"{'min_samples_leaf': 49, 'learning_rate': 0.26...",0.825238,0.818571,0.821905,0.818571,0.819524,0.820762,0.002548,1,0.832143,0.832143,0.832917,0.833095,0.833333,0.832726,0.000494
4,11.622179,0.162805,0.016949,0.020882,41,1.591946,"{'min_samples_leaf': 41, 'learning_rate': 1.59...",0.784762,0.797143,0.796905,0.786905,0.801429,0.793429,0.006443,5,0.809524,0.80381,0.818393,0.806429,0.812024,0.810036,0.005017
5,11.652706,0.119954,0.006199,0.00028,37,1.528188,"{'min_samples_leaf': 37, 'learning_rate': 1.52...",0.790952,0.793095,0.79119,0.756905,0.796905,0.78581,0.014609,8,0.788036,0.801845,0.814286,0.767202,0.806548,0.795583,0.016563
6,11.690367,0.120727,0.006932,0.000712,58,1.591946,"{'min_samples_leaf': 58, 'learning_rate': 1.59...",0.805238,0.812381,0.76881,0.785714,0.775714,0.789571,0.016755,7,0.80875,0.82125,0.794881,0.804286,0.790119,0.803857,0.010917
7,11.692964,0.056694,0.006492,0.000181,28,1.260403,"{'min_samples_leaf': 28, 'learning_rate': 1.26...",0.798095,0.79619,0.790952,0.786667,0.786429,0.791667,0.00479,6,0.859821,0.835595,0.85619,0.858036,0.857381,0.853405,0.008982
8,11.604131,0.10583,0.024203,0.024407,63,1.795973,"{'min_samples_leaf': 63, 'learning_rate': 1.79...",0.769524,0.770714,0.773095,0.760714,0.771667,0.769143,0.004374,10,0.769643,0.772202,0.764107,0.768393,0.78375,0.771619,0.006607
9,9.568081,2.484347,0.006445,0.000215,59,0.291275,"{'min_samples_leaf': 59, 'learning_rate': 0.29...",0.824286,0.818571,0.82,0.816905,0.817619,0.819476,0.002619,2,0.831905,0.833274,0.834048,0.835357,0.834167,0.83375,0.001138


In [6]:
# Best model
print(random_GBM_class.best_estimator_)
print()

# Model ranking 1 in cv_results
print(random_GBM_class.best_index_)
print()

# Best hyper-parameters 
print(random_GBM_class.best_params_)
print()

# Best overall score
print(random_GBM_class.best_score_)

GradientBoostingClassifier(learning_rate=0.2657718120805369,
                           min_samples_leaf=49)

3

{'min_samples_leaf': 49, 'learning_rate': 0.2657718120805369}

0.8207619047619048


In [7]:
# Seconds used for refitting the best model on the whole dataset
random_GBM_class.refit_time_

7.199084043502808

# Predict using the best model from RandomSearchCV

In [8]:
predictions = random_GBM_class.predict(X_test)
predictions[:5]

array([0, 0, 0, 0, 0])