# [Grid vs Random Search Hyperparameter Tuning using Python](https://www.youtube.com/watch?v=Ah4wsTXghwI)

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
%matplotlib inline 
import warnings
warnings.filterwarnings("ignore") 

from sklearn import metrics, preprocessing, tree 
from sklearn.metrics import f1_score, make_scorer 
from sklearn.model_selection import cross_val_score 
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV 
from sklearn.model_selection import train_test_split 
import time

##### [GitHub decision_tree_grid_search](https://github.com/bhattbhavesh91/decision_tree_grid_search/blob/master/GridSearch_Vs_RandomSearch.ipynb)

In [2]:
# Used to measure the time between 2 function calls
# will be used to determine difference in time taken between gridsearchCV and randomsearchCV
def timeit(method):
    def timed(*args, **kw):
        ts = time.time()
        result = method(*args, **kw)
        te = time.time()
        if 'log_time' in kw:
            name = kw.get('log_name', method.__name__.upper())
            kw['log_time'][name] = int((te - ts) * 1000)
        else:
            print('%r  %2.2f ms' % \
                  (method.__name__, (te - ts) * 1000))
        return result
    return timed

### Classification problem

In [3]:
file_loc = 'loan_prediction.csv'
df = pd.read_csv(file_loc)
df.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status
0,5849,0.0,0.0,360.0,1.0,1
1,4583,1508.0,128.0,360.0,1.0,0
2,3000,0.0,66.0,360.0,1.0,1
3,2583,2358.0,120.0,360.0,1.0,1
4,6000,0.0,141.0,360.0,1.0,1


In [4]:
df.shape

(614, 6)

In [5]:
from sklearn.tree import DecisionTreeClassifier as dt
clf = dt()

In [6]:
clf # default parameters
X = df.iloc[:,0:len(df.columns)-1].values
Y = df.iloc[:,-1].values
X.shape

(614, 5)

In [7]:
Y.shape

(614,)

In [8]:
X_train,X_test,Y_train,Y_test = train_test_split(X, Y, test_size=0.25, random_state=0)
X_train.shape

(460, 5)

In [9]:
X_test.shape

(154, 5)

Applying k-fold cross-validation

In [10]:
scores = cross_val_score(clf, X_train, Y_train, cv=5, scoring='f1_macro') # 5 folds
scores.mean() # score is less

0.6379641979774977

In [11]:
# Fit the model
clf.fit(X_train, Y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [12]:
# Make predictions
train_predictions = clf.predict(X_train)
test_predictions = clf.predict(X_test)
clf

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [13]:
train_cols = df.columns[0:len(df.columns)-1]
target_cols = df.columns[-1]
print('The Training F1 Score is', f1_score(train_predictions, Y_train))
print('The Testing F1 Score is', f1_score(test_predictions, Y_test))

The Training F1 Score is 1.0
The Testing F1 Score is 0.743119266055046


Observe that the training score is higher than the testing score it is very evident that the model is overfitting. We need a model which has both high training and testing accuracy.Generally a training accuracy close to 1 signifies that you are overfitting but if you have a bit lower training accuracy and equally weighted testing accuracy then its a good generalized model. So to get this good generalized model which fits well on testing data we will be making use of gridsearchcv or randomsearchcv to find the optimal hyperparameters that best describe the model.

In [14]:
parameters = {'max_depth':[1,2,3,4,5], 'min_samples_leaf':[1,2,3,4,5], 'min_samples_split':[2,3,4,5],
              'criterion' : ['gini','entropy']}
scorer = make_scorer(f1_score)

###### [Sklearn make-scorer](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html)
Make a scorer from a performance metric or loss function.  

This factory function wraps scoring functions for use in GridSearchCV and cross_val_score. It takes a score function, such as accuracy_score, mean_squared_error, adjusted_rand_index or average_precision and returns a callable that scores an estimator’s output.

In [15]:
@timeit   # decorator to find the time that the function will take to execute
def generate_clf_from_search(grid_or_random, clf, parameters, scorer, X, y):
    if grid_or_random == "Grid":
        search_obj = GridSearchCV(clf, parameters, scoring=scorer)
    elif grid_or_random == "Random":
        search_obj = RandomizedSearchCV(clf, parameters, scoring=scorer)
    fit_obj = search_obj.fit(X, y)
    best_clf = fit_obj.best_estimator_
    return best_clf

In [16]:
best_clf_grid = generate_clf_from_search("Grid", clf, parameters, scorer, X_train, Y_train)
scores = cross_val_score(best_clf_grid, X_train, Y_train, cv=5, scoring='f1_macro')
scores.mean()

'generate_clf_from_search'  2587.12 ms


0.7058924321624135

Now making prediction with aid of GridSearchCV.

In [17]:
best_clf_grid.fit(X_train, Y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=1, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [18]:
best_train_predictions = best_clf_grid.predict(X_train)
best_test_predictions = best_clf_grid.predict(X_test)

In [19]:
# calculate the f1_score of the new model
print("The training f1 score is: ",f1_score(best_train_predictions, Y_train))
print("The testing f1 score is: ",f1_score(best_test_predictions, Y_test))

The training f1 score is:  0.8360902255639098
The testing f1 score is:  0.8620689655172413


High training and testing score obtained from GridSearchCV.

In [20]:
del(best_train_predictions, best_test_predictions, best_clf_grid)

### RandomizedSearch CV
Uniqueness of random search comes from the fact that it randomly samples from the 200 possible combinations of hyperparaameters that we give. It starts off with an initial guess as in when it finds out that the guess is going wrong it tries to mininize the guess by taking out samples that randomly selelcting sample which will give a lower error and hence a better performing model. It is not completely random, lot of functions are being run behind the scenes to optimize.

In [21]:
best_clf_random = generate_clf_from_search('Random', clf, parameters, scorer, X_train, Y_train)

'generate_clf_from_search'  160.07 ms


Observe the difference in time taken by randomsearchCV is very less.

In [22]:
scores = cross_val_score(best_clf_random, X_train, Y_train, cv=5, scoring='f1_macro')
scores.mean()

0.7058924321624135

Now making prediction with aid of RandomizedSearchCV.

In [23]:
best_clf_random.fit(X_train, Y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=1, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=2, min_samples_split=3,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [24]:
# Make predictionsusing the new model
best_train_predictions = best_clf_random.predict(X_train)
best_test_predictions = best_clf_random.predict(X_test)
# calculate the f1_score of the new model
print("The training f1 score is: ",f1_score(best_train_predictions, Y_train))
print("The testing f1 score is: ",f1_score(best_test_predictions, Y_test))

The training f1 score is:  0.8360902255639098
The testing f1 score is:  0.8620689655172413


So we get almost the same result as obtained using the GridSearchCV but at a lesser computational time.

# [Why Is Random Search Better Than Grid Search For Machine Learning](https://analyticsindiamag.com/why-is-random-search-better-than-grid-search-for-machine-learning/)
Optimising hyperparameters is considered to be the trickiest part of building machine learning and artificial intelligence models. That is why, we always go by playing with the hyperparameter to optimise them. However, this is not scalable for high dimensional data because the number of the increase in iterations, which in turn expands the training time. If these parameters are not set to optimal, the training might take ages to complete, or the model may never even reach the local minima. There are various training methods introduced into machine learning to find these optimal parameters, so let us look at two of the widely used and easy to implement techniques.


## Hyperparameter Tuning Methods
It is searching for the right hyperparameter to find the high precision and accuracy:
- Grid search
- Random search

Grid search is a technique which tends to find the right set of hyperparameters for the particular model. Model parameters are learned during training when we optimise a loss function using something like a gradient descent. In this tuning technique, we simply build a model for every combination of various hyperparameters and evaluate each model. The model which gives the highest accuracy wins. The pattern followed here is similar to the grid, where all the values are placed in the form of a matrix. Each set of parameters is taken into consideration and the accuracy is noted. Once all the combinations are evaluated, the model with the set of parameters which give the top accuracy is considered to be the best.Below is a visual description of uniform search pattern of the grid search.

![](g1.PNG)

![](g2.PNG)



One of the drawbacks of grid search is that when it comes to dimensionality, it suffers when evaluating the number of hyperparameters grows exponentially. However, there is no guarantee that the search will produce the perfect solution, as it usually finds one by aliasing around the right set.

### Random Search

Random search is a technique where random combinations of the hyperparameters are used to find the best solution for the built model. It is similar to grid search, and yet it has proven to yield better results comparatively. The drawback of random search is that it yields high variance during computing. Since the selection of parameters is completely random; and since no intelligence is used to sample these combinations, luck plays its part.

Below is a visual description of search pattern of the random search:

![](g3.PNG)

As random values are selected at each instance, it is highly likely that the whole of action space has been reached because of the randomness, which takes a huge amount of time to cover every aspect of the combination during grid search. This works best under the assumption that not all hyperparameters are equally important. In this search pattern, random combinations of parameters are considered in every iteration. The chances of finding the optimal parameter are comparatively higher in random search because of the random search pattern where the model might end up being trained on the optimised parameters without any aliasing.

![](g4.PNG)

Let us say we have 3×3 set of parameters. There are about 9 set parameters in each search technique. From the example above, random search works best for lower dimensional data since the time taken to find the right set is less with less number of iterations. In grid search, however, the optimal parameter is not found since we do not have it in our grid, and that’s why time is spent to find the near best solution is until it reaches the last set sample. With this example, it is clear that random search is the best parameter search technique when there are less number of dimensions.  

While providing the information for the search parameters, for example, the properties of the cost function like continuous or discrete and type or property of error correction like stochastic, all these parameters and the heuristics provided will help the model to converge at the minima faster. The bottom rule of finding the highest accuracy is that more the information you provide faster it finds the optimised parameters.

### [A Comparison of Grid Search and Randomized Search Using Scikit Learn](https://blog.usejournal.com/a-comparison-of-grid-search-and-randomized-search-using-scikit-learn-29823179bc85)

While it’s possible that RandomizedSearchCV will not find as accurate of a result as GridSearchCV, it surprisingly picks the best result more often than not and in a fraction of the time it takes GridSearchCV would have taken. Given the same resources, Randomized Search can even outperform Grid Search. 

