<a href="https://colab.research.google.com/github/kundyyy/100-Days-Of-ML-Code/blob/master/AfterWork_Data_Science_Hyperparameter_Tuning_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color="blue">To use this notebook on Google Colaboratory, you will need to make a copy of it. Go to **File** > **Save a Copy in Drive**. You can then use the new copy that will appear in the new tab.</font>

# AfterWork Data Science: Hyperparameter Tuning with Python

### Pre-requisites

In [None]:
# We will start by running this cell which will import the necessary libraries
# ---
# 
import pandas as pd                # Pandas for data manipulation
import numpy as np                 # Numpy for scientific computations
import matplotlib.pyplot as plt    # Matplotlib for visualisation - We might not use it but just incase you decide to 
%matplotlib inline                 

## 1. Manual Search

### Example 

In [None]:
# Example 
# ---
# Question: Will John, 40 years old with a salary of 2500 will buy a car?
# ---
# Dataset url = http://bit.ly/SocialNetworkAdsDataset
# ---
#

In [None]:
# Steps 1
# ---
# Loading our dataset 
social_df = pd.read_csv('http://bit.ly/SocialNetworkAdsDataset') 

# Data preparation: Encoding
social_df["Gender"] = np.where(social_df["Gender"].str.contains("Male", "Female"), 1, 0) 

# Defining our predictor and label variable
X = social_df.iloc[:, [1, 2 ,3]].values  # Independent/predictor variables
y = social_df.iloc[:, 4].values          # Dependent/label variable

# Splitting our dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)


# Performing scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler() 
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

In [None]:
# Steps 2
# ---
# Defining our classifier
from sklearn.tree import DecisionTreeClassifier  

# We will get to see the values of the Decision Tree classifier hyper parameters in the output below 
# The decision tree has a quite a number of hyperparameters that require fine-tuning in order 
# to get the best possible model that reduces the generalization error. 
# To explore other decision tree hyperparameters, we can explore the sckit-learn documentation 
# by following this link: https://bit.ly/3eu3XIh
# ---
# We will focus on two specific hyperparameters:
# 1. Max depth: This is the maximum number of children nodes that can grow out from 
# the decision tree until the tree is cut off. 
# For example, if this is set to 3, then the tree will use three children nodes 
# and cut the tree off before it can grow any more. 
# 2. Min samples leaf: This is the minimum number of samples, or data points, 
# that are required to be present in the leaf node.
# ---
#
decision_classifier = DecisionTreeClassifier()

# Fitting our data
decision_classifier.fit(X_train, y_train)

In [None]:
# Steps 3
# ---
# Making our predictions
decision_y_prediction = decision_classifier.predict(X_test) 

# Determining the Accuracy
from sklearn.metrics import accuracy_score 
print(accuracy_score(decision_y_prediction, y_test))

In [None]:
# Repeating Steps 2
# ---
# Let's now perform hyper parameter tuning by setting 
# the hyperparameters max_depth = 2 and min_samples_leaf = 100
# and get our output?
# ---
# 
decision_classifier = DecisionTreeClassifier(max_depth = 2, min_samples_leaf = 50)

# Fitting our data
decision_classifier.fit(X_train, y_train)

In [None]:
# Repeating Steps 3
# --- 
# Steps 3
# ---
# Making our predictions
decision_y_prediction = decision_classifier.predict(X_test) 

# Determining the Accuracy
from sklearn.metrics import accuracy_score 
print(accuracy_score(decision_y_prediction, y_test))

Can you get a better accuracy? By tuning the same hyperparameters or other parameters?

To read more about hyper parameter tuning for decision trees, you can refer to this reading: [Link](https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680)

### <font color="green">Challenge</font>

In [None]:
# Challenge 1
# ---
# Using the given dataset above, create a logistic regression classifier 
# then tune its hyperparameters to get the best possible accuracy.
# Make a comparisons of your with other fellows in your breakout rooms.
# Hint: Use the following documentation to tune the hyper parameters.
# Sckit-learn documentation: https://bit.ly/2YZR4iP
# ---
# Dataset url = http://bit.ly/SocialNetworkAdsDataset
# 

## 2. Grid Search

### Example

In [None]:
# Example 
# ---
# Question: Will John, 40 years old with a salary of 2500 will buy a car?
# ---
# Dataset url = http://bit.ly/SocialNetworkAdsDataset
# ---
#

In [None]:
# Steps 2
# ---
# Defining our classifier 

# We will get to see the values of the Decision Tree classifier hyper parameters in the output below 
# The decision tree has a quite a number of hyperparameters that require fine-tuning in order 
# to get the best possible model that reduces the generalization error. 
# To explore other decision tree hyperparameters, we can explore the sckit-learn documentation 
# by following this link: https://bit.ly/3eu3XIh
# ---
# Again we will focus on the same two specific hyperparameters:
# 1. Max depth: This is the maximum number of children nodes that can grow out from 
# the decision tree until the tree is cut off. 
# For example, if this is set to 3, then the tree will use three children nodes 
# and cut the tree off before it can grow any more. 
# 2. Min samples leaf: This is the minimum number of samples, or data points, 
# that are required to be present in the leaf node.
# ---
#
decision_classifier = DecisionTreeClassifier()

In [None]:
# Step 3: Hyperparameters: Getting Started with Grid Search
# ---
# We will continue from where we left off from the previous example,
# We create a dictionary of all the parameters and their corresponding 
# set of values that you want to test for best performance. 
# The name of the dictionary items corresponds to the parameter name 
# and the value corresponds to the list of values for the parameter.
# As shown grid_param dictionary with three parameters n_estimators, criterion, and bootstrap. 
# The parameter values that we want to try out are passed in the list. 
# For instance, in the above script we want to find which value 
# (out of 100, 300, 500, 800, and 1000) provides the highest accuracy. 
# Similarly, we want to find which value results in the 
# highest performance for the criterion parameter: "gini" or "entropy"? 
# The Grid Search algorithm basically tries all possible combinations 
# of parameter values and returns the combination with the highest accuracy. 
# For instance, in the above case the algorithm will check all combinations (5 x 5 = 25).
# ---
# 
grid_param = {
    'max_depth': [2, 3, 4, 10, 15],
    'min_samples_leaf': [10, 20, 30, 40, 50]
}

In [None]:
# Step 2: Instantiating GridSearchCV object
# ---
# Once the parameter dictionary is created, the next step 
# is to create an instance of the GridSearchCV class. 
# We need to pass values for the estimator parameter, 
# which basically is the algorithm that you want to execute. 
# The param_grid parameter takes the parameter dictionary 
# that we just created as parameter, the scoring parameter 
# takes the performance metrics, the cv parameter corresponds 
# to number of folds, which will set 5 in our case, and finally 
# the n_jobs parameter refers to the number of CPU's that we want to use for execution. 
# A value of -1 for n_jobs parameter means that use all available computing power.
# You can refer to the GridSearchCV documentation 
# if you want to find out more: https://bit.ly/2Yr0qVC
# ---
# 
from sklearn.model_selection import GridSearchCV
gd_sr_cl = GridSearchCV(estimator = decision_classifier,
                     param_grid = grid_param,
                     scoring = 'accuracy',
                     cv = 5,
                     n_jobs =-1)

In [None]:
# Step 3: Calling the fit method
# ---
# Once the GridSearchCV class is initialized, we call the fit method of the class 
# and pass it the training and test set, as shown in the following code.
# The method might take abit of some time to execute. 
# This is the drawback - GridSearchCV will go through all the intermediate 
# combinations of hyperparameters which makes grid search computationally very expensive.
# ---
# 
gd_sr_cl.fit(X_train, y_train)

In [None]:
# Step 4: Checking the parameters that return the highest accuracy
# --- 
# To do so, we print the sr.best_params_ attribute of the GridSearchCV object, as shown below:
# ---
# 
best_parameters = gd_sr_cl.best_params_
print(best_parameters)

# The result shows that the highest accuracy is achieved 
# when the n_estimators are 300, bootstrap is True and criterion is "gini". 
# It would be a good idea to add more number of estimators 
# and see if performance further increases since the highest 
# allowed value of n_estimators was chosen.

In [None]:
# Step 5: Finding the obtained accuracy
# ---
# The last and final step of Grid Search algorithm is 
# to find the accuracy obtained using the best parameters. 
# Previously we had a mean accuracy of 64.22%.
# To find the best accuracy achieved, we execute the following code:
# ---
# 
best_result = gd_sr_cl.best_score_
print(best_result)

# The accuracy achieved is: 0.6505 of 65.05% which is only slightly better than 64.22%. 
# To improve this further, it would be good to test values for other parameters 
# of Random Forest algorithm, such as max_features, max_depth, max_leaf_nodes, etc. 
# to see if the accuracy further improves or not.

Can you get a better accuracy? By refering to the decision tree documentation, choosing additional approriate hyper-parameters and set the hyperparameter values to the grid search space in an effort to get a better accuracy.

### <font color="green">Challenge</font>

In [None]:
# Challenge
# ---
# In this challenge, we still be required to use grid search while using 
# the logistic regression classifier we created earlier to get the best possible accuracy. 
# Hint: Use the following documentation to tune the hyperparameters.
# Sckit-learn documentation: https://bit.ly/2YZR4iP
# ---
# Dataset url = http://bit.ly/SocialNetworkAdsDataset
# 

## 3. Random Search

### Example

In [None]:
# Example 
# ---
# Question: Will John, 40 years old with a salary of 2500 will buy a car?
# ---
# Dataset url = http://bit.ly/SocialNetworkAdsDataset
# ---
#

In [None]:
# Step 1: Hyperparameters: Getting Started with Random Search
# ---
# Random search differs from grid search in that we longer 
# provide a discrete set of values to explore for each hyperparameter; rather, 
# we provide a statistical distribution for each hyperparameter 
# from which values may be randomly sampled.
# We'll define a sampling distribution for each hyperparameter.
# ---
# 

# specify parameters and distributions to sample from
from scipy.stats import randint as sp_randint
param_dist = {"max_depth": [3, None], 
              "min_samples_leaf": sp_randint(1, 50)}

In [None]:
# Step 2: Instantiating RandomizedSearchCV object 
# ---
# Documentation: https://bit.ly/2V9Xhri
# 
from sklearn.model_selection import RandomizedSearchCV 
random_sr = RandomizedSearchCV(decision_classifier, param_dist, cv = 5) 

In [None]:
# Step 3: Calling the fit method
# ---
#
random_sr.fit(X_train, y_train)

In [None]:
# Step 4: Checking the parameters that return the highest accuracy
# ---
#
best_parameters = random_sr.best_params_
print(best_parameters)

In [None]:
# Finding the obtained accuracy
# --
# 
best_result = random_sr.best_score_
print(best_result)

Can you get a better accuracy? By refering to the decision tree documentation, choosing additional approriate hyper-parameters and set the hyperparameter values to the random search space in an effort to get a better accuracy.

### <font color="green">Challenge</font>

In [None]:
# Challenge
# ---
# Again, we will also be required to use random search while using 
# the logistic regression classifier we created earlier to get the best possible accuracy. 
# Hint: Use the following documentation to tune the hyperparameters.
# Sckit-learn documentation: https://bit.ly/2YZR4iP
# ---
# Dataset url = http://bit.ly/SocialNetworkAdsDataset
# 