# Hyperparameter Optimization using GridSearchCV and RandomizedSearchCV

We're revisiting the Titanic again. The data set is located [here](https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv). On this data set:
1. You will perform a supervised classification on the column "survived", using the decision tree classifier. You must clearly show the __accuracy score of the test set__.
2. With this same classifier, you will run a __GridSearch__ with the values :
    - all integers from 1 to 50 of the parameter `max_depth`
    - all integers from 1 and 15 of the parameter `min_samples_leaf`
    - (2, 5, 7, 10, 15, 30) of the `min_samples_split` parameter
3. With this search via __GridSearch__, what are the best values of the hyperparameters if we want to maximize the accuracy score ?
4. With this same classifier, you will run a RandomSearch on the same parameter values limiting to 200 iterations.
5. With this search via __RandomSearch__, what are the best values of the hyperparameters if we want to maximize the accuracy score?


## Task 1

* You will perform a supervised classification on the column "survived", using the decision tree classifier. You must clearly show the __accuracy score of the test set__.

In [12]:
# Importing the necessary modules
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split

In [4]:
df = pd.read_csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv")

df.head(5)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


In [13]:
# Selecting features
# I removed "Name" as I don't want to extract sex from it (we have that feature)
# I don't want to guess age from the title in the "Name" as well for the simplicity of the quest
# The name and surname isn't right now interesting for me in the context of survival
features = ["Pclass", "Sex", "Age", "Siblings/Spouses Aboard", "Parents/Children Aboard", "Fare"]
target   = "Survived"

# Making the train-test-split
X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.20)

# Getting the sizes
print(f"Shape of training set = {X_train.shape}")
print(f"Shape of testing set  = {X_test.shape}")

# Since we need to do some preprocessing too,
# let's make a column transformer including the OrdinalEncoder and
# later a pipeline

# Column transformer passes through everything
# except "Sex", which is ordinally encoded
column_transformer = make_column_transformer(
    ("passthrough", ["Pclass"]),
    (OrdinalEncoder(), ["Sex"]),
    remainder="passthrough")

# Pipe contains from the column transformer and RFC
pipe = make_pipeline(column_transformer, RandomForestClassifier())

# Let's fit the pipe
pipe.fit(X_train, y_train)

# Let's compute the prediction scores (default = "accuracy")
scores_train = pipe.score(X_train, y_train)
scores_test = pipe.score(X_test, y_test)

# Output
print(f"RFC accuracy on the training set = {round(scores_train * 100, 2)}%")
print(f"RFC accuracy on the testing set  = {round(scores_test * 100, 2)}%")

Shape of training set = (709, 6)
Shape of testing set  = (178, 6)
RFC accuracy on the training set = 98.45%
RFC accuracy on the testing set  = 78.65%


## Task 2

* With this same classifier, you will run a __GridSearch__ with the values :
    - all integers from 1 to 50 of the parameter `max_depth`
    - all integers from 1 and 15 of the parameter `min_samples_leaf`
    - (2, 5, 7, 10, 15, 30) of the `min_samples_split` parameter

In [16]:
param_grid = {
    "randomforestclassifier__max_depth"         : range(1,51),
    "randomforestclassifier__min_samples_leaf"  : range(1,16),
    "randomforestclassifier__min_samples_split" : [2, 5, 7, 10, 15, 30]
    }

model_grid = GridSearchCV(pipe, param_grid=param_grid)

model_grid.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('passthrough',
                                                                         'passthrough',
                                                                         ['Pclass']),
                                                                        ('ordinalencoder',
                                                                         OrdinalEncoder(),
                                                                         ['Sex'])])),
                                       ('randomforestclassifier',
                                        RandomForestClassifier())]),
             param_grid={'randomforestclassifier__max_depth': range(1, 51),
                         'randomforestclassifier__min_samples_leaf': range(1, 16),
                         'random

## Task 3

* With this search via __GridSearch__, what are the best values of the hyperparameters if we want to maximize the accuracy score ?

In [21]:
print(f"Best mode parameters are: {model_grid.best_params_}")
print(f"The best accuracy = {round(model_grid.best_score_ * 100, 2)}%")

Best mode parameters are: {'randomforestclassifier__max_depth': 10, 'randomforestclassifier__min_samples_leaf': 2, 'randomforestclassifier__min_samples_split': 5}
The best accuracy = 84.48%


## Task 4


* With this same classifier, you will run a RandomSearch on the same parameter values limiting to 200 iterations.


In [23]:
random_grid = RandomizedSearchCV(pipe, param_grid, n_iter=200)

random_grid.fit(X_train, y_train)

RandomizedSearchCV(estimator=Pipeline(steps=[('columntransformer',
                                              ColumnTransformer(remainder='passthrough',
                                                                transformers=[('passthrough',
                                                                               'passthrough',
                                                                               ['Pclass']),
                                                                              ('ordinalencoder',
                                                                               OrdinalEncoder(),
                                                                               ['Sex'])])),
                                             ('randomforestclassifier',
                                              RandomForestClassifier())]),
                   n_iter=200,
                   param_distributions={'randomforestclassifier__max_depth': range(1, 51),
         

## Task 5

* With this search via __RandomSearch__, what are the best values of the hyperparameters if we want to maximize the accuracy score?

In [24]:
print(f"Best mode parameters are: {random_grid.best_params_}")
print(f"The best accuracy = {round(random_grid.best_score_ * 100, 2)}%")

Best mode parameters are: {'randomforestclassifier__min_samples_split': 5, 'randomforestclassifier__min_samples_leaf': 2, 'randomforestclassifier__max_depth': 46}
The best accuracy = 83.92%
