**NOTE: This notebook is written for the Google Colab platform. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import time
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import GridSearchCV

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
from class_utils.download import download_file_maybe_extract
download_file_maybe_extract("https://www.dropbox.com/s/u8u7vcwy3sosbar/titanic.zip?dl=1",
                            directory="data/titanic")

# also create a directory for storing any outputs
import os
os.makedirs("output", exist_ok=True)

## Optimizing Hyperparameters Using Grid Search

In our next notebook we will experiment with another simple method for hyperparameter optimization: it is called grid search.

### Loading and Preprocessing the Data

The loading and preprocessing of data will be identical to that from the previous notebook.



In [None]:
#@title -- Loading and preprocessing: X_train, Y_train, X_test, Y_test -- { display-mode: "form" }
df = pd.read_csv("data/titanic/train.csv")
df_train, df_test = train_test_split(df, test_size=0.25,
                     stratify=df["Survived"], random_state=4)

# we split the columns into categorical and numeric inputs and the output
categorical_inputs = ["Pclass", "Sex", "Embarked"]
numeric_inputs = ["Age", "SibSp", 'Parch', 'Fare']
output = ["Survived"]

# we create our preprocessing pipeline
input_preproc = make_column_transformer(
    (make_pipeline(
        SimpleImputer(strategy="most_frequent"),
        OrdinalEncoder(categories='auto')),
     categorical_inputs),
    
    (make_pipeline(
        SimpleImputer(),
        StandardScaler()),
     numeric_inputs)
)

# we fit the pipeline on the train set and then apply it to both train and test
X_train = input_preproc.fit_transform(df_train[categorical_inputs + numeric_inputs])
Y_train = df_train[output]

X_test = input_preproc.transform(df_test[categorical_inputs + numeric_inputs])
Y_test = df_test[output]

### Grid Search

When using grid search, we define the "grid" of all hyperparameter combinations, which and the resulting space is searched systematically and in full.

Let us again display the docstring of class `DecisionTreeClassifier`, to refresh our memory as to what hyperparameters we will be tuning:



In [None]:
?DecisionTreeClassifier

---
### Task 1: Setting up the Search Space

**Use the next cell to define the search space `space` of decision tree hyperparameters.** 

---
We will again need to set up the search space. In the case of grid search, the space needs to be discrete and, if possible, relatively small, since we are going to search it fully, testing all possible configurations.

We will be using the `GridSearchCV` method from package `sklearn`. There the search space is defined as a dictionary, in which the keys are the hyperparameter names and values are the lists of possible hyperparameter values, e.g.:

```
space = {
    # categorical variable:
    'cat_var': ["opt1", "opt2", "opt3"],

    # numeric variable: needs to be discretized
    'num_var': [0.1, 0.5, 1.0]
}
```


In [None]:
grid = {


    # ---

    
}

### Running the Optimization

Next we can run the optimization – using the `GridSearchCV` method.



In [None]:
start = time.time()

model = DecisionTreeClassifier()
grid_search = GridSearchCV(model, grid, n_jobs=-1, cv=10,
                           scoring='f1_macro', verbose=True)
grid_search.fit(X_train, Y_train)

end = time.time()
print(end - start)

We extract the best hyperparameters:



In [None]:
best_params = grid_search.best_params_
best_params

### Retraining the Model with the Best Hyperparameters

Now that we have identified the best set of hyperparameters, we will use them to retrain the model: this time using the entire training set.



In [None]:
model = DecisionTreeClassifier(**best_params)
model = model.fit(X_train, Y_train)

### Testing

And finally, we are ready to test the model on the test set. We will display the confusion matrix and our standard metrics.



In [None]:
y_test = model.predict(X_test)

In [None]:
cm = pd.crosstab(Y_test.values.reshape(-1), y_test,
                 rownames=['actual'],
                 colnames=['predicted'])
print(cm)

In [None]:
print("Accuracy = {}".format(accuracy_score(Y_test, y_test)))
print("Precision = {}".format(precision_score(Y_test, y_test)))
print("Recall = {}".format(recall_score(Y_test, y_test)))