The dataset below shows records of online order delivery extracted from the original dataset [here](https://www.kaggle.com/benroshan/online-food-delivery-preferencesbangalore-region). 

The `Output` column is the target value indicating if the delivery is satisfactory. 

#### Please first run the code from start to end and complete the quiz questions at the end of the notebook. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

df = pd.read_csv("./onlinedelivery-short.csv")

In [2]:
model_features = df.columns.drop('Output')
model_target = 'Output'
numerical_features_all = df[model_features].select_dtypes(include=np.number).columns
catagorical_features_all = df[model_features].select_dtypes(include='object').columns

Create two processing pipelines for numerical and categorical features. Combine the two pipelines with `ColumnTransformer` 

In [3]:
numerical_processor = Pipeline([
    ('num_imputer', SimpleImputer(strategy='mean')),
    ('num_scaler', MinMaxScaler())
])

categorical_processor = Pipeline([
    ('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('cat_encoder', OneHotEncoder(handle_unknown='ignore')) 
])

data_preprocessor = ColumnTransformer([
    ('numerical_pre', numerical_processor, numerical_features_all),
    ('categorical_pre', categorical_processor, catagorical_features_all)
]) 

In [4]:
# Final pipeline with a decision tree estimator
pipeline = Pipeline([
    ('data_preprocessing', data_preprocessor),
    ('dt', DecisionTreeClassifier(random_state=42))
])

# Visualize the pipeline
from sklearn import set_config
set_config(display='diagram')

In [5]:
train_data, test_data = train_test_split(df, test_size=0.3, shuffle=True, random_state=42)
X_train = train_data[model_features]
y_train = train_data[model_target]
pipeline.fit(X_train, y_train)

Get the hyperparameters for the fitted decision tree in the pipeline. 

In [6]:
pipeline['dt'].get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': 42,
 'splitter': 'best'}

Evaluate the model

In [7]:
X_test = test_data[model_features]
y_test = test_data[model_target]
test_predictions = pipeline.predict(X_test)

print(confusion_matrix(y_test, test_predictions))
print(classification_report(y_test, test_predictions))


[[11 11]
 [17 78]]
              precision    recall  f1-score   support

          No       0.39      0.50      0.44        22
         Yes       0.88      0.82      0.85        95

    accuracy                           0.76       117
   macro avg       0.63      0.66      0.64       117
weighted avg       0.79      0.76      0.77       117



Use `GridSearchCV` to optimize the hyperparameter in the decision tree model: `max_depth` and `min_samples_leaf`

In [8]:
# Parameter grid for GridSearch
param_grid={'dt__max_depth': [20, 50, 100],
            'dt__min_samples_leaf': [1, 5, 10],
        }

grid_search = GridSearchCV(pipeline, # Base model
                           param_grid, # Parameters to try
                           cv = 5, # Apply 5-fold cross validation
                           verbose = 1, # Print summary
                           n_jobs = -1 # Use all available processors
                          )

# Fit the GridSearch to our training data
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


Evaluate the tuned model on test data

In [9]:
final_model = grid_search.best_estimator_
test_predictions = final_model.predict(X_test)

print('Model evaluation on the test data: \n')
print(confusion_matrix(y_test, test_predictions))
print(classification_report(y_test, test_predictions))

Model evaluation on the test data: 

[[11 11]
 [17 78]]
              precision    recall  f1-score   support

          No       0.39      0.50      0.44        22
         Yes       0.88      0.82      0.85        95

    accuracy                           0.76       117
   macro avg       0.63      0.66      0.64       117
weighted avg       0.79      0.76      0.77       117



In [10]:
final_model['dt'].get_depth()

12

Since our grid search above didn't actually produce better results, let's try to tune the model further. As shown in the code above, the current model has a tree depth of 12, which is much smaller than the candidate values for the `max_depth` hyperparameter. So, let's try some new values. 

You are asked to add the code to tune the `max_depth` hyperparameter using grid search and determine 

1. Which value, among three candidates [8, 10, 12] for `max_depth` hyperparameter will improve the model performance? (you may keep the candidate for `min_samples_leaf` unchanged.)

2. What is the tree depth of the decision tree in the new best model after tuning?  

__Bonus:__ In the [sklearn decision tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) models, the uncertainty can also be represented by entropy during the tree construction. Use `GridSearchCV` (and not plugging this option directly to the model object) to determine if using the entropy criteria may improve the performance of our model given the same train/test split and other parameter settings. 

__-------------------Important: Add your code below this block; Answers inserted above will not be graded --------------__