# Introduction to Data Science PC Lab 09: Tree-based Models 
# Demo Notebook

Author: Jan Verwaeren - Arne Deloose

Course: Introduction to Data Science
    
Welcome back!

This notebook contains Python code for the lecture on Tree-based methods in the course *Introduction to data science* and includes a set of exercises as well.

## Import Libraries

To add functionality to your Python session, a series of libraries (most importantly scikit-image and scikit-learn are imported)

In [1]:
# Basic libraries
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter("ignore", UserWarning)
warnings.simplefilter("ignore", RuntimeWarning)
warnings.simplefilter("ignore", DeprecationWarning)

# Sklearn
## Data
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

## Models
from sklearn import tree
from sklearn import ensemble
from sklearn.model_selection import GridSearchCV
from sklearn.manifold import TSNE

## Model Explaination
from sklearn.inspection import permutation_importance
from sklearn.inspection import PartialDependenceDisplay

## Metrics
from sklearn.metrics import accuracy_score, confusion_matrix

# XGBoost
# import xgboost

# Plotting
import graphviz
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display



## Loading a toy dataset

To illustrate the concepts of this class, we will use the Wisconsin Breastcancer dataset, a dataset that contains measurements of microscopic images of tumors. The goal is to predict if these tumors are *benign* or *malignant*. 

In [2]:
# Load dataset
breast_cancer_data = load_breast_cancer()
predictors = breast_cancer_data['data']
labels = breast_cancer_data['target']

# Print description of the dataset (in case you want some more info)
# print(breast_cancer_data['DESCR'])

We will make the usual split in train and test data.

In [3]:
# Parameters
seed = 0

# Train - Test Split
X_train, X_test, y_train, y_test = train_test_split(predictors, 
                                                    labels, 
                                                    random_state=seed)

## 1. Decision Trees

Tree-based methods are implemented in the submodule ``tree`` and classification trees are implemeted by the ``DecisionTreeClassifier`` class of that submodule.

### 1.1 Decision trees with default parameters

Decision tree classifiers come (as most classifiers in sklearn) with a set of default settings for the hyperparameters. The colde sample below shows how such a default tree can be built and tested.

NOTE: the parameter ``random_state`` sets the seed of the random generator used by the ``DecisionTreeClassifier`` instance. On rare occasions, two potential splits can be equally good and in that case thetree induction algorithm will decide randomly which split to use (using a random number generator). As this is a random process, different attempts will lead to different trees. Fixing the seed avoids this problem.  

In [4]:
# Create decision tree classifier object
decision_tree_classifier = tree.DecisionTreeClassifier(random_state=seed)

# Fit the training data to the classifier
decision_tree_classifier = decision_tree_classifier.fit(X_train, y_train)

# Calculate accuracy of the train and test sets
train_predictions = decision_tree_classifier.predict(X_train)
test_predictions = decision_tree_classifier.predict(X_test)
print("Train set accuracy is: {} and test set accuracy is: {}".format(round(accuracy_score(y_train, train_predictions), 4),
                                                                      round(accuracy_score(y_test, test_predictions), 4)))

Train set accuracy is: 1.0 and test set accuracy is: 0.8811


### 1.2 Decision trees with cost complexity pruning

Cost-complexity pruning is a for of model tuning that focuses on finding an optimal value for the cost-complexity parameter $\alpha$. As opposed to hyperparameters we have been tuning in the past (such as the regularization parameter $\alpha$ for ridge regression) sklearn provides a method (``cost_complexity_pruning_path``) that is capable of generating a series of $\alpha$'s that should be searched during a grid search (one can show that including additional values in the grid search is not relevant).

In the following code fragment ``path.ccp_alphas`` is an array of relevant $\alpha$'s to try.

In [None]:
# Call built-in method to compute the pruning path during Minimal Cost-Complexity Pruning.
path = decision_tree_classifier.cost_complexity_pruning_path(X_train, y_train)
path.ccp_alphas

In a next step, a grid search can be used to find the optimal value for $\alpha$ (cross-validation).

In [14]:
# create GridSearchCV instance
mdl_cv = GridSearchCV(decision_tree_classifier,
                      param_grid = {'ccp_alpha' : path.ccp_alphas},
                      cv = 10)

# perform the grid search
mdl_cv.fit(X_train, y_train)

Look at the best alpha found

In [None]:
mdl_cv.best_params_

Make predictions on the test set and compute accuracy.

In [None]:
# make predictions using and compute accuracy (using a built-in function this time)
predictions = mdl_cv.predict(X_test)
print(accuracy_score(y_test, predictions))

# make predictions using and compute confusion matrix
confusion_matrix(y_test, predictions)

Visualize the tree

In [None]:
tree.plot_tree(mdl_cv.best_estimator_)

### 1.3 Regression trees

Regression trees can be built in the same way. They are implemented by the ``DecisionTreeRegressor`` class of the ``tree`` submodule of ``sklearn``.

## 2 Random forests

Random forests are a simple but powerful extension to classification and regression trees. 

Random forests are implemented by the ``RandomForestClassifier`` and ``RandomForestRegressor`` classes in the ``ensemble`` submodule.

In [None]:
# Create random forest classifier object
random_forest_classifier = ensemble.RandomForestClassifier(oob_score=True,
                                                           max_features='sqrt',
                                                           random_state=seed)

# Fit the training data to the classifier
random_forest_classifier = random_forest_classifier.fit(X_train, y_train)

# Calculate accuracy of the train and test sets
train_predictions = random_forest_classifier.predict(X_train)
test_predictions = random_forest_classifier.predict(X_test)
print("Train set accuracy is: {} and test set accuracy is: {}".format(round(accuracy_score(y_train, train_predictions), 6),
                                                                      round(accuracy_score(y_test, test_predictions), 6)))

As with most methods, the hyperparameters can be tuned using GridSearchCV

In [None]:
# Define parameter space to search
param_grid = { 
    'max_features': ['sqrt', None],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],}

# Create random forest classifier object
random_forest_classifier = ensemble.RandomForestClassifier(n_estimators=300,
                                                           oob_score=True, 
                                                           n_jobs=-1,
                                                           random_state=seed,) 

# Perform grid search in the defined parameter space with cross validation (3 fold) -> in total 2*3*3*3 = 54 model fits 
CV_random_forest_classifier = GridSearchCV(estimator=random_forest_classifier, param_grid=param_grid, cv= 5)
CV_random_forest_classifier.fit(X_train, y_train)
print('Best Parameters:', CV_random_forest_classifier.best_params_)

## 3. Feature Importances

### 3.1 Permutation importance

*Permutation importance* is a generic approach to computing importances of variables for a model.

In [None]:
# Perform permutation feature importance using the best random forest model

random_forest_classifier.fit(X_train, y_train)

permutation_importance_result = permutation_importance(random_forest_classifier, 
                                                       X_test, 
                                                       y_test, 
                                                       n_repeats=10, 
                                                       random_state=seed,)

# Extract the mean and standard deviation of the feature importances from the results and create Pandas Dataframe
forest_importances = pd.DataFrame({"importances" : permutation_importance_result.importances_mean, 
                                   "stdev" : permutation_importance_result.importances_std }, 
                                   index=breast_cancer_data['feature_names']).sort_values("importances", ascending=False).iloc[:8]

# Plot the feature importances
plt.bar(x = forest_importances.index,
        height = forest_importances["importances"],
        yerr=forest_importances["stdev"])

plt.title("Feature importances using permutation on test data")
plt.ylabel("Mean accuracy decrease")
plt.ylim(bottom=0)
plt.show()

### 3.2 Partial Dependence and ICE

In [None]:
# Create Pandas DataFrame for the test predictors
X_test_df = pd.DataFrame(X_test, columns=breast_cancer_data['feature_names'])

# Plot individual partial dependency of selected features 
fig, ax = plt.subplots(figsize=(20, 8))
PartialDependenceDisplay.from_estimator(random_forest_classifier, 
                                        X_test_df, 
                                        features = ['worst radius', 'worst perimeter', 'mean concave points'], 
                                        kind='both', 
                                        ax=ax)

**EXERCISE**: Tune a random forest regressor to predct the toxicity of molecules using the ``QSAR dataset``.
- Evaluate the performance of the resulting model.
- Which features are the most important?

In [1]:
# complete ...