<center> <h1> <span style="color:black"> Hands-on Machine Learning with Python  </h1> </center> 
<center> <h2> <span style="color:red"> Module 2: Tree-based machine learning methods </h1> </center>
<center> <h3> <span style="color:red"> Session 1: Decision trees </h1> </center>

# Structure of the notebook

* [Chapter 1 - Introduction](#one)
    + [1.1 Objectives of the notebook](#one-one)
    + [1.2 Library requirements](#one-two)

* [Chapter 2 - Regression tree](#two)
    + [2.1 Toy example](#two-one)
    + [2.2 Parameter settings](#two-two)
    + [2.3 Grid search cross-validation](#two-three)
        
* [Chapter 3 - Classification tree](#three)
    + [3.1 Toy example](#three-one)
    + [3.2 Grid search cross-validation](#three-two)

* [Chapter 4 - Actuarial tree](#four)
    + [4.1 MTPL data](#four-one)
    + [4.2 Claim frequency](#four-two)
    + [4.3 Claim severity](#four-three)

* [Chapter 5 - Interpretation tools](#five)
    + [5.1 Feature importance](#five-one)
    + [5.2 Partial dependence](#five-two)


# Chapter 1 - Introduction <a name="one"></a>

## 1.1 Objectives of the notebook <a name="one-one"></a>
The objectives of this notebook are to:
1. Build decision trees for typical regression, classification and actuarial problems.
1. Tune the parameters of decision trees to obtain optimal performance.
1. Inspect decision trees to gain insights in the underlying decision process.

## 1.2 Library requirements <a name="one-two"></a>
We start by importing all the required Python packages for this notebook.

In [None]:
# import packages
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from plotnine import ggplot, geom_point, geom_line, aes, theme_set, theme_bw
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, plot_tree
from sklearn.model_selection import GridSearchCV
from sklearn.inspection import permutation_importance, partial_dependence, PartialDependenceDisplay

# set the black and white theme for ggplot to get rid of gray backgrounds
theme_set(theme_bw())

# Chapter 2 - Regression tree <a name="two"></a>
A `scikit-learn` regression tree is implemented in the `sklearn.tree.DecisionTreeRegressor`: [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html).

## 2.1 Toy example <a name="two-one"></a>
We start by fitting a simple regression tree to a toy example with simulated data. The goal is to understand the principles of how regression trees model the underlying data.

We simulate data from a sinusoidal pattern with some normally distributed noise on top of it:

In [None]:
# set a seed for reproducibility
np.random.seed(5678)
# generate a x array from 0 to 2*pi
x = np.linspace(start=0, stop=2*math.pi, num=500)
# generate the true model m as the sin of x
m = np.sin(x)
# generate the observed y by adding normal noise to m
y = m + np.random.normal(loc=0, scale=0.5, size = len(m))
# collect the arrays in a dataframe
dfr = pd.DataFrame.from_dict({'x':x, 'm':m, 'y':y})
# print the dataframe
dfr

The simulated data (gray points) and the underlying true model (green line) look as follows:

In [None]:
# plot simulated data
ggplot(dfr, aes(x = 'x')) + geom_point(aes(y = 'y'), alpha = 0.3) + geom_line(aes(y = 'm'), colour = 'darkgreen', size = 1.5)

Before fitting our first model, we need to reshape the feature vector `x` to a feature matrix `X` because `sklearn` expects a 2D array:

In [None]:
# print the shape of x
print(x.shape)
# reshape x to a matrix
X = x.reshape(-1, 1)
# print the shape of X
print(X.shape)

As a first try, we fit a decision stump (a tree with only one split and two leaf nodes) to our data:

In [None]:
# initialize a DecisionTreeRegressor with max depth equal to 1
tree_reg1 = DecisionTreeRegressor(criterion='squared_error', max_depth=1)
# fit the tree to our data
tree_reg1 = tree_reg1.fit(X, y)
# print the tree object
tree_reg1

Printing the tree object does not yield any information, so let's dig a little bit deeper. The `scikit-learn` page contains a guide on how to [understand the tree structure](https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html#sphx-glr-auto-examples-tree-plot-unveil-tree-structure-py) from `tree_` attributes like `children`, `feature` and `threshold`:

In [None]:
# function to explore the tree structure
def print_tree(tree_obj): 
  n_nodes = tree_obj.tree_.node_count
  children_left = tree_obj.tree_.children_left
  children_right = tree_obj.tree_.children_right
  feature = tree_obj.tree_.feature
  threshold = tree_obj.tree_.threshold

  node_depth = np.zeros(shape=n_nodes, dtype=np.int64)
  is_leaves = np.zeros(shape=n_nodes, dtype=bool)
  stack = [(0, 0)]  # start with the root node id (0) and its depth (0)
  while len(stack) > 0:
      # `pop` ensures each node is only visited once
      node_id, depth = stack.pop()
      node_depth[node_id] = depth

      # If the left and right child of a node is not the same we have a split
      # node
      is_split_node = children_left[node_id] != children_right[node_id]
      # If a split node, append left and right children and depth to `stack`
      # so we can loop through them
      if is_split_node:
          stack.append((children_left[node_id], depth + 1))
          stack.append((children_right[node_id], depth + 1))
      else:
          is_leaves[node_id] = True

  print(
      "The binary tree structure has {n} nodes and has "
      "the following tree structure:\n".format(n=n_nodes)
  )
  for i in range(n_nodes):
      if is_leaves[i]:
          print(
              "{space}node={node} is a leaf node.".format(
                  space=node_depth[i] * "\t", node=i
              )
          )
      else:
          print(
              "{space}node={node} is a split node: "
              "go to node {left} if X[:, {feature}] <= {threshold} "
              "else to node {right}.".format(
                  space=node_depth[i] * "\t",
                  node=i,
                  left=children_left[i],
                  feature=feature[i],
                  threshold=threshold[i],
                  right=children_right[i],
              )
          )

Applied to our tree example the results looks as follows:

In [None]:
# print the tree structure
print_tree(tree_reg1)

We can use the `sklearn.tree.plot_tree` function to visualize the tree structure, along with node counts, predictions and error metrics.

In [None]:
# plot the tree structure
plt.figure(figsize=(5, 5), dpi=100)
plot_tree(tree_reg1);

We can generate the predicted values ourselves via the `predict()` method on our fitted tree model object:

In [None]:
# predict from the fitted tree
tree_reg1_pred = tree_reg1.predict(X)
tree_reg1_pred

To make this a bit more visually appealing, we define a function to show predictions together with the underlying data:

In [None]:
# function to plot data and predictions for the regression example
def plot_reg(dfr, pred=None):
  dfr['pred'] = pred
  ggout = ggplot(dfr, aes(x = 'x')) + geom_point(aes(y = 'y'), alpha = 0.3) + geom_line(aes(y = 'm'), colour = 'darkgreen', size = 1.5)
  if pred is not None:
    ggout = ggout + geom_line(aes(y = 'pred'), colour = 'darkred', size = 1.5)
  return(ggout)

When we now plot the predictions we can indeed see that there are two leaf nodes present:

In [None]:
# plot predictions
plot_reg(dfr, tree_reg1_pred)

This was a very nice first step, but the decision stump is clearly too simplistic to capture the full sinusoidal patterns. Let's add one depth level to the tree:

In [None]:
# initialize a DecisionTreeRegressor with max depth equal to 2
tree_reg2 = DecisionTreeRegressor(criterion='squared_error', max_depth=2)
# fit the tree to our data
tree_reg2 = tree_reg2.fit(X, y)
# plot the tree structure
plt.figure(figsize=(8, 5), dpi=100)
plot_tree(tree_reg2);

Compare this with the tree structure from before, what do you notice?

The predictions with depth 2 and 4 leaf nodes now look as follows:

In [None]:
# make predictions
tree_reg2_pred = tree_reg2.predict(X)
# plot the predictions
plot_reg(dfr, tree_reg2_pred)

**Your turn!**

* Pick one of the four leaf nodes of the tree with depth two and replicate the numbers for *samples*, *value* and *squared_error*.
* Build a tree of depth equal to 6 and check the resulting tree structure and predictions.

In [None]:
# add your code here
node_obs = dfr.loc[dfr['x'] > 3.62]
print(node_obs.shape[0])
print(np.mean(node_obs.y))
print(np.mean((node_obs.y - (np.mean(node_obs.y)))**2))

In [None]:
# add your code here
tree_reg6 = DecisionTreeRegressor(criterion='squared_error', max_depth=6)
tree_reg6 = tree_reg6.fit(X, y)
tree_reg6_pred = tree_reg6.predict(X)
plot_reg(dfr, tree_reg6_pred)

## 2.2 Parameter settings <a name="two-two"></a>
The different parameters involved in a `sklearn.tree.DecisionTreeRegressor` can be seen from the [function header](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html):

*class sklearn.tree.DecisionTreeRegressor(criterion='squared_error', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, ccp_alpha=0.0)*

We now fit a regression tree with parameters:
* max leaf nodes = 100
* max depth = 10
* min samples split = 10
* min samples leaf = 5
* ccp alpha = 0

This seems to lead to a severe overfit to the training data.

In [None]:
# initialize, fit, predict and plot a DecisionTreeRegressor with alpha 0
tree_reg_alpha0 = DecisionTreeRegressor(criterion='squared_error', max_leaf_nodes = 100, max_depth=10, min_samples_split=10, min_samples_leaf=5, ccp_alpha=0)
tree_reg_alpha0 = tree_reg_alpha0.fit(X, y)
tree_reg_alpha0_pred = tree_reg_alpha0.predict(X)
plot_reg(dfr, tree_reg_alpha0_pred)

Setting the `ccp_alpha` parameter equal to zero leads to the trivial root node tree with a constant prediction for all observations:

In [None]:
# initialize, fit, predict and plot a DecisionTreeRegressor with alpha 0
tree_reg_alpha1 = DecisionTreeRegressor(criterion='squared_error', max_leaf_nodes = 100, max_depth=10, min_samples_split=10, min_samples_leaf=5, ccp_alpha=1)
tree_reg_alpha1 = tree_reg_alpha1.fit(X, y)
tree_reg_alpha1_pred = tree_reg_alpha1.predict(X)
plot_reg(dfr, tree_reg_alpha1_pred)

There are different approaches to avoid a decision tree from overfitting:
1. One is by limiting the size of the tree via parameters like `max_leaf_nodes`, `max_depth` or `min_samples_leaf`.
1. Another option is via the `ccp_alpha` parameter which allows to perform [minimal cost complexity pruning](https://scikit-learn.org/stable/modules/tree.html#minimal-cost-complexity-pruning). In short, minimal cost complexity pruning finds the subtree that minimizes a penalized loss function where $\alpha \in [0,1]$ determines the penalty strength:
  * $\alpha = 0$ gives the biggest tree possible
  * $\alpha = 1$ gives the root node tree

We can extract the `ccp_alpha` values from different subtrees via the `cost_complexity_pruning_path` method:

In [None]:
# get the ccp alpha path
cp_path = tree_reg_alpha0.cost_complexity_pruning_path(X, y)
cp_path

The last item in these arrays corresponds to the trivial root node tree with the following overal MSE:

In [None]:
# calculate overal MSE
print(np.mean((y - (np.mean(y)))**2))

We therefore obtain all elements from the path, excluding this last element, and plot the training MSE versus the alpha values:

In [None]:
# get the alphas en MSEs
ccp_alphas, ccp_mse = cp_path.ccp_alphas[:-1], cp_path.impurities[:-1]

# plot the MSE with respect to the different values for alpha
plt.figure(figsize=(8, 5), dpi=100)
plt.plot(ccp_alphas, ccp_mse, marker="o", drawstyle="steps-post")
plt.xlabel("alpha")
plt.ylabel("MSE")
plt.title("Total MSE vs effective alpha for training set")
plt.show()

It is clear that smaller alphas lead to lower MSEs on the training set and therefore to more overfitting.

We now fit a regression tree to each distinct alpha value from the path as follows:

In [None]:
# initialize an empty list to save the tree models
tree_list = []
# loop over the different alpha values
for ccp_alpha in ccp_alphas:
  # fit a regression tree
  tree_reg = DecisionTreeRegressor(criterion='squared_error', max_leaf_nodes = 100, max_depth=10, min_samples_split=10, min_samples_leaf=5, ccp_alpha=ccp_alpha).fit(X, y)
  # append the model to the list
  tree_list.append(tree_reg)
#print the length of the tree list
len(tree_list)

We will now plot the number of nodes and the depth for the trees with different alpha values. This shows that smaller alphas lead to deeper trees with more nodes compared to those with larger values for alpha:

In [None]:
# get the node counts and depth for each tree in the list
node_counts = [tree_reg.tree_.node_count for tree_reg in tree_list]
depth = [tree_reg.tree_.max_depth for tree_reg in tree_list]

# plot the number of nodes and depth of the trees in function of alpha
fig, ax = plt.subplots(1, 2, figsize=(12, 5))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()

We now plot the predictions made by trees for different values of alpha, again showing how complexity grows for smaller alphas:

In [None]:
# initialize a plot object
plt.figure(figsize=(20, 20),dpi=100)
plt.subplots_adjust(hspace=0.5)
plt.suptitle("Tree predictions for different alphas", fontsize=18, y=0.95)
# iterate over the tree list and take every 5th item
for i, indx in enumerate(list(range(0, len(tree_list), 4))): 
#for i, indx in enumerate(list(range(len(tree_list) - 12, len(tree_list), 1))): # to zoom in on the largest alphas
  # make a prediction for this tree
  pred = tree_list[indx].predict(X)
  # plot the predictions in a subplot
  ax = plt.subplot(4, 4, i + 1)
  ax.scatter(x,y, color='gray', s=2)
  ax.plot(x,m,color='green')
  ax.plot(x,pred,color='red')
  ax.set_title(ccp_alphas[indx])
  ax.set_xlabel('x')
  ax.set_ylabel('y')

**Your turn!**

* Use what you have seen so far and try to obtain a nice fit for this dataset by manually tweaking some parameters.

In [None]:
# add your code here
tree_reg_pred = DecisionTreeRegressor(max_leaf_nodes=10, min_samples_leaf = 20).fit(X,y).predict(X)
plot_reg(dfr, tree_reg_pred)

## 2.3 Grid search cross-validation <a name="two-three"></a>
The exercise from the previous section showed us that it is possible to obtain a good model fit by manually tweaking some parameters. However, there are two very big drawbacks with that approach:
1. This manual tweaking is time-consuming work and not fun to do.
1. Validation of good happened on a visual basis but not in a quantitative way.

A parameter grid search via cross-validation mediates both issues, giving an automatic way to try different settings and returning a quantifiable loss metric to base decisions on regarding what a "good" fit is. In `scikit-learn`, grid search CV is implemented in the `class sklearn.model_selection.GridSearchCV`: [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

For illustration purposes, we perform a grid search on the `ccp_alpha` parameter and keep the other settings fixed:

In [None]:
# define a parameter grid as a dict
param_grid = {'ccp_alpha': ccp_alphas}
# initialize the model
tree_reg = DecisionTreeRegressor(criterion='squared_error', max_leaf_nodes = 100, max_depth=10, min_samples_split=10, min_samples_leaf=5) # note that the ccp_alpha param is not included here
# initialize the 5-fold CV
tree_reg_cv = GridSearchCV(tree_reg, param_grid, cv=5)
# fit the CV
tree_reg_cv.fit(X,y)

We can collect the results from the `cv_results_` attribute:

In [None]:
# collect results
results_cv = tree_reg_cv.cv_results_
# store in a dataframe
results_pd = pd.DataFrame.from_dict({'alpha':ccp_alphas,'score':-results_cv['mean_test_score'],'rank':results_cv['rank_test_score']}).sort_values('rank')
# show the top results
results_pd.iloc[0:6]

We now plot the predictions for the optimal alpha value according to our grid search:

In [None]:
# obtain the optimal alpha from the CV results
opt_alpha = results_pd[results_pd['rank'] == 1]['alpha'].mean()
# calculate the predictions for this alpha value
pred = DecisionTreeRegressor(criterion='squared_error', max_leaf_nodes = 100, max_depth=10, min_samples_split=10, min_samples_leaf=5, ccp_alpha=opt_alpha).fit(X,y).predict(X)
# plot the predictions
plot_reg(dfr,pred)

# Chapter 3 - Classification tree <a name="three"></a>
A `scikit-learn` classification tree is implemented in the `sklearn.tree.DecisionTreeClassifier`: [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier).

## 3.1 Toy example <a name="three-one"></a>
We start by fitting a simple classification tree to a toy example with simulated data. The goal is to understand the principles of how classification trees model the underlying data.

We simulate binary data in a two-dimensional plane, with two decision boundaries and some normally distributed noise on top:

In [None]:
# set seed for reproducibility
np.random.seed(54321)

# generate the x and x2 feature vectors
x1 = np.repeat(np.arange(0.1, 10.1, 0.1), 100)
x2 = np.tile(np.arange(0.1, 10.1, 0.1), 100)

# generate the target vector y
y = np.zeros(len(x1), dtype=int)
y += (x1 + 2*x2 < 8).astype(int)
y += (3*x1 + x2 > 30).astype(int)
y += np.round(np.random.normal(loc=0, scale=0.3, size=len(y))).astype(int)
y = np.clip(y, 0, 1)

# collect the arrays in a dataframe
dfc = pd.DataFrame.from_dict({'x1':x1,'x2':x2,'y':y})
# transform the y column to a category
dfc['y'] = dfc['y'].astype("category")

# print the data
dfc

Our simulated data looks as follows:

In [None]:
ggplot(dfc, aes(x = 'x1', y = 'x2')) + geom_point(aes(color = 'y'))

Before we can start modeling, we need to create a feature matrix `X` from the arrays `x1` and `x2`:

In [None]:
# stack the 1d arrays in a 2d matrix
X = np.stack([x1,x2], axis=1)
X.shape

We now fit a classification tree of depth two using the `DecisionTreeClassifier` class and the `fit` method:

In [None]:
# fit a DecisionTreeClassifier of depth 2
tree_clf2 = DecisionTreeClassifier(criterion = 'log_loss', max_depth=2)
tree_clf2.fit(X,y)

We can plot the resulting tree structure via the `sklearn.tree.plot_tree` function:

In [None]:
# plot the tree structure
plt.figure(figsize=(8, 5), dpi=100)
plot_tree(tree_clf2);

We can make predictions of the fitted classification tree for:
* the predicted class label via `predict()`
* the predicted class probabilities via `predict_proba()`.

Below we compute both and assign them to the `dfc` dataframe.

In [None]:
# predict the probabilities
dfc['phat'] = tree_clf2.predict_proba(X)[:,1]
# predict the class labels
dfc['yhat'] = tree_clf2.predict(X)
# convert both to categories for plotting
dfc['yhat'] = dfc['yhat'].astype('category')
dfc['phat'] = dfc['phat'].astype('category')

The predicted class labels look like this:

In [None]:
# plot yhat
ggplot(dfc, aes(x = 'x1', y = 'x2')) + geom_point(aes(color = 'yhat'))

And the predicted class probabilities look like this:

In [None]:
# plot phat
ggplot(dfc, aes(x = 'x1', y = 'x2')) + geom_point(aes(color = 'phat'))

**Your turn!**

* Pick one of the four leaf nodes of the tree with depth two and replicate the numbers for *samples*, *value* and *log_loss*.
* Bonus: can you also explain the predicted probability *phat* for this group?
* Build a tree of depth equal to 20 and check the resulting tree structure and predictions.

In [None]:
# add your code here
node_obs = dfc.query('x1 > 8.15 & x2 > 4.55')
print(node_obs.shape[0])
print(node_obs.query('y == 0').shape[0])
print(node_obs.query('y == 1').shape[0])
print(node_obs.query('y == 1').shape[0] / node_obs.shape[0])

In [None]:
# add your code here
from sklearn.metrics import log_loss
y_obs = node_obs.y.to_numpy()
X_obs = np.stack([node_obs.x1.to_numpy(),node_obs.x2.to_numpy()], axis=1)
p_obs = tree_clf2.predict_proba(X_obs)

log_loss0 = log_loss(y_obs,p_obs[:,1])
log_loss1 = -np.mean((y_obs * np.log(p_obs[:,1])) + ((1-y_obs) * np.log(p_obs[:,0])))
log_loss2 = -np.mean((y_obs * np.log2(p_obs[:,1])) + ((1-y_obs) * np.log2(p_obs[:,0])))
print(log_loss0)
print(log_loss1)
print(log_loss2)

In [None]:
# add your code here
count_0 = node_obs.query('y == 0').shape[0] / node_obs.shape[0]
count_1 = node_obs.query('y == 1').shape[0] / node_obs.shape[0]
-(count_0 * np.log2(count_0) + count_1 * np.log2(count_1))

In [None]:
# add your code here
tree_clf20 = DecisionTreeClassifier(criterion='log_loss', max_depth=20).fit(X, y)
dfc['yhat'] = tree_clf20.predict(X)
dfc['yhat'] = dfc['yhat'].astype('category')
ggplot(dfc, aes(x = 'x1', y = 'x2')) + geom_point(aes(color = 'yhat'))

## 3.2 Grid search cross-validation <a name="three-two"></a>
We now perform a simple grid search over one parameter with cross-validation to find the optimal classification tree.

In [None]:
# define a grid for the max leaf nodes
param_grid_clf = {'max_leaf_nodes': list(range(2,100))}
# initialize the classifier
tree_clf = DecisionTreeClassifier()
# initialize the grid search
tree_clf_cv = GridSearchCV(tree_clf, param_grid_clf, cv=5, scoring='f1')
# fit the cross-validation
tree_clf_cv.fit(X,y)

In [None]:
# obtain the CV results
results_clf_cv = tree_clf_cv.cv_results_
# collect in a dataframe
results_clf_pd = pd.DataFrame.from_dict({'size':param_grid_clf['max_leaf_nodes'],'score':results_clf_cv['mean_test_score'],'rank':results_clf_cv['rank_test_score']}).sort_values('rank')
# inspect the top results
results_clf_pd.iloc[0:6]

In [None]:
# get the optimal tuning parameter value
opt_size_clf = results_clf_pd[results_clf_pd['rank'] == 1]['size'].min().astype(int)
# fit and predict from the optimal tree
pred_clf = DecisionTreeClassifier(max_leaf_nodes=opt_size_clf).fit(X,y).predict(X)
# plot the predicted class labels
dfc['yhat'] = pred_clf
dfc['yhat'] = dfc['yhat'].astype('category')
ggplot(dfc, aes(x = 'x1', y = 'x2')) + geom_point(aes(color = 'yhat'))

**Your turn!**

* What do you think of the "optimal" classifier?
* Expand the grid search via the code above to include more tuning parameters and obtain a better classifier.

In [None]:
# add your code here
dfc['yhat'] = DecisionTreeClassifier(max_leaf_nodes=75).fit(X,y).predict(X)
dfc['yhat'] = dfc['yhat'].astype('category')
ggplot(dfc, aes(x = 'x1', y = 'x2')) + geom_point(aes(color = 'yhat'))

# Chapter 4 - Actuarial tree <a name="four"></a>
So far we have seen how decision trees can be applied to classical regression and classification problems. Now it is time to tackle more actuarial applications with decision trees.

## 4.1 MTPL data <a name="four-one"></a>
We will use a Belgian motor third party liability (MTPL) dataset to illustrate how decision trees can be used for insurance pricing applications. Let's start by reading and preparing the data in a `pandas` dataframe:

In [None]:
# read the MTPL data
mtpl = pd.read_csv("https://katrienantonio.github.io/hands-on-machine-learning-R-module-1/data/PC_data.txt", delimiter = "\t", usecols=list(range(1,14)))
# transform the column names to lowercase
mtpl.columns = mtpl.columns.str.lower()
# rename the exp column to expo
mtpl = mtpl.rename(columns= {'exp': 'expo'})
# print the shape
print(mtpl.shape)
# show the first observations
mtpl.head(100)

Our columns have different types, namely integers, floats and objects (which can be seen as strings in our case):

In [None]:
# get types of all the columns
mtpl.dtypes

For ML applications, we need to transform all our columns to a numerical input. We will transform our categorical features to integers:

In [None]:
# map string values to integers for certain columns
mtpl['coverage'] = mtpl['coverage'].map({'TPL':0, 'PO':1, 'FO':2})
mtpl['fleet'] = mtpl['fleet'].map({'N':0, 'Y':1})
mtpl['fuel'] = mtpl['fuel'].map({'gasoline':0, 'diesel':1})
mtpl['use'] = mtpl['use'].map({'private':0, 'work':1})
mtpl['sex'] = mtpl['sex'].map({'male':0, 'female':1})

# check whether the types have changed
mtpl.dtypes

Our MTPL data in numerical format now looks like this:

In [None]:
# show the first observations
mtpl.head(100)

## 4.2 Claim frequency <a name="four-two"></a>
We start our modeling efforts by building a claim frequency `DecisionTreeRegressor` model for the MTPL dataset. The [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) teaches us that there is a `poisson criterion` which uses reduction in Poisson deviance to find splits, which sounds great for our use-case.

First, we create our feature matrix with all the MTPL features:

In [None]:
# cols to retain as features
feat_cols = ['bm','ageph','agec','power','coverage','fuel','sex','fleet','use']
# subset the data
X_mtpl_freq = mtpl[feat_cols]
# print the shape
print(X_mtpl_freq.shape)
# show the features
X_mtpl_freq

Next, we create our target and weight arrays from the number of claims and exposure:

In [None]:
# claim frequency (nclaims/expo) as target
y_mtpl_freq = np.array(mtpl.nclaims/mtpl.expo)
# exposure as weights
w_mtpl_freq = np.array(mtpl.expo)

Finally, we fit our claim frequency regression tree with the `poisson` criterion:

In [None]:
# initialize a tree of depth 2
tree_freq = DecisionTreeRegressor(criterion='poisson', max_depth=2, min_samples_split=10000, min_samples_leaf=5000)
# fit the tree to our target with weights
tree_freq.fit(X=X_mtpl_freq, y=y_mtpl_freq, sample_weight=w_mtpl_freq)
# print the tree
tree_freq

We plot the tree structure and make it more readable by supplying feature names via the `feature_names` parameter:

In [None]:
# plot the tree structure
plt.figure(figsize=(8, 5), dpi=100)
plot_tree(tree_freq, feature_names=feat_cols);

Let's take a step back and see whether the prediction of 0.139 makes sense for the root node. This would be the overall claim frequency for the entire portfolio, which we can calculate as:

In [None]:
# empirical claim frequency portfolio
np.sum(mtpl.nclaims) / np.sum(mtpl.expo)

This checks out, nice! Given that we are doing weighted regression, this is calculated as follows by `sklearn`:

In [None]:
# weighted overall prediction
np.sum(y_mtpl_freq * w_mtpl_freq)/np.sum(w_mtpl_freq)

This shows that our specification of the target and weights makes sense from a portfolio perspective.

Let's now calculate the numbers for the bottom right node with `bm > 10.5`, starting with the number of samples:

In [None]:
# subset the mtpl data
mtpl_subset = mtpl.query('bm > 10.5')
# get the number of samples
mtpl_subset.shape[0]

Next, we will calculate the prediction value via the `predict` method:

In [None]:
# predict for the subset of policyholders
tree_freq_pred = tree_freq.predict(mtpl_subset[feat_cols])
# take the unique values
np.unique(tree_freq_pred)

Note that this prediction is an unweighted version, i.e., an **annual claim frequency** for someone in this group. This can be seen as follows:



In [None]:
np.sum(mtpl_subset['nclaims'])/np.sum(mtpl_subset['expo'])

Given this prediction, we can now calculate the Poisson deviance with `sklearn.metrics.mean_poisson_deviance`:

In [None]:
# calculate the Poisson deviance
from sklearn.metrics import mean_poisson_deviance
mean_poisson_deviance(mtpl_subset.eval('nclaims/expo'), tree_freq_pred, sample_weight = mtpl_subset.eval('expo'))/2

There are three important things to note here:
1. both `y_true` and `y_pred` are expressed in an annual basis
1. the exposure metric is supplied as a weight
1. the `DecisionTreeRegressor` class implements half the Poisson deviance as an impurity measure ([GitHub code](https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/tree/_criterion.pyx))

**Your turn!**

* Experiment with the settings to obtain a more detailed idea of which features are driving the claim frequency
* At what point is the tree not interpretable enough? We'll tackle this issue in the next section, stay tuned!

In [None]:
# add your code here
param_grid_freq = {'max_leaf_nodes': list(range(3,20))}
tree_freq = DecisionTreeRegressor(criterion='poisson')
tree_freq_cv = GridSearchCV(tree_freq, param_grid_freq, cv=5, scoring='neg_mean_poisson_deviance')
tree_freq_cv.fit(X_mtpl_freq, y=y_mtpl_freq, sample_weight=w_mtpl_freq)

In [None]:
#add ypur code here
results_freq_cv = tree_freq_cv.cv_results_
results_freq_pd = pd.DataFrame.from_dict({'size':param_grid_freq['max_leaf_nodes'],'score':results_freq_cv['mean_test_score'],'rank':results_freq_cv['rank_test_score']}).sort_values('rank')
results_freq_pd.iloc[0:6]

In [None]:
# add your code here
opt_size_freq = results_freq_pd[results_freq_pd['rank'] == 1]['size'].min().astype(int)
tree_freq = DecisionTreeRegressor(criterion='poisson', max_leaf_nodes=opt_size_freq).fit(X_mtpl_freq, y_mtpl_freq, w_mtpl_freq)
plt.figure(figsize=(20, 10), dpi=100)
plot_tree(tree_freq, feature_names=feat_cols);

## 4.3 Claim severity <a name="four-three"></a>
We continue our modeling efforts by building a claim severity `DecisionTreeRegressor` model for the MTPL dataset. The [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) teaches us, unfortunately, that there is no suitable criterion for long-tailed distributions like claim severity. So what can we do?

We will first try to model the log-transformed version of claim severity with a MSE criterion and exponentiate the results afterwards, a little bit in line with how we fit log-normal GLMs.

We first subset the data keeping only the observations with actual claims (within a certain range for simplicity):

In [None]:
# subset the data based on claim amount
mtpl_sev = mtpl.query('amount > 1 & amount < 100000')
mtpl_sev.shape

Next we extract the features, target (average claim size) and weights (the number of claims):

In [None]:
# features
X_mtpl_sev = mtpl_sev[feat_cols]
# target and log-transformed version
y_mtpl_sev = np.array(mtpl_sev.avg)
y_mtpl_sev_log = np.log(y_mtpl_sev)
# weights
w_mtpl_sev = np.array(mtpl_sev.nclaims)

Fit a MSE regression tree of depth one to the log-transformed target:

In [None]:
# fit a tree
tree_sev_log = DecisionTreeRegressor(criterion='squared_error', max_depth=1).fit(X_mtpl_sev, y_mtpl_sev_log, w_mtpl_sev)
# plot the tree
plot_tree(tree_sev_log, feature_names=feat_cols);

We can observe that the root node indeed predicts the weighted claim severity on the log scale for the entire portfolio:

In [None]:
# weighted mean of the protfolio log severity
np.sum(w_mtpl_sev * y_mtpl_sev_log) / np.sum(w_mtpl_sev)

How about the claim severity predictions on the actual or non-log scale? When we exponentiate the prediction result of the root node, we notice we are closer to the empirical median as to the mean severity:

In [None]:
# exponent of the root node prediction
print(np.exp(6.15))
# overall mean
print(np.mean(mtpl_sev.avg))
# overall median
print(np.median(mtpl_sev.avg))

This can be explained in two different ways:
1. The mean of the log-normal distribution is equal to $\exp(\mu + \sigma^2/2)$ and we are calculating $\exp(\mu)$, which is the median. But unlike GLMs, we do not get a proper estimate for $\sigma$ in a decision tree, only an estimate for $\mu$.
2. The exponential of an average is not equal to the average of an exponential: $\exp(1/n\sum_1^n x_i) \neq 1/n \sum_i^n\exp(x_i)$, and by exponentiating after the root node prediction we are doing the first, while we should be doing the latter.

Another approach we can take is by simply modeling the claim severity with a MSE regression tree as follows:

In [None]:
# fit a tree
tree_sev = DecisionTreeRegressor(criterion='squared_error', max_depth=1).fit(X_mtpl_sev, y_mtpl_sev, w_mtpl_sev)
# plot the tree
plot_tree(tree_sev, feature_names=feat_cols);

Even though this is not a valid statistical assumption typically, it leads to a correct estimation of the average claim severity in the root node:

In [None]:
# overall claim severity
np.sum(mtpl_sev.amount) / np.sum(mtpl_sev.nclaims)

In [None]:
# weighted overall prediction
np.sum(y_mtpl_sev * w_mtpl_sev)/np.sum(w_mtpl_sev)

Unfortunately, there is not an ideal way to model claim severities with a decision tree in `scikit-learn` because the distributional loss functions are not implemented (yet?). Later on, we will see some ensemble methods where this is the case however, so stay tuned.

# Chapter 5 - Interpretation tools <a name="five"></a>
Decision trees are typically considered as explainable models given their simple structure. However, inferring interpretations might become difficult for complex trees with a deep structure. We therefore introduce some interpretation tools that can assist to understand your model better. These are model-agnostic tools, meaning that they can be applied to any type of ML model. Let's use the following claim frequency tree as an example:

In [None]:
# fit a moderately sized frequency tree and plot the structure
tree_freq = DecisionTreeRegressor(criterion='poisson', max_leaf_nodes=25, min_samples_leaf = 1000).fit(X_mtpl_freq, y_mtpl_freq, w_mtpl_freq)
plt.figure(figsize=(20, 10), dpi=100)
plot_tree(tree_freq, feature_names=feat_cols);

Can youn tell which features are driving the claim frequency prediction result? And can you explain how certain features relate to the prediction target? If not, no worries, the following two tools will assist you with exactly those questions.

## 5.1 Feature importance <a name="five-one"></a>
The feature importance metric explains how important each feature is in your ML model, simple right? In `sklearn` these values are an attribute of your fitted model object, namely the `feature_importances_` attribute:

In [None]:
# obtain the feature importance values
tree_freq.feature_importances_

We can see that the bonus-malus feature accounts for over 80% of the predictive power in the model, followed by the age of the policyholder and power of the car with respectively 10% and 5%. The features `sex`, `fleet` and `use` are not used in the tree and have zero importance:

In [None]:
# collect the feature names and importance scores
tree_freq_fi = pd.DataFrame({'feature':tree_freq.feature_names_in_, 'importance':tree_freq.feature_importances_}).sort_values('importance', ascending=False)
# inspect the results
tree_freq_fi

The importance of a feature is computed as the total reduction of the criterion, in our case the Poisson deviance, brought by that feature. In the end, these values are normalized to sum to one over all features.

This approach can however be misleading for high cardinality features with many unique values, as these have more split options to start with. Impurity-based feature importance scores can therefore give biased results with an over/underestimation of the importance for high/low cardinality features.

An alternative solution is a permutation-based importance score, which is calculated as follows. First, a baseline loss metric is evaluated on the benchmark dataset. Next, a feature column from that dataset is randomly permuted (shuffled) and the metric is evaluated again. The permutation importance is defined to be the difference between the baseline metric and metric from permutating the feature column. This procedure breaks the relationship between the feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature. The larger the drop, the more important the feature is.

Let's test this approach out via the `sklearn.inspection.permutation_importance` function ([docs](https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html)):


In [None]:
# calculate the permutation importance
perm_imp = permutation_importance(tree_freq, X_mtpl_freq, y_mtpl_freq, sample_weight=w_mtpl_freq, scoring='neg_mean_poisson_deviance', n_repeats=5, random_state=0, max_samples=1.0)
perm_imp

We get get multiple importance scores per feature (defined by `n_repeats`) so we can extract the average score per feature and normalize it to sum over one for all the different features:

In [None]:
# extract the average
pi = perm_imp['importances_mean']
# normalize and add to results
tree_freq_fi['permutation'] = pi / np.sum(pi)
# show results
tree_freq_fi

We observe that both importance measures are very similar, but that the higher cardinality features like `bm`, `ageph`, `power` and `agec` lose some importance to the lower cardinality features like `fuel` and `coverage`.

We now know which features are driving the predictions for our claim frequency tree, but how do they do it?

## 5.2 Partial dependence <a name="five-two"></a>
A partial dependence plot (PDP) quantifies and shows the relation between one ore more features and the prediction target. The partial dependence corresponds to the average prediction of a model for each possible value of the feature. In `sklearn` we have two options to generate PDPs:
1. Create a built-in display graph directly from the fitted model object via the `sklearn.inspection.PartialDependenceDisplay.from_estimator` function ([docs](https://scikit-learn.org/stable/modules/generated/sklearn.inspection.PartialDependenceDisplay.html#sklearn.inspection.PartialDependenceDisplay.from_estimator)).
2. Calculate PD data via the `sklearn.inspection.partial_dependence` function ([docs](https://scikit-learn.org/stable/modules/generated/sklearn.inspection.partial_dependence.html#sklearn.inspection.partial_dependence)) and create a custom plot.

Let's start with the first approach and create a built-in PDP for some features:

In [None]:
# create pdps for a couple of features
fig, ax = plt.subplots(figsize=(15, 10))
PartialDependenceDisplay.from_estimator(tree_freq, X_mtpl_freq, features = ['bm','ageph','power','fuel','agec','coverage'], categorical_features=['fuel','coverage'], kind='average', ax=ax);

It is also possible to visualize interaction effects of two features by supplying a tuple to the `features` parameter:

In [None]:
# create 2D PDP
PartialDependenceDisplay.from_estimator(tree_freq, X_mtpl_freq, features = [('ageph','power')], kind='average');

Next we calculate the partial dependence data to do some custom plots:

In [None]:
# calculate the pd for bm
tree_pd_bm = partial_dependence(tree_freq, X_mtpl_freq, features = ['bm'], percentiles=(0.05, 0.95), grid_resolution=100, kind='average')
tree_pd_bm

In [None]:
# transform the dict to a pandas daatframe
tree_pd_bm['average'] = tree_pd_bm['average'][0]
tree_pd_bm['values'] = tree_pd_bm['values'][0]
tree_pd_bm = pd.DataFrame.from_dict(tree_pd_bm).rename(columns={'average':'pd','values':'bm'})
tree_pd_bm

In [None]:
# custom PD ggplot
ggplot(tree_pd_bm, aes(x = 'bm')) + geom_line(aes(y = 'pd'), colour = 'darkblue', size = 1)

**Your turn!**

* Feel free to experiment with the interpretation tools to explore the tree or create a cool custom graph.
* Try to replicate the PD effect for a specific value of a feature of choice.

In [None]:
# add your code here
partial_dependence(tree_freq, X_mtpl_freq, features = ['bm'], method='brute', kind='average')

In [None]:
# add your code here
X_mtpl_freq_adj = X_mtpl_freq.copy()
X_mtpl_freq_adj['bm'] = 1
X_mtpl_freq_adj

In [None]:
# add your code here
tree_freq.predict(X_mtpl_freq_adj).mean()