There are 2 ways to regularize Decision Trees:
1. Pre-pruning: setting some stopping criteria before model fitting
2. Post-pruning: build a full tree, analyze its structure and prune it to some reduced version

Cost-complexity pruning (CCP) is a process of simplifying tree strcuture by optimizing the cost function enhanced by complexity regularizer.

Let's denote
- $T_n$ = subtree with a root in node n
- $T - T_n$ = pruned tree (n becomes leaf node)
- $R(T)$ = cost function of a tree
- $|T|$ = number of leaves in a tree 
- $R(n)$= cost function of a node
- $w(n)$ = size of a partition (node weight)
- $p(n)$ = impurity measure in the node

Cost function for a tree is calculated on its leaves:
$$R(T) = \sum_{n \in leaves} R(n) = \sum_{n \in leaves} w(n) \cdot p(n) $$

If we add some complexity regularizer to cost function:
$$R_\alpha(T) = \sum_{n \in leaves} R(n) + \alpha \cdot |T|$$

How does impurity change without $\alpha$?
- When we parition a node impurity goes down
- When we prune a node impurity goes up

If we add regularizer $\alpha$ to impurity, this is not longer the case
- if we use small $\alpha$, the cost will still be higher after pruning  
- if we use large $\alpha$, the cost after pruning will be lower

Let's compute the effect of pruning tree $T$ by removing a subtree $T_n$:

$$R_\Delta(n) = R_{\alpha}(T-T_n) - R_{\alpha}(T) $$

Also plug-in complexity-regularizers so:
$$R_\Delta(n) = \bigg( R(T-T_n) + \alpha \cdot |T-T_n| \bigg) - \bigg(R(T) - \alpha \cdot |T|\bigg) $$

Let's simplify this expression a bit.

We removed $|T_n|$ leaves of a subtree $T_n$ and replaced it with 1 leaf node. So the change in <u>complexity</u> is gonna be $-(1 + |T_n|)$

Using the same logic, change in <u>impurity</u> gonna be $R(T_n) - R(n)$.

<img src="img/ccp1.png" width=500>

So the cost delta becomes:
$$R_\Delta(n) = R(T_n) - R(n) - \alpha \cdot (1 + |T_n|)$$

The choice of $\alpha$ defines whether the regulairzed cost increases or decreases for a candidate node.

$$\frac{R(T_n) - R(n)}{1 - |T|} \vee 0$$

We don't need to set $\alpha$ beforehand. We can first build a sequence of pruned trees and then deduct optimal value.

Pruning process:
- for each candidate node compute a cost gap
    $$g(n) = \frac{R(T_n) - R(n)}{1 - |T|}$$
- select a node with a lowest gap
    $$n = argmin \big( g(n_1), g(n_2) \cdots g(n_t) \big)$$
- prune the tree up to that node and contuniue to iterate
    $$T := T - T_n$$




## Scikit-learn Implementation

In sklearn [DecisionTreeClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn-tree-decisiontreeclassifier) class has some of the functionality described above
- __ccp_alpha__ - to set the regularization strenght
- __cost_complexity_pruning_path()__ to return impurities for each alpha in a sequence

Note that alpha path here is computed on Train data, not Test data.

In [None]:
clf = DecisionTreeClassifier()
(alphas, impurities) = clf.cost_complexity_pruning_path(X_train, y_train)

We can visualize this alpha path which gives us some understanding of the alpha scale:
<img src="img/ccp2.png" width=500>

In order to evaluate pruned trees on Test data, we need to construct this sequence manually

In [1]:
clfs = []
for alpha in alphas:
    clf = DecisionTreeClassifier(random_state=0, ccp_alpha=alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)

NameError: name 'alphas' is not defined

Once we did that, we can evaluate those trees any way we want. Including finding optimal $\alpha$ in a sequence.

In [None]:
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]

<img src="img/ccp3.png" width=500>

## CCP for Random Forests

[paper](https://www.researchgate.net/publication/315116126_Cost-Complexity_Pruning_of_Random_Forests), 2017

