# DecisionTreeClassifier with scikit learn

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Code
```python
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(<parameters)
model.fit(X, y)
y_new = model.predict(X_test)
```

[Official Reference](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

## Parameters
- `criterion`: `"gini"` or `"entropy"`  
the function to measure a good cut
- `max_depth`: an integer, the maximum depth of a tree
- `min_samples_split`: if a node has more than `min_samples_split` samples, then split further

and many others.

## Attributes
- `classes_`: an array of shape `(n_classes,)`  
(Usually `0, ..., n_classes-1`)
- `feature_importances_`: an array of shape `(n_features,)`  
the total importance (impurity reduction) of each feature
- `tree_`: the constructed decition tree

For `model.tree_`  
- `children_left[i]`: id of the left child of node i or -1 if leaf node
- `children_right[i]`: id of the right child of node i or -1 if leaf node
- `feature[i]`: feature used for splitting node i
- `threshold[i]`: threshold value at node i
- `n_node_samples[i]`: the number of of training samples reaching node i
- `impurity[i]`: the impurity at node i

(Source: [Understanding the decision tree structure](https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html))

## Sample data

##### Exercise 1
Let  
```python
mu1 = np.array([1,1])
cov1 = np.array([[1.1,-1],
                [-1,1.1]])
mu2 = np.array([-1,-1])
cov2 = np.array([[1.1,-1],
                [-1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])
y = np.array([0]*100 + [1]*100)
```

###### 1(a)
Plot the points (rows) in `X` with `c=y` .  

In [None]:
### your answer here

###### 1(b)
Draw 1000 random points uniformly on the region $-5\leq x\leq 5$ and $-5\leq y\leq 5$.  
Use the trained model to make a prediction `y_new` .  
Then plot these 1000 points with `c=y_new` .

In [None]:
### your answer here

###### 1(c)
Trains a decision tree model.  
Let  
```python
from sklearn.tree import plot_tree
plot_tree(model)
```
Try to understand the following questions:
- Check if the number of samples in a node is equal to the sum of those in its two children
- What can you say about the `gini` value at each leaf node?
- What can you say about the `value` distribution at each leaf node?
- Check how many samples satisfies the criteria at the root node.  It should be the same as the number of samples in the left child (of the root).

In [None]:
### your answer here

##### Exercise 2
Let  
```python
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
```

###### 2(a)
Apply the decision tree classification algorithm to `X` and `y` .  
Make a prediciont of the training data.  
How is the accuracy?

In [None]:
### your answer here

###### 2(b)
Plot the decision tree.  
If a bag of $n$ balls contains $n_i$ balls of color $i$ (for colors $i=0,\ldots,c-1$), the **Gini impurity** of this bag (distribution) is  
$$\sum_{i=0}^{c-1} p_i(1 - p_i),$$
where $p_i = n_i/n$ is the probability of getting a ball of color $i$.

Pick a node and check if this formula is correct.

In [None]:
### your answer here

###### 2(c)
Change the model setting to `criteria="entropy"` and plot the decision tree again.  
If a bag of $n$ balls contains $n_i$ balls of color $i$ (for colors $i=0,\ldots,c-1$), the **entropy** of this bag (distribution) is  
$$\sum_{i=0}^{c-1} -p_i\log_2(p_i),$$
where $p_i = n_i/n$ is the probability of getting a ball of color $i$.

Pick a node and check if this formula is correct.

In [None]:
### your answer here

##### Exercise 3
Let  
```python
from sklearn.datasets import load_digits
digits = load_digits()
mask = (digits.target == 0) | (digits.target == 1)
X = digits.data[mask]
y = digits.target[mask]
```

###### 3(a)
Train a decision tree classification model.  
How is its accuracy score?

In [None]:
### your answer here

###### 3(b)
Use any software or online app to draw a picture of 0 or 1.  
Save it as a file, e.g., `my_digit.png` .  
Use the following code to load it.  
```python
from PIL import Image
img = Image.open("my_digit.png").resize(8,8)
```
Does the model give you the right answer?  
Each of you can do 5 pictures.  
Let's see what is the accuracy score.

In [None]:
### your answer here

## Experiments

##### Exercise 4
Let  
```python
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
```  
and `model` be your decision tree classification model.

###### 4(a)
Plot the decision tree with the keyword `node_ids=True` .  
(If necessary, you may use `plt.figure(figsize=(15,15))` to change the figure size.)

In [None]:
### your answer here

###### 4(b)
Let `T = model.tree_` .  
Print `T.children_left` and `T.children_right` .  
Compare these two arrays with the decision tree you printed.  
What do they mean?

In [None]:
### your answer here

###### 4(c)
Print `T.feature` and `T.threshold` .  
Compare these two arrays with the decision tree you printed.  
What do they mean?

In [None]:
### your answer here

###### 4(d)
Print `T.n_node_samples` and `T.impurity` .  
Compare these two arrays with the decision tree you printed.  
What do they mean?

In [None]:
### your answer here

###### 4(e)
For each `i = 0,1,2,3`, count how many nodes uses feature `i` for splitting.

In [None]:
### your answer here

###### 4(f)
Suppose there are $N$ sample points in the training data.  
Suppose a node contains $n_s$ sample points.  
Within these sample points, there is a chance of $p_i$ to get class $i$.  
One may calculate the impurity $H$ (Gini or entropy) at this point.  

Suppose the "information" at each node is  
$$I = \frac{n_s}{N}\cdot H.$$
Calculate the information at each node.

In [None]:
### your answer here

###### 4(e)
Suppose $I$ is the information at one node, while $I_\ell$ and $I_r$ are the information at its left and right children, respectively.  
The **information gain** at this node is $I_\ell + I_r -I$.  
Calculate the information gain at each node.

In [None]:
### your answer here

###### 4(f)
Let $W_i$ be the sum of information gain among nodes using feature $i$ for splitting.  
Calculate an array `W` such whose entries are `W_i` for each feature $i$.  
Let `W = W / W.sum()` .  
Compare `W` with `model.feature_importances_` .

In [None]:
### your answer here

##### Exercise 5
Let  
```python 
X = 5 * np.random.randn(1000,2)
lengths = np.linalg.norm(X, axis=1)
band1 = (lengths > 1) & (lengths <2)  
band2 = (lengths > 3) & (lengths <4)  
X = X[band1 | band2, :]
y = np.array([0]*band1.shape[0] + [1]*band2.shape[0])
```

###### 5(a)
Go through the split-train-test process.  
What is the accuracy score?

In [None]:
### your answer here

###### 5(b)
Use some random points to plot the regions for each class.  
(Just as what we did in Exercise 1.)

In [None]:
### your answer here