# DecisionTreeClassifier with scikit learn

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier ### added by Jephian

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

## Code
```python
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(<parameters)
model.fit(X, y)
y_new = model.predict(X_test)
```

[Official Reference](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

## Parameters
- `criterion`: `"gini"` or `"entropy"`  
the function to measure a good cut
- `max_depth`: an integer, the maximum depth of a tree
- `min_samples_split`: if a node has more than `min_samples_split` samples, then split further

and many others.

## Attributes
- `classes_`: an array of shape `(n_classes,)`  
(Usually `0, ..., n_classes-1`)
- `feature_importances_`: an array of shape `(n_features,)`  
the total importance (impurity reduction) of each feature
- `tree_`: the constructed decition tree

For `model.tree_`  
- `children_left[i]`: id of the left child of node i or -1 if leaf node
- `children_right[i]`: id of the right child of node i or -1 if leaf node
- `feature[i]`: feature used for splitting node i
- `threshold[i]`: threshold value at node i
- `n_node_samples[i]`: the number of of training samples reaching node i
- `impurity[i]`: the impurity at node i

(Source: [Understanding the decision tree structure](https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html))

## Sample data

##### Exercise 1
Let  
```python
mu = np.array([1,1])
cov = np.array([[1.1,-1],
                [-1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu, cov, 100), 
               np.random.multivariate_normal(-mu, cov, 100)])
y = np.array([0]*100 + [1]*100)
```

###### 1(a)
Plot the points (rows) in `X` with `c=y` .  

In [None]:
### your answer here
mu = np.array([1,1])
cov = np.array([[1.1,-1],
                [-1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu, cov, 100), 
               np.random.multivariate_normal(-mu, cov, 100)])
y = np.array([0]*100 + [1]*100)

# plot
plt.scatter(X[:,0], X[:,1], c=y)

###### 1(b)
Draw 1000 random points uniformly on the region $-5\leq x\leq 5$ and $-5\leq y\leq 5$.  
Use the trained model to make a prediction `y_new` .  
Then plot these 1000 points with `c=y_new` .

In [None]:
### your answer here
X_test = (np.random.rand(1000, 2)*10)-5

# model predict
model = DecisionTreeClassifier()
model.fit(X, y)
y_new = model.predict(X_test)

# plot
plt.scatter(X_test[:,0], X_test[:,1], c=y_new)


###### 1(c)
Trains a decision tree model.  
Let  
```python
from sklearn.tree import plot_tree
plot_tree(model)
```
Try to understand the following questions:
- Check if the number of samples in a node is equal to the sum of those in its two children
- What can you say about the `gini` value at each leaf node?
- What can you say about the `value` distribution at each leaf node?
- Check how many samples satisfies the criteria at the root node.  It should be the same as the number of samples in the left child (of the root).

In [None]:
### your answer here
from sklearn.tree import plot_tree
plot_tree(model, class_names=['0','1'])

# ANS:
# 1. Yes, the number of samples in a node is equal to the sum of those in its two children(ex: 108+92=200, 90+18=108, etc.)

# 2. "gini" means the degree of impurity of the dataset. For example, there are 200 data points in the dataset. After the first
#    classification, the model say that there are 100 points labeled as 0, 100 points labeled as 1. 100 points are correct, and
#    100 are mislabeled, so the gini=0.5 (formula: https://www.datacamp.com/community/tutorials/decision-tree-classification-python)
#    Only if the gini=0 will the model stop classifying. gini=0 means that the node is pure.

# 3. "value" means the number of data points in each class (以這題來說就是多少點是被標0，多少被標1). Take the left child at 
#    the second level as example, value=[18,90] means that 18 points should be labeled as 0, but they are mislabeles as 1.
#    Only if value=[0, xx] or [xx, 0] will the model stop classifying. It means that there are no points mislabeled.

# 4. There are 108 points satisfies the criteria at the root node, 92 don't.

##### Jephian:
Nice answers.  
For 2, my intention was to have a observation that each node have the Gini impurity equal to zero.  
For 4, you may use the code below and see if the output is the same as the number of samples in the left child of the root.  
```python
j = model.tree_.feature[0]
t = model.tree_.threshold[0]
### x[j] <= t
(X[:,j] < t).sum()
```

##### Exercise 2
Let  
```python
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
```

###### 2(a)
Apply the decision tree classification algorithm to `X` and `y` .  
Make a prediciont of the training data.  
How is the accuracy?

In [None]:
### your answer here
from sklearn.datasets import load_iris
import sklearn.model_selection
iris = load_iris()
X = iris.data
y = iris.target


X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test

from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_new = model.predict(X_test)

print("Accuracy:",metrics.accuracy_score(y_test, y_new))


###### 2(b)
Plot the decision tree.  
If a bag of $n$ balls contains $n_i$ balls of color $i$ (for colors $i=0,\ldots,c-1$), the **Gini impurity** of this bag (distribution) is  
$$\sum_{i=0}^{c-1} p_i(1 - p_i),$$
where $p_i = n_i/n$ is the probability of getting a ball of color $i$.

Pick a node and check if this formula is correct.

In [None]:
### your answer here
from sklearn.tree import plot_tree
plot_tree(model)
(36/105)*(1-36/105) + (32/105)*(1-32/105) + (37/105)*(1-37/105)

###### 2(c)
Change the model setting to `criteria="entropy"` and plot the decision tree again.  
If a bag of $n$ balls contains $n_i$ balls of color $i$ (for colors $i=0,\ldots,c-1$), the **entropy** of this bag (distribution) is  
$$\sum_{i=0}^{c-1} -p_i\log_2(p_i),$$
where $p_i = n_i/n$ is the probability of getting a ball of color $i$.

Pick a node and check if this formula is correct.

In [None]:
### your answer here
import math
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion="entropy")
model.fit(X_train, y_train)
y_new = model.predict(X_test)

plot_tree(model)

-(36/105)*math.log((36/105),2)+-1*(34/69)*math.log((34/69),2)+-1*(35/69)*math.log((35/69),2)

In [None]:
from math import log
-(36/105)*log(36/105,2) - (32/105)*log(32/105,2) - (37/105)*log(37/105,2)

##### Jephian:
You may use `np.log2` instead.

##### Exercise 3
Let  
```python
from sklearn.datasets import load_digits
digits = load_digits()
mask = (digits.target == 0) | (digits.target == 1)
X = digits.data[mask]
y = digits.target[mask]
```

###### 3(a)
Train a decision tree classification model.  
How is its accuracy score?

In [None]:
### your answer here
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn import metrics

digits = load_digits()
mask = (digits.target == 0) | (digits.target == 1)
X = digits.data[mask]
y = digits.target[mask]
model = DecisionTreeClassifier()
model.fit(X, y)
y_new = model.predict(X)
print("Testing data accuracy:",metrics.accuracy_score(y, y_new))

###### 3(b)
Use any software or online app to draw a picture of 0 or 1.  
Save it as a file, e.g., `my_digit.png` .  
Use the following code to load it.  
```python
from PIL import Image
img = Image.open("my_digit.png").resize(8,8)
```
Does the model give you the right answer?  
Each of you can do 5 pictures.  
Let's see what is the accuracy score.

In [None]:
import os ### added by Jephian
path = "DecisionTreeClassifier-with-scikit-learn-2021S-data" ### added by Jephian

### your answer here
from PIL import Image
def ImageToMatrix(filename):
    # 读取图片
    im = Image.open(filename).resize((8,8))
 
    width,height = im.size
    im = im.convert("L") 
    data = im.getdata()
    data = np.array(data,dtype='float')/255.0
    #new_data = np.reshape(data,(width,height))
    new_data = np.reshape(data,(height,width))
    return new_data

filename = os.path.join(path, "my_digit.png") ### added by Jephian
data = ImageToMatrix(filename) ### added by Jephian
# data = ImageToMatrix("my_digit.png")

data[data < 1] = 0

mask0 = (data[:,6] == 0)
mask1 = (data[:,6] == 1)
X = data[:,mask0]
y = data[:,mask1]

model = DecisionTreeClassifier()
model.fit(X, y)
y_new = model.predict(X)
print("Testing data accuracy:",metrics.accuracy_score(y, y_new))


def MatrixToImage(data):
    data = data*255
    new_im = Image.fromarray(data.astype(np.uint8))
    return new_im


new_im = MatrixToImage(data)

new_im.show()

##### Jephian:
The question is asking you to use the model you trained in 1(a) to predict the digit you drew.  
You may run 1(a) again and run the following code.  
```python
import os 
from PIL import Image
path = "DecisionTreeClassifier-with-scikit-learn-2021S-data"
filename = os.path.join(path, "my_digit.png")
img = Image.open(filename).resize((8,8)).convert("L")

arr = np.array(img).ravel()
arr = (255 - arr) * 16 / 255
X_sample = arr[np.newaxis, :]
model.predict(X_sample)
```

## Experiments

##### Exercise 4
Let  
```python
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
```  
and `model` be your decision tree classification model.

###### 4(a)
Plot the decision tree with the keyword `node_ids=True` .  
(If necessary, you may use `plt.figure(figsize=(15,15))` to change the figure size.)

In [None]:
### your answer here
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test

from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_new = model.predict(X_test)

plt.figure(figsize=(15,15))
plot_tree(model,node_ids=True)


###### 4(b)
Let `T = model.tree_` .  
Print `T.children_left` and `T.children_right` .  
Compare these two arrays with the decision tree you printed.  
What do they mean?

In [None]:
### your answer here
T = model.tree_
print(T.children_left)
print(T.children_right)
#在left和right的第i個元素表示
#在第i個node上，第i個node的左子節點和右子節點
#-1代表沒有子節點

###### 4(c)
Print `T.feature` and `T.threshold` .  
Compare these two arrays with the decision tree you printed.  
What do they mean?

In [None]:
### your answer here
print(T.feature)
#T.feature[i]表示
#在分類第i個節點時，以第feature[i]個特徵進行分類
print(T.threshold)
#每個節點分類的臨界值

###### 4(d)
Print `T.n_node_samples` and `T.impurity` .  
Compare these two arrays with the decision tree you printed.  
What do they mean?

In [None]:
### your answer here
print(T.n_node_samples)
#T.n_node_samples[i]表示
#在第i個節點中總共有T.n_node_samples[i]個樣本
print(T.impurity)
#值越大代表該節點所含的資訊量越雜
#要再進一步找出feature進行分類

###### 4(e)
For each `i = 0,1,2,3`, count how many nodes uses feature `i` for splitting.

In [None]:
### your answer here
for i in range(4):
    print("Feature ",i," uses ",sum(T.feature==i)," times")

###### 4(f)
Suppose there are $N$ sample points in the training data.  
Suppose a node contains $n_s$ sample points.  
Within these sample points, there is a chance of $p_i$ to get class $i$.  
One may calculate the impurity $H$ (Gini or entropy) at this point.  

Suppose the "information" at each node is  
$$I = \frac{n_s}{N}\cdot H.$$
Calculate the information at each node.

In [None]:
### your answer here
ns = T.n_node_samples
H = T.impurity
N=105
I = (ns/N)*H
for i in range(13):
    print("The information of node ",i,"is ",I[i])

###### 4(e)
Suppose $I$ is the information at one node, while $I_\ell$ and $I_r$ are the information at its left and right children, respectively.  
The **information gain** at this node is $I_\ell + I_r -I$.  
Calculate the information gain at each node.

In [None]:
### your answer here
inf_g = np.zeros_like(I)
for i in range(13):
    left = T.children_left[i]
    right = T.children_right[i]
    if left == -1:
        if right == -1:
            inf_g[i] = I[i]
        inf_g[i] = I[right]-I[i]
    inf_g[i] = I[left]+I[right]-I[i]
    print("The information gain of node ",i," is: ",inf_g[i])

###### 4(f)
Let $W_i$ be the sum of information gain among nodes using feature $i$ for splitting.  
Calculate an array `W` such whose entries are `W_i` for each feature $i$.  
Let `W = W / W.sum()` .  
Compare `W` with `model.feature_importances_` .

In [None]:
### your answer here
feature = T.feature
W = np.zeros(4)

where0 = np.where(feature==0)
where1 = np.where(feature==1)
where2 = np.where(feature==2)
where3 = np.where(feature==3)

for i in range(4):
    inf_where = np.where(feature==i)
    W[i] = sum(inf_g[inf_where])

W = W / W.sum()
print(model.feature_importances_)
print(W)

##### Exercise 5
Let  
```python 
X = 5 * np.random.randn(1000,2)
lengths = np.linalg.norm(X, axis=1)
band1 = (lengths > 1) & (lengths <2)  
band2 = (lengths > 3) & (lengths <4)  
X = X[band1 | band2, :]
y = np.array([0]*band1.shape[0] + [1]*band2.shape[0])
```

###### 5(a)
Go through the split-train-test process.  
What is the accuracy score?

In [None]:
from sklearn.model_selection import train_test_split
X = 5 * np.random.randn(1000,2)
lengths = np.linalg.norm(X, axis=1)
band1 = (lengths > 1) & (lengths <2)  
band2 = (lengths > 3) & (lengths <4)
X = np.vstack([X[band1], X[band2]])
y = np.array([0]*band1.sum() + [1]*band2.sum())

score = np.array([])
for i in range(10):
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    model = DecisionTreeClassifier()
    model.fit(X, y)
    y_new = model.predict(X_test)
    
    scr = np.sum(y_new==y_test)/y_test.shape[0]
    score = np.append(score,scr)


In [None]:
for i in range(10):
    print("test",i+1,": score =",score[i])
print("average score =",np.mean(score))

###### 5(b)
Use some random points to plot the regions for each class.  
(Just as what we did in Exercise 1.)

In [None]:
### your answer here
X_test=np.random.rand(1000,2)*10-5
y_new = model.predict(X_test)
plt.axis('equal')
plt.scatter(X_test.T[0],X_test.T[1],c=y_new)