# DecisionTreeClassifier with scikit learn

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree

## Code
```python
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(<parameters>)
model.fit(X, y)
y_new = model.predict(X_test)
```

[Official Reference](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

## Parameters
- `criterion`: `"gini"` or `"entropy"`  
the function to measure a good cut
- `max_depth`: an integer, the maximum depth of a tree
- `min_samples_split`: if a node has more than `min_samples_split` samples, then split further

and many others.

## Attributes
- `classes_`: an array of shape `(n_classes,)`  
(Usually `0, ..., n_classes-1`)
- `feature_importances_`: an array of shape `(n_features,)`  
the total importance (impurity reduction) of each feature
- `tree_`: the constructed decition tree

For `model.tree_`  
- `children_left[i]`: id of the left child of node i or -1 if leaf node
- `children_right[i]`: id of the right child of node i or -1 if leaf node
- `feature[i]`: feature used for splitting node i
- `threshold[i]`: threshold value at node i
- `n_node_samples[i]`: the number of of training samples reaching node i
- `impurity[i]`: the impurity at node i

(Source: [Understanding the decision tree structure](https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html))

## Sample data

##### Exercise 1
Let  
```python
mu = np.array([1,1])
cov = np.array([[1.1,-1],
                [-1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu, cov, 100), 
               np.random.multivariate_normal(-mu, cov, 100)])
y = np.array([0]*100 + [1]*100)
```

###### 1(a)
Plot the points (rows) in `X` with `c=y` .  

In [None]:
### your answer here
mu = np.array([1,1])
cov = np.array([[1.1,-1],
                [-1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu, cov, 100), 
               np.random.multivariate_normal(-mu, cov, 100)])
y = np.array([0]*100 + [1]*100)

plt.scatter(*X.T, c=y)   ##先畫原本的 X,y 的資料

###### 1(b)
Draw 1000 random points uniformly on the region $-5\leq x\leq 5$ and $-5\leq y\leq 5$.  
Use the trained model to make a prediction `y_new` .  
Then plot these 1000 points with `c=y_new` .

In [None]:
### your answer here
#建立模型
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X, y)

X_test = np.random.rand(1000,2) * 10 - np.array([5,5])  ##創建測試集的資料
y_new = model.predict(X_test)
plt.scatter(*X.T, c=y)  ##深色為原本的資料
plt.scatter(*X_test.T, c=y_new, s=10, alpha=0.3)  ## 依照分類樹模型來預測那測試集的點，結果為較淺較小點的

###### 1(c)
Trains a decision tree model.  
Let  
```python
from sklearn.tree import plot_tree
plot_tree(model)
```
Try to understand the following questions:
- Check if the number of samples in a node is equal to the sum of those in its two children
- What can you say about the `gini` value at each leaf node?
- What can you say about the `value` distribution at each leaf node?
- Check how many samples satisfies the criteria at the root node.  It should be the same as the number of samples in the left child (of the root).

In [None]:
### your answer here
from sklearn.tree import plot_tree
plot_tree(model,class_names=['a','b'])
# 
# ANS:
# 1. Yes.
#    The number of samples in a node is equal to the sum of those in its two children
#    (ex: 110 + 90 = 200, 93 + 17 = 110, ...)
# 2. Gini:
#    gini is the degree of impurity of the dataset.
#    For example, there are 200 data points in the original dataset.At the first classification,
#    the model classify it into 100 points which labeled by "a", and 100 points labeled as "b".
#    100 points are correct, and 100 are mislabeled, so the gini = 0.5(1-0.5) + 0.5(1-0.5) = 0.5
#    let's take a look at the another example, we focus on left child of the second level,
#    therr are 81 samples , the model classify it into 5 points which labeled by "a", and 76 points labeled as "b".
#    76 points are correct, and 5 are mislabeled, so the gini = 17/110(1 - 17/110) + 93/110(1 - 93/110) = 0.261
#    As we see, after few time of classification, gini will decrease to 0,
#    only when the gini=0 , the model will stop classifying, because gini=0 means that the node is pure.

# 3. Value:
#    value means the number of data points in each class,
#    in this case, it means that there are how many points whick are labeled by "a" or "b".
#    Take the left child at the second level as example, value=[17,93] means that
#    17 points should be labeled as a, but they are mislabeles as b.
#    Only if value=[0, xx] or [xx, 0] will the model stop classifying,
#    because it means that there are no points are mislabeled.

# 4. There are 110 points satisfies the criteria at the root node.
j = model.tree_.feature[0]
t = model.tree_.threshold[0]
### x[j] <= t
(X[:,j] < t).sum()

##### Exercise 2
Let  
```python
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
```

###### 2(a)
Apply the decision tree classification algorithm to `X` and `y` .  
Make a predicion of the training data.  
How is the accuracy?

In [None]:
### your answer here
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

import sklearn.model_selection
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.2)
# 80% training and 20% test

from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_new = model.predict(X_test)

print("Accuracy = ",metrics.accuracy_score(y_test, y_new))

###### 2(b)
Plot the decision tree.  
If a bag of $n$ balls contains $n_i$ balls of color $i$ (for colors $i=0,\ldots,c-1$), the **Gini impurity** of this bag (distribution) is  
$$\sum_{i=0}^{c-1} p_i(1 - p_i),$$
where $p_i = n_i/n$ is the probability of getting a ball of color $i$.

Pick a node and check if this formula is correct.

In [None]:
### your answer here
from sklearn.tree import plot_tree
plt.figure(figsize = (12,12))
plot_tree(model)
(39/120)*(1-39/120) + (37/120)*(1-37/120) + (44/120)*(1-44/120)
## the answer = 0.66486, is close to the gini = 0.665 in the top node

###### 2(c)
Change the model setting to `criteria="entropy"` and plot the decision tree again.  
If a bag of $n$ balls contains $n_i$ balls of color $i$ (for colors $i=0,\ldots,c-1$), the **entropy** of this bag (distribution) is  
$$\sum_{i=0}^{c-1} -p_i\log_2(p_i),$$
where $p_i = n_i/n$ is the probability of getting a ball of color $i$.

Pick a node and check if this formula is correct.

In [None]:
### your answer here
import math
from sklearn import metrics

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion="entropy")
model.fit(X_train, y_train)
y_new = model.predict(X_test)

plt.figure(figsize = (12,12))
plot_tree(model)

-(39/120)*np.log2(39/120) - (37/120)*np.log2(37/120) - (44/120)*np.log2(44/120)
## the answer= 1.58109,it is close to the entropy = 1.581 in the top node

##### Exercise 3
Let  
```python
from sklearn.datasets import load_digits
digits = load_digits()
mask = (digits.target == 0) | (digits.target == 1)
X = digits.data[mask]
y = digits.target[mask]
```

###### 3(a)
Train a decision tree classification model.  
How is its accuracy score?

In [None]:
### your answer here
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_digits
from sklearn import metrics

digits = load_digits()
mask = (digits.target == 0) | (digits.target == 1)
X = digits.data[mask]
y = digits.target[mask]

digits = load_digits()
mask = (digits.target == 0) | (digits.target == 1)
X = digits.data[mask]
y = digits.target[mask]
model = DecisionTreeClassifier()
model.fit(X, y)
y_new = model.predict(X)
print("Testing data accuracy:",metrics.accuracy_score(y, y_new))

###### 3(b)
Use any software or online app to draw a picture of 0 or 1.  
Save it as a file, e.g., `my_digit.png` .  
Use the following code to load it.  
```python
from PIL import Image
img = Image.open("my_digit.png").resize(8,8)
```
Does the model give you the right answer?  
Each of you can do 5 pictures.  
Let's see what is the accuracy score.

In [None]:
### your answer here
import os 
path = "DecisionTreeClassifier-with-scikit-learn-2021S-data"

### your answer here
from PIL import Image
def ImageToMatrix(filename):
    # 读取图片
    im = Image.open(filename).resize((8,8))
 
    width,height = im.size
    im = im.convert("L") 
    data = im.getdata()
    data = np.array(data,dtype='float')/255.0
    #new_data = np.reshape(data,(width,height))
    new_data = np.reshape(data,(height,width))
    return new_data

filename = os.path.join(path, "my_digit.png") 
data = ImageToMatrix(filename) 
# data = ImageToMatrix("my_digit.png")

data[data < 1] = 0

mask0 = (data[:,6] == 0)
mask1 = (data[:,6] == 1)
X = data[:,mask0]
y = data[:,mask1]

model = DecisionTreeClassifier()
model.fit(X, y)
y_new = model.predict(X)
print("Testing data accuracy:",metrics.accuracy_score(y, y_new))


def MatrixToImage(data):
    data = data*255
    new_im = Image.fromarray(data.astype(np.uint8))
    return new_im


new_im = MatrixToImage(data)

new_im.show()
#會跳出一張預測的照片，圖片顯示的根原本的圖形類似，確實是有預測成0(原本是一個長方形的框框)

#### TA:
You are supposed to predict the digit you drew by using the model trained in 1(a). Not retrain a new model.  
You may run the following code instead.

```python
import os 
from PIL import Image
path = "DecisionTreeClassifier-with-scikit-learn-2021S-data"
                                                 ~~~~ ---> 2022
filename = os.path.join(path, "my_digit.png")
img = Image.open(filename).resize((8,8)).convert("L")

arr = np.array(img).ravel()
arr = (255 - arr) * 16 / 255
X_sample = arr[np.newaxis, :]
model.predict(X_sample)
```

## Experiments

##### Exercise 4
Let  
```python
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
```  
and `model` be your decision tree classification model.

###### 4(a)
Plot the decision tree with the keyword `node_ids=True` .  
(If necessary, you may use `plt.figure(figsize=(15,15))` to change the figure size.)

In [None]:
### your answer here
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

import sklearn.model_selection
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.2, random_state=1)
# 80% training and 20% test

from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_new = model.predict(X_test)

plt.figure(figsize=(12,12))
plot_tree(model,node_ids=True)

###### 4(b)
Let `T = model.tree_` .  
Print `T.children_left` and `T.children_right` .  
Compare these two arrays with the decision tree you printed.  
What do they mean?

In [None]:
### your answer here
T = model.tree_
print(T.children_left)
print(T.children_right)
# Look at the array,in each row means the node of the left child and right child
# the i-th row means the left child node and right child node of the (i-1)-th node
# -1 means in (i-1)-th node it has no children

###### 4(c)
Print `T.feature` and `T.threshold` .  
Compare these two arrays with the decision tree you printed.  
What do they mean?

In [None]:
### your answer here
print(T.feature)
# T.feature[i] means when we are classifing the i-th node,
# it use the charateristic of feature[i]-th to start classification

print(T.threshold)
# When we are classifing, threshold is the critical value in every node 

###### 4(d)
Print `T.n_node_samples` and `T.impurity` .  
Compare these two arrays with the decision tree you printed.  
What do they mean?

In [None]:
### your answer here
print(T.n_node_samples)
# T.n_node_samples[i] means that
# in i-th node, the numbers of the sample will equal to "T.n_node_samples[i]"
print(T.impurity)
# When the values of the "T.impurity" is not equal to 0, it means the data inside are still not pure 
#so it requires to find the feature to classify more precisely

###### 4(e)
For each `i = 0,1,2,3`, count how many nodes uses feature `i` for splitting.

In [None]:
### your answer here
for i in range(4):
    print("Feature ",i," uses ",sum(T.feature==i)," times")

###### 4(f)
Suppose there are $N$ sample points in the training data.  
Suppose a node contains $n_s$ sample points.  
Within these sample points, there is a chance of $p_i$ to get class $i$.  
One may calculate the impurity $H$ (Gini or entropy) at this point.  

Suppose the "information" at each node is  
$$I = \frac{n_s}{N}\cdot H.$$
Calculate the information at each node.

In [None]:
### your answer here
ns = T.n_node_samples
H = T.impurity
N = 120
I = (ns/N)*H
for i in range(13):
    print("The information of node ",i,"is ",I[i])

###### 4(g)
Suppose $I$ is the information at one node, while $I_\ell$ and $I_r$ are the information at its left and right children, respectively.  
The **information gain** at this node is $I_\ell + I_r -I$.  
Calculate the information gain at each node.

In [None]:
### your answer here
inf_g = np.zeros_like(I)
for i in range(13):
    left = T.children_left[i]
    right = T.children_right[i]
    if left == -1:
        if right == -1:
            inf_g[i] = -I[i]
            inf_g[i] = I[right] - I[i]
            inf_g[i] = I[left] + I[right] - I[i]
    print("The information gain of node ",i," is: ",inf_g[i])

###### 4(h)
Let $W_i$ be the sum of information gain among nodes using feature $i$ for splitting.  
Calculate an array `W` such whose entries are `W_i` for each feature $i$.  
Let `W = W / W.sum()` .  
Compare `W` with `model.feature_importances_` .

In [None]:
### your answer here
feature = T.feature
W = np.zeros(4)

where0 = np.where(feature==0)
where1 = np.where(feature==1)
where2 = np.where(feature==2)
where3 = np.where(feature==3)

for i in range(4):
    inf_where = np.where(feature==i)
    W[i] = sum(inf_g[inf_where])

W = W / W.sum()
print(model.feature_importances_)
print(W)

##### Exercise 5
Let  
```python 
X = 5 * np.random.randn(1000,2)
lengths = np.linalg.norm(X, axis=1)
band1 = (lengths > 1) & (lengths <2)  
band2 = (lengths > 3) & (lengths <4)
X = np.vstack([X[band1], X[band2]])
y = np.array([0]*band1.sum() + [1]*band2.sum())
```

###### 5(a)
Go through the split-train-test process.  
What is the accuracy score?

In [None]:
### your answer here
from sklearn.model_selection import train_test_split
X = 5 * np.random.randn(1000,2)
lengths = np.linalg.norm(X, axis=1)
band1 = (lengths > 1) & (lengths <2)  
band2 = (lengths > 3) & (lengths <4)
X = np.vstack([X[band1], X[band2]])
y = np.array([0]*band1.sum() + [1]*band2.sum())

score = np.array([])
for i in range(10):
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    model = DecisionTreeClassifier()
    model.fit(X, y)
    y_new = model.predict(X_test)
    
    scr = np.sum(y_new==y_test)/y_test.shape[0]
    score = np.append(score,scr)
for i in range(10):
    print("test",i+1,": score =",score[i])
print("average score =",np.mean(score))

###### 5(b)
Use some random points to plot the regions for each class.  
(Just as what we did in Exercise 1.)

In [None]:
### your answer here
mu = np.array([1,1])
cov = np.array([[1.1,-1],
                [-1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu, cov, 100), 
               np.random.multivariate_normal(-mu, cov, 100)])
y = np.array([0]*100 + [1]*100)

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X, y)

X_test = np.random.rand(1000,2) * 10 - np.array([5,5])  ##創建測試集的資料
y_new = model.predict(X_test)


#plt.scatter(*X_test.T, c=y_new, s=10, alpha=0.3) 
X_test=np.random.rand(1000,2)*10 - np.array([5,5])
y_new = model.predict(X_test)
plt.axis('equal')
plt.scatter(X_test.T[0],X_test.T[1],c=y_new)

In [None]:
#end#

In [None]:
## 上課畫的impurity line

In [None]:
criteria = "gini"

def impurity(arr):
    dtrib = np.unique(arr, return_counts=True)[1]
    dtrib = dtrib / dtrib.sum()
    if criteria == "gini":
        return np.sum(dtrib * (1 - dtrib))
    if criteria == "entropy":
        return np.sum(-dtrib * np.log2(dtrib))

In [None]:
X = np.zeros((10,2), dtype=float)
X[:,0] = np.arange(10)
y = np.array([0,0,0,1,1,0,1,1,1,1])

plt.scatter(*X.T, c=y)

Iprime = np.zeros_like(X)
### for illustration, just focus on x-coordinate
j = 0 
N,d = X.shape

for i in range(N):
    mask = (X[:,j] <= X[i,j])
    NL, NR = mask.sum(), (~mask).sum()
    HL = impurity(y[mask])
    HR = impurity(y[~mask])
    Iprime[i,j] = (NL * HL + NR * HR) / N
    
plt.plot(X[:,j], Iprime[:,j])