决策树（decision tree）算法可分别应用于分类与回归。其学习过程分为：**特征选择、决策树生成、决策树的修剪，**主要算法包括ID3、C4.5、CART

# 一、特征选择

特征选择就是选择对于样本数据具有分类能力的特征，从而提高分类决策的效率。如果利用某一个特征分类的效果与随机分类效果无疑，则可称此特征无分类能力，这样的特征舍弃也无妨。
特征的选择准则是通过计算**信息增益**
## 一、信息增益
给定数据集D：<br>
其样本数量为|D|，即样本容量；<br>
设数据集有K个分类$C_k$，k=1,2,...，K；<br>
$|C_k|$表示属于分类$|C_k|$的样本个数；<br>
$\sum_{k=1}^{K}{|C_k|}=|D|$；<br>
根据特征A可将D分类n个子集，分别为$D_1, D_2, ..., D_n$；<br>
$|D_i|$为$D_i$的样本个数，$\sum_{i=1}^{n}{|D_i|}=|D|$；<br>
集合$D_{ik}$表示子集$D_i$中属于类$C_k$的样本，即$D_{ik} = D_i \cap C_k$；<br>
$|D_{ik}|$表示$D_{ik}$的个数。<br>
**信息增益算法：**<br>
输入：训练数据集D和特征A；<br>
输出：特征A对训练数据集D的信息增益Gain（D，A）<br>
进一步，输出：使得信息增益Gain（D，A）最大的特征$A_v$。<br>

(1)计算数据集D的经验熵$E(D)$ (entropy)
$$
E(D) = -\sum_{k=1}^{K}{p_klog_2(p_k)} = -\sum_{k=1}^{K}{\frac{|D_k|}{|D|}log_2(\frac{|D_k|}{|D|})}
$$
(2)计算特征A对数据集D的经验条件熵$E(D|A)$
$$
H(D|A) = -\sum_{i=1}^{n}{p_iE(D_i)} = -\sum_{i=1}^{n}{\sum_{k=1}^{K}{p_iE(D_{ik})}} = -\sum_{i=1}^{n}{\frac{|D_{i}|}{|D|}\sum_{k=1}^{K}{\frac{|D_{ik}|}{|D_i|}log_2(\frac{|D_{ik}|}{|D_i|})}}
$$
(3)计算信息增益
$$
Gain(D,A) = H(D)-H(D|A)
$$



In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [4]:
#数据生成
X = np.random.randint(4,size=(100,4))
y = np.random.randint(6,size=(100,1))
D = pd.DataFrame(np.column_stack([X,y]))
D.shape

(100, 5)

In [5]:
#经验熵E（D）计算

pk = D.groupby(by=len(D)-1).count()[0].values/len(D)
ED = -np.dot(pk,np.log2(pk))
ED

2.529321538303107

In [6]:
#经验条件熵E（D|A）_EDK 、信息增益GAIN 计算
A = list([])
for v in range(D.shape[1]-1):
    Av = D.groupby([v,D.shape[1]-1])[1].count().unstack().values.astype(float)
    Av_sum = Av.sum(1)
    EDK = 0
    for ll in range(len(Av_sum)):
        Av[ll] = Av[ll]/(Av_sum[ll])
        EDK += np.dot(Av[ll],np.log2(Av[ll]))
    EDK = -EDK
    print(ED -  EDK)


-7.2959804738655745
nan
nan
-7.079603770490115


更换一组数据

In [22]:
def div5(data):
    return np.rint((data - np.min(data))/(np.max(data)-np.min(data))*5)

In [29]:
#数据的读取与分类
D = pd.read_csv('bank.csv',';')
D.age = div5(D.age)
D.balance = div5(D.balance)
D.day = div5(D.day)
D.duration = div5(D.duration)
D.campaign = div5(D.campaign)
D.pdays = div5(D.pdays)
D.previous = div5(D.previous)
D.shape

(4521, 17)

In [38]:
n = D.shape[1]      #数据维度
m = D.shape[0]      #样本数量
D.columns = range(n)
pk = D.groupby(by=n-1).count()[0].values/len(D)
ED = -np.dot(pk,np.log2(pk))
ED


0.5155217894467282

In [40]:
#经验条件熵E（D|A）_EDK 、信息增益GAIN 计算
A = list([])
for v in range(D.shape[1]-1):
    Av = D.groupby([v,D.shape[1]-1])[1].count().unstack().values.astype(float)
    Av_sum = Av.sum(1)
    EDK = 0
    for ll in range(len(Av_sum)):
        Av[ll] = Av[ll]/(Av_sum[ll])
        EDK += np.dot(Av[ll],np.log2(Av[ll]))
    #EDK = -EDK
    print(ED -  EDK)

4.716651116094781
7.127428836824998
2.164038659853548
2.5209850278748576
1.555688753041983
nan
1.556719151904955
1.3947652767999186
1.9789001662516825
3.60128421338871
8.529812570873268
nan
nan
5.27590146027835
nan
3.1559896334373962


In [42]:
np.unique(D[15])

array(['failure', 'other', 'success', 'unknown'], dtype=object)

In [41]:
D.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,1.0,unemployed,married,primary,no,0.0,no,no,cellular,3.0,oct,0.0,0.0,0.0,0.0,unknown,no
1,1.0,services,married,secondary,no,1.0,yes,yes,cellular,2.0,may,0.0,0.0,2.0,1.0,failure,no
2,1.0,management,single,tertiary,no,0.0,yes,no,cellular,2.0,apr,0.0,0.0,2.0,0.0,failure,no
3,1.0,management,married,tertiary,no,0.0,yes,yes,unknown,0.0,jun,0.0,0.0,0.0,0.0,unknown,no
4,3.0,blue-collar,married,secondary,no,0.0,yes,no,unknown,1.0,may,0.0,0.0,0.0,0.0,unknown,no
