决策树（decision tree）算法可分别应用于分类与回归。其学习过程分为：**特征选择、决策树生成、决策树的修剪，**主要算法包括ID3、C4.5、CART

# 一、特征选择

特征选择就是选择对于样本数据具有分类能力的特征，从而提高分类决策的效率。如果利用某一个特征分类的效果与随机分类效果无疑，则可称此特征无分类能力，这样的特征舍弃也无妨。
特征的选择准则是通过计算**信息增益**
## 一、信息增益
给定数据集D：<br>
其样本数量为|D|，即样本容量；<br>
设数据集有K个分类$C_k$，k=1,2,...，K；<br>
$|C_k|$表示属于分类$|C_k|$的样本个数；<br>
$\sum_{k=1}^{K}{|C_k|}=|D|$；<br>
根据特征A可将D分类n个子集，分别为$D_1, D_2, ..., D_n$；<br>
$|D_i|$为$D_i$的样本个数，$\sum_{i=1}^{n}{|D_i|}=|D|$；<br>
集合$D_{ik}$表示子集$D_i$中属于类$C_k$的样本，即$D_{ik} = D_i \cap C_k$；<br>
$|D_{ik}|$表示$D_{ik}$的个数。<br>
**信息增益算法：**<br>
输入：训练数据集D和特征A；<br>
输出：特征A对训练数据集D的信息增益Gain（D，A）<br>
进一步，输出：使得信息增益Gain（D，A）最大的特征$A_v$。<br>

(1)计算数据集D的经验熵$E(D)$ (entropy)
$$
E(D) = -\sum_{k=1}^{K}{p_klog_2(p_k)} = -\sum_{k=1}^{K}{\frac{|D_k|}{|D|}log_2(\frac{|D_k|}{|D|})}
$$
(2)计算特征A对数据集D的经验条件熵$E(D|A)$
$$
H(D|A) = -\sum_{i=1}^{n}{p_iE(D_i)} = -\sum_{i=1}^{n}{\sum_{k=1}^{K}{p_iE(D_{ik})}} = -\sum_{i=1}^{n}{\frac{|D_{i}|}{|D|}\sum_{k=1}^{K}{\frac{|D_{ik}|}{|D_i|}log_2(\frac{|D_{ik}|}{|D_i|})}}
$$
(3)计算信息增益
$$
Gain(D,A) = H(D)-H(D|A)
$$



In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
#数据生成
X = np.random.randint(4,size=(100,4))
y = np.random.randint(6,size=(100,1))
D = pd.DataFrame(np.column_stack([X,y]))
D.shape

(100, 5)

In [4]:
#经验熵E（D）计算

pk = D.groupby(by=D.shape[1]-1).count()[0].values/len(D)
ED = -np.dot(pk,np.log2(pk))
ED

2.507391515336053

In [5]:
#经验条件熵E（D|A）_EDK 、信息增益GAIN 计算
A = list([])
for v in range(D.shape[1]-1):
    Av = D.groupby([v,D.shape[1]-1])[1].count().unstack().fillna(0)
    Av = Av.values.astype(float)
    
    Av_sum = Av.sum(1)
    EDK = 0
    for ll in range(len(Av_sum)):
        Av[ll] = Av[ll]* (1/(Av_sum[ll]))
        EDK += np.dot(Av[ll],np.log2(Av[ll]))
    EDK = -EDK
    print(ED -  EDK)


-7.0578214254787675
-7.194021913716226
-7.096594945564
-7.1360267355917575


### 分析
数据中出现"nan"的原因是某个$|D_{ik}|=0$，造成$\frac{|D_{ik}|}{|D_i|}=0$,然后出现了$log_2(\frac{|D_{ik}|}{|D_i|})=log_2(0)$的计算.<br>
当某个特征i没有对应的分类k的样本就会出现$|D_{ik}|=0$的现象，具体说来，如西瓜书中的案例，所有西瓜进行分类，分为"好瓜"、"坏瓜"，我要通过特征"皮的厚度"进行再分类，该分类可分为"厚"、"薄"2类，然后发现皮"薄"且为"好瓜"的有很多个，但是皮"厚"且为"好瓜"的样本数为0，所以，当i=("皮的厚度"，"厚")，k=（"好瓜"）时，就会出现$log_2(\frac{|D_{ik}|}{|D_i|})=log_2(0)$的计算。<br>
在本计算中，可根据实际的意义，令$log2(0) = 0$

In [6]:
A = list([])
for v in range(D.shape[1]-1):
    Av = D.groupby([v,D.shape[1]-1])[1].count().unstack().fillna(0)
    Av = Av.values.astype(float)
    
    Av_sum = Av.sum(1)
    EDK = 0
    for ll in range(len(Av_sum)):
        Av[ll] = Av[ll]* (1/(Av_sum[ll]))
        tempAv = np.log2(Av[ll])
        tempAv[np.isinf(tempAv)] = 0
        EDK += np.dot(Av[ll],tempAv)
    EDK = -EDK
    print(ED -  EDK)

-7.0578214254787675
-7.194021913716226
-7.096594945564
-7.1360267355917575


虽然还是会报错， 但是结果已经没有nan了

更换一组数据

In [7]:
def div5(data):
    return np.rint((data - np.min(data))/(np.max(data)-np.min(data))*5)

In [8]:
#数据的读取与分类
D = pd.read_csv('bank.csv',';')
D.age = div5(D.age)
D.balance = div5(D.balance)
D.day = div5(D.day)
D.duration = div5(D.duration)
D.campaign = div5(D.campaign)
D.pdays = div5(D.pdays)
D.previous = div5(D.previous)
D.shape

(4521, 17)

n = D.shape[1]      #数据维度
m = D.shape[0]      #样本数量
D.columns = range(n)
pk = D.groupby(by=n-1).count()[0].values/len(D)
ED = -np.dot(pk,np.log2(pk))
ED


#经验条件熵E（D|A）_EDK 、信息增益GAIN 计算
A = list([])
for v in range(D.shape[1]-1):
    Av = D.groupby([v,D.shape[1]-1])[1].count().unstack().fillna(0).values.astype(float)
    Av_sum = Av.sum(1)
    EDK = 0
    for ll in range(len(Av_sum)):
        Av[ll] = Av[ll]/(Av_sum[ll])
        tempAv = np.log2(Av[ll])
        tempAv[np.isinf(tempAv)] = 0
        EDK += np.dot(Av[ll],tempAv)
    A.append([v,(ED -  EDK)])
listGain = np.array(A)
# maxGainIndex 表示最大Gain值特征index 下一步将根据该特征值继续进行计算。
maxGainIndex = listGain[listGain[:,1]==listGain[:,1].max()][0,0]



In [202]:
class DecisionTree(object):
    def __init__(self):
        self.myTree = {}
        self.dep = 0
    
    def GainCal(self,inData):
        '''
        输入：矩阵数据，
            计算信息增益
        输出：
            maxGainIndex :  最大信息增益的特征index
            listGain[maxGainIndex]: 最大信息增益值
        '''
        n_dimensions = inData.shape[1]      #数据维度
        m_samples = inData.shape[0]         #样本数量
        inData.columns = range(n_dimensions)
        pk = inData.groupby(by=n_dimensions-1).count()[0].values/m_samples
        ED = -np.dot(pk,np.log2(pk))
        A = list([])
        for v in range(D.shape[1]-1):
            Av = D.groupby([v,D.shape[1]-1])[1].count().unstack().fillna(0).values.astype(float)
            Av_sum = Av.sum(1)
            EDK = 0
            for ll in range(len(Av_sum)):
                Av[ll] = Av[ll]/(Av_sum[ll])
                tempAv = np.log2(Av[ll])
                tempAv[np.isinf(tempAv)] = 0
                EDK += np.dot(Av[ll],tempAv)
            A.append([v,(ED -  EDK)])
        listGain = np.array(A)     
        maxGainIndex = listGain[listGain[:,1]==listGain[:,1].max()][0,0]
        #maxGain = listGain[maxGainIndex,1]
        maxGain = listGain[:,1].max()
        return maxGainIndex,maxGain
    
    def CountSamples(self,inLata):
        '''
        输入：单维度数列
        输出：以数据矩阵最后一列为结果，汇总结果种类
            nSample: 结果的数量
            samples : 具体包含那些结果
        '''
        samples = np.unique(inLata)
        nSample = len(samples)
        return nSample,samples
    
    def CreateTree(self,inData,thisTree, minGain):
        '''
        根据输入数据矩阵，计算获得决策树myTree
        其中决策树myTree为class的内部变量，为dict类型
        '''
        thisTree = {'node':1}   
        n_dimensions = inData.shape[1]      #数据维度
        m_samples = inData.shape[0]         #样本数量
        nSample,samples = self.CountSamples(inData[n_dimensions - 1 ])
        if nSample == 1: 
            thisTree = {'leaf': samples.values}
            return     
        elif nSample == 0:
            thisTree = {'leaf': 0}
            return     
        maxGainIndex, maxGain = self.GainCal(inData)
        print(maxGainIndex,maxGain)
        if maxGain <=minGain :
            thisTree = {'leaf':0}
 
        thisTree = {maxGainIndex:{}}
        thisTree[maxGainIndex] = {i:0 for i in np.unique(D[maxGainIndex])}
        for i_feature in thisTree[maxGainIndex]:
            nD = D[D[maxGainIndex]==i_feature].drop(maxGainIndex,axis = 1)
            self.CreateTree(nD,thisTree[maxGainIndex][i_feature],minGain)
            
            

                            
    def fitTree(self,inData, minGain=1e-2):
        t = self.myTree
        self.CreateTree(inData, t, minGain)
        print(myTree)
            
            
    def fit(self,X,y):
        pass
    
    def predict(self,X):
        pass


In [203]:
dt = DecisionTree()
dt.fitTree(D)




{}
10.0 8.529812570873268
{}




10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}
10.0 8.71811239481086
{}


KeyboardInterrupt: 