<font color=#0099ff size=4 face="黑体">
Classification And Regression Trees implement<br>
@Author: Ge Chen<br>
@Time: 2019-8-24<br>
</font>


1.2.1、附加 各常见树构造算法的划分分支方式
还有一点要说明，构建决策树算法，常用到的是三个方法: ID3, C4.5, CART.
三种方法区别是划分树的分支的方式:

ID3 是信息增益分支
C4.5 是信息增益率分支
CART 做分类工作时，采用 GINI 值作为节点分裂的依据；回归时，采用样本的最小方差作为节点的分裂依据。
工程上总的来说:

CART 和 C4.5 之间主要差异在于分类结果上，CART 可以回归分析也可以分类，C4.5 只能做分类；C4.5 子节点是可以多分的，而 CART 是无数个二叉子节点；

以此拓展出以 CART 为基础的 “树群” Random forest ， 以 回归树 为基础的 “树群” GBDT 。

In [1]:
import numpy as np

In [86]:
class regTree():
    def loadDataSet(self, filename):
        dataMat = []
        with open(filename) as fr:
            for line in fr.readlines():
                curLine = line.strip().split('\t')
                fltLine = [float(x) for x in curLine]
                dataMat.append(fltLine)
        return dataMat
    
    def leafType(self, dataSet):
        # use center of leaf dataset 
        return np.mean(dataSet[:, -1])
    
    def regErr(self, dataSet):
        # use variance to stand for the error
        return np.var(dataSet[:, -1]) * np.shape(dataSet)[0]
    

    def binSplitDataSet(self, dataSet, feature, value):
        """
        Desc:
            split dataSet according to the giving feature value
            It uses dichotomy to split dataset.
        Args:
            dataSet -- dataset
            feature -- giving feature index
            value -- giving feature value
        Returns:
            mat0 -- subdataset with the feature value <= giving value
            mat1 -- subdataset with the feature value > giving value
        """

        # np.nonzzero(): return the subscripts with "ture"
        # >>> test1 = np.array([1,2,3,4,5,6,7,8,9,10])
        #     testRes = np.nonzero(test1 < 5)
        # >>> testRes: (array([0, 1, 2, 3], dtype=int64),)
        # >>> testRes: array([0, 1, 2, 3], dtype=int64)
        mat0 = dataSet[np.nonzero(dataSet[:, feature] <= value)[0], :]
        mat1 = dataSet[np.nonzero(dataSet[:, feature] > value)[0], :]
        return mat0, mat1
    
    def chooseBestSplit(self, dataSet, ops = (1, 4)):
        """
        Desc:
            find the best splitting way and generate the leaf node
            The best feature refers to the feature which can maximum 
            the error reduction after splitting.
        Args:
            dataSet -- data set
            leafType -- the function used to generate leaf node
            errType -- the function used to calculate error (total variance)
            ops -- (minimum error reduction, minimum set size)
        Returns:
            bestIndex -- best feature's index
            bestValue -- best feature value for splitting
        """
        # 最小误差下降值, minimum error reduction.
        # If the error reduction after splitting is smaller than this value, 
        # then the splitting will stop 
        tolS = ops[0]
        
        # minimum set size,
        # If the set size is smaller than this value, stop splitting.
        tolN = ops[1]
        
        # if all the labels in this dataset is same, stop splitting
        if len(set(dataSet[:, -1].T.tolist()[0])) == 1: 
            return None, self.leafType(dataSet)        
        m, n = np.shape(dataSet)
        
        # error before splitting
        S = self.regErr(dataSet)
        print (S)
        # initialization
        bestS, bestIndex, bestValue = np.inf, 0, 0
        # start loop to find the best value
        for featIndex in range(n - 1):
            # dataSet[:, featIndex].T.tolist()[0] -- convert the giving feature into a list
            for splitVal in set(dataSet[:, featIndex].T.tolist()[0]):
                # split according different feature value
                mat0, mat1 = self.binSplitDataSet(dataSet, featIndex, splitVal)
                
                # Ending condition I:the set number < minimum set size
                # stop this loop step and start the next loop step
                if (np.shape(mat0)[0] < tolN) or (np.shape(mat1)[0] < tolN):
                    continue
                newS = self.regErr(mat0) + self.regErr(mat1)
                if newS < bestS:
                    bestIndex = featIndex
                    bestValue = splitVal
                    bestS = newS
        # complete the loop, judge if the splitting reach the ending conditions
        # Ending condition II: the error reduction < minimum error reduction
        if (S - bestS) < tolS:
            return None, self.leafType(dataSet)
        
        # split dataset with best feature index and value
        mat0, mat1 = self.binSplitDataSet(dataSet, bestIndex, bestValue)
        
        # Ending condition I: dataset size after splitting < minimum set size
        if (np.shape(mat0)[0] < tolN) or (np.shape(mat1)[0] < tolN):
            return None, self.leafType(dataSet)
        
        # return the best feature index and best feature value for the next splitting
        return bestIndex, bestValue
    
    def createTree(self, dataSet, ops=(1,4)):
        """
        Desc:
            create tree with recursion.
        Args:
            dataSet -- data set
            leafType -- the function used to generate leaf node
            errType -- the function used to calculate error (total variance)
            ops -- (minimum error reduction, minimum set size)
        Returns:
            retTree -- the decision tree
        """
        
        # choose the best way to split the dataset, get the best feature and best feature value
        feat, val = self.chooseBestSplit(dataSet, ops)
        
        # if the splitting reaches a ending condition, stop splitting, return the value
        if feat is None:
            return val
        
        # if the splitting does not reach the ending conditions, 
        # store the best value and feature
        retTree = {}
        retTree['spInd'] = feat
        retTree['spVal'] = val
        
        # split the data set with the best value and feature and 
        lSet, rSet = self.binSplitDataSet(dataSet, feat, val)
        # use recursion to generate the left child node and right child node
        retTree['left'] = self.createTree(lSet, ops)
        retTree['right'] = self.createTree(rSet, ops)
        
        return retTree
                
                

In [87]:
    # 回归树
    RT = regTree()
    myDat = RT.loadDataSet(r'data\Ch09\data1.txt')
    # myDat = loadDataSet('data/9.RegTrees/data2.txt')
    # print 'myDat=', myDat
    myMat = np.mat(myDat)
    # print 'myMat=',  myMat

    myTree = RT.createTree(myMat)

63.380639385992986
3.382419701775143
4.972326818084716


In [88]:
myTree

{'spInd': 0,
 'spVal': 0.48813,
 'left': -0.04465028571428572,
 'right': 1.0180967672413792}