# 决策树
## 离散型决策树

### 离散型数据

|不能上陆|有蹼|是否是鱼类|
|--|--|--|
|1|1|是|
|1|1|是|
|1|0|否|
|0|1|否|
|0|1|否|

In [1]:
from math import log

In [2]:
def create_dataset():
    dataSet = [[1, 1, 'yes'],
               [1, 1, 'yes'],
               [1, 0, 'no'],
               [0, 1, 'no'],
               [0, 1, 'no']]
    names = ['no surfacing','flippers']  # 不能上陆，有无蹼
    return dataSet, names

In [3]:
dataset, names = create_dataset()
dataset

[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

In [4]:
names

['no surfacing', 'flippers']

### 递归构建决策树
换回一个嵌套的字典，形如：
{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

代码共分为部分：

1. 递归终止条件
    1. 只有一类
    2. 没有用于分类的特征
2. 根绝最大增益选择特征
3. 根据该特征划分数据集
4. 左右子树进入递归

In [9]:
def create_tree(dataset, names, features, tree):
    """return a tree dictionary
    dataset: 数据集或数据集子集
    names: 特征名称
    features: 特征集合或特征子集 features = {0, 1} 表示第一列、第二列
    tree: 决策树字典或决策子树字典
    """
    classes = set(sample[-1] for sample in dataset)  # classes = {'yes', 'no'}
    if len(classes) <=1:  # Only one class
        tree = classes.pop()
        return tree
    if len(features) == 0:  # no feature
        tree = majority_count(dataset, classes)
        return tree

    # 最佳特征及其下的左子树和右子树
    (best_feature, best_value, another, 
     best_left, best_right) = choose_feature(dataset, features, classes)
    best_feature_name = names[best_feature]
    tree = {best_feature_name: {best_value: {}, another: {}}}
    features.remove(best_feature)
    print('current feature => ', best_feature_name)
    
    if best_left:
        sub_tree = create_tree(best_left, names, features, tree[best_feature_name][best_feature])
        tree[best_feature_name][best_value] = sub_tree
    if best_right:
        sub_tree = create_tree(best_right, names, features, tree[best_feature_name][another])
        tree[best_feature_name][1 - best_value] = sub_tree
    return tree

In [5]:
def majority_count(dataset, classes):
    """
    返回叶子结点中样本最多的样本
    """
    max_cnt = 0
    max_c = None
    labels = [sample[-1] for sample in dataset]
    for c in classes:
        cnt = labels.count(c)
        if cnt > max_cnt:
            max_cnt = cnt
            max_c = c
    return c

### 划分数据集
根据信息增益划分数据集：
    
    按照给定特征划分数据集（按上述特征划分数据集=> `split_dataset`）
    选择最好的数据集划分方式（选择信息增益最大的特征=> `choose_best_feature`）
    按照给定特征划分数据集（按上述特征划分数据集=> `split_dataset`）
    ……

In [6]:
def split_dataset(dataset, feature, value):
    """
    Left is true. Right is false
    """
    left, right = list(), list()
    for sample in dataset:
        if sample[feature] == value:
            left.append(sample)
        else:
            right.append(sample)
    return left, right

In [7]:
def calc_entropy(classes, *datasets):
    entropy = 0.0
    for d in datasets:
        if d:
            n_samples = len(d)
            for c in classes:
                proportion = [sample[-1] for sample in d].count(c) / float(n_samples)
                if proportion != 0:
                    entropy -= proportion * log(proportion, 2)
    return entropy

In [8]:
def choose_feature(dataset, features, classes):
    best_gain = 0.0
    best_feature = 0
    best_value = None
    another = None
    best_left, best_right = None, None
    for f in features:
        values = set(sample[f] for sample in dataset)
        for value in values:
            # split the dataset
            left, right = split_dataset(dataset, f, value)
            # calculate information gain
            org_entropy = calc_entropy(classes, dataset)
            new_entropy = calc_entropy(classes, left, right)
            gain = org_entropy - new_entropy
            if gain > best_gain:
                best_gain = gain
                best_feature = f
                best_value = value
                best_left, best_right = left, right
        for value in values:
            if value != best_value:
                another = value
                break
    return best_feature, best_value, another, best_left, best_right

### main

In [10]:
tree = dict()
dataset, names = create_dataset()
features = set(range(len(dataset[0]) - 1))  # features = {0, 1}
create_tree(dataset, names, features, tree)

current feature =>  no surfacing
current feature =>  flippers


{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}