# 第五题（选做）：请你完成带有后剪枝的决策树

## 实验内容
1. 实现带有后剪枝的决策树
2. 数据集随意
3. 最后对比剪枝和不剪枝的差别

**（选做：以下答案仅供参考）**

In [19]:
# 导入类库
import pandas as pd
from sklearn.utils import shuffle
import numpy as np
import math
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

In [2]:
loans = pd.read_csv('data/lendingclub/lending-club-data.csv', low_memory=False)

# 对数据进行预处理，将safe_loans作为标记
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
del loans['bad_loans']

features = ['grade',              # grade of the loan
            'term',               # the term of the loan
            'home_ownership',     # home_ownership status: own, mortgage or rent
            'emp_length',         # number of years of employment
           ]
target = 'safe_loans'
loans = loans[features + [target]]

loans = shuffle(loans, random_state = 34)

split_line1 = int(len(loans) * 0.6)
split_line2 = int(len(loans) * 0.8)
train_data = loans.iloc[: split_line1]
validation_data = loans.iloc[split_line1: split_line2]
test_data = loans.iloc[split_line2:]

In [3]:
def one_hot_encoding(data, features_categorical):
    '''
    Parameter
    ----------
    data: pd.DataFrame
    
    features_categorical: list(str)
    '''
    
    # 对所有的离散特征遍历
    for cat in features_categorical:
        
        # 对这列进行one-hot编码，前缀为这个变量名
        one_encoding = pd.get_dummies(data[cat], prefix = cat)
        
        # 将生成的one-hot编码与之前的dataframe拼接起来
        data = pd.concat([data, one_encoding],axis=1)
        
        # 删除掉原始的这列离散特征
        del data[cat]
    
    return data

In [4]:
train_data = one_hot_encoding(train_data, features)

one_hot_features = train_data.columns.tolist()
one_hot_features.remove(target)

validation_tmp = one_hot_encoding(validation_data, features)
validation_data = pd.DataFrame(columns = train_data.columns)
for feature in train_data.columns:
    if feature in validation_tmp:
        validation_data[feature] = validation_tmp[feature].copy()
    else:
        validation_data[feature] = np.zeros(len(validation_tmp), dtype = 'uint8')
        
test_data_tmp = one_hot_encoding(test_data, features)
test_data = pd.DataFrame(columns = train_data.columns)
for feature in train_data.columns:
    if feature in test_data_tmp.columns:
        test_data[feature] = test_data_tmp[feature].copy()
    else:
        test_data[feature] = np.zeros(test_data_tmp.shape[0], dtype = 'uint8')

print(train_data.shape, validation_data.shape, test_data.shape)

        
target_values = train_data[target]
print(len(target_values[target_values==1]))
print(len(target_values[target_values==-1]))

(73564, 25) (24521, 25) (24522, 25)
59622
13942


In [17]:
# 以下所有函数同前面题中的函数
def information_entropy(labels_in_node):
    '''
    求当前结点的信息熵
    
    Parameter
    ----------
    labels_in_node: np.ndarray, 如[-1, 1, -1, 1, 1]
    
    Returns
    ----------
    float: information entropy
    '''
    
    # 统计样本总个数
    num_of_samples = labels_in_node.shape[0]
    
    if num_of_samples == 0:
        return 0
    
    # 统计出标记为1的个数
    num_of_positive = len(labels_in_node[labels_in_node == 1])
    
    # 统计出标记为-1的个数
    num_of_negative = len(labels_in_node[labels_in_node == -1])                  # YOUR CODE HERE
    
    # 统计正例的概率
    prob_positive = num_of_positive / num_of_samples
    
    # 统计负例的概率
    prob_negative = num_of_negative / num_of_samples                            # YOUR CODE HERE
    
    if prob_positive == 0:
        positive_part = 0
    else:
        positive_part = prob_positive * np.log2(prob_positive)
    
    if prob_negative == 0:
        negative_part = 0
    else:
        negative_part = prob_negative * np.log2(prob_negative)
    
    return - ( positive_part + negative_part )


def compute_information_gain_ratios(data, features, target, annotate = False):
    '''
    计算所有特征的信息增益率并保存起来
    
    Parameter
    ----------
    data: pd.DataFrame, 带有特征和标记的数据
    
    features: list(str)，特征名组成的list
    
    target: str， 特征的名字
    
    annotate: boolean, default False，是否打印注释
    
    Returns
    ----------
    gain_ratios: dict, key: str, 特征名
                       value: float，信息增益率
    '''
    
    gain_ratios = dict()
    
    # 对所有的特征进行遍历，使用当前的划分方法对每个特征进行计算
    for feature in features:
        
        # 左子树保证所有的样本的这个特征取值为0
        left_split_target = data[data[feature] == 0][target]
        
        # 右子树保证所有的样本的这个特征取值为1
        right_split_target =  data[data[feature] == 1][target]
            
        # 计算左子树的信息熵
        left_entropy = information_entropy(left_split_target)
        
        # 计算左子树的权重
        left_weight = len(left_split_target) / (len(left_split_target) + len(right_split_target))

        # 计算右子树的信息熵
        right_entropy = information_entropy(right_split_target)
        
        # 计算右子树的权重
        right_weight = len(right_split_target) / (len(left_split_target) + len(right_split_target))
        
        # 计算当前结点的信息熵
        current_entropy = information_entropy(data[target])
        
        # 计算当前结点的信息增益
        gain =  current_entropy - (left_weight * left_entropy + right_weight * right_entropy)         # YOUR CODE HERE
        
        # 计算IV公式中，当前特征为0的值
        if left_weight == 0:
            left_IV = 0
        else:
            left_IV =left_weight * np.log2(left_weight)                                        # YOUR CODE HERE
        
        # 计算IV公式中，当前特征为1的值
        if right_weight == 0:
            right_IV = 0
        else:
            right_IV =right_weight * np.log2(right_weight)                                  # YOUR CODE HERE
        
        # IV 等于所有子树IV之和的相反数
        IV = - (left_IV + right_IV)
            
        # 计算使用当前特征划分的信息增益率
        # 这里为了防止IV是0，导致除法得到np.inf，在分母加了一个很小的小数
        gain_ratio = gain / (IV + np.finfo(np.longdouble).eps)
        
        # 信息增益率的存储
        gain_ratios[feature] = gain_ratio
        
        if annotate:
            print(" ", feature, gain_ratio)
            
    return gain_ratios


def best_splitting_feature(data, features, target, criterion = 'gain_ratio', annotate = False):
    '''
    给定划分方法和数据，找到最优的划分特征
    
    Parameters
    ----------
    data: pd.DataFrame, 带有特征和标记的数据
    
    features: list(str)，特征名组成的list
    
    target: str， 特征的名字
    
    criterion: str, 使用哪种指标，三种选项: 'information_gain', 'gain_ratio', 'gini'
    
    annotate: boolean, default False，是否打印注释
    
    Returns
    ----------
    best_feature: str, 最佳的划分特征的名字
    
    '''
    if criterion == 'information_gain':
        if annotate:
            print('using information gain')
        return None

    elif criterion == 'gain_ratio':
        if annotate:
            print('using information gain ratio')
        
        # 得到当前所有特征的信息增益率
        gain_ratios = compute_information_gain_ratios(data, features, target, annotate)
    
        # 根据这些特征和他们的信息增益率，找到最佳的划分特征
        best_feature = max(gain_ratios.items(), key = lambda x: x[1])[0]

        return best_feature
    
    elif criterion == 'gini':
        if annotate:
            print('using gini')
        return None
    else:
        raise Exception("传入的criterion不合规!", criterion)
        

def intermediate_node_num_mistakes(labels_in_node):
    '''
    求树的结点中，样本数少的那个类的样本有多少，比如输入是[1, 1, -1, -1, 1]，返回2
    
    Parameter
    ----------
    labels_in_node: np.ndarray, pd.Series
    
    Returns
    ----------
    int：个数
    
    '''
    # 如果传入的array为空，返回0
    if len(labels_in_node) == 0:
        return 0
    
    # 统计1的个数
    num_of_one = len(labels_in_node[labels_in_node == 1])     # YOUR CODE HERE
    
    # 统计-1的个数
    num_of_minus_one = len(labels_in_node[labels_in_node == -1])    # YOUR CODE HERE
    
    return num_of_one if num_of_minus_one > num_of_one else num_of_minus_one


def majority_class(labels_in_node):
    '''
        求树的结点中，样本数多的那个类是什么
    '''
    # 如果传入的array为空，返回0
    if len(labels_in_node) == 0:
        return 0
    
    # 统计1的个数
    num_of_one = len(labels_in_node[labels_in_node == 1])     # YOUR CODE HERE
    
    # 统计-1的个数
    num_of_minus_one = len(labels_in_node[labels_in_node == -1])    # YOUR CODE HERE
    
    return 1 if num_of_minus_one < num_of_one else -1


def create_leaf(target_values):
    '''
    计算出当前叶子结点的标记是什么，并且将叶子结点信息保存在一个dict中
    
    Parameter:
    ----------
    target_values: pd.Series, 当前叶子结点内样本的标记

    Returns:
    ----------
    leaf: dict，表示一个叶结点，
            leaf['splitting_features'], None，叶结点不需要划分特征
            leaf['left'], None，叶结点没有左子树
            leaf['right'], None，叶结点没有右子树
            leaf['is_leaf'], True, 是否是叶子结点
            leaf['prediction'], int, 表示该叶子结点的预测值
    '''
    # 创建叶子结点
    leaf = {'splitting_feature' : None,
            'left' : None,
            'right' : None,
            'is_leaf': True}
   
    # 数结点内-1和+1的个数
    num_ones = len(target_values[target_values == +1])
    num_minus_ones = len(target_values[target_values == -1])    

    # 叶子结点的标记使用少数服从多数的原则，为样本数多的那类的标记，保存在 leaf['prediction']
    leaf['prediction'] = majority_class(target_values)

    # 返回叶子结点
    return leaf


def classify(tree, x, annotate = False):
    '''
    递归的进行预测，一次只能预测一个样本
    
    Parameters
    ----------
    tree: dict
    
    x: pd.Series，样本
    
    x: pd.DataFrame, 待预测的样本
    
    annotate, boolean, 是否显示注释
    
    Returns
    ----------
    返回预测的标记
    '''
    if tree['is_leaf']:
        if annotate:
            print ("At leaf, predicting %s" % tree['prediction'])
        return tree['prediction']
    else:
        split_feature_value = x[tree['splitting_feature']]
        if annotate:
             print ("Split on %s = %s" % (tree['splitting_feature'], split_feature_value))
        if split_feature_value == 0:
            return classify(tree['left'], x, annotate)
        else:
            return classify(tree['right'], x, annotate)
        

def predict(tree, data):
    '''
    按行遍历data，对每个样本进行预测，将值存储起来，最后返回np.ndarray
    
    Parameter
    ----------
    tree, dict, 模型
    
    data, pd.DataFrame, 数据
    
    Returns
    ----------
    predictions, np.ndarray, 模型对这些样本的预测结果
    '''
    predictions = np.zeros(len(data))
    
    # YOUR CODE HERE
    for i in range(len(data)):
        predictions[i] = classify(tree, data.iloc[i])
    
    return predictions

In [10]:
def find_majority_class(labels_in_node):
    '''
        求树的结点中，样本数多的那个类是什么
    '''
    # 如果传入的array为空，返回0
    if len(labels_in_node) == 0:
        return 0
    
    # 统计1的个数
    num_of_one = len(labels_in_node[labels_in_node == 1])     # YOUR CODE HERE
    
    # 统计-1的个数
    num_of_minus_one = len(labels_in_node[labels_in_node == -1])    # YOUR CODE HERE
    
    return 1 if num_of_minus_one < num_of_one else -1

In [11]:
def no_pruning_decision_tree_create(data,validation_data, features, target,index_tree, criterion = 'gini', current_depth = 0, max_depth = 10, annotate = False):
    '''
    Parameter:
    ----------
    data: pd.DataFrame, 数据

    features: iterable, 特征组成的可迭代对象，比如一个list

    target: str, 标记的名字

    criterion: 'str', 特征划分方法，只支持三种：'information_gain', 'gain_ratio', 'gini'

    current_depth: int, 当前深度，递归的时候需要记录

    max_depth: int, 树的最大深度，我们设定的树的最大深度，达到最大深度需要终止递归

    Returns:
    ----------
    dict, dict['is_leaf']          : False, 当前顶点不是叶子结点
          dict['prediction']       : None, 不是叶子结点就没有预测值
          dict['splitting_feature']: splitting_feature, 当前结点是使用哪个特征进行划分的
          dict['left']             : dict
          dict['right']            : dict
    '''
    
    if criterion not in ['information_gain', 'gain_ratio', 'gini']:
        raise Exception("传入的criterion不合规!", criterion)
    
    # 复制一份特征，存储起来，每使用一个特征进行划分，我们就删除一个
    remaining_features = features[:]
    
    # 取出标记值
    target_values = data[target]
    validation_values = validation_data[target]
    majority_class = find_majority_class(validation_values)
    if(majority_class == -1):print("HERE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
    print("-" * 50)
    print("Subtree, depth = %s (%s data points)." % (current_depth, len(target_values)))

    # 终止条件1
    # 如果当前结点内所有样本同属一类，即这个结点中，各类别样本数最小的那个等于0
    # 使用前面写的intermediate_node_num_mistakes来完成这个判断
    if intermediate_node_num_mistakes(target_values)==0:                                  # YOUR CODE HERE
        print("Stopping condition 1 reached.")
        return create_leaf(target_values)   # 创建叶子结点
    
    # 终止条件2
    # 如果已经没有剩余的特征可供分割，即remaining_features为空
    
    if  len(remaining_features)==0:           # YOUR CODE HERE
        print("Stopping condition 2 reached.")
        return create_leaf(target_values)   # 创建叶子结点
    
    # 终止条件3
    # 如果已经到达了我们要求的最大深度，即当前深度达到了最大深度
    
    if current_depth==max_depth:             # YOUR CODE HERE
        print("Reached maximum depth. Stopping for now.")
        return create_leaf(target_values)   # 创建叶子结点

    # 找到最优划分特征
    # 使用best_splitting_feature这个函数
    
    splitting_feature =best_splitting_feature(data,remaining_features,target,criterion)          # YOUR CODE HERE
    
    # 使用我们找到的最优特征将数据划分成两份
    # 左子树的数据
    left_split = data[data[splitting_feature] == 0]
    
    # 右子树的数据
    right_split = data[data[splitting_feature] == 1]                                    # YOUR CODE HERE
    
    validation_left_split = validation_data[validation_data[splitting_feature] == 0]
    validation_right_split = validation_data[validation_data[splitting_feature] == 1]
    
    # 现在已经完成划分，我们要从剩余特征中删除掉当前这个特征
    remaining_features.remove(splitting_feature)
    
    # 打印当前划分使用的特征，打印左子树样本个数，右子树样本个数
    print("Split on feature %s. (%s, %s)" % (\
                      splitting_feature, len(left_split), len(right_split)))
    
    # 如果使用当前的特征，将所有的样本都划分到一棵子树中，那么就直接将这棵子树变成叶子结点
    # 判断左子树是不是“完美”的
    if len(left_split) == len(data):
        print("Creating leaf node.")
        return create_leaf(left_split[target])
    
    # 判断右子树是不是“完美”的
    if len(right_split) == len(data):
        print("Creating right node.")
        return create_leaf(right_split[target])                                          # YOUR CODE HERE

    # 递归地创建左子树
    left_tree = no_pruning_decision_tree_create(left_split,validation_left_split, remaining_features, target,index_tree, criterion, current_depth + 1, max_depth, annotate)
    
    # 递归地创建右子树
    
    right_tree = no_pruning_decision_tree_create(right_split,validation_right_split, remaining_features, target,index_tree, criterion, current_depth + 1, max_depth, annotate)
    
    # 返回树的非叶子结点
    thisnode = {'is_leaf'          : False, 
            'prediction'       : None,
            'majority_class'   :majority_class,
            'splitting_feature': splitting_feature,
            'left'             : left_tree, 
            'right'            : right_tree}
    
    index_tree[current_depth].append(thisnode)
    
    return thisnode

In [12]:
max_depth = 6
index_tree = dict()
for i in range(0,max_depth+1):
    index_tree[i] = list()
tree_without_pruning = no_pruning_decision_tree_create(train_data,validation_data, one_hot_features, target,index_tree, 'gain_ratio', max_depth = max_depth, annotate = False)

--------------------------------------------------
Subtree, depth = 0 (73564 data points).
Split on feature grade_F. (71229, 2335)
--------------------------------------------------
Subtree, depth = 1 (71229 data points).
Split on feature grade_A. (57869, 13360)
--------------------------------------------------
Subtree, depth = 2 (57869 data points).
Split on feature grade_G. (57232, 637)
--------------------------------------------------
Subtree, depth = 3 (57232 data points).
Split on feature grade_E. (51828, 5404)
--------------------------------------------------
Subtree, depth = 4 (51828 data points).
Split on feature grade_D. (40326, 11502)
--------------------------------------------------
Subtree, depth = 5 (40326 data points).
Split on feature term_ 36 months. (5760, 34566)
--------------------------------------------------
Subtree, depth = 6 (5760 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (345

Split on feature emp_length_8 years. (460, 18)
--------------------------------------------------
Subtree, depth = 4 (460 data points).
Split on feature emp_length_4 years. (433, 27)
--------------------------------------------------
Subtree, depth = 5 (433 data points).
Split on feature home_ownership_MORTGAGE. (287, 146)
--------------------------------------------------
Subtree, depth = 6 (287 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (146 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 5 (27 data points).
Split on feature home_ownership_OWN. (25, 2)
--------------------------------------------------
Subtree, depth = 6 (25 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (2 data points).
Stopping condition 1 reached.
------------------------------------

In [13]:
def post_pruning(tree,index_tree,max_depth,validation_data,target):
    for i in range(max_depth-1,-1,-1):
        for j in range(len(index_tree[i])):
            #保存当前结点信息
            node = index_tree[i][j]
            node_label = node['majority_class']
            left_label = 1
            right_label = 1
            if node['left']['is_leaf']==True :
                left_label = node['left']['prediction']
            else: left_label = node['left']['majority_class']
            if node['right']['is_leaf']==True :
                right_label = node['right']['prediction']
            else: right_label = node['right']['majority_class']
            #如果当前结点的多数label和他的左右孩子的多数标签是一样的，那么即使进行剪枝替换精度也不会变化，可以跳过检验
            if node_label == left_label and node_label == right_label:
                print("当前结点的多数label和他的左右孩子的多数标签是一样的,跳过验证")
                continue
                
            node_info = dict(node)
            prediction = predict(tree,validation_data)
            np_score = accuracy_score(validation_data[target],prediction)
            print('未替换前验证精度：',np_score)
            
            majority_class = node['majority_class']
            node['splitting_feature'] = None
            node['left'] = None
            node['right'] = None
            node['is_leaf'] = True
            node['prediction'] = majority_class
            
            #print(tree)
            
            prediction = predict(tree,validation_data)
            p_score = accuracy_score(validation_data[target],prediction)
            print('替换后验证精度：',p_score)
            
            if p_score>np_score :
                print('进行剪枝，将%s变为叶子结点'%(node_info['splitting_feature']))
            else:
                print('不剪枝')
                node['splitting_feature'] = node_info['splitting_feature']
                node['left'] = node_info['left']
                node['right'] = node_info['right']
                node['is_leaf'] = node_info['is_leaf']
                node['prediction'] = node_info['prediction']
                node['majority_class'] = node_info['majority_class']
    return tree

In [14]:
def generate_echarts_data(tree,count):
    
    # 当前顶点的dict
    value = dict()
    
    # 如果传入的tree已经是叶子结点了
    if tree['is_leaf'] == True:
        count['left']=count['left']+1
        # 它的value就设置为预测的标记
        value['value'] = tree['prediction']
        
        # 它的名字就叫"label: 标记"
        value['name'] = 'label: %s'%(tree['prediction'])
        
        # 直接返回这个dict即可
        return value
    
    # 如果传入的tree不是叶子结点，名字就叫当前这个顶点的划分特征，子树是一个list
    # 分别增加左子树和右子树到children中
    value['name'] = tree['splitting_feature']
    count['bifurcation']=count['bifurcation']+1
    value['children'] = [generate_echarts_data(tree['left'],count), generate_echarts_data(tree['right'],count)]
    return value

In [15]:
# from pyecharts.charts import Tree
from pyecharts import Tree
np_count = dict()
np_count['left']=0
np_count['bifurcation'] = 0
np_data = generate_echarts_data(tree_without_pruning,np_count)
tree = Tree()
tree.add("",
         [np_data],
         collapse_interval=5,
         pos_top="5%",
         pos_left="0%",
         symbol = 'rect',
         symbol_size = 20
         )
print(np_count)

ERROR:lml.utils:failed to import pyecharts_snapshot
Traceback (most recent call last):
  File "D:\Users\11979\Anaconda3\envs\tens\lib\site-packages\lml\utils.py", line 43, in do_import
    plugin_module = __import__(plugin_module_name)
ImportError: No module named 'pyecharts_snapshot'


{'bifurcation': 44, 'left': 45}


In [20]:
tree_with_pruning = no_pruning_decision_tree_create(train_data,validation_data, one_hot_features, target,index_tree, 'gain_ratio', max_depth = max_depth, annotate = False)
tree_with_pruning = post_pruning(tree_with_pruning,index_tree,max_depth,validation_data,target)
p_count =dict() 
p_count['left']=0
p_count['bifurcation'] = 0
p_data = generate_echarts_data(tree_with_pruning,p_count)
print(p_count)
if np_count['bifurcation']!=p_count['bifurcation']:
    print("根据两次对分叉结点的统计可以看出，确实进行了剪枝(由于数据问题，如果要通过看图判断的话，会看瞎眼的)")
tree.add("",
         [p_data],
         collapse_interval=5,
         pos_top="55%",
         pos_left="0%",
         symbol = 'rect',
         symbol_size = 20
         )
tree.render()

--------------------------------------------------
Subtree, depth = 0 (73564 data points).
Split on feature grade_F. (71229, 2335)
--------------------------------------------------
Subtree, depth = 1 (71229 data points).
Split on feature grade_A. (57869, 13360)
--------------------------------------------------
Subtree, depth = 2 (57869 data points).
Split on feature grade_G. (57232, 637)
--------------------------------------------------
Subtree, depth = 3 (57232 data points).
Split on feature grade_E. (51828, 5404)
--------------------------------------------------
Subtree, depth = 4 (51828 data points).
Split on feature grade_D. (40326, 11502)
--------------------------------------------------
Subtree, depth = 5 (40326 data points).
Split on feature term_ 36 months. (5760, 34566)
--------------------------------------------------
Subtree, depth = 6 (5760 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (345

Split on feature home_ownership_RENT. (73, 67)
--------------------------------------------------
Subtree, depth = 6 (73 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 6 (67 data points).
Reached maximum depth. Stopping for now.
--------------------------------------------------
Subtree, depth = 4 (2 data points).
Stopping condition 1 reached.
--------------------------------------------------
Subtree, depth = 3 (478 data points).
Split on feature emp_length_8 years. (460, 18)
--------------------------------------------------
Subtree, depth = 4 (460 data points).
Split on feature emp_length_4 years. (433, 27)
--------------------------------------------------
Subtree, depth = 5 (433 data points).
Split on feature home_ownership_MORTGAGE. (287, 146)
--------------------------------------------------
Subtree, depth = 6 (287 data points).
Reached maximum depth. Stopping for now.
----------------------------------

未替换前验证精度： 0.814852575343583
替换后验证精度： 0.8147710126014437
不剪枝
当前结点的多数label和他的左右孩子的多数标签是一样的,跳过验证
当前结点的多数label和他的左右孩子的多数标签是一样的,跳过验证
未替换前验证精度： 0.814852575343583
替换后验证精度： 0.814852575343583
不剪枝
当前结点的多数label和他的左右孩子的多数标签是一样的,跳过验证
当前结点的多数label和他的左右孩子的多数标签是一样的,跳过验证
当前结点的多数label和他的左右孩子的多数标签是一样的,跳过验证
当前结点的多数label和他的左右孩子的多数标签是一样的,跳过验证
未替换前验证精度： 0.814852575343583
替换后验证精度： 0.814852575343583
不剪枝
当前结点的多数label和他的左右孩子的多数标签是一样的,跳过验证
当前结点的多数label和他的左右孩子的多数标签是一样的,跳过验证
未替换前验证精度： 0.814852575343583
替换后验证精度： 0.814852575343583
不剪枝
当前结点的多数label和他的左右孩子的多数标签是一样的,跳过验证
当前结点的多数label和他的左右孩子的多数标签是一样的,跳过验证
当前结点的多数label和他的左右孩子的多数标签是一样的,跳过验证
当前结点的多数label和他的左右孩子的多数标签是一样的,跳过验证
未替换前验证精度： 0.814852575343583
替换后验证精度： 0.814852575343583
不剪枝
当前结点的多数label和他的左右孩子的多数标签是一样的,跳过验证
当前结点的多数label和他的左右孩子的多数标签是一样的,跳过验证
未替换前验证精度： 0.814852575343583
替换后验证精度： 0.814852575343583
不剪枝
当前结点的多数label和他的左右孩子的多数标签是一样的,跳过验证
当前结点的多数label和他的左右孩子的多数标签是一样的,跳过验证
当前结点的多数label和他的左右孩子的多数标签是一样的,跳过验证
当前结点的多数label和他的左右孩子的多数标签是一样的,跳过验证
未替换前验证精度： 0.814852575343583
替换后验证

In [21]:
prediction = predict(tree_without_pruning,test_data)
print("不剪枝")
print(accuracy_score(test_data['safe_loans'],prediction),precision_score(test_data['safe_loans'],prediction)
      ,recall_score(test_data['safe_loans'],prediction),f1_score(test_data['safe_loans'],prediction))

print("后剪枝")
prediction = predict(tree_with_pruning,test_data)
print(accuracy_score(test_data['safe_loans'],prediction),precision_score(test_data['safe_loans'],prediction)
      ,recall_score(test_data['safe_loans'],prediction),f1_score(test_data['safe_loans'],prediction))

不剪枝
0.8088247288149417 0.8097397556890141 0.9984383658253992 0.8942429164410755
后剪枝
0.8091917461870973 0.8095821090434214 0.9993451211525868 0.8945102017810844


###### 双击此处编辑

模型|精度|查准率|查全率|F1
-|-|-|-|-
无后剪枝| 0.8088247288149417  | 0.8097397556890141  | 0.9984383658253992  | 0.8942429164410755
有后剪枝| 0.8091917461870973   | 0.8095821090434214   | 0.9993451211525868   | 0.8945102017810844