# Tree Lab

## 准备工作
### 环境准备
请确保完成以下依赖包的安装，并且通过下面代码来导入与验证。

In [71]:
import pandas as pd
import numpy as np

### 数据集准备
我们将使用以下数据集进行决策树的构建。该数据集包括7个特征，以及一个标签“是否适合读博”，这些特征描述了适合读博的各种条件，如love doing research,I absolutely want to be a college professor等。

请运行下面的代码来加载数据集。

（防侵权说明）参考https://zhuanlan.zhihu.com/p/372884253，数据集来源GPT4，但构造的决策树不一定与参考内容完全一致。

In [72]:
# read decision_tree_datasets.csv
train_data = pd.read_csv('train_phd_data.csv')
test_data = pd.read_csv('test_phd_data.csv')


## 决策树构建 (10 分)
在这个部分，你将学习并完成决策树的构建。注意：不考虑剪枝，决策树构建停止条件是数据所有实例属于同一类或者特征不可再分（即每个特征值都一样）。

我们采用信息增益率作为分类标准，同时也允许使用其他指标，如基尼系数。

请完成以下函数：

1. **计算数据的信息熵** `getInfoEntropy()`
2. **根据选取的特征进行数据分割** `split_data()`
3. **根据分类标准找到最优特征** `find_best_feature()`

你可能会用到`pandas`库函数，请参考 [pandas官方文档](https://pandas.pydata.org/docs/reference/)。

In [73]:
def getInfoEntropy(data):
    ''' 
        calculate the information entropy of the data

    Args:
        data: the data set, the last column is the label, the other columns are the features

    Returns:
        Entropy: float, the information entropy of the data
    '''
    
    Entropy = 0.0
    
    # TODO: 1. count the number of different labels samples (each class has how many samples) -- count_class
    ## hint: use pd.value_counts() to count the number of different labels samples
    ## hint: use data.iloc[:,-1] to get the last column of the data
    count_class = data.iloc[:, -1].value_counts()
    #print(count_class)
    # 0 : 20 || 1: 7
    
    # TODO: 2. calculate the number of data
    data_count = len(data)

    
    # NOTE use count_class.index here to iteration to avoid iterator i causing more troubles (of course in guarantee of right code and no cheating )
    for class_label in count_class.index:
        # TODO: 3. calculate the probability of each class
        #print(f"count_class[i]: {count_class[i]}")
        #print(f"data_count:{data_count}")
        # print(f"debug for count_class: {count_class}")
        #print(f"debug for i:{i}")
        # print("*"*30)
        #print(f"debug for count_class[i]:{count_class[i]}")
        # print(f"debug for data_count:{data_count}")
        #if isinstance(count_class[i],pd.Series):
        #    label_count = count_class[i].value_counts()
        #    p = label_count.idxmax() / data_count
        #else:
        p = count_class[class_label] / data_count
        # TODO: 4. calculate the entropy of each class
        #print(f"p : {p}")
        Entropy += -p * np.log2( p + 1e-5 )
    #print('当前数据集的信息熵为：',Entropy)
    return Entropy


In [74]:
## test getInfoEntropy
print(getInfoEntropy(train_data))

0.8255976726326857


In [75]:
def split_data(data, column):
    ''' 
        split the data set according to the feature column

        Args:
            data: the data set, the last column is the label, the other columns are the features
            column: the feature column
        Returns:
            splt_datas: Series, the data set after splitting
    '''
    # 1. construct a Series to save the data set after splitting
    splt_datas = pd.Series()  
    # 2. get the unique values of the feature column
    str_values = data.iloc[:,column].unique()  
    # 3. find the data set corresponding to each unique value
    for i in range(len(str_values)):   
        df = data.loc[data.iloc[:,column] == str_values[i]]

        splt_datas[str(i)] = df
    return splt_datas


In [76]:
def find_best_feature(data):

    '''  
        find the best feature to split the data set

        Args:
            data: the data set, the last column is the label, the other columns are the features
        Returns:
            best_feature: the best feature
            best_Series: Series, the data set after splitting
    '''
    best_feature_index = 0    
    baseEnt = getInfoEntropy(data)  
    bestInfoGain_ratio = 0.0
    numFeatures = data.shape[1] - 1   
    InfoGain = 0.0 

    best_Series = pd.Series(dtype="object") # FIXME it cannot use dataframe format
    #print(f"numFeatures:{numFeatures}")
    # Loop through each feature to calculate information gain ratio.
    for i in range(numFeatures):
        newEnt = 0.0
        # avoid div 0 error
        IV = 1e-5
        # TODO: 1. split the data set according to the feature column
        series = split_data(data, i)
        #print(f"series:{series}")
        # 2. calculate the information entropy of each data set, and calculate the weighted average information entropy
        for j in range(len(series)):
            df = series[j]
            # TODO: 3. calculate the probability of each data set
            prob = len(df) / len(data)
            # TODO: 4. calculate the weighted average information entropy
            newEnt += prob * getInfoEntropy(df)
            # print(f"debug for newEnt:{newEnt}")
            # TODO: 5. calculate the entropy of class labels IV
            # FIXME
            IV += -prob * np.log2(prob + 1e-10)
        
        # TODO: 6. calculate the information gain 
        InfoGain = baseEnt - newEnt
        
        # TODO: 7. calculate the information gain ratio
        
        InfoGain_ratio = InfoGain / (IV + 1e-5) 
        #print(u"第%d个特征的信息增益率为：%.3f" % (i, InfoGain_ratio))

        # 
        if InfoGain_ratio > bestInfoGain_ratio:
            bestInfoGain_ratio = InfoGain_ratio
            best_feature_index = i
            best_Series = series
        
    return data.columns[best_feature_index], best_Series


In [77]:
#create decision tree
def creat_Tree(data):

    '''
        create decision tree

        Args:
            data: the data set, the last column is the label, the other columns are the features
        Returns:
            Tree: dict, the decision tree
    '''
    # get the class labels of the data set
    y_values = data.iloc[:,-1].unique()   

    # TODO: 1. If there is only one class label, stop splitting and return the class label.
    #print(f"y_values:{len(y_values)}")
    if len(y_values)==1:
        return y_values[0]
    
    # 2. Check if the value of each feature is the same. If so, return the class label with the most samples.
    flag = 0
    for i in range(data.shape[1] - 1):   
        #print(f"data_iloc{data.iloc[:,i]}")
        if(len(data.iloc[:,i].unique()) != 1):
            flag = 1
            break
    
    # TODO: 3. If all features are identical, return the class label with the most samples.If the samples number are same return the first one
    
    if(flag == 0):
        value_count = data.iloc[:,-1].value_counts()
        # print(f"value_count:{value_count}")
        # print(f"value_count.idxmax: {value_count.idxmax()}")
        return value_count.idxmax()

    # 4. TODO: Find the best feature to split the data set.
    best_feature, best_Series = find_best_feature(data) 
    best_feature=str(best_feature)
    #print(f"best_features:{best_feature}; best_Series:{best_Series}")
    Tree = {best_feature:{}}
    #print(f"Tempoarl Tree:{Tree}")
    # 5. Build the tree recursively. 
    for j in range(len(best_Series)):  
        #print("1111")
        #print(f"current value: {value} for best Series")  
        #print(f"current split_data: {split_data} for best Series")
        split_data = best_Series.iloc[j]
        
        # NOTE there is seemingly some bugs so I refined the code here

        value = split_data.loc[:,best_feature].unique()[0]
        # delete the best feature 
        split_data = split_data.drop(best_feature, axis = 1) 
        #print(f"current value: {value} for best_seri {j}")
        #print(f"split_data: {split_data}")

        # TODO: 6. recursively call the function to build the tree
        #print(f"value:{value}")
        
        Tree[best_feature][value] = creat_Tree(split_data)
    return Tree


In [78]:
Tree = creat_Tree(train_data)
print(f"Tree:{Tree}")

Tree:{'I absolutely want to be a college professor': {0: {'I work 9-5 Mon-Fri': {0: {'I am OK being with judged all the time': {0: 0, 1: {'I need a clear target and immediate feedback': {0: 1, 1: 0}}}}, 1: 0}}, 1: {'I love doing research': {1: {'I am OK being with judged all the time': {1: 1, 0: {'I can deal with extreme stress and competition': {1: 0, 0: 1}}}}, 0: 0}}}}


  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()
  splt_datas = pd.Series()


In [79]:
# visualize decision tree
from graphviz import Digraph

# NOTE : Since I run the code on server which don't have SUDO ROOT right, so I have made offline installation of graphviz according to https://blog.csdn.net/a19990412/article/details/115674086
# NOTE : also can use brew install graphviz, but unfortunately on my server it failed ! https://docs.brew.sh/Homebrew-on-Linux#requirements


def plot_tree(tree, parent_name, node_id=0, graph=None, edge_label=''):
    
    
    if graph is None:
        graph = Digraph(comment='Decision Tree')

    
    if not isinstance(tree, dict):
        current_node_name = f'node{node_id}' 
        graph.node(current_node_name, label=str(tree))
        graph.edge(parent_name, current_node_name, label=edge_label)
        node_id += 1
        return node_id
    
    for k, v in tree.items():
        current_node_name = f'node{node_id}' 
        node_label = f'{k}' if isinstance(v, (str, int)) else k
        graph.node(current_node_name, label=node_label)
        graph.edge(parent_name, current_node_name, label=str(edge_label))

        if isinstance(v, dict):
            for key in v:
                # 假设分支可以用 '0' 和 '1' 来区分
                node_id += 1
                node_id = plot_tree(v[key], current_node_name, node_id, graph, edge_label=str(key))
                
    return node_id

# plot decision tree
tree_graph = Digraph(comment='Decision Tree')
plot_tree(Tree, 'Root', 0, tree_graph)


# NOTE I have changed the format into SVG, because my version of graphviz doesn't support PNG format
# NOTE please see [decision_tree.svg] in this work folder
tree_graph.render('decision_tree', format='svg', cleanup=True)

'decision_tree.svg'

 I have changed the format into SVG, because my version of graphviz doesn't support PNG format \
 please see [decision_tree.svg] in this work folder

In [80]:
# classfiy test data
def classify(tree, test_data):
    ''' 
        classify test data

        Args:
            tree: dict, the decision tree
            test_data: the test data set, the last column is the label, the other columns are the features
        Returns:
            class_label: the predicted class label
    '''
    ## get the checked feature of the decision tree
    first_str = list(tree.keys())[0]


    key = test_data[first_str]
    #print(f"key:{key}")
    # TODO: get the subtree corresponding to the value of the feature
    #print(f"tree[first_str]:{tree[str(first_str)]}")
    sub_tree = tree[first_str]
    #print(f"sub_tree:{sub_tree}")
    # TODO: recursively call the function to classify the test data
    # hint: if the value of the feature is not a dict, it means that the decision tree has reached the leaf node, and the value of the feature is the predicted class label
    if isinstance(sub_tree[key],dict):
        return classify(sub_tree[key],test_data)
    else:
        return sub_tree[key]

    

def test(test_data, tree):
    # NOTE I have rewritten the code here to enhance efficiency
    correct_predictions = sum(classify(tree, sample) == sample[-1] for _, sample in test_data.iterrows())
    accuracy = correct_predictions / len(test_data)
    print('accuracy: ', accuracy)
    

In [81]:
test(test_data, Tree)
test(train_data, Tree)

accuracy:  0.8571428571428571
accuracy:  1.0


接下来我们将进行一项小测试，目的在于评估您是否适合攻读博士学位。

请注意！这仅仅是基于假设的模型，无法准确预测实际情况。请将其视为一次轻松的尝试，仅供娱乐之用，不要用其来替代对自身状况的思考与决策。

In [82]:
# You can input your profile to predict your phd admission result
# input your profile about "I love doing research,I absolutely want to be a college professor,Money is important to me,I can deal with extreme stress and competition,I am OK being with judged all the time,I need a clear target and immediate feedback,I work 9-5 Mon-Fri"
loving = input('Do you love research? (1/0)')
professor = input('Do you want to be a professor? (1/0)')
money = input('Is money important to you? (1/0)')
stress = input('Can you deal with stress? (1/0)')
judge = input('Can you deal with being judged all the time? (1/0)')
feedback = input('Do you need a clear target and immediate feedback? (1/0)')
work = input('Do you work 9-5 Mon-Fri? (1/0)')

# Combine the user's responses into a single data frame
test_data = pd.Series({
    'I love doing research': int(loving),
    'I absolutely want to be a college professor': int(professor),
    'Money is important to me': int(money),
    'I can deal with extreme stress and competition': int(stress),
    'I am OK being with judged all the time': int(judge),
    'I need a clear target and immediate feedback': int(feedback),
    'I work 9-5 Mon-Fri': int(work)
})
# Use the decision tree to predict the result 
result = classify(Tree, test_data)
# Print the result to the user
if result == 1:
    print("Congratulations! According to the model, you are likely to gain admission for Ph.D.")
elif result == 0: 
    print("Unfortunately, according to the model, you are unlikely to gain admission for Ph.D.")


Congratulations! According to the model, you are likely to gain admission for Ph.D.


Finished at 12-21 @Boyuan  
