### 作業目的: 實作樹型模型

在本次課程中實作了以Entropy計算訊息增益的決策樹模型，而計算訊息增益的方法除了Entropy只外還有Gini。因此本次作業希望讀者實作以Gini計算

訊息增益，且基於課程的決策樹模型建構隨機森林模型。

在作業資料夾中的`decision_tree_functions.py`檔案有在作業中實作的所有函式，在實作作業中可以充分利用已經寫好的函式

### Q1: 使用Gini計算訊息增益

$$
Gini = \sum_{i=1}^cp(i)(1-p(i)) = 1 - \sum_{i=1}^cp(i)^2
$$

In [1]:
import pandas as pd
import numpy as np
from icecream import ic
from decision_tree_functions import decision_tree, train_test_split

In [2]:
# 使用與課程中相同的假資料

training_data = [
    ['Green', 3.1, 'Apple'],
    ['Red', 3.2, 'Apple'],
    ['Red', 1.2, 'Grape'],
    ['Red', 1, 'Grape'],
    ['Yellow', 3.3, 'Lemon'],
    ['Yellow', 3.1, 'Lemon'],
    ['Green', 3, 'Apple'],
    ['Red', 1.1, 'Grape'],
    ['Yellow', 3, 'Lemon'],
    ['Red', 1.2, 'Grape'],
]

header = ["color", "diameter", "label"]

df = pd.DataFrame(data=training_data, columns=header)
df.head()

Unnamed: 0,color,diameter,label
0,Green,3.1,Apple
1,Red,3.2,Apple
2,Red,1.2,Grape
3,Red,1.0,Grape
4,Yellow,3.3,Lemon


In [3]:
#Gini impurity
def calculate_gini(data):
    
    #取得資料的label訊息
    label = data[:, -1]
    
    #取得所有輸入資料的獨立類別與其個數
    _, counts = np.unique(label, return_counts=True)
    
    #計算機率
    proba = counts / counts.sum()
    
    #計算gini impurity
    gini = 1 - sum(proba**2)
    return gini

In [4]:
#分割資料集
train_df, test_df = train_test_split(df, test_size=0.1)

#以 Gini inpurity 作為 metric_function 訓練決策樹
tree = decision_tree(calculate_gini)

tree.fit(train_df)

# 以建構好的樹進行預測
sample = test_df.iloc[0]
tree.pred(sample, tree.sub_tree)

'Lemon'

In [5]:
sample

color       Yellow
diameter       3.3
label        Lemon
Name: 4, dtype: object

### Q2: 實作隨機森林
利用決策樹來實作隨機森林模型，讀者可參考隨機森林課程講義。

此份作業只要求讀者實作隨機sample訓練資料，而隨機sample特徵進行訓練的部分，讀者可以參考`decision_tree_functions.py`中的`get_potential_splits`與`decision_tree`部分(新增參數`random_features`)

In [6]:
from decision_tree_functions import get_potential_splits

In [42]:
class random_forest():
    '''Random forest model
    Parameters
    ----------
    n_boostrap: int
        number of samples to sample to train indivisual decision tree
    n_tree: int
        number of trees to form a forest
    '''
    
    def __init__(self, n_bootstrap, n_trees, metric_function, task_type='classification', counter=0, min_samples=2, max_depth=5, random_features=None, n_features=None):
        self.n_bootstrap = n_bootstrap
        self.n_trees = n_trees
        self.task_type = task_type
        self.min_samples = min_samples
        self.max_depth = max_depth
        self.metric_function = metric_function
        self.n_features = n_features
    
    def bootstrapping(self, train_df, n_bootstrap):
        
        #sample data to be used to train individual tree
        ###<your code>###
        df_bootstrapped = train_df.sample(n_bootstrap)
        
        #avoid pick the samples with all the same label
        ###<your code>###
        while len(df_bootstrapped.label.unique()) == 1:
            df_bootstrapped = train_df.sample(n_bootstrap)
        
        return df_bootstrapped
    
    def fit(self, train_df):
        
        self.forest = []
        
        ###<your code>###
        for _ in range(self.n_trees):
            df_bootstrapped = self.bootstrapping(train_df, self.n_bootstrap)
            tree = decision_tree(self.metric_function)
            tree.fit(df_bootstrapped)
            self.forest.append(tree)
            
        return self.forest
    
    def pred(self, test_df):
        predictions = {}
        
        ###<your code>###
        for i, tree in enumerate(self.forest):
            predictions[f'tree_{i}'] = test_df.apply(lambda x: tree.pred(x, tree.sub_tree), axis=1)
            
        df_predictions = pd.DataFrame(predictions)
        
        #majority voting
        ###<your code>###
        random_forest_predictions = df_predictions.mode(axis=1)
        return random_forest_predictions

In [43]:
train_df, test_df = train_test_split(df, 0.2)

#建立隨機森林模型
###<your code>###
forest = random_forest(n_bootstrap=2, n_trees=3, metric_function=calculate_gini)
forest.fit(train_df)
forest.pred(test_df)

Unnamed: 0,0
2,Lemon
0,Lemon


In [44]:
test_df  # 低準確度

Unnamed: 0,color,diameter,label
2,Red,1.2,Grape
0,Green,3.1,Apple


In [45]:
# 提高 n_bootstrap＝提高準確度，抽樣樣本數提高至８
forest = random_forest(n_bootstrap=8, n_trees=3, metric_function=calculate_gini)
forest.fit(train_df)
forest.pred(test_df)

Unnamed: 0,0
2,Grape
0,Apple


In [46]:
test_df   # 高準確度

Unnamed: 0,color,diameter,label
2,Red,1.2,Grape
0,Green,3.1,Apple


In [51]:
# 提高 n_trees＝提高準確度，５棵樹仍不準，要到７棵樹比較好
forest = random_forest(n_bootstrap=2, n_trees=7, metric_function=calculate_gini)
forest.fit(train_df)
forest.pred(test_df)

Unnamed: 0,0
2,Grape
0,Apple


In [52]:
test_df

Unnamed: 0,color,diameter,label
2,Red,1.2,Grape
0,Green,3.1,Apple


---

In [20]:
# 未 Voting：random_forest_predictions = df_predictions
forest.pred(test_df)

Unnamed: 0,tree_0,tree_1,tree_2,tree_3,tree_4
4,Apple,Apple,Lemon,Lemon,Lemon
7,Grape,Apple,Grape,Grape,Grape


In [23]:
# Voting：採用取眾數的方法 
forest.pred(test_df).mode(axis=1)

Unnamed: 0,0
4,Lemon
7,Grape


In [22]:
test_df

Unnamed: 0,color,diameter,label
4,Yellow,3.3,Lemon
7,Red,1.1,Grape
