### 作業目的: 實作樹型模型

在本次課程中實作了以Entropy計算訊息增益的決策樹模型，而計算訊息增益的方法除了Entropy只外還有Gini。因此本次作業希望讀者實作以Gini計算

訊息增益，且基於課程的決策樹模型建構隨機森林模型。

在作業資料夾中的`decision_tree_functions.py`檔案有在作業中實作的所有函式，在實作作業中可以充分利用已經寫好的函式

### Q1: 使用Gini計算訊息增益

$$
Gini = \sum_{i=1}^cp(i)(1-p(i)) = 1 - \sum_{i=1}^cp(i)^2
$$

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [1]:
import os

# Current directory
print(os.getcwd())

# change directory
os.chdir('/content/drive/MyDrive/python_training/NLP100Days/day_27-Tree_Base_model_practice/')
print(os.getcwd())

/content
/content/drive/MyDrive/python_training/NLP100Days/day_27-Tree_Base_model_practice


In [2]:
import pandas as pd
import numpy as np
from decision_tree_functions import decision_tree, train_test_split

In [3]:
# 使用與課程中相同的假資料

training_data = [
    ['Green', 3.1, 'Apple'],
    ['Red', 3.2, 'Apple'],
    ['Red', 1.2, 'Grape'],
    ['Red', 1, 'Grape'],
    ['Yellow', 3.3, 'Lemon'],
    ['Yellow', 3.1, 'Lemon'],
    ['Green', 3, 'Apple'],
    ['Red', 1.1, 'Grape'],
    ['Yellow', 3, 'Lemon'],
    ['Red', 1.2, 'Grape'],
]

header = ["color", "diameter", "label"]

df = pd.DataFrame(data=training_data, columns=header)
df.head()

Unnamed: 0,color,diameter,label
0,Green,3.1,Apple
1,Red,3.2,Apple
2,Red,1.2,Grape
3,Red,1.0,Grape
4,Yellow,3.3,Lemon


In [4]:
#Gini impurity
def calculate_gini(data):

    #取的資料的label訊息
    label_column = data[:, -1]
    
    #取得所有輸入資料的獨立類別與其個數
    _, counts = np.unique(label_column, return_counts=True)

    #計算機率
    probabilities = counts / counts.sum()
    
    #計算gini impurity entropy
    gini = 1-sum(probabilities**2)

    return gini

In [5]:
#分割資料集
train_df, test_df = train_test_split(df, 0.2)

#以Gini inpurity作為metric_function訓練決策樹
tree = decision_tree(calculate_gini, 'classification', 0, min_samples=2, max_depth=5)
tree.fit(train_df)


{'diameter <= 1.2': ['Grape', {'color = Yellow': ['Lemon', 'Apple']}]}

In [6]:
# 以建構好的樹進行預測
sample = test_df.iloc[0]
tree.pred(sample, tree.sub_tree)


'Lemon'

In [7]:
sample

color       Yellow
diameter         3
label        Lemon
Name: 8, dtype: object

### Q2: 實作隨機森林
利用決策樹來實作隨機森林模型，讀者可參考隨機森林課程講義。

此份作業只要求讀者實作隨機sample訓練資料，而隨機sample特徵進行訓練的部分，讀者可以參考`decision_tree_functions.py`中的`get_potential_splits`與`decision_tree`部分(新增參數`random_features`)

In [8]:
class random_forest():
    '''Random forest model
    Parameters
    ----------
    n_boostrap: int
        number of samples to sample to train indivisual decision tree
    n_tree: int
        number of trees to form a forest
    '''
    
    def __init__(self, n_bootstrap, n_trees, task_type, min_samples, max_depth, metric_function, n_features=None):
        self.n_bootstrap = n_bootstrap
        self.n_trees = n_trees
        self.task_type = task_type
        self.min_samples = min_samples
        self.max_depth = max_depth
        self.metric_function = metric_function
        self.n_features = n_features
    
    def bootstrapping(self, train_df, n_bootstrap):
        #sample data to be used to train individual tree
        bootstrap_indices = np.random.randint(low=0, high=len(train_df), size=n_bootstrap)
        df_bootstrapped = train_df.iloc[bootstrap_indices]
        
        #avoid pick the samples with all the same label
        while len(df_bootstrapped['label'].unique()) == 1:
            bootstrap_indices = np.random.randint(low=0, high=len(train_df), size=n_bootstrap)
            df_bootstrapped = train_df.iloc[bootstrap_indices]
        
        
        return df_bootstrapped
    
    def fit(self, train_df):        
        self.forest = []
        
        for i in range(self.n_trees):
            df_bootstrapped = self.bootstrapping(train_df, self.n_bootstrap)
            tree = decision_tree(self.metric_function, self.task_type,
                                 0, self.min_samples, self.max_depth,
                                 self.n_features)
            
            tree.fit(df_bootstrapped)
            self.forest.append(tree)
            
        return self.forest
    
    def pred(self, test_df):
        df_predictions = {}
        
        for i in range(len(self.forest)):
            # get prediction of every trees
            col_name = f"tree_{i}"
            predictions = list(test_df.apply(self.forest[i].pred, args=(self.forest[i].sub_tree,), axis=1))
            df_predictions[col_name] = predictions
             
        df_predictions = pd.DataFrame(df_predictions)
        
        #majority voting
        random_forest_predictions = df_predictions.mode(axis=1)[0]
         
        return random_forest_predictions

In [9]:
train_df, test_df = train_test_split(df, 0.2)

#建立隨機森林模型

forest = random_forest(n_bootstrap=20, n_trees=4, task_type='classification',
                       min_samples=2, max_depth=5, metric_function=calculate_gini,
                       n_features=None)

forest.fit(train_df)

[<decision_tree_functions.decision_tree at 0x7f42058436d8>,
 <decision_tree_functions.decision_tree at 0x7f420588b400>,
 <decision_tree_functions.decision_tree at 0x7f42058a22b0>,
 <decision_tree_functions.decision_tree at 0x7f42058a2128>]

In [10]:
forest.pred(test_df)

0    Apple
1    Lemon
Name: 0, dtype: object

In [11]:
test_df

Unnamed: 0,color,diameter,label
0,Green,3.1,Apple
4,Yellow,3.3,Lemon
