Mount Google Drive (optional)

In [37]:
# from google.colab import drive
# drive.mount('/content/drive')

# **Lab 2 : Decision Tree and Random Forest**
In *lab 2*, you need to finish :

1. Basic Part :
  Implement a Decision Tree model and predict whether patients in the validation set survived.

  > * Section 1: Function Implementation and Testing
  > * Section 2: Building the Decision Tree Model


2. Advanced Part : Build a **Random Forest** model to make predictions


❗ **Important** ❗
Please follow the template. Follow the instructions.
**Do not** change the code outside this code bracket if you see one.
```
### START CODE HERE ###
...
### END CODE HERE ###
```



We'll be using **pandas** frequently in this template, so we've provided a link to help you get familiar with its usage: https://pandas.pydata.org/docs/user_guide/10min.html


## Import Packages

> Note : You **cannot** import any other packages in both basic and advanced part


In [38]:
from pyexpat import features

import numpy as np
import pandas as pd
import math
import random
from numpy import sqrt
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

# **Basic Part** (30%)

## Section 1: Function Implementation and Testing
You will implement five functions that are necessary for building a decision tree model. After implementing each function, you must run it with the given input variables to verify its correctness. Save the results of each function to a CSV file for submission.
> * Step 1: Calculate the Entropy
> * Step 2: Calculate the Information Gain
> * Step 3: Find the Best Split
> * Step 4: Split the data into two branches
> * Step 5: Build the decision tree
> * Step 6: Save answers


## Section 2: Build a Decision Tree Model and make Predictions
After implementing the functions, you will use them to build a decision tree model and make predictions. Follow the steps below to train your model and evaluate its performance.
> * Step 1: Split the data into training set and validation set
> * Step 2: Train a decision tree model with the training set
> * Step 3: Predict the cases in the validation set by using the model trained in Step 2
> * Step 4: Calculate the f1-score of your predictions in Step 3
> * Step 5: Save answer

## Load the input data
Let's load the input file **lab2_basic_input.csv**.

> Note: you will use this input data in both section 1 and section 2

In [39]:
input_data = pd.read_csv('lab2_basic_input.csv')
input_data

Unnamed: 0,age,bmi,gender,height,weight,pre_icu_los_days,glucose_apache,heart_rate_apache,resprate_apache,sodium_apache,hospital_death
0,28.0,26.596278,1,173.0,79.6,0.0,199.0,52.0,29.0,140.0,0
1,51.0,36.267895,0,180.3,117.9,0.141667,88.0,104.0,31.0,143.0,0
2,81.0,24.196007,1,162.0,63.5,1.988194,285.0,178.0,4.0,138.0,1
3,83.0,21.105377,1,162.6,55.8,0.211111,189.0,115.0,18.0,158.0,0
4,76.0,20.470093,0,167.6,57.5,14.493056,278.0,93.0,8.0,134.0,1
5,60.0,46.111111,0,180.0,149.4,0.027778,186.0,146.0,34.0,139.0,1
6,70.0,17.361111,1,168.0,49.0,0.156944,181.0,111.0,12.0,158.0,1
7,79.0,33.274623,0,165.1,90.7,0.004861,56.0,37.0,44.0,141.0,0
8,81.0,30.462306,0,177.8,96.3,0.002083,113.0,62.0,4.0,142.0,0
9,54.0,25.843929,0,177.8,81.7,0.007639,112.0,110.0,24.0,136.0,1


## Global attributes
Define the global attributes
> Note : You **cannot** modify the values of these attributes we have provided in the basic part

In [40]:
max_depth = 2
depth = 0
min_samples_split = 2
n_features = input_data.shape[1] - 1

> You can add your own global attributes here

## Section 1: Function Implementation and Testing

### Step 1 & 2: Calculate the Entropy and Information Gain
In these steps, you will implement functions to calculate entropy and information gain. These metrics are crucial for determining the best way to split the dataset at each node in the decision tree.

If you need some help on Entropy and Information Gain, please refer to
* https://codingnomads.com/decision-tree-information-gain-entropy#what-is-entropy
* https://www.mldawn.com/decision-trees-entropy/#:~:text=In%20a%20binary%20classification%20problem%2C%20when%20Entropy%20hits%200%20it,state%20of%20purity%20and%20certainty.


In [41]:
def entropy(data):
    """
    This function measures the amount of uncertainty in a probability distribution
    args:
    * data(type: DataFrame): the data you're calculating for the entropy
    return:
    * entropy_value(type: float): the data's entropy
    """
    p = 0 # to count the number of cases that survived
    n = 0 # to count the number of cases that passed away

    ### START CODE HERE ###
    # Hint 1: what is the equation for calculating entropy?
    # Hint 2: consider the case when p == 0 or n == 0, what should entropy be?
    
    # calculate number of p and n
    p = np.sum(data['hospital_death'] == 0)
    n = np.sum(data['hospital_death'] == 1)
    
    # probability of p and n
    if len(data):
        probP = p / len(data)
        probN = n / len(data)
    else:
        probP = 0
        probN = 0
    
    # calculate entropy
    if probP == 0 or probN == 0:
        entropy_value = 0
    else:
        entropy_value = round(-probP * math.log2(probP) - probN * math.log2(probN), 4)

    ### END CODE HERE ###

    return entropy_value

# [Note] You have to save the value of "ans_entropy" into the output file
# Please round your answer to 4 decimal place
ans_entropy = entropy(input_data)
print("ans_entropy = ", ans_entropy)

ans_entropy =  0.9928


Expected output:
> ans_entropy =  0.9928

In [42]:
def information_gain(data, mask):
    """
    This function will calculate the information gain
    args:
    * data(type: DataFrame): the data you're calculating for the information gain
    * mask(type: Series): partition information(left/right) of current input data,
      - boolean 1(True) represents split to left subtree
      - boolean 0(False) represents split to right subtree
    return:
    * ig(type: float): the information gain you can obtain by classifying the data with this given mask
    """
    ### START CODE HERE ###
    # Hint: you should use mask to split the data into two, then recall what is the equation for calculating information gain
    data = data.reset_index(drop=True)
    left = data[mask]
    right = data[~pd.Series(mask)]
    
    
    entropybefore = entropy(data)
    entropyafter = (len(left) / len(data)) * entropy(left) + (len(right) / len(data)) * entropy(right)
    ig = round(entropybefore - entropyafter, 4)
    ### END CODE HERE ###

    return ig

# [Note] You have to save the value of "ans_informationGain" into your output file
# Here, let's assume that we split the input_data with 2/3 of the data in the left subtree and 1/3 in the right subtree
# Please round your answer to 4 decimal place
temp1 = np.zeros((int(input_data.shape[0]/3), 1), dtype=bool)
temp2 = np.ones(((input_data.shape[0]-int(input_data.shape[0]/3), 1)), dtype=bool)
temp_mask = np.concatenate((temp1, temp2))
df_mask = pd.DataFrame(temp_mask, columns=['mask'])
ans_informationGain = information_gain(input_data, df_mask['mask'])
print("ans_informationGain = ", ans_informationGain)

ans_informationGain =  0.0385


Expected output:
> ans_informationGain = 0.0385

### Step 3: Find the Best Split
In this step, you will use the information gain calculated for each feature to find the best split. The best split is the point where the dataset is divided into two subgroups (left and right subtrees) in a way that maximizes the reduction of entropy.


> Method: The process involves evaluating **every possible split** for each feature in the dataset. After sorting the data, you calculate potential split points by taking the **median of two consecutive values** where they differ. This median value becomes the threshold for splitting the data into two branches. The split that results in the highest information gain is selected as the best split.

> Note: The method we have provided is a straightforward and basic approach. Please use this method to complete the basic part of the assignment. However, for the advanced part, you are welcome to explore and use other methods to fine-tune your model.

In [43]:
def find_best_split(data, impl_part):
    """
    This function will find the best split combination of data
    args:
    * data(type: DataFrame): the input data
    * impl_part(type: string): 'basic' or 'advanced' to specify which implementation to use
    return
    * best_ig(type: float): the best information gain you obtain
    * best_threshold(type: float): the value that splits data into 2 branches
    * best_feature(type: string): the feature that splits data into 2 branches
    """
    best_ig = -1e9
    best_threshold = 0
    best_feature = ''

    if(impl_part == 'basic'):
        # Implement this part of the function using the method we provided
        ### START CODE HERE ###
        features = list(data.columns[:-1])
        for feature in features:
            # make mask for information gain
            values = sorted(list(set(data[feature])))
            # judge if values consist of 0 and 1s
            if set(values).issubset({0, 1}):
                thresholds = [0, 1]
            else:
                thresholds = list((values[i] + values[i+1]) / 2 for i in range(len(values)-1))
            
            for threshold in thresholds:
                mask = list(data[feature] <= threshold)
                cal_ig = information_gain(data, mask)
                # update ig if having more information
                if cal_ig > best_ig:
                    best_ig = cal_ig
                    best_threshold = threshold
                    best_feature = feature
        ### END CODE HERE ###
    else:
        # You can implement another method here for the advanced part
        ### START CODE HERE ###
        pass
        ### END CODE HERE ###

    return best_ig, best_threshold, best_feature


# [Note] You have to save the value of "ans_ig", "ans_value", and "ans_name" into the output file
# Here, let's try to find the best split for the input_data
# Please round your answer to 4 decimal place
ans_ig, ans_value, ans_name = find_best_split(input_data, 'basic')
print("ans_ig = ", ans_ig)
print("ans_value = ", ans_value)
print("ans_name = ", ans_name)

ans_ig =  0.2146
ans_value =  99.5
ans_name =  glucose_apache


Expected output:
> ans_ig =  0.2146

> ans_value =  99.5

> ans_name =  glucose_apache

### Step 4: Split into 2 branches

When you are building a decision tree, after identifying the best split, you will divide the dataset into two branches: a left branch and a right branch. Each branch represents a subset of the data based on the chosen **feature** and **split point**.

* The left branch will contain the data points that meet the condition of the split (e.g., values less than or equal to the split threshold).
* The right branch will contain the remaining data points (e.g., values greater than the split threshold).

This step is essential because it creates the subgroups that the decision tree will continue to split in subsequent steps. By repeatedly splitting the data into smaller and more homogenous branches, the tree becomes more capable of accurately classifying new data points.

In [44]:
def make_partition(data, feature, threshold):
    """
    This function will split the data into 2 branches
    args:
    * data(type: DataFrame): the input data
    * feature(type: string): the attribute(column name)
    * threshold(type: float): the threshold for splitting the data
    return:
    * left(type: DataFrame): the divided data that matches(less than or equal to) the assigned feature's threshold
    * right(type: DataFrame): the divided data that doesn't match the assigned feature's threshold
    """
    ### START CODE HERE ###
    left = data[data[feature] <= threshold]
    right = data[data[feature] > threshold]
    ### END CODE HERE ###

    return left, right


# [Note] You have to save the value of "ans_left" into the output file
# Here, let's assume the best split is when we choose bmi as the feature and threshold as 21.0
left, right = make_partition(input_data, 'bmi', 21.0)
ans_left = left.shape[0]
print("ans_left = ", ans_left)

ans_left =  7


Expected output:
> ans_left = 7

### Step 5: Build the Decision Tree
Hang in there... we are almost done with this section!

Now, you need to use the above functions to complete a build_tree function.

> Method:
1.  If current depth < max_depth and the remaining number of samples > min_samples_split: continue to classify those samples
2.  Use function *find_best_split()* to find the best split combination
3.  If the obtained information gain is **greater than 0**: can build a deeper decision tree (add depth)
4. Use function *make_partition()* to split the data into two parts
5. Save the features and corresponding thresholds (starting from the root) used by the decision tree into *ans_features[]* and *ans_thresholds[]* respectively

In [45]:
def build_tree(data, max_depth, min_samples_split, depth):
    """
    This function will build the decision tree
    args:
    * data(type: DataFrame): the data you want to apply to the decision tree
    * max_depth: the maximum depth of a decision tree
    * min_samples_split: the minimum number of instances required to do partition
    * depth: the height of the current decision tree
    return:
    * subtree: the decision tree structure including root, branch, and leaf (with the attributes and thresholds)
    """
    ### START CODE HERE ###
    # check the condition of current depth and the remaining number of samples
    if depth < max_depth and len(data) > min_samples_split:
        # call find_best_split() to find the best combination
        best_ig, threshold, feature = find_best_split(data, 'basic')
        
        # check the value of information gain is greater than 0 or not
        if best_ig > 0:
            # update the depth
            depth += 1
            # call make_partition() to split the data into two parts
            left, right = make_partition(data, feature, threshold)
            # If there is no data split to the left tree OR no data split to the right tree
            if left.shape[0] == 0 or right.shape[0] == 0:
                # return the label of the majority
                unique_labels, counts = np.unique(list(data['hospital_death']), return_counts=True)
                label = unique_labels[np.argmax(counts)]
                return label
            else:
                # add the feature and threshold to the list
                ans_features.append(feature)
                ans_thresholds.append(threshold)
                
                question = "{} {} {}".format(feature, "<=", threshold)
                subtree = {question: []}

                # call function build_tree() to recursively build the left subtree and right subtree
                left_subtree = build_tree(left, max_depth, min_samples_split, depth)
                right_subtree = build_tree(right, max_depth, min_samples_split, depth)

                if left_subtree == right_subtree:
                    # this feature is unused and should be removed
                    ans_features.pop()
                    ans_thresholds.pop()
                    
                    subtree = left_subtree
                else:
                    subtree[question].append(left_subtree)
                    subtree[question].append(right_subtree)
        else:
            # return the label of the majority
            unique_labels, counts = np.unique(list(data['hospital_death']), return_counts=True)
            label = unique_labels[np.argmax(counts)]
            return label
    else:
        # return the label of the majority
        unique_labels, counts = np.unique(list(data['hospital_death']), return_counts=True)
        label = unique_labels[np.argmax(counts)]
        return label
    ### END CODE HERE ###

    return subtree

An example of the output from *build_tree()*
```
{'bmi <= 33.5': [1, {'age <= 68.5': [0, 1]}]}
```
Therefore,
```
ans_features = ['bmi', 'age']
ans_thresholds = [33.5, 68.5]
```

In [46]:
# Here, let's build a decision tree using the input_data

ans_features = []
ans_thresholds = []

decisionTree = build_tree(input_data, max_depth, min_samples_split, depth)
decisionTree

{'glucose_apache <= 99.5': [{'height <= 184.15': [0, 1]}, 1]}

Expected output:
> decisionTree = {'glucose_apache <= 99.5': [{'height <= 184.15': [0, 1]}, 1]}

In [47]:
# [Note] You have to save the features in the "decisionTree" structure (from root to branch and leaf) into the output file
ans_features

['glucose_apache', 'height']

Expected output:
> ans_features = ['glucose_apache', 'height']

In [48]:
# [Note] You have to save the corresponding thresholds for the features in the "ans_features" list into the output file
ans_thresholds

[99.5, 184.15]

Expected output:
> ans_thresholds = [99.5, 184.15]

### Step 6: Save answers

In [49]:
basic = []
basic.append(ans_entropy)
basic.append(ans_informationGain)
basic.append([ans_ig, ans_value, ans_name])
basic.append(ans_left)
basic.append(ans_features + ans_thresholds)

## Section 2: Build a Decision Tree Model

Congrats! You have completed all 5 crucial functions. Now, we will use the functions above to implement a simple decision tree. You will train the decision tree using a training set and make predictions using a validation set.

### Step 1: Split data into training set and validation set
> Note: We have split the data into training set and validation. You **cannot** change the distribution of the data.

In [50]:
num_train = 30
num_validation = 10

training_data = input_data.iloc[:num_train]
validation_data = input_data.iloc[-num_validation:]

y_train = training_data[['hospital_death']]
x_train = training_data.drop(['hospital_death'], axis=1)

y_validation = validation_data[['hospital_death']]
x_validation = validation_data.drop(['hospital_death'], axis=1)
y_validation = y_validation.values.flatten()

print(input_data.shape)
print(training_data.shape)
print(validation_data.shape)

(40, 11)
(30, 11)
(10, 11)


### Step 2 to 4 : Make predictions with a decision tree

Define the attributions of the decision tree
> You **cannot** modify the values of these attributes in this part

In [51]:
max_depth = 2
depth = 0
min_samples_split = 2
n_features = x_train.shape[1]

We have finished the function 'classify_data()' below, however, you can modify this function if you prefer completing it on your own way.

In [52]:
def classify_data(instance, tree):
    """
    This function will predict/classify the input instance
    args:
    * instance: a instance(case) to be predicted
    return:
    * answer: the prediction result (the classification result)
    """
    equation = list(tree.keys())[0]
    if equation.split()[1] == '<=':
        temp_feature = equation.split()[0]
        temp_threshold = equation.split()[2]
        if instance[temp_feature] > float(temp_threshold):
            answer = tree[equation][1]
        else:
            answer = tree[equation][0]
    else:
        if instance[equation.split()[0]] in (equation.split()[2]):
            answer = tree[equation][0]
        else:
            answer = tree[equation][1]

    if not isinstance(answer, dict):
        return answer
    else:
        return classify_data(instance, answer)


def make_prediction(tree, data):
    """
    This function will use your pre-trained decision tree to predict the labels of all instances in data
    args:
    * tree: the decision tree
    * data: the data to predict
    return:
    * y_prediction: the predictions
    """
    ### START CODE HERE ###
    # [Note] You can call the function classify_data() to predict the label of each instance
    y_prediction = []
    idx = data.index.tolist()
    for i in idx:
        y_prediction.append(classify_data(data.loc[i], tree))

    ### END CODE HERE ###

    return y_prediction


def calculate_score(y_true, y_pred):
    """
    This function will calculate the f1-score of the predictions
    args:
    * y_true: the ground truth
    * y_pred: the predictions
    return:
    * score: the f1-score
    """
    score = f1_score(y_true, y_pred)

    return score

In [53]:
decision_tree = build_tree(training_data, max_depth, min_samples_split, depth)

y_pred = make_prediction(decision_tree, x_validation)

# [Note] You have to save the value of "ans_f1score" into your output file
# Please round your answer to 4 decimal place
ans_f1score = calculate_score(y_validation, y_pred)
print("ans_f1score = ", ans_f1score)

ans_f1score =  0.4444444444444445


Expected output:
> ans_f1score =  0.4444

In [54]:
# This is just for you to check your predictions
y_pred

[1, 1, 0, 1, 0, 0, 0, 0, 0, 1]

Expected output:
> y_pred = [1, 1, 0, 1, 0, 0, 0, 0, 0, 1]

### Step 5: Save answer

In [55]:
basic.append(ans_f1score)

## Write to Output File
Save all of your answers into a csv file named **lab2_basic.csv**
> Note: Please do not touch the code in this step, we have made sure this outputs the correct file format.

In [56]:
basic_path = 'lab2_basic.csv'

basic_df = pd.DataFrame({'Id': range(len(basic)), 'Ans': basic})
basic_df.set_index('Id', inplace=True)
basic_df

Unnamed: 0_level_0,Ans
Id,Unnamed: 1_level_1
0,0.9928
1,0.0385
2,"[0.2146, 99.5, glucose_apache]"
3,7
4,"[glucose_apache, height, 99.5, 184.15]"
5,0.444444


In [57]:
basic_df.to_csv(basic_path, header = True, index = True)

# **Advanced Part** (65%)

In the advanced section of this lab, you will enhance your prediction capabilities by implementing a more powerful and complex machine learning model—Random Forests. Random Forests are an ensemble learning method that builds multiple decision trees and combines their outputs to improve prediction accuracy and model robustness.
> * Step 1: Load training and testing data
> * Step 2: Split training data into training and validation set
> * Step 3: Build a Random Forest
> * Step 4: Make predictions with the random forest
> * Step 5: Write the Output File

> ❗ **Important** ❗ You are allowed to create new functions to fine tune your random forest, but please make sure to **complete the functions provided**.



We have attached some references if you need help:
> https://medium.com/chung-yi/ml%E5%85%A5%E9%96%80-%E5%8D%81%E4%B8%83-%E9%9A%A8%E6%A9%9F%E6%A3%AE%E6%9E%97-random-forest-6afc24871857

> https://www.geeksforgeeks.org/random-forest-algorithm-in-machine-learning/



### Step 1: Load training and testing data
First, load **lab2_advanced_training.csv**. You will use this to **train** the random forest.

In [58]:
advanced_training_data = pd.read_csv('lab2_advanced_training.csv')
advanced_training_data

Unnamed: 0,age,bmi,gender,height,weight,pre_icu_los_days,arf_apache,bun_apache,creatinine_apache,gcs_eyes_apache,...,temp_apache,ventilated_apache,wbc_apache,apache_4a_hospital_death_prob,apache_4a_icu_death_prob,aids,cirrhosis,diabetes_mellitus,leukemia,hospital_death
0,79.0,25.616497,1,168.0,72.3,0.305556,0.0,20.0,0.92,4.0,...,36.3,1.0,7.20,0.28,0.07,0.0,0.0,0.0,0.0,0
1,43.0,23.494409,0,171.0,68.7,0.011806,0.0,9.0,0.70,1.0,...,39.5,1.0,21.20,0.53,0.48,0.0,0.0,0.0,0.0,0
2,62.0,29.145882,0,182.9,97.5,0.006250,0.0,54.0,3.59,1.0,...,35.0,1.0,19.20,0.62,0.45,0.0,0.0,0.0,0.0,1
3,72.0,41.183318,1,170.2,119.3,1.945139,0.0,53.0,2.25,4.0,...,37.1,0.0,10.40,0.11,0.02,0.0,0.0,1.0,0.0,1
4,87.0,22.914211,0,170.1,66.3,0.085417,0.0,33.0,1.60,4.0,...,36.1,1.0,16.50,0.16,0.08,0.0,0.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8495,55.0,33.201250,0,165.1,90.5,0.052083,0.0,15.0,1.09,4.0,...,37.7,1.0,7.60,0.06,0.04,0.0,0.0,0.0,0.0,0
8496,87.0,29.756001,1,142.0,60.0,0.014583,0.0,17.0,0.80,4.0,...,35.7,0.0,11.70,0.05,0.02,0.0,0.0,0.0,0.0,0
8497,80.0,17.630854,1,165.0,48.0,0.102778,0.0,30.0,0.90,3.0,...,35.6,1.0,45.80,0.25,0.12,0.0,0.0,0.0,0.0,1
8498,74.0,19.199423,0,175.3,59.0,0.460417,0.0,39.0,2.19,1.0,...,33.7,1.0,3.20,0.79,0.72,0.0,0.0,0.0,0.0,1


Next, load **lab2_advanced_testing.csv**. You will make predictions on this testing data using the pre-trained random forest model.

In [59]:
advanced_testing_data = pd.read_csv('lab2_advanced_testing.csv')
advanced_testing_data

Unnamed: 0,age,bmi,gender,height,weight,pre_icu_los_days,arf_apache,bun_apache,creatinine_apache,gcs_eyes_apache,...,sodium_apache,temp_apache,ventilated_apache,wbc_apache,apache_4a_hospital_death_prob,apache_4a_icu_death_prob,aids,cirrhosis,diabetes_mellitus,leukemia
0,82,38.733847,1,158.23,96.82,0.232639,0.0,50,3.32,1.0,...,135,33.0,1.0,14.8,0.84,0.71,0.0,0.0,0.0,0.0
1,65,22.692476,0,173.67,69.40,0.121528,0.0,33,1.40,1.0,...,133,32.1,1.0,12.5,0.70,0.54,0.0,0.0,1.0,0.0
2,72,33.702285,0,177.47,105.70,0.143750,0.0,17,1.71,1.0,...,143,33.9,1.0,17.8,0.49,0.29,0.0,0.0,0.0,0.0
3,81,20.274075,0,171.74,61.10,0.664583,0.0,35,2.09,3.0,...,136,36.4,1.0,9.0,0.40,0.31,0.0,0.0,0.0,0.0
4,41,29.027749,1,175.75,90.00,0.004167,0.0,3,0.41,1.0,...,149,32.3,1.0,24.0,0.71,0.64,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
895,61,22.840598,0,174.96,71.20,0.009722,0.0,22,0.90,4.0,...,134,36.5,0.0,5.2,0.04,0.03,0.0,0.0,0.0,0.0
896,74,28.843833,1,169.31,82.90,0.389583,0.0,12,0.74,3.0,...,144,36.4,1.0,9.9,0.02,0.00,0.0,0.0,0.0,0.0
897,68,22.744572,1,170.02,67.05,2.172917,0.0,36,1.87,3.0,...,137,35.8,1.0,29.7,0.16,0.11,0.0,0.0,1.0,0.0
898,55,25.356784,0,169.90,73.90,4.186111,0.0,12,1.38,4.0,...,139,35.9,1.0,15.3,0.01,-0.02,0.0,0.0,0.0,0.0


### Step 2: Split training data into training and validation set (Optional)
> You can split the training data into training and validation set, this is up to you.

In [60]:
### START CODE HERE ###
# split data at ratio 8:2
idx = int(advanced_training_data.shape[0] * 0.8)
training_data = advanced_training_data
validation_data = advanced_training_data.iloc[idx:]

x_validation = validation_data.drop(['hospital_death'], axis=1)
y_validation = validation_data[['hospital_death']]
### END CODE HERE ###

### Step 3: Build a Random Forest

Define the attributions of the random forest
> * You **can** modify the values of these attributes in advanced part
> * Each tree can have different attribute values
> * Must use function *build_tree()* to build a random forest model
> * Must print out the *selected_datas* and *selected_features*


In [61]:
### START CODE HERE ###
# Define the attributes
max_depth = 2
depth = 0
min_samples_split = 2

# total number of trees in a random forest
n_trees = 500

# number of features to train a decision tree
n_features = int(sqrt(advanced_training_data.shape[1] - 1))

# the ratio to select the number of instances
sample_size = 0.8
n_samples = int(training_data.shape[0] * sample_size)
### END CODE HERE ###

In [62]:
def build_forest(data, n_trees, n_features, n_samples):
    """
    This function will build a random forest.
    args:
    * data: all data that can be used to train a random forest
    * n_trees: total number of tree
    * n_features: number of features
    * n_samples: number of instances
    return:
    * forest: a random forest with 'n_trees' of decision tree
    """
    ### START CODE HERE ###
    data_len = data.shape[0]
    feature_list = list(data.columns[:-1])
    forest = []
    ### END CODE HERE ###

    # Create 'n_trees' number of trees and store each into the 'forest' list
    for i in range(n_trees):

        ### START CODE HERE ###
        # Select 'n_samples' number of samples and 'n_features' number of features
        # (you can select randomly or use any other techniques)
        
        # randomly select index of data and features
        selected_features = np.random.choice(feature_list, n_features, replace=False)
        selected_datas = data.loc[np.random.choice(data.index, size=n_samples, replace=False)]

        ### END CODE HERE ###

        # print(f"selected_datas = {selected_datas}")
        print(f"selected_features = {selected_features}")

        ### START CODE HERE ###
        # Store the rows in 'selected_datas' from 'data' into a new DataFrame
        tree_data = pd.DataFrame(selected_datas, columns=data.columns)

        # Filter the DataFrame for specific 'selected_features' (columns)
        tree_data = tree_data[selected_features.tolist() + ['hospital_death']]
        # print(tree_data)

        ### END CODE HERE ###

        # Then use the new data and 'build_tree' function to build a tree
        tree = build_tree(tree_data, max_depth, min_samples_split, depth)
        print(tree)

        # Save your tree
        forest.append(tree)

    return forest

In [63]:
predictions_path = 'predictions.csv'
forest_path = f'{n_features}_{max_depth}_{n_trees}_full.pkl'

In [64]:
# try multitasking
from joblib import Parallel, delayed

def build_single_tree(data, n_features, n_samples):
    # randomly select index of data and features
    feature_list = data.columns[:-1]
    selected_features = np.random.choice(feature_list, n_features, replace=False)
    selected_datas = data.loc[np.random.choice(data.index, size=n_samples, replace=False)]

    ### END CODE HERE ###

    # print(f"selected_datas = {selected_datas}")
    print(f"selected_features = {selected_features}")

    ### START CODE HERE ###
    # Store the rows in 'selected_datas' from 'data' into a new DataFrame
    tree_data = pd.DataFrame(selected_datas, columns=data.columns)

    # Filter the DataFrame for specific 'selected_features' (columns)
    tree_data = tree_data[selected_features.tolist() + ['hospital_death']]
    
    tree = build_tree(tree_data, max_depth, min_samples_split, depth)
    return tree

# forest = Parallel(n_jobs=-1)(delayed(build_single_tree)(training_data, n_features, n_samples) for _ in range(n_trees))

In [None]:
import pickle
try: 
    with open(forest_path, 'rb') as f:
        forest = pickle.load(f)
except:
    forest = Parallel(n_jobs=-1)(delayed(build_single_tree)(training_data, n_features, n_samples) for _ in range(n_trees))
    with open(forest_path, 'wb') as f:
        pickle.dump(forest, f)

In [30]:
# forest = build_forest(training_data, n_trees, n_features, n_samples)

### Step 4: Make predictions with the random forest

In [31]:
def make_prediction_forest(forest, data):
    """
    This function will use the pre-trained random forest to make the predictions
    args:
    * forest: the random forest
    * data: the data used to predict
    return:
    * y_prediction: the predicted results
    """
    y_prediction = []
    predictions = []

    ### START CODE HERE ###
    # Loop through each tree in the forest
    predictions = [make_prediction(tree, data) for tree in forest]
    
    # Here, each tree has made its predictions.
    # We can use majority vote in which the final prediction is determined by the mode (most frequent prediction) across all the trees.
    # Feel free to use any other method to determine the final prediction

    # Loop through each column of 'predictions'
    for i in range(len(predictions[0])):
        # For a specific column, find out each tree's prediction
        column_predictions = list(zip(*predictions))[i]
        # Then, use a method to determine the final prediction for this column
        # append the final prediction to y_prediction
        if column_predictions.count(1) > column_predictions.count(0):
            y_prediction.append(1)
        else:
            y_prediction.append(0)
    ### END CODE HERE ###

    return y_prediction

Validation (Optional)
> If you split the data into training and validation sets in step 2, you can assess the accuracy of the forest here.

In [32]:
### START CODE HERE ###
pred_validation = make_prediction_forest(forest, x_validation)
score = calculate_score(y_validation, pred_validation)
print(score)

# store score and forest csv
store_df = pd.DataFrame({'score': [score], 'forest': [forest_path]})
store_df.to_csv(predictions_path, mode='a', header = False, index = False)

### END CODE HERE ###

0.7501711156741958


After you have completed fine-tuning and validating the forest, you can proceed to make predictions on the test data.

In [33]:
y_pred_test = make_prediction_forest(forest, advanced_testing_data)

### Step 5: Write the Output File
Save your predictions from the **random forest** in a csv file, named as **lab2_advanced.csv**
> Note: Please do not touch the code in this step, we have made sure this outputs the correct file format.

In [34]:
advanced = []
for i in range(len(y_pred_test)):
    advanced.append(y_pred_test[i])

In [35]:
advanced_path = 'lab2_advanced.csv'

advanced_df = pd.DataFrame({'Id': range(len(advanced)), 'hospital_death': advanced})
advanced_df.set_index('Id', inplace=True)
advanced_df

Unnamed: 0_level_0,hospital_death
Id,Unnamed: 1_level_1
0,1
1,1
2,1
3,1
4,1
...,...
895,0
896,0
897,1
898,0


In [36]:
advanced_df.to_csv(advanced_path, header = True, index = True)