# Assignment 1: Decision Trees (10 marks)

Student Name: Tianhong Wang

Student ID: 1436415

## General info

<b>Due date</b>: 

<b>Submission method</b>: Canvas submission

<b>Submission materials</b>: completed copy of this iPython notebook

<b>Late submissions</b>: -10% per day up to 5 days (both weekdays and weekends count)
<ul>
    <li>one day late, -1.0 ;</li>
    <li>two days late, -2.0;</li>
    <li>three days late, -3.0;</li>
    <li>four days late, -4.0;</li>
    <li>five days late, -5.0;</li>
</ul>

<b>Marks</b>: This assignment will be marked out of 10, and make up 10% of your overall mark for this subject.

<b>Materials</b>: See <a href="https://canvas.lms.unimelb.edu.au/courses/151131/pages/python-and-jupyter-notebooks?module_item_id=4532241">[Using Jupyter Notebook and Python page]</a> on Canvas (under Modules> Coding Resources) for information on the basic setup required for this class, including an iPython notebook viewer and the python packages `numpy` and `pprint`. You can use any Python built-in packages, but do not use any other 3rd party packages; if your iPython notebook doesn't run on the marker's machine, you will lose marks. <b> You should use Python 3</b>.  


<b>Evaluation</b>: Your iPython notebook should run end-to-end without any errors in a reasonable amount of time, and you must follow all instructions provided below, including specific implementation requirements and instructions for what needs to be printed (please avoid printing output we don't ask for). You should edit the sections below where requested, but leave the rest of the code as is. You should leave the output from running your code in the iPython notebook you submit, to assist with marking. The amount each section is worth is given in parenthesis after the instructions. 

You will be marked not only on the correctness of your methods and answers.

<b>Updates</b>: Any major changes to the assignment will be announced via Canvas. Minor changes and clarifications will be announced on Canvas> Assignments>A ssignment1. We recommend you check it regularly.

<b>Academic misconduct</b>: This assignment is an individual task, and so reuse of code or other instances of clear influence will be considered cheating. Please check the <a href="https://canvas.lms.unimelb.edu.au/courses/124196/modules#module_662096">CIS Academic Honesty training</a> for more information. We will be checking submissions for originality and will invoke the University’s <a href="http://academichonesty.unimelb.edu.au/policy.html">Academic Misconduct policy</a> where collusion or plagiarism are deemed to have taken place. Content produced by an AI (including, but not limited to ChatGPT) is not your own work, and submitting such content will be treated as a case of academic misconduct, in line with the <a href="https://academicintegrity.unimelb.edu.au/plagiarism-and-collusion/artificial-intelligence-tools-and-technologies"> University's policy</a>.

**IMPORTANT**

Please carefully read and fill out the <b>Authorship Declaration</b> form at the bottom of the page. Failure to fill out this form results in the following deductions: 
<UL TYPE=”square”>
<LI>Missing Authorship Declaration at the bottom of the page, -1.0
<LI>Incomplete or unsigned Authorship Declaration at the bottom of the page, -0.5
</UL>

## Overview

In this assignment, you will apply the Decision Tree (DT) classification algorithm to a real-world machine learning dataset. Specifically, you will predict the type of a given bridge using a diverse set of features, including its material, length, number of lanes, and other structural properties. We use a modified version of the publicly available <a href="https://archive.ics.uci.edu/ml/datasets/Pittsburgh+Bridges"> Pittsburgh Bridges Data Set</a>. (click the link for more information on all features and values).

In the **Data Preparation** section, you will read the dataset into a data frame and perform data preprocessing (Q1.1). You will also need to develop code to convert the features to numeric values (Q1.2). Then, the data will be divided into two parts: Train and Test.

In the **Model Training** section, we provide you with a number of helper functions that will be incorporated into the main Decision Tree function (Q2-8). Your task is to read and use these functions and explain the exact role of each helper function in developing a Decision Tree. You must provide a meaningful and descriptive name for each function that matches its specific role in training a Decision Tree model (e.g., count_labels, calculate_entropy, etc.).

You can observe the use of each of these functions in the Main Decision Tree Algorithm, which you can use to print your trained tree. We have also provided the code for navigating a trained tree and using it to predict the labels for the test dataset.

In the **Classification and Evaluation** section, you will test and evaluate your model. You will implement functions to calculate the accuracy of the model predictions (Q9). Finally, you will train a tree and use it to predict the labels for the test set and see the accuracy of your model (Q10).

Q11 concludes with analytical questions in which you will critically analyze the behavior of your decision tree in different situations.


In [1]:
import numpy as np
import pandas as pd

from collections import Counter
from pprint import pprint

## Data Preparation [1 mark]

The Bridges dataset consists of 108 bridges. Each bridge is defined by 11 features and 1 label. These features are `Index`, `River`, `Erected`, `Purpose`, `Length`, `Lanes`, `Clear`, `T-OR-D`, `Material`, `Span` and `Rel-L`. Note that the first feature is only an index and not useful as a feature for classification. 


#### Q1.1 Reading the input file and data preprocessing (0.25 marks)

This code should read the input file into a `pandas` dataframe. You should remove the first feature because it is only an id and perform some preprocessing. In the preprocessing phase you are going to replace all missing values (`?`) with the most frequent (most common) value of that feature. 

In [2]:
# read the input file
df = pd.read_csv("bridges.data.csv")
#print(df.head())

############ YOUR CODE HERE ##############

# First, drop the index column
df = df.iloc[:, 1:]


# Second, replace all missing values with the most frequet value of the features
df['Length'] = df['Length'].replace(['?'], df['Length'].mode())
df['Lanes'] = df['Lanes'].replace(['?'], df['Lanes'].mode())
df['Clear'] = df['Clear'].replace(['?'], df['Clear'].mode())
df['T-OR-D'] = df['T-OR-D'].replace(['?'], df['T-OR-D'].mode())
df['Material'] = df['Material'].replace(['?'], df['Material'].mode())
df['Span'] = df['Span'].replace(['?'], df['Span'].mode())
df['Rel-L'] = df['Rel-L'].replace(['?'], df['Rel-L'].mode())
    
############## TEST IT YOURSELF ###############

assert df['Length'][0] == 'MEDIUM'
assert df['Lanes'][24] == '2'
df


Unnamed: 0,River,Erected,Purpose,Length,Lanes,Clear,T-OR-D,Material,Span,Rel-L,Label
0,M,CRAFTS,HIGHWAY,MEDIUM,2,N,THROUGH,WOOD,SHORT,S,WOOD
1,A,CRAFTS,AQUEDUCT,MEDIUM,1,N,DECK,WOOD,MEDIUM,S,WOOD
2,O,MODERN,HIGHWAY,MEDIUM,2,G,THROUGH,STEEL,MEDIUM,F,ARCH
3,O,MATURE,HIGHWAY,MEDIUM,2,G,THROUGH,STEEL,LONG,S-F,CANTILEV
4,O,MODERN,HIGHWAY,MEDIUM,2,G,THROUGH,STEEL,LONG,F,CONT-T
...,...,...,...,...,...,...,...,...,...,...,...
103,M,MATURE,RR,MEDIUM,2,G,THROUGH,STEEL,MEDIUM,S,SIMPLE-T
104,M,MODERN,RR,MEDIUM,2,G,THROUGH,STEEL,MEDIUM,F,SIMPLE-T
105,M,MODERN,HIGHWAY,MEDIUM,2,G,THROUGH,STEEL,MEDIUM,S,ARCH
106,M,MODERN,HIGHWAY,SHORT,4,G,THROUGH,STEEL,MEDIUM,F,CONT-T


#### Q1.2 Convert Categorical features to Numeric (0.5 marks)
In order to simplify our calculations we are going to transform the features (columns) with categorical values to numbers. For example, for the feature `River` there are 3 possible values: 'A', 'M' and '0'. 

**NOTE:** Sort them alphabatically and assign sequencial integer values starting from 0. For `River` we will have: 'A' = 0, 'M' = 1, 'O' = 2.


In [3]:
############ YOUR CODE HERE ##############

for col_name in df.columns[:-1]:
    df[col_name] = pd.factorize(df[col_name],sort=True)[0]
df
    
##########################################

#print(df.head(10))

Unnamed: 0,River,Erected,Purpose,Length,Lanes,Clear,T-OR-D,Material,Span,Rel-L,Label
0,1,0,1,1,1,1,1,2,2,1,WOOD
1,0,0,0,1,0,1,0,2,1,1,WOOD
2,2,3,1,1,1,0,1,1,1,0,ARCH
3,2,2,1,1,1,0,1,1,0,2,CANTILEV
4,2,3,1,1,1,0,1,1,0,0,CONT-T
...,...,...,...,...,...,...,...,...,...,...,...
103,1,2,2,1,1,0,1,1,1,1,SIMPLE-T
104,1,3,2,1,1,0,1,1,1,0,SIMPLE-T
105,1,3,1,1,1,0,1,1,1,1,ARCH
106,1,3,1,2,2,0,1,1,1,0,CONT-T


#### Q1.3 Split the data into a Train and Test Set (0.25 marks)

The first 85 instances should be used for training and the last 23 instances for testing. NO SHUFFLING IS ALLOWED!

In [4]:
# the first 85 instances should be used for training and the last 23 instances for testing
# Don't shuffle the data!
train_df = df.iloc[0:85]
test_df = df.iloc[85:]

assert(len(train_df)==85)

## Training a Decision Tree [5 marks]

### Helper Functions

For answers to the questions Q2--Q7 we expect about 3-5 sentences, and for Q8 about 6-10 sentences.


**Q2. Provide an descriptive name for `func1` and explain the role of this function in training/building a Decision Tree. (0.5 marks)**

func1: is_only_one_label   This function returns a boolean value determined by the "Label" column has a unique name. If only 1 unique name here, it returns "True", otherwise returns "False". The role of this function is helpful to judge if the data is as pure as possible in building a decision tree. It would be called in the main decision tree algorithm to partition the tree until it cannot be partitioned anymore. 
</br>
</br>
</br>





In [5]:
#This function receives a 2D array of values the passed instances should have a format such as:
#    [[1 0 1 1 1 1 1 2 2 1 'WOOD'] 
#     [0 0 0 1 0 1 0 2 1 1 'WOOD']
#     [2 3 1 1 1 0 1 1 1 0 'ARCH']
#     [2 2 1 1 1 0 1 1 0 2 'CANTILEV'] 
#     [2 3 1 1 1 0 1 1 0 0 'CONT-T']
#      ...]

def func1(data): 
    # the input `data` is a 2D array as illustrated above
    # the output `answer` is a BOOLEAN (True or False)
    
    column = data[:, -1]
    values = np.unique(column)

    if len(values) == 1:
        return True
    else:
        return False
    

**Q3. Provide a descriptive name for `func2` and explain the role of this function in training/building a Decision Tree. (0.5 marks)**


*func2: get_most_frequent_label  This function firstly calculates the occurences of each unique name in 'Label'. Then it gets the index of the most frequent one and finally returns the most frequent label name. The role of this function helps the main decision tree algorithm to find the leaf node when the tree cannot partition anymore, then the most frequent label name would be returned by this function and assign this to the leaf node. This function is crucial to make a correct prediction in training/building a decision tree. *  
</br>
</br>
</br>





In [6]:
def func2(data):
    # the input `data` is a 2D array as illustrated above
    # the output is a STRING 
    
    column = data[:, -1] 
    
    values, counts = np.unique(column, return_counts=True)
    index = counts.argmax()
    name = values[index]
    
    return name

**Q4. (1) Provide an descriptive name for `func3` and two parameters `name` and `value`. You need to explain what is the role of this function (and these two parametrs) in training/building a Decision Tree.  
(2) What are `data1` and `data2` in the context of a Decision Tree? (1 mark)**  

*(1) func3: partition_data_by_threshold name: feature_name value: threshold  The role of this function is going to split the parameter "data" by giving a specific feature name and a specific threshold value into two separate datasets. It helps the main decision tree algorithm recursively generate left child nodes and right child nodes.
(2) Data 1 and data2 helps the decision tree recursively generates left and right child nodes, partition the dataset as pure as possible or satifies the requirements of depth value. 
</br>
</br>
</br>





In [7]:
def func3(data, array, name, value):
    # the input `data` is a 2D array 
    # the input `array` is the list of names
    # the input `name` specifies a member in the array
    # the input `value` specifies a value for that array member
    # the outputs `data1` and `data2` are 2D arrays 

    index = array.get_loc(name)
    data1 = data[data[:,index]<value]
    data2 = data[data[:,index]>=value]
    
    return data1, data2

**Q5. Provide a descriptive name for `func4` and explain what is the output of this function? (0.5 marks)**


*func4: get_entropy
 This function firstly calculates the number of events of each unique label name in the 'Label' column, and calculates the prportion of each unique label name. Then, it multiply by self-info(i) accroding to the formula and sums each label name's results together to output the total entropy y. It is also prepared for calculating the mean information.
</br>
</br>
</br>



In [8]:
def func4(data):
    # the input `data` is a 2D array 
    # the output is a single real-valued number
    
    column = data[:, -1]
    _, counts = np.unique(column, return_counts=True)

    x = counts / counts.sum()  
    y = sum(x * -np.log2(x))
    
    return y

**Q6. Provide a descriptive name for `func5` and explain what is the output of this function? (0.5 marks)**

*fucn5: get_mean_information  The function firstly calculates the total length of two datasets and calculate each weighted average of the entropy. Then, it calls previous get_entropy function to output the total mean information accroding to the mean information formula. It is also prepared for calculating the information gain.
</br>
</br>
</br>



In [9]:
def func5(data1, data2):
    # the inputs `data1` and `data2` are both 2D arrays
    # the output is a single real-valued number

    n = len(data1) + len(data2)
    x1 = len(data1) / n
    x2 = len(data2) / n

    y =  (x1 * func4(data1) + x2 * func4(data2))
    
    return y

**Q7. Provide a descriptive name for `func6` and explain what is the output of this function? (0.5 marks)**

*func6: get_information_gain  This function firstly calculates the entropy of input parameter 'data0' and subtract by the mean information of two datasets "data1","data2" to output a final information gain. It is also prepared for calculating the gain ratio.   
</br>
</br>
</br>



In [10]:
def func6(data0, data1, data2):
    # the inputs `data0`, `data1` and `data2` are all 2D arrays
    # the output is a single real-valued number
    
    x0 = func4(data0)
    y = x0 - func5(data1, data2)
    
    return y

**Q8. Explain the following variables in `func7` and provide a descriptive name for them. (2 marks)</br>**
**1. x </br> 2. Y </br> 3. y1 </br> 4. y2**


x: information_gain It is the information gain calculated by the entropy of input parameter "data" subtract by the mean information of input parameter "data1", "data2".
</br>
Y: largest_information_gain Y is the largest information gain and it is used by comparison by x, if x larger than it,Y value would be updated by x which means that the current largest information gain may be updated and determined by "x".   
</br>
y1: best_feature_name It is the corresponded best feature's name accroding to the largest information gain "Y" 
</br>
y2: best_feature_threshold It is the corresponded best feature's threshold value accroding to the largest information gain "Y".   
</br>



In [11]:
def func7(data,array):
    
    first_iteration = True
    
    for a in array[:-1]:

        values = np.unique(data[:,array.get_loc(a)])
        
        for v in values:
            
            data1, data2 =  func3(data, array, a, v)
            
            x = func6(data, data1, data2)  
            
            if first_iteration or x >= Y:
                first_iteration = False
                Y = x
                y1 = a
                y2 = v
    
    return y1, y2
            

### Main Decision Tree Algorithm
This is the recursive part of developing the Decision Tree. If you have developed all the previous parts correctly this function will make a Decision Tree for the passed dataset. We can set the maximum depth of the tree (`max_depth`) and number of samples that we would stop spliting a node (`min_samples`). <br>


In [12]:
def decision_tree_algorithm(data, array, max_depth, counter=0, min_samples=2):
    
    # We first write the case where the recursive algorithm reach to an stop.
    # The algorithm stops if func1 returns `true` 
    #    or instances left in the node is less than define minimum samples (e.g., 2) 
    #    or we have reached the maximum length of the tree
    
#     print(counter, max_depth, func1(data), len(data) < min_samples, counter == max_depth)
       
    if (func1(data)) or (len(data) < min_samples) or (counter == max_depth): 
#         print('break')
        y = func2(data)
        return y
    
    # recursive part
    else:
#         print('recursive')
        
        counter += 1
        
        name, value = func7(data,array)
        data1, data2 = func3(data, array, name, value)
        
        if (len(data1) == 0) or (len(data2) == 0):
            y = func2(data)
            return y
    
        #instanciate the tree
        node = "{} < {}".format(name, value)
        sub_tree = {node: []}
              
        #develop the sub-trees (recursion)
        left_child = decision_tree_algorithm(data=data1, array=array, max_depth=max_depth, counter=counter)
        right_child = decision_tree_algorithm(data=data2, array=array, max_depth=max_depth, counter=counter)
        
        
        if left_child == right_child:
            sub_tree = right_child
        else:
            sub_tree[node].append(left_child)
            sub_tree[node].append(right_child)
        
        return sub_tree   
         

In [13]:
############## TEST IT YOURSELF ###############
data = train_df
tree = decision_tree_algorithm(data=data.values, array=data.columns, max_depth=3, counter=0)

pprint(tree)

{'Material < 2': [{'Purpose < 2': [{'Rel-L < 2': ['SIMPLE-T', 'SUSPEN']},
                                   'SIMPLE-T']},
                  'WOOD']}


With the max_depth of 3, your tree would be:

`Material < 2`   ------- `Purpose < 2`  ------ `Rel-L < 2`------- **SIMPLE-T** <br>
$\;\;\;\;|\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;|\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;|$------  **SUSPEN**<br>
$\;\;\;\;$|----- **WOOD** $\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;$|------  **SIMPLE-T** 


# Classification and Evaluation [1 mark]

Now that we have built our decision tree, we are ready to use it to predict a label for unseen instances. We provided the code to do this, below.

In [14]:
def predict_one_instance(example, tree):
    
    # tree is just a root node
    if not isinstance(tree, dict):
        return tree
    
    question = list(tree.keys())[0]
    feature_name, comparison_operator, value = question.split(" ")

    # ask question
    if (example[feature_name] < float(value)):
        answer = tree[question][0]
    else:
        answer = tree[question][1]

    # base case (the answer in not a dictionary)
    if not isinstance(answer, dict):
        return answer
    
    # recursive part
    else:
        residual_tree = answer
        return predict_one_instance(example, residual_tree)


In [15]:
def make_predictions(df, tree):
    
    if len(df) != 0:
        predictions = df.apply(predict_one_instance, args=(tree,), axis=1)
    else:
        # "df.apply()"" with empty dataframe returns an empty dataframe,
        # but "predictions" should be a series instead
        predictions = pd.Series()
        
    return predictions

**Q9. Write a function to calculate accuracy (0.5 marks)**

This function receives a dataframe and predictions of a trained tree (on that dataset) and should calculate the accuracy of the prdictions for the passed dataframe.

**NOTE:** You are NOT allowed to use Python accuracy methods here. 

In [16]:
def calculate_accuracy(df, predictions):
    
    ########### YOUR CODE HERE ##############
    correct = sum(df['Label'] == predictions)
    accuracy = correct / len(df)
    
    
    return accuracy


**Q10. Use the Tree (0.5 marks)**

Using all the developed functions above build a tree using the **Training set** (train_df) (max_depth = 2) and then use it to predict labels for the **Test set** (test_df). Then print the accuracy of your tree for the test set.

In [22]:
#############  YOUR CODE HERE ##############


data = train_df
tree = decision_tree_algorithm(data=data.values, array=data.columns, max_depth=2, counter=0)
test_data = test_df
prediction=make_predictions(test_data, tree)

accuracy = calculate_accuracy(test_data, prediction)
accuracy = accuracy * 100
print("Test set accuracy is: {:.2f}%".format(accuracy))

Test set accuracy is: 34.78%


## Q11. Analytical questions [2 marks]

Answer each of the following subquestions with a text answer of 3-5 sentences, using the results you obtained in the previous sections. You don't need to develop new functions. You might need to re-run and reuse some of the functions you already implemented in the ### CODE ### cells above.

### Q11.1 Manipulating the features (1 mark)

Let's see if changes in the number of features changes the tree results.

**(A)** Train a tree (T1) with max_depth = 2, using the train_df. Use T1 to predict the labels for both train_df and test_df and print the accuracy for both datasets.

**(B)** For both train_df and test_df remove (drop) the `Material` feature. Then train another tree (T2) with max_depth = 2 using the new train set (no_mat_train_df). Use T2 to predict labels for new test and train set and check the accuracy of the predictions. Are the results different from what they were in part A? If yes, explain why. 

In [18]:
### CODE 11.1 (A) ###
data = train_df
T1 = decision_tree_algorithm(data=data.values, array=data.columns, max_depth=2, counter=0)
train_set = train_df
train_prediction=make_predictions(train_set, T1)
train_accuracy = calculate_accuracy(train_set, train_prediction)
print("Train dataset's accuracy is: {:.2f}%".format(train_accuracy * 100))

test_set = test_df
test_prediction=make_predictions(test_set, T1)
test_accuracy = calculate_accuracy(test_set, test_prediction)
print("Test dataset's accuracy is: {:.2f}%".format(test_accuracy * 100))


Train dataset's accuracy is: 61.18%
Test dataset's accuracy is: 34.78%


In [19]:
### CODE 11.1 (B) ###
no_mat_train_df = train_df.drop(train_df.columns[7], axis = 1)
no_mat_test_df = test_df.drop(test_df.columns[7], axis = 1)
T2 = decision_tree_algorithm(data=no_mat_train_df.values, array=no_mat_train_df.columns, max_depth=2, counter=0)
no_mat_train_prediction=make_predictions(no_mat_train_df, T2)
no_mat_train_accuracy = calculate_accuracy(no_mat_train_df, no_mat_train_prediction)
print("New train dataset's accuracy is {:.2f}%". format(no_mat_train_accuracy * 100))

no_mat_test_prediction = make_predictions(no_mat_test_df, T2)
no_mat_test_accuracy = calculate_accuracy(no_mat_test_df, no_mat_test_prediction)
print("New Test dataset's accuracy is: {:.2f}%".format(no_mat_test_accuracy * 100))
#train_df
#test_df


New train dataset's accuracy is 58.82%
New Test dataset's accuracy is: 43.48%



Yes, they are different. The train set accuracy decreases from 61.18% to 58.82%. This is mainly due to the 'material' data has an obvious distribution or characteristic, droping it lead to a change in the distribution of overall train model, then it capture a weak relationship between label and the data. The correct predictions decline. Hence, the accuracy decreases. The test set accuracy increases from 34.78% to 43.48%, this is mainly due to the 'material' data could be a noise of the overall test set. After dropping it, the test model can recognise more valid correlation among the remaining data. Hence, the accuracy increases. Secondly, dropping the 'material' data may reduce chance of the overfitting or the decision tree algorithm may get improved by dropping a column of data, both reasons lead to a better accuracy. Overall, after dropping the 'material' column, both data set exists occasionality with a rise or fall with the accuracy.    
</br>
</br>
</br>



### Q11.2 Decision tree complexity (1 mark)

**(A)** Using the tree you generated in Q10 (name it `little_tree`), find the accuracy of the tree for predicting the labels for the **training set**. Do you notice a difference between train and test accuracy? If so, discuss possible reasons. 

**(B)** Now change the max_depth of the decision tree to 10 and train another tree (name it `big_tree`). Now use this new tree to predict the labels for test and train sets. Describe and explain any change in the results you notice compared to your tree of depth 2.</br>


In [20]:
### CODE 11.2 (A) ###
data = train_df
little_tree = decision_tree_algorithm(data=data.values, array=data.columns, max_depth=2, counter=0)
lt_prediction=make_predictions(train_df, little_tree)
lt_accuracy = calculate_accuracy(train_df, lt_prediction)
print("Little tree's training set accuracy is: {:.2f}%".format(lt_accuracy * 100))

Little tree's training set accuracy is: 61.18%


In [21]:
### CODE 11.2 (B) ###
big_tree = decision_tree_algorithm(data=data.values, array=data.columns, max_depth=10, counter=0)
bt_train_prediction = make_predictions(train_df, big_tree)
bt_train_accuracy = calculate_accuracy(train_df, bt_train_prediction)
print("Big tree's training set accuracy is {:.2f}%".format(bt_train_accuracy * 100))

bt_test_prediction = make_predictions(test_df, big_tree)
bt_test_accuracy = calculate_accuracy(test_df, bt_test_prediction)
print("Big tree's testing set accuracy is: {:.2f}%".format(bt_test_accuracy * 100))


Big tree's training set accuracy is 94.12%
Big tree's testing set accuracy is: 34.78%


*Type your text answers here*

*(A) The training set accuracy is 61.18% which better than test set's 34.78%, there may exist a few possible reasons. Firstly, train set has 85 rows of data, but the test set only has 23 rows of data, the amount of test set are not enough to provide an ideal accuracy which is lower than 50%. Secondly, there could be a overfitting in the test set's model, this may lead to a decrease about the accuracy, and the train set model has a good fitting, therefore test model's accuracy is better. Thirdly, test model may exist more noises than in the train model, this may lead to a lower accuracy comparting to the train model  *

*(B) For training set, the accuracy increases from 61.18%(depth=2) to 94.12%(depth=10) this is mainly due to a higher depth decision tree can recgonise more complicated patterns and relationships among the dataset, which enhances the fitting of the model  *
  For test set, when the depth increases from 2 to 10, the accuracy keep remain at 34.78% this is probably due to the relationship between each feature is quite simple, therefore, the data could be predicted with a few splits. Hence, a deeper decision tree cannot improve a better accuracy. 

# Authorship Declaration:

   (1) I certify that the program contained in this submission is completely
   my own individual work, except where explicitly noted by comments that
   provide details otherwise.  I understand that work that has been developed
   by another student, or by me in collaboration with other students,
   or by non-students as a result of request, solicitation, or payment,
   may not be submitted for assessment in this subject. The same holds for AI models including, but not limited to, ChatGPT. I understand that
   submitting for assessment work developed by or in collaboration with
   other students or non-students constitutes Academic Misconduct, and
   may be penalized by mark deductions, or by other penalties determined
   via the University of Melbourne Academic Honesty Policy, as described
   at https://academicintegrity.unimelb.edu.au.

   (2) I also certify that I have not provided a copy of this work in either
   softcopy or hardcopy or any other form to any other student, and nor will
   I do so until after the marks are released. I understand that providing
   my work to other students, regardless of my intention or any undertakings
   made to me by that other student, is also Academic Misconduct.

   (3) I further understand that providing a copy of the assignment
   specification to any form of code authoring or assignment tutoring
   service, or drawing the attention of others to such services and code
   that may have been made available via such a service, may be regarded
   as Student General Misconduct (interfering with the teaching activities
   of the University and/or inciting others to commit Academic Misconduct).
   I understand that an allegation of Student General Misconduct may arise
   regardless of whether or not I personally make use of such solutions
   or sought benefit from such actions.

   <b>Signed by</b>: [Tianhong Wang 1436415]
   
   <b>Dated</b>: [17/03/2023]