## Student #1 ID: 207047150

## Student #2 ID: 206727638

# Exercise 1: Decision Trees

In this assignment you will implement a Decision Tree algorithm as learned in class.

## Read the following instructions carefully:

1. This jupyter notebook contains all the step by step instructions needed for this exercise.
1. Write **efficient vectorized** code whenever possible. Some calculations in this exercise take several minutes when implemented efficiently, and might take much longer otherwise. Unnecessary loops will result in point deduction.
1. You are responsible for the correctness of your code and should add as many tests as you see fit. Those tests will not be graded nor checked.
1. You are free to add code and markdown cells as you see fit.
1. Write your functions in this jupyter notebook only. Do not create external python modules and import from them.
1. You are allowed to use functions and methods from the [Python Standard Library](https://docs.python.org/3/library/) and [numpy](https://www.numpy.org/devdocs/reference/) only, unless otherwise mentioned.
1. Your code must run without errors. It is a good idea to restart the notebook and run it from end to end before you submit your exercise.
1. Answers to qualitative questions should be written in **markdown cells (with $\LaTeX$ support)**.
1. Submit this jupyter notebook only using your ID as a filename. **No not use ZIP or RAR**. For example, your submission should look like this: `123456789.ipynb` if you worked by yourself or `123456789_987654321.ipynb` if you worked in pairs.

## In this exercise you will perform the following:
1. Practice OOP in python.
2. Implement two impurity measures: Gini and Entropy.
3. Construct a decision tree algorithm.
4. Prune the tree to achieve better results.
5. Visualize your results.

In [23]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# make matplotlib figures appear inline in the notebook
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# Make the notebook automatically reload external python modules
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Warmup - OOP in python

Our desicion tree will be implemented using a dedicated python class. Take a minute and practice your object oriented skills. Create a tree with some nodes and make sure you understand how objects in python work.

In [24]:
class Node(object):
    def __init__(self, data):
        self.data = data
        self.children = []

    def add_child(self, node):
        self.children.append(node)

In [25]:
n = Node(5)
p = Node(6)
q = Node(7)
n.add_child(p)
n.add_child(q)
n.children

[<__main__.Node at 0x1e5142684f0>, <__main__.Node at 0x1e514110c10>]

## Data preprocessing

We will use the breast cancer dataset that is available as a part of sklearn. In this example, our dataset will be a single matrix with the **labels on the last column**. Notice that you are not allowed to use additional functions from sklearn.

In [26]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

# load dataset
X, y = datasets.load_breast_cancer(return_X_y = True)
X = np.column_stack([X,y]) # the last column holds the labels

# split dataset
X_train, X_test = train_test_split(X, random_state=99)

print("Training dataset shape: ", X_train.shape)
print("Testing dataset shape: ", X_test.shape)



Training dataset shape:  (426, 31)
Testing dataset shape:  (143, 31)


In [27]:

lables = X_train[:,-1:]
l, lable_count = np.unique(lables, return_counts= True)
print(lable_count)
print(l)

[168 258]
[0. 1.]


## Impurity Measures (10 points)

Impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. Implement the functions `calc_gini` (5 points) and `calc_entropy` (5 points). You are encouraged to test your implementation.

In [28]:
def calc_gini(data):
    """
    Calculate gini impurity measure of a dataset.
 
    Input:
    - data: any dataset where the last column holds the labels.
 
    Returns the gini impurity.    
    """
    gini = 1
    ###########################################################################
    # TODO: Implement the function.                                           #
    ###########################################################################
    lables = data[:,-1:]
    _ , lable_count = np.unique(lables, return_counts= True)
    sum_count = sum(lable_count)
    for count in lable_count:
        gini -= (count/sum_count)**2
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return gini

In [29]:
def calc_entropy(data):
    """
    Calculate the entropy of a dataset.

    Input:
    - data: any dataset where the last column holds the labels.

    Returns the entropy of the dataset.    
    """
    entropy = 0.0
    ###########################################################################
    # TODO: Implement the function.                                           #
    ###########################################################################
    lables = data[:,-1:]
    _ , lable_count = np.unique(lables, return_counts= True)
    sum_count = sum(lable_count)
    for count in lable_count:
        pj = count/sum_count
        entropy -= pj*np.log(pj)  
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return entropy


In [30]:
rows, cols = 5, 4

# Create a matrix with random numbers for all columns except the last
matrix = np.random.rand(rows, cols - 1)

# Add a column of random 0s and 1s as the last column
last_col = np.random.choice([0, 1], size=(rows, 1))

# Combine the matrix and the last column
matrix = np.hstack((matrix, last_col))
matrix

array([[0.00269196, 0.65094996, 0.04399517, 1.        ],
       [0.81910496, 0.18517757, 0.00783152, 1.        ],
       [0.60516867, 0.44703712, 0.20934001, 1.        ],
       [0.36774798, 0.2615676 , 0.35297144, 0.        ],
       [0.53076609, 0.56451058, 0.17239323, 0.        ]])

## Building a Decision Tree (50 points)

Use a Python class to construct the decision tree. Your class should support the following functionality:

1. Initiating a node for a decision tree. You will need to use several class methods and class attributes and you are free to use them as you see fit. We recommend that every node will hold the **feature** and **value** used for the split and the **children** of that node. In addition, it might be a good idea to store the **prediction** in that node, the **height** of the tree for that node and whether or not that node is a **leaf** in the tree.
2. Your code should support both Gini and Entropy as impurity measures. 
3. The provided data includes continuous data. For this exercise, create at most a **single split** for each node of the tree (your tree will be binary). Determine the threshold for splitting by checking all possible features and the values available for splitting. When considering the values, take the average of each consecutive pair. For example, for the values [1,2,3,4,5] you should test possible splits on the values [1.5, 2.5, 3.5, 4.5]. 
5. After you complete building the class for a decision node in the tree, complete the function `build_tree`. This function takes as input the training dataset and the impurity measure. Then, it initializes a root for the decision tree and constructs the tree according to the procedure you saw in class.
1. Once you are finished, construct two trees: one with Gini as an impurity measure and the other using Entropy.

In [33]:
class DecisionNode:
    '''
    This class will hold everyhing you need to construct a node in a DT. You are required to 
    support basic functionality as previously described. It is highly recommended that you  
    first read and understand the entire exercises before diving into this class.
    You are allowed to change the structure of this class as you see fit.
    '''


 
    def __init__(self, data):
        # you should take more arguments as inputs when initiating a new node
        self.data = data
        self.children = []

        self.feature = None
        self.splitting_value = None
        self.score = None

        self.leaf = True
        self.height = 1

    def add_child(self, node):
        if (node):
            self.children.append(node)
            node.parent = self
            if (self.leaf):
                self.leaf = False
            node.height += self.height
     
    def check_split(self, feature, value):
        # this function divides the data according to a specific feature and value
        # you should use this function while testing for the optimal split
        group_A = self.data[self.data[:,feature] <= value]
        group_B = self.data[self.data[:,feature] > value]
        return group_A, group_B

    def info_gain(self, impurity_measure, group_A, group_B):
        '''

        :param impurity_measure: can be either entropy or gini
        :param group_A: one part of the data after splitting
        :param group_B: second part of the data after splitting
        :return: a score for the given split
        '''
        return (impurity_measure(self.data) - ((1/self.data.shape[0])* (group_A.shape[0] * impurity_measure(group_A) + group_B.shape[0] * impurity_measure(group_B))))


    def split(self, impurity_measure):
        # this function goes over all possible features and values and finds
        # the optimal split according to the impurity measure. Note: you can
        # send a function as an argument

        #tuple in form (#split_score, #feature_index, #threshold_value)
        best_split = {'score':-100, 'feature': None , 'threshold': None}

        num_of_features = len(self.data[0]) - 1
        for feature in range(num_of_features):
            thresholds = self.set_thresholds(feature)
            for threshold in thresholds:
                group_A, group_B = self.check_split(feature, threshold)

                split_score = self.info_gain(impurity_measure, group_A, group_B)

    #TODO not sure if score should be low or high, i think high
                if split_score > best_split['score']:
                    best_split['score'], best_split['feature'], best_split['threshold'] = split_score, feature, threshold

        self.splitting_value = best_split['threshold']
        self.feature = best_split['feature']
        self.score = best_split['score']

        print(best_split)

        return self.check_split(self.feature, self.splitting_value)

    def set_thresholds(self, feature):
        """
        set threshold between every 2 consecutive values of the given feature
        :param feature: the feature to set the thresholds according to
        :return: an array of values to split the data by according to the given feature
        """
        sorted_by_index = self.data[:, feature].argsort()
        thresholds = []
        for i in range(len(sorted_by_index) - 1):
            thresholds.append((self.data[sorted_by_index[i], feature] + self.data[sorted_by_index[i + 1], feature]) / 2)
        return thresholds


In [22]:
node = DecisionNode(matrix)
node.split(calc_entropy)

{'score': 1, 'feature': None, 'threshold': None}


TypeError: '<=' not supported between instances of 'float' and 'NoneType'

In [46]:
def build_tree(data, impurity):
    """
    Build a tree using the given impurity measure and training dataset. 
    You are required to fully grow the tree until all leaves are pure. 
 
    Input:
    - data: the training dataset.
    - impurity: the chosen impurity measure. Notice that you can send a function
                as an argument in python.
 
    Output: the root node of the tree.
    """
    root = None
    ###########################################################################
    # TODO: Implement the function.                                           #
    ###########################################################################
    print(data.shape)
    if (data.shape[0] != 0) & (impurity(data) != 0):
        root = DecisionNode(data)
        group_A, group_B = root.split(impurity)
        if group_A.shape[0] > 0 and impurity(group_A) != 0 :
            root.add_child(build_tree(group_A, impurity))
        if group_B.shape[0] > 0 and impurity(group_B) != 0 :
            root.add_child(build_tree(group_B, impurity))
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return root

In [42]:
X_train

array([[1.200e+01, 2.823e+01, 7.677e+01, ..., 2.447e-01, 8.194e-02,
        1.000e+00],
       [1.157e+01, 1.904e+01, 7.420e+01, ..., 3.035e-01, 8.284e-02,
        1.000e+00],
       [1.646e+01, 2.011e+01, 1.093e+02, ..., 3.054e-01, 9.519e-02,
        0.000e+00],
       ...,
       [1.747e+01, 2.468e+01, 1.161e+02, ..., 2.160e-01, 9.300e-02,
        0.000e+00],
       [1.008e+01, 1.511e+01, 6.376e+01, ..., 2.933e-01, 7.697e-02,
        1.000e+00],
       [1.674e+01, 2.159e+01, 1.101e+02, ..., 4.863e-01, 8.633e-02,
        0.000e+00]])

In [47]:

# python support passing a function as arguments to another function.
tree_gini = build_tree(data=X_train, impurity=calc_gini)
tree_entropy = build_tree(data=X_train, impurity=calc_entropy)

(426, 31)
{'score': 0.342981396141687, 'feature': 27, 'threshold': 0.14235}
(271, 31)
{'score': 0.07003408275389443, 'feature': 3, 'threshold': 696.25}
(248, 31)
{'score': 0.010695659688335954, 'feature': 27, 'threshold': 0.1349}
(237, 31)
{'score': 0.0027065755314435672, 'feature': 10, 'threshold': 0.62555}
(234, 31)
{'score': 0.00043830813061569544, 'feature': 21, 'threshold': 33.349999999999994}
(18, 31)
{'score': 0.1049382716049383, 'feature': 21, 'threshold': 33.56}
(3, 31)
{'score': 0.4444444444444444, 'feature': 1, 'threshold': 18.630000000000003}
(11, 31)
{'score': 0.3173553719008265, 'feature': 15, 'threshold': 0.02744}
(5, 31)
{'score': 0.31999999999999984, 'feature': 0, 'threshold': 13.225000000000001}
(23, 31)
{'score': 0.2688510817055241, 'feature': 1, 'threshold': 16.375}
(18, 31)
{'score': 0.09295570079883814, 'feature': 19, 'threshold': 0.0015485}
(17, 31)
{'score': 0.05190311418685119, 'feature': 1, 'threshold': 18.6}
(2, 31)
{'score': 0.5, 'feature': 0, 'threshold': 1

## Tree evaluation (10 points)

Complete the functions `predict` and `calc_accuracy`.

After building both trees using the training set (using Gini and Entropy as impurity measures), you should calculate the accuracy on the test set and print the measure that gave you the best test accuracy. For the rest of the exercise, use that impurity measure. (10 points)

Node: Feature = 27, Split Value = 0.14235
  Node: Feature = 3, Split Value = 696.25
    Node: Feature = 27, Split Value = 0.1349
      Node: Feature = 10, Split Value = 0.62555
        Node: Feature = 21, Split Value = 33.349999999999994
          Leaf: Prediction = 0.1049382716049383
        Leaf: Prediction = 0.4444444444444444
      Node: Feature = 15, Split Value = 0.02744
        Leaf: Prediction = 0.31999999999999984
    Node: Feature = 1, Split Value = 16.375
      Node: Feature = 19, Split Value = 0.0015485
        Node: Feature = 1, Split Value = 18.6
          Leaf: Prediction = 0.5
  Node: Feature = 13, Split Value = 21.924999999999997
    Node: Feature = 21, Split Value = 29.0
      Leaf: Prediction = 0.19753086419753085
    Leaf: Prediction = 0.01408379860167999


In [None]:
def predict(node, instance):
    """
    Predict a given instance using the decision tree
 
    Input:
    - root: the root of the decision tree.
    - instance: an row vector from the dataset. 
 
    Output: the prediction of the instance.
    """
    pred = None
    ###########################################################################
    # TODO: Implement the function.                                           #
    ###########################################################################
    
    
    
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return pred

In [None]:
def calc_accuracy(node, dataset):
    """
    Predict a given dataset using the decision tree
 
    Input:
    - node: a node in the decision tree.
    - dataset: the dataset on which the accuracy is evaluated
 
    Output: the accuracy of the decision tree on the given dataset (%).
    """
    accuracy = 0
    ###########################################################################
    # TODO: Implement the function.                                           #
    ###########################################################################
    
    
    
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return accuracy

## Print the tree (10 points)

Complete the function `print_tree`. Your code should do something like this (10 points):
```
[X0 <= 1],
  [X1 <= 2]
    [X2 <= 3], 
       leaf: [{1.0: 10}]
       leaf: [{0.0: 10}]
    [X4 <= 5], 
       leaf: [{1.0: 5}]
       leaf: [{0.0: 10}]
   leaf: [{1.0: 50}]
```

In [None]:
def print_tree(node):
    """
    Prints the tree similar to the example above.
    As long as the print is clear, any printing scheme will be fine
    
    Input:
    - node: a node in the decision tree.
 
    Output: This function has no return value.
    """
    
    ###########################################################################
    # TODO: Implement the function.                                           #
    ###########################################################################
    
    
    
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return
    


## Post pruning (20 points)

Construct a decision tree and perform post pruning: For each leaf in the tree, calculate the test accuracy of the tree assuming no split occurred on the parent of that leaf and find the best such parent (in the sense that not splitting on that parent results in the best testing accuracy among possible parents). Make that parent into a leaf and repeat this process until you are left with the root. On a single plot, draw the training and testing accuracy as a function of the number of internal nodes in the tree. Explain and visualize the results and print your tree (20 points).

In [None]:
###########################################################################
# TODO: Implement the function.                                           #
###########################################################################



###########################################################################
#                             END OF YOUR CODE                            #
###########################################################################