# Decision Trees - Edward King

In this section I will explore the Decision Tree Model of machine learning.

### Contents:

* Graph theory
* Decision trees
* Information theory
* Practicalities
* The Code

Chapter 1 - Graph theory
---

### 1.1 - Graph theory basics ###

Graph theory is the field of mathematics dedicated to studying the structure and properties of graphs. In order to understand Decision Trees, we first need to know what a tree is. In this section, I will provide the definitions needed to rigourously define a tree.

>#### Definition 1.1: Graphs, Nodes, Arcs and Direction
>A graph is an ordered pair of sets $(V,\ E)$ where:
>* $V$ is a set of nodes (or vertices)
>* $E$ is a set of arcs (or edges) which are ordered pairs of vertices, i.e. $E=\{(x,\ y):x,y\in V\}$.
>
>It is important to note that arcs are directional: $(x,\ y)\neq(y,\ x)$.  
>We say a graph is directed if $\exist (x,\ y)\in E\ s.t.\ (y,\ x)\notin E$.  
>A graph is undirected if $\forall (x,\ y)\in E,\ \exist (y,\ x)\in E$.

![alt text](images/LetteredGraph.png)  
_Figure 1.2 - an undirected graph._

By convention, if a graph is directed it is shown using arrows on the arcs. Therefore, we may assume that this graph is undirected.

Next we need a notion of cycles in a graph. To construct this, we need a series of definitions.

>#### Definition 1.3: Walks, Trails, Paths, Closure, Cycles and Connectedness
>* A walk is an ordered collection of nodes, $(x_1,\ x_2,\ x_3,\ldots,\ x_n)$.  
>Every walk has a corresponding unique collection of arcs, $(e_1,\ e_2,\ldots,\ e_{n-1})$ where $e_i = (x_i,\ x_{i+1})$.  
>N.B. we only consider walks of finite length.
>
>* A trail is a walk where the collection of arcs has no repeats i.e. $i\neq j\Rightarrow e_i\neq e_j,\ \forall i,j: 0≤i,j≤n-1$.
>
>* A path is a trail where the collection of nodes has no repeats i.e. $i\neq j\Rightarrow x_i\neq x_j,\ \forall i,j: 0≤i,j≤n-1$.
>
>* A walk is closed if $x_1=x_n$.
>
>* A cycle is a closed path.
>
>* A graph is cyclic if it has a cycle.
>
>* A graph is acyclic if it has no cycles.
>
>* Two nodes, $x_a,\ x_b$, are connected if there is a walk of the form $(x_a,\ x_i,\ x_{i+1}, \ldots,\ x_b)$.
>
>* A graph is connected if each pair of nodes ($x,\ y\in V$) is connected.

Visually we can see that there are multiple cycles in the graph in figure 1.2, e.g. $(A,\ B,\ C,\ E,\ A)$ and $(A,\ D,\ E,\ A)$, meaning that this graph is cyclic.

>#### Definition 1.4: Trees, Roots, Parents, Children, Leaves and Binary Trees
>* A tree is a connected, acyclic graph.
>
>* In a tree, we designate one node to be the root.
>
>* Given two nodes $x,\ y\in V$, if there is a path from $x$ to the root which passes through $y$ then we say that $y$ is the parent node of $x$, the child node.
>
>* If a node has no children then it is a leaf node.
>
>* If each node in a tree has no more than two children then it is a binary tree.

![alt text](images/trees.png)  
_Figure 1.5 - three trees (the one on the right is a binary tree)._

N.B. Only nodes with at most two arcs can be designated as the root of a binary tree however any node can be designated the root of a non-binary tree.

At last we have all the pieces needed to start on decision trees.

Chapter 2 - Decision Trees
---

### 2.1 - Decision tree basics
A decision tree acts on a set of data where each data point is given by an ordered set $(a_1,\ a_2,\ \ldots ,\ a_n,\ c)$ where each $a_i$ is a feature and $c$ is the classification of this data point. One of the main advantages of using a decision tree is that there is no pre-processing of data as each feature can be of any data type (i.e. continuous, discrete), though for simplicity I will assume that the data features are all continuous in my code.

A decision tree is a tree (or tree-like structure) where each node in the set $V$ describes a condition $C$ on a data feature (for example: $a_1\leq 1$, or $a_2 =$ "Honda") which indicates how data should be split as you traverse the tree (with the given examples: feature $a_1$ is less than or equal to $1$ or $a_2$ is "Honda"). If the node is a leaf node then the condition states what classification should be assigned to the data point.

![alt text](images/DecisionTree.png)  
_Figure 2.1 - A simple decision tree_

Consider the well known model, the Galton board, shown in figure 2.2 below. Consider each ball to be a data point which traverses the decision tree by falling under gravity. The pocket it falls into at the bottom determines which class it is assigned to. Each peg in the board represents a node in the decision tree that decides which path the data point should take down the board.

![alt text](images/GaltonBoard.png)  
_Figure 2.2 - Galton board_

### 2.2 - Decision trees in the wild
This definition allows for a wide range of use cases beyond the one detailed later. This concept is used frequently in game development to design the behaviour of non-player characters and enemies. This is clear in the board game Gloomhaven where players are given a decision tree to determine how the enemies move and attack the players. Each enemy has its own set of information (such as whether or not it is stunned, its position on the board, etc.) which can be considered as the features $(a_1,\ a_2,\ \ldots ,\ a_n)$ and the classification, $c$, is the action the enemy will take. The image below could be formalised to look more like figure 2.1 but I shall leave this as an exercise to the reader.

![alt text](images/gloomhaven_behaviour_tree.png)  
_Figure 2.3 - Gloomhaven behaviour tree_

### 2.3 - Back to machine learning ###
Now we return to the meat of this section: how decision trees are applied to machine learning. I will explore the "learning" as it applies here in the next section but first I will discuss some reasons why one might use decision trees over other models.

Advantages:
* Unlike other models (particularly neural networks), it is very easy to see how a decision tree comes to its conclusions.
* As mentioned in §2.1, the data features can be either discrete or continuous.
* Can be used in both classification and regression problems (though I'm only considering classification problems here).
* Fast query time - on average this has complexity $O(\log{n_{samples}})$<sup>[[1]](https://scikit-learn.org/stable/modules/tree.html)</sup>.

Disadvantages:
* Use of a greedy approach (the best choice is chosen at each stage without considering the effect on future choices). This can lead to a non-optimal solution but will be faster than a more comprehensive search.
* Small variations in the input data can lead to wildly different trees.
* Slow construction time - on average this has complexity $O(n_{samples}n_{features}\log(n_{samples}))$<sup>[[1]](https://scikit-learn.org/stable/modules/tree.html)</sup>. This issue can be mitigated since we only construct the tree once then query it from then on.
* Cannot add data to the tree once constructed - unlike KNN, our model cannot be improved once constructed.
* Prone to overfitting - (see Tobey's section for a more detailed description of the issue). Some remedies to this are discussed at the end of §2.4.


### 2.4 - Algorithm for constructing a decision tree ###
Now that we know what a decision tree looks like, can we construct one? Below I detail the algorithm for constructing a decision tree, italicised lines will be expanded upon later:
>1. Receive a dataset. _Select stopping requirements_.
>
>2. Create a root node. Mark it and the whole dataset as active.
>
>3. If the stopping requirements have been met, proceed to step 10, otherwise proceed to step 4.
>
>4. _Choose a condition on which to split the active dataset_. If the condition has negative information gain, proceed to step 10.
>
>5. Assign the splitting condition to the active node and split the active dataset into left and right subsets.
>
>6. Create a left child node and mark the left subset as active.
>
>7. Mark the left node as active and proceed to step 3.
>
>8. Create a right child node and mark the right subset as active.
>
>9. Mark the right node as active and proceed to step 3.
>
>10. Assign the mode class among the active dataset to the active node.
>
>11. If the active node is a left node, proceed to step 8. Otherwise proceed to step 12.
>
>12. _If there are incomplete nodes, proceed to step 3 with the first incomplete node as active._ Othewise, proceed to step 13.
>
>13. Terminate the algorithm.


This algorithm can be optimised using recursion (which I use in my code), though for clarity I use this slightly more inefficient version here (this optimisation is relatively small as the majority of the complexity comes from step 4 as will be discussed later).

As promised, I will expand on steps 1 and 11 now. Step 4 will be left for its own section.

Step 1 is used as a method of preventing overfitting as well as enabling termination of the algorithm. There are three requirements which may be met in order to terminate the algorithm:
1. Maximum distance from the root - if the tree has too many nodes.
2. Minimum sample size - if the data subset assigned to a node has too few elements.
3. Non-positive information gain - if the best splitting condition has negative information gain (to be explained in the next chapter).

If either of these conditions is satisfied, then the algorithm is terminated.

With regard to step 11, there is an algorithm to determine the next node (depth first search) but this is not necessary as using the recursive algorithm solves this.

I will now explain step 4 of the algorithm.

Chapter 3 - Choosing Optimal Splitting Conditions
---

### 3.1 - Information theory ###
Step 4 says: "Choose a condition on which to split the data assigned to the active node". We want to choose the "best" condition to do this so one might compare all possible conditions to find the best one. However, given two or more choices of splitting conditions, how do we evaluate which is more effective at splitting the data? What does "effective" even look like? For this problem, we need two concepts from information theory: the study of communication, storage and quantification of information.

### 3.2 - Impurity ###

Impurity is the measure of uncertainty in an event. For our purposes, impurity measures the uncertainty of picking a class from the dataset.

The first piece we need is to know the impurity of our dataset. There are two measures of impurity that I will use here: entropy and gini index. The respective formulae used to calulate them are<sup>[[2]](https://www.ibm.com/topics/decision-trees)</sup>:
>
> \begin{align*}  
> \text{E}(D) = \sum_{c} -p_c\ \log(p_c)\\
> \text{G}(D) = 1 - \sum_{c} p_c^2
> \end{align*}

Where:
* $D$ is the dataset that is having its impurity calculated
* $p_c$ is the proportion of class $c$ in the dataset

N.B. The choice of the base of the logarithm changes the type of entropy used. For machine learning we use $\log_2$. This is known as Shannon entropy. In other fields it is common to use base $10$ or the natural logarithm.

### 3.3 - Information gain ###

Information gain is the measure of reduction in impurity of a dataset.

>The formula to calculate information gain is<sup>[[2]](https://www.ibm.com/topics/decision-trees)</sup>:  
>\begin{align*}
>\text{IG}(C) = \text{Im}(P) - \sum_{\chi}\text{Im}(\chi)
>\end{align*}

Where:
* $C$ is the condition that is having its information gain calculated
* $\text{Im}$ is the impurity measure being used
* $P$ is the parent dataset
* $\chi$ is the child dataset


Whichever splitting condition has the highest information gain is the one that should be used.

### 3.4 - Step 4 algorithm ###

>1. Order the data features and find all unique values taken by the dataset for each feature (i.e. discard repeated values within a single feature).
>
>2. Mark all data features and unique values as unconsidered.
>
>3. Set the highest information gain to $-\infty$.
>
>4. Select an unconsidered data feature.
>
>5. Select an unconsidered value for the current feature.
>
>6. Calculate the information gain resulting from splitting the current data feature at the current value.
>
>7. If the gain is greater than the highest information gain, proceed to step 8. Otherwise proceed to step 11.
>
>8. Change the mark on the feature, value pair that was marked highest to discounted.
>
>9. Mark the current feature and value as highest.
>
>10. Set the highest information gain to the calculated value.
>
>11. If there are unconsidered values for the current feature, proceed to step 5. Otherwise proceed to step 12.
>
>12. If there are unconsidered features, proceed to step 4. Otherwise proceed to step 13.
>
>13. Return the highest information gain and terminate the algorithm.

Chapter 4 - The Code
---

Firstly I import the libraries that I will use.  
* Numpy for manipulating arrays<sup>[[3]](https://numpy.org/doc/stable/)</sup>
* Pandas for importing large datasets<sup>[[4]](https://pandas.pydata.org/docs/)</sup>


N.B. I do use the scikit learn library later on<sup>[[5]](https://scikit-learn.org/stable/modules/classes.html)</sup>. Specifically, I use it to get a random sample from my dataset and to check how accurate my model was.

Next I import the datasets to be used on the model.

I have included both the iris dataset<sup>[[6]]()</sup> and the dry bean dataset<sup>[[7]](http://archive.ics.uci.edu/dataset/602/dry+bean+dataset)</sup>

In [1]:
import numpy as np
import pandas as pd

In [2]:
''' If using the Iris data set use below: '''

# col_names = ['sepal_length',
#              'sepal_width',
#              'petal_length',
#              'petal_width',
#              'type']

# data = pd.read_csv('iris.data', sep=',', header=None, names=col_names)



''' If using the dry bean dataset, use below: '''

col_names = ['Area',
             'Perimeter',
             'MajorAxisLength',
             'MinorAxisLength',
             'AspectRation',
             'Eccentricity',
             'ConvexArea', 
             'EquivDiameter',
             'Extent',
             'Solidity',
             'Roundness',
             'Compactness',
             'ShapeFactor1',
             'ShapeFactor2',
             'ShapeFactor3',
             'ShapeFactor4',
             'Class']

data = pd.read_csv('Dry_Bean_Dataset.data', sep=',', header=None, names=col_names)


''' Leave this uncommented regardless of the dataset used. '''
data.head(10)

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRation,Eccentricity,ConvexArea,EquivDiameter,Extent,Solidity,Roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3,ShapeFactor4,Class
0,28395,610.291,208.178117,173.888747,1.197191,0.549812,28715,190.141097,0.763923,0.988856,0.958027,0.913358,0.007332,0.003147,0.834222,0.998724,SEKER
1,28734,638.018,200.524796,182.734419,1.097356,0.411785,29172,191.27275,0.783968,0.984986,0.887034,0.953861,0.006979,0.003564,0.909851,0.99843,SEKER
2,29380,624.11,212.82613,175.931143,1.209713,0.562727,29690,193.410904,0.778113,0.989559,0.947849,0.908774,0.007244,0.003048,0.825871,0.999066,SEKER
3,30008,645.884,210.557999,182.516516,1.153638,0.498616,30724,195.467062,0.782681,0.976696,0.903936,0.928329,0.007017,0.003215,0.861794,0.994199,SEKER
4,30140,620.134,201.847882,190.279279,1.060798,0.33368,30417,195.896503,0.773098,0.990893,0.984877,0.970516,0.006697,0.003665,0.9419,0.999166,SEKER
5,30279,634.927,212.560556,181.510182,1.171067,0.520401,30600,196.347702,0.775688,0.98951,0.943852,0.923726,0.00702,0.003153,0.85327,0.999236,SEKER
6,30477,670.033,211.050155,184.03905,1.146768,0.489478,30970,196.988633,0.762402,0.984081,0.85308,0.933374,0.006925,0.003242,0.871186,0.999049,SEKER
7,30519,629.727,212.996755,182.737204,1.165591,0.51376,30847,197.12432,0.770682,0.989367,0.967109,0.92548,0.006979,0.003158,0.856514,0.998345,SEKER
8,30685,635.681,213.534145,183.157146,1.165852,0.514081,31044,197.659696,0.771561,0.988436,0.95424,0.925658,0.006959,0.003152,0.856844,0.998953,SEKER
9,30834,631.934,217.227813,180.897469,1.200834,0.553642,31120,198.139012,0.783683,0.99081,0.970278,0.912125,0.007045,0.003008,0.831973,0.999061,SEKER


To construct a tree, we need to know what its nodes look like.  
In the following block, I define the `Node` class which stores information about:
* The data feature which is being split by this node.
* The threshold value which is applied to the feature.
* The node's left and right children (which may be `None`).
* The class to which data points should be assigned (`None` if not a leaf node).

In [3]:
class Node():
    def __init__(self, feature_id: int = None, threshold = None, cls: str = None):        
        self.feature_id = feature_id
        self.threshold = threshold
        self.left = None
        self.right = None

        self.cls = cls

Now, I write the class which will handle contruction and querying the decision tree.

It has four attributes:
* `max_depth` and `min_samples`: used to determine the stopping conditions as explained in §2.4.
* `root`: used as a base to construct the tree from.
* `imp_mode`: used to decide which impurity measure should be used.


and ___ methods:
* `__init__`: a constructor for the class.
* `constructTree`: initialisation of the construction algorithm laid out in §2.4.
* `setChildNodes`: the recursive function used to construct the tree.
* `getBestSplit`: runs the algorithm from §3.5.
* `calculateInfoGain`, `getEntropy` and `getGini`: all apply the respective formulae from §3.2 and §3.3.
* `splitDataset`: splits the dataset according to a given according to the given conditions.
* `getLeafValue`: gets the mode class in a dataset.
* `printTree`: displays the tree in a readable format.
* `predict`: runs `makePrediction` on each datapoint in a set.
* `makePrediction`: searches the tree to determine the class of a given datapoint.

N.B. When a variable must be defined but will not be used, I use an underscore: `_`.

In [4]:
class DecisionTree():
    def __init__(self, max_depth: int = 3, min_samples: int = 4, imp_mode: str = 'entropy'):
        self.max_depth = max_depth
        self.min_samples = min_samples
    
        self.root = None

        self.imp_mode = imp_mode



    # This is the function that initiates the algorithm.
    def constructTree(self, dataset):
        # Split the root
        _, num_features = np.shape(dataset[:,:-1])
        feature_id, threshold, _ = self.getBestSplit(dataset, num_features)
        self.root = Node(feature_id, threshold)


        # Find the root's children
        self.setChildNodes(self.root, dataset, 0)



    def setChildNodes(self, node, dataset, depth):
        num_samples, num_features = np.shape(dataset[:,:-1])
        depth += 1

        # Stopping conditions
        if (depth < self.max_depth) and (num_samples > self.min_samples):
            # Split child nodes
            left_dataset, right_dataset = self.splitDataset(node.feature_id, node.threshold, dataset)

            left_feature_id, left_threshold, left_info_gain = self.getBestSplit(left_dataset, num_features)
            right_feature_id, right_threshold, right_info_gain = self.getBestSplit(right_dataset, num_features)

            if left_info_gain > 0:
                # Set left node
                node.left = Node(left_feature_id, left_threshold)
                
                # Determine children of left node
                self.setChildNodes(node.left, left_dataset, depth)

            else:
                # Set leaf value
                node.left = Node(cls=self.getLeafValue(left_dataset[:, -1]))

            if right_info_gain > 0:
                # Set right node
                node.right = Node(right_feature_id, right_threshold)

                # Determine children of right node
                self.setChildNodes(node.right, right_dataset, depth)

            else:
                # Set leaf value
                node.right = Node(cls=self.getLeafValue(right_dataset[:, -1]))

        else:
            # Set leaf value if the algorithm stops
            node.cls = self.getLeafValue(dataset[:, -1])
    


    def getBestSplit(self, dataset, num_features):
        
        best_feat = None
        best_thresh = None
        best_info = -float("inf")
        
        # iterate over all features
        for feature_id in range(num_features):
            feature_values = dataset[:, feature_id]

            # find the unique values for the chosen feature
            possible_thresholds = np.unique(feature_values)

            # iterate over all unique values
            for threshold in possible_thresholds:
                # get current split
                dataset_left, dataset_right = self.splitDataset(feature_id, threshold, dataset)
                
                # check children are not null
                if len(dataset_left)>0 and len(dataset_right)>0:
                    class_list, class_list_left, class_list_right = dataset[:, -1], dataset_left[:, -1], dataset_right[:, -1]
                    
                    # compute information gain
                    curr_info_gain = self.calculateInfoGain(class_list, class_list_left, class_list_right)
                    
                    # update the best split if necessary
                    if curr_info_gain>best_info:
                        best_feat = feature_id
                        best_thresh = threshold
                        best_info = curr_info_gain
                        
        # return best split
        return best_feat, best_thresh, best_info
    


    # Implements the formulae from §3.3
    def calculateInfoGain(self, class_list, class_list_left, class_list_right):
        if self.imp_mode == 'entropy':
            parent_imp = self.getEntropy(class_list)
            left_imp = self.getEntropy(class_list_left)
            right_imp = self.getEntropy(class_list_right)

        elif self.imp_mode == 'gini':
            parent_imp = self.getGini(class_list)
            left_imp = self.getGini(class_list_left)
            right_imp = self.getGini(class_list_right)

        left_weight = len(class_list_left) / len(class_list)
        right_weight = len(class_list_right) / len(class_list)

        return parent_imp - left_weight * left_imp - right_weight * right_imp
    


    # Implement the formulae from §3.2
    def getEntropy(self, class_list):
        unique_classes = np.unique(class_list)
        entropy = 0
        for cls in unique_classes:
            p_cls = len(class_list[class_list == cls]) / len(class_list)
            entropy += -p_cls * np.log2(p_cls)
        return entropy
    
    def getGini(self, class_list):
        unique_classes = np.unique(class_list)
        gini = 0
        for cls in unique_classes:
            p_cls = len(class_list[class_list == cls]) / len(class_list)
            gini += p_cls ** 2
        return 1 - gini



    # This splits the data according to a given condition.
    def splitDataset(self, feature_id, threshold, dataset):
        dataset_left = np.array([row for row in dataset if row[feature_id]<=threshold])
        dataset_right = np.array([row for row in dataset if row[feature_id]>threshold])
        return dataset_left, dataset_right
    


    # Gets the mode class in a dataset.
    def getLeafValue(self, Y):
        Y = list(Y)
        return max(Y, key=Y.count)
    


    # Outputs the tree in a readable format.
    def printTree(self, tree = None, indent: str = " ", feat_names = None):
        
        # On the first call, starts at the root.
        if not tree:
            tree = self.root

        # If the node is a leaf node return its value.
        if tree.cls is not None:
            print(tree.cls)

        else:
            # Determines the output format.
            if not feat_names:
                print("X_" + str(tree.feature_id), "<=", tree.threshold)

            else:
                print(str(feat_names[tree.feature_id]), "<=", tree.threshold)

            # Continues to the child nodes.
            print("%sleft:" % (indent), end="")
            self.printTree(tree.left, indent + indent, feat_names)
            
            print("%sright:" % (indent), end="")
            self.printTree(tree.right, indent + indent, feat_names)



    # Predicts the outputs of a dataset.
    def predict(self, X):
        predictions = [self.makePrediction(x, self.root) for x in X]
        return predictions



    # Predicts the class of a single datapoint.
    def makePrediction(self, x, tree):

        if tree.cls!=None:
            return tree.cls
        
        feature_val = x[tree.feature_id]

        if feature_val <= tree.threshold:
            return self.makePrediction(x, tree.left)
        
        else:
            return self.makePrediction(x, tree.right)

Next I do some reshaping of the dataset and get a random selection of datapoints to construct the tree from.

This is the first of two uses of the sklearn library in my section. As you can see, I do not use it to implement any of the model.

In [5]:
X = data.iloc[:, :-1].values
Y = data.iloc[:, -1].values.reshape(-1,1)

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=1000, test_size=100)

Now I run the code to construct the tree from the chosen dataset. In this example I use the gini index measure. I then display the tree with the `printTree` method.

In [6]:
tree = DecisionTree(max_depth=5, min_samples=5, imp_mode="gini")

dataset = np.concatenate((X_train, Y_train), axis=1)

tree.constructTree(dataset)

tree.printTree(feat_names=col_names)

Perimeter <= 749.909
 left:AspectRation <= 1.3435720617876916
  left:MinorAxisLength <= 181.66087987226535
    left:Roundness <= 0.925011277935379
        left:SIRA
        right:DERMASON
    right:ShapeFactor4 <= 0.9938135901063284
        left:DERMASON
        right:SEKER
  right:ConvexArea <= 37483.0
    left:MajorAxisLength <= 280.69309975606564
        left:DERMASON
        right:HOROZ
    right:Roundness <= 0.9220680534983396
        left:DERMASON
        right:SIRA
 right:AspectRation <= 1.8514302867459431
  left:MinorAxisLength <= 210.48639000515425
    left:Roundness <= 0.8353305514534717
        left:HOROZ
        right:SIRA
    right:Area <= 107911.0
        left:CALI
        right:BOMBAY
  right:MinorAxisLength <= 215.48975027642743
    left:ShapeFactor4 <= 0.998228338038105
        left:HOROZ
        right:HOROZ
    right:Area <= 87055.0
        left:CALI
        right:BOMBAY


Finally, I use the sklearn library once more to determine the accuracy of my model.

In [7]:
Y_pred = tree.predict(X_test) 
from sklearn.metrics import accuracy_score
accuracy_score(Y_test, Y_pred)

0.75

Part 5 - Potential Improvements
---

Aside from the inefficiencies resulting from my incompetence, there are a few methods that could be employed to improve my code.

As mentioned in §2.3, decision trees can be used in both classification and regression and their inputs can be both discrete and continuous. My code currently only works with continuous data in a classification problem.

In order to improve the effectiveness of the model, I could make use of concepts such as random forest (see Stavros' section) and xgboost.

References:
---

[[1]](https://scikit-learn.org/stable/modules/tree.html) Scikit documentation for decision trees functions  
[[2]](https://www.ibm.com/topics/decision-trees) IBM decision tree page  
[[3]](https://numpy.org/doc/stable/) Numpy library  
[[4]](https://pandas.pydata.org/docs/) Pandas library  
[[5]](https://scikit-learn.org/stable/modules/classes.html) Scikit learn library  
[[6]](http://archive.ics.uci.edu/dataset/53/iris) Iris dataset  
[[7]](http://archive.ics.uci.edu/dataset/602/dry+bean+dataset) Dry bean dataset  