## Decision trees implementations

This notebook brings discussion on theoretical foundations and main implementations of **Classification and Regression Trees (CART)** algorithm for supervised learning. This method is also known as **decision trees**, since its functioning for making predictions reproduces the way how humans make decisions in real life: by applying a set of rules that optimize the set of inputs given some goal in mind.

The objective of this project is to review the theoretical understanding of *how decision trees are built* and *how they produce a prediction from an input vector*, and, most importantly, to strengthen this understanding by analyzing some *Python codes that implement the decision tree predictive modeling*.

Decision trees are important for two main reasons: first, they are one of the greatest learning methods in terms of **interpretability and explainability**. Decision trees can be easily plotted, so their structure and construction are highly explainable both in words and with visualizations. Their functioning is also understandable given the graphical representation: an input vector $x$ can be moved through the nodes, so final predictions are immediately interpreted by the user.

The second reason why the understanding of decision trees is so crucial in the machine learning domain relies on the fact that some of the most widely used and most effective learning algorithms are **tree-based**. Since decision trees are relatively simple in its structure, the trade-off between interpretability and performance implies that they are not expected to achieve appropriate levels of predictive performance in most empirical applications. However, methods constructed upon decision trees, such as *bagging, random forest, and gradient boosting*, try to reduce the variance or the bias of estimates by imposing more structure into the models, even that this would negatively affect their interpretability.

The structure of a decision tree derives from binary splits that sequentially leads a data point from the root node, which applies to all observations in a given application, to a leaf (or terminal) node that applies only to a subset of observations for whom all intermediary nodes apply together. A decision tree is then composed from the following elements:
1. **Node:** gathers a subset of observations. So, it can be described by a dataset with features and a label (for a classification task), its length (i.e., the number of observations in the node) and the distribution of these observations across classes of the response variable (from which a measure of the quality of this split can be calculated). Such distribution (an average value for a regression task) is crucial for terminal nodes, since it gives rise to predictions for a given input vector $x$.
2. **Splits:** all nodes aside leaves can support one binary split, which depends on an input variable $X_j$, a relational operator ($<$, $\leq$, $>$, $\geq$ or $=$), and a value of reference $x_j$. Splits are then implemented through rules such as $X_j \leq x_j$ (for a continuous variable), justifying the name *decision trees*. Consequently, apart from leaf nodes, all nodes are also described by an input variable (e.g., $X_j$) or by a rule (e.g., $X_j \leq x_j$). As a split is applied over a node, its observations are further divided into two regions, one for which the rule applies and the other for the complement. In a graphical sense, the tree receives two branches leading to two additional (child) nodes, which can be described in the same way as their parent node. The splits can be depicted by looking into disjoint regions that are sequentially created in the features space. Given a complete tree, an input vector can only belong to a single terminal node and to a single subset of the features space.
3. **Levels:** a tree has distinct, non-decreasing levels, where each indicate how many splits have occurred so far. Imposing restrictions over the maximum level of a tree can control its complexity, since the higher the maximum depth, the more splits are done, and the more specific (pure) are the leaf nodes.
4. **Cost function:** for a split to be made, a feature and a feature value must be chosen. So, some criterium should guide the splitting process, i.e., the learning by the algorithm from the training data. For regression tasks, the mean squared error (MSE) is the most common choice as the cost function to be minimized at each split. For classification tasks, there are some alternatives, such as the Gini index, the cross-entropy, and the misclassification error rate. While the later may be appropriate for tuning the tree, the first two are the most common choices when it comes to growing a tree. Both of these functions consider the same principle: the more pure a node is (i.e., the less diverse it is in terms of distinct classes), the better. The Gini index can be expressed by:
\begin{equation}
  \displaystyle G(node) = \sum_{k=1}^C p_k(1 - p_k)
\end{equation}

Where $p_k$ is the proportion of observations that belong to class $k$ in the node. The lower $G(node)$, the more pure the node is, since it will be more concentrated in one class. The cross-entropy, in its turn, is given by:
\begin{equation}
  \displaystyle Entropy(node) = -\sum_{k=1}^C p_k\log(p_k)
\end{equation}
The lower the entropy (the chaos), the more concentrated the node will be in one class.

Given the structure and components of a decision tree, the **tree construction** consists of a splitting process from which nodes are sequentially splitted until a termination criterium is met. The process of growing a tree locally optimizes the choice of features and values for the split, which means that the implementation of a split does not consider what would be the best choice regarding subsequent splits. In practice, a *greedy algorithm*, known as **binary recursive splitting**, tries to minimize the cost function at each possible split. This optimization iterates over all available  input variables and, then, for each splitting variable candidate, over all values that it assumes in the training data (or, alternatively, over the series of averages between consecutive - and sorted - values). Once the best pair of input variable and split value is found, the dataset is further divided into two disjoint subsets.

The binary recursive splitting stops when at least one of two conditions applies for all current leaf nodes:
1. When the depth is no lower than a predefined value $J$, the **maximum tree depth**.
2. When the number of observations that fall into the node is lower than a predefined value $n_{min}$, the **minimum node size** (or, minimum samples split).

Both of such **hyper-parameters** control the complexity of the tree: the higher tha maximum tree depth, the more splits can be implemented, since more levels can be created; the lower the minimum node size, the more splits can be implemented, since the data can be more and more fragmented into small-sized nodes. The more complex a tree is, however, the more likely it will suffer from overfitting, because the model becomes too dependent on the data it sees during the fitting. Of course, a tree that is too short may lead to the opposite problem: underfitting as a consequence of weak patterns found in the data (presented by nodes with a low degree of purity).

The functioning of decision trees is also explained by how the **prediction** occurs for a given input vector. Since binary splits are sequentially made creating different disjoint regions in the features space, the prediction follows the average value of the response variable in the region where the input vector is located. For classification tasks, the average value refers to the proportion of observations in each class, thus consisting of a probability estimate and pointing to the most frequent class (for a label prediction). Considering the tree structure, an input vector moves across the tree until it finds a terminal node: there, the average value of a numerical response variable is calculated, or then the proportion of observations in each class is computed for classification tasks. At this point, it is relevant to highlight one aspect regarding the transition from decision trees to ensemble methods (as bagging, random forest and gradient boosting): the probability estimate is not given by the average vote across all trees in the ensemble, but by the probability estimate averaged over all trees.

Below we find two implementations of decision trees that only make use of standard Python libraries. The first comes from [this](https://towardsdatascience.com/decision-tree-algorithm-in-python-from-scratch-8c43f0e40173) web article, while the second comes from [another](https://machinelearningmastery.com/implement-decision-tree-algorithm-scratch-python/) article. The first implementation creates a Python class named "Node", which has several methods: one that calculates the Gini impurity index, another for finding the best split and the main method to grow the tree. The second implementation works with functions, and the main one builds a tree (using a function specific for that) and returns predictions on a test set. Other functions do the calculation of the Gini index, the implementation of splits and the search for the best split.

It is central in both implementations the process of splitting: in the first, it occurs with the "grow_tree" method, as sequentially new nodes (a left and a right) are created and new trees (i.e., new splits) follow from each one of them. In the second, splits are made with the "split" function, which again creates nodes from previous ones. It is worth to notice how these implementations are highly chained in their structure. So, in the first approach, the "grow_tree" method is called inside itself from nodes created using the "Node" class in which the method is defined. In the second approach, the function "split" is also called inside itself. This chained nature reflects the fact that new nodes are created from splits applied over parent nodes.

Besides the object-oriented or function-based approach, the implementations differ one from the other in the usage of the cost function: the first maximizes the Gini gain, i.e., the reduction in Gini impurity from the parent node, while the second minimizes the Gini impurity of the split. Another difference refers to the grid of values for splitting numerical variables: the first takes the average of consecutive and sorted values, while the second tries all values available in the training data.

Finally, it should be mentioned possible extensions to the implementations presented here. Categorical inputs can be easily used for splittings, requiring no transformations. Different cost functions can be used instead of Gini impurity. Moreover, if one uses entropy instead of Gini index, again the objective function can be such that a minimization (of the entropy) or a maximization (of the information gain) should occur. The construction of terminal nodes can reflect a regression task, instead of a classification, and for a classification task the probability can be calculated together with the predicted label. Cost-complexity pruning can tune model performance by selecting the best subtree from a large grown tree.

**References**
<br>
[Decision tree algorithm in python from scratch](https://towardsdatascience.com/decision-tree-algorithm-in-python-from-scratch-8c43f0e40173).
<br>
[Implement decision tree algorithm scratch python](https://machinelearningmastery.com/implement-decision-tree-algorithm-scratch-python/).
<br>
[Classification and regression trees for machine learning/](https://machinelearningmastery.com/classification-and-regression-trees-for-machine-learning/).
<br>
[Decision tree classifier explained in real-life: picking a vacation destination](https://towardsdatascience.com/decision-tree-classifier-explained-in-real-life-picking-a-vacation-destination-6226b2b60575).
<br>
[The Elements of Statistical Learning](https://web.stanford.edu/~hastie/Papers/ESLII.pdf).

----------------

The notebook initially imports all standard libraries needed. Then, presents codes and demonstrates the first and second implementations.

**Summary:**
1. [Libraries](#libraries)<a href='#libraries'></a>.
2. [First implementation](#first_implementation)<a href='#first_implementation'></a>.
3. [Second implementation](#second_implementation)<a href='#second_implementation'></a>.

<a id='libraries'></a>

## Libraries

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
cd "/content/gdrive/MyDrive/Studies/tree_based/Codes"

/content/gdrive/MyDrive/Studies/tree_based/Codes


In [None]:
import pandas as pd
import numpy as np
from collections import Counter
from random import seed, randrange
from csv import reader

<a id='first_implementation'></a>

## First implementation

This first implementation follows from this [article](https://towardsdatascience.com/decision-tree-algorithm-in-python-from-scratch-8c43f0e40173) ([Github](https://github.com/Eligijus112/decision-tree-python) page of reference).

<a id='first_implementation_codes'></a>

### Codes

In [None]:
class Node:
    """
    Class for creating the nodes for a decision tree 
    """
    def __init__(
        self, 
        Y: list,
        X: pd.DataFrame,
        min_samples_split=None,
        max_depth=None,
        depth=None,
        node_type=None,
        rule=None
    ):
        # Saving the data to the node 
        self.Y = Y
        self.X = X

        # Saving the hyper parameters
        self.min_samples_split = min_samples_split if min_samples_split else 20
        self.max_depth = max_depth if max_depth else 5

        # Default current depth of node
        self.depth = depth if depth else 0

        # Extracting all the features
        self.features = list(self.X.columns)

        # Type of node 
        self.node_type = node_type if node_type else 'root'

        # Rule for spliting 
        self.rule = rule if rule else ""

        # Calculating the counts of Y in the node
        self.counts = Counter(Y)

        # Getting the GINI impurity based on the Y distribution
        self.gini_impurity = self.get_GINI()

        # Sorting the counts and saving the final prediction of the node 
        counts_sorted = list(sorted(self.counts.items(), key=lambda item: item[1]))

        # Getting the last item
        yhat = None
        if len(counts_sorted) > 0:
            yhat = counts_sorted[-1][0]

        # Saving to object attribute. This node will predict the class with the most frequent class
        self.yhat = yhat

        # Saving the number of observations in the node 
        self.n = len(Y)

        # Initiating the left and right nodes as empty nodes
        self.left = None 
        self.right = None 

        # Default values for splits
        self.best_feature = None 
        self.best_value = None 

    @staticmethod
    def GINI_impurity(y1_count: int, y2_count: int) -> float:
        """
        Given the observations of a binary class calculate the GINI impurity
        """
        # Ensuring the correct types
        if y1_count is None:
            y1_count = 0

        if y2_count is None:
            y2_count = 0

        # Getting the total observations
        n = y1_count + y2_count
        
        # If n is 0 then we return the lowest possible gini impurity
        if n == 0:
            return 0.0

        # Getting the probability to see each of the classes
        p1 = y1_count / n
        p2 = y2_count / n
        
        # Calculating GINI 
        gini = 1 - (p1 ** 2 + p2 ** 2)
        
        # Returning the gini impurity
        return gini

    @staticmethod
    def ma(x: np.array, window: int) -> np.array:
        """
        Calculates the moving average of the given list. 
        """
        return np.convolve(x, np.ones(window), 'valid') / window

    def get_GINI(self):
        """
        Function to calculate the GINI impurity of a node 
        """
        # Getting the 0 and 1 counts
        y1_count, y2_count = self.counts.get(0, 0), self.counts.get(1, 0)

        # Getting the GINI impurity
        return self.GINI_impurity(y1_count, y2_count)

    def best_split(self) -> tuple:
        """
        Given the X features and Y targets calculates the best split 
        for a decision tree
        """
        # Creating a dataset for spliting
        df = self.X.copy()
        df['Y'] = self.Y

        # Getting the GINI impurity for the base input
        GINI_base = self.get_GINI()

        # Finding which split yields the best GINI gain 
        max_gain = 0

        # Default best feature and split
        best_feature = None
        best_value = None

        for feature in self.features:
            # Droping missing values
            Xdf = df.dropna().sort_values(feature)

            # Sorting the values and getting the rolling average
            xmeans = self.ma(Xdf[feature].unique(), 2)

            for value in xmeans:
                # Spliting the dataset 
                left_counts = Counter(Xdf[Xdf[feature]<value]['Y'])
                right_counts = Counter(Xdf[Xdf[feature]>=value]['Y'])

                # Getting the Y distribution from the dicts
                y0_left, y1_left, y0_right, y1_right = left_counts.get(0, 0), left_counts.get(1, 0), right_counts.get(0, 0), right_counts.get(1, 0)

                # Getting the left and right gini impurities
                gini_left = self.GINI_impurity(y0_left, y1_left)
                gini_right = self.GINI_impurity(y0_right, y1_right)

                # Getting the obs count from the left and the right data splits
                n_left = y0_left + y1_left
                n_right = y0_right + y1_right

                # Calculating the weights for each of the nodes
                w_left = n_left / (n_left + n_right)
                w_right = n_right / (n_left + n_right)

                # Calculating the weighted GINI impurity
                wGINI = w_left * gini_left + w_right * gini_right

                # Calculating the GINI gain 
                GINIgain = GINI_base - wGINI

                # Checking if this is the best split so far 
                if GINIgain > max_gain:
                    best_feature = feature
                    best_value = value 

                    # Setting the best gain to the current one 
                    max_gain = GINIgain

        return (best_feature, best_value)

    def grow_tree(self):
        """
        Recursive method to create the decision tree
        """
        # Making a df from the data 
        df = self.X.copy()
        df['Y'] = self.Y

        # First and second conditions to continue the tree growth from a given node:
        if (self.depth < self.max_depth) and (self.n >= self.min_samples_split):

            # Getting the best split 
            best_feature, best_value = self.best_split()

            # Third condition to continue the tree growth from a given node:
            if best_feature is not None: # If there is GINI to be gained, we split further.
                # Saving the best split to the current node 
                self.best_feature = best_feature
                self.best_value = best_value

                # Getting the left and right nodes
                left_df, right_df = df[df[best_feature]<=best_value].copy(), df[df[best_feature]>best_value].copy()

                # Creating the left and right nodes
                left = Node(
                    left_df['Y'].values.tolist(), 
                    left_df[self.features], 
                    depth=self.depth + 1, 
                    max_depth=self.max_depth, 
                    min_samples_split=self.min_samples_split, 
                    node_type='left_node',
                    rule=f"{best_feature} <= {round(best_value, 3)}"
                    )

                self.left = left 
                self.left.grow_tree()

                right = Node(
                    right_df['Y'].values.tolist(), 
                    right_df[self.features], 
                    depth=self.depth + 1, 
                    max_depth=self.max_depth, 
                    min_samples_split=self.min_samples_split,
                    node_type='right_node',
                    rule=f"{best_feature} > {round(best_value, 3)}"
                    )

                self.right = right
                self.right.grow_tree()

    def print_info(self, width=4):
        """
        Method to print the infromation about the tree
        """
        # Defining the number of spaces 
        const = int(self.depth * width ** 1.5)
        spaces = "-" * const
        
        if self.node_type == 'root':
            print("Root")
        else:
            print(f"|{spaces} Split rule: {self.rule}")
        print(f"{' ' * const}   | GINI impurity of the node: {round(self.gini_impurity, 2)}")
        print(f"{' ' * const}   | Class distribution in the node: {dict(self.counts)}")
        print(f"{' ' * const}   | Predicted class: {self.yhat}")   

    def print_tree(self):
        """
        Prints the whole tree from the current node to the bottom
        """
        self.print_info() 
        
        if self.left is not None: 
            self.left.print_tree()
        
        if self.right is not None:
            self.right.print_tree()

    def predict(self, X:pd.DataFrame):
        """
        Batch prediction method
        """
        predictions = []

        for _, x in X.iterrows():
            values = {}
            for feature in self.features:
                values.update({feature: x[feature]})
        
            predictions.append(self.predict_obs(values))
        
        return predictions

    def predict_obs(self, values: dict) -> int:
        """
        Method to predict the class given a set of features
        """
        cur_node = self
        while cur_node.depth < cur_node.max_depth:
            # Traversing the nodes all the way to the bottom
            best_feature = cur_node.best_feature
            best_value = cur_node.best_value

            if cur_node.n < cur_node.min_samples_split:
                break 

            if (values.get(best_feature) < best_value):
                if self.left is not None:
                    cur_node = cur_node.left
            else:
                if self.right is not None:
                    cur_node = cur_node.right
            
        return cur_node.yhat

<a id='first_implementation_demo'></a>

### Demonstration

In [None]:
# Loading the data:
d = pd.read_csv('../Datasets/train.csv')

# Dropping missing values:
dtree = d[['Survived', 'Age', 'Fare']].dropna().copy()

# Defining the X and Y matrices:
Y = dtree['Survived'].values
X = dtree[['Age', 'Fare']]

# Saving the features list:
features = list(X.columns)

In [None]:
# Hyper-parameters for growing the tree:
hp = {
 'max_depth': 3,
 'min_samples_split': 50
}

# Initializing the tree:
root = Node(Y, X, **hp)

# Growing the tree:
root.grow_tree()

In [None]:
# Assessing the constructed tree:
root.print_tree()

Root
   | GINI impurity of the node: 0.48
   | Class distribution in the node: {0: 424, 1: 290}
   | Predicted class: 0
|-------- Split rule: Fare <= 52.277
           | GINI impurity of the node: 0.44
           | Class distribution in the node: {0: 389, 1: 195}
           | Predicted class: 0
|---------------- Split rule: Fare <= 10.481
                   | GINI impurity of the node: 0.32
                   | Class distribution in the node: {0: 192, 1: 47}
                   | Predicted class: 0
|------------------------ Split rule: Age <= 32.5
                           | GINI impurity of the node: 0.37
                           | Class distribution in the node: {0: 134, 1: 43}
                           | Predicted class: 0
|------------------------ Split rule: Age > 32.5
                           | GINI impurity of the node: 0.12
                           | Class distribution in the node: {0: 58, 1: 4}
                           | Predicted class: 0
|---------------- Split rule

<a id='second_implementation'></a>

## Second implementation

The second implementation of decision trees follows from this [article](https://machinelearningmastery.com/implement-decision-tree-algorithm-scratch-python/).

<a id='gini_index'></a>

### Gini index

#### Codes

In [None]:
# Calculate the Gini index for a split dataset
def gini_index(groups, classes):
	# count all samples at split point
	n_instances = float(sum([len(group) for group in groups]))
	# sum weighted Gini index for each group
	gini = 0.0
	for group in groups:
		size = float(len(group))
		# avoid divide by zero
		if size == 0:
			continue
		score = 0.0
		# score the group based on the score for each class
		for class_val in classes:
			p = [row[-1] for row in group].count(class_val) / size
			score += p * p
		# weight the group score by its relative size
		gini += (1.0 - score) * (size / n_instances)
	return gini

#### Demonstration

In [None]:
# test Gini values
print(gini_index([[[1, 1], [1, 0]], [[1, 1], [1, 0]]], [0, 1]))
print(gini_index([[[1, 0], [1, 0]], [[1, 1], [1, 1]]], [0, 1]))
print(gini_index([[[1, 0], [1, 0], [1,1]], [[1, 1], [1, 1], [1,0]]], [0, 1]))
print(gini_index([[[1, 0], [1, 0], [1,1]], [[1, 1], [1, 1], [1,1]]], [0, 1]))

0.5
0.0
0.4444444444444444
0.2222222222222222


<a id='splits'></a>

### Creating splits

#### Codes

In [None]:
# Split a dataset based on an attribute and an attribute value
def test_split(index, value, dataset):
	left, right = list(), list()
	for row in dataset:
		if row[index] < value:
			left.append(row)
		else:
			right.append(row)
	return left, right

In [None]:
# Select the best split point for a dataset
def get_split(dataset):
	class_values = list(set(row[-1] for row in dataset))
	b_index, b_value, b_score, b_groups = 999, 999, 999, None
	for index in range(len(dataset[0])-1):
		for row in dataset:
			groups = test_split(index, row[index], dataset)
			gini = gini_index(groups, class_values)
			print('X%d < %.3f Gini=%.3f' % ((index+1), row[index], gini))
			if gini < b_score:
				b_index, b_value, b_score, b_groups = index, row[index], gini, groups
	return {'index':b_index, 'value':b_value, 'groups':b_groups}

#### Demonstration

In [None]:
dataset = [[2.771244718,1.784783929,0],
	[1.728571309,1.169761413,0],
	[3.678319846,2.81281357,0],
	[3.961043357,2.61995032,0],
	[2.999208922,2.209014212,0],
	[7.497545867,3.162953546,1],
	[9.00220326,3.339047188,1],
	[7.444542326,0.476683375,1],
	[10.12493903,3.234550982,1],
	[6.642287351,3.319983761,1]]
split = get_split(dataset)
print('Split: [X%d < %.3f]' % ((split['index']+1), split['value']))

X1 < 2.771 Gini=0.444
X1 < 1.729 Gini=0.500
X1 < 3.678 Gini=0.286
X1 < 3.961 Gini=0.167
X1 < 2.999 Gini=0.375
X1 < 7.498 Gini=0.286
X1 < 9.002 Gini=0.375
X1 < 7.445 Gini=0.167
X1 < 10.125 Gini=0.444
X1 < 6.642 Gini=0.000
X2 < 1.785 Gini=0.500
X2 < 1.170 Gini=0.444
X2 < 2.813 Gini=0.320
X2 < 2.620 Gini=0.417
X2 < 2.209 Gini=0.476
X2 < 3.163 Gini=0.167
X2 < 3.339 Gini=0.444
X2 < 0.477 Gini=0.500
X2 < 3.235 Gini=0.286
X2 < 3.320 Gini=0.375
Split: [X1 < 6.642]


<a id='build_tree'></a>

### Building a tree

#### Codes

In [None]:
# Create a terminal node value
def to_terminal(group):
	outcomes = [row[-1] for row in group]
	return max(set(outcomes), key=outcomes.count)

In [None]:
# Create child splits for a node or make terminal
def split(node, max_depth, min_size, depth):
	left, right = node['groups']
	del(node['groups'])
	# check for a no split
	if not left or not right:
		node['left'] = node['right'] = to_terminal(left + right)
		return
	# check for max depth
	if depth >= max_depth:
		node['left'], node['right'] = to_terminal(left), to_terminal(right)
		return
	# process left child
	if len(left) <= min_size:
		node['left'] = to_terminal(left)
	else:
		node['left'] = get_split(left)
		split(node['left'], max_depth, min_size, depth+1)
	# process right child
	if len(right) <= min_size:
		node['right'] = to_terminal(right)
	else:
		node['right'] = get_split(right)
		split(node['right'], max_depth, min_size, depth+1)

In [None]:
# Build a decision tree
def build_tree(train, max_depth, min_size):
	root = get_split(train)
	split(root, max_depth, min_size, 1)
	return root

In [None]:
# Print a decision tree
def print_tree(node, depth=0):
	if isinstance(node, dict):
		print('%s[X%d < %.3f]' % ((depth*' ', (node['index']+1), node['value'])))
		print_tree(node['left'], depth+1)
		print_tree(node['right'], depth+1)
	else:
		print('%s[%s]' % ((depth*' ', node)))

#### Demonstration

In [None]:
dataset = [[2.771244718,1.784783929,0],
	[1.728571309,1.169761413,0],
	[3.678319846,2.81281357,0],
	[3.961043357,2.61995032,0],
	[2.999208922,2.209014212,0],
	[7.497545867,3.162953546,1],
	[9.00220326,3.339047188,1],
	[7.444542326,0.476683375,1],
	[10.12493903,3.234550982,1],
	[6.642287351,3.319983761,1]]
tree = build_tree(dataset, 1, 3)
print_tree(tree)

X1 < 2.771 Gini=0.444
X1 < 1.729 Gini=0.500
X1 < 3.678 Gini=0.286
X1 < 3.961 Gini=0.167
X1 < 2.999 Gini=0.375
X1 < 7.498 Gini=0.286
X1 < 9.002 Gini=0.375
X1 < 7.445 Gini=0.167
X1 < 10.125 Gini=0.444
X1 < 6.642 Gini=0.000
X2 < 1.785 Gini=0.500
X2 < 1.170 Gini=0.444
X2 < 2.813 Gini=0.320
X2 < 2.620 Gini=0.417
X2 < 2.209 Gini=0.476
X2 < 3.163 Gini=0.167
X2 < 3.339 Gini=0.444
X2 < 0.477 Gini=0.500
X2 < 3.235 Gini=0.286
X2 < 3.320 Gini=0.375
[X1 < 6.642]
 [0]
 [1]


In [None]:
tree = build_tree(dataset, 3, 3)
print_tree(tree)

X1 < 2.771 Gini=0.444
X1 < 1.729 Gini=0.500
X1 < 3.678 Gini=0.286
X1 < 3.961 Gini=0.167
X1 < 2.999 Gini=0.375
X1 < 7.498 Gini=0.286
X1 < 9.002 Gini=0.375
X1 < 7.445 Gini=0.167
X1 < 10.125 Gini=0.444
X1 < 6.642 Gini=0.000
X2 < 1.785 Gini=0.500
X2 < 1.170 Gini=0.444
X2 < 2.813 Gini=0.320
X2 < 2.620 Gini=0.417
X2 < 2.209 Gini=0.476
X2 < 3.163 Gini=0.167
X2 < 3.339 Gini=0.444
X2 < 0.477 Gini=0.500
X2 < 3.235 Gini=0.286
X2 < 3.320 Gini=0.375
X1 < 2.771 Gini=0.000
X1 < 1.729 Gini=0.000
X1 < 3.678 Gini=0.000
X1 < 3.961 Gini=0.000
X1 < 2.999 Gini=0.000
X2 < 1.785 Gini=0.000
X2 < 1.170 Gini=0.000
X2 < 2.813 Gini=0.000
X2 < 2.620 Gini=0.000
X2 < 2.209 Gini=0.000
X1 < 2.771 Gini=0.000
X1 < 3.678 Gini=0.000
X1 < 3.961 Gini=0.000
X1 < 2.999 Gini=0.000
X2 < 1.785 Gini=0.000
X2 < 2.813 Gini=0.000
X2 < 2.620 Gini=0.000
X2 < 2.209 Gini=0.000
X1 < 7.498 Gini=0.000
X1 < 9.002 Gini=0.000
X1 < 7.445 Gini=0.000
X1 < 10.125 Gini=0.000
X1 < 6.642 Gini=0.000
X2 < 3.163 Gini=0.000
X2 < 3.339 Gini=0.000
X2 < 0.4

<a id='predictions'></a>

### Predictions

#### Codes

In [None]:
# Make a prediction with a decision tree
def predict(node, row):
	if row[node['index']] < node['value']:
		if isinstance(node['left'], dict):
			return predict(node['left'], row)
		else:
			return node['left']
	else:
		if isinstance(node['right'], dict):
			return predict(node['right'], row)
		else:
			return node['right']

#### Demonstration

In [None]:
dataset = [[2.771244718,1.784783929,0],
	[1.728571309,1.169761413,0],
	[3.678319846,2.81281357,0],
	[3.961043357,2.61995032,0],
	[2.999208922,2.209014212,0],
	[7.497545867,3.162953546,1],
	[9.00220326,3.339047188,1],
	[7.444542326,0.476683375,1],
	[10.12493903,3.234550982,1],
	[6.642287351,3.319983761,1]]
 
#  predict with a stump
stump = {'index': 0, 'right': 1, 'value': 6.642287351, 'left': 0}
for row in dataset:
	prediction = predict(stump, row)
	print('Expected=%d, Got=%d' % (row[-1], prediction))

Expected=0, Got=0
Expected=0, Got=0
Expected=0, Got=0
Expected=0, Got=0
Expected=0, Got=0
Expected=1, Got=1
Expected=1, Got=1
Expected=1, Got=1
Expected=1, Got=1
Expected=1, Got=1


<a id='second_implementation_app'></a>

### Application

#### Codes

In [None]:
# Load a CSV file
def load_csv(filename):
	file = open(filename, "rt")
	lines = reader(file)
	dataset = list(lines)
	return dataset
 
# Convert string column to float
def str_column_to_float(dataset, column):
	for row in dataset:
		row[column] = float(row[column].strip())

In [None]:
# Split a dataset into k folds
def cross_validation_split(dataset, n_folds):
	dataset_split = list()
	dataset_copy = list(dataset)
	fold_size = int(len(dataset) / n_folds)
	for i in range(n_folds):
		fold = list()
		while len(fold) < fold_size:
			index = randrange(len(dataset_copy))
			fold.append(dataset_copy.pop(index))
		dataset_split.append(fold)
	return dataset_split
 
# Calculate accuracy percentage
def accuracy_metric(actual, predicted):
	correct = 0
	for i in range(len(actual)):
		if actual[i] == predicted[i]:
			correct += 1
	return correct / float(len(actual)) * 100.0
 
# Evaluate an algorithm using a cross validation split
def evaluate_algorithm(dataset, algorithm, n_folds, *args):
	folds = cross_validation_split(dataset, n_folds)
	scores = list()
	for fold in folds:
		train_set = list(folds)
		train_set.remove(fold)
		train_set = sum(train_set, [])
		test_set = list()
		for row in fold:
			row_copy = list(row)
			test_set.append(row_copy)
			row_copy[-1] = None
		predicted = algorithm(train_set, test_set, *args)
		actual = [row[-1] for row in fold]
		accuracy = accuracy_metric(actual, predicted)
		scores.append(accuracy)
	return scores

In [None]:
# Classification and Regression Tree Algorithm
def decision_tree(train, test, max_depth, min_size):
	tree = build_tree(train, max_depth, min_size)
	predictions = list()
	for row in test:
		prediction = predict(tree, row)
		predictions.append(prediction)
	return(predictions)

#### Demonstration

In [None]:
# load and prepare data
seed(1)
filename = '../Datasets/data_banknote_authentication.csv'
dataset = load_csv(filename)
# convert string attributes to integers
for i in range(len(dataset[0])):
	str_column_to_float(dataset, i)

# evaluate algorithm
n_folds = 5
max_depth = 5
min_size = 10
scores = evaluate_algorithm(dataset, decision_tree, n_folds, max_depth, min_size)
print('Scores: %s' % scores)
print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

[1;30;43mA saída de streaming foi truncada nas últimas 5000 linhas.[0m
X3 < 1.613 Gini=0.119
X3 < -0.188 Gini=0.115
X3 < -0.822 Gini=0.114
X3 < -4.046 Gini=0.121
X3 < -2.424 Gini=0.120
X3 < 2.257 Gini=0.119
X3 < -4.031 Gini=0.122
X3 < 0.276 Gini=0.115
X3 < -3.557 Gini=0.122
X3 < 0.353 Gini=0.115
X3 < 0.661 Gini=0.116
X3 < -2.548 Gini=0.120
X3 < -1.943 Gini=0.116
X3 < 0.419 Gini=0.115
X3 < -0.588 Gini=0.115
X3 < 0.465 Gini=0.116
X3 < 2.281 Gini=0.119
X3 < 0.486 Gini=0.116
X3 < 0.706 Gini=0.117
X3 < 1.705 Gini=0.118
X3 < 0.183 Gini=0.115
X3 < -1.388 Gini=0.115
X3 < -2.260 Gini=0.118
X3 < 3.090 Gini=0.120
X3 < 6.599 Gini=0.121
X3 < 1.456 Gini=0.119
X3 < -0.793 Gini=0.115
X3 < -2.123 Gini=0.117
X3 < 6.010 Gini=0.121
X3 < -1.430 Gini=0.114
X3 < 0.757 Gini=0.117
X3 < -4.358 Gini=0.122
X3 < 0.511 Gini=0.116
X3 < -1.361 Gini=0.115
X3 < 2.033 Gini=0.118
X3 < 1.545 Gini=0.119
X3 < 1.951 Gini=0.118
X3 < 2.071 Gini=0.118
X3 < 3.994 Gini=0.120
X3 < -2.429 Gini=0.120
X3 < -1.969 Gini=0.116
X3 < 7.