# Decision Tree Clasification in Pyhton

- Easy to interpret both by practitioners and domain experts. Are *white box models*.
- Can explain exactly why a specific prediction was made.
- Require very little data preparation.
    - They not require feature scaling or centering at all.

Trees: 
- The representation of a Classification Decision Tree is a binary tree.
- Each node can have zero, one or two child nodes.
- A node represents a single input variable, assuming the variable is numeric.
- The leaf nodes of the tree contain an output variable (y), which is used to make a prediction.
- The split with the best cost (lowest cost) is selected.

#### Regression:

   > The cost function that is minimized to choose split points is the sum squared error across all training samples that fall within the rectangle.
    
#### Classification:

   > The Gini cost function is used, which provides an indication of how pure the nodes are.





## Decision Tree Algorithm from Scratch

In [42]:
# Loading the data:
# we are going to use iris dataset
import pandas as pd
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
X = pd.DataFrame(iris.data[:, :2], columns = ['Sepal length', 'Sepal width' ]) # We only will use two features for simplicity
y = iris.target
# And the dataset with X and y to work with it along the excersice
dataset = X.copy()
dataset['Species'] = y

### Gini index

- The gini index is our cost function, we going to use it to evaluate the splits in the dataset.

- A split in the dataset involves one input attribute and one value for that attribute. It can be used to divide training patterns into two groups of rows.

- Node purity refers to how mixed the training data assigned to each node is. A node is pure (`gini = 0`) if all training instances it applies to belong to the same class.

- Gini impurity is calculated as follows:
$$G_i = 1 - \sum_{k=1}^n p_{i,k}^2,$$

> where $p_{i,k}^2$ is the ratio of class $k$ instances among the trining instances in the $i^{th}$ node.

In [25]:
# Let's start defining or gini_index function
def gini_index(groups, classes):
   # Count all samples at split point
    n_instances = (sum([len(group) for group in groups]))
    gini = 0.0
    for group in groups:
        size = len(group)
        if size == 0:
            continue
        score = 0.0
        for class_val in classes:
            p = [row[-1] for row in group].count(class_val) / size
            score += p * p
        gini += (1.0 - score) * (size / n_instances)
    return gini

gini_index()

0.5
0.0
0.5


In [44]:
def split_dataset(attribute, value, dataset):
    # Determines if the value is a string or a number
    if isinstance(value, str):
        left = dataset[dataset[attribute] == value]
        left = dataset[dataset[attribute] != value]
    if isinstance(value, int):
        left = dataset[dataset[attribute] <= value]
        right = dataset[dataset[attribute] > value]
    return (left, right)

In [52]:
# Example of split
print('Dataset variables:', list(dataset.columns[:2]))
print('Dataset shape:',dataset.shape)
# let's split the dataset
variable_ = 'Sepal length'
value_ = 6
left, right = split_dataset(attribute=variable_, value = value_, dataset= dataset)
print(F'Num. of observations with "{variable_}" <= {value_}: {len(left)}')
print(F'Num. of observations with "{variable_}" > {value_}: {len(right)}')

Dataset variables: ['Sepal length', 'Sepal width']
Dataset shape: (150, 3)
Num. of observations with "Sepal length" <= 6: 89
Num. of observations with "Sepal length" > 6: 61
