##### Made by Josiah Coad, CoadBros-Tutoring
- Date: Jun 12, 2018

# Decision Trees

#### When do you use this model? 
- You have labeled data (a DT is a supervised machine learning model)
- You want a highly interpretable model (easy and intuitive to see how the computer is making decisions)
- Your problem is a classification problem (although there are also regression applications)
- You have a large data set (thousands of samples per label)
    - (A decision tree is a non-linear algorithm which means that it can capture complex relationships and non-linearities in your data but has a tendency to overfit your data)
- You need a model set up quickly

#### Difference between random forest and decision trees
- A random forest is a collection of decision trees where each tree chooses a subset of features from the dataset to build from. 
- The subset of features for each tree are chosen at random.
- As a result, each tree will be like "experts" in the features they were allowed to use.
- They all pitch in to the final decision like "votes" towards the final decision.
- Example: As the president, you want many people who are around you who all have different backgrounds to give you input on the decisions you should make.

## Visualizing a Decision Tree
#### You will need to install the following python packages using Anaconda or pip:
- scipy, sklearn, numpy, statistics

#### The tree graph file is a .dot file. I suggest either of the following ways to view the .dot:
1. Using xdot
    - brew install xdot
2. Using graphviz:
    - brew install graphviz

In [1]:
import sklearn.tree
import pandas as pd
import numpy as np
import sklearn.model_selection
import statistics

In [2]:
# Let's load the iris dataset...
iris_filename = 'iris.csv'
feature_columns = ["sepal-length", "sepal-width", "petal-length", "petal-width"]
label_column = ["class"]
# read in the dataframe
iris_df = pd.read_csv(iris_filename)
# seperate the features and labels
iris_data = iris_df[feature_columns]
# get the labels as a Series
iris_labels = iris_df[label_column].iloc[:,0]
iris_data.head()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [3]:
dtclassifier = sklearn.tree.DecisionTreeClassifier(
    max_depth=None, max_leaf_nodes=None, min_impurity_decrease=0.0
)

##### Extra
- start with setting __min_impurity_decrease=.5__, __max_depth=5__, __max_leaf_nodes=8__ 
- see how your tree changes and how your cross_validation changes
- you can look into additional parameters to pass to DecisionTreeClassifier [link here](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

In [4]:
# This is where the magic happens...
iris_dtree = dtclassifier.fit(iris_data, iris_labels)

In [5]:
# Now lets look at the output of fitting our model!
sklearn.tree.export_graphviz(iris_dtree,
    feature_names=feature_columns,
    class_names=iris_labels.unique(),
    out_file='iris_tree.dot')

Once you run the above code, type in your terminal (based on the install method you chose): 
1. xdot iris_tree.dot
2. dot -Tpng iris_tree.dot -o iris_tree.png && open iris_tree.png

#### In case you weren't able to do those steps, here is the output:
![Decision tree for Iris](iris_tree.png)

In [6]:
# Now see how well your iris model performed!
statistics.mean(sklearn.model_selection.cross_val_score(iris_dtree, iris_data, iris_labels, cv=10))

0.95333333333333337

## Decision Tree Algorithm From Scratch

In [7]:
from random import seed
from random import randrange
from csv import reader
from math import sqrt

In [8]:
# Calculate the Gini index for a split dataset
# gini_index goes from 0 to 1
# A gini_index of 0 is the best meaning that each bucket
# has homogeneous data. AKA each bucket has data
# from a single class and no bucket has data from 2 or more classes.
def gini_index(buckets, classes):
    # count all samples at split point
    n_instances = float([len(bucket) for bucket in buckets])
    # sum weighted Gini index for each bucket
    gini = 0.0
    for bucket in buckets:
        size = float(len(bucket))
        # avoid divide by zero for buckets with no data in them
        if size == 0:
            continue
        score = 0.0
        # score the bucket based on the score for each class
        for class_val in classes:
            p = [row[-1] for row in bucket].count(class_val) / size
            score += p * p
        # weight the group score by its relative size
        gini += (1.0 - score) * (size / n_instances)
    return gini

In [9]:
# Split a dataset split into two buckets based
# on a split point (feature, feature value).
# Return the left/right children buckets.
def test_split(splitfeature, splitvalue, dataframe):
    leftbucket, rightbucket = list(), list()
    for row in dataframe:
        if row[splitfeature] < splitvalue:
            leftbucket.append(row)
        else:
            rightbucket.append(row)
    return leftbucket, rightbucket

# Select the best split point (feature, feature value) for a dataset
# Iterate through each feature and each corresponding feature value,
# and "test" what would happen if we split the data on that split point.
# Save the split point where the "test results" AKA gini value
# is closest to 0.
def get_split(dataframe, label_column):
    # get the classes
    labels = dataframe[label_column].iloc[:,0]
    classes = labels.unique()
    # get the list of feature names
    features = list(data)
    # set defaults
    best_gini, bestfeature_index, bestsample_index, best_buckets = 99999, 99999, 99999, None
    
    for feature in features:
        for row in dataframe:
            buckets = test_split(feature, row[feature], dataframe)
            gini = gini_index(buckets, classes)
            if gini < best_gini:
                bestfeature_index, bestsample_index, best_gini, best_buckets = index, row[index], gini, buckets
    return {'index':bestfeature_index, 'value':bestsample_index, 'buckets':best_buckets}

In [10]:
# Create a terminal/leaf node
def to_terminal(bucket, label_column):
    outcomes = [row[label_column] for row in bucket]
    return max(set(outcomes), key=outcomes.count)

# Create child splits for a node or make terminal
# Call recursively 
def split(node, max_depth, min_size, depth, label_column):
    left, right = node['buckets']
    del(node['buckets'])
    # check for a no split
    if not left or not right:
        node['left'] = node['right'] = to_terminal(left + right, label_column)
        return
    # check for max depth
    if depth >= max_depth:
        node['left'], node['right'] = to_terminal(left, label_column), to_terminal(right, label_column)
        return
    # process left child
    if len(left) <= min_size:
        node['left'] = to_terminal(left, label_column)
    else:
        node['left'] = get_split(left, label_column)
        split(node['left'], max_depth, min_size, depth+1)
    # process right child
    if len(right) <= min_size:
        node['right'] = to_terminal(right, label_column)
    else:
        node['right'] = get_split(right, label_column)
        split(node['right'], max_depth, min_size, depth+1, label_column)

In [11]:
# Build a decision tree by splitting the data many times
def build_tree(dataframe, label_column, max_depth, min_size):
    root = get_split(dataframe, label_column)
    split(root, max_depth, min_size, 1, dataframe)
    return root

In [12]:
# Now using our own .fit function that we made!
sonar_dtree = build_tree(iris_data, iris_labels, max_depth=10, min_size=1)

NameError: name 'class_column' is not defined

In [None]:
# These package are just to display the json in an interactive way
import uuid
from IPython.display import display_javascript, display_html, display
import json

In [None]:
# Render the tree as an interactive json object
class RenderJSON(object):
    def __init__(self, json_data):
        if isinstance(json_data, dict):
            self.json_str = json.dumps(json_data)
        else:
            self.json_str = json_data
        self.uuid = str(uuid.uuid4())

    def _ipython_display_(self):
        display_html('<div id="{}" style="height: 600px; width:100%;"></div>'.format(self.uuid), raw=True)
        display_javascript("""
        require(["https://rawgit.com/caldwell/renderjson/master/renderjson.js"], function() {
        document.getElementById('%s').appendChild(renderjson(%s))
        });
        """ % (self.uuid, self.json_str), raw=True)

RenderJSON(sonar_dtree)

In [None]:
# In case the above didn't work...
print(json.dumps(sonar_dtree, indent=2))

### Recap what is important
- The tree is made from the root, down to its leaves/terminals by calling get_split recursively on each child node.
- The computer is trying at each step/node to get the lowest gini possible.
- The lowest gini is achieved when the left/right buckets are as homogenous (data from only one class) as possible
- At each node, the computer iterates through each feature and each value corresponding to that feature and tries to split the data given that split point (feature, feature value) and sees how good of a gini index it can get.
- If left to itself, the computer would grow the tree until every node had a gini index of 0... but that would probably be overfitting the data so we set a max_depth.
- The code from scratch was very inefficient. What is the time complexity of this algorithm? How could you improve it?

### What's next...
- A decision tree has probably overfit your data.
- Try a random forest (which is a ensemble method) next to try to mitigate the overfitting!