##### Made by Josiah Coad, CoadBros-Tutoring
- Date: Jun 12, 2018
- Notes to self:
    - Look more into how this can be used for feature engineering? regression? handle missing values?
    - change sonar_dataset to something else

# Decision Trees

#### When do you use this model? 
- You have labeled data (a DT is a supervised machine learning model)
- You want a highly interpretable model (easy and intuitive to see how the computer is making decisions)
- Your problem is a classification problem (although there are also regression applications)
- You have a large data set (thousands of samples per label)
    - (A decision tree is a non-linear algorithm which means that it can capture complex relationships and non-linearities in your data but has a tendency to overfit your data)
- You need a model set up quickly

#### Difference between random forest and decision trees
- A random forest is a collection of decision trees where each tree chooses a subset of features from the dataset to build from. 
- The subset of features for each tree are chosen at random.
- As a result, each tree will be like "experts" in the features they were allowed to use.
- They all pitch in to the final decision like "votes" towards the final decision.
- Example: As the president, you want many people who are around you who all have different backgrounds to give you input on the decisions you should make.

## Visualizing a Decision Tree
#### You will need to install the following python packages using Anaconda or pip:
- scipy, sklearn, numpy, statistics

#### The tree graph file is a .dot file. I suggest either of the following ways to view the .dot:
1. Using xdot
    - brew install xdot
2. Using graphviz:
    - brew install graphviz

In [1]:
from sklearn.datasets import load_iris
import sklearn.tree
import pandas as pd
import numpy as np
import sklearn.model_selection
import statistics

###### Extra
- look into additional parameters to pass to DecisionTreeClassifier [link here](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
- start with adding __min_impurity_split=.1__ or __max_depth=5__ as a parameter
- see how your tree changes and how your cross_validation changes

In [2]:
dtclassifier = sklearn.tree.DecisionTreeClassifier()

In [3]:
# first an example using the iris dataset
iris = load_iris()
iris_dtree = dtclassifier.fit(iris.data, iris.target)

In [4]:
sklearn.tree.export_graphviz(iris_dtree,
    feature_names=["Sepal Length", "Sepal Width", "Petal Length","Petal Width"],
    class_names=["Setosa", "Versicolour", "Virginica"],
    out_file='iris_tree.dot')

Once you run the above code, type in your terminal (based on the install method you chose): 
1. xdot iris_tree.dot
2. dot -Tpng iris_tree.dot -o iris_tree.png && open iris_tree.png

#### In case you weren't able to do those steps, here is the output:
![Decision tree for Iris](iris_tree.png)

In [5]:
# Now see how well your iris model performs!
statistics.mean(sklearn.model_selection.cross_val_score(iris_dtree, iris.data, iris.target, cv=10))

0.95333333333333337

In [6]:
# In case you wanted an example using pandas
sonar_filename = 'sonar.all-data.csv'
# sonar dataset doesn't have headers but if yours does, remove header=None
sonar_df = pd.read_csv(sonar_filename, header=None)
# label index in sonar dataset is column 60
# if using your own dataset, use the column index or column name of the labelcolumn
labelIndex = 60
sonar_data = sonar_df
sonar_labels = sonar_data.pop(labelIndex)
# Make sure none of the data types are "objects"
print(*sonar_data.dtypes, sep=", ")
sonar_data.head()

float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64, float64


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
0,0.02,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0232,0.0027,0.0065,0.0159,0.0072,0.0167,0.018,0.0084,0.009,0.0032
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0125,0.0084,0.0089,0.0048,0.0094,0.0191,0.014,0.0049,0.0052,0.0044
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598,0.6194,...,0.0033,0.0232,0.0166,0.0095,0.018,0.0244,0.0316,0.0164,0.0095,0.0078
3,0.01,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0241,0.0121,0.0036,0.015,0.0085,0.0073,0.005,0.0044,0.004,0.0117
4,0.0762,0.0666,0.0481,0.0394,0.059,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0156,0.0031,0.0054,0.0105,0.011,0.0015,0.0072,0.0048,0.0107,0.0094


In [7]:
sonar_dtree = dtclassifier.fit(sonar_data, sonar_labels)
# Now see how well your sonar model performs!
statistics.mean(sklearn.model_selection.cross_val_score(sonar_dtree, sonar_data, sonar_labels, cv=10))

0.59192640692640697

In [8]:
# checkout the tree picture after running this code following the same steps as before
sklearn.tree.export_graphviz(sonar_dtree,
    class_names=["Rock", "Metal"],
    out_file='sonar_tree.dot')

## Naive Decision Tree Algorithm From Scratch

In [9]:
from random import seed
from random import randrange
from csv import reader
from math import sqrt

In [10]:
# Calculate the Gini index for a split dataset
# gini_index goes from 0 to 1
# A gini_index of 0 is the best meaning that each bucket
# has homogeneous data. AKA each bucket has data
# from a single class and no bucket has data from 2 or more classes.
def gini_index(buckets, classes):
    # count all samples at split point
    n_instances = float(sum([len(bucket) for bucket in buckets]))
    # sum weighted Gini index for each bucket
    gini = 0.0
    for bucket in buckets:
        size = float(len(bucket))
        # avoid divide by zero for buckets with no data in them
        if size == 0:
            continue
        score = 0.0
        # score the bucket based on the score for each class
        for class_val in classes:
            p = [row[-1] for row in bucket].count(class_val) / size
            score += p * p
        # weight the group score by its relative size
        gini += (1.0 - score) * (size / n_instances)
    return gini

In [11]:
# Split a dataset split into two buckets based
# on a split point (feature, feature value).
# Return the left/right children buckets.
def test_split(splitfeature_index, splitvalue, dataset):
    left, right = list(), list()
    for row in dataset:
        if row[splitfeature_index] < splitvalue:
            left.append(row)
        else:
            right.append(row)
    return left, right

# Select the best split point (feature, feature value) for a dataset
# Iterate through each feature and each corresponding feature value,
# and "test" what would happen if we split on that split point.
# save the split point where the "test results" AKA gini value
# is closest to 0.
def get_split(dataset):
    # Note here that the class label is assumed to be the last column of the dataset 
    class_values = list(set(row[-1] for row in dataset))
    best_gini, bestfeature_index, bestsample_index, best_groups = 99999, 99999, 99999, None
    features = range(len(dataset[0]) - 1)
    for index in features:
        for row in dataset:
            groups = test_split(index, row[index], dataset)
            gini = gini_index(groups, class_values)
            if gini < best_gini:
                bestfeature_index, bestsample_index, best_gini, best_groups = index, row[index], gini, groups
    return {'index':bestfeature_index, 'value':bestsample_index, 'groups':best_groups}

In [12]:
# Create a terminal/leaf node
def to_terminal(bucket):
    outcomes = [row[-1] for row in bucket]
    return max(set(outcomes), key=outcomes.count)

# Create child splits for a node or make terminal
# Call recursively 
def split(node, max_depth, min_size, depth):
    left, right = node['groups']
    del(node['groups'])
    # check for a no split
    if not left or not right:
        node['left'] = node['right'] = to_terminal(left + right)
        return
    # check for max depth
    if depth >= max_depth:
        node['left'], node['right'] = to_terminal(left), to_terminal(right)
        return
    # process left child
    if len(left) <= min_size:
        node['left'] = to_terminal(left)
    else:
        node['left'] = get_split(left)
        split(node['left'], max_depth, min_size, depth+1)
    # process right child
    if len(right) <= min_size:
        node['right'] = to_terminal(right)
    else:
        node['right'] = get_split(right)
        split(node['right'], max_depth, min_size, depth+1)

In [13]:
# Build a decision tree by splitting the data many times
def build_tree(train, max_depth, min_size):
    root = get_split(train)
    split(root, max_depth, min_size, 1)
    return root

In [14]:
# Load a CSV file
def load_csv(filename):
    dataset = list()
    with open(filename, 'r') as file:
        csv_reader = reader(file)
        for row in csv_reader:
            if not row:
                continue
            dataset.append(row)
    return dataset

In [15]:
# load sonar data
sonar_filename = 'sonar.all-data.csv'
sonar_dataset = load_csv(sonar_filename)
# Note here that the class label is assumed to be in the last column of the dataset 
sonar_dtree = build_tree(sonar_dataset, max_depth=10, min_size=1)

In [16]:
# These package are just to display the json in an interactive way
import uuid
from IPython.display import display_javascript, display_html, display
import json

In [17]:
# Render the tree as an interactive json object
class RenderJSON(object):
    def __init__(self, json_data):
        if isinstance(json_data, dict):
            self.json_str = json.dumps(json_data)
        else:
            self.json_str = json_data
        self.uuid = str(uuid.uuid4())

    def _ipython_display_(self):
        display_html('<div id="{}" style="height: 600px; width:100%;"></div>'.format(self.uuid), raw=True)
        display_javascript("""
        require(["https://rawgit.com/caldwell/renderjson/master/renderjson.js"], function() {
        document.getElementById('%s').appendChild(renderjson(%s))
        });
        """ % (self.uuid, self.json_str), raw=True)

RenderJSON(sonar_dtree)

In [18]:
# In case the above didn't work...
print(json.dumps(sonar_dtree, indent=2))

{
  "index": 10,
  "value": "0.1989",
  "left": {
    "index": 3,
    "value": "0.0525",
    "left": {
      "index": 0,
      "value": "0.0412",
      "left": {
        "index": 27,
        "value": "0.9956",
        "left": {
          "index": 2,
          "value": "0.0024",
          "left": "M",
          "right": {
            "index": 27,
            "value": "0.0832",
            "left": "M",
            "right": {
              "index": 43,
              "value": "0.0337",
              "left": "M",
              "right": {
                "index": 0,
                "value": "0.0200",
                "left": {
                  "index": 0,
                  "value": "0.0100",
                  "left": {
                    "index": 0,
                    "value": "0.0039",
                    "left": "R",
                    "right": "R"
                  },
                  "right": {
                    "index": 0,
                    "value": "0.0100",
                   

### Recap what is important
- The tree is made from the root, down to its leaves/terminals by calling get_split recursively on each child node.
- The computer is trying at each step/node to get the lowest gini possible.
- The lowest gini is achieved when the left/right buckets are as homogenous (data from only one class) as possible
- At each node, the computer iterates through each feature and each value corresponding to that feature and tries to split the data given that split point (feature, feature value) and sees how good of a gini index it can get.
- If left to itself, the computer would grow the tree until every node had a gini index of 0... but that would probably be overfitting the data so we set a max_depth.
- The code from scratch was very inefficient. What is the time complexity of this algorithm? How could you improve it?

### What's next...
- A decision tree has probably overfit your data.
- Try a random forest (which is a ensemble method) next to try to mitigate the overfitting!