# DSCI6003 Practicum I: Random Forests

Your study of tree classifiers begins with random forests. 

## Implement Decision Trees

In order to build a random forest you must first master building decision trees.

1. If you have not yet completed working code for decision trees, start with getting a complete implementation using the annotated code stub DecisionTree.py and TreeNode.py provided to you in the /code directory. 

2. Use the run_decision_tree.py and test_decision_tree.py code stubs (with the command line) to ensure that your construction is correct. Use pycharm or sublime for a develop environment.

3. Once your tree is capable of producing correct results, continue with the RandomForest.py stub, discussed below.

4. You can check your performance of both the forest and trees against the setup of the executable in the practicum directory.

In [1]:
import numpy as np
import math
from collections import Counter
from TreeNode import TreeNode


class DecisionTree(object):
    '''
    A decision tree class.
    '''

    def __init__(self, impurity_criterion='entropy'):
        '''
        Initialize an empty DecisionTree.
        '''

        self.root = None  # root Node
        self.feature_names = None  # string names of features (for interpreting
                                   # the tree)
        self.categorical = None  # Boolean array of whether variable is
                                 # categorical (or continuous)
        self.impurity_criterion = self._entropy \
                                  if impurity_criterion == 'entropy' \
                                  else self._gini

    def fit(self, X, y, feature_names=None):
        '''
        INPUT:
            - X: 2d numpy array
            - y: 1d numpy array
            - feature_names: numpy array of strings
        OUTPUT: None
        Build the decision tree.
        X is a 2 dimensional array with each column being a feature and each
        row a data point.
        y is a 1 dimensional array with each value being the corresponding
        label.
        feature_names is an optional list containing the names of each of the
        features.
        '''


        # This piece of code is used to provide feature names to the Decision tree
        if feature_names is None or len(feature_names) != X.shape[1]:
            # if the user has not provided feature names, just give them numbers
            self.feature_names = np.arange(X.shape[1])
        else:
            # otherwise, these are the names
            self.feature_names = feature_names
            
        

        # * Create True/False array of whether the variable is categorical
        # use a lambda function called is_categorical to determine if the variable is an instance
        # of str, bool or unicode - in that case is_categorical will be true
        # otherwise False. Look up the function isinstance()

        #is_categorical = lambda x: ?

        # Each variable (organized by index) is given a label categorical or not
        self.categorical = np.vectorize(is_categorical)(X[0])

        # Call the build_tree function
        self.root = self._build_tree(X, y)

    def _build_tree(self, X, y):
        '''
        INPUT:
            - X: 2d numpy array
            - y: 1d numpy array
        OUTPUT:
            - TreeNode
        Recursively build the decision tree. Return the root node.
        '''

        #  * initialize a root TreeNode

        # * set index, value, splits as the output of self._choose_split_index(X,y)

        # if no index is returned from the split index or we cannot split
        if index is None or len(np.unique(y)) == 1:
            # * set the node to be a leaf

            # * set the classes attribute to the number of classes
            # * we have in this leaf with Counter()

            # * set the name of the node to be the most common class in it

        else: # otherwise we can split (again this comes out of choose_split_index
            # * set X1, y1, X2, y2 to be the splits

            # * the node column should be set to the index coming from split_index

            # * the node name is the feature name as determined by
            #   the index (column name)

            # * set the node value to be the value of the split

            # * set the categorical flag of the node to be the category of the column

            # * now continue recursing down both branches of the split

        return node

    def _entropy(self, y):
        '''
        INPUT:
            - y: 1d numpy array
        OUTPUT:
            - float
        Return the entropy of the array y.
        '''

        total = 0
        # * for each unique class C in y
            # * count up the number of times the class C appears and divide by
            # * the total length of y. This is the p(C)
            # * add the entropy p(C) ln p(C) to the total
        return -total

    def _gini(self, y):
        '''
        INPUT:
            - y: 1d numpy array
        OUTPUT:
            - float
        Return the gini impurity of the array y.
        '''

        total = 0
        # * for each unique class C in y
            # * count up the number of times the class C appears and divide by
            # * the size of y. This is the p(C)
            # * add p(C)**2 to the total
        return 1 - total

    def _make_split(self, X, y, split_index, split_value):
        '''
        INPUT:
            - X: 2d numpy array
            - y: 1d numpy array
            - split_index: int (index of feature)
            - split_value: int/float/bool/str (value of feature)
        OUTPUT:
            - X1: 2d numpy array (feature matrix for subset 1)
            - y1: 1d numpy array (labels for subset 1)
            - X2: 2d numpy array (feature matrix for subset 2)
            - y2: 1d numpy array (labels for subset 2)
        Return the two subsets of the dataset achieved by the given feature and
        value to split on.
        Call the method like this:
        X1, y1, X2, y2 = self._make_split(X, y, split_index, split_value)
        X1, y1 is a subset of the data.
        X2, y2 is the other subset of the data.
        '''

        # * slice the split column from X with the split_index
        # * if the variable of this column is categorical
            # * select the indices of the rows in the column
            #  with the split_value (T/F) into one set of indices (call them A)
            # * select the indices of the rows in the column
            # that don't have the split_value into another
            #  set of indices (call them B)
        # * else if the variable is not categorical
             # * select the indices of the rows in the column
            #  less than the split value into one set of indices (call them A)
            # * select the indices of the rows in the column
            #  greater or equal to  the split value into
            # another set of indices (call them B)
        return X[A], y[A], X[B], y[B]

    def _information_gain(self, y, y1, y2):
        '''
        INPUT:
            - y: 1d numpy array
            - y1: 1d numpy array (labels for subset 1)
            - y2: 1d numpy array (labels for subset 2)
        OUTPUT:
            - float
        Return the information gain of making the given split.
        Use self.impurity_criterion(y) rather than calling _entropy or _gini
        directly.
        '''
        # * set total equal to the impurity_criterion

        # * for each of the possible splits y1 and y2
            # * calculate the impurity_criterion of the split
            # * subtract this value from the total, multiplied by split_size/y_size
        return total

    def _choose_split_index(self, X, y):
        '''
        INPUT:
            - X: 2d numpy array
            - y: 1d numpy array
        OUTPUT:
            - index: int (index of feature)
            - value: int/float/bool/str (value of feature)
            - splits: (2d array, 1d array, 2d array, 1d array)
        Determine which feature and value to split on. Return the index and
        value of the optimal split along with the split of the dataset.
        Return None, None, None if there is no split which improves information
        gain.
        Call the method like this:
        index, value, splits = self._choose_split_index(X, y)
        X1, y1, X2, y2 = splits
        '''

        # set these initial variables to None
        split_index, split_value, splits = None, None, None
        # we need to keep track of the maximum entropic gain
        max_gain = 0

        # * for each column in X
            # * set an array called values to be the
            # unique values in that column (use np.unique)

            # if there are less than 2 values, move on to the next column
            if len(values) < 2:
                continue

            # * for each value V in the values array

                # * make a temporary split (using the column index and V) with make_split

                # * calculate the information gain between the original y, y1 and y2

                # * if this gain is greater than the max_gain
                    # * set max_gain, split_index, and split_value to be equal
                    # to the current max_gain, column and value

                    # * set the output splits to the current split setup (X1, y1, X2, y2)
        return split_index, split_value, splits

    def predict(self, X):
        '''
        INPUT:
            - X: 2d numpy array
        OUTPUT:
            - y: 1d numpy array
        Return an array of predictions for the feature matrix X.
        '''

        return np.apply_along_axis(self.root.predict_one, axis=1, arr=X)

    def __str__(self):
        '''
        Return string representation of the Decision Tree. This will allow you to $:print tree
        '''
        return str(self.root)

IndentationError: expected an indented block (<ipython-input-1-89c2b566551c>, line 85)

In [2]:
from collections import Counter
import numpy as np

class TreeNode(object):
    '''
    A node class for a decision tree.
    '''
    def __init__(self):
        self.column = None  # (int)    index of feature to split on
        self.value = None  # value of the feature to split on
        self.categorical = True  # (bool) whether or not node is split on
                                 # categorial feature
        self.name = None    # (string) name of feature (or name of class in the
                            #          case of a list)
        self.left = None    # (TreeNode) left child
        self.right = None   # (TreeNode) right child
        self.leaf = False   # (bool)   true if node is a leaf, false otherwise
        self.classes = Counter()  # (Counter) only necessary for leaf node:
                                  #           key is class name and value is
                                  #           count of the count of data points
                                  #           that terminate at this leaf

    def predict_one(self, x):
        '''
        INPUT:
            - x: 1d numpy array (single data point)
        OUTPUT:
            - y: predicted label
        Return the predicted label for a single data point.
        '''
        if self.leaf:
            return self.name
        col_value = x[self.column]

        if self.categorical:
            if col_value == self.value:
                return self.left.predict_one(x)
            else:
                return self.right.predict_one(x)
        else:
            if col_value < self.value:
                return self.left.predict_one(x)
            else:
                return self.right.predict_one(x)

    # This is for visualizing your tree. You don't need to look into this code.
    def as_string(self, level=0, prefix=""):
        '''
        INPUT:
            - level: int (amount to indent)
        OUTPUT:
            - prefix: str (to start the line with)
        Return a string representation of the tree rooted at this node.
        '''
        result = ""
        if prefix:
            indent = "  |   " * (level - 1) + "  |-> "
            result += indent + prefix + "\n"
        indent = "  |   " * level
        result += indent + "  " + str(self.name) + "\n"
        if not self.leaf:
            if self.categorical:
                left_key = str(self.value)
                right_key = "no " + str(self.value)
            else:
                left_key = "< " + str(self.value)
                right_key = ">= " + str(self.value)
            result += self.left.as_string(level + 1, left_key + ":")
            result += self.right.as_string(level + 1, right_key + ":")
        return result

    def __repr__(self):
        return self.as_string().strip()

## Build a Random Forest

You will be using our implementation of Decision Trees to implement a Random Forest.

You can use the `DecisionTree` class from `DecisionTree.py` with the following code:

```python
dt = DecisionTree()
dt.fit(X_train, y_train)
predicted_y = dt.predict(X_test)
```

You can also visualize a Decision Tree by printing it. This may be helpful for understanding your Random Forest.

```python
print dt
```

While you're getting your code to work, use the play golf data set that we used for implementing Decision Trees.

There's a file called `RandomForest.py` which contains a skeleton of the code. Your goal is to fill it in so that you can run it with the following lines of code:

```python
from RandomForest import RandomForest
from sklearn.cross_validation import train_test_split
import numpy as np
import pandas as pd

df = pd.read_csv('data/playgolf.csv')
y = df.pop('Result').values
X = df.values
X_train, X_test, y_train, y_test = train_test_split(X, y)

rf = RandomForest(num_trees=10, num_features=5)
rf.fit(X_train, y_train)
y_predict = rf.predict(X_test)
print "score:", rf.score(X_test, y_test)
```

### A. Implement *Tree Bagging*

Bagging, or *bootstrap aggregating*, is taking several random samples *with replacement* from the data set and building a model for each sample. Each of these models gets a vote on the prediction.

Sampling with replacement means that we can repeat data points. In the basic random forest, we will always use a sample size that is the same as the size of the original data set. Many data points will not be included in each sample and many will be repeated.

1. Implement the `build_forest` method. For right now, we will be ignoring the `num_features` parameter. Here is the pseudocode:

      Repeat num_trees times:
          Create a random sample of the data with replacement
          Build a decision tree with that sample
      Return the list of the decision trees created


### B. Implement random feature selection

1. Modify the `DecisionTree` class so that it takes an additional parameter: `num_features`. This is the number of features to consider at each node in choosing the best split. Which features to consider is randomly chosen at each node. You will need to modify the `__init__`, method to take a `num_features` parameter. In `_choose_split_index`, you should randomly select `num_features` of the potential features to consider. Only calculate and compare the features that were randomly chosen, so that the feature you choose is one of the randomly chosen features.

2. Modify `build_forest` in your `RandomForest` class to pass the `num_features` parameter to the Decision Trees.


### C. Implement classification and scoring

1. In the `predict` method, you should have each Decision Tree classify each data point. Choose the label with the majority of trees. Break ties by choosing one of the labels arbitrarily.

2. In the `score` method, you should first classify the data points and count the percent of them which match the given labels.


### D. Try a bigger data set

You won't be able to get great results cross validating with the play golf data set since it's so small. In the data folder, there's a dataset called 'congressional_voting.csv'. This contains congressman, how they voted on different issues and their party.

Here are what the 17 columns refer to:

* Class Name: 2 (democrat, republican)
* handicapped-infants: 2 (y,n)
* water-project-cost-sharing: 2 (y,n)
* adoption-of-the-budget-resolution: 2 (y,n)
* physician-fee-freeze: 2 (y,n)
* el-salvador-aid: 2 (y,n)
* religious-groups-in-schools: 2 (y,n)
* anti-satellite-test-ban: 2 (y,n)
* aid-to-nicaraguan-contras: 2 (y,n)
* mx-missile: 2 (y,n)
* immigration: 2 (y,n)
* synfuels-corporation-cutback: 2 (y,n)
* education-spending: 2 (y,n)
* superfund-right-to-sue: 2 (y,n)
* crime: 2 (y,n)
* duty-free-exports: 2 (y,n)
* export-administration-act-south-africa: 2 (y,n)

The dataset came from UCI [here](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records).

1. Based on the votes on the 16 issues, predict the party using your implementation of Random Forest. Start with 10 trees and a maximum of 5 features.

2. Compare how well the Random Forest does versus the Decision Tree.

3. Try modifying the number of trees and see how it affects your accuracy.

4. Calculate the accuracy for each of your decision trees on the test set and compare it to the accuracy of the random forest on the test set.

5. Predict how the congressmen will vote on a particular issue given the remaining columns.


### Extra Credit: out-of-bag error and feature importance

1. Out-of-bag error is a clever way of validating your model by testing individual trees based on samples that weren't including in their training set. It is described in the lecture notes, [Applied Data Science](http://columbia-applied-data-science.github.io/appdatasci.pdf) (9.4.3) and [Breiman's notes](http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr).

2. Feature importance is a way of determining which features contribute the most to being able to predict the result. It is discussed in the lecture notes and [Breiman's notes](http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp). You can compare what features you get with Breiman's method vs [sklearn](http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation).