# DSCI6003 Practicum I: Random Forests

Your study of tree classifiers begins with random forests. 

## Implement Decision Trees

In order to build a random forest you must first master building decision trees.

1. If you have not yet completed working code for decision trees, start with getting a complete implementation using the annotated code stub DecisionTree.py and TreeNode.py provided to you in the /code directory. 

2. Use the run_decision_tree.py and test_decision_tree.py code stubs (with the command line) to ensure that your construction is correct. Use pycharm or sublime for a develop environment.

3. Once your tree is capable of producing correct results, continue with the RandomForest.py stub, discussed below.

4. You can check your performance of both the forest and trees against the setup of the executable in the practicum directory.

Modify the DecisionTree class so that it takes an additional parameter: num_features. This is the number of features to consider at each node in choosing the best split. Which features to consider is randomly chosen at each node. You will need to modify the __init__, method to take a num_features parameter. In _choose_split_index, you should randomly select num_features of the potential features to consider. Only calculate and compare the features that were randomly chosen, so that the feature you choose is one of the randomly chosen features

In [1]:
import numpy as np
import math
from collections import Counter
from sklearn.model_selection import train_test_split
import pandas as pd


In [2]:

#from TreeNode import TreeNode


class DecisionTree(object):
    '''
    A decision tree class.
    '''

    def __init__(self, impurity_criterion='entropy',number_features=None):
        '''
        Initialize an empty DecisionTree.
        '''

        self.root = None  # root Node
        self.feature_names = None  # string names of features (for interpreting
                                   # the tree)
        self.categorical = None  # Boolean array of whether variable is
                                 # categorical (or continuous)
        self.impurity_criterion = self._entropy \
                                  if impurity_criterion == 'entropy' \
                                  else self._gini
        self.num_features = number_features
        

    def fit(self, X, y, feature_names=None):
        '''
        INPUT:
            - X: 2d numpy array
            - y: 1d numpy array
            - feature_names: numpy array of strings
        OUTPUT: None
        Build the decision tree.
        X is a 2 dimensional array with each column being a feature and each
        row a data point.
        y is a 1 dimensional array with each value being the corresponding
        label.
        feature_names is an optional list containing the names of each of the
        features.
        '''
        
        if self.num_features!=None:
            if self.num_features>np.shape(X)[1]:
                return "You do not have this many features. Please select fewer features."


        # This piece of code is used to provide feature names to the Decision tree
        if feature_names is None or len(feature_names) != X.shape[1]:
            # if the user has not provided feature names, just give them numbers
            self.feature_names = np.arange(X.shape[1])
        else:
            # otherwise, these are the names
            self.feature_names = feature_names
            
        

        # * Create True/False array of whether the variable is categorical
        # use a lambda function called is_categorical to determine if the variable is an instance
        # of str, bool or unicode - in that case is_categorical will be true
        # otherwise False. Look up the function isinstance()
        

        is_categorical = lambda x: isinstance(x,str) or isinstance(x,bool)

        # Each variable (organized by index) is given a label categorical or not
        self.categorical = np.vectorize(is_categorical)(X[0])

        # Call the build_tree function
        self.root = self._build_tree(X, y)

    def _build_tree(self, X, y):
        '''
        INPUT:
            - X: 2d numpy array
            - y: 1d numpy array
        OUTPUT:
            - TreeNode
        Recursively build the decision tree. Return the root node.
        '''
        
        node = TreeNode()

        #  * initialize a root TreeNode
        ####### use num_features here
        
        # * set index, value, splits as the output of self._choose_split_index(X,y)
        index, value, splits = self._choose_split_index(X,y)
        # splits is a tuple here

        # if no index is returned from the split index or we cannot split
        if index is None or len(np.unique(y)) == 1:
            # * set the node to be a leaf
            node.leaf = True

            # * set the classes attribute to the number of classes
            # * we have in this leaf with Counter()
            node.classes=Counter(y)
            #Counter) only necessary for leaf node:
                                  #           key is class name and value is
                                  #           count of the count of data points
                                  #           that terminate at this leaf
                        

            # * set the name of the node to be the most common class in it
            node.name = node.classes.most_common(1)[0][0] # return the first element, the first part of tuple will be class
            
            

        else: # otherwise we can split (again this comes out of choose_split_index
            # * set X1, y1, X2, y2 to be the splits
            
            X1, y1, X2, y2  = splits
            
            # * the node column should be set to the index coming from split_index
            node.column=index

            # * the node name is the feature name as determined by
            #   the index (column name)
            node.name = self.feature_names[index]

            # * set the node value to be the value of the split
            node.value = value

            # * set the categorical flag of the node to be the category of the column
            ## find the category of the columns
            #self.categorical[index]
            node.categorical = self.categorical[index]

            # * now continue recursing down both branches of the split
            node.left =self._build_tree(X1,y1)
            node.right=self._build_tree(X2,y2)
            

        return node

    def _entropy(self, y):
        '''
        INPUT:
            - y: 1d numpy array
        OUTPUT:
            - float
        Return the entropy of the array y.
        '''

        total = 0
        unique_y = set(y)
        number_of_items = len(y)
        counter_of_class = Counter(y)
        

        for unique_class in unique_y:

            
            total += counter_of_class[unique_class]/number_of_items* np.log(counter_of_class[unique_class]/number_of_items)

                
        # * for each unique class C in y
            # * count up the number of times the class C appears and divide by
            # * the total length of y. This is the p(C)
            # * add the entropy p(C) ln p(C) to the total
        return -total

    def _gini(self, y):
        '''
        INPUT:
            - y: 1d numpy array
        OUTPUT:
            - float
        Return the gini impurity of the array y.
        '''

        total = 0
        unique_y = set(y)
        number_of_items = len(y)
        counter_of_class = Counter(y)
        
        for key,value in counter_of_class.items():
            
            total += (counter_of_class[key]/number_of_items)**2
        
        # * for each unique class C in y
            # * count up the number of times the class C appears and divide by
            # * the size of y. This is the p(C)
            # * add p(C)**2 to the total
        return 1 - total

    def _make_split(self, X, y, split_index, split_value):
        '''
        INPUT:
            - X: 2d numpy array
            - y: 1d numpy array
            - split_index: int (index of feature)
            - split_value: int/float/bool/str (value of feature)
        OUTPUT:
            - X1: 2d numpy array (feature matrix for subset 1)
            - y1: 1d numpy array (labels for subset 1)
            - X2: 2d numpy array (feature matrix for subset 2)
            - y2: 1d numpy array (labels for subset 2)
        Return the two subsets of the dataset achieved by the given feature and
        value to split on.
        Call the method like this:
        X1, y1, X2, y2 = self._make_split(X, y, split_index, split_value)
        X1, y1 is a subset of the data.
        X2, y2 is the other subset of the data.
        '''
        
        new_split = X[:,split_index]
        try:
            if self.categorical[split_index] == True:
                A = np.where(new_split==split_value)
                B = np.where(new_split!=split_value)
            else:# variable is not categorical
                A = np.where(new_split<split_value)
                B = np.where(new_split>=split_value)
            #print("Made a sccessful split of data")
        except:

            if self.categorical[split_index] == True:
                A = np.where(new_split==split_value)
                B = np.where(new_split!=split_value)
            else:# variable is not categorical, 
               
                A = np.where(new_split<int(split_value))
                B = np.where(new_split>=int(split_value))

        # * slice the split column from X with the split_index
        # * if the variable of this column is categorical
            # * select the indices of the rows in the column
            #  with the split_value (T/F) into one set of indices (call them A)
            # * select the indices of the rows in the column
            # that don't have the split_value into another
            #  set of indices (call them B)
        # * else if the variable is not categorical
             # * select the indices of the rows in the column
            #  less than the split value into one set of indices (call them A)
            # * select the indices of the rows in the column
            #  greater or equal to  the split value into
            # another set of indices (call them B)
        return X[A], y[A], X[B], y[B]

    def _information_gain(self, y, y1, y2):
        '''
        INPUT:
            - y: 1d numpy array
            - y1: 1d numpy array (labels for subset 1)
            - y2: 1d numpy array (labels for subset 2)
        OUTPUT:
            - float
        Return the information gain of making the given split.
        Use self.impurity_criterion(y) rather than calling _entropy or _gini
        directly.
        '''
        # * set total equal to the impurity_criterion
        total = self.impurity_criterion(y)

        len_y1 = len(y1)
        len_y2 = len(y2)
        
        total -= self.impurity_criterion(y1)*len_y1/len(y) # split one
        total -=self.impurity_criterion(y2)*len_y2/len(y)
        
        # * for each of the possible splits y1 and y2
            # * calculate the impurity_criterion of the split
            # * subtract this value from the total, multiplied by split_size/y_size
        return total

    def _choose_split_index(self, X, y):
        # you should randomly select num_features of the potential features to consider. Only calculate and compare
        #the features that were randomly chosen, so that the feature you choose is one of the randomly chosen features
        # self.num_features
        '''
        INPUT:
            - X: 2d numpy array
            - y: 1d numpy array
        OUTPUT:
            - index: int (index of feature)
            - value: int/float/bool/str (value of feature)
            - splits: (2d array, 1d array, 2d array, 1d array)
        Determine which feature and value to split on. Return the index and
        value of the optimal split along with the split of the dataset.
        Return None, None, None if there is no split which improves information
        gain.
        Call the method like this:
        index, value, splits = self._choose_split_index(X, y)
        X1, y1, X2, y2 = splits
        '''
        
        
        # set these initial variables to None
        split_index, split_value, splits = None, None, None
        # we need to keep track of the maximum entropic gain
        max_gain = 0
        
        
        
        if self.num_features !=None: ## building a random forest 
        
           
            index_of_columns = np.arange(np.shape(X)[1])
            index_selection = np.random.choice(index_of_columns,self.num_features)
     
            
            for col_index in index_selection: ##go the col indexes that were chosen above
                values = np.unique(X[:,col_index])
            
                


            # * for each column in X
                # * set an array called values to be the
                # unique values in that column (use np.unique)

                # if there are less than 2 values, move on to the next column
                if len(values) < 2:
                    continue


                # * for each value V in the values array
                for  val_index,value in enumerate(values): # each feature

                    X_yes, y_yes, X_no, y_no =self._make_split(X,y,col_index,value) 
                    #_make_split(self, X, y, split_index, split_value):
                    
                    #make split returns  X[A], y[A], X[B], y[B]

                    # * make a temporary split (using the column index and V) with make_split

                    # * calculate the information gain between the original y, y1 and y2
                    # _information_gain(self, y, y1, y2):
                    info_gain  = self._information_gain(y,y_yes,y_no)
                    if info_gain>max_gain:
                        max_gain = info_gain
                        split_index = col_index
                        split_value = value
                        splits = (X_yes, y_yes, X_no, y_no)


                    # * if this gain is greater than the max_gain
                        # * set max_gain, split_index, and split_value to be equal
                        # to the current max_gain, column and value

                        # * set the output splits to the current split setup (X1, y1, X2, y2)
        

        
        else: ## building a regular decision tree
          
            
            for col_index,col in enumerate(X.T):
                values = np.unique(col)

            # * for each column in X
                # * set an array called values to be the
                # unique values in that column (use np.unique)

                # if there are less than 2 values, move on to the next column
                if len(values) < 2:
                    continue


                # * for each value V in the values array
                for  val_index,value in enumerate(values): # each feature

                    X_yes, y_yes, X_no, y_no =self._make_split(X,y,col_index,value) #_make_split(self, X, y, split_index, split_value):
                    #make split returns  X[A], y[A], X[B], y[B]


                    # * make a temporary split (using the column index and V) with make_split

                    # * calculate the information gain between the original y, y1 and y2
                    # _information_gain(self, y, y1, y2):
                    info_gain  = self._information_gain(y,y_yes,y_no)
                    if info_gain>max_gain:
                        max_gain = info_gain
                        split_index = col_index
                        split_value = value
                        splits = (X_yes, y_yes, X_no, y_no)


                    # * if this gain is greater than the max_gain
                        # * set max_gain, split_index, and split_value to be equal
                        # to the current max_gain, column and value

                        # * set the output splits to the current split setup (X1, y1, X2, y2)
                        
        return split_index, split_value, splits

    def predict(self, X):
        '''
        INPUT:
            - X: 2d numpy array
        OUTPUT:
            - y: 1d numpy array
        Return an array of predictions for the feature matrix X.
        '''

        #return np.array(map(lambda x: self.root.predict_one(x) ,X)).reshape((1,-1))
        


        return np.apply_along_axis(self.root.predict_one, axis=1, arr=X)

    def __str__(self):
        '''
        Return string representation of the Decision Tree. This will allow you to $:print tree
        '''
        return str(self.root)

In [3]:
p = np.array([[1,2],
            [3,4]])

In [4]:
p.sum(axis=1)

array([3, 7])

In [5]:


class TreeNode(object):
    '''
    A node class for a decision tree.
    '''
    def __init__(self):
        self.column = None  # (int)    index of feature to split on
        self.value = None  # value of the feature to split on
        self.categorical = True  # (bool) whether or not node is split on
                                 # categorial feature
        self.name = None    # (string) name of feature (or name of class in the
                            #          case of a list)
        self.left = None    # (TreeNode) left child
        self.right = None   # (TreeNode) right child
        self.leaf = False   # (bool)   true if node is a leaf, false otherwise
        self.classes = Counter()  # (Counter) only necessary for leaf node:
                                  #           key is class name and value is
                                  #           count of the count of data points
                                  #           that terminate at this leaf

    def predict_one(self, x):
        '''
        INPUT:
            - x: 1d numpy array (single data point)
        OUTPUT:
            - y: predicted label
        Return the predicted label for a single data point.
        '''
        if self.leaf:
            return self.name
        col_value = x[self.column]

        if self.categorical:
            if col_value == self.value:
                return self.left.predict_one(x)
            else:
                return self.right.predict_one(x)
        else:
            if col_value < self.value:
                return self.left.predict_one(x)
            else:
                return self.right.predict_one(x)

    # This is for visualizing your tree. You don't need to look into this code.
    def as_string(self, level=0, prefix=""):
        '''
        INPUT:
            - level: int (amount to indent)
        OUTPUT:
            - prefix: str (to start the line with)
        Return a string representation of the tree rooted at this node.
        '''
        result = ""
        if prefix:
            indent = "  |   " * (level - 1) + "  |-> "
            result += indent + prefix + "\n"
        indent = "  |   " * level
        result += indent + "  " + str(self.name) + "\n"
        if not self.leaf:
            if self.categorical:
                left_key = str(self.value)
                right_key = "no " + str(self.value)
            else:
                left_key = "< " + str(self.value)
                right_key = ">= " + str(self.value)
            result += self.left.as_string(level + 1, left_key + ":")
            result += self.right.as_string(level + 1, right_key + ":")
        return result

    def __repr__(self):
        return self.as_string().strip()

In [6]:
## import pandas as pd
#from DecisionTree import DecisionTree


def test_tree(filename):
    df = pd.read_csv(filename)
    y = df.pop('Result').values
    X = df.values
    print(X)
    
    tree = DecisionTree()
    tree.fit(X, y, df.columns)
    print(tree)
    print

    y_predict = tree.predict(X)
    print('%26s   %10s   %10s' % ("FEATURES", "ACTUAL", "PREDICTED"))
    print('%26s   %10s   %10s' % ("----------", "----------", "----------"))
    for features, true, predicted in zip(X, y, y_predict):
        print('%26s   %10s   %10s' % (str(features), str(true), str(predicted)))

In [7]:
test_tree('data/playgolf.csv')

[['sunny' 85 85 False]
 ['sunny' 80 90 True]
 ['overcast' 83 78 False]
 ['rain' 70 96 False]
 ['rain' 68 80 False]
 ['rain' 65 70 True]
 ['overcast' 64 65 True]
 ['sunny' 72 95 False]
 ['sunny' 69 70 False]
 ['rain' 75 80 False]
 ['sunny' 75 70 True]
 ['overcast' 72 90 True]
 ['overcast' 81 75 False]
 ['rain' 71 80 True]]
Outlook
  |-> overcast:
  |     Play
  |-> no overcast:
  |     Temperature
  |     |-> < 80:
  |     |     Temperature
  |     |     |-> < 75:
  |     |     |     Temperature
  |     |     |     |-> < 71:
  |     |     |     |     Temperature
  |     |     |     |     |-> < 68:
  |     |     |     |     |     Don't Play
  |     |     |     |     |-> >= 68:
  |     |     |     |     |     Play
  |     |     |     |-> >= 71:
  |     |     |     |     Don't Play
  |     |     |-> >= 75:
  |     |     |     Play
  |     |-> >= 80:
  |     |     Don't Play
                  FEATURES       ACTUAL    PREDICTED
                ----------   ----------   ----------
     ['sunn

In [8]:
import nose.tools as n
import numpy as np
#from DecisionTree import DecisionTree as DT
#from TreeNode import TreeNode as TN


def test_entropy():
    array = [1, 1, 2, 1, 2]
    result = DecisionTree()._entropy(np.array(array))
    actual = 0.67301
    message = 'Entropy value for %r: Got %.2f. Should be %.2f' \
              % (array, result, actual)
    n.assert_almost_equal(result, actual, 4, message)


def test_gini():
    array = [1, 1, 2, 1, 2]
    result = DecisionTree()._gini(np.array(array))
    actual = 0.48
    message = 'Gini value for %r: Got %.2f. Should be %.2f' \
              % (array, result, actual)
    n.assert_almost_equal(result, actual, 4, message)


def fake_data():
    X = np.array([[1, 'bat'], [2, 'cat'], [2, 'rat'], [3, 'bat'], [3, 'bat']])
    y = np.array([1, 0, 1, 0, 1])
    X1 = np.array([[1, 'bat'], [3, 'bat'], [3, 'bat']])
    y1 = np.array([1, 0, 1])
    X2 = np.array([[2, 'cat'], [2, 'rat']])
    y2 = np.array([0, 1])
    return X, y, X1, y1, X2, y2


def test_make_split():
    X, y, X1, y1, X2, y2 = fake_data()
    split_index, split_value = 1, 'bat'
    dt = DecisionTree()
    dt.categorical = np.array([False, True])
    result = dt._make_split(X, y, split_index, split_value)
    try:
        X1_result, y1_result, X2_result, y2_result = result
    except ValueError:
        n.assert_true(False, 'result not in correct form: (X1, y1, X2, y2)')
    actual = (X1, y1, X2, y2)
    message = '_make_split got results\n%r\nShould be\n%r' % (result, actual)
    n.ok_(np.array_equal(X1, X1_result), message)
    n.ok_(np.array_equal(y1, y1_result), message)
    n.ok_(np.array_equal(X2, X2_result), message)
    n.ok_(np.array_equal(y2, y2_result), message)


def test_information_gain():
    X, y, X1, y1, X2, y2 = fake_data()
    result = DecisionTree()._information_gain(y, y1, y2)
    actual = 0.01384
    message = 'Information gain for:\n%r, %r, %r:\nGot %.3f. Should be %.3f' \
              % (y, y1, y2, result, actual)
    n.assert_almost_equal(result, actual, 4, message)


def test_choose_split_index():
    X, y, X1, y1, X2, y2 = fake_data()
    index, value = 1, 'cat'
    dt = DecisionTree()
    dt.categorical = np.array([False, True])
    result = dt._choose_split_index(X, y)
    try:
        split_index, split_value, splits = result
    except ValueError:
        message = 'result not in correct form. Should be:\n' \
                  '    split_index, split_value, splits'
        n.assert_true(False, message)
    message = 'choose split for data:\n%r\n%r\n' \
              'split index, split value should be: %r, %r\n' \
              'not: %r, %r' \
              % (X, y, index, value, split_index, split_value)
    n.eq_(split_index, index, message)
    n.eq_(split_value, value, message)

def test_predict():
    root = TreeNode()
    root.column = 1
    root.name = 'column 1'
    root.value = 'bat'
    root.left = TreeNode()
    root.left.leaf = True
    root.left.name = "one"
    root.right = TreeNode()
    root.right.leaf = True
    root.right.name = "two"
    data = [10, 'cat']
    result = root.predict_one(data)
    actual = "two"
    message = 'Predicted %r. Should be %r.\nTree:\n%r\ndata:\n%r' \
              % (result, actual, root, data)
    n.eq_(result, actual, message)




In [9]:
test_entropy()
test_make_split()

test_gini()
test_make_split()
test_information_gain()
test_choose_split_index()
test_predict()

## Build a Random Forest

You will be using our implementation of Decision Trees to implement a Random Forest.

You can use the `DecisionTree` class from `DecisionTree.py` with the following code:

```python
dt = DecisionTree()
dt.fit(X_train, y_train)
predicted_y = dt.predict(X_test)
```

You can also visualize a Decision Tree by printing it. This may be helpful for understanding your Random Forest.

```python
print dt
```

While you're getting your code to work, use the play golf data set that we used for implementing Decision Trees.

There's a file called `RandomForest.py` which contains a skeleton of the code. Your goal is to fill it in so that you can run it with the following lines of code:

```python
from RandomForest import RandomForest
from sklearn.cross_validation import train_test_split
import numpy as np
import pandas as pd

df = pd.read_csv('data/playgolf.csv')
y = df.pop('Result').values
X = df.values
X_train, X_test, y_train, y_test = train_test_split(X, y)

rf = RandomForest(num_trees=10, num_features=5)
rf.fit(X_train, y_train)
y_predict = rf.predict(X_test)
print "score:", rf.score(X_test, y_test)
```

### A. Implement *Tree Bagging*

Bagging, or *bootstrap aggregating*, is taking several random samples *with replacement* from the data set and building a model for each sample. Each of these models gets a vote on the prediction.

Sampling with replacement means that we can repeat data points. In the basic random forest, we will always use a sample size that is the same as the size of the original data set. Many data points will not be included in each sample and many will be repeated.

1. Implement the `build_forest` method. For right now, we will be ignoring the `num_features` parameter. Here is the pseudocode:

      Repeat num_trees times:
          Create a random sample of the data with replacement
          Build a decision tree with that sample
      Return the list of the decision trees created


### B. Implement random feature selection

1. Modify the `DecisionTree` class so that it takes an additional parameter: `num_features`. This is the number of features to consider at each node in choosing the best split. Which features to consider is randomly chosen at each node. You will need to modify the `__init__`, method to take a `num_features` parameter. In `_choose_split_index`, you should randomly select `num_features` of the potential features to consider. Only calculate and compare the features that were randomly chosen, so that the feature you choose is one of the randomly chosen features.

2. Modify `build_forest` in your `RandomForest` class to pass the `num_features` parameter to the Decision Trees.


### C. Implement classification and scoring

1. In the `predict` method, you should have each Decision Tree classify each data point. Choose the label with the majority of trees. Break ties by choosing one of the labels arbitrarily.

2. In the `score` method, you should first classify the data points and count the percent of them which match the given labels.


### D. Try a bigger data set

You won't be able to get great results cross validating with the play golf data set since it's so small. In the data folder, there's a dataset called 'congressional_voting.csv'. This contains congressman, how they voted on different issues and their party.

Here are what the 17 columns refer to:

* Class Name: 2 (democrat, republican)
* handicapped-infants: 2 (y,n)
* water-project-cost-sharing: 2 (y,n)
* adoption-of-the-budget-resolution: 2 (y,n)
* physician-fee-freeze: 2 (y,n)
* el-salvador-aid: 2 (y,n)
* religious-groups-in-schools: 2 (y,n)
* anti-satellite-test-ban: 2 (y,n)
* aid-to-nicaraguan-contras: 2 (y,n)
* mx-missile: 2 (y,n)
* immigration: 2 (y,n)
* synfuels-corporation-cutback: 2 (y,n)
* education-spending: 2 (y,n)
* superfund-right-to-sue: 2 (y,n)
* crime: 2 (y,n)
* duty-free-exports: 2 (y,n)
* export-administration-act-south-africa: 2 (y,n)

The dataset came from UCI [here](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records).

1. Based on the votes on the 16 issues, predict the party using your implementation of Random Forest. Start with 10 trees and a maximum of 5 features.

2. Compare how well the Random Forest does versus the Decision Tree.

3. Try modifying the number of trees and see how it affects your accuracy.

4. Calculate the accuracy for each of your decision trees on the test set and compare it to the accuracy of the random forest on the test set.

5. Predict how the congressmen will vote on a particular issue given the remaining columns.


### Extra Credit: out-of-bag error and feature importance

1. Out-of-bag error is a clever way of validating your model by testing individual trees based on samples that weren't including in their training set. It is described in the lecture notes, [Applied Data Science](http://columbia-applied-data-science.github.io/appdatasci.pdf) (9.4.3) and [Breiman's notes](http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr).

2. Feature importance is a way of determining which features contribute the most to being able to predict the result. It is discussed in the lecture notes and [Breiman's notes](http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp). You can compare what features you get with Breiman's method vs [sklearn](http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation).

def test_tree(filename):

    df = pd.read_csv(filename)
    y = df.pop('Result').values
    X = df.values
    print(X)
    
    tree = DecisionTree()
    tree.fit(X, y, df.columns)
    print(tree)
    print

    y_predict = tree.predict(X)
    print('%26s   %10s   %10s' % ("FEATURES", "ACTUAL", "PREDICTED"))
    print('%26s   %10s   %10s' % ("----------", "----------", "----------"))
    for features, true, predicted in zip(X, y, y_predict):
        print('%26s   %10s   %10s' % (str(features), str(true), str(predicted)))

In [10]:
np.random.choice(np.arange(10),3)

array([2, 1, 0])

In [11]:
#from DecisionTree import DecisionTree

class RandomForest(object):
    '''A Random Forest class'''

    def __init__(self, num_trees, num_features):
        '''
           num_trees:  number of trees to create in the forest:
        num_features:  the number of features to consider when choosing the
                           best split for each node of the decision trees
        '''
        self.num_trees = num_trees
        self.num_features = num_features
        self.forest = None

    def fit(self, X, y):
        '''
        X:  two dimensional numpy array representing feature matrix
                for test data
        y:  numpy array representing labels for test data
        '''
        self.forest = self.build_forest(X, y, X.shape[0])

    def build_forest(self, X, y, num_samples): #num_samples is the number of rows

        # * Return a list of num_trees DecisionTrees.
#         index_of_columns=np.arange(np.shape(X)[1]) # get the number of columns
        
#         subset_selection = np.random.choice(index_of_columns,num_features)

        trees=[]
        list_of_indexes =np.arange(np.shape(X)[0])
        
        for tree_num in range(self.num_trees):
            
            indexes= np.random.choice(list_of_indexes,num_samples)
            
            tree = DecisionTree(number_features=self.num_features)
            
            tree.fit(X[indexes],y[indexes])
            trees.append(tree)
            #print(tree,'__________________')
            
            
            
        return trees


        #In the predict method, you should have each Decision Tree classify each data 
        #point. Choose the label with the majority of trees. Break ties by choosing one of the labels arbitrarily.
#In the score method, you should first classify the data 
#points and count the percent of them which match the given labels.

    def predict(self, X):

        '''
        Return a numpy array of the labels predicted for the given test data.
        '''
        
        
        results={} # this is now a list of lists
        tree_results=[] # prediction from each tree
        final_results=[] # this is from counter most common
        
        for tree_count,tree in enumerate(self.forest):
            tree_results.append(tree.predict(X)) # this is a list

            for prediction in range(np.shape(X)[0]):
                if prediction in results:


                    #print(final_results[predic])
                    results[prediction].append(tree_results[tree_count][prediction])

                else:
                    results[prediction]=[tree_results[tree_count][prediction]]

        
        for k,v in results.items():
            final_results.append(Counter(v).most_common()[0][0]) # return the value of the most common index
            
        # * Each one of the trees is allowed to predict on the same row of input data. The majority vote
        # is the output of the whole forest. This becomes a single prediction.
        #print(results,'results')
        return final_results

    def score(self, X, y):

        '''
        Return the accuracy of the Random Forest for the given test data.
        '''

        # * In this case you simply compute the accuracy formula as we have defined in class. Compare predicted y to
        # the actual input y.

        return sum(self.predict(X)==y)/len(y)
  


In [12]:
#Test the tree


df = pd.read_csv('data/playgolf.csv')
y = df.pop('Result').values
X = df.values
X_train, X_test, y_train, y_test = train_test_split(X, y)

dt = RandomForest(9,2)
dt.fit(X_train, y_train)
#predicted_y = dt.predict(X_test)
#dt.build_forest(X_train,y_train,5)
dt.predict(X)
dt.score(X,y)

0.6428571428571429

In [13]:
X[:,2]


array([85, 90, 78, 96, 80, 70, 65, 95, 70, 80, 70, 90, 75, 80], dtype=object)

In [14]:
r = ["Don't Play", "Play","Don't Play", "Don't Play"]

In [15]:
Counter(r).most_common()[0][0]

"Don't Play"

In [16]:
df = pd.read_csv('data/playgolf.csv')
y = df.pop('Result').values
X = df.values
X_train, X_test, y_train, y_test = train_test_split(X, y)

dt = RandomForest(500,3) ## 500 tree , 3 features
dt.fit(X_train, y_train)
#predicted_y = dt.predict(X_test)
dt.predict(X)
dt.score(X,y)

0.7857142857142857

# We see accuracy of ~85% 500 trees in our random forest.

## Larger Dataset -Voting

In [17]:
cols = ['class_name',\
   'handicapped-infants',\
   'water-project-cost-sharing',\
   'adoption-of-the-budget-resolution',\
   'physician-fee-freeze',\
   'el-salvador-aid',\
   'religious-groups-in-schools',\
   'anti-satellite-test-ban',\
   'aid-to-nicaraguan-contras',\
  'mx-missile',\
  'immigration',\
  'synfuels-corporation-cutback',\
  'education-spending',\
  'superfund-right-to-sue',\
  'crime',\
  'duty-free-exports',\
  'export-administration-act-south-africa']

In [18]:
voting_records = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data\
    ',header=None,names=cols)


In [19]:
voting_records.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 435 entries, 0 to 434
Data columns (total 17 columns):
class_name                                435 non-null object
handicapped-infants                       435 non-null object
water-project-cost-sharing                435 non-null object
adoption-of-the-budget-resolution         435 non-null object
physician-fee-freeze                      435 non-null object
el-salvador-aid                           435 non-null object
religious-groups-in-schools               435 non-null object
anti-satellite-test-ban                   435 non-null object
aid-to-nicaraguan-contras                 435 non-null object
mx-missile                                435 non-null object
immigration                               435 non-null object
synfuels-corporation-cutback              435 non-null object
education-spending                        435 non-null object
superfund-right-to-sue                    435 non-null object
crime                      

In [20]:
# remove the ? marks in the data

In [21]:
voting_records.columns

Index(['class_name', 'handicapped-infants', 'water-project-cost-sharing',
       'adoption-of-the-budget-resolution', 'physician-fee-freeze',
       'el-salvador-aid', 'religious-groups-in-schools',
       'anti-satellite-test-ban', 'aid-to-nicaraguan-contras', 'mx-missile',
       'immigration', 'synfuels-corporation-cutback', 'education-spending',
       'superfund-right-to-sue', 'crime', 'duty-free-exports',
       'export-administration-act-south-africa'],
      dtype='object')

In [22]:
df1 = voting_records[~((voting_records['class_name'] == '?' ) | \
                       (voting_records['handicapped-infants'] == '?' ) | (voting_records['water-project-cost-sharing'] == '?' ) |\
                      (voting_records['adoption-of-the-budget-resolution'] == '?' ) | (voting_records['physician-fee-freeze'] == '?' ) |\
                      (voting_records['el-salvador-aid'] == '?' ) | (voting_records['religious-groups-in-schools'] == '?' ) |\
                      (voting_records['anti-satellite-test-ban'] == '?' ) | (voting_records['aid-to-nicaraguan-contras'] == '?' ) | \
                     (voting_records['mx-missile'] == '?' ) | (voting_records['immigration'] == '?' ) | \
                      (voting_records['synfuels-corporation-cutback'] == '?' ) | (voting_records['education-spending'] == '?' ) |  \
                      (voting_records['superfund-right-to-sue'] == '?' ) | (voting_records['crime'] == '?' ) | \
                      (voting_records['duty-free-exports'] == '?' ) | (voting_records['export-administration-act-south-africa'] == '?' ))]


In [23]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 232 entries, 5 to 431
Data columns (total 17 columns):
class_name                                232 non-null object
handicapped-infants                       232 non-null object
water-project-cost-sharing                232 non-null object
adoption-of-the-budget-resolution         232 non-null object
physician-fee-freeze                      232 non-null object
el-salvador-aid                           232 non-null object
religious-groups-in-schools               232 non-null object
anti-satellite-test-ban                   232 non-null object
aid-to-nicaraguan-contras                 232 non-null object
mx-missile                                232 non-null object
immigration                               232 non-null object
synfuels-corporation-cutback              232 non-null object
education-spending                        232 non-null object
superfund-right-to-sue                    232 non-null object
crime                      

The dataset came from UCI here.
Based on the votes on the 16 issues, predict the party using your implementation of Random Forest. Start with 10 trees and a maximum of 5 features.
Compare how well the Random Forest does versus the Decision Tree.
Try modifying the number of trees and see how it affects your accuracy.
Calculate the accuracy for each of your decision trees on the test set and compare it to the accuracy of the random forest on the test set.
Predict how the congressmen will vote on a particular issue given the remaining columns.


> Test on the original data frame with ? marks, and the reduced DF.

In [24]:
X = np.array(voting_records.iloc[:,1:])
y=np.array(voting_records.iloc[:,0])

In [25]:

X_train, X_test, y_train, y_test = train_test_split(X, y)

dt = RandomForest(10,5) ## 500 tree , 3 features
dt.fit(X_train, y_train)
#predicted_y = dt.predict(X_test)
dt.predict(X_test)
dt.score(X_test,y_test)

0.93577981651376152

In [26]:
#Increase the number of trees


X_train, X_test, y_train, y_test = train_test_split(X, y)

dt = RandomForest(50,9) ## 50 tree , 9 features
dt.fit(X_train, y_train)
#predicted_y = dt.predict(X_test)
dt.predict(X_test)
dt.score(X_test,y_test)

0.64220183486238536

- We achieve an 56% accuracy with the original data frame (including the ? marks) to predict the party of the senator.
- Next, try the reduced DF.

In [27]:
X_red = np.array(df1.iloc[:,1:])
y_red=np.array(df1.iloc[:,0])

X_train, X_test, y_train, y_test = train_test_split(X_red, y_red)

dt = RandomForest(10,5) ## 10 tree , 5 features
dt.fit(X_train, y_train)
#predicted_y = dt.predict(X_test)
dt.predict(X_test)
dt.score(X_test,y_test)

0.5

- If we remove the question marks, our accuracy increases to over 91%.


In [28]:
# more trees on the reduced df and more features


X_red = np.array(df1.iloc[:,1:])
y_red=np.array(df1.iloc[:,0]) #predict the party

X_train, X_test, y_train, y_test = train_test_split(X_red, y_red)

dt = RandomForest(50,9) ## 50 tree , 9 features
dt.fit(X_train, y_train)
#predicted_y = dt.predict(X_test)
dt.predict(X_test)
dt.score(X_test,y_test)

0.98275862068965514

#### More trees and more features slightly increases our accuracy (depending on when you run the cells this value might change).

## Predict how the congressmen will vote on a particular issue given the remaining columns.
- change the target classification to predict how people will vote on the last four columns (superfund-right-to-sue                    232 non-null object
crime                                     232 non-null object
duty-free-exports                         232 non-null object
export-administration-act-south-africa )

In [29]:
voting_records.iloc[:,:-4].head()

Unnamed: 0,class_name,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending
0,republican,n,y,n,y,y,y,n,n,n,y,?,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?


In [30]:
X = np.array(voting_records.iloc[:,:-3]) ## not including last three
y = np.array(voting_records.iloc[:,16]) # predict export-administration-act-south-africa

X_train, X_test, y_train, y_test = train_test_split(X, y)

dt = RandomForest(200,10) ## 200 tree , 10 features
dt.fit(X_train, y_train)
#predicted_y = dt.predict(X_test)
print('Predicted vote, Actual Vote')
print([(pred_vot,actual_vote) for pred_vot,actual_vote in zip(dt.predict(X_test),y_test)])
dt.score(X_test,y_test)

Predicted vote, Actual Vote
[('y', '?'), ('y', 'y'), ('y', 'y'), ('n', 'n'), ('n', 'n'), ('y', 'y'), ('y', 'y'), ('n', 'n'), ('y', 'y'), ('n', 'n'), ('n', 'y'), ('n', 'y'), ('y', 'y'), ('y', '?'), ('y', '?'), ('y', 'y'), ('n', 'y'), ('y', 'y'), ('?', '?'), ('n', 'y'), ('?', 'y'), ('y', 'n'), ('y', '?'), ('y', '?'), ('n', '?'), ('y', 'y'), ('n', 'n'), ('y', 'y'), ('y', '?'), ('y', 'y'), ('y', '?'), ('n', 'n'), ('y', '?'), ('y', '?'), ('y', 'n'), ('n', 'n'), ('y', 'y'), ('n', 'y'), ('n', 'y'), ('?', 'y'), ('y', 'n'), ('y', 'y'), ('n', 'y'), ('n', 'n'), ('y', 'y'), ('n', 'y'), ('y', 'y'), ('n', 'n'), ('y', '?'), ('n', '?'), ('y', 'y'), ('y', 'y'), ('n', '?'), ('y', 'y'), ('y', 'y'), ('y', 'y'), ('y', 'y'), ('y', 'y'), ('y', '?'), ('y', '?'), ('y', 'y'), ('y', 'y'), ('y', 'y'), ('y', 'y'), ('y', 'y'), ('?', '?'), ('y', '?'), ('n', 'n'), ('n', '?'), ('y', 'n'), ('y', 'y'), ('n', 'y'), ('y', 'y'), ('y', 'n'), ('?', '?'), ('y', '?'), ('n', 'y'), ('y', 'y'), ('y', 'y'), ('y', 'y'), ('n', 'n'),

0.59633027522935778

- Only ~58% chance of predicting the ourcomt of export administration act in south america.
- Next, predict duty-free-exports 

In [31]:
X = np.array(voting_records.iloc[:,:-3]) ## not including last three
y = np.array(voting_records.iloc[:,15]) # predict duty free exports

X_train, X_test, y_train, y_test = train_test_split(X, y)

dt = RandomForest(200,10) ## 200 tree s, 10 features
dt.fit(X_train, y_train)
#predicted_y = dt.predict(X_test)
print('Predicted vote, Actual Vote')
print([(pred_vot,actual_vote) for pred_vot,actual_vote in zip(dt.predict(X_test),y_test)])
dt.score(X_test,y_test)

Predicted vote, Actual Vote
[('n', 'y'), ('y', 'n'), ('n', 'n'), ('y', 'n'), ('n', '?'), ('n', 'n'), ('n', 'y'), ('n', 'n'), ('y', 'n'), ('y', 'y'), ('n', 'n'), ('y', 'y'), ('n', 'y'), ('y', 'y'), ('n', 'n'), ('n', 'n'), ('n', 'y'), ('y', 'y'), ('n', 'n'), ('n', 'n'), ('n', 'y'), ('y', 'n'), ('n', 'y'), ('y', 'n'), ('n', 'n'), ('n', 'n'), ('n', 'n'), ('y', 'y'), ('y', 'y'), ('n', 'y'), ('n', 'n'), ('n', 'n'), ('y', 'n'), ('n', 'n'), ('n', 'n'), ('n', '?'), ('n', 'n'), ('n', '?'), ('y', '?'), ('n', 'n'), ('n', '?'), ('y', '?'), ('y', 'y'), ('n', 'n'), ('n', 'n'), ('y', 'y'), ('y', 'n'), ('n', 'n'), ('?', '?'), ('n', 'n'), ('y', 'y'), ('n', 'y'), ('n', 'n'), ('y', 'y'), ('y', 'y'), ('n', 'n'), ('n', 'n'), ('y', 'y'), ('n', 'n'), ('y', 'y'), ('n', '?'), ('y', 'n'), ('n', 'n'), ('n', 'n'), ('n', 'n'), ('y', 'y'), ('y', 'y'), ('y', 'y'), ('y', 'y'), ('n', 'n'), ('n', 'y'), ('y', 'n'), ('n', 'n'), ('y', 'n'), ('n', 'n'), ('n', 'n'), ('n', 'n'), ('n', 'n'), ('y', '?'), ('n', 'n'), ('y', 'y'),

0.68807339449541283

- Slightly easier to predict duty free exports.

In [32]:
X = np.array(voting_records.iloc[:,:-3]) ## not including last three
y = np.array(voting_records.iloc[:,14]) # predict crime

X_train, X_test, y_train, y_test = train_test_split(X, y)

dt = RandomForest(200,3) ## 200 trees , 3 features
dt.fit(X_train, y_train)
#predicted_y = dt.predict(X_test)
print('Predicted vote, Actual Vote')
print([(pred_vot,actual_vote) for pred_vot,actual_vote in zip(dt.predict(X_test),y_test)])
dt.score(X_test,y_test)

Predicted vote, Actual Vote
[('y', 'y'), ('y', 'y'), ('y', 'n'), ('y', 'y'), ('y', 'y'), ('y', 'y'), ('n', 'n'), ('n', 'n'), ('n', 'n'), ('n', 'n'), ('n', 'n'), ('n', 'y'), ('y', 'y'), ('y', 'y'), ('n', 'y'), ('y', 'y'), ('y', 'y'), ('n', 'n'), ('n', 'n'), ('n', 'n'), ('y', 'y'), ('y', 'y'), ('y', 'y'), ('y', 'y'), ('y', 'y'), ('n', 'y'), ('n', 'y'), ('y', '?'), ('y', 'y'), ('y', 'y'), ('n', 'n'), ('y', 'y'), ('y', 'y'), ('n', 'n'), ('y', 'y'), ('n', 'y'), ('y', 'y'), ('y', 'y'), ('y', 'n'), ('n', 'y'), ('y', 'y'), ('y', 'y'), ('y', 'y'), ('y', 'y'), ('?', '?'), ('n', 'n'), ('y', 'y'), ('n', 'y'), ('n', 'n'), ('y', 'y'), ('y', 'y'), ('y', 'y'), ('n', 'n'), ('n', 'n'), ('n', 'n'), ('n', 'n'), ('y', 'y'), ('y', 'y'), ('n', 'n'), ('n', 'y'), ('n', 'n'), ('n', 'n'), ('n', 'n'), ('y', 'y'), ('n', 'n'), ('y', 'y'), ('n', 'n'), ('n', 'n'), ('n', '?'), ('n', 'n'), ('y', 'y'), ('y', 'y'), ('n', 'n'), ('y', 'y'), ('y', 'y'), ('y', 'y'), ('n', 'n'), ('n', 'n'), ('y', 'y'), ('y', 'y'), ('y', 'n'),

0.84403669724770647

# This easiest thing to predict crime with ~80%

In [33]:
#test crime with only yes or no options available in the data (not ?)

X = np.array(df1.iloc[:,:-3]) ## not including last three
y = np.array(df1.iloc[:,14]) # predict crime

X_train, X_test, y_train, y_test = train_test_split(X, y)

dt = RandomForest(200,3) ## 200 trees , 3 features
dt.fit(X_train, y_train)
#predicted_y = dt.predict(X_test)
print('Predicted vote, Actual Vote')
print([(pred_vot,actual_vote) for pred_vot,actual_vote in zip(dt.predict(X_test),y_test)])
dt.score(X_test,y_test)

Predicted vote, Actual Vote
[('y', 'n'), ('n', 'y'), ('y', 'y'), ('n', 'n'), ('y', 'y'), ('n', 'y'), ('y', 'y'), ('y', 'y'), ('n', 'n'), ('y', 'y'), ('y', 'y'), ('n', 'n'), ('y', 'y'), ('n', 'n'), ('n', 'n'), ('y', 'y'), ('y', 'y'), ('y', 'y'), ('y', 'y'), ('y', 'y'), ('n', 'y'), ('n', 'y'), ('n', 'n'), ('y', 'y'), ('y', 'n'), ('y', 'y'), ('y', 'y'), ('n', 'y'), ('y', 'y'), ('y', 'y'), ('y', 'n'), ('n', 'n'), ('n', 'n'), ('y', 'y'), ('n', 'n'), ('y', 'y'), ('n', 'n'), ('n', 'n'), ('n', 'n'), ('n', 'n'), ('y', 'y'), ('y', 'y'), ('n', 'n'), ('n', 'n'), ('y', 'y'), ('n', 'n'), ('n', 'y'), ('y', 'y'), ('y', 'y'), ('n', 'n'), ('n', 'y'), ('y', 'y'), ('y', 'y'), ('y', 'y'), ('y', 'y'), ('n', 'n'), ('n', 'n'), ('n', 'n')]


0.82758620689655171

- As expected, if there are fewer options to predict (two versus three) the prediction accuracy increases (usually).

In [39]:
#test crime with only yes or no options available in the data (not ?), print out the tree

X = np.array(df1.iloc[:,:-3]) ## not including last three
y = np.array(df1.iloc[:,14]) # predict crime

X_train, X_test, y_train, y_test = train_test_split(X, y)

dt = RandomForest(5,10) ## 5 trees , 10 features
dt.fit(X_train, y_train)

tree = DecisionTree()
tree.fit(X_train, y_train, df1.columns[:-3])
print(tree)

    
#predicted_y = dt.predict(X_test)
print('Predicted vote, Actual Vote')
print([(pred_vot,actual_vote) for pred_vot,actual_vote in zip(dt.predict(X_test),y_test)])
dt.score(X_test,y_test)

physician-fee-freeze
  |-> n:
  |     aid-to-nicaraguan-contras
  |     |-> n:
  |     |     superfund-right-to-sue
  |     |     |-> n:
  |     |     |     adoption-of-the-budget-resolution
  |     |     |     |-> n:
  |     |     |     |     y
  |     |     |     |-> no n:
  |     |     |     |     el-salvador-aid
  |     |     |     |     |-> n:
  |     |     |     |     |     n
  |     |     |     |     |-> no n:
  |     |     |     |     |     anti-satellite-test-ban
  |     |     |     |     |     |-> n:
  |     |     |     |     |     |     n
  |     |     |     |     |     |-> no n:
  |     |     |     |     |     |     y
  |     |     |-> no n:
  |     |     |     y
  |     |-> no n:
  |     |     adoption-of-the-budget-resolution
  |     |     |-> y:
  |     |     |     religious-groups-in-schools
  |     |     |     |-> n:
  |     |     |     |     water-project-cost-sharing
  |     |     |     |     |-> n:
  |     |     |     |     |     immigration
  |     |     |     |   

0.86206896551724133