# Lab 3 - Decision Trees

This assignment uses a dataset obtained from the JSE Data Archive that aims at identifying whether there is a statistically significant dependence between gender and biological/activity traits in humans. 

FEATURE DESCRIPTIONS: 

<ul>
<li>Color (Blue, Brown, Green, Hazel, Other)
<li>Age (in years)
<li>YearinSchool (First, Second, Third, Fourth, Other)
<li>Height (in inches)
<li>Miles (distance from home town of student to Ames, IA) 
<li>Brothers (number of brothers)
<li>Sisters (number of sisters)
<li>CompTime (number of hours spent on computer per week)
<li>Exercise (whether the student exercises Yes or No)
<li>ExerTime (number of hours spent exercising per week)
<li>MusicCDs (number of music CDs student owns)
<li>PlayGames (number of hours spent playing games per week)
<li>WatchTV (number of hours spent watching TV per week
</ul>

https://ww2.amstat.org/publications/jse/jse_data_archive.htm

In [65]:
from collections import Counter, defaultdict
from itertools import combinations 
import pandas as pd
import numpy as np
import operator

In [66]:
df = pd.read_csv('Eye_Color.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2068 entries, 0 to 2067
Data columns (total 14 columns):
gender           2068 non-null object
age              2068 non-null int64
year             2068 non-null object
eyecolor         2068 non-null object
height           2051 non-null float64
miles            2052 non-null float64
brothers         2068 non-null int64
sisters          2068 non-null int64
computertime     2061 non-null float64
exercise         2068 non-null object
exercisehours    2068 non-null float64
musiccds         2024 non-null float64
playgames        2067 non-null float64
watchtv          2067 non-null float64
dtypes: float64(7), int64(3), object(4)
memory usage: 226.3+ KB


In [67]:
# remove NA's and reset the index
df = df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
df = df.reset_index(drop=True)
df.info()
df.head()
len(df.height.unique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1988 entries, 0 to 1987
Data columns (total 14 columns):
gender           1988 non-null object
age              1988 non-null int64
year             1988 non-null object
eyecolor         1988 non-null object
height           1988 non-null float64
miles            1988 non-null float64
brothers         1988 non-null int64
sisters          1988 non-null int64
computertime     1988 non-null float64
exercise         1988 non-null object
exercisehours    1988 non-null float64
musiccds         1988 non-null float64
playgames        1988 non-null float64
watchtv          1988 non-null float64
dtypes: float64(7), int64(3), object(4)
memory usage: 217.5+ KB


30

# Calculating Gini Index 



**Question 1: How many rows are there in the dataset for males? For females? **



In [121]:
grouped=df.groupby('gender').size()
grouped
print('male: ',grouped['male'],'female: ',grouped['female'])

male:  910 female:  1078


**Question 2: What is the Gini Index of this dataset, using males and females as the target classes?**

In [108]:
ratio=grouped['female']/(grouped['female']+grouped['male'])

def gini(p):
    return 1-p*p-(1-p)*(1-p)
print(gini(ratio))

0.496429279905


## Best Split of a Numeric Feature

**Question 3: What is the best split point of the 'height' feature. **

In [125]:
sorted_heights = sorted(df["height"].unique())
print(sorted_heights)
psp=[]  #potential split point
for i,value in enumerate(sorted_heights):
    if i<29:
        midpoint=((sorted_heights[i]+sorted_heights[i+1])/2)
        psp=np.append(midpoint,psp)
print(sorted(psp))

for x in psp:
    top=df[df['height']<=x]
    topcount=top.groupby('gender').size()
    bottom=df[df['height']>x]
    bottomcount=bottom.groupby('gender').size()
    
    #if len(topcount==1):
     #   topsplitprob=1
      #  tsplitcount=topcount['male'] + topcount['female']
    #else:
    tsplitcount=topcount['male'] + topcount['female']
    topsplitprob=topcount['male']/tsplitcount
        
    #if len(bottomcount==1):
     #   bottomsplitprob=1
      #  bsplitcount=bottomcount['male'] + bottomcount['female']
    #else:
    bsplitcount=bottomcount['male'] + bottomcount['female']
    bottomsplitprob=bottomcount['male']/bsplitcount    
    
  #  totalcount=bsplitcount+tsplitcount
   # final_gini=gini(tsplitcount)*(topsplitprob/totalcount) +gini(bottomsplitprob)*(bsplitcount/totalcount)
    
    

[44.0, 52.0, 54.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 82.0, 85.0]
[48.0, 53.0, 55.0, 56.5, 57.5, 58.5, 59.5, 60.5, 61.5, 62.5, 63.5, 64.5, 65.5, 66.5, 67.5, 68.5, 69.5, 70.5, 71.5, 72.5, 73.5, 74.5, 75.5, 76.5, 77.5, 78.5, 79.5, 81.0, 83.5]


KeyError: 'female'

**Question 4: What is the Gini Index of this best split?**

**Question 5: How much does this partitioning reduce the Gini Index over that of the overall dataset?**

**Question 6: How many 'female' rows are below your best split point? 'male' rows?**

**Question 7: How many 'female' rows are above your best split point? 'male' rows?**

Recall that, to calculate the best split of this numeric field, you'll need to order your data by 'height', then consider the midpoint between each pair of consecutive heights as a potential split point, then calculate the Gini Index for that partitioning. You'll want to keep track of the best split point and its Gini Index (remember that you are trying to minimize the Gini Index). 

There are a lot of ways to do this. Some are very fast, others very slow. One tip to make this run quickly is, as you consecutively step through the data and calculate the Gini Index of each possible split point, keep a running total of the number of rows for each candidate that are located above and below the split point. 

Some Python tips: 

* Counter(), from the collections module, is a special dictionary for counting values of a key
* zip() lets you concatenate lists into a list of tuples (for example, if we have a list of the candidates and a list of transaction amounts, zip(candidate_list, transaction_amount) would give us a list of (candidate, transaction amount) pairs

## Best Split of a Categorial Variable

**Question 8: How many possible splits are there of the eyecolor feature?**

**Question 9: Which split of eyecolor best splits the female and male rows, as measured by the Gini Index?**

**Question 10: What is the Gini Index of this best split?**

**Question 11: How much does this partitioning reduce the Gini Index over that of the overall data set?**

**Question 12: How many 'female' rows and 'male' rows are in your first partition? How many 'female' rows and 'male' rows are in your second partition?**

Python tip: the combinations function of the itertools module allows you to enumerate combinations of a list. You might want to Google 'power set'.

## Training a decision tree

**Question 13: Using all of the features in the original dataframe read in at the top of this notebook, train a decision tree classifier that has a depth of three (including the root node and leaf nodes). What is the accuracy of this classifier on the training data?**

Scikit-learn classifiers require class labels and features to be in numeric arrays. As such, you will need to turn your categorical features into numeric arrays using DictVectorizer. This is a helpful notebook for understanding how to do this: http://nbviewer.ipython.org/gist/sarguido/7423289. You can turn a pandas dataframe of features into a dictionary of the form needed by DictVectorizer by using df.to_dict('records'). Make sure you remove the class label first (in this case, gender). If you use the class label as a feature, your classifier will have a training accuracy of 100%! The example notebook link also shows how to turn your class labels into a numeric array using sklearn.preprocessing.LabelEncoder().

In [None]:
import sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer #to turn categorial variables into numeric arrays
from sklearn import preprocessing #to transform the feature labels

**Question 14: Using the following code snippet, export your decision tree to graphviz and visualize it. In your write-up, write down the interpretation of the rule at each node which is used to perform the splitting.**

In order to install graphviz, you may need to download the tool from [this website](https://graphviz.gitlab.io), and then pip3/conda install the python libraries you do not have.

Mac users can use ```brew install graphviz``` instead of following the link, and linux users can do the same using their favourite package manager (for example, Ubuntu users can use ```sudo apt-get install graphviz```, followed by the necessary pip3/conda installations.

In [10]:
from IPython.display import Image  
import pydotplus 
import pydot
from sklearn.externals.six import StringIO

#
# clf = your classifier

dotfile = StringIO() 
tree.export_graphviz(clf, out_file=dotfile,
#                     feature_names=df.columns,  
#                          class_names=['Female', 'Male'],  
                         filled=True, rounded=True,  
                         special_characters=True)
                    
graph = pydotplus.graph_from_dot_data(dotfile.getvalue())
Image(graph.create_png())

**Question 15 (Extra Credit): For each of your leaf nodes, specify the percentage of 'female' rows in that node (out of the total number of rows at that node).**

See this notebook for the basics of training a decision tree in scikit-learn and exporting the outputs to view in graphviz: http://nbviewer.ipython.org/gist/tebarkley/b68c04d9b31e64ce6023