### Measuring Entropy

A measure of a data set's disorder - how same or different it is.

- If we classify a data set into N different classes.(Example : a data set of animal attributes and species)
 - The Entropy is 0(Zero) if all of the classes in the data are the same (Everyone is an Iguana)
 - The Entropy is high if they're all different.
 
 
#### Computing Entropy

H(S) = -P1(ln P1) - ... - Pn(ln Pn)
 
P1,P2,Pn,Pi represents the proportion of the data labeled for each class.

## Decision Tree

- You can actually construct a flowchart to help you decide a classification for something with Machine Learning
- This is Called a decision tree.
- Another form of supervised learning.
 - Give it some sample data and the resulting classifications out comes as a tree.

### Decision Tree Examples

- You want to build resume filter.
- Make Choice based on historical data.

### How Decision Tree Work

- At each step, find the attribute we can use to partition the data set to minimize the entropy of the data at the next step.
- Fancy term for this simple algorithm : ID3
- It is greedy algorithm - as it goes down the tree, it just picks the decision that reduce entropy the most at that stage.
 - This may gives a optimal tree and works.

### Here is an Example of Decision Tree.

First we'll load some fake data on past hires I made up. Note how we use pandas to convert a csv file into a DataFrame:

In [17]:
import numpy as np
import pandas as pd
from sklearn import tree

cand = pd.read_csv('PastHires.csv', header = 0)

cand.head()

Unnamed: 0,Years Experience,Employed?,Previous employers,Level of Education,Top-tier school,Interned,Hired
0,10,Y,4,BS,N,N,Y
1,0,N,0,BS,Y,Y,Y
2,7,N,6,BS,N,N,N
3,2,Y,1,MS,Y,N,Y
4,20,N,2,PhD,Y,N,N


In [19]:
d = {'Y' :1, 'N' : 0}

cand['Hired'] = cand['Hired'].map(d)
cand['Employed?'] = cand['Employed?'].map(d)
cand['Interned'] = cand['Interned'].map(d)
cand['Top-tier school'] = cand['Top-tier school'].map(d)

da = {'BS' : 0,'MS' : 1, 'PhD':'2'}
cand['Level of Education'] = cand['Level of Education'].map(da)

cand

Unnamed: 0,Years Experience,Employed?,Previous employers,Level of Education,Top-tier school,Interned,Hired
0,10,1,4,0,0,0,1
1,0,0,0,0,1,1,1
2,7,0,6,0,0,0,0
3,2,1,1,1,1,0,1
4,20,0,2,2,1,0,0
5,0,0,0,2,1,1,1
6,5,1,2,1,0,1,1
7,3,0,1,0,0,1,1
8,15,1,5,0,0,0,1
9,0,0,0,0,0,0,0


Next we need to separate the features from the target column that we're trying to bulid a decision tree for.

In [21]:
features = list(cand.columns[:6])
features

['Years Experience',
 'Employed?',
 'Previous employers',
 'Level of Education',
 'Top-tier school',
 'Interned']

Now actually construct the decision tree:

In [25]:
y = cand['Hired']
x = cand[features]

clf = tree.DecisionTreeClassifier()
clf = clf.fit(x,y)

... and display it. Note you need to have pydotplus installed for this to work. (!pip install pydotplus)

To read this decision tree, each condition branches left for "true" and right for "false". When you end up at a value, the value array represents how many samples exist in each target value. So value = [0. 5.] mean there are 0 "no hires" and 5 "hires" by the tim we get to that point. value = [3. 0.] means 3 no-hires and 0 hires.

In [26]:
from IPython.display import Image  
from sklearn.externals.six import StringIO  
import pydotplus

dot_data = StringIO() 

tree.export_graphviz(clf, out_file = dot_data, feature_names = features)  

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  

Image(graph.create_png())  

ModuleNotFoundError: No module named 'pydotplus'

## Ensemble learning: using a random forest

We'll use a random forest of 10 decision trees to predict employment of specific candidate profiles:

In [30]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(x, y)

#Predict employment of an employed 10-year veteran
print(clf.predict([[10, 1, 4, 0, 0, 0]]))

#...and an unemployed 10-year veteran
print(clf.predict([[10, 0, 4, 0, 0, 0]]))

[1]
[0]


## Activity

Modify the test data to create an alternate universe where everyone I hire everyone I normally wouldn't have, and vice versa. Compare the resulting decision tree to the one from the original data.