# Decision Trees

### Example 1: 

$x$ indicates he's not windsurfing and $o$ otherwise. This image is not $LS$. 

<img src="dt_images/example_1.png" alot = "Wind-Surfing" style = "width: 500px;"/> 

*Decision Trees* let's you ask multiple questions. Like is it Windy (Yes, No)? If it is Windy, we need to create another decision. This leads us to some sort of tree graph as described below 

### Example 2: 

<img src="dt_images/example_2.png" alot = "Wind-Surfing-Decision" style = "width: 500px;"/> 

**Decision Trees** helps us classify data charectorizing it as decision of questions.

### Example 3: 

<img src="dt_images/example_3.png" alot = "Decisions-Equalities" style = "width: 400px;"/>

In [6]:
# Decision Trees code example . 
from sklearn import tree #import dt
X = [[0,0], [1,1]] #training features
Y = [0,1] # training labels
clf = tree.DecisionTreeClassifier() # create classifier 
clf = clf.fit(X,Y) #fit function
pred = clf.predict([[2.,2.]])

print{"Classifier and Prediction": (clf, pred)} # output



{'Classifier and Prediction': (DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'), array([1]))}


# min_samples_split 

Essentially, what split achieves is taking one node and spliting it by $n$ choices. Here below 
$n = 2$, so each node keeps spliting until $n \Rightarrow 1$. 

### Example 4: 
<img src="dt_images/example_4.png" alot = "mss" style = "width: 400px;"/>
The more splits you have the more packed your data becomes which causes overfitting, which came be problematic in some cases. 

# Min_samples_split and Overfitting

### Example 5: 

<img src="dt_images/example_5.png" alot = "mss-overfitting" style = "width: 500px;"/>

one of them is min_samlpes split $= 2$ and one $= 50$. 

In [7]:
#example code

"""from sklearn import tree 
from sklearn.metrics import accuracy_score
clf_50 = tree.DecisionTreeClassifier(min_samples_split =50) 
clf_50 = clf_50.fit(features_train,labels_train) 
pred = clf_50.predict(features_test)
acc_min_samples_split_50 = accuracy_score(labels_test, pred)

clf_2 = tree.DecisionTreeClassifier(min_samples_split =2) 
clf_2 = clf_2.fit(features_train,labels_train) 
pred = clf_2.predict(features_test)
acc_min_samples_split_2 = accuracy_score(labels_test, pred)


def submitAccuracies():
  return {"acc_min_samples_split_2":round(acc_min_samples_split_2,3),
          "acc_min_samples_split_50":round(acc_min_samples_split_50,3)} 


Here's your output:
{'acc_min_samples_split_50': 0.912, 'acc_min_samples_split_2': 0.908}
          
          """

'from sklearn import tree \nfrom sklearn.metrics import accuracy_score\nclf_50 = tree.DecisionTreeClassifier(min_samples_split =50) \nclf_50 = clf_50.fit(features_train,labels_train) \npred = clf_50.predict(features_test)\nacc_min_samples_split_50 = accuracy_score(labels_test, pred)\n\nclf_2 = tree.DecisionTreeClassifier(min_samples_split =2) \nclf_2 = clf_2.fit(features_train,labels_train) \npred = clf_2.predict(features_test)\nacc_min_samples_split_2 = accuracy_score(labels_test, pred)\n\n\ndef submitAccuracies():\n  return {"acc_min_samples_split_2":round(acc_min_samples_split_2,3),\n          "acc_min_samples_split_50":round(acc_min_samples_split_50,3)} \n\n\nHere\'s your output:\n{\'acc_min_samples_split_50\': 0.912, \'acc_min_samples_split_2\': 0.908}\n          \n          '

# Entropy 

Def: Entropy controlls how a DT decides where to split the data and measures of impurity in a bunch of examples. 

### Example 6: 
<img src="dt_images/example_6.png" alot = "entropy" style = "width: 500px;"/>

Essentially, what we would like to achieve is more purity within the examples of the data set. Variables and split points that will create subsets that are as pure as possible.

entropy $E$ is mathematically defined as: 
$$ E = -\sum_{i}p_{i}log_{2}(p_{i}) $$

* $p_{i}$ is the fraction of examples in class $i$
* $\sum_{i}$ is the sum over all classes available 

### Example 7: 
<img src="dt_images/example_7.png" alot = "intution" style = "width: 500px;"/>


### Example 8: 
<img src="dt_images/example_8.png" alot = "example_e" style = "width: 500px;"/> 

* $p_{slow} = $ the fraction of slow examples = $\frac{ss}{ssff}$ or $\frac{1}{2}$ 
* $p_{fast} = $ the fraction of fast examples = $\frac{ff}{ssff}$ or $\frac{1}{2}$

Puting this all together we shouldd have: 

$$ p_{slow}log_{2}(p_{slow}) + p_{fast}log_{2}(p_{fast}) $$ 

or simply put: 

$$ E_{ssff} = -(0.5*log_{2}(0.5) + 0.5*log_{2}(0.5)) = 1.0$$


In [19]:
# how to use the entropy formula 
import math as m 
E_ssff = -(0.5*m.log(0.5,2) + 0.5*m.log(0.5,2))
print E_ssff



1.0


-0.5

# Information Gain 

### Example 9: 
<img src="dt_images/example_9.png" alot = "example_e" style = "width: 500px;"/> 

The decision tree will **maximize information gain** 

* entropy of parent = 1.0 
* entropy of children = 3/4(0.9184) + 1/4(0) weighted average of nodes
* information gain = 1.0 - [3/4(0.9184) + 1/4(0)] 31% 

This is the information gain we get if we split on the grade. 

* entropy of parent = 1.0 
* entropy of bumpy = $-(p_{bs}*log_{2}(p_{bs}) + p_{bf}*log_{2}(p_{bf}))$ 
* entropy of smooth = $-(p_{ss}*log_{2}(p_{ss}) + p_{sf}*log_{2}(p_{sf}))$ 
* information gain = $E_{parent} - [1/2*E_{bumpy} + 1/2*E_{smooth}] = 0$

Split based on speed limit 
* entropy of parent = 1.0 
* entropy of yes = $E_{yes} = -(p_{ss}*log_{2}(p_{ss}) + p_{ff}*log_{2}(p_{ff})) = 0$
* entropy of no =  $E_{no} = -(p_{ss}*log_{2}(p_{ss}) + p_{ff}*log_{2}(p_{ff})) = 0$
* information gain = $E_{parent} - [1/2*E_{yes} + 1/2*E_{no}] = 1.0$ 

so if we start of with an entropy from 1 to 1 that's the split we want to use for the decision tree this is very importy. Caluclations like this is hat the decision tree does for all the changens to make splits in the data set. 

#### gini index 
support "entropy" for the information gain. 


# Conclusion 

Decision trees are prone to overfitting, especially when you have alot of features it can overfit with the datza you need to be careful with the parameter. Decision Trees can build bigger classifer with ensemble methods. Now we will decide who wrote an email bassed on the written emails on decision trees. 