# Machine learning with Decision Trees

We will use the scikit-learn module - sklearn.

In [117]:
import pandas as pd
from sklearn import tree
from treeviz import tree_print
from sklearn.metrics import accuracy_score

## Example: Iris Data

**Important:** for decision trees in sklearn the feature matrix has to be numerical!

Let's read the data into a data frame: df

In [118]:
df = pd.read_csv("assets/iris.csv")
df.head()

Unnamed: 0,id,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,1,5.1,3.5,1.4,0.2,setosa
1,2,4.9,3.0,1.4,0.2,setosa
2,3,4.7,3.2,1.3,0.2,setosa
3,4,4.6,3.1,1.5,0.2,setosa
4,5,5.0,3.6,1.4,0.2,setosa


Set up the data according to sklearn specs: **feature matrix** and **target vector**:

In [119]:
features_df = df.drop(['id','Species'],axis=1)
features_df.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [120]:
target_df = pd.DataFrame(df['Species'])
target_df.head()

Unnamed: 0,Species
0,setosa
1,setosa
2,setosa
3,setosa
4,setosa


We are ready to build our decision tree. Set up the model:

In [121]:
dtree = tree.DecisionTreeClassifier(criterion='entropy')

Build the model:

In [122]:
dtree.fit(features_df,target_df)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

Show the actual model.  Decision trees are transparent models so we can just look at them:

In [123]:
tree_print(dtree,features_df)

if Petal.Length =< 2.450000047683716: 
  |then setosa
  |else if Petal.Width =< 1.75: 
  |  |then if Petal.Length =< 4.949999809265137: 
  |  |  |then if Petal.Width =< 1.6500000953674316: 
  |  |  |  |then versicolor
  |  |  |  |else virginica
  |  |  |else if Petal.Width =< 1.5499999523162842: 
  |  |  |  |then virginica
  |  |  |  |else if Sepal.Length =< 6.949999809265137: 
  |  |  |  |  |then versicolor
  |  |  |  |  |else virginica
  |  |else if Petal.Length =< 4.850000381469727: 
  |  |  |then if Sepal.Length =< 5.949999809265137: 
  |  |  |  |then versicolor
  |  |  |  |else virginica
  |  |  |else virginica
<------------->
Tree Depth:  5


Let's **evaluate** our model.  Does it make any mistakes when we apply the model back on the training data.  Recall that 'target_df' holds the vector with the original labels.  The idea is to apply our model to the training data 'features_df' and obtain **predicted** labels:

> A correct model is a model where the predicted labels equal the original training labels

In [124]:
predict_array = dtree.predict(features_df)      # produces an array of labels
predicted_labels = pd.DataFrame(predict_array)  # turn it into a DF
predicted_labels.columns = ['Species']          # name the column - same name as in target!

In [125]:
predicted_labels.head()

Unnamed: 0,Species
0,setosa
1,setosa
2,setosa
3,setosa
4,setosa


In [126]:
target_df.head()

Unnamed: 0,Species
0,setosa
1,setosa
2,setosa
3,setosa
4,setosa


Now see if the predicted labels equal the labels in target_df:

In [127]:
predicted_labels.equals(target_df)

True

### Our Model is 100% correct!

# Model Accuracy

We can do a little bit better than just getting a true or false back on the question how good our model is: **model accuracy**

Model accurary is defined as:

> $ \mbox{accuracy} = 1 - \frac{(\mbox{number of errors})}{(\mbox{size of data})}$

sklean has a function for that: **accuracy_score**

In [128]:
from sklearn.metrics import accuracy_score

print("Our model accuracy is: {}".format(accuracy_score(target_df, predicted_labels)))

Our model accuracy is: 1.0


# Model Parameters

The decision tree model has many *hyperparameters* that we can change.  

```
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
```
The one parameter that we did set in our previous model was **entropy**:
```
dtree = tree.DecisionTreeClassifier(criterion='entropy')
```
Another important parameter for decision trees is the **max_depth** parameter.  It helps us to control **model complexity**.

Let's build another model where we restrict the complexity...

In [129]:
dtree2 = tree.DecisionTreeClassifier(criterion='entropy', max_depth=2)
dtree2.fit(features_df,target_df)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [130]:
tree_print(dtree2,features_df)

if Petal.Width =< 0.800000011920929: 
  |then setosa
  |else if Petal.Width =< 1.75: 
  |  |then versicolor
  |  |else virginica
<---->
Tree Depth:  2


In [131]:
predict_array2 = dtree2.predict(features_df)      # produces an array of labels
predicted_labels2 = pd.DataFrame(predict_array2)  # turn it into a DF
predicted_labels2.columns = ['Species']           # name the column - same name as in target!

print("Our model accuracy is: {}".format(accuracy_score(target_df, predicted_labels2)))

Our model accuracy is: 0.96


**Observation**: by restricting the complexity of the model we often obtain very readable and
understandable models without sacrificing a lot of accuracy!

# Reading

* 2.1 [Understanding Data Types in Python](https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html)
* 3.0 [Data Manipulation with Pandas](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html)
* 3.1 [Introducing Pandas Objects](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html)
* 4.0 [Visualization with Matplotlib](https://jakevdp.github.io/PythonDataScienceHandbook/04.00-introduction-to-matplotlib.html)
* 4.2 [Simple Scatter Plots](https://jakevdp.github.io/PythonDataScienceHandbook/04.02-simple-scatter-plots.html)
* 5.0 [Machine Learning](https://jakevdp.github.io/PythonDataScienceHandbook/05.00-machine-learning.html)
* 5.1 [What Is Machine Learning?](https://jakevdp.github.io/PythonDataScienceHandbook/05.01-what-is-machine-learning.html)
* 5.2 [Introducing Scikit-Learn](https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html)


# Team Exercise

* Download the numeric play tennis data set from github:      https://github.com/lutzhamel/ds/tree/master/assets
* Download the treeviz.py file from github: https://github.com/lutzhamel/ds/tree/master/assets
* Build a decision tree and print it using treeviz.py
* Try to answer the question if the tree models the data set completely.
* Then find another data set, only **numeric features** and with **target labels**, **NO** regression data sets.
* Build a tree, visualize it, and then evaluate the model using *accuracy_score*.

The following are all the *import* statements you will need.
```
import pandas as pd
from sklearn import tree
from treeviz import tree_print
from sklearn.metrics import accuracy_score
```


## Teams

```
Team 0:	Cory Alexander Susallin 
Team 1:	Matt Aakash Harout 
Team 2:	Geron Gabe Kevin 
Team 3:	Baez Shehjar Aguilar 
Team 4:	Ben David_P Evelyn 
Team 5:	Kermalyn David_M Shamal 
Team 6:	Joe Maurice Peter 
Team 7:	Christopher Alber Najib 
```