In [None]:
###### Set Up #####
# verify our folder with the data and module assets is installed
# if it is installed make sure it is the latest
!test -e ds-assets && cd ds-assets && git pull && cd ..
# if it is not installed clone it
!test ! -e ds-assets && git clone https://github.com/lutzhamel/ds-assets.git
# point to the folder with the assets
home = "ds-assets/assets/"
import sys
sys.path.append(home)      # add home folder to module search path

Already up to date.


# Machine learning with Decision Trees

We will use the scikit-learn module - sklearn.

In [None]:
import pandas as pd
from sklearn import tree
from treeviz import tree_print
from sklearn.metrics import accuracy_score

## Example: Iris Data

**Important:** for decision trees in sklearn the feature matrix has to be numerical!

Let's read the data into a data frame: df

In [None]:
df = pd.read_csv(home+"iris.csv")
df.head()

Unnamed: 0,id,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,1,5.1,3.5,1.4,0.2,setosa
1,2,4.9,3.0,1.4,0.2,setosa
2,3,4.7,3.2,1.3,0.2,setosa
3,4,4.6,3.1,1.5,0.2,setosa
4,5,5.0,3.6,1.4,0.2,setosa


Set up the data according to sklearn specs: **feature matrix** and **target vector**:

In [None]:
features_df = df.drop(columns=['id','Species'])
features_df.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [None]:
target_df = df[['Species']]
target_df.head()

Unnamed: 0,Species
0,setosa
1,setosa
2,setosa
3,setosa
4,setosa


We are ready to build our decision tree. Set up the model:

In [None]:
dtree = tree.DecisionTreeClassifier(criterion='entropy')

Build the model:

In [None]:
dtree.fit(features_df,target_df)

Show the actual model.  Decision trees are transparent models so we can just look at them:

In [None]:
tree_print(dtree,features_df)

if Petal.Length =< 2.449999988079071: 
  |then setosa
  |else if Petal.Width =< 1.75: 
  |  |then if Petal.Length =< 4.950000047683716: 
  |  |  |then if Petal.Width =< 1.6500000357627869: 
  |  |  |  |then versicolor
  |  |  |  |else virginica
  |  |  |else if Petal.Width =< 1.550000011920929: 
  |  |  |  |then virginica
  |  |  |  |else if Sepal.Length =< 6.949999809265137: 
  |  |  |  |  |then versicolor
  |  |  |  |  |else virginica
  |  |else if Petal.Length =< 4.8500001430511475: 
  |  |  |then if Sepal.Length =< 5.950000047683716: 
  |  |  |  |then versicolor
  |  |  |  |else virginica
  |  |  |else virginica
<------------->
Tree Depth:  5


Let's **evaluate** our model.  Does it make any mistakes when we apply the model back on the training data.  Recall that 'target_df' holds the vector with the original labels.  The idea is to apply our model to the training data 'features_df' and obtain **predicted** labels:

> A correct model is a model where the predicted labels equal the original training labels

We use the `predict` function that is part of the tree model to compute a new label for each row in the `features_df` dataframe.

In [None]:
predict_array = dtree.predict(features_df)      # produces an array of labels
type(predict_array)

numpy.ndarray

The problem is that the result is an array rather than a dataframe.  So we need to convert this array into a dataframe so that we can compare it to the `target_df` dataframe.

In [None]:
predict_df = pd.DataFrame(predict_array, columns=['Species'])  # turn it into a DF
type(predict_df)

pandas.core.frame.DataFrame

In [None]:
type(target_df)

pandas.core.frame.DataFrame

Now we can compare the two dataframes.

In [None]:
target_df.equals(predict_df)

True

Our Model makes no mistakes!

Let's try a smaller tree, we restrict our tree model to a max depth of 2.

In [None]:
dtree2 = tree.DecisionTreeClassifier(criterion='entropy', max_depth=2)
dtree2.fit(features_df,target_df)
tree_print(dtree2,features_df)

if Petal.Length =< 2.449999988079071: 
  |then setosa
  |else if Petal.Width =< 1.75: 
  |  |then versicolor
  |  |else virginica
<---->
Tree Depth:  2


In [None]:
predict_df2 = pd.DataFrame(dtree2.predict(features_df), columns=['Species'])
target_df.equals(predict_df2)

False

The smaller tree makes some mistakes.

# Model Accuracy

We can do a little bit better than just getting a true or false back on the question how good our model is: **model accuracy**

Model accurary is defined as:

> $ \mbox{accuracy} = 1 - \frac{(\mbox{number of errors})}{(\mbox{size of data})}$

sklean has a function for that: **accuracy_score**

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
print("Our model accuracy of our first model is: {}".format(accuracy_score(target_df, predict_df)))

Our model accuracy of our first model is: 1.0


In [None]:
print("Our model accuracy of our second model is: {}".format(accuracy_score(target_df, predict_df2)))

Our model accuracy of our second model is: 0.96


# Model Parameters

The decision tree model has many *hyperparameters* that we can change.  

```
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
```
The parameters that we did set in our previous models was **entropy** and **max_depth**.
The **max_depth** parameter helps us to control **model complexity**.


**Observation**: by restricting the complexity of the model we often obtain very readable and
understandable models without sacrificing a lot of accuracy!

# Reading

* 5.0 [Machine Learning](https://jakevdp.github.io/PythonDataScienceHandbook/05.00-machine-learning.html)
* 5.1 [What Is Machine Learning?](https://jakevdp.github.io/PythonDataScienceHandbook/05.01-what-is-machine-learning.html)
* 5.2 [Introducing Scikit-Learn](https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html)
