In [1]:
###### Set Up #####
# verify our folder with the data and module assets is installed
# if it is installed make sure it is the latest
!test -e ds-assets && cd ds-assets && git pull && cd ..
# if it is not installed clone it
!test ! -e ds-assets && git clone https://github.com/lutzhamel/ds-assets.git
# point to the folder with the assets
home = "ds-assets/assets/"
import sys
sys.path.append(home)      # add home folder to module search path

Cloning into 'ds-assets'...
remote: Enumerating objects: 204, done.[K
remote: Counting objects: 100% (40/40), done.[K
remote: Compressing objects: 100% (28/28), done.[K
remote: Total 204 (delta 18), reused 34 (delta 12), pack-reused 164[K
Receiving objects: 100% (204/204), 9.57 MiB | 19.88 MiB/s, done.
Resolving deltas: 100% (78/78), done.


In [2]:
# we need the ID3 tree machine learning package
!pip3 install decision-tree-id-fork

Collecting decision-tree-id-fork
  Downloading decision_tree_id_fork-0.0.15-py3-none-any.whl (16 kB)
Installing collected packages: decision-tree-id-fork
Successfully installed decision-tree-id-fork-0.0.15


In [3]:
import pandas as pd

# Machine learning with Decision Trees



# ID3 Decision Trees

We introduced the ID3 decision tree algorithm in the previous slides.  The advantage of ID3 is that is a very simple and straightforward algorithm.  The disadvantage is that it can **only deal with categorical attributes**.

Let's apply ID3 to our tennis dataset and evaluate the results.

In [4]:
tennis_df = pd.read_csv(home+"tennis.csv")
tennis_df.head()

Unnamed: 0,outlook,temp,humidity,windy,play
0,sunny,hot,high,weak,no
1,sunny,hot,high,strong,no
2,overcast,hot,high,weak,yes
3,rainy,mild,high,weak,yes
4,rainy,cool,normal,weak,yes


We have to split our dataset into a **feature matrix** and a **target vector**.

In [5]:
X = tennis_df.drop(columns=['play'])   # feature matrix
y = tennis_df[['play']]                # target vector

Next we have to instantiate our decision tree object and train it on the data or statistical jargon "fit it to the data".

In [6]:
import id3 # Id3Estimator, export_text

In [7]:
tennis_tree = id3.Id3Estimator()  # instantiate object
tennis_tree.fit(X, y['play'])  # train it -- needs a series

Let's see what the decision tree looks like.

In [8]:
feature_names = list(X.columns)
print(id3.export_text(tennis_tree.tree_, feature_names))


outlook overcast: yes (4) 
outlook rainy
|   windy strong: no (2) 
|   windy weak: yes (3) 
outlook sunny
|   humidity high: no (3) 
|   humidity normal: yes (2) 



The above tree is just a different representation of the original tree we looked at.

![pipeline.png](https://raw.githubusercontent.com/lutzhamel/ds-assets/main/assets/tennis-tree.png)

In order to test the model we can apply the model to the feature matrix and have it predict the label for each row.  We can then compare the predicted labels to the original labels to see if our model made any mistakes and we say that

> A correct model is a model where the predicted labels equal the original training labels

In [9]:
predict_df = pd.DataFrame(tennis_tree.predict(X), columns=['play'])
predict_df

Unnamed: 0,play
0,no
1,no
2,yes
3,yes
4,yes
5,no
6,yes
7,no
8,yes
9,yes


Let's compare the original labels in 'y' with the predicted labels in 'predict_df'.

In order to do this we put both vectors into a single dataframe using the Pandas **concat** function.

In [10]:
COLUMNS = 1
compare_df = pd.concat([y,predict_df],axis=COLUMNS)
compare_df.columns = ['original','predicted']
compare_df

Unnamed: 0,original,predicted
0,no,no
1,no,no
2,yes,yes
3,yes,yes
4,yes,yes
5,no,no
6,yes,yes
7,no,no
8,yes,yes
9,yes,yes


**Observation**: There is a 1-to-1 correspondence between the original labels and the predicted labels.  That means **our model makes no mistakes**.

# Model Accuracy

We can do a little bit better than manually comparing original vs. predicted labels using **model accuracy**.

Model accurary is defined as:

> $ \mbox{accuracy} = \big(1 - \frac{(\mbox{number of errors})}{(\mbox{size of data})}\big) \times 100\%$

Sklean has a function for that, the **accuracy_score**. Note, however, this function returns the score as a small fractional number.  We need to multiply that number by 100% in order to obtain a percentage based accuracy.

In [11]:
from sklearn.metrics import accuracy_score

In [12]:
print("The accuracy of our model is: {}%".format(accuracy_score(y, predict_df)*100))

The accuracy of our model is: 100.0%


# Sklearn CART Trees

The problem with ID3 is that it can only handle categorical data.  The **Classification And Regression Tree (CART) model** in Sklearn can handle numerical data (actually, all its independent variables have to be numerical -- more on that later).

Let's try the sklearn tree model using the iris dataset.

In [13]:
iris_df = pd.read_csv(home+"iris.csv")
iris_df.head()

Unnamed: 0,id,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,1,5.1,3.5,1.4,0.2,setosa
1,2,4.9,3.0,1.4,0.2,setosa
2,3,4.7,3.2,1.3,0.2,setosa
3,4,4.6,3.1,1.5,0.2,setosa
4,5,5.0,3.6,1.4,0.2,setosa


Set up the data according to sklearn specs: **feature matrix** and **target vector**:

In [14]:
features_df = iris_df.drop(columns=['id','Species'])
features_df.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


**Note**: all numerical features as required by sklearn.

In [15]:
target_df = iris_df[['Species']]
target_df.head()

Unnamed: 0,Species
0,setosa
1,setosa
2,setosa
3,setosa
4,setosa


We are ready to build our decision tree.

In [16]:
from sklearn import tree
from treeviz import tree_print  # treeviz is a module from ds-assets

Instantiate the decision tree object.

In [17]:
iris_tree = tree.DecisionTreeClassifier(criterion='entropy')

Train the model.

In [18]:
iris_tree.fit(features_df,target_df)

Show the actual model.  Decision trees are transparent models so we can just look at them:

In [19]:
tree_print(iris_tree,features_df)

if Petal.Length =< 2.449999988079071: 
  |then setosa
  |else if Petal.Width =< 1.75: 
  |  |then if Petal.Length =< 4.950000047683716: 
  |  |  |then if Petal.Width =< 1.6500000357627869: 
  |  |  |  |then versicolor
  |  |  |  |else virginica
  |  |  |else if Petal.Width =< 1.550000011920929: 
  |  |  |  |then virginica
  |  |  |  |else if Sepal.Length =< 6.949999809265137: 
  |  |  |  |  |then versicolor
  |  |  |  |  |else virginica
  |  |else if Petal.Length =< 4.8500001430511475: 
  |  |  |then if Sepal.Length =< 5.950000047683716: 
  |  |  |  |then versicolor
  |  |  |  |else virginica
  |  |  |else virginica
<------------->
Tree Depth:  5


We use the `predict` function  to compute a new label for each row in the `features_df` dataframe.

In [20]:
predict_df = pd.DataFrame(iris_tree.predict(features_df), columns=['Species'])

Let's compute the accuracy of our model.

In [21]:
from sklearn.metrics import accuracy_score
print("The accuracy of our model is: {}%".format(accuracy_score(target_df, predict_df)*100))

The accuracy of our model is: 100.0%


Let's try a smaller tree, we restrict our tree model to a max depth of 2.

In [22]:
iris_tree2 = tree.DecisionTreeClassifier(criterion='entropy', max_depth=2)
iris_tree2.fit(features_df,target_df)
tree_print(iris_tree2,features_df)

if Petal.Length =< 2.449999988079071: 
  |then setosa
  |else if Petal.Width =< 1.75: 
  |  |then versicolor
  |  |else virginica
<---->
Tree Depth:  2


In [23]:
predict_df2 = pd.DataFrame(iris_tree2.predict(features_df), columns=['Species'])

In [24]:
print("The accuracy of our model is: {}%".format(accuracy_score(target_df, predict_df2)*100))

The accuracy of our model is: 96.0%


**Observation**: by restricting the complexity of the model we often obtain very readable and
understandable models without sacrificing a lot of accuracy!

# Model Parameters

The sklearn decision tree model has many *hyperparameters* that we can change.  

```
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
```
The parameters that we did set in our previous models was **entropy** and **max_depth**.
The **max_depth** parameter helps us to control **model complexity**.


# Reading

* 5.0 [Machine Learning](https://jakevdp.github.io/PythonDataScienceHandbook/05.00-machine-learning.html)
* 5.1 [What Is Machine Learning?](https://jakevdp.github.io/PythonDataScienceHandbook/05.01-what-is-machine-learning.html)
* 5.2 [Introducing Scikit-Learn](https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html)
