In [1]:
###### Set Up #####
import sys
import os
import platform

colab = True if 'google.colab' in os.sys.modules else False
system = platform.system() # "Windows", "Linux", "Darwin"

if colab:
  # running in google colab
  # update/clone ds-assets repo
  !test -e ds-assets && cd ds-assets && git pull && cd ..
  !test ! -e ds-assets && git clone https://github.com/lutzhamel/ds-assets.git
  home = "ds-assets/assets/"
else:
  # running on local machine
  # set this to the folder containing the DS assets
  home = "ds-assets/assets/"

sys.path.append(home)      # add home folder to module search path

In [2]:
# for this notebook we need the ID3 tree machine learning package
!pip3 install decision-tree-id-fork # installs ID3


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
# notebook level imports
import pandas as pd
import id3                  # Id3Estimator, export_text
from sklearn import tree    # DecisionTreeClassifier, export_text
from sklearn import metrics # accuracy_score

# Machine learning with Decision Trees



# ID3 Decision Trees

We introduced the ID3 decision tree algorithm in the previous slides.  The advantage of ID3 is that is a very simple and straightforward algorithm.  The disadvantage is that it can **only deal with categorical attributes**.

Let's apply ID3 to our tennis dataset and evaluate the results.

In [4]:
tennis_df = pd.read_csv(home+"tennis.csv")
tennis_df.head()

Unnamed: 0,outlook,temp,humidity,windy,play
0,sunny,hot,high,weak,no
1,sunny,hot,high,strong,no
2,overcast,hot,high,weak,yes
3,rainy,mild,high,weak,yes
4,rainy,cool,normal,weak,yes


We have to split our dataset into a **feature matrix** and a **target vector**.

In [5]:
X = tennis_df.drop(columns=['play']) # feature matrix
y = tennis_df[['play']]              # target vector

Next we have to instantiate our decision tree object and train it on the data or statistical jargon "fit it to the data".

In [6]:
# instantiate ID3 object
tennis_tree = id3.Id3Estimator()  

# train - ID3 require the target to be a series
tennis_tree.fit(X, y['play'])     

Let's see what the decision tree looks like.

In [7]:
print(id3.export_text(tennis_tree.tree_, 
                      feature_names=list(X.columns)))


outlook overcast: yes (4) 
outlook rainy
|   windy strong: no (2) 
|   windy weak: yes (3) 
outlook sunny
|   humidity high: no (3) 
|   humidity normal: yes (2) 



The above tree is just a different representation of the original tree we looked at.

![pipeline.png](https://raw.githubusercontent.com/lutzhamel/ds-assets/main/assets/tennis-tree.png)

In order to test the model we can apply the model to the feature matrix and have it predict the label for each row.  We can then compare the predicted labels to the original labels to see if our model made any mistakes and we say that

> A correct model is a model where the predicted labels equal the original training labels

In [8]:
predict_df = pd.DataFrame(tennis_tree.predict(X), 
                          columns=['predicted'])
predict_df

Unnamed: 0,predicted
0,no
1,no
2,yes
3,yes
4,yes
5,no
6,yes
7,no
8,yes
9,yes


Let's compare the original labels in 'y' with the predicted labels in 'predict_df'.

In order to do this we put both vectors into a single dataframe using the Pandas **concat** function.

In [9]:
COLUMNS = 1
compare_df = pd.concat([y,predict_df],axis=COLUMNS)
compare_df

Unnamed: 0,play,predicted
0,no,no
1,no,no
2,yes,yes
3,yes,yes
4,yes,yes
5,no,no
6,yes,yes
7,no,no
8,yes,yes
9,yes,yes


**Observation**: There is a 1-to-1 correspondence between the original labels and the predicted labels.  That means **our model makes no mistakes**.

# Model Accuracy

We can do a little bit better than manually comparing original vs. predicted labels using **model accuracy**.

Model accurary is defined as:

> $ {accuracy} = \big(1 - \frac{({\# errors})}{({\# observations})}\big) \times 100\%$

Sklean has a function for that, the **accuracy_score**. Note, however, this function returns the score as a small fractional number.  We need to multiply that number by 100% in order to obtain a percentage based accuracy.

In [10]:
# y is the original target vector and predict_df 
# is our computed target
print("The accuracy of our model is: {}%"
      .format(metrics.accuracy_score(y, predict_df)*100))

The accuracy of our model is: 100.0%


# The SciKit-Learn Machine Learning Package

The [SciKit-Learn package](https://scikit-learn.org/stable/) (sklearn for short) is one of the most popular and mature Python machine learning packages.  We will use it almost exclusively for our work here.  We will start out with decision  trees knows as **C**lassificand **A**nd **R**egression **T**rees (CART).

# Sklearn CART Trees

The problem with ID3 is that it can only handle categorical data.  The sklearn [CART](https://scikit-learn.org/stable/api/sklearn.tree.html) model can handle numerical data. More specifically, the independent variables of CART need to be numerical and the target can be categorical (classification) or numerical (regression).  We look at converting categorical independent attributes to numerical attributes later.

Let's try the sklearn tree model using the iris dataset (for the necessary imports see the beginning of the notebook)

In [11]:
iris_df = pd.read_csv(home+"iris.csv")
iris_df.head()

Unnamed: 0,id,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,1,5.1,3.5,1.4,0.2,setosa
1,2,4.9,3.0,1.4,0.2,setosa
2,3,4.7,3.2,1.3,0.2,setosa
3,4,4.6,3.1,1.5,0.2,setosa
4,5,5.0,3.6,1.4,0.2,setosa


Set up the data according to sklearn specs: **feature matrix** and **target vector**:

In [12]:
features_df = iris_df.drop(columns=['id','Species'])
features_df.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


**Note**: all numerical features as required by sklearn.

In [13]:
target_df = iris_df[['Species']]
target_df.head()

Unnamed: 0,Species
0,setosa
1,setosa
2,setosa
3,setosa
4,setosa


**Note**: the target is a categorical attribute, this makes this a classification problem.

We are ready to build our decision tree.  Instantiate the decision tree object.

In [14]:
iris_tree = tree.DecisionTreeClassifier()

Train the model.

In [15]:
iris_tree.fit(features_df,target_df)

Show the actual model.  Decision trees are transparent models so we can just look at them:

In [16]:
print(tree.export_text(iris_tree,
                       feature_names=list(features_df.columns)))

|--- Petal.Length <= 2.45
|   |--- class: setosa
|--- Petal.Length >  2.45
|   |--- Petal.Width <= 1.75
|   |   |--- Petal.Length <= 4.95
|   |   |   |--- Petal.Width <= 1.65
|   |   |   |   |--- class: versicolor
|   |   |   |--- Petal.Width >  1.65
|   |   |   |   |--- class: virginica
|   |   |--- Petal.Length >  4.95
|   |   |   |--- Petal.Width <= 1.55
|   |   |   |   |--- class: virginica
|   |   |   |--- Petal.Width >  1.55
|   |   |   |   |--- Petal.Length <= 5.45
|   |   |   |   |   |--- class: versicolor
|   |   |   |   |--- Petal.Length >  5.45
|   |   |   |   |   |--- class: virginica
|   |--- Petal.Width >  1.75
|   |   |--- Petal.Length <= 4.85
|   |   |   |--- Sepal.Width <= 3.10
|   |   |   |   |--- class: virginica
|   |   |   |--- Sepal.Width >  3.10
|   |   |   |   |--- class: versicolor
|   |   |--- Petal.Length >  4.85
|   |   |   |--- class: virginica



We use the `predict` function  to compute a new label for each row in the `features_df` dataframe.

In [17]:
predict_df = pd.DataFrame(iris_tree.predict(features_df),
                          columns=['Species'])

Let's compute the accuracy of our model.

In [18]:
print("The accuracy of our model is: {}%".format(metrics.accuracy_score(target_df, predict_df)*100))

The accuracy of our model is: 100.0%


Let's try a smaller tree, we restrict our tree model to a max depth of 2.

In [19]:
iris_tree2 = tree.DecisionTreeClassifier(max_depth=2)
iris_tree2.fit(features_df,target_df)
print(tree.export_text(iris_tree2,
                       feature_names=list(features_df)))

|--- Petal.Length <= 2.45
|   |--- class: setosa
|--- Petal.Length >  2.45
|   |--- Petal.Width <= 1.75
|   |   |--- class: versicolor
|   |--- Petal.Width >  1.75
|   |   |--- class: virginica



In [20]:
predict_df2 = pd.DataFrame(iris_tree2.predict(features_df), columns=['Species'])

In [21]:
print("The accuracy of our model is: {}%".format(metrics.accuracy_score(target_df, predict_df2)*100))

The accuracy of our model is: 96.0%


**Observation**: by restricting the complexity of the model we often obtain very readable and
understandable models without sacrificing a lot of accuracy!

# Model Parameters

The sklearn decision tree model has many *hyperparameters* that we can change.  

```
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
```
One of those parameters that we did set in our previous models was **max_depth**.
The **max_depth** parameter helps us to control **model complexity**.


# Reading

* 5.0 [Machine Learning](https://jakevdp.github.io/PythonDataScienceHandbook/05.00-machine-learning.html)
* 5.1 [What Is Machine Learning?](https://jakevdp.github.io/PythonDataScienceHandbook/05.01-what-is-machine-learning.html)
* 5.2 [Introducing Scikit-Learn](https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html)
