# Coding a simple decision tree
---

In this worksheet we are going to work with a data set, using the idea of a decision tree class.  We are going to simplify the model and use Python code to make a simple decision tree classification model.  We will do this for two reasons:
*   writing the code is often good for helping to understand what is going on under the bonnet of a library function
*   it is a good coding exercise for practice as it mostly depends on calculations and if..elif..else statements

In this worksheet we are going to code a decision tree which will use the calculated probabilities to make decisions about wheter a row of given data would be classified as Iris-virginica, or not, based on sepal and petal dimensions.  It is easier to classify between two values (Iris-virginica or not).  Later, using this information, species would be further predicted by probabilities of error.

![Iris-petals and sepals](https://www.math.umd.edu/~petersd/666/html/iris_with_labels.jpg)

The workflow is:
*  divide the data set into 70% of the rows for training and 30% for testing  (we can increase the size of the train set later)
*  find the median for each of the 4 size columns
*  calculate the proportion of each column that are on or above median that are of a species (ie proportion of petal-lengths on or above median that are Iris-virginica)
*  infer the proportion of each that are not of that species (using 1 - proportion above).  In both cases we are looking to find if either of these is 1, which could be infered as definitely not that species. 
*  calculate a Gini Index that will indicate the probability that a prediction will be incorrect
*  use the results of the Gini Index to model a decision tree
*  code the decision tree model into a function that will return whether or not a row in the test set is predicted to be of species Iris-virginica
*  use the decision tree function to predict, for each row in the test set, if the species will be Iris-virginicia or not, using a set of nested if statements to classify
*  compare the predicted values against the actual values in the test set - what proportion were predicted correctly?


### Exercise 1 - investigate the iris data set
---
Let's start by looking at the data.  We are going to use a data set that contains data on iris flowers.

Read the data at this location: https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv into a dataframe called iris_data

The columns in the CSV file do not have headings, when you read the file, add column headings like this:
```
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'species']
iris_data = pd.read_csv(url, name=names)
```
*  Take a look at the column info (how many columns, what type of data, any missing data?)
*  Take a look at the data values in the first 10 and the last 10 records to get an idea of the type of values included
*  Find out how many unique values there are in the species column
*  Find out the maximum, minimum, median and upper and lower quartile values in each of the columns


In [20]:
import pandas as pd
import numpy as np
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'species']
iris_data = pd.read_csv(url, names= names)

display(iris_data)
iris_data.info
iris_data.dropna()
print(iris_data["species"].value_counts())

def q1(x):
    return x.quantile(0.25)
def q3(x):
    return x.quantile(0.75)

print(iris_data.agg(
     {      "sepal-length": ["min", "max", "median", q1, q3],
            "sepal-width": ["min", "max", "median", q1, q3],
            "petal-length": ["min", "max", "median", q1, q3],
            "petal-width": ["min", "max", "median", q1, q3]
     }
))

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


Iris-virginica     50
Iris-setosa        50
Iris-versicolor    50
Name: species, dtype: int64
        sepal-length  sepal-width  petal-length  petal-width
min              4.3          2.0          1.00          0.1
max              7.9          4.4          6.90          2.5
median           5.8          3.0          4.35          1.3
q1               5.1          2.8          1.60          0.3
q3               6.4          3.3          5.10          1.8


### Exercise 2 - split the data into train and test sets
---

Split the data set into and 70% train, 30% test, split.  Run the cell below. Add code to inspect the `train` data set.


In [109]:
# import the train_test_split function
from sklearn.model_selection import train_test_split


# create the classification variables from the all columns
train, test = train_test_split(iris_data, test_size=0.20)

display(train)
display(test)
print(train.agg(
     {      "sepal-length": ["min", "max", "median", q1, q3],
            "sepal-width": ["min", "max", "median", q1, q3],
            "petal-length": ["min", "max", "median", q1, q3],
            "petal-width": ["min", "max", "median", q1, q3]
     }
))
print(test.agg(
     {      "sepal-length": ["min", "max", "median", q1, q3],
            "sepal-width": ["min", "max", "median", q1, q3],
            "petal-length": ["min", "max", "median", q1, q3],
            "petal-width": ["min", "max", "median", q1, q3]
     }
))
train.loc[train['species'] == 'Iris-virginica']

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,species
28,5.2,3.4,1.4,0.2,Iris-setosa
54,6.5,2.8,4.6,1.5,Iris-versicolor
118,7.7,2.6,6.9,2.3,Iris-virginica
59,5.2,2.7,3.9,1.4,Iris-versicolor
137,6.4,3.1,5.5,1.8,Iris-virginica
...,...,...,...,...,...
147,6.5,3.0,5.2,2.0,Iris-virginica
109,7.2,3.6,6.1,2.5,Iris-virginica
5,5.4,3.9,1.7,0.4,Iris-setosa
16,5.4,3.9,1.3,0.4,Iris-setosa


Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,species
23,5.1,3.3,1.7,0.5,Iris-setosa
34,4.9,3.1,1.5,0.1,Iris-setosa
85,6.0,3.4,4.5,1.6,Iris-versicolor
101,5.8,2.7,5.1,1.9,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
13,4.3,3.0,1.1,0.1,Iris-setosa
148,6.2,3.4,5.4,2.3,Iris-virginica
55,5.7,2.8,4.5,1.3,Iris-versicolor
89,5.5,2.5,4.0,1.3,Iris-versicolor
47,4.6,3.2,1.4,0.2,Iris-setosa


        sepal-length  sepal-width  petal-length  petal-width
min            4.400        2.200           1.0          0.1
max            7.700        4.400           6.9          2.5
median         5.800        3.000           4.3          1.3
q1             5.100        2.800           1.5          0.3
q3             6.425        3.325           5.1          1.8
        sepal-length  sepal-width  petal-length  petal-width
min            4.300          2.0          1.10        0.100
max            7.900          4.0          6.70        2.500
median         5.800          3.0          4.45        1.300
q1             5.025          2.7          2.10        0.625
q3             6.275          3.3          5.10        1.975


Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,species
118,7.7,2.6,6.9,2.3,Iris-virginica
137,6.4,3.1,5.5,1.8,Iris-virginica
115,6.4,3.2,5.3,2.3,Iris-virginica
113,5.7,2.5,5.0,2.0,Iris-virginica
134,6.1,2.6,5.6,1.4,Iris-virginica
136,6.3,3.4,5.6,2.4,Iris-virginica
121,5.6,2.8,4.9,2.0,Iris-virginica
140,6.7,3.1,5.6,2.4,Iris-virginica
132,6.4,2.8,5.6,2.2,Iris-virginica
104,6.5,3.0,5.8,2.2,Iris-virginica


### Exercise 3 - assumptions
---

Let's make some assumptions based on the data

1.  Iris-setosa, Iris-versicolor, Iris-virginica are the full range of types of iris to be analysed
2.  Although this is a small data set, the means are fairly representative

With these in mind, let's start by classifying sepal/petal size into long/short and wide/narrow with values on or above the median taken as long or wide and those below as short or narrow.

This is a starting point.  We will be trying to find a value (indicator) for each column where rows on or above do not contain any of a particular species, this might indicate that this column is a good (if not rough) indicator of species.

Calculate, and store the medians of the four columns

**Test**:
Display train.describe() to see the value of the medians of the training set
Print the four medians and compare to the output of train.describe() to check that they have been calculated correctly

In [110]:
sl_indicator = 5.8
sw_indicator = 3.1
pl_indicator = 4.0
pw_indicator = 1.3

print(sl_indicator, sw_indicator, pl_indicator, pw_indicator)

5.8 3.1 4.0 1.3


### Exercise 4 - Calculate the proportion of values on or above the indicator that are of species `Iris-virginica`

We are going to focus on the `Iris-virginica` species.

First we will calculate, for each dimension column (`sepal-length, sepal-width, petal-length, petal-width`) what proportion of values in that column, where the value is on or above the median, are classified as `Iris-virginica`.

We will do this by filtering all the records in each column of the the `train` set that are on or above the median and match the species .  Then use the outcome to calculate the proportion of the full `train` set for which a value on or above the median that are of species `Iris-virginica`.

*  filter for values in the `sepal-length` column being on or above the median (`sl_indicator`) and the species column being `Iris-virginica`.  Then divide the count of rows in this filtered dataset by the count of rows in a second data set, filtered for just the value being on or above the median.

*  Do this for all four columns, for `Iris-virginica`  (4 operations).

Print the results to see which columns look like they might predict the species as `Iris-virginica` (the result is 1).  The highest numbers may be the most, but we will do some more before coming to this conclusion.

*  By definition, those on or above the median that are NOT Iris_virginica will be `1 - the proportion of those that are.  Calculate these

The first one has been done for you.

*  We will also need the proportion of those BELOW the median that are NOT Iris-virginica.  Calculate these in the same way



In [111]:
# calculate the proportion of results where the value is on or above median that are of the species Iris-virginica
sl_vi_above = train[(train['sepal-length'] >= sl_indicator) & (train['species'] == 'Iris-virginica')]['sepal-length'].count() / train[train['sepal-length'] >= sl_indicator]['sepal-length'].count()
sw_vi_above = train[(train['sepal-length'] >= sw_indicator) & (train['species'] == 'Iris-virginica')]['sepal-length'].count() / train[train['sepal-length'] >= sw_indicator]['sepal-length'].count()
pl_vi_above = train[(train['petal-length'] >= pl_indicator) & (train['species'] == 'Iris-virginica')]['petal-length'].count() / train[train['petal-length'] >= pl_indicator]['petal-length'].count()
pw_vi_above = train[(train['petal-width'] >= pw_indicator) & (train['species'] == 'Iris-virginica')]['petal-width'].count() / train[train['petal-width'] >= pw_indicator]['petal-width'].count()

print(sl_vi_above, sw_vi_above, pl_vi_above, pw_vi_above)

# calculate the proportion of results where the column is above median that are NOT of the species Iris-virginica

no_sl_vi_above = train[(train['sepal-length'] >= sl_indicator) & (train['species'] != 'Iris-virginica')]['sepal-length'].count() / train[train['sepal-length'] >= sl_indicator]['sepal-length'].count()
no_sw_vi_above = train[(train['sepal-width'] >= sw_indicator) & (train['species'] != 'Iris-virginica')]['sepal-width'].count() / train[train['sepal-width'] >= sw_indicator]['sepal-width'].count()
no_pl_vi_above = train[(train['petal-length'] >= pl_indicator) & (train['species'] != 'Iris-virginica')]['petal-length'].count() / train[train['petal-length'] >= pl_indicator]['petal-length'].count()
no_pw_vi_above = train[(train['petal-width'] >= pw_indicator) & (train['species'] != 'Iris-virginica')]['petal-width'].count() / train[train['petal-width'] >= pw_indicator]['petal-width'].count()

print(no_sl_vi_above, no_sw_vi_above, no_pl_vi_above, no_pw_vi_above)

0.5737704918032787 0.31666666666666665 0.5428571428571428 0.5757575757575758
0.4262295081967213 0.7962962962962963 0.45714285714285713 0.42424242424242425


### Exercise 5 - Calculate the proportion of each column where the value is below median that are of species `Iris-virginica`

Repeat the code above, this time looking for values below the mean

In [112]:
# calculate the proportion of results where the value is below median that are of the species Iris-virginica

sl_vi_below = train[(train['sepal-length'] < sl_indicator) & (train['species'] == 'Iris-virginica')]['sepal-length'].count() / train[train['sepal-length'] < sl_indicator]['sepal-length'].count()
sw_vi_below = train[(train['sepal-width'] < sw_indicator) & (train['species'] == 'Iris-virginica')]['sepal-width'].count() / train[train['sepal-width'] < sw_indicator]['sepal-width'].count()
pl_vi_below = train[(train['sepal-length'] < pl_indicator) & (train['species'] == 'Iris-virginica')]['petal-length'].count() / train[train['petal-length'] < sl_indicator]['petal-length'].count()
pw_vi_below = train[(train['sepal-width'] < pw_indicator) & (train['species'] == 'Iris-virginica')]['sepal-width'].count() / train[train['petal-width'] < sl_indicator]['petal-width'].count()

print(sl_vi_below, sw_vi_below, pl_vi_below, pw_vi_below)

# calculate the proportion of results where the column is below median that are NOT of the species Iris-virginica

no_sl_vi_below = train[(train['sepal-length'] < sl_indicator) & (train['species'] != 'Iris-virginica')]['sepal-length'].count() / train[train['sepal-length'] < sl_indicator]['sepal-length'].count()
no_sw_vi_below = train[(train['sepal-width'] < sw_indicator) & (train['species'] != 'Iris-virginica')]['sepal-width'].count() / train[train['sepal-width'] < sw_indicator]['sepal-width'].count()
no_pl_vi_below = train[(train['sepal-length'] < pl_indicator) & (train['species'] != 'Iris-virginica')]['petal-length'].count() / train[train['petal-length'] < sl_indicator]['petal-length'].count()
no_pw_vi_below = train[(train['sepal-width'] < pw_indicator) & (train['species'] != 'Iris-virginica')]['sepal-width'].count() / train[train['petal-width'] < sl_indicator]['petal-width'].count()

print(no_sl_vi_below, no_sw_vi_below, no_pl_vi_below, no_pw_vi_below)

0.05084745762711865 0.4090909090909091 0.0 0.0
0.9491525423728814 0.5909090909090909 0.0 0.0


### Exercise 5 - calculate Gini index for `Iris-virginica`
---

Each time you split the data set into `train` and `test`, you will get a slightly different mix and so your `train` data set will be slightly different.  We are going to try to look at how well we might predict a particular species from the 3 columns.   Let's use the `Iris-virginica` species and try to predict if a row would be that species or not, based on the four dimensions columns.

A Gini Index is a measure of the probability of a randomly chosen prediction being incorrect.  The most influential column will have the lowest Gini Index and that will be put at the top of our decision tree.

The formula for the Gini Index is:

*Gini Index = 1 - (the sum of the squares of the proportion values calculated above)*

To calculate the Gini Index for , use the following example:

`gini_sl_vi = 1 - (sl_vi_above**2 + no_sl_vi_above**2)`

The first one has been done for you.






In [113]:
# calculate the Gini Index for the proportion of those below median which are Iris_virginica, for all four columns

gini_sl_vi_above = 1 - (sl_vi_above**2 + no_sl_vi_above**2)
gini_sw_vi_above = 1 - (sw_vi_above**2 + no_sw_vi_above**2)
gini_pl_vi_above = 1 - (pl_vi_above**2 + no_pl_vi_above**2)
gini_pw_vi_above = 1 - (pw_vi_above**2 + no_pw_vi_above**2)

print(gini_sl_vi_above, gini_sw_vi_above, gini_pl_vi_above, gini_pw_vi_above)

# calculate the Gini Index for the proportion of those below median which are Iris_virginica, for all four columns

gini_sl_vi_below = 1 - (sl_vi_below**2 + no_sl_vi_below**2)
gini_sw_vi_below = 1 - (sw_vi_below**2 + no_sw_vi_below**2)
gini_pl_vi_below = 1 - (pl_vi_below**2 + no_pl_vi_below**2)
gini_pw_vi_below = 1 - (pw_vi_below**2 + no_pw_vi_below**2)

print(gini_sl_vi_below, gini_sw_vi_below, gini_pl_vi_below, gini_pw_vi_below)


0.48911582907820483 0.2656344307270233 0.49632653061224496 0.4885215794306703
0.09652398735995393 0.48347107438016523 1.0 1.0


### Exercise 6 - add weights to the index
---

Lastly, we are going to weight the calculation by applying the proportion of those that are and aren't `Iris_virginica`  

This is the calculation for the sepal-length column:
1.  Use the proportion of values in **whole sepal-length column** that are on or above median:  
`sl_vi_above_indicator = train[train['sepal-length'] >= sl_indicator]['sepal-length'].count() / train['sepal-length'].count()`

2.  Do the same to calculate the proportion of values below the median

3.  Calculate weightings using the formula:  
`weighted_gini_sl_vi = sl_vi_above_indicator * gini_sl_vi_above + sl_vi_below_indicator * gini_sl_vi_below`

Do this for each of the four columns

In [114]:
# calculate the proportion of values in sepal-length column that are on or above mean, then calculate the weighted Gini Index

sl_vi_above_indicator = train[train['sepal-length'] >= sl_indicator]['sepal-length'].count() / train['sepal-length'].count()
sl_vi_below_indicator = train[train['sepal-length'] <= sl_indicator]['sepal-length'].count() / train['sepal-length'].count()
weighted_gini_sl_vi = sl_vi_above_indicator * gini_sl_vi_above + sl_vi_below_indicator * gini_sl_vi_below
print(weighted_gini_sl_vi)

# calculate the weighted Gini Index for sepal-width

sw_vi_above_indicator =  train[train['sepal-width'] >= sw_indicator]['sepal-width'].count() / train['sepal-width'].count()
sw_vi_below_indicator = train[train['sepal-width'] <= sw_indicator]['sepal-width'].count() / train['sepal-width'].count()
weighted_gini_sw_vi =  sw_vi_above_indicator * gini_sw_vi_above + sw_vi_below_indicator * gini_sw_vi_below
print(weighted_gini_sw_vi)

# calculate the weighted Gini Index for petal_length

pl_vi_above_indicator = train[train['petal-length'] >= pl_indicator]['petal-length'].count() / train['petal-length'].count()
pl_vi_below_indicator = train[train['petal-length'] <= pl_indicator]['petal-length'].count() / train['petal-length'].count()
weighted_gini_pl_vi = pl_vi_above_indicator * gini_pl_vi_above + pl_vi_below_indicator * gini_pl_vi_below
print(weighted_gini_pl_vi)

# calculate the weighted Gini Index for petal-width

pw_vi_above_indicator =  train[train['petal-width'] >= sw_indicator]['petal-width'].count() / train['petal-width'].count()
pw_vi_below_indicator = train[train['petal-width'] <= sw_indicator]['petal-width'].count() / train['petal-width'].count()
weighted_gini_pw_vi = pw_vi_above_indicator * gini_pw_vi_above + pw_vi_below_indicator * gini_pw_vi_below
print(weighted_gini_pw_vi)

0.29930897314539656
0.4257338409345985
0.7395238095238096
1.0


### Exercise 6 - Make a decision tree
---

Use pencil and paper or a graphical application to create a decision tree using the following rules (**use the picture below as a guide only - yours will be different**):

*  the column with a 0.0 initial Gini Index and the lowest weighted Gini Index is placed at the top
*  other columns with a 0.0 initial Gini Index are placed in order below
*  the rest of the columns are placed in order below these

Any column where one branch (on or above median OR below median) has an initial Gini Index of 0, could be classified as a strong indicator of Iris_virginica being the species.  Anything else doesn't have enough certainty.

Let's code the decision tree using the following logic for this decision tree (yours might be slightly different):

![Decision tree](https://drive.google.com/uc?id=1CTo23EHwR2IPCRjcfSyCQsT_oQ5Exwso)

In the decision tree above, there is no certainty below petal-length so our decision tree will only include petal-width and petal-length.




In [115]:
def get_species(row):
  # ADD CODE HERE TO RETURN None if petal-width is below pw_decision_line or if petal-length is below pl_decision_line, otherwise return 'Iris-virginia'
    if row['petal-width'] <= pw_indicator:
        val = 'None'
    elif row['petal-length'] <= pl_indicator:
        val = 'None'
    else:
        val = 'Iris-virginia'
    return val

# use the get_species(df) function to predict the species, count how many are predicted correct and use this to calculate the proportion correct
correct = 0
test_size = test.shape[0]
for i in range(0, test_size):
  species = get_species(test.iloc[i])
  if species == test.iloc[i]['species']:
      correct += 1

print ("Proportion correctly identified", correct / test_size) 


Proportion correctly identified 0.0


### Exercise 7 - expand the training set
---

Let's make the training set a bit bigger so that there are more rows to use in the analysis.

*  go back and change the train/test proportions to 80% and 20%
*  save the notebook and run all the code cells again

# SUM UP what you have found out so far
---
#Not correctly predicting any!? 

### Exercise 8 - change the measure for sepal-length

We are currently using the median to act as the indicator line for all 4 columns.  This is helping us to identify petal-length and petal-width as good indicators around the median.  We can use the decision tree with different indicators.

Change the `sepal-length` indicator so that you are instead using the lower quartile.  The code should not need changing except for where you calculated the value for `sl_indicator`.

Run all the code again.  Is the proportion of correct values better or worse this time?   Is the decision tree still appropriate?


WHAT DO YOU NOTICE? (write your answer here)
---




### Exercise 9 - try different indicators for sepal-length
---

Do the same again trying different indicators.  If upper or lower quartiles don't help, try using another percentile (e.g. .quantile(0.2). Is it making any difference?  What indicators give the best looking results?

WHAT DO YOU NOTICE? 

write your answer here


### Exercise 10 - try a different species

Run the median test again for the Iris-versicolor species.  Again, try some different indicators.

What are the results.  Record them in the text cell below:

RESULTS FOR Iris-versicolor
---  
write your findings here

# New logic introduced in this worksheet:

1.  Adding headings to a CSV if none currently exist
2.  Splitting a data set into train and test sets

# Reflection
----

## What skills have you demonstrated in completing this notebook?

Your answer: How decision trees work

## What caused you the most difficulty?

Your answer: Writing the function to return the value none or iris type, and understanding where the gini index is incorporated