# Implement Decision Trees for Iris Classification

We are going to continue working with the iris data set and will build a Decision Tree to classify iris flowers.



### Import Libraries


Before you get started, you need to import a few libraries. You can do this by executing the following code. Remember, run code in a cell by selecting the cell, holding the shift key, and pressing enter/return.

We will import the scikit-learn `DecisionTreeClassifier`, the `train_test_split` function for splitting the data into training and test sets, and the metric `accuracy_score` to evaluate our model. In this exercise we will be performing a k-fold cross validation on the model, and so we will also import the scikit-learn function for running cross validaton.

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

### Step 1: Load the Iris Data Set

Next we will load the data from the iris data set and store it in a dataframe named `dfiris`.

In [2]:
dfiris = pd.read_csv('Iris_Data.csv')

### Step 2: Create labeled examples from our data set for the training phase

Let's extract variables from our data set to create labeled examples. This time, every example will be using all of the features in the iris dataset.

The code cell below carries out the following steps:

* Extracts all features from `dfiris` and assign it to the variable `X`. 
* Creates the `species` label from `dfiris` and assigns it to the variable `y`.
* Prints the values of `X` and `y`

Execute the code cell below and inspect the results. You will see that we have 150 labeled examples. Each example contains four features and one label.


In [3]:
x = dfiris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
y = dfiris['species'].to_frame()
print(x)
print(y)

     sepal_length  sepal_width  petal_length  petal_width
0             5.1          3.5           1.4          0.2
1             4.9          3.0           1.4          0.2
2             4.7          3.2           1.3          0.2
3             4.6          3.1           1.5          0.2
4             5.0          3.6           1.4          0.2
..            ...          ...           ...          ...
145           6.7          3.0           5.2          2.3
146           6.3          2.5           5.0          1.9
147           6.5          3.0           5.2          2.0
148           6.2          3.4           5.4          2.3
149           5.9          3.0           5.1          1.8

[150 rows x 4 columns]
            species
0       Iris-setosa
1       Iris-setosa
2       Iris-setosa
3       Iris-setosa
4       Iris-setosa
..              ...
145  Iris-virginica
146  Iris-virginica
147  Iris-virginica
148  Iris-virginica
149  Iris-virginica

[150 rows x 1 columns]


### Step 3: Create Training & Test Data Sets

Now that we have specified examples, we will need to split them into a training set and a test set.

We will refer to the training feature vectors as `x_train` with labels `y_train`. 

Our testing vectors are `x_test` with labels `y_test`. 


In [4]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

### Step 4: Fit a Decision Tree with the Training Set

The code cell below:

1. Creates a DecisionTreeClassifier() object and assignss the result to the variable model.

2. Calls the .fit() method on model to fit the model to the training data. The first argument should be `x_train` and the second `y_train`. 

3. Uses the .predict() method on model with the argument `x_test` to use the fitted model to predict values for the testing data. Store the outcome in the variable `y_pred`. We will compare these values to `y_test` later.
Execute the code cell below and notice how the different flowers were classified.

In [5]:
# Initialize the model
tree_model = DecisionTreeClassifier(random_state=0)

# Train the model using the training sets
tree_model.fit(x_train, y_train)

# Make predictions using the test set
pred = tree_model.predict(x_test)

### Step 5: Check the accuracy of your model

Execute the code cell below to see the accuracy score of your model.

In [6]:
score = accuracy_score(y_test, pred)
print(score)

1.0


## Step 6: Perform k-Fold Cross-Validation on the Model

The code cell below uses scikit-learn's `cross_val_score` function to perform cross-validation on the model.

You will recall that k-fold cross-validation splits a training data set into equally sized subsets, or folds (k). We train and test 'k' times, such that each time, we train on k-1 folds and test on 1 fold. Therefore, every fold will have a chance to serve as a test set. We then average the resulting accuracies obtained on each of the k iterations to detemine the accuracy of the model.

The `cross_val_score` function requires that you pass your model as an arugment using the `estimator` parameter. It  allows you to specify the value for `k` using the parameter `cv`. It returns `k` number of accuracy scores, one for each training/test iteration. 

We will perform a k-fold cross-validation using 4 folds. Execute the code cell below and inspect the results. 

You'll notice that the four resulting accuracy scores are good, and the standard deviation among the scores are low, indicating that our model performs well. 


In [7]:
all_accuracies = cross_val_score(estimator=tree_model, X=x_train, y=y_train, cv=4)

# Print the accuracy scores
print('Accuracies for the four training/test iterations:')
print(all_accuracies)

# Print the average using the mean() method
print('The mean accuracy score across the four iterations:')
print(all_accuracies.mean())

# Print the standard deviation of the accuracy scores using the std() method to see the degree of variance.
print('The standard deviation of the accuracy score across the four iterations:')
print(all_accuracies.std())


Accuracies for the four training/test iterations:
[0.96666667 0.9        0.93333333 0.93333333]
The mean accuracy score across the four iterations:
0.9333333333333333
The standard deviation of the accuracy score across the four iterations:
0.02357022603955158


## Step 7: Perform a Grid Search on the Model

For this decision tree, we used the default hyperparameters. If we had wanted to optimize the hyperparameters for that method, we could have used the `GridSearchCV` function within scikit-learn, which searches over different combinations of possible hyperparameter values to find the set that results in the best cross-validation (CV) score. You can find the names of the `DecisionTreeClassifier` hyperparameters in the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). You can also find more information on `GridSearchCV` in the corresponding [scikit-learn documentaton](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

The code cell below demonstrates how to use the `GridSearchCV` function.


In [8]:
# Import the GridSearchCV function
from sklearn.model_selection import GridSearchCV

# Dictionary of different hyperparameters to try.
# GridSearchCV will choose which max_depth hyperparameter is best
hyperparameters = {'max_depth': [2,3,4,5,6,7,8]}

# Run a Grid Search with 4-fold cross-validation using our decision tree model
grid = GridSearchCV(tree_model, param_grid=hyperparameters, cv=4)

grid.fit(x_train, y_train)

# Print best hyperparameter to use
print(grid.best_params_)


{'max_depth': 3}
