#### Copyright 2020 Google LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Decision Trees and Random Forests

In this lab we will apply decision trees and random forests to perform machine learning tasks. These two model types are relatively easy to understand, but they are very powerful tools.

Random forests build upon decision tree models, so we'll start by creating a decision tree and then move to random forests.

## Load Data

Let's start by loading some data. We'll use the familiar iris dataset from scikit-learn.

In [None]:
import pandas as pd

from sklearn.datasets import load_iris

iris_bunch = load_iris()

feature_names = iris_bunch.feature_names
target_name = 'species'

iris_df = pd.DataFrame(
    iris_bunch.data,
    columns=feature_names
)

iris_df[target_name] = iris_bunch.target

iris_df.head()

## Decision Trees

Decision trees are models that create a tree structure that has a condition at each non-terminal leaf in the tree. The condition is used to choose which branch to traverse down the tree.

Let's see what this would look like with a simple example.

Let's say we want to determine if a piece of fruit is a lemon, lime, orange, or grapefruit. We might have a tree that looks like:

```txt
                      ----------
           -----------| color? |-----------
          |           ----------           |
          |               |                |
       <green>         <orange>        <yellow>
          |               |                |
          |               |                |
       ========           |            =========
       | lime |           |            | lemon |
       ========       ---------        =========
                 -----| size? |-----
                 |    ---------    |
                 |                 |
              <small>           <large>
                 |                 |
                 |                 |
            ==========       ==============
            | orange |       | grapefruit |
            ==========       ==============
```

This would roughly translate to the following code:

```python

def fruit_type(fruit):
  if fruit.color == "green":
    return "lime"
  if fruit.color == "yellow":
    return "lemon"
  if fruit.color == "orange":
    if fruit.size == "small":
      return "orange"
    if fruit.size == "large":
      return "grapefruit"
```

As you can see, the decision tree is very easy to interpret. If you use a decision tree to make predictions and then need to determine why the tree made the decision that it did, it is very easy to inspect.

Also, decision trees don't benefit from scaling or normalizing your data, which is different from many types of models.

### Create a Decision Tree

Now that we have the data loaded, we can create a decision tree. We'll use the [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) from scikit-learn to perform this task.

Note that there is also a [`DecisionTreeRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) that can be used for regression models. In practice, you'll typically see decision trees applied to classification problems more than regression.

To build and train the model, we create an instance of the classifier and then call the `fit()` method that is used for all scikit-learn models.

In [None]:
from sklearn import tree

dt = tree.DecisionTreeClassifier()

dt.fit(
    iris_df[feature_names],
    iris_df[target_name]
)

If this were a real application, we'd keep some data to the side for testing.

### Visualize the Tree

We now have a decision tree and can use it to make predictions. But before we do that, let's take a look at the tree itself.

To do this we create a [`StringIO`](https://docs.python.org/3/library/io.html) object that we can export dot data to. [DOT](https://www.graphviz.org/doc/info/lang.html) is a graph description language with Python-graphing utilities that we can plot with.


In [None]:
import io
import pydotplus

from IPython.display import Image  

dot_data = io.StringIO()  

tree.export_graphviz(
    dt,
    out_file=dot_data,  
    feature_names=feature_names
)  

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  

Image(graph.create_png())  

That tree looks pretty complex. Many branches in the tree is a sign that we may have overfit the model. Let's create the tree again; this time we'll limit the depth.

In [None]:
from sklearn import tree

dt = tree.DecisionTreeClassifier(max_depth=2)

dt.fit(
    iris_df[feature_names],
    iris_df[target_name]
)

And plot to see the branching.

In [None]:
import io
import pydotplus

from IPython.display import Image  

dot_data = io.StringIO()  

tree.export_graphviz(
    dt,
    out_file=dot_data,  
    feature_names=feature_names
)  

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  

Image(graph.create_png())  

This tree is less likely to be overfitting since we forced it to have a depth of 2. Holding out a test sample and performing validation would be a good way to check.

What are the `gini`, `samples`, and `value` items shown in the tree?

`gini` is is the *Gini impurity*. This is a measure of the chance that you'll misclassify a random element in the dataset at this decision point. Smaller `gini` is better.

`samples` is a count of the number of samples that have met the criteria to reach this leaf.

Within `value` is the count of each class of data that has made it to this leaf. Summing `value` should equal `sample`.

### Hyperparameters

There are many hyperparameters you can tweak in your decision tree models. One of those is `criterion`. `criterion` determines the quality measure that the model will use to determine the shape of the tree.

The possible `criterion` values are `gini` and `entropy`. `gini` is the [Gini Impuirty](https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity) while `entropy` is a measure of [Information Gain](https://en.wikipedia.org/wiki/Decision_tree_learning#Information_gain).

In the example below, we switch the classifier to use "entropy" for `criterion`. You'll see in the resultant tree that we now see "entropy" instead of "gini", but the resultant trees are the same. For more complex models, though, it may be worthwhile to test the different criterion.

In [None]:
import io
import pydotplus

from IPython.display import Image  
from sklearn import tree

dt = tree.DecisionTreeClassifier(
    max_depth=2, 
    criterion="entropy"
)

dt.fit(
    iris_df[feature_names],
    iris_df[target_name]
)

dot_data = io.StringIO()  

tree.export_graphviz(
    dt,
    out_file=dot_data,  
    feature_names=feature_names
)  

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  

Image(graph.create_png())  

We've limited the depth of the tree using `max_depth`. We can also limit the number of samples required to be present in a node for it to be considered for splitting using `min_samples_split`. We can also limit the minimum size of a leaf node using `min_samples_leaf`. All of these hyperparameters help you to prevent your model from overfitting.

There are many other hyperparameters that can be found in the [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) documentation.

### Exercise 1: Tuning Decision Tree Hyperparameters

In this exercise we will use a decision tree to classify wine quality in the [Red Wine Quality dataset](https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009).

The target column in the dataset is `quality`. Quality is an integer value between 1 and 10 (inclusive). You'll use the other columns in the dataset to build a decision tree to predict wine quality.

For this exercise:

* Hold out some data for final testing of model generalization.
* Use [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to compare some hyperparameters for your model. You can choose which parameters to test.
* Print the hyperparameters of the best performing model.
* Print the accuracy of the best performing model and the holdout dataset.
* Visualize the best performing tree.

Use as many text and code cells as you need to perform this exercise. We'll get you started with the code to authenticate and download the dataset.

First upload your `kaggle.json` file, and then run the code block below.

In [None]:
! chmod 600 kaggle.json && (ls ~/.kaggle 2>/dev/null || mkdir ~/.kaggle) && mv kaggle.json ~/.kaggle/ && echo 'Done'

Next, download the wine quality dataset.

In [None]:
! kaggle datasets download uciml/red-wine-quality-cortez-et-al-2009
! ls

##### **Student Solution**

###### Unzip file

In [None]:
! unzip red-wine-quality-cortez-et-al-2009.zip
! ls

###### Load data into Dataframe

In [None]:
import pandas as pd

wine_df = pd.read_csv('winequality-red.csv')

wine_df

###### Features and Targets for Model

In [None]:
features = wine_df.drop(columns='quality')
target = wine_df['quality']

###### Train-Test-Split

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    features,
    target,
    test_size=0.2
)

print(len(x_train), len(x_test), len(y_train), len(y_test))

# we put these into our model for training
  # x_train is the training features
  # y_train is the training target

# we put these into our model for predictions
  # x_test is the testing features
  # y_test is the testing target

###### Make Decision Tree Model

In [None]:
from sklearn import tree

dt = tree.DecisionTreeClassifier()

###### Find the best hyperparameters for the model

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV

search = GridSearchCV(dt, {
  #DecisionTreeClassifier parameters to check
  'criterion' : ['gini', 'entropy'],
  'splitter'  : ['best', 'random'],
  'max_depth' : [i for i in range(1,10)],
})

search.fit(x_train, y_train)

print(search.best_estimator_)

In [None]:
predictions = search.predict(x_test)

print('Accuracy: ', round(accuracy_score(predictions, y_test), 3))
print('Precision: ', round(precision_score(predictions, y_test, average='micro'), 3))
print('Recall: ', round(recall_score(predictions, y_test, average='micro'), 3))
print('F1: ', round(f1_score(predictions, y_test, average='micro'), 3))

###### Visualize the best tree

In [None]:
from sklearn import tree

best_dt = tree.DecisionTreeClassifier(
  criterion='gini',
  splitter='random',
  max_depth=6
)

best_dt.fit(x_train, y_train)

In [None]:
import io
import pydotplus

from IPython.display import Image  

dot_data = io.StringIO()  

tree.export_graphviz(
    best_dt,
    out_file=dot_data,  
    feature_names=features.columns
)  

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  

Image(graph.create_png())  

---

## Random Forests

Random forests are a simple yet powerful machine learning tool based on decision trees. Random forests are easy to understand, yet they touch upon many advanced machine learning concepts, such as ensemble learning and bagging. These models can be used for both classification and regression. Also, since they are built from decision trees, they are not sensitive to unscaled data.

You can think of a random forest as a group decision made by a number of decision trees. For classification problems, the random forest creates multiple decision trees with different subsets of the data. When it is asked to classify a data point, it will ask all of the trees what they think and then take the majority decision.

For regression problems, the random forest will again use the opinions of multiple decision trees, but it will take the mean (or some other summation) of the responses and use that as the regression value.

This type of modeling, where one model consists of other models, is called *ensemble learning*. Ensemble learning can often lead to better models because taking the combined, differing opinions of a group of models can reduce overfitting.

### Create a Random Forest

Creating a random forest is as easy as creating a decision tree.

scikit-learn provides a [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and a [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html), which can be used to combine the predictive power of multiple decision trees.

In [None]:
import pandas as pd

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier

iris_bunch = load_iris()

feature_names = iris_bunch.feature_names
target_name = 'species'

iris_df = pd.DataFrame(
    iris_bunch.data,
    columns=feature_names
)

iris_df[target_name] = iris_bunch.target

rf = RandomForestClassifier()
rf.fit(
    iris_df[feature_names],
    iris_df[target_name]
)

You can look at different trees in the random forest to see how their decision branching differs. By default there are `100` decision trees created for the model.

Let's view a few.

Run the code below a few times, and see if you notice a difference in the trees that are shown.

In [None]:
import pydotplus
import random

from IPython.display import Image  
from sklearn.externals.six import StringIO  

dot_data = StringIO()  

tree.export_graphviz(
    random.choice(rf.estimators_),
    out_file=dot_data,  
    feature_names=feature_names
)  

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  

Image(graph.create_png())  

### Make Predictions

Just like any other scikit-learn model, you can use the `predict()` method to make predictions.

In [None]:
print(rf.predict([iris_df.iloc[121][feature_names]]))

### Hyperparameters

Many of the hyperparameters available in decision trees are also available in random forest models. There are, however, some hyperparameters that are only available in random forests.

The two most important are `bootstrap` and `oob_score`. These two hyperparameters are relevant to ensemble learning.

`bootstrap` determines if the model will use [bootstrap sampling](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)). When you bootstrap, only a sample of the dataset will be used for training each tree in the forest. The full dataset will be used as the source of the sampling for each tree, but each sample will have a different set of data points, perhaps with some repetition. In bootstrapping, there is also "replacement" of the data, which means a data point can occur in more that one tree.

`oob_score` stands for "Out of bag score." When you create a bootstrap sample, this is referred to as a *bag* in machine learning parlance. When the tree is being scored, only data points in the bag sampled for the tree will be used unless `oob_score` is set to true.

### Exercise 2: Feature Importance

In this exercise we will use the [UCI Abalone  dataset](https://www.kaggle.com/hurshd0/abalone-uci) to determine the age of sea snails.

The target feature in the dataset is `rings`, which is a proxy for age in the snails. This is a numeric value, but it is stored as an integer and has a biological limit. So we can think of this as a classification problem and use a [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

You will download the dataset and train a random forest classifier. After you have fit the classifier, the `feature_importances_` attribute of the model will be populated. Use the importance scores to print the least important feature.

*Note that some of the features are categorical string values. You'll need to convert these to numeric values to use them in the model.*

Use as many text and code blocks as you need to perform this exercise.

#### **Student Solution**

##### Load dataset

In [None]:
! kaggle datasets download hurshd0/abalone-uci
! unzip abalone-uci.zip
! ls

###### Load data into Dataframe

In [None]:
import pandas as pd

snail_df = pd.read_csv('abalone_original.csv')

snail_df

##### EDA

###### Check for missing data

In [None]:
snail_df.isna().describe() # no missing values

###### Features and Target

In [None]:
feature_names = snail_df.columns[:-1].tolist()
target = snail_df['rings']

###### One-hot Encoding

In [None]:
# sex
for op in sorted(snail_df['sex'].unique()):
  op_col = op.lower().replace(' ', '_').replace('<', '')
  snail_df[op_col] = (snail_df['sex'] == op).astype(int)
  feature_names.append(op_col)

feature_names.remove('sex')

features = snail_df[feature_names]

In [None]:
features

##### Train-Test-Split

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    features,
    target,
    test_size=0.2
)

print(len(x_train), len(x_test), len(y_train), len(y_test))

# we put these into our model for training
  # x_train is the training features
  # y_train is the training target

# we put these into our model for predictions
  # x_test is the testing features
  # y_test is the testing target

##### Make Random Forest Model

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(
    x_train,
    y_train
)

###### Print least important feature

In [None]:
importance = rf.feature_importances_
importance

In [None]:
print(importance.min())
print(importance[8])

In [None]:
print(f'The least important feature is "{feature_names[8]}"')

---