<a href="https://colab.research.google.com/github/google/applied-machine-learning-intensive/blob/master/v2/06_other_models/00_random_forests/colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2020 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Decision Tress and Random Forests

In this lab we will apply decision trees and random forest to perform machine learning tasks. These two model types are relatively easy to understand, yet they are very powerful tools.

Random forests build upon decision tree models, so we'll start by creating a decision tree and then move to random forests.

## Load Data

Let's start by loading some data. We'll use the familiar iris dataset from scikit-learn.

In [0]:
import pandas as pd

from sklearn.datasets import load_iris

iris_bunch = load_iris()

feature_names = iris_bunch.feature_names
target_name = 'species'

iris_df = pd.DataFrame(
    iris_bunch.data,
    columns=feature_names
)

iris_df[target_name] = iris_bunch.target

iris_df.head()

## Decision Trees

Decision trees are models that create a tree structure that has a condition at each non-terminal leaf in the tree. The condition is used to choose which branch to traverse down the tree.

Let's see what this would look like with a simple example.

Say that we want to determine if a piece of fruit is a lemon, lime, orange, or grapefruit. We might have a tree that looks like:

```txt
                      ----------
           -----------| color? |-----------
          |           ----------           |
          |               |                |
       <green>         <orange>        <yellow>
          |               |                |
          |               |                |
       ========           |            =========
       | lime |           |            | lemon |
       ========       ---------        =========
                 -----| size? |-----
                 |    ---------    |
                 |                 |
              <small>           <large>
                 |                 |
                 |                 |
            ==========       ==============
            | orange |       | grapefruit |
            ==========       ==============
```

This would roughly translate to the following code:

```python

def fruit_type(fruit):
  if fruit.color == "green":
    return "lime"
  if fruit.color == "yellow":
    return "lemon"
  if fruit.color == "orange":
    if fruit.size == "small":
      return "orange"
    if fruit.size == "large":
      return "grapefruit"
```

As you can see, the decision tree is very easy to interpret. If you use a decision tree to make predictions and then need to determine why the tree made the decision that it did, it is very easy to inspect.

Also, decision trees aren't sensitive to data that isn't scaled or normalized, which is different from many types of models.

### Create a Decision Tree

Now that we have the data loaded, we can create a decision tree. We'll use the [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) from scikit-learn to perform this task.

Note that there is also a [`DecisionTreeRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) that can be used for regression models. In practice you'll typically see decision trees applied to classification problems more than regression.

To build and train the model we create an instance of the classifer and then call the `fit()` method that is used for all scikit-learn models.

In [0]:
from sklearn import tree

dt = tree.DecisionTreeClassifier()

dt.fit(
    iris_df[feature_names],
    iris_df[target_name]
)

Note that if this were a real application we'd keep some data to the side for testing.

### Visualize the Tree

We now have a decision tree and can use it to make predictions. But before we do that, let's take a look at the tree itself.

To do this we create a [`StringIO`](https://docs.python.org/3/library/io.html) object that we can export dot-data to. The dot data is a graph description language that what we can plot with Python graphing utilities.


In [0]:
import io
import pydotplus

from IPython.display import Image  

dot_data = io.StringIO()  

tree.export_graphviz(
    dt,
    out_file=dot_data,  
    feature_names=feature_names
)  

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  

Image(graph.create_png())  

That tree looks pretty complex. Many branches in the tree is a sign that we may have overfit the model. Let's create the tree again, only this time we'll limit the depth.

In [0]:
from sklearn import tree

dt = tree.DecisionTreeClassifier(max_depth=2)

dt.fit(
    iris_df[feature_names],
    iris_df[target_name]
)

And plot to see the branching.

In [0]:
import io
import pydotplus

from IPython.display import Image  

dot_data = io.StringIO()  

tree.export_graphviz(
    dt,
    out_file=dot_data,  
    feature_names=feature_names
)  

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  

Image(graph.create_png())  

This tree is less likely to be overfitting, since we forced it to have a depth of 2. Holding out a test sample and performing validation would be a good way to check.

What are the `gini`, `samples`, and `value` items shown in the tree.

`gini` is is the *Gini impurity*. This is a measure of the chance that you'll misclassify a random element in the dataset at this decision point. Smaller `gini` is better.

`samples` is a count of the number of samples that have met the criteria to reach this leaf.

`value` are the count of each class of data that has made it to this leaf. Summing `value` should equal `sample`.

### Hyperparameters

There are many hyperparameters that you can tweak in your decision tree models. One of those is `criterion`. `criterion` determines the quality measure that the model will use the determine the shape of the tree.

The values are `gini` and `entrophy`. `gini` is the [Gini Impuirty](https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity) while `entrophy` is a measure if [Information Gain](https://en.wikipedia.org/wiki/Decision_tree_learning#Information_gain).

In the example below we switch the classifer to use "entrophy" for `criterion`. You'll see in the resultant tree that we now see "entrophy" instead of "gini", but the resultant trees are the same. For more complex models you might want to test the different criterion though.

In [0]:
import io
import pydotplus

from IPython.display import Image  
from sklearn import tree

dt = tree.DecisionTreeClassifier(
    max_depth=2, 
    criterion="entropy"
)

dt.fit(
    iris_df[feature_names],
    iris_df[target_name]
)

dot_data = io.StringIO()  

tree.export_graphviz(
    dt,
    out_file=dot_data,  
    feature_names=feature_names
)  

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  

Image(graph.create_png())  

We've limited the depth of the tree using `max_depth`. We can also limit the number of samples required to be present in a node for it to be considered for splitting using `min_samples_split` and can limit the minimum size of a leaf node using `min_samples_leaf`. All of these hyperparameters help you to prevent your model from overfitting.

There are many other hyperparameters that can be found in the [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) documentation.

### Exercise 1: Tuning Decision Tree Hyperparameters

In this exercise we will use a decision tree to classify wine quality in the [Red Wine Quality dataset](https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009).

The target column in the dataset is `quality`. Quality is an integer value between 1 and 10 (inclusive). You'll use the other columns in the dataset to build a decision tree to predict wine quality.

For this exercise:

* Hold out some data for final testing of model generalization.
* Use [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) compare some hyperparameters for your model. You can choose which parameters to test.
* Print the hyperparameters of the best performing model.
* Print the accuracy of the best performing model and the hold out dataset.
* Visualize the best performing tree.

Use as many text and code cells as you need to perform this exercise. We'll get you started with the code to authenticate and download the dataset.

First, upload your `kaggle.json` file and then run the code block below.

In [0]:
! chmod 600 kaggle.json && (ls ~/.kaggle 2>/dev/null || mkdir ~/.kaggle) && mv kaggle.json ~/.kaggle/ && echo 'Done'

Next, download the wine quality dataset.

In [0]:
! kaggle datasets download uciml/red-wine-quality-cortez-et-al-2009
! ls

**Student Solution**

In [0]:
# Your Code Goes Here

---

##### Answer Key

Get authentication credentials in place and then download the dataset.

In [0]:
! chmod 600 kaggle.json && (ls ~/.kaggle 2>/dev/null || mkdir ~/.kaggle) && mv kaggle.json ~/.kaggle/ && echo 'Done'
! kaggle datasets download uciml/red-wine-quality-cortez-et-al-2009
! ls

Load the data, split the data, and then create train and test `DataFrame` objects.

In [0]:
import pandas as pd

from sklearn.model_selection import train_test_split


df = pd.read_csv('red-wine-quality-cortez-et-al-2009.zip')

feature_names = df.columns.values[:-1]
target_name = df.columns.values[-1]

X_train, X_test, y_train, y_test = train_test_split(
    df[feature_names],
    df[target_name],
    stratify=df[target_name],
    train_size=0.2,
)

train_df = pd.DataFrame(
    X_train,
    columns=feature_names
)
train_df[target_name] = y_train

train_df

Perform a cross-validation grid search on the training data.

In [0]:
from sklearn import tree
from sklearn.model_selection import GridSearchCV


dt = tree.DecisionTreeClassifier()
grid_search = GridSearchCV(dt, {
    'max_depth': [None, 2, 4, 6],
    'min_samples_split': [2, 4, 16],
    'min_samples_leaf': [1, 4, 8],
})

grid_search.fit(
    train_df[feature_names],
    train_df[target_name]
)

grid_search.cv_results_

Print the best parameters and the best score.

In [0]:
print(grid_search.best_params_)
print(grid_search.best_score_)

Print the hold-out testing accuracy.

In [0]:
from sklearn.metrics import accuracy_score

accuracy_score(
    test_df[target_name],
    grid_search.best_estimator_.predict(test_df[feature_names])
)

Visualize the best tree.

In [0]:
import io
import pydotplus

from IPython.display import Image  

dot_data = io.StringIO()  

tree.export_graphviz(
    grid_search.best_estimator_,
    out_file=dot_data,  
    feature_names=feature_names
)  

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  

Image(graph.create_png())  

---

## Random Forests

Random forests are a simple yet powerful machine learning tool based on decision trees. Random forests are easy to understand, yet they touch upon many advanced machine learning concepts such as ensemble learning and bagging. These models can be used for both classification and regression. Also, since they are built from decision trees, they are not sensitive to unscaled data.

You can think of a random forest a group decision made by a number of decision trees. For classification problems, the random forest creates multiple decision trees with different subsets of the data. When it is asked to classify a data point, it will ask all of the trees what they think and then take the majority decision.

For regression problems the random forest will again use the opnions of multiple decision trees, but will take the mean (or some other summation) of the responses and use that as the regression value.

This type of modeling where one model consists of other models is called *ensemble learning*. Ensemble learning can often lead to better models because taking the combined differing opnions of a group of models can reduce overfitting.

### Create a Random Forest

Creating a random forest is as easy as creating a decision tree.

sckit-learn provides a [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), and a a [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html), that can be used to combine the predictive power of multiple decision trees.

In [0]:
import pandas as pd

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier


iris_bunch = load_iris()

feature_names = iris_bunch.feature_names
target_name = 'species'

iris_df = pd.DataFrame(
    iris_bunch.data,
    columns=feature_names
)

iris_df[target_name] = iris_bunch.target

rf = RandomForestClassifier()
rf.fit(
    iris_df[feature_names],
    iris_df[target_name]
)

You can look at different trees in the random forest to see how their decision branching differs. By default there are `100` decision trees created for the model.

Let's view a few.

Run the code below a few times and see if you notice a difference in the trees that are shown.

In [0]:
import pydotplus
import random

from IPython.display import Image  
from sklearn.externals.six import StringIO  

dot_data = StringIO()  

tree.export_graphviz(
    random.choice(rf.estimators_),
    out_file=dot_data,  
    feature_names=feature_names
)  

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  

Image(graph.create_png())  

### Make Predictions

Just like any other scikit-learn model you can use the `predict()` method to make predictions.

In [0]:
print(rf.predict([iris_df.iloc[121][feature_names]]))

### Hyperparameters

Many of the hyperparameters available in decision trees are also available in random forest models. There are however some hyperparameters that are only available in random forests.

The two most important are `bootstrap` and `oob_score`. These two hyperparameters are relevant to ensemble learning.

`bootstrap` determines if the model will use [bootstrap sampling](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)). When you bootstrap only a sample of the dataset will be used for training each tree in the forest. Bootstrapping means that the full dataset will be used as the source of the sampling for each tree. There is "replacement" of the data. A datapoint can occur in more that one tree.

`oob_score` stands for "Out of bag score". When you create a bootstrap sample this is referred to as a *bag* in machine learning parlance. When the tree is being scored only data points in the bag sampled for the tree will be used unless `oob_score` is set to true.

### Exercise 2: Feature Importance

In this exercise we will use the [UCI Abalone  dataset](https://www.kaggle.com/hurshd0/abalone-uci) to determine the age of sea snails.

The target feature in the dataset is `rings`, which is a proxy for age in the snails. This is a numeric value, but it is stored as an integer and has a biological limit, so we can think of this as a classification problem and use a [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

You will download the dataset and train a random forest classifier. After you have fit the classifer the `feature_importances_` attribute of the model will be populated. Use the importance scores to print the least important feature.

*Note that some of the features are catagorical string values. You'll need to convert these to numeric values to use them in the model.*

Use as many text and code blocks as you need to below to perform this exercise.

**Student Solution**

In [0]:
# Your Code Goes Here

#### Answer Key

First we download the data.

In [0]:
! kaggle datasets download hurshd0/abalone-uci
! ls

And load it into a `DataFrame`.

In [0]:
import pandas as pd

df = pd.read_csv('abalone-uci.zip')
df

We check for missing values... there are none.

In [0]:
df.isna().any()

However, sex was represented as a string. We convert it into an integer value.

In [0]:
sexes = sorted(df['sex'].unique())
sex_lookup = {sexes[i]: i for i in range(len(sexes))}
df['sex'] = df['sex'].map(sex_lookup)

Next we capture the feature and target names.

In [0]:
feature_names = df.columns.values[:-1]
target_name = df.columns.values[-1]

target_name, feature_names

And then fit the model.

In [0]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

model.fit(df[feature_names], df[target_name])

Only to learn that `sex` is the least important feature.

In [0]:
import numpy as np

feature_names[np.argmin(model.feature_importances_)]

---