#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Random Forests

Decision Trees and Random Forests are powerful machine learning algorithms capable of performing both classification and regression tasks.

## Overview

### Learning Objectives

* Create and apply a decision tree algorithm for classification.
* Perform ensemble learning using random forests.
* Apply limits to depth and split size to reduce overfitting.

### Prerequisites

* Introduction to scikit-learn
* Classification
* Visualizations

### Estimated Duration

60 minutes

### Grading Criteria

Each exercise is worth 3 points. The rubric for calculating those points is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at exercise |
| 1      | Attempted exercise, but code does not run |
| 2      | Attempted exercise, code runs, but produces incorrect answer |
| 3      | Exercise completed successfully |

There is 1 exercises in this Colab so there are 3 points available. The grading scale will be 3 points.

## Load Data

Let's start by loading some data. We'll use the familiar iris dataset from Scikit Learn.

In [0]:
import pandas as pd

from sklearn.datasets import load_iris

iris_bunch = load_iris()

feature_names = iris_bunch.feature_names
target_name = 'species'

iris_df = pd.DataFrame(
    iris_bunch.data,
    columns=feature_names
) 

iris_df[target_name] = iris_bunch.target

iris_df.head()

## Create a Decision Tree

Now that we have the data loaded, we can create a decision tree.

Remember that if this were a real application we'd keep some data to the side for testing.

In [0]:
from sklearn import tree

dt = tree.DecisionTreeClassifier()

dt.fit(
    iris_df[feature_names],
    iris_df[target_name]
)

## Visualize the Tree

We now have a decision tree and can use it to make predictions. But before we do that, let's take a look at the tree itself.

To do this we create a `StringIO` object that we can export dot-data to. The dot data is a graph description language that what we can plot with Python graphing utilities.


In [0]:
import pydotplus

from IPython.display import Image  
from sklearn.externals.six import StringIO  

dot_data = StringIO()  

tree.export_graphviz(
    dt,
    out_file=dot_data,  
    feature_names=feature_names
)  

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  

Image(graph.create_png())  

That tree looks pretty complex. There is a good chance that we overfit. Let's create the tree again, only this time we'll limit the depth.

In [0]:
from sklearn import tree

dt = tree.DecisionTreeClassifier(max_depth=2)

dt.fit(
    iris_df[feature_names],
    iris_df[target_name]
)

And plot to see what branching was performed.

In [0]:
import pydotplus

from IPython.display import Image  
from sklearn.externals.six import StringIO  

dot_data = StringIO()  

tree.export_graphviz(
    dt,
    out_file=dot_data,  
    feature_names=feature_names
)  

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  

Image(graph.create_png())  

This tree seems like likely to be overfitting. Holding out a test sample and performing validation would be a good way to check.

## Create a Random Forest

Another way to help prevent overfitting and to create a better overall model is to use a random forest. 

In [0]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=10, max_depth=5)
rf.fit(
    iris_df[feature_names],
    iris_df[target_name]
)

You can look at different trees in the random forest to see how their decision branching differs.

In [0]:
import pydotplus

from IPython.display import Image  
from sklearn.externals.six import StringIO  

dot_data = StringIO()  

tree_to_view = 5

tree.export_graphviz(
    rf.estimators_[tree_to_view],
    out_file=dot_data,  
    feature_names=feature_names
)  

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  

Image(graph.create_png())  

## Make Predictions

And now we can make predictions, either with our decision tree or our random forest.

In [0]:
rf.predict([iris_df.iloc[121][feature_names]])

# Exercises



## Exercise 1

SciKit Learn also has a [DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) class that can be used to make regression predictions using decision trees. Download data about [suicide rates around the world](https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016) and build a model to predict those rates.

You'll needed to:

*   [Download](https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016) the data
*   Load the data in Colab
*   Convert the string features into numbers (decision trees like to work in numbers)
*   Shuffle the data and split off some for testing.
*   Train a model using the training data
*   Visualize the tree
*   Test the model and find the RMSE

If you have time, try adjusting the depth and other parameters of the tree. Also try replacing the decision tree with a [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) and see what effect that has on your RMSE.

### Student Solution

In [0]:
# Your solution goes here.

### Answer Key

**Solution**

In [0]:
# TODO

**Validation**

In [0]:
# TODO