Copyright 2020 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.


# Random forests: Problem solving

In this session, we'll use the `boston` dataset, which has been used to examine the relationship between clean air and house prices:


| Variable | Type | Description |
|:----|:-----|:----------|
|crim | Ratio | per capita crime rate by town | 
|zn | Ratio | proportion of residential land zoned for lots over 25,000 sq. ft. | 
|indus | Ratio | proportion of non-retain business acres per town | 
|chas | Nominal (binary) | Charles River dummy variable (=1 if tract bounds river, =0 otherwise) | 
|nox | Ratio | nitrogen oxides concentration (parts per 10 million) | 
|rm | Ratio | average number of rooms per dwelling | 
|age | Ratio | proportion of owner-occupied units built prior to 1940 | 
|dis | Ratio | weighted mean of distances to fie Boston employment centers | 
|rad | Ordinal | index of accessibility to radial highways | 
|tax | Ratio | full-value proporty tax rate per \$10,000 | 
|ptratio | Ratio | pupil-teacher ratio by town | 
|lstat | Ratio | percent lower status of population (defined as non-high school graduate, manual labor) | 
|medv | Ratio | median value of owner-occupied homes in $1000s | 

<div style="text-align:center;font-size: smaller">
    <b>Source:</b> This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
</div>
<br>
    
As before, we'll try to predict `medv` using the rest of the variables.

**Because `medv` is a ratio variable, we will do random forest regression trees not decision trees.**

Additionally, we will compare the performance of three models on this problem:

- Regression trees
- Bagged regression trees
- Random forest regression trees

## Load data

Import `pandas` to load a dataframe.

Load the dataframe.

## Explore data

Some of these steps we've done before with these data, so we'll skip the normal interpretation steps on those parts.

Describe the data.

Make a correlation heatmap.

First import `plotly.express`.

Create a correlation matrix.

And show the correlation heatmap with row/column labels.

Because these variables are highly correlated (an numeric), a scatterplot matrix would make a lot of sense.

Use `plotly` to make a `scatter_matrix` of the dataframe.
If you have a hard time reading the labels, you can give it something like `width=1000` and `height=1000` to make it bigger.

-----------
**QUESTION:**

Remembering that a perfect correlation is a line, and no correlation is a uniform random scattering of datapoints, what would you say about the pattern of these scatterplots overall?
Of the scatterplots in the last row (i.e. correlated with `medv`) in particular?

**ANSWER: (click here to edit)**


<hr>

Ultimately we want to predict median house value (`medv`), so make a histogram of that.

------------------
**QUESTION:**

Do you think we need to transform `medv` to make it more normal? Why or why not?

**ANSWER: (click here to edit)**


<hr>

## Prepare train/test sets

If we were just using bagging or random forests, we could use OOB performance instead of splitting the data into training and testing sets.

However, splitting is necessary if we want to compare to regression trees.

Start by dropping the response variable, `medv` to make a new dataframe called `X`.

Save a dataframe with just `medv` in `Y`. 

Import `sklearn.model_selection` to split the data into train/test sets.

And do the actual split.

## Fit models

Fit three models in turn:

- Regression tree
- Bagged regression tree
- Random forest regression tree

Import the `sklearn.tree` and `sklearn.ensemble` libraries.

### Regression tree

Create the regression tree model.
Go ahead and create it with a freestyle `random_state=1` so we all get the same results.

Fit the regression tree model and get predictions.

### Bagged regression tree

Next create the bagged regression tree model by using `BaggingRegressor`.
Just as `BaggingClassifier` uses a decision tree by default, `BaggingRegressor` uses a regression tree by default.
Use the same parameters as the random forest notebook (e.g. 100 trees, etc).

Interestingly, for this model, `sklearn` requires us to use `ravel` on `Y` when fitting the model, so import `numpy`.

Fit the bagged regression tree using `ravel` on `Y` and get predictions.

### Random forest regression tree

Next create the random forest regression tree model by using `RandomForestRegressor`, which also uses a regression tree by default.
Use the same parameters as before.

Fit the random forest regression tree using `ravel` on `Y` and get predictions.

## Evaluate the models

### Regression tree

- Get the $r^2$ on the *training* set

- Get the $r^2$ on the *testing* set

### Bagged regression tree

- Get the $r^2$ on the *training* set

- Get the $r^2$ on the *testing* set

### Random forest regression tree

- Get the $r^2$ on the *training* set

- Get the $r^2$ on the *testing* set

------------------
**QUESTION:**

Compare the three models with respect to their *training data performance*. Which is better?
Now compare the three models with respect to their *testing data performance*. Which is better?
What do these differences tell you?

**ANSWER: (click here to edit)**


<hr>

## Feature importance

Calculate the feature importance for the three models and plot it as a `plotly` bar chart.

To get the column names for `x=` in the bar chart, you can use `from X get columns` as a shortcut.

### Regression tree

### Bagged regression tree

For some reason, `sklearn` does not implement `feature_importances_` for `BaggingRegressor`, so use the following freestyle for the `y=` part of your plot: 

`np.mean([ tree.feature_importances_ for tree in baggedRegressionTree.estimators_ ], axis=0)`

**You will need to change `baggedRegressionTree` to whatever you called this model above.**

### Random forest regression tree

This uses `feature_importances_`, so you can make the plot exactly like you would for the regression tree model.

-----------
**QUESTION:**

Look carefully at the three feature importance plots, hovering your mouse over each bar.
What are the major differences between them?

**ANSWER: (click here to edit)**


<hr>

**QUESTION:**

What other tool(s) can you think of that we haven't tried that we could use to compare these models?

**ANSWER: (click here to edit)**


<hr>

<!--  -->