Copyright 2020 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.


# Regression trees: Problem solving

In this session we will look at the `mpg` dataset, which contains measurements of fuel economy and other properties of cars from the 1970s.

Our goal is to predict miles per gallon.

| Variable     | Type     | Description                              |
|:--------------|:----------|:------------------------------------------|
| mpg          | Ratio    | Miles per gallon; fuel economy           |
| cylinders    | Ordinal  | Number of cylinders in engine            |
| displacement | Ratio    | Volume inside cylinders (likely inches)                  |
| horsepower   | Ratio    | Unit of power                            |
| weight       | Ratio    | Weight of car (likely pounds)                           |
| acceleration | Ratio    | Acceleration of car (likely in seconds to 60 MPH) |
| model_year   | Interval | Year of car manufacture; last two digits |
| origin       | Nominal  | Numeric code corresponding to continent  |
| name     | Nominal  | Car model name (ID)                      |

<div style="text-align:center;font-size: smaller">
    <b>Source:</b> This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
</div>

## Import libraries

We need to load our data into a dataframe and do some plots, so import `pandas` and `plotly.express` below.

Load `"datasets/mpg.csv"` into a dataframe, using `index_col=` to tell it to use `"name"` as an ID.

The last time we looked at this data, we used the `mpg-nona` dataset that I'd already removed the `NaN` from.

This `mpg` still has `NaN`, so remove them.

Get the five figure summary (descriptive statistics) of the data using `describe`.

------------------
**QUESTION:**

Do the min/max/mean of these values look OK to you?

**ANSWER: (click here to edit)**


<hr>

Ultimately we want to predict miles per gallon (`mpg`), so make a histogram of that.

------------------
**QUESTION:**

Do you think we need to transform `mpg` to make it more normal? Why or why not?

**ANSWER: (click here to edit)**


<hr>

Use `plotly` to make a `scatter_matrix` of the dataframe.
If you have a hard time reading the labels, you can give it something like `width=1000` and `height=1000` to make it bigger.

-----------------
**QUESTION:**

Looking at the scatterplot matrix, which variables have curved (nonlinear) relationships with `mpg`?

**ANSWER: (click here to edit)**


<hr>

**QUESTION:**

What do you think would be better for this data, linear regression or a regression tree? Why?

**ANSWER: (click here to edit)**


<hr>

## Prepare train/test sets

You need to split the data into train/test sets.

Start by dropping the label, `mpg` to make a new dataframe called `X`.

Save a dataframe with just `mpg` in `Y`. 

Import `sklearn.model_selection` to split the data into train/test sets.

And do the actual split.

## Fit models

Fit two models, a linear regression model and a regression tree model.

Import the `sklearn.tree` and `sklearn.linear_model` libraries.

### Linear regression model 

Create the linear regression model.

Fit the linear regression model and get predictions.

### Regression tree model

Create the regression tree model.

Fit the regression tree model and get predictions.

## Evaluate the models

For the linear regression model:

- Get the $r^2$ on the *training* set

- Get the $r^2$ on the *testing* set

For the regression tree model:

- Get the $r^2$ on the *training* set

- Get the $r^2$ on the *testing* set

------------------
**QUESTION:**

Compare the two models with respect to their *training data performance*. Which is better?
Now compare the two models with respect to their *testing data performance*. Which is better?
What do these differences tell you?

**ANSWER: (click here to edit)**


<hr>

## Penalize the regression tree

Start with `ccp_alpha=0.004` and look at peformance.
Try a few values less than this and greater than this (max of 1.0).

------------------
**QUESTION:**

What values did you try? Did any of them do better than linear regression on the test set?

**ANSWER: (click here to edit)**


<hr>

## Visualize the model

If you have time, try to do this for whatever value of `ccp_alpha` you liked the best.

Import `graphviz`.

Create the graph and display it.

To save you time, here are the feature names: `"cylinders","displacement","horsepower","weight","acceleration","model_year","origin"`

------------------
**QUESTION:**

Explain the top three nodes in your tree.

**ANSWER: (click here to edit)**


<hr>


-------------------

**QUESTION:**

Which model do you prefer, linear regression or regression trees, in this situation, and why?

**ANSWER: (click here to edit)**


<hr>

<!--  -->

<!-- path = model.cost_complexity_pruning_path(splits[0], splits[2])
ccp_alphas, impurities = path.ccp_alphas, path.impurities

clfs = []
for ccp_alpha in ccp_alphas:
    clf = tree.DecisionTreeRegressor(random_state=0, ccp_alpha=ccp_alpha)
    clf.fit(splits[0], splits[2])
    clfs.append(clf)
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
px.scatter(x=ccp_alphas,y=node_counts) -->