Copyright 2020 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.


# Gradient boosting: Problem solving


This session will use a dataset of video game sales for games that sold at least 100,000 copies.
Because the dataset is so large, only 1000 randomly sampled rows are included.

| Variable     | Type    | Description                                                                                 |
|:--------------|:---------|:---------------------------------------------------------------------------------------------|
| Rank         | Interval   | Ranking of overall sales                                                                    |
| Name         | Nominal   | The games name                                                                              |
| Platform     | Nominal   | Platform of the games release (i.e. PC,PS4, etc.)                                           |
| Year         | Ratio   | Year of the game's release                                                                  |
| Genre        | Nominal   | Genre of the game                                                                           |
| Publisher    | Nominal   | Publisher of the game                                                                       |
| NA_Sales     | Ratio   | Sales in North America (in millions)                                                        |
| EU_Sales     | Ratio   | Sales in Europe (in millions)                                                               |
| JP_Sales     | Ratio   | Sales in Japan (in millions)                                                                |
| Other_Sales  | Ratio   | Sales in the rest of the world (in millions)                                                |
| Global_Sales | Ratio   | Total worldwide sales.                                                                      |

<div style="text-align:center;font-size: smaller">
    <b>Source:</b> This dataset was taken from <a href="https://www.kaggle.com/gregorut/videogamesales">Kaggle</a>.
</div>
<br>

The goal is to predict `Global_Sales` using the other non-sales variables in the data.


## Load data

Import `pandas` for dataframes.

Load the dataframe with `datasets/vgsales-1000.csv`, using `index_col="Name"`.

## Explore data

### Describe and drop missing

Describe the data.

-----------
**QUESTION:**

Does the min/mean/max of each variable make sense to you?

**ANSWER: (click here to edit)**


<hr>

Try to remove missing values to see if any rows are incomplete.

-----------
**QUESTION:**

How many rows had missing values?

**ANSWER: (click here to edit)**


<hr>

### Visualize

Import `plotly.express`.

And create a correlation matrix heatmap.

-----------
**QUESTION:**

What's going on with `Rank` and the `*_Sales` variables?

**ANSWER: (click here to edit)**


<hr>

Do a scatterplot matrix to see the relationships between these variables.

-----------
**QUESTION:**

Take a look at the scatterplots of the nominal variables against the `Global_Sales`. 
Is there any obvious pattern?

**ANSWER: (click here to edit)**


<hr>

Make a histogram of `Global_Sales` so we can see how it is distributed.

------------------
**QUESTION:**

Do you think we need to transform `Global_Sales` to make it more normal? Why or why not?

**ANSWER: (click here to edit)**


<hr>

## Prepare train/test sets

### X, Y, and dummies

Make a new dataframe called `X` by either dropping all the sales related variables or creating a dataframe with just the columns you want to keep.

Import `numpy` to square root transform `Y`.

Save a dataframe with just `Global_Sales` in `Y`, but use `numpy` to log transform in a freestyle block: `np.sqrt(dataframe[[ "Global_Sales"]])`.

Replace the nominal variables with dummies and save in `X`.

### Train/test splits

Import `sklearn.model_selection`.

Create the data splits. 
Make sure to use `random_state=1` so we get the same answers.
Don't bother stratifying.

## Fit model

Since the response/target variable is numeric, we need to use a gradient boosting regressor rather than a classifier.

Import `sklearn.ensemble`.

Create the gradient boosting regressor, using `random_state=1` and `subsample=.5`.

`fit` the classifier.

Get  and save predictions.

## Evaluate the model

Because this is regression not classification, you can't use classification metrics like accuracy, precision, recall, and f1.
Instead, you'll use $r^2$.
Some examples are in the `Regression-trees-PS` notebook.

- Get the $r^2$ on the *training* set

- Get the $r^2$ on the *testing* set

------------------
**QUESTION:**

Compare the *training data performance* to the *testing data performance*. Which is better?
What do these differences tell you?

**ANSWER: (click here to edit)**


<hr>

## Visualizing

### Feature importance

Visualize feature importance using a bar chart.

------------------
**QUESTION:**

Hover over the bars to see the corresponding predictor and value. 
What are the most important features?

**ANSWER: (click here to edit)**


<hr>

### Overfit

Use the OOB error to test if the model is overfit.

Import `plotly.graph_objects`

Create an empty figure to draw lines on.

And add the two lines, one for training deviance and one for testing deviance.

------------------
**QUESTION:**

Do you think it would help our test data performance if we stopped training earlier? Why?

**ANSWER: (click here to edit)**


<hr>

**QUESTION:**

Now that you are familiar with this data and how gradient boosting performed with it, what other models would you try?

**ANSWER: (click here to edit)**


<hr>