## Classical ML regression

We will build a decision tree to fit to the data that we cleaned in the first exercise.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## Load up the data

We use pandas to read in the data. In this case the first 6 columns (0-5) are the features and the final column (6) are the labels. The code to do this is:
```
df = pd.read_pickle('./week1-cleaned-data.pickle')
```
We then explore the data and look for highly correlated columns and drop these from the dataset.

```
import seaborn as sns
corrmat = df.corr()
hm = sns.heatmap(corrmat,
                 cbar=True,
                 annot=True,
                 square=True,
                 fmt='.2f',
                 annot_kws={'size': 10},
                 yticklabels=df.columns,
                 xticklabels=df.columns,
                 cmap="Spectral_r")
plt.show()
```

```
df_train = df.drop([<list>], axis=1)
```

We then standardise the data using the `StandardScaler`. **Note** if the data has only one feature, like the label data y, we need to use a reshape when standardising. The code to do this is:
```

from sklearn.preprocessing import StandardScaler

scaler_x = StandardScaler()
x = scaler_x.fit_transform(x)
scaler_y = StandardScaler()
y = scaler_y.fit_transform(y.reshape(-1, 1))
```

We then use the `train_test_split` function to make separate training and test sets; use 80% to train and 20% to test. You should be able to do this without a code hint

## Train a model

We will train a gradient boosted regressor, using the default hyperparameters. The code to do this is:
```
from sklearn.ensemble import GradientBoostingRegressor

regr = GradientBoostingRegressor()
regr.fit(x_train, y_train)
```

## Visually inspect the performance of the model

Get the predictions of the model on the test set. Use `predictions = regr.predict(x_test)` to make predictions.

Use a scatter plot from matplotlib to plot `predictions` versus `y_test`. Don't forget to label your axes.

## Calculate the metrics for the model performace

Use the same kind of performace metrics that you used in the exercise for linear regression to compare the `predictions` to `y_test`

## Perform hyper-parameter tuning

Here we will use `GridSearchCV` from `sklearn.model_selection` to search for the best combination of hyperparameters.
Specifically we will tune the `n_estimators`, `max_leaf_nodes` and `learning_rate` together.

To get the names of the hyperparameters use `regr.get_params().keys()`

Set up the distributions to optmise over:
```
from sklearn.model_selection import GridSearchCV

param_grid = {
    "n_estimators": [50, 100, 200, 500, 1000],
    "learning_rate": [0.098, 0.099, 0.1 , 0.101, 0.102]
}
```
Set up the cross-validation based search:
```
search_cv = GridSearchCV(
    regr, param_grid=param_grid,
    scoring="neg_mean_absolute_error",  n_jobs=2, cv=10
)
```
Perform the search
```
search_cv.fit(x_train, y_train.ravel())
```
This will take a few minutes to complete

## Check to see what the optimal set of parameters are

You can get this by printing `search_cv.best_params_`

## Look at the performace of the best model

You can select the best model using:
```
best_regr = search_cv.best_estimator_
```
Now use this `best_regr` similar to how you used `regr` above to visually look at the results and to visually inspect the performace and calculate metrics.

How does the tuned model compare to the default model?
