# Exercises 19: scikit-learn
Let's have a look at the basics of scikit-learn.

Below I've made the basic imports, loaded the data and made a copy of both the features and the target. I've defined `n_learn` as the number of data points we will use for learning and split the data into a training set `train_data` and testing set `test_data` (and `train_target` and `test_target`) using `sklearn.model_selection.train_test_split`.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
try:
    diabetes = load_diabetes(scaled=False)
except:
    diabetes = load_diabetes()
    diabetes.data[:,1] = np.where(diabetes.data[:,1] > 0, 2, 1)

# We make a copy of the diabetes data to work on
data = np.copy(diabetes.data)
target = np.copy(diabetes.target)

# now we split the data into a training and a test set
n_learn = 280
train_data, test_data, train_target, test_target = \
train_test_split(data, target, train_size = n_learn)

Preprocessors and estimators from scikit-learn function exactly as the ones we have defined in Exercises 18 on classes.

## Exercise 19.1
In this exercise we will use `scikit-learn` to train a linear model on the `diabetes` data and then test the quality of the model we have just fitted. 

*Note that this is the exact same model we have fitted in Supplementary Ex. 18.4 with the same quality of fit and the estimator also functions very similarly.*

### Exercise 19.1.1
* Create an instance of a linear estimator (use `sklearn.linear_model.LinearRegression`)
* Use it to fit the data (We only use `train_data` and `train_target` for training.)

In [None]:
from sklearn import linear_model
lin_reg = linear_model.LinearRegression()


# we fit the regressor to the training data
lin_reg.fit(train_data, train_target)

### Exercise 19.1.2
Now let's look at the predictions from our trained estimator. For that we will make a plot of the predicted disease progression vs the real disease progression (target values).

* Use the trained estimator to predict the disease progression for the training data (call the prediction `train_predicted_target`).
* Use the trained estimator to predict the disease progression for the test data (call the prediction `test_predicted_target`).
* Make a plot of the predictions vs the real disease progression (so one series of points is `train_predicted_target` vs `train_target` and the second series is `test_predicted_target` vs `test_target`)

Can you tell whether the model is overfitted?

In [None]:
# we use the regressor to predict the disease progression for the training data
train_predicted_target = lin_reg.predict(train_data)
# and for the test data
test_predicted_target = lin_reg.predict(test_data)

In [None]:
plt.figure()

# plot the actual disease progression versus the predicted progression for the test set
plt.plot(test_target, test_predicted_target, "bx", label="test")

# plot the actual disease progression versus the predicted progression for the training set
plt.plot(train_target, train_predicted_target, "rx", label="training")

# Add a line representing the perfect prediction
plt.plot(plt.xlim(), plt.xlim(), "--")

# Add labels and such
plt.xlabel("Target")
plt.ylabel("Prediction")
plt.legend(loc="best")
plt.show()
plt.close()

The model is not overfitted as the quality of the fit is similar for the training and the test data.

### Exercise 19.1.3
Let's now numerically evaluate the quality of our model. We do this by calculating the average error made by the prediction from our estimator compared to the real disease progerssion (on the test data). We do this in two different ways:
* Calculate the mean squared error using the metric provided by scikit-learn, namely the `mean_squared_error` from `sklearn.metrics` module (it takes two arguments, the target values and its prediction: `mean_squared_error(target, prediction)`)
* Estimator objects also have a `score` method, which gives a default evaluation for a given class of methods. For regression this returns the $R^2$ coefficient of determination. Calculate it and print it

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
error = mean_squared_error(test_target, test_predicted_target)
print("error for the linear model", error)
r2 = lin_reg.score(test_data, test_target)
print("r2 for the linear model", r2)

## Exercise 19.2
Now we redo the same thing as in the previous exercise but with a support-vector machine. Here we are doing a regression, so use `sklearn.svm.SVR` and repeat the previous exercise, except that we start by normalizing the data. For this we use a preprocessor from scikit-learn, `sklearn.preprocessing.StandardScaler`, which removes the average and divides by the standard deviation (this class functions exactly as the scaler we have defined in Ex 18.2).

### Exercise 19.2.1
* Import the necessary classes (`StandardScaler` and `SVR`)
* Instantiate a `StandardScaler`, then fit it to the training data
* Normalize the data (both training and test) using the normalizer
* Then, as above, create an estimator, fit it to the normalized data, calculate the mean squared error and the $R^2$ coefficient (`score` method) and print the results
* Make a plot of disase progression vs predicted progression

In [None]:
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
norm = StandardScaler()

# we fit the normaliser to the training data
norm.fit(train_data)

In [None]:
# We use the normaliser to transform the training and test data
train_data_normed = norm.transform(train_data)
test_data_normed = norm.transform(test_data)

In [None]:
# We now fit a support vector machine regressor to our normalised data
svr = SVR()
svr.fit(train_data_normed, train_target)

# We calculate the error and print it
error = mean_squared_error(test_target, svr.predict(test_data_normed))
print("error for the svm", error)

# We score the estimator
r2 = svr.score(test_data_normed, test_target)
print("r2 for the svm",r2)

In [None]:
plt.figure()

# plot the actual disease progression versus the predicted progression for the test set
plt.plot(test_target, svr.predict(test_data_normed), "bx", label="test")

# plot the actual disease progression versus the predicted progression for the training set
plt.plot(train_target, svr.predict(train_data_normed), "rx", label="training")

# Add a line representing the perfect prediction
plt.plot(plt.xlim(),plt.xlim(), "--")

# Add labels and such
plt.xlabel("Target")
plt.ylabel("Prediction")
plt.legend(loc="best")
plt.show()
plt.close()

### Exercise 19.2.2
Results obtained with the SVR above are pretty bad. Let's tweak the estimator a bit. Set the `kernel="sigmoid"` and the `C=10`, then test the estimator as above, i.e.:
- Initialize the estimator with `SVR(kernel="sigmoid", C=10)`, then fit the estimator 
- Fit the model to the data
- Predict the target for the training and test data
- Plot the prediction against the actual target
- Calculate the mean squared error and 𝑅2

In [None]:
# We now fit a support vector machine regressor to our normalised data
svr = SVR(kernel="sigmoid", C=10)
svr.fit(train_data_normed, train_target)

# We calculate the error and print it
error = mean_squared_error(test_target, svr.predict(test_data_normed))
print("error for the svm", error)

# We score the estimator
r2 = svr.score(test_data_normed, test_target)
print("r2 for the svm",r2)

In [None]:
plt.figure()

# plot the actual disease progression versus the predicted progression for the test set
plt.plot(test_target, svr.predict(test_data_normed), "bx", label="test")

# plot the actual disease progression versus the predicted progression for the training set
plt.plot(train_target, svr.predict(train_data_normed), "rx", label="training")

# Add a line representing the perfect prediction
plt.plot(plt.xlim(),plt.xlim(), "--")

# Add labels and such
plt.xlabel("Target")
plt.ylabel("Prediction")
plt.legend(loc="best")
plt.show()
plt.close()

## Exercise 19.3
Finally we try with a regression tree. 
### Exercise 19.3.1
Repeat Exercise 19.2.1 but using `sklearn.tree.DecisionTreeRegressor`. Again use the normalized data.

In [None]:
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(train_data_normed, train_target)
error = mean_squared_error(test_target, tree_reg.predict(test_data_normed))
print("error for the tree", error)
r2 = tree_reg.score(test_data_normed, test_target)
print("r2 for the tree",r2)

In [None]:
plt.figure()
plt.plot(test_target, tree_reg.predict(test_data_normed), "bx", label="test")
plt.plot(train_target, tree_reg.predict(train_data_normed), "rx", label="training")
plt.plot(plt.xlim(),plt.xlim(), "--")
plt.xlabel("Target")
plt.ylabel("Prediction")
plt.legend(loc="best")
plt.show()
plt.close()

# Supplementary

### Exercise 19.3.2
Our estimetor from exercise 19.3.1 is clearly overfitted. This is because by default the depth of the tree is unlimited and tree is expanded until all leaves are pure. With the `max_depth` parameter, the depth of the tree can be controlled. One way to determine the optimal depth of the tree is to look at how the prediction error evolves with the depth of the tree on the training and testing data set and determine the depth for which prediction on the testing data set is best.
- Calculate the Mean Squared Error (for the training and testing data) for `DecisionTreeRegressor`s fitted to depths ranging from 1 to 20.
- Plot the resulting errors as a function of the tree depth
- Determine the optimal `max_depth` to use
- Make the plot of target vs prediction for the `DecisionTreeRegressor` fitted with the optimal `max_depth`

In [None]:
train_error = []
test_error = []
for i in range(1, 21):
    tree_reg = DecisionTreeRegressor(max_depth=i)
    tree_reg.fit(train_data_normed, train_target)
    test_error.append(mean_squared_error(test_target, tree_reg.predict(test_data_normed)))
    train_error.append(mean_squared_error(train_target, tree_reg.predict(train_data_normed)))

In [None]:
plt.figure()
plt.plot(range(1, 21), test_error, "rx", label="test error")
plt.plot(range(1, 21), train_error, "go",label="train error")
plt.legend(loc="best")
plt.show()

In [None]:
tree_reg = DecisionTreeRegressor(max_depth=3)
tree_reg.fit(train_data_normed, train_target)

plt.figure()
plt.plot(test_target, tree_reg.predict(test_data_normed), "bx", label="test")
plt.plot(train_target, tree_reg.predict(train_data_normed), "rx", label="training")
plt.plot(plt.xlim(),plt.xlim(), "--")
plt.xlabel("Target")
plt.ylabel("Prediction")
plt.legend(loc="best")
plt.show()
plt.close()

error = mean_squared_error(test_target, tree_reg.predict(test_data_normed))
print("error for the tree", error)
r2 = tree_reg.score(test_data_normed, test_target)
print("r2 for the tree",r2)

## Exercise 19.4: feature engineering
In this exercise we will add more features to our dataset. We will do this by generating all the terms of degree 2 between our features, using the `PolynomialFeatures` class.
- Use the `PolynomialFeatures` class to create a new feature set containing also all terms of degree 2
- Then repeat the steps from the other exercises, using a linear model (`LinearRegression`), i.e.:
  - Fit the model to the new data
  - Predict the target for the training and test data
  - Plot the prediction against the actual target
  - Calculate the mean squared error and $R^2$

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2)
augmented_train_data = poly.fit_transform(train_data)
augmented_test_data = poly.fit_transform(test_data)

In [None]:
# Fit a linear model to the augmented data
lin_reg = linear_model.LinearRegression()
lin_reg.fit(augmented_train_data, train_target)

# Use the model to predict the disease progression for the training and test data
train_predicted_target = lin_reg.predict(augmented_train_data)
test_predicted_target = lin_reg.predict(augmented_test_data)

In [None]:
# Plot the prediction against the target values
plt.figure()
plt.plot(test_target, test_predicted_target, "bx", label="test")
plt.plot(train_target, train_predicted_target, "rx", label="training")
plt.plot(plt.xlim(),plt.xlim(), "--")

plt.xlabel("Target")
plt.ylabel("Prediction")
plt.legend(loc="best")
plt.show()
plt.close()

In [None]:
error = mean_squared_error(test_target, test_predicted_target)
print("error for the linear model", error)
r2 = lin_reg.score(augmented_test_data, test_target)
print("r2 for the linear model", r2)