# Machine Learning in Python - Workshop 4

## 1. Setup

### 1.1 Packages

First, the version of scikit-learn on noteable is slightly out of date so we will update it if necessary (this may take a minute or two),

In [None]:
import pkg_resources
if pkg_resources.get_distribution("scikit-learn").version != '0.22.1':
    !conda install --yes scikit-learn

In the cell below we will load the core libraries we will be using for this workshop and setting some sensible defaults for our plot size and resolution. 

In [None]:
# Display plots inline
%matplotlib inline

# Data libraries
import pandas as pd
import numpy as np

# Plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Plotting defaults
plt.rcParams['figure.figsize'] = (8,5)
plt.rcParams['figure.dpi'] = 80

# sklearn modules
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline

### 1.2 Data

We will again use the data set from Workshop 3, which was generated via a random draw from a Gaussian Process model and it represent an unknown smooth function $y = f(x) + \epsilon$. The data have been randomly thinned to include only 100 observations.

We can read the data in from `gp2.csv` and plot the data,

In [None]:
d = pd.read_csv("gp2.csv")
n = d.shape[0] # number of rows

sns.scatterplot('x', 'y', data=d, color="black")

# 2. Cross validation

In this section we will explore some of the tools that sklearn provides for cross validation for the purpose of model evaluation and selection. The most basic form of CV is to split the data into a testing and training set, this can be achieved using `train_test_split` from the `model_selection` submodule. Here we provide the function with our model matrix $X$ and outcome vector $y$ to obtain a test and train split of both.

In [None]:
from sklearn.model_selection import train_test_split

X = np.c_[d.x]
y = d.y

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

The additional arguments `test_size` determines the proportion of data to include in the test set and `random_state` is the seed used when determining the partition (keeping the seed the same will result in the same partition(s) each time the cell is rerun).

We can check the dimensions of the original and new objects using the shape attribute,

In [None]:
print("orig sizes :", X.shape, y.shape)
print("train sizes:", X_train.shape, y_train.shape)
print("test sizes :", X_test.shape, y_test.shape)

With these new objects we can try several polynomial regression models, with different values of `M`, and compare their performance. Our goal is to fit the models using the training data and then evaluating their performance using the test data.

We will assess the models' performance using root mean squared error, 

$$ \text{rmse} = \left( \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \right)^{1/2} $$

with sklearn this is calculated using the `mean_squared_error` function with the argument `squared=False`.

The following code uses a `for` loop to fit 30 polynomial regression models with $M = [1,2,\ldots,30]$ and calculates the rmse of the training data and the testing data.

In [None]:
degree = []
train_rmse = []
test_rmse = []

M = 30

for i in np.arange(1,M+1):
    m = make_pipeline(
        PolynomialFeatures(degree=i),
        LinearRegression(fit_intercept=False)
    ).fit(X_train, y_train)
    
    degree.append(i)
    train_rmse.append( mean_squared_error(y_train, m.predict(X_train), squared=False) )
    test_rmse.append( mean_squared_error(y_test, m.predict(X_test), squared=False) )

fit = pd.DataFrame(data = {"degree": degree, "train_rmse": train_rmse, "test_rmse": test_rmse})

sns.lineplot(x="degree", y="value", hue="variable", data = pd.melt(fit,id_vars=["degree"]))

---

### &diams; Exercise 1

Based on these results, what value of $M$ produces the best model, explain.

---

### &diams; Exercise 2

Try adjusting the proportion of the data in the test vs training data, how does this change the Training and Testing rmse curves?

---

## 2.1 k-fold cross validation

The previous approach was relatively straight forward, but it required a fair bit of book keeping code to implement and we only examined a single test train split. If we would like to perform k-fold cross validation we can use `cross_val_score` from the `model_selection` submodule. This function is passed our model or pipeline (any object implementing `fit`) and then our full model matrix $X$ and response $y$. The argument `cv` is the integer number of folds to use and `scoring` determines the scoring metric to use. Alternatively, if you want more control over the cross validation process the `cv` argument can be any cross-validation generator or iterable object (e.g. `KFold`).

In [None]:
from sklearn.model_selection import cross_val_score

m = make_pipeline(
    PolynomialFeatures(degree=1),
    LinearRegression(fit_intercept=False)
)

cross_val_score(m, X, y, cv=5, scoring="neg_root_mean_squared_error")

Here we have used `"neg_root_mean_squared_error"` as our scoring metric which returns the negative of the root mean squared error. As the name implies this returns the negative of the usual fit metric, this is because sklearn expects to always optimize for the maximum of a score and the model with the largest negative rmse  will therefore be the "best". To get a list of all available scoring metrics for `cross_val_score` you can run the following code.

In [None]:
sorted(sklearn.metrics.SCORERS.keys())

To obtain these 5-fold CV estimates of rmse for our models we slightly modify our original code as follows,

In [None]:
degree = []
test_mean_rmse = []
test_rmse = []

M = 30

for i in np.arange(1,M+1):
    m = make_pipeline(
        PolynomialFeatures(degree=i),
        LinearRegression(fit_intercept=False)
    )
    cv = -1 * cross_val_score(m, X, y, cv=5, scoring="neg_root_mean_squared_error")
    degree.append(i)
    test_mean_rmse.append(np.mean(cv))
    test_rmse.append(cv)

cv = pd.DataFrame(
    data = np.c_[degree, test_mean_rmse, test_rmse],
    columns = ["degree", "mean_rmse"] + ["fold" + str(i) for i in range(1,6) ]
)

cv.head(n=15)

---

### &diams; Exercise 3

Do these CV rmse's agree with the results we obtained when using `train_test_split`? How do they differ, is it only a single fold that differs or several?

*rmse's are much larger than our previous values, particularly for larger M's. fold's 1,2 and 5 are primarily responsible*

---

We will now repeat the 5-fold CV model fitting but first we will shuffle our original data frame.

In [None]:
d_shuf = d.sample(frac=1) # Shuffle rows
Xs = np.c_[d_shuf.x]
ys = d_shuf.y

In [None]:
degree = []
test_mean_rmse = []
test_rmse = []

M = 30

for i in np.arange(1,M+1):
    m = make_pipeline(
        PolynomialFeatures(degree=i),
        LinearRegression(fit_intercept=False)
    )
    cv = -1 * cross_val_score(m, Xs, ys, cv=5, scoring="neg_root_mean_squared_error")
    degree.append(i)
    test_mean_rmse.append(np.mean(cv))
    test_rmse.append(cv)

cv = pd.DataFrame(
    data = np.c_[degree, test_mean_rmse, test_rmse],
    columns = ["degree", "mean_rmse"] + ["fold" + str(i) for i in range(1,6) ]
)

cv.head(n=15)

We can also plot these results using a scatter plot for the individual folds and a line plot for the mean of the rmse's.

In [None]:
sns.lineplot(x="degree", y="mean_rmse", data = cv, color="black")
sns.scatterplot(x="degree", y="value", hue="variable", data = pd.melt(cv,id_vars=["degree", "mean_rmse"]))

---

### &diams; Exercise 4

Do these CV rmse's agree with the results we obtained when using `train_test_split`?

---

### &diams; Exercise 5

Based on these results, what value of $M$ do you think produces the best model, explain.

---

### &diams; Exercise 6

Explain why "shuffling" our original data "fixed" the strange / large rmse's we observed previously. *Hint* - review the documentation of `cross_val_score` and `KFold` and closely examine the original data frame `d`.

---

## 2.2 CV Grid Search


We can further reduce the amount of code needed if there is a specific set of parameter values we would like to explore using cross validation. This is done using the `GridSearchCV` function from the `model_selection` submodule. This function works similarly to `cross_val_score` with the addition of the `param_grid` argument which is a dictionary containing parameters names as keys and lists of parameter settings to try as values. Since we are using a pipeline, out parameter name is the name of the pipeline step, `polynomialdeatures`, followed by `__`, and then the parameter name, `degree`. So for our pipeline the parameter is named `polynomialfeatures__degree`.

In [None]:
from sklearn.model_selection import GridSearchCV

m = make_pipeline(
        PolynomialFeatures(),
        LinearRegression(fit_intercept=False)
    )

parameters = {
    'polynomialfeatures__degree': np.arange(1,31,1)
}

# Fit to the shuffled data
grid_search = GridSearchCV(m, parameters, cv=5, scoring="neg_root_mean_squared_error").fit(Xs, ys)

Once fit, we can determine the optimal hyperparameter value by accessing `grid_search`'s attributes,

In [None]:
print("best index: ", grid_search.best_index_)
print("best param: ", grid_search.best_params_)
print("best score: ", grid_search.best_score_)

Additional useful details from the CV process are available in the `cv_results_` attribute, which provides CV and scoring details,

In [None]:
grid_search.cv_results_["mean_test_score"]

In [None]:
grid_search.cv_results_["split0_test_score"]

In [None]:
grid_search.cv_results_["rank_test_score"]

and the `best_estimator_` attribute, which gives direct access to the "best" model or pipeline object.

In [None]:
grid_search.best_estimator_

In [None]:
grid_search.best_estimator_.named_steps['linearregression'].coef_

---

# 3. More dimensions

Let us now consider a regression problem of the following form,

$$ y = f(x_1) + g(x_2) + h(x_3) + \epsilon $$

where $f()$, $g()$, and $h()$ are polynomials with fix degrees, we will assume linear, quadratic and cubic in this case with the following coefficients:

$$
\begin{align}
f(x) &= 1.2 x + 1.1 \\
g(x) &= 2.5 x^2 - 0.9 x - 3.2  \\
h(x) &= 2 x^3 + 0.4 x^2 - 5.2 x + 2.7 \\
\end{align}
$$

We generate values for $x_1$, $x_2$, $x_3$, and $\epsilon$ and then use these values to calculate observations of $y$ using the following code.


In [None]:
np.random.seed(1234)
n = 500

f = lambda x: 1.2 * x + 1.1
g = lambda x: 2.5 * x**2 - 0.9 * x - 3.2 
h = lambda x: 2 * x**3 + 0.4 * x**2 - 5.2 * x + 2.7

ex2 = pd.DataFrame({
    "x1": np.random.rand(n),
    "x2": np.random.rand(n),
    "x3": np.random.rand(n)
}).assign(
   y = lambda d: f(d.x1) + g(d.x2) + h(d.x3) + np.random.randn(n) # epsilon
)

ex2

---

### &diams; Exercise 7

Create a pairs plot of these data, from this alone is it possible to identify the polynomial relationships between $y$ and the $x$s?

---

### &diams; Exercise 8

Assume that we know that each of the functions $f()$, $g()$, and $h()$ are at most of degree 3 - fit a polynomial model to these data. What are the coefficients you obtain and how do they compare to the "correct" values used above? *Hint* - using the `powers_` attribute from the `PolynomialFeatures` transformer will provide details on which `coef_` value maps to what term in the model.

In [None]:
X = ex2.drop(columns=['y'])
y = ex2.y

In [None]:
m = make_pipeline(
    PolynomialFeatures(degree=3),
    LinearRegression(fit_intercept=False)
)

fit = m.fit(X, y)

print( fit.named_steps['linearregression'].coef_ )
print( fit.named_steps['polynomialfeatures'].powers_ )

---

### &diams; Exercise 9

Calculate the rmse of this model using 5-fold cross validation.

---

## 3.2 Column Transformers

Often we do not want to apply a single transformation to all of the features of a model at the same time. This particularly example is one such case as we might prefer individual polynomial transformations of each of the three $x$'s rather than the polynomial transformations and their interactions. To do this we will use sklearn's `ColumnTransformer` and the `make_column_transformer` helper function from the `compose` submodule.

In [None]:
from sklearn.compose import ColumnTransformer, make_column_transformer

In [None]:
ind_poly = make_column_transformer(
    (PolynomialFeatures(degree=3, include_bias=False), ['x1']),
    (PolynomialFeatures(degree=3, include_bias=False), ['x2']),
    (PolynomialFeatures(degree=3, include_bias=False), ['x3']),
)

trans = ind_poly.fit_transform(X,y)

pd.DataFrame(trans) # printing as a DataFrame makes the array more readable

`ColumnTransformer`s are like pipelines but they include a specific column or columns for the transformer to be applied. By using this transformer we take each feature and apply a single polynomial feature transformer, of degree 3 (excluding the intercept column (bias)), resulting in 9 total features as output (3 for each input feature). We can check these values make sense by examining them along with the original values of the $x$s. Here we are using `include_bias=False` to avoid creating a rank deficient model matrix, which would result if all three polynomial features transforms included the same intercept column.

In [None]:
pd.concat([X, pd.DataFrame(trans)], axis=1)

A `ColumnTransformer` is like any other transformer and can therefore be included in a pipeline, this enables us to create a pipeline for fitting our desired polynomial regression model (with no interaction terms). Since the polynomial features no longer include an intercept, we can add this back to the model with `fit_intercept=True` in the linear regression step.

In [None]:
m2 = make_pipeline(
    make_column_transformer(
        (PolynomialFeatures(degree=3, include_bias=False), ['x1']),
        (PolynomialFeatures(degree=3, include_bias=False), ['x2']),
        (PolynomialFeatures(degree=3, include_bias=False), ['x3']),
    ),
    LinearRegression(fit_intercept=True)
)

fit = m2.fit(X, y)

We can examine the fitted values of the coefficients by accessing the `linearregression` step and its `coef_` and `intercept_` attributes.

In [None]:
fit.named_steps['linearregression'].coef_

In [None]:
fit.named_steps['linearregression'].intercept_

Instead of directly fitting, we can also use this pipeline with cross validation functions like `cross_val_score` to obtain a more reliable estimate of our model's rmse.

In [None]:
cv = cross_val_score(m2, X, y, cv=5, scoring="neg_root_mean_squared_error")

print(cv)
print(cv.mean())

---

### &diams; Exercise 10

Is this rmse better or worse than the rmse calculated for the original model that included interactions? Explain why you think this is.

---

## 3.3 Column Transformers & CV Grid Search

Finally we will see if we can come close to recovering the original forms of $f()$, $g()$, and $h()$ using `GridSearchCV`.  This builds on our previous use of this function, but now we need to optimize over the degree parameter of all three of the polynomial feature transformers. We can examine the names of these transforms by examining the `named_transformers_` attribute associated with the `columntransformer`,

In [None]:
m2.named_steps['columntransformer'].named_transformers_

This gives us the transformer names: `polynomialfeatures-1`, `polynomialfeatures-2`, and `polynomialfeatures-3` which are referenced in the same way by combining the step name with the transformer name and then the parameter name separated by `__`. As such, the degree parameter for the first transformer will be `columntransformer__polynomialfeatures-1__degree`. It is also possible to view all of the parameters for a model or pipeline by looking at the keys returns by the `get_params` method.

In [None]:
m2.get_params().keys()

To keep the space of parameters being explored reasonable we will restrict the possible value of the degrees parameter to be in $[1,\ldots,5]$.

In [None]:
parameters = {
    'columntransformer__polynomialfeatures-1__degree': np.arange(1,5,1),
    'columntransformer__polynomialfeatures-2__degree': np.arange(1,5,1),
    'columntransformer__polynomialfeatures-3__degree': np.arange(1,5,1),
}

grid_search = GridSearchCV(m2, parameters, cv=5, scoring="neg_root_mean_squared_error").fit(X, y)

---

### &diams; Exercise 11

How many models have been fit and scored by `GridSearchCV`?

---

Once fit, we can determine the optimal parameter value by accessing `grid_search`'s attributes,

In [None]:
print("best index: ", grid_search.best_index_)
print("best param: ", grid_search.best_params_)
print("best score: ", grid_search.best_score_)

---

### &diams; Exercise 12

Based on these results have we done a good job of recovering the general structure of the functions $f()$, $g()$, and $h()$? e.g. have we correctly recovered the degrees of these functions.

---

We can directly access the properties of the "best" model, according to our scoring method, using the `best_estimator_` attribute. From this we can access the `linearregression` step of the pipeline to recover the model coefficients.

In [None]:
grid_search.best_estimator_.named_steps["linearregression"].intercept_

In [None]:
grid_search.best_estimator_.named_steps["linearregression"].coef_

---

### &diams; Exercise 13

Compare the coefficient values we obtained via `GridSearchCV` to the true values used to generate the $y$ observations, how well have recovered the truth values of the coefficients?

---

### &diams; Bonus Exercise

Repeat the analysis above but use only 200 observations of `ex2` instead of all 500. How does your resulting "best" model change? What about 100 or 50 observations? How dependent are the results on the original sample size?