# Exercise 2 - Regression

This lab is on performing some regression experiments and develop the understanding of Model Selection.  

## Getting started

From the Regression lecture we have seen that Linear Regression can be easily extended to Polynomial Regression just by augmenting the original data with extra nonlinear terms. This strategy is adopted by Sklearn to implement both regression techniques using the same Linear Regression engine, and introducing a pre-processing pipeline for Polynomial augmentation. This is shown in the following code cell:   

In [None]:
# import useful packages
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

# define a pipeline for polynomial regression
def PolynomialRegression(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree),
                         LinearRegression(**kwargs))

Now we make some simple toy data - this is generate the X and y arrays we are to use for later experiments:

In [None]:
# make some example data
def make_data(N=30, err=0.8, rseed=1):
    # randomly sample the data
    rng = np.random.RandomState(rseed)
    X = rng.rand(N, 1) ** 2           
    y = 10 - 1. / (X.ravel() + 0.1)   
    if err > 0:
        y += err * rng.randn(N)
    return X, y

X, y = make_data()
xf = np.linspace(0.0, 1.0, 1000)[:, None]       # use 1000 points in [-0.1, 1.0) to plot out prediction

Now we call out Pyplot to do some simple plotting: 

In [None]:
plt.plot(X, y, 'b.')     # plot the data points 
plt.grid()

As our first modeling attempt, let us try linear regression, i.e., a PolynomialRegression of degree 1. Let's see how the fitted model predict over the specified range:

In [None]:
model = PolynomialRegression(1).fit(X, y)
plt.plot(X, y, 'b.') 
plt.plot(xf.ravel(), model.predict(xf), color='gray')
plt.grid()
plt.show()

**TO-DO** Copy and paste the code above; modify it so that it generates prediction using degree 3. 

In [None]:
# code for degree=3; fitting is much better!
model = PolynomialRegression(3).fit(X, y)
plt.plot(X, y, 'b.') 
plt.plot(xf.ravel(), model.predict(xf), color='gray')
plt.grid()

There are three major functions for us to focus on:
- First we call the **fit(X, y)** to *train* the model with the training data X, y
- Then do a **predict()** to produce predictions
- Or, we can **score(X, y)** to check out the performance on data X and y.

In our examples, to plot out the prediction curves we use a range of input values in [0.0, 1.0] (using numpy's linspace() function). 

## Just Fit It
Here's an example to show multiple prediction results using subplots. For polynomial orders from 1 to 8, we generate the fitting models, visualize them, and report the $R^2$ scores. It is easy to see which model gives the best $R^2$ score...

In [None]:
fig, ax = plt.subplots(2, 4, figsize=(16, 6))  # Create 4x2 subplots showing results
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.2, hspace=0.3) # adjust subplot layout a bit

for deg in range(1,9):
    model = PolynomialRegression(deg).fit(X, y)
    r = int((deg-1)/4)
    c = (deg-1)%4
    ax[r, c].scatter(X.ravel(), y, s=40)
    ax[r, c].plot(xf.ravel(), model.predict(xf), color='gray')
    ax[r, c].axis([-0.1, 1.0, -2, 14])
    ax[r, c].set_title('Degr.={}, $R^2$={}'.format(deg, model.score(X, y)), size=14)

<font color="red">*Your Answer*:</font> 
- $R^2$ score keeps increasing when the degree of the polynomial model increases.
- The score is measured on the *training* data only. 

## Do It Properly
Now, generate twenty testing data instances and re-assess the score trend with the new testing data. 

In [None]:
X2, y2 = make_data(20, rseed=42)

**TO-DO.** Follow the example given above. Plot out the prediction curve and display the $R^2$ score for each model being tested on the X2,y2 data. In other words, copy and paste the code, keep ".fit(X, y)" unchanged, but change ".score(X, y)" to ".score(X2, y2)". What is now revealed by the new results? 

In [None]:
# copy and modify...
fig, ax = plt.subplots(2, 4, figsize=(16, 6))  # Create 4x2 subplots showing results
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.2, hspace=0.3) # adjust subplot layout a bit

for deg in range(1,9):
    model = PolynomialRegression(deg).fit(X, y)
    r = int((deg-1)/4)
    c = (deg-1)%4
    ax[r, c].scatter(X2.ravel(), y2, s=40)    # note: plotting out testing data instead
    ax[r, c].plot(xf.ravel(), model.predict(xf), color='gray')
    ax[r, c].axis([-0.1, 1.0, -2, 14])
    ax[r, c].set_title('Degr.={}, $R^2$={}'.format(deg, model.score(X2, y2)), size=14)

*Note* when plotting out the fit curves against the testing data points, quite some *misses* are revealed. By looking at the scores, the polynomial models with degrees higher than 6 actually start to drop!

We now collect all test scores for display and comparison. 
For the model degree ranged from 1 to 11 (inclusive), we retrain (using X and y) and test a model (using X2 and y2). Two lists are used to record the training score and testing score of the models. 

In [None]:
# Code with tesing score plotted out
tr_score=[]      # create two empty lists to collect training and testing scores
te_score=[]
degrange=range(1,12)
for deg in degrange:
    model = PolynomialRegression(deg).fit(X, y)     # training on X, y
    tr_score.append(model.score(X, y))              # score on training data  
    te_score.append(model.score(X2, y2))         # score on testing data

plt.xlabel('Degree'); plt.ylabel("$R^2$")
plt.plot(degrange, tr_score, 'g-', label='Training score')
# also plotting test scores 
plt.plot(degrange, te_score, 'r-', label='Training score')
plt.legend(); plt.grid(True)

**TO-DO.** The testing score array is missing in the plot above. Add it in and re-run the code. 

What we get is a so-called "[run chart](https://en.wikipedia.org/wiki/Run_chart)" frequently used in machine learning practice. 

**END**

In [None]:
nan