# Exercise: Polynomial regression

In this exercise we train polynomial regressors. Given data $(x_1,y_1),\ldots, (x_n,y_n)$ we search for polynomial regressors $f$ such that $$\sum_{i=1} (y_i-f(x_i))^2$$ becomes small.

First we do some of the necessary imports.

In [None]:
import numpy as np  # for general scientific computing
import matplotlib.pyplot as plt # plotting
import sklearn.linear_model # linear regression 
import sklearn.preprocessing # for polynomial regression (see below)

We generate a small training and test set. Both sets come from the quadratic function $1.5x^2-2x+1$, with a little bit of random noise added. To illustrate both sets we plot them, as well as the quadratic function. 

You do not need to understand the code below in full detail. There is just one detail I'd like to point out. The estimators of *scikit learn* expect the training set to have a particular format: a list (or rather an array) of (multidimensional) data vectors. Because of that the estimators will raise an error if the training set is 1-dimensional. Consequently, we need to <code>reshape</code> the data. That is, we turn 1-dimensional data into 2-dimensional data. Let me demonstrate.

In [None]:
one_dim=np.array([1.,2.,42.])
one_dim

In [None]:
one_dim.reshape(-1,1)

Now let's generate the training set consisting of data <code>x_train</code> and target values <code>y_train</code>, and the test set consisting of <code>x_test</code> and <code>y_test</code>.

In [None]:
def true_function(x):
    return 2.5*x**2-2*x+1

def draw_points(number):
    np.random.seed(42*number)
    x = np.random.random(number)
    noise = np.random.normal(scale=0.05,size=number)
    y = true_function(x) + noise
    return x.reshape(-1,1),y # x reshaped because training set cannot be 1-dim

training_size = 10
x_train,y_train=draw_points(training_size)

test_size = 5
x_test,y_test = draw_points(test_size)

xx=np.linspace(0,1,200)
yy_true=true_function(xx)
fig,ax=plt.subplots(figsize=(5,5))
ax.plot(xx,yy_true,"b-",label="true curve",linewidth=4)
ax.plot(x_test.flat,y_test,"rx",label="test set")
ax.plot(x_train.flat,y_train,"go",label="training set")
ax.legend()
plt.show()

Next, we train a linear regression. For this, we first instantiate a <code>LinearRegression</code> object, which we find in the package <code>sklearn.linear_model</code>. Then it needs to be trained. All classifiers and regressors in *scikit learn* have a method <code>fit</code> for this.

In [None]:
lin_reg=sklearn.linear_model.LinearRegression()
lin_reg.fit(x_train,y_train)

Once trained, we can <code>predict</code> new values. The method <code>predict</code> expects the $x$-values in the same format as the training set, ie, a list of(multidimensional) data points. That means, even, if you just need the prediction for a single data point, you need to wrap it in square brackets: <code>estimator.predict([single_x])</code>. 

Because we have 1-dimensional $x$ we even need to put the data into two square brackets. Let's predict two values.

In [None]:
lin_reg.predict([[1],[2]])

Linear regression is not very interesting. Let's do quadratic regression. We already know that quadratic regression is just linear regression with a modified training set. In *scikit learn* we can easily compute features of any degree with <code>PolynomialFeatures</code>. In the same way that estimators need to be fit first, this is also the case for <code>PolynomialFeatures</code>. <code>fit</code> does not do much here: It simply memorises the dimension of the data. Instead of a <code>predict</code> method, we have here a <code>transform</code> method. Let's have a look.

In [None]:
quad_features=sklearn.preprocessing.PolynomialFeatures(degree=2)
x_demo=np.array([0,1,2,3]).reshape(-1,1) # reshape because estimator expects multidim training set
quad_features.fit(x_demo)
quad_features.transform(x_demo)

We see that the data <code>[2]</code> turns into <code>[1,2,4]</code>, ie, constant term, linear term and quadratic term, as expected.

Next, we can plug the new, transformed training set into a linear regression, and obtain a quadratic regressor.

In [None]:
reg=sklearn.linear_model.LinearRegression()
reg.fit(quad_features.transform(x_train),y_train)

What is a bit annoying: If we now want to do predictions, we first need to <code>transform</code> the $x$-values via <code>quad_features</code> and then call <code>predict</code> on <code>reg</code>:

In [None]:
reg.predict(quad_features.transform([[1],[3]]))

Because that is so annyoing, *scikit learn* has the class <code>Pipeline</code> that allows to chain any number of transformators and a <code>predictor</code>. Once the pipeline is defined, it can be used like a normal regressor, ie, we can call <code>fit</code> and also <code>predict</code> on it. Here's how that works.

In [None]:
import sklearn.pipeline
quad_features=sklearn.preprocessing.PolynomialFeatures(degree=2)
reg=sklearn.linear_model.LinearRegression()
### this what we want to chain:
steps=[('quadratic features',quad_features),('linear regression',reg)] # always pairs (name, estimator)
quad_pipe=sklearn.pipeline.Pipeline(steps)
quad_pipe.fit(x_train,y_train)

The whole pipeline is now trained and can be used for prediction.

In [None]:
quad_pipe.predict([[1],[3]])

### Task: Train degree 10 regressor###

In the same way as above, define a regressor of degree 10 with a pipeline and train it on the training set. The <code>Pipeline</code> object that encapsulates the regressor should be called <code>ten_pipe</code>.

In [None]:
### insert your code here ###
### end of insert ###

ten_pipe.predict([[1],[3]]) # this should run without any errors

Let's plot the result. (This will not work if you messed up the regressors or if you didn't call the )

In [None]:
fig,ax=plt.subplots(figsize=(10,5))
ax.set_ylim(0,2)
xx=np.linspace(0,1,400)
estimators=[(1,lin_reg),(2,quad_pipe),(10,ten_pipe)]
estimator_degrees=[1,2,10]
for degree,estimator in estimators:
    ax.plot(xx,estimator.predict(xx.reshape(-1,1)),linewidth=3,label="degree {}".format(degree))
ax.plot(x_test.flat,y_test,"rx",label="test set")
ax.plot(x_train.flat,y_train,"go",label="training set")
ax.legend()

Next, let's compute errors. For regression, we often use the *mean square error*, or mse for friends:
$$
\text{mse}(y,y')=\frac{1}{n}\sum_{i=1}^n(y_i-y'_i)^2
$$

The mean square error is implemented as function <code>mean_squared_error</code> in package <code>sklearn.metrics</code>. Let's try it out.

In [None]:
import sklearn.metrics

mse=sklearn.metrics.mean_squared_error
y1=[1,1,1]
y2=[1.1,0.9,1.1]
mse(y1,y2)

### Task: errors ###

Compute the training error and test error for the three regressor we've trained above. Output the errors with <code>print</code>. To print out text and variable at the same time you can do the following <code>print("this {} is the value of the variable".format(some_variable))</code>

In [None]:
### insert your code here ###

### Task: interpretation ###

Interpret the results, give an explanation for the training and test errors (at most three sentences).

Write your answer here:
