Hooke's Law
==========

We are going to "learn" Hooke's law from data.  I have provided you
with some experiments, where I attached a weight $x$ to a spring
and measured the elongation $y$ of the spring.

You should follow the machine learning procedure to train a linear model
$$
    \hat y = f(x; \theta) = \theta x
$$
and validate it against the data.  You should follow the step-by-step
procedure outlined in the lecture.

Step 1: Get data
---------------

You should find a text file `spring.dat` in your directory.  This file
has a set of rows, each row containing `x y` ($x$ and associated $y$
value).

Use numpy's `loadtxt` function to load that data as a matrix and then
write assign the first column to `x` and the second column to `y`.

In [None]:
import numpy as np

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
x

In [None]:
assert (x, y) is not None
assert x.shape == y.shape == (100,)
np.testing.assert_allclose(x.sum(), 49.26, atol=1e-12, rtol=1e-12)
np.testing.assert_allclose(y.sum(), 92.868096, atol=1e-12, rtol=1e-12)

Let's plot the data to get a feel:

Make a plot of $y$ over $x$, but please **do not** connect
the points using lines (you can just use a marker like `+` etc.
as a plot style to not plot lines or alternatively use `pl.scatter`).

Remember: plots have x and y axis labels!

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Step 2: Split data into training and validation sets
--------------------------------------------------------------

Partition your data into two, about equally large, sets:

 - a training set with `xtrain` and corresponding `ytrain`
 - a validation set with `xtest` and corresponding `ytest`
 
To avoid biasing your results, it is important to do this
**randomly**.  For this we are going to use a random number
generator (see next line).

The simplest way of splitting up the data is then the following:

 * for each data point, generate a random number between 0 and 1.
 If that number is greater than 0.5, put the corresponding `x`
 and `y` into the validation set, otherwise put it in the training
 set.  Be careful not to separate pairs of `x` and `y`.

In [None]:
# Creates a new pseudo-random number generator
random = np.random.default_rng(4711)

In [None]:
# Get a random number uniformly distributed between 0 and 1
random.uniform()

In [None]:
# Another random number
random.uniform()

In [None]:
# You should fill the variables xtest, ytest, xtrain, ytrain here
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert (xtest, ytest, xtrain, ytrain) is not None
assert xtest.size == ytest.size
assert xtrain.size == ytrain.size
assert xtest.size > 25 and xtest.size < 75

dfull = np.sort(np.rec.fromarrays([x,y], names="x,y"))
dsplit = np.sort(np.rec.fromarrays(
    [np.hstack([xtrain,xtest]), np.hstack([ytrain,ytest])],
    names="x,y"))
np.testing.assert_allclose(
    dfull["x"], dsplit["x"], err_msg='some x values are missing/incorrect')
np.testing.assert_allclose(
    dfull["y"], dsplit["y"], err_msg='some y values are missing/incorrect')
del dfull, dsplit

Let's again plot the training and validation set to make sure we haven't biased either set in any way.

You should modify the plot above to plot x and y values in the training set as a point cloud
and (in another color) the x and y values in the validation set.

Be sure to give each set of points a **label** and also show a **legend**.  You can
do so with `pl.plot(..., label='some label text')` and `pl.legend()`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Step 3: Formulate and train linear model
--------------------------

Let us now formulate our model and train it on the data.
First let us define a model function `fmodel` that depends on
a feature vector `x` and some parameters `theta` and
predicts a label `y`.

(Hint: `a.T` gives the transpose of a numpy array `a`)

In [None]:
def fmodel(x, theta):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert fmodel(np.array([2]), np.array([8])) == 16
assert fmodel(np.array([2,4]), np.array([8,2])) == 24
assert fmodel(np.array([1.0,1.0,1.0]), np.array([8.0,2.0,5.0])) == 15.0

Let us train the parameter `theta` now on the training set data.

For this, first take the label vector of all the $y$ values in the training set
and also create the design matrix $X$, i.e., a matrix where the rows correspond to
observations and columns correspond to features in $x$.

Then solve the **normal equations** to get the fitted value of theta.

(Hint: `np.reshape`)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
theta

In [None]:
lyx different user dirassert theta is not None
assert theta > 1.2 and theta < 2.5


Step 4: Validate results
----------------------------
Now that we have trained our model, we should check if the model gives
useful results for the validation set.  **This is the crucial step in learning
that separates it from pure model fitting**.

First, since our data is so low-dimensional, we can actually plot the
model prediction together with the validation set.  Modify the plot above
to (1) only plot the points in the validation set and (2) plot a line
$\hat y = f(x)$ corresponding to the linear model. Do not forget
labels and legend!

Hint: you can use `np.linspace` to get a vector equally spaced points.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

We are striving for a more quantitative approach: to do so, we are going to
define the **loss function**, which takes a vector of `x` points, a vector
of `y` points and the model parameter `theta`, and returns the loss:
$$
    \operatorname{loss}(x,y,\theta) = \sum_n |y_n - f(x_n; \theta)|^2
$$

In [None]:
def loss(x, y, theta):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert loss(xtrain, ytrain, theta) > 0


Now we compute the loss for the training set and the validation set

(called the "in"-error and the "out"-error)

In [None]:
E_in = loss(xtrain, ytrain, theta)
print ("Training ('in') error:    %.4g" % E_in)

In [None]:
E_out = loss(xtest, ytest, theta)
print ("Validation ('out') error: %.4g" % E_out)

Observe that these values are similar in magnitude.

 a) What does this mean for the "learning procedure"?

 b) What would we have to conclude if the validation loss had been much greater than the training loss?
 

YOUR ANSWER HERE

Epilogue
---------

I lied to you.

You did not actually fit Hooke's law.  You instead fitted a normalized expected salary $y$ of
a job seeker based on a proprietary aptitude score $x$, calculated at the Department of Public Employment Service and Obsolescence Support (PESOS).

The top brass at PESOS are quite impressed with the model and want it to form the basis of a more
efficient distribution of funds to apt job seekers.  Reasons given:

 1. The model is fair, because it does not rely on human input.
 
 2. The model is based on cutting-edge technology, therefore it is trustworthy.
 
Discuss (briefly, I have a meeting later with them).

YOUR ANSWER HERE