# Exercise Sheet 2

## Exercise 2-1: The Perceptron in more than two Dimensions

## a)
The numbers vom 0 through 9 were represented by pixel arrays in the lecture, the corresponding data matrix can be found in the file _numberMatrix.RData_. Use this data to train a perceptron such that it distinguished between odd and even numbers. Vary w and η. Additionally, answer the question if the perceptron learning rule terminates for the problem "is a multiple of 3"?

## b)
What is the complexity of training a perceptron for an M -dimensional dataset of with N input patterns?
What is the cost of a predicition after having trained the perceptron?

## Exercise 2-2: Linear Regression

Let X be a variable providing the data and its occurrences Y:

x | 3 | 4 | 5 | 6 | 7 | 8
--|---|---|---|---|---|---
y | 150 | 155 | 150 | 170 | 160 | 175

## a)
Presume the model exhibits the following linear relation:

$$
y_i = \beta_0 + \beta_1x_i = x^Tw
$$

Use the least squares-estimator introduced in the lecture to determine $w$.

least squares: $cost(w) = \sum_i ( y_i - \hat{y_i} )^2$

(what I call $\hat{y}$ is defined as $f(x_i,w) = Xw$, so for the squared error we get:)

$cost(w) = (y-Xw)^T(y-Xw)$

with the Values given for x we get:

$X = \begin{pmatrix} 1 & 3 \\ 1 & 4 \\ 1 & 5 \\ 1 & 6 \\ 1 & 7 \\ 1 & 8 \end{pmatrix}$

(the first column with ones corresponds to the bias (multiplied with $\beta_0$))

we calculate the first derivative for $w$ and set it to $0$ (i.e. minimize the error)

$$
\frac{\partial cost(w)}{\partial w} = -2X^T(y-Xw) = 0
$$

When solving for $w$ (i.e. $w_{ls}$) we get:

$$
\begin{align}
-2X^T(y-Xw) &= 0 \\
2X^Ty + 2X^TXw &= 0 \\
wX^TXw &= -2X^Ty \\
w &= \frac{-2X^Ty}{2X^TX} \\
w &= \frac{X^Ty}{X^TX} \\
w &= (X^TX)^{-1}X^Ty
\end{align}
$$

In [38]:
import numpy as np
import matplotlib.pyplot as plt

X = np.matrix([[1,3], [1,4], [1,5], [1,6], [1,7], [1,8]])

y = np.matrix([[150], [155], [150], [170], [160], [175]])

# WATCH OUT! For Matrices there is no such thing as division!
# We need to multiply with the inverse instead

ws_ls = ((X.T*X)**-1 ) * X.T * y

print ws_ls

[[ 134.85714286]
 [   4.57142857]]


so we get:

$\hat{y_i} \approx 134.857 + 4.571x_i$

In [37]:
x_vals = X[:,1]
y_vals = [134.857 + 4.571*x for x in x_vals]

# TODO: make plotable
#plt.plot(x_vals, y_vals)

## b)
Now, presume the non-linear relation
$y_i = \beta_0 + \beat_1x_i + \beta_2x_i^2 = x^Tw
and, again, determine w.

In [19]:
# only x has changed (squares of x are added as third column):

X = np.matrix([[1,3,9], [1,4,16], [1,5,25], [1,6,36], [1,7,49], [1,8,64]])

ws_ls2 = ((X.T*X)**-1 ) * X.T * y

print ws_ls2

[[ 149.5       ]
 [  -1.32142857]
 [   0.53571429]]


so we get:

$\hat{y_i} \approx 149.5 + -1.321x_i + 0.536x_i^2$

In [None]:
# TODO: plot

### c)
How could the empiric quadratic error between model and data be visualized? Explain and sketch your suggestion in two as well as in three dimensions on arbitrary data.

### d)
Which of the models a) and b) is better? Compute the average quadratic error and evaluate the models. How could a better model be realized?

## Exercise 2-3: Regularisation / Overfitting

### a)
What is _overfitting_ and how does it occur?

### b)
How can a model be identified as "overfitted"?

### c)
How can overfitting be avoided?

## Exercise 2-4: Curse of Dimensionality vs. Kernel Trick

### a)
Explain the term _curse of dimensionality_.
When does it occur, how can it be avoided?

### b)
Explain the _Kernel Trick_.
How can it be used, what is its connection to the _curse of dimensionality_?

##Exercise 2-5: Basis Functions of Neural Networks

Given a test vector $x_i$, the output of a neural network is defined as
$$
f(x_i) = \sum_{h=0}^{M_\phi-1}w_h\phi_h(x_i,v_h).
$$

The weights of the neurons can be learned by employing the back-propagation rule with sample-based gradient descent. In the lecture neural networks with sigmooid neurons have been introduced, but it is possible to employ different basis functions:

### a)
Which properties do these basis functions have to fullfill?

### b)
Can a linear combination $\phi(x_i, v_h) = z_h = \sum_{j=0}^M v_{h,j}x_{i,j}$ be suitable for this?

### c)
Is the number of parameters for $\phi(x_i,v_h$