# Exercise 1: A perceptron learning rule

Consider a perceptron with activity rule ${g}(h)={1\over 1+e^{-h}}$, where $h=\sum_{i=1}^N w_ix_i$ is the activation for an N-dimensional input pattern $x_i$. The training set includes $p$ patterns $\{x_i^\mu,\zeta^\mu\}_{\mu=1,\ldots,p}$, where $\zeta^\mu$ are the targets.

**Question 1**: Derive the perceptron learning rule starting from the following cost function:
\begin{align}
E={1\over 2}\sum_{\mu=1}^p\left[\zeta^\mu-{g}(h^\mu)\right]^2 \ ,\label{cost}\tag{1}
\end{align}
by applying gradient descent. Note that $\partial_h{g}={g}(1-{g})$.

**Question 2**: Add a weight decay term to the cost function and derive the new perceptron learning rule. Explain in words what the Bayesian interpretation of the weight decay term is, and what issue with training it may alleviate.

**Question 3**: What is the learning rule if you use a linear activity rule ${g}(h)=h$ instead of the sigmoid activity rule as above?




### Question 1
First we will differentiate the cost function with respect to the weights. By applying the chain rule we get the following:
\begin{align*}
-\frac{\partial E}{\partial w_i}&=-2*\frac{1}{2}\sum_{\mu=1}^p\left[\zeta^\mu-{g}(h^\mu)\right]g'(h^\mu)\\
\end{align*}
Note that by the chain rule, $g'(h^\mu)= (\partial_hg)(\partial_wg)$. Therefore,
\begin{align*}
-\frac{\partial E}{\partial w_i}&=-\sum_{\mu=1}^p\left[\zeta^\mu-{g}(h^\mu)\right](\partial_hg)(\partial_wg)\\
&=-\sum_{\mu=1}^p\left[\zeta^\mu-{g}(h^\mu)\right](g(1-g))(x_i^\mu)\\
\end{align*}
Therefore our perceptron learning rule would be the following.
\begin{align*}
\delta^\mu&=g'(h^\mu)\left(\zeta^\mu-{g}(h^\mu)\right)\\
\text{or}\\
\delta^\mu&=\left(\zeta^\mu-{g}(h^\mu)\right)(g(1-g))(x_i^\mu)
\end{align*}
Finally, to update our weight we would follow this formula below.
\begin{align*}
w_i^{n+1}=w_i^{n}+\eta*\left(\zeta^\mu-{g}(h^\mu)\right)(g(1-g))(x_i^\mu)
\end{align*}
where $\eta$ is our learning rate.

### Question 2
New weight decay term: $D(w)=\alpha\frac{1}{2}\sum_i^N w_{i=1}^2$ where $\alpha$ is a constant we can change to alter the weight of this decay term (weight decay rate). Our updated cost function takes the form
\begin{align*}
E_{decay}&=\frac{1}{2}\sum_{\mu=1}^p\left[\zeta^\mu-{g}(h^\mu)\right]^2+\alpha\frac{1}{2}\sum_{i=1}^N w_i^2.\\
\end{align*}
Similiar to Question 1, we will apply gradient descent to find our new learning rule. Hence,
\begin{align*}
-\frac{\partial E_{decay}}{\partial w_i} &= -\sum_{\mu=1}^p\left[\zeta^\mu-{g}(h^\mu)\right]g'(h^\mu) - \alpha x_i
\end{align*}
Finally, to update our weight we would follow this formula below.
\begin{align*}
w_i^{n+1}=w_i^{n}+\eta*\left(\zeta^\mu-{g}(h^\mu)\right)(g(1-g))(x_i^\mu) -\alpha x_i
\end{align*}
where $\eta$ is our learning rate.

#### Bayesian interpretation of this weight decay term:
This weight decay term serves as the prior on the weights. The prior is important to include when we are learning because it takes into account the uncertainty on the parameters estimation.

Without the weight decay term, our weights will continue to grow and grow the more we train. This will cause our predictions to be overconfident in the classifications (overfitting). When we include the weight decay term, we hope to alleviate this problem of overfitting our data.

### Question 3
Here, we will differentiate our cost function (similiar to #1) but now with our updated activity rule. Hence,
\begin{align*}
-\frac{\partial E}{\partial w_i}&=-2*\frac{1}{2}\sum_{\mu=1}^p\left[\zeta^\mu-{g}(h^\mu)\right]g'(h^\mu)\\
&=-\sum_{\mu=1}^p\left[\zeta^\mu-{g}(h^\mu)\right](\partial_hg)(\partial_wg)\\
&=-\sum_{\mu=1}^p\left[\zeta^\mu-{g}(h^\mu)\right](1)(x_i^\mu)\\
&=-\sum_{\mu=1}^p\left[\zeta^\mu-{g}(h^\mu)\right](x_i^\mu)\\
\end{align*}
Therefore our perceptron learning rule would be the following.
\begin{align*}
\delta^\mu=\left(\zeta^\mu-{g}(h^\mu)\right)(x_i^\mu)\\
\end{align*}
Finally, to update our weight we would follow this formula below.
\begin{align*}
w_i^{n+1}=w_i^{n}+\eta*\left(\zeta^\mu-{g}(h^\mu)\right)(x_i^\mu)
\end{align*}
where $\eta$ is our learning rate.