<a href="https://colab.research.google.com/github/moodlep/uor-cssml18/blob/master/CSSML18_Homework3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Neural Networks and Gradient Descent

Main concepts: 

* $y(x,w) = h(w^Tx)$   -> know how to draw the network from a description. 

A useful simplification to make: $ v = w^Tx$, so $y(x,w) = h(v)$

Where $h(v)$  is the activation function. 
* $E(w) = \frac{1}{2} \sum_{n=1}^{N} || y_n - t_n || ^2 $  

* Updating the weights: $w^{new} = w^{old} - \eta \frac{dE}{dw}$

* Where $\frac{dE}{dw} = (y_n-t_n)\frac{dy}{dw}$ 

* Recall that $\bbox[5px,border:1px solid black]{
\frac{dy}{dw} = y' =  \frac{dh}{dw} = \frac{dh}{dv}\frac{dv}{dw}
}$





### Q2 expanded

The online gradient descent algorithm is used to train a single layer perceptron given by 

$y(x, w) = h(∑^2_{i=0}w_ix_i) $ where $x ∈ [x_1, x_2]^T $ and $x_0 = 1, w = [w_0, w_1, w_2]$

$h(v) = tanh(v) = \frac{exp(v)−exp(−v)} {exp(v)+exp(−v)} $

(Note: $h'(v) = (1 − h(v)^2)$). 

Assume that the current weight vector is $w = [1, 0.3, 0.4]^T$, Calculate the new weight updated from a new training datum
$[x, t] = [2, 1, 0.2]^T$, using the learning rate $η = 0.02$.


####Approach: 


To update the weights: $w^{new} = w^{old} - \eta \frac{dE}{dw}$ 

* We have $w^{old} =  [1, 0.3, 0.4]^T$, $ \eta = 0.02$, $x =  [1, 2, 1]^T$ and $t_n = 0.2$. 


* We need $\frac{dE}{dw} = (y_n-t_n)\frac{dy}{dw}$  where $\frac{dy}{dw} = y' =  \frac{dh}{dw} = \frac{dh}{dv}\frac{dv}{dw}$


* We can calculate $ y_n = h(v), v = w^Tx $. 


* For $y' =  \frac{dh}{dw} = \frac{dh}{dv}\frac{dv}{dw}$ we have $\frac{dh}{dv} = h'(v) = (1 − h(v)^2)$ so we still need $\frac{dv}{dw} = x^T$


* Pulling all the terms together we get \\
$\frac{dE}{dw} = (y_n-t_n)(1 − y_n^2)[1,  x_n^T]^T  $ \\
where $[1, x^T]$ is equivalent to adding the $x_0$ component to the vector $x$, for example $x = [1, 2, 1]^T$


## Q4 

The mathematical form for a two layer MLP is 
 $$ y(x, w) = σ(∑^2_{j=0}w^{(2)}_j h(∑^2_{i=0}w^{(1)}_{ji} x_i))$$
 
where $x_0 = 1, x ∈ ℜ^D$. The superscript (1) and (2) indicate the corresponding weights are in first or second layer. 

For $σ(v) = h(v) = 1$ if $v ≥ 0$ and $0$ if $v < 0 $: 

Calculate the network outputs for each of the data input $x = [−1, 3]^T, x = [−3, 1]^T, x = [−2, 2]^T, $
in the cases that the network weights are
$w^{(1)}_{ji} = 1, ∀i, j, w^{(2)} = [0, 0.5, −0.5]^T, w^{(2)} = [0, 1, 5, −0.5]^T, w^{(2)} =[0, 0.5, −1.5]^T$  respectively.

###Approach
Separate out the layers: 

Layer 1: \\
Inputs: $x_{(3 \times 1)}$ where $x_0 = 1$ \\
Weights: $w^{(1)}_{i,j} $ dim $3 \times 3$ where $w^{(1)}_{ji} = 1, ∀i, j$ \\
Activation function: $h(v) = 1$ if $v ≥ 0$ and $0$ if $v < 0 $ \\
Output: $h = w^Tx $ with dim $3\times1$

Layer 2: \\
Inputs: $h_{(3 \times 1)}$  \\
Weights: $w^{(2)}_{j} $ dim $3 \times 1$ with values as defined above \\
Activation function: $σ(v) = 1$ if $v ≥ 0$ and $0$ if $v < 0 $ \\
Output: $y = w^{(2)T}h $ with dim $1\times1$



---





## Radial Basis Function Networks

Using Gaussians as basis functions, $\phi(x)$: 

$$ y(x) = \sum^M_{i=1} w_i exp (−\frac{(∥x − c_i∥)^2}{2σ^2}) = \sum^M_{i=1} w_i\phi_i(x) $$

which is still linear w.r.t. weights. 


### Using as discriminant function for classification

$$ g(x) = \sum^M_{i=1} w_i exp (−\frac{(∥x − c_i∥)^2}{2σ^2})$$

Where there are $M$ basis functions with $M$ centroids $c_i$ and $M$ weights $w_i$ attached to the basis functions. 

* Non-linear function vs the linear discriminant from earlier lectures
* Centroids may be found using k-means 
* Weights may be found using least squares
$$w = (Φ^TΦ)^{−1}Φ^T t$$

Where $\Phi$ is $N \times M$, $N$ is the number of data points $(x_n,t_n)$ and $M$ is the number of weights and centriods. 

* Once $g(x)$ is known plug in new data points to classify using $g>0$ and $g<0$


####Q7 
A radial basis function has the form of $y(x) = ∑^3_{i=1}w_i exp (−\frac{(∥x − c_i∥)^2}{2}) $ \\
The centers are $c_1 = [−1, 3]^T, c_2 = [−3, 1]^T, c_3 = [−2, 2]^T$. \\
Calculate the network outputs for input $x$ equals to each center $c_1, c_2, c_3$, respectly in the cases that the network weights as \\
$w = [1/3, 1/3, 1/3]^T,w = [2, 1, −1]^T,w = [1, 0, 0]^T.$


Approach: 

* Write out the full expression for $y(x)$ filling in the centroids and the first set of weights 

$y(x) = \frac{1}{3} exp (−\frac{(∥x − c_1∥)^2}{2}) + \frac{1}{3} exp (−\frac{(∥x − c_2∥)^2}{2}) + \frac{1}{3} exp (−\frac{(∥x − c_3∥)^2}{2}) $ 

* Plug in the value of $x = c_1$ and calculate. 

$y(x) = \frac{1}{3} exp (−\frac{(∥c_1 − c_1∥)^2}{2}) + \frac{1}{3} exp (−\frac{(∥c_1 − c_2∥)^2}{2}) + \frac{1}{3} exp (−\frac{(∥c_1 − c_3∥)^2}{2}) $ 

Where $∥c_1 − c_2∥^2 = (c_{11}-c_{21})^2 +(c_{12}-c_{22})^2  $

* Repeat for $x = c_2$ then $x = c_3$
* Do this the other 2 sets of weights. 

## Resources

* NN [playground](http://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.19909&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false): use this to train NN models and build an intuition of how they work. 

* Really nice explanation of [backprop](http://cs231n.github.io/optimization-2/). 

* [Tutorial](http://mccormickml.com/2013/08/15/radial-basis-function-network-rbfn-tutorial/) that explains a little more about RBFNs. 