# Feed-Forward Neural Network with 1 Hidden Layer
## The Model
<img src="pics/ff_nn.png">

## The Data

## Implementation of the Forward-Pass

In [None]:
def forward(X, W, b, V, c):
    Z = 1 / (1 + np.exp(-X.dot(W) - b))
    A = Z.dot(V) + c
    expA = np.exp(A)
    Y = expA / expA.sum(axis=1, keepdims=True)
    return Y, Z

## Parameter Optimisation
Loss (Log-Likelihood to avoid carrying around the negative sign): $J=\sum_{n=1}^N\sum_{k=1}^K t_{nk}\log y_{nk}$

$n$ is the index for the sample (row) in $X$

### Gradients with Respect to $V$ and $c$

$$\frac{\partial J}{\partial V_{mk}} = \sum_{n=1}^N \sum_{k'=1}^K \frac{\partial J_{nk'}}{\partial y_{nk'}} \frac{\partial y_{nk'}}{\partial a_{nk}} \frac{\partial a_{nk}}{\partial V_{mk}}$$

$$\frac{\partial J}{\partial c_k} = \sum_{n=1}^N \sum_{k'=1}^K \frac{\partial J_{nk'}}{\partial y_{nk'}} \frac{\partial y_{nk'}}{\partial a_{nk}} \frac{\partial a_{nk}}{\partial c_k}$$

Copying over $\frac{\partial J_{nk'}}{\partial y_{nk'}}$, $\frac{\partial y_{nk'}}{\partial a_{nk}}$, $\frac{\partial J_{nk'}}{\partial y_{nk'}}$ and $\frac{\partial y_{nk'}}{\partial a_{nk}}$ from logistic regression and realising that
$$\frac{\partial a_{nk}}{\partial V_{mk}} = z_{nm}$$
and
$$\frac{\partial a_{nk}}{\partial c_k} = 1$$

leads to 

$$\frac{\partial J}{\partial V_{mk}} = \sum_{n=1}^N (t_{nk}-y_{nk})z_{nm}$$

$$\frac{\partial J}{\partial c_k} = \sum_{n=1}^N (t_{nk}-y_{nk})$$

in vector form:

$$\nabla_VJ=Z^T(T-Y)$$
and the gradient with respect to $c$ directly in code:

```python
grad_c = np.sum(T-Y, axis=0)
```

### Gradients with Respect to $W$ and $b$

$$\frac{\partial J}{\partial W_{dm}} = \sum_{k=1}^N \sum_{n=1}^N \sum_{k'=1}^K \frac{\partial J_{nk'}}{\partial y_{nk'}} \frac{\partial y_{nk'}}{\partial a_{nk}} \frac{\partial a_{nk}}{\partial z_{mk}}\frac{\partial z_{nm}}{\partial \alpha_{nm}} \frac{\partial \alpha_{nm}}{\partial W_{dm}}$$

Copying over $\frac{\partial J_{nk'}}{\partial y_{nk'}}$, $\frac{\partial y_{nk'}}{\partial a_{nk}}$ from logistic regression and realising that
$$\frac{\partial a_{nk}}{\partial z_{mk}} = V_{mk}$$
and 
$$\frac{\partial \alpha_{nm}}{\partial W_{dm}}=x_{nd}$$

$\frac{\partial z_{nm}}{\partial \alpha_{nm}}$ depends on the a choice of the activation function in the hidden layer:

$$\text{for sigmoid} \; \; \rightarrow \; \; \frac{\partial z_{nm}}{\partial \alpha_{nm}}=z_{nm}(1-z_{nm})$$
$$\text{for tanh} \; \; \rightarrow \; \; \frac{\partial z_{nm}}{\partial \alpha_{nm}}=1-z_{nm}^2$$
$$\text{for relu} \; \; \rightarrow \; \; \frac{\partial z_{nm}}{\partial \alpha_{nm}}=step function(z_nm)$$

with sigmoid:
$$\frac{\partial J}{\partial W_{dm}} = \sum_{k=1}^N \sum_{n=1}^N (t_{nk}-y_{nk})V_{mk}(1-z_{nm}) x_{nd}$$

With respect to the bias term:
$$\frac{\partial J}{\partial b_{m}} = \sum_{k=1}^N \sum_{n=1}^N (t_{nk}-y_{nk})V_{mk}(1-z_{nm})$$

Expressing it in Vector Form:
$$\nabla_WJ=X^T \{[(T-Y)V^T] \odot Z \odot (1-Z)\}$$

```python
grad_b = np.sum((T-Y).dot(V.T) * Z * (1-Z), axis=0)
```


In [1]:
def classification_rate(Y, P):
    # num correct/num total
    n_total = len(Y)
    n_correct = 0
    for i in range(n_total):
        if Y[i] == P[i]:
            n_correct += 1
    return float(n_correct) / n_total