# Deep Feedforward Fully Connect Networks


## Neural network representation

![Representation](images/representation.PNG)

## Neural network representation - vector notation

![Representation](images/representation2.PNG)

Let

$\mathbf X =\begin{bmatrix} | & | &   & | \\ 
                \mathbf x^{(1)} & \mathbf x^{(2)} & \cdots & \mathbf x^{(m)} \\
                          | & | &   & | \end{bmatrix},\qquad$
$\mathbf x^{(i)} = \begin{bmatrix} x_1^{(i)} \\ x_2^{(i)} \\ \vdots \\ x_{n^{[0]}}^{(i)} \end{bmatrix},\qquad$
$\mathbf X\in\mathbb R^{n^{[0]} \times m},\qquad$
$\mathbf x\in\mathbb R^{n^{[0]}}$

$\mathbf A^{[l]}=\begin{bmatrix} | & | &   & | \\ 
                \mathbf a^{[l](1)} & \mathbf a^{[l](2)} & \cdots & \mathbf a^{[l](m)} \\
                          | & | &   & | \end{bmatrix},\qquad$
$\mathbf a^{[l](i)} = \begin{bmatrix} a_1^{[l](i)} \\ a_2^{[l](i)} \\ \vdots \\ a_{n^{[l]}}^{[l](i)} \end{bmatrix},\qquad$       $\mathbf A\in\mathbb R^{n^{[l]} \times m},\qquad$
$\mathbf a\in\mathbb R^{n^{[l]}}$           
                          
$\mathbf W^{[l]}=\begin{bmatrix} -\mathbf w_{1}^{[l]}- \\ 
                                 -\mathbf w_{2}^{[l]}- \\ 
                                 \vdots \\ 
                                 -\mathbf w_{n^{[l]}}^{[l]}- \\ 
                 \end{bmatrix},\qquad$
$\mathbf w_{i}^{[l]}=[w_1^{[l]},w_2^{[l]}, \cdots, w_{n^{[l-1]}}^{[l]}],\qquad$
$\mathbf b^{[l]}=\begin{bmatrix} b_{1}^{[l]} \\ 
                                 b_{2}^{[l]} \\ 
                                 \vdots \\ 
                                 b_{n^{[l]}}^{[l]} \\ 
                 \end{bmatrix},\qquad$ 
$\mathbf W\in\mathbb R^{n^{[l]} \times n^{[l-1]}},\qquad$
$\mathbf b\in\mathbb R^{n^{[l]}}$   
                          
then               

$\mathbf Z^{[1]}=\mathbf W^{[1]}\mathbf X + \mathbf b^{[1]},\qquad$
$\mathbf A^{[1]}=\sigma(\mathbf Z^{[1]})$

$\mathbf Z^{[2]}=\mathbf W^{[2]}\mathbf A^{[1]} + \mathbf b^{[2]},\qquad$
$\mathbf A^{[2]}=\sigma(\mathbf Z^{[2]})$

$\mathbf Z^{[l]}=\mathbf W^{[l]}\mathbf A^{[l-1]} + \mathbf b^{[l]},\qquad$
$\mathbf A^{[l]}=\sigma(\mathbf Z^{[l]})$



## Activation functions


![Activation functions](images/activations.PNG)

$\qquad\qquad a=\dfrac{1}{1+e^{-z}}\qquad\qquad\qquad\qquad\qquad$
$a=\dfrac{e^z-e^{-z}}{e^z+e^{-z}}\qquad\qquad\qquad\qquad$
$a=\max(0,z)\qquad\qquad\qquad\qquad$
$a=\max(\alpha z,z)$

## Derivates of activaton functions

__Sigmoid__



$g(z)=\dfrac{1}{1+e^{-z}} \quad \Longrightarrow \quad \dfrac{d}{dz}g(z)=g(z)(1-g(z))$

$g(z)=\dfrac{e^{z}-e^{-z}}{e^{z}+e^{-z}} \quad \Longrightarrow \quad \dfrac{d}{dz}g(z)=1-(g(z))^2$

$g(z)=\max(0,z) \quad \Longrightarrow \quad \dfrac{d}{dz}g(z)=\left\{ \begin{array}{} 0\, if\, z \lt 0 \\ 1\, if\, z \geq 0 \end{array}\right., \qquad$ Technically $\dfrac{d}{dz}g(z)$ is not defined at 0 but in practice it can be "overlooked".


## Why deep representation

Circuit theory proves that there are functions that can be computed with "narrow" (relatively small number of hidden units in a layer) but deep (many layers) that shallower networks require exponentially more hidden units to compute.