# Deep Feedforward Networks 

Also called **multilayer perceptron** (MLPs). The goal is to approximate some function $f^*$. In order to do that, it defines a mapping $y = f(x;\theta)$ and learns the parameters $\theta$ that result in the best function approximation. It's called **feedforward** because information flows through the function from $x$ to the output $y$.   

The model is associated with directed acyclic graph and describes how the functions are composed together. We can form, for example, $f(x) = f^{(3)}(f^{(2)}(f^{(1)}(x)))$. In this case, $f^{(i)}$ is called $i^{th}$ layer. The overall length of the chain gives the **depth** of the model. In the training, we drive $f(x)$ to match $f^{*}(x)$. Because the training dara does not show the desired outut for the internal layers, they are called **hidden**. 

The dimensionality of these hidden layers determines the **width** of the model. Each element of the vector in the layer "plays" a role analogous to a neuron. 

Starting from the linear models (limited to linear functions), we think in extend them to represent nonlinear functions of $x$. We can apply the linear model not to $x$, but to a transformed input $\phi(x)$, where $\phi$ is nonlinear. The strategy of deep learninng is to learn $\phi$. In this approach, we have a model

$$
y = f(x;\theta, w) = \phi(x;\theta)^Tw
$$

We parametrize the $\phi$ representation and find $\theta$ that corresponds to a good representation. 

## Simple example

![model1](images/model-xor.png)

In the hidden layer $h$, we must use a nonlinear function. Most neural networks do so using an affinne transformation controlled by learned parameters followed by a fixed nonlinear function, called activation function, that is 

$$
h = g(W^Tx + c)
$$

In modern neural networks the default recommendation is to use **rectified linear unit** or ReLU, defined by $g(z) = max\{0,z\}$. 