# Chap 6: Deep Feedforward Networks

## Introduction
- Names: Deep feedforward networks/feedforward neural networks/multiplayer perceptron (MLPs)
- Goal: approximate some function $f^*$
- Why "feedforward", "neural", "network" ?
- It is composed of many different functions. (chain function). -> layers -> units
- Depth of the model = the length of the chain function
-> Hidden layers -> the dimensionality of these hidden layers determines the width of the model.
- Significance: -> convolution networks, recurrent networks

### Extending linear models
To extend linear models to represent nonlinear functions of x -> apply the linear model to transformed input $\phi x$ ($\phi$ is a nonlinear transformation)

-> The strategy of deep learning is to learn $\phi$ from a board class of functions by parameters $\theta$

## 6.1 Example: Learning XOR
- A linear model can't learn XOR function. -> use deep feedforward network
-> introduce to activation function ReLU

- Complete model: $f(\mathbf{x}; \mathbf{W}, \mathbf{c}, \mathbf{w}, b) = \mathbf{w}^T max\{0, \mathbf{W}^T \mathbf{x}\ + \mathbf{c}\} + b$

-> a gradient-based optimization algorithm can find parameters that produce very little error.

## 6.2 Gradient-Based Learning
- Designing and training a neural network needs:
    optimization procedure,
    a cost function
    a model family.
    
- The nonlinearity of a neural network causes loss functions to become **non-convex**.
![][convex_cost_function]
-> use **iterative, gradient-based optimizers** rather than linear equation solver or convex optimization algorithms or SVMs

### Cost functions
- We use the cross-entropy between the training data and the model's predictions as the cost function.

#### Learning Conditional Distributions with Maximum Likelihood
- Modern NN use **maximum likelihood** -> the cost function is simply the **negative log-likelihood**.
- $\mathbf{J}(\theta) = −E_{x,y∼\hat{p}_{data}} log p_{model}(\mathbf{y} | \mathbf{x})$
- The specific form of the cost functions changes from model to model, depending on the $log p_{model}$ 
- advantages: 
    - remove the burden of designing cost functions.
    - avoid saturation problems.

#### Learning Conditional Statistics
- If we use a sufficiently powerful NN,... -> think learning as choosing function rather than choosing parameters. -> to optimize problem wrt a fuction: **calculus of variations**
- So it can give us 2 following results:
    - predict the **mean** of y for each value of x.
    - predict the **median**...

- Poor results-> the cross-entropy cost function is more popular than mean squared error or mean absolute error.

### Output Units
- Output units can be used as hidden units.
- Role: additional transformation

#### Linear Units for Gaussian Output Distributions
- In: $\mathbf{h}$ -> Out: $\hat{y} = \mathbf{W}^T \mathbf{h} + \mathbf{b}$.
- often used to produce the mean of a conditional Gauss dist.
- linear units don't sat -> litte difficult for gradient-based algo.

#### Sigmoid Units for Bernoulli Output Distributions
Tasks require predicting the value of a binary variable y.

Out: $\hat{y} = \sigma(\mathbf{w}^T \mathbf{h} + b)$
- first, uses a linear layer to compute $z = w^T h + b$
- second, uses sigmoid to convert z into probability.
![][sigmoid-unit]

#### Softmax Units for Multinoulli Output Distributions
- Generalization of the sigmoid function -> probability dist over a discrete variable with $n$ possible values.

$z = \mathbf{W}^T \mathbf{h} + \mathbf{b}$

$softmax(z)_i = \frac{exp(z_i)}{\sum_j exp(z_j)}$

## 6.3 Hidden Units
- Good default choice: **ReLU**
- There are more but need trial and error to choose which one is the best.
- left derivative, right derivative are difined and equal -> dfferentiable at z.

### ReLU and their Generalizations
$g(z) = max\{0, z\}$

$\mathbf{h} = g(\mathbT \mathbf{x} + \mathbf{b})$

$\mathbf{b}$ should be small, such as $0.1$
- Most generalizations: 
    - perform better.
    - guarantee receiving gradient everywhere.
- Ex: leaky ReLU, perametric ReLU, PReLU.

#### Maxout units
- divide $z$ into groups of $k$ values.
![][maxout-unit]

### Logistic Sigmoid and Hyperbolic Tangent
- Logistic sigmoid: $g(z) = \sigma(z)$
- Hyperbolic tangent: $g(z) = \tanh(z)$
- Their use as hidden units in feedforward networks is now discouraged.
- Sigmoidal activation functions are common in settings other than feedforward networks.

## 6.4 Architecture Design
Architecture: 
- how many units it should have 
- how these units should be connected to each other

![][nn]
Each layer being a function of the layer that preceded it.

- main architectural considerations: to choose the **depth** of the network and the **width** of each layer

Deeper -> fewer units per layer and parameters

### Universal Approximation Properties and Depth
**universal approximation theorem** -> any continuous function on a closed and bounded subset of $R^n$ is Borel measurable -> be approximated by a neural network (able to represent but may not able to learn)

-> use deeper model to reduct: number of units and the amount of generalization error.

### Other Architectural Considerations
- how to connect a pair of layers to each other.
Ex: convolutional networks for computer vision, recurrent neural networks for sequence processing.

These strategies for reducing the number of connections: 
- reduce the number of parameters and the amount of computation required to evaluate the network 
- often highly problem-dependent

## 6.5 Back-Propagation and Other Differentiation Algorithms
**Back-propagation** algo: allows the information from the cost to flow backwards -> compute the gradient.

### Computational Graphs
-> To describe the back-propagation algorithm more precisely.

An **operation** is a simple function of one or more variables. 

Our graph language is accompanied by a set of allowable operations.

![][computational-graph]

### Chain Rule of Calculus
![][chain-rule]





















[convex_cost_function]: convex_cost_function.jpg
[sigmoid-unit]: sigmoid-unit.png
[maxout-unit]: maxout.jpg
[nn]: nn.png
[computational-graph]: computational-graph.png
[chain-rule]: chain-rule.jpg