# Chapter 6 Deep Feedforward Networks

## 6.1 Example: Learning XOR

## 6.2 Gradient-Based Learning

### 6.2.1 Cost Function

- We can view the cost function as being a **functional**(泛函) rather than just a function. We can thus think of learning as choosing a function rather than merely choosing a set of parameters.

- The **calculus of variantions**(变分法) can be used to derive:
  - If we could train on ifinitely many samples from true data generating distribution, minimizing MSE gives a function that predicts the mean of $y$ for each value of $x$.
  - **Mean absolute error** cost function will predicts the *median* value of the $y$ for each value of $x$.

- MSE and mean absolute error often lead to poor results when used with gradient-based optimization. Some output units that saturate produce very small gradients when combined with these cose functions. This is one reason that the cross-entropy cost function is more popular than MSE.

### 6.2.2 Output Units

- The choice of cost function is tightly coupled with choice of output unit.

#### 6.2.2.1 Linear Units for Gaussian Output Distributions

- Maximizing the log-likelihood is equivalent to minimizing the MSE.
- The maximum likelihood framework makes it straightforward to learn the covariance of the Gaussian too.

#### 6.2.2.2 Sigmoid Units for Bernoulli Output Distributions

- If we begin with assumption that the unnormalized log probabilities $\tiled{P}(y)$ are linear in $y$ and $z$, we can exponentiate to obtain the unnormalized probabilities and then normalize to see that yields a Bernoulli distribution controlled by a sigmoidal transformation of $z$.

$$
\begin{align}
\log \tilde{P}(y) & = yz \\
\tilde{P}(y) & = exp(yz) \\
P(y) & = \frac {exp(yz)} {\sum_{y' = 0}^1 exp(y'z)} \\
P(y) & = \sigma ((2y-1)z)
\end{align}
$$

- loss function for maximum likelihood learning of a Bernoulli parametrized by a sigmoid is:

$$
\begin{align}
J(\theta) & = -\log P(y|x)  \\
          & = -\log \sigma((2y-1)z) \\
          & = \zeta((1-2y)z)
\end{align}
$$

- Sturation thus occurs only only when the model already has the right answer--when $y=1$ and $z$ is very positive, or $y=0$ and $z$ is very negative.

- The cost function used with maximum likelihood is $-\log P(y|x)$, the $\log$ in the cost function undoes the exp of the sigmoid.

- when we use other loss functions, such as MSE, the loss can saturate anytime $\sigma(z)$ saturates. So, **Maximum likelihood** os **almost always the preferred approach** to training sigmoid output units.

#### 6.2.2.3 Softmax Units for Multinoulli Output Distributions

- As with the logistic sigmoid, the use of the exp function works very well when training the softmax to output a target value y using maximum log-likelihood.

$$softmax(z)_i = \frac{exp(z_i)}{\sum_j exp(z_j)}$$

$$\log softmax(z)_i = z_i - \log \sum_j exp(z_j)$$

- first term shows that the intpu $z_i$ always has a direct contribution to the cost function.

- To gain some **intuition** for the second term, observe that this term can be roughly approximated by $max_j z_j$, thus the negative log-likelihhod cost always strongly penalizes the most active incrrect predicton. If the correct answer already has the largest input to the softmax, then those terms will roughly cancel.

- $softmax(z)_i$ saturates to 1 when the corresponding input is maximal $(z_i = max_iz_i)$ and $z_i$ is much greater than all the other inputs. saturate to 0 when $z_i$ is not maximal and the maximum is much greater.

- **winner-take-all** (one of the outputs is nearly 1 and the others are nearly 0).

- **softmax** provides a "softend" version(continuous and differentiable) of the **argmax**. The corresponding soft version of maximum function is $softmax(z)^T z$.

#### 6.2.2.4 Other Ouput Types

> TODO(leguo)

## 6.3 Hiddn Units

### 6.3.1 Rectified Linear Units and Their Generalizations

$$g(z) =\max\{0,z\}$$

- One drawback ot ReLU is that they cannot learn via gradient-based methods on examples for which their activation is zero.

- generalizations:

$$h_i = g(z,\alpha)_i = \max(0, z_i) + \alpha_i \min(0, z_i)$$

- **Absolute value rectification**: $g(z) = |z|$
- **leaky ReLU**: fixes $alpha_i$ to a small value like 0.01
- **parametric ReLU(PReLU)**: treats $alpha_i$ as a learnable parameter.
- **Maxout**: a maxout layer can learn to approximate any convex function with arbitrary fidelity.

>TODO(leguo)

- ReLU and all of these generalizations are based on the principle that models are easier to optimize if their behavior is closer to linear.

### 6.3.2 Logistic Sigmoid and Hyperbolic Tangent

$$\tanh(x) = 2\sigma(2z)-1$$

- The widespread saturation of sigmoidal units can make gradient-based learning very difficult. Their use **as hidden units** in feedforward networds is now **discouraged**. Their use **as output units*** is **compatible** with the use of gradient-based learning when an **appropriate cost function** can undo the saturation of the sigmoid in the output layer.

- tanh typically performs better than logistic when a sigmoidal must be used.

- Recurrent networks, many probabilistic models, and some autoencoders sometimes use sigmoidal units.

### 6.3.3 Other Hidden Units

- other types are used less frequently

> TODO(leguo)

## 6.4 Architecture Design

### 6.4.1 Universal Approximation Properties and Depth

- a feedforward network with a linear output layer and at least

## 6.5 Back-Propagation and Other Differentiation Algorithms

## 6.6  Historical Notes
