## Overview
Designing and training a neural network is not much different from training any
other ML model with gradient descent. It needs:
- optimization procedure,
- a cost function
- a model family.

The nonlinearity of a neural network causes most interesting loss functions to become non-convex.

-> Neural networks are usually trained by using **iterative, gradient-based optimizers** rather than linear equation solver or convex optimization algorithms or SVMs.

Stochastic gradient descent applied to non-convex loss functions has no global convergence guarantee, and is sensitive to the values of the initial parameters. 

![non-convex vs convex][convex_cost_function]

-> For feedforward neural networks, it is important to: 
- initialize all weights to small random values. 
- initialize the biases to zero or to small positive values.

Computing the gradient is slightly more complicated for a neural network, but can still be done efficiently and exactly.

## Cost function
The cost functions for neural networks are more or less the same as those for other parametric models

[](What is this ?)
In most cases, our parametric model defines a distribution $p(\mathbf{y}\ |\ \mathbf{x};\mathbf{\theta})$ and we simply use the principle of **maximum likelihood**. 

-> we use the **cross-entropy** between the training data and the model’s predictions as the cost function.

### Learning Conditional Distributions with Maximum Likelihood
Most modern neural networks are trained using maximum likelihood.

-> the cost function is simply the **negative log-likelihood** or equivalently descrived as the cross-entropy between the training data and the model distribution.

$\mathbf{J}(\theta) = −E_{x,y∼\hat{p}_{data}} log p_{model}(\mathbf{y} | \mathbf{x})$

The specific form of the cost function changes from model to model, depending on the specific form of log pmodel. The expansion of the above equation typically yields some terms that do not depend on the model parameters and may be discarded.

An advantage of this approach of deriving the cost function from maximum likelihood is that it removes the burden of designing cost functions for each model.

-> Specifying a model $p(\mathbf{y} | \mathbf{x})$ automatically determines a cost function $log p(\mathbf{y} | \mathbf{x})$

One recurring theme throughout neural network design is that the gradient of the cost function must be large and predictable enough to serve as a good guide for the learning algorithm.

Negative log-likelihood avoids saturation problems

### Learning Conditional Statistics
Instead of learning a full probability distribution $p(\mathbf{y} | \mathbf{x}; \mathbf{\theta})$ we often want to learn just one conditional statistic of $\mathbf{y}$ given $\mathbf{x}$.

Ex: we may have a predictor $f(\mathbf{x}; \mathbf{\theta})$ that we wish to predict the mean
of $\mathbf{y}$

We can view the cost function as being a **functional** rather than just a function. A functional is a mapping from functions to real numbers. We can thus think of learning as choosing a function rather than merely choosing a set of parameters.

Solving an optimization problem with respect to a function requires a mathematical tool called **calculus of variations**

Different cost functions give different statistics
-> We can predicts the median or the mean value of $\mathbf{y}$ for each $\mathbf{x}$

#### Conclusion
**Mean squared error** and **mean absolute error** often lead to poor results when used with gradient-based optimization.

-> The cross-entropy cost function is more popular, even when it is not necessary to estimate an
entire distribution $p(\mathbf{y} | \mathbf{x})$.

[](Phần này chưa hiểu rõ lắm)
## Output units
Most of the time, we simply use the cross-entropy between the data distribution and the model distribution. The choice of how to represent the output then determines the form of the cross-entropy function.

### Linear Units for Gaussian Output Distributions
output unit based on an affine transformation with no nonlinearity.

Given features $\mathbf{h}$, a layer of linear output units produces a vector $\hat{\mathbf{y}} = \mathbf{W}^T \mathbf{h}+\mathbf{b}$

Linear output layers are often used to produce the mean of a conditional Gaussian distribution.

linear units do not saturate, they pose little difficulty for gradientbased optimization algorithms and may be used with a wide variety of optimization algorithms.

### Sigmoid Units for Bernoulli Output Distributions
-> predicting the value of a binary variable y

A Bernoulli distribution is defined by just a single number. The neural net needs to predict only $P(\mathbf{y} = 1 | \mathbf{x})$. For this number to be a valid probability, it must lie in the interval $[0, 1]$

A sigmoid output unit is defined by

$\hat{\mathbf{y}} = \sigma (\mathbf{W}^T \mathbf{h} + \mathbf{b})$

We can think of the sigmoid output unit as having two components. 
- First, it uses a linear layer to compute $\mathbf{z} = \mathbf{w}^T \mathbf{h} + \mathbf{b}$. 
- Next, it uses the sigmoid activation function to convert $\mathbf{z}$ into a probability.

### Softmax Units for Multinoulli Output Distributions
Any time we wish to represent a probability distribution over a discrete variable with n possible values, we may use the softmax function. This can be seen as a generalization of the sigmoid function which was used to represent a probability distribution over a binary variable















[convex_cost_function]: convex_cost_function.jpg