## 6.1 Example: Learning XOR

### pp. 177 don't expect super clean result from gradient descent.

> In practice, gradient descent would usually not find clean, easily understood, integer-valued solutions like the one we presented here.

## 6.2 Gradient-Based Learning

one central topic mentioned by authors over and over again is that for learning to work, there should be enough gradient. Thus, sigmoid units are not good, since they tend to saturate easily.


### pp. 179 boundary probabilities can't be represented by many models.

> For discrete output variables, most models are parametrized in such a way that they cannot represent a probability of zero or one, but can come arbitrarily close to doing so.

this is because (most, I guess) exponential family can't describe boundary. See Theorem 3.3 of [Graphical Models, Exponential Families, and Variational Inference](http://dx.doi.org/10.1561/2200000001), as well as my special topic note on exponential family.

Therefore, if the training data is perfectly separable, then some regularization is needed.

### pp. 181 you need to match output units' type with cost function.

> Unfortunately, mean squared error and mean absolute error often lead to poor results when used with gradient-based optimization. Some output units that saturate produce very small gradients when combined with these cost functions.
This is one reason that the cross-entropy cost function is more popular than mean squared error or mean absolute error, even when it is not necessary to estimate an entire distribution $p(y \mid x)$.

I think this is also connected to canonical link function in generalized linear models.

### pp. 184 for sigmoid, mean squared error loss gives too little gradient, compared to maximum likelihood.

> When we use other loss functions, such as mean squared error, the loss can saturate anytime $\sigma(z)$ saturates. ... The gradient can shrink too small to be useful for learning whenever this happens, whether the model has the correct answer or the incorrect answer. For this reason, maximum likelihood is almost always the preferred approach to training sigmoid output units.

Again, I think this is also connected to canonical link function in generalized linear models.

### pp. 185 some intuition about softmax, about Eq. (6.30)

> To gain some intuition for the second term, $\log_j \exp( z_j )$, observe that this term can be roughly approximated by $\max_j z_j$. This approximation is based on the idea that $\exp(z_k)$ is insignificant for any $z_k$ that is noticeably less than $\max_j z_j$. ... This example will then contribute little to the overall training cost, which will be dominated by other examples that are not yet correctly classified.

### pp. 186 again, why using correct cost function is correct.

> When the softmax saturates, many cost functions based on the softmax also saturate, unless they are able to invert the saturating activating function.

this inversion process is exactly what's done in (standard) generalized linear models.

### pp. 187 overparameterized softmax, neuroscience connection, and naming of softmax

> In practice, there is rarely much difference between using the overparametrized version or the restricted version, and it is simpler to implement the overparametrized version.

> From a neuroscientific point of view, it is interesting to think of the softmax as a way to create a form of competition between the units that participate in it.

> The name “softmax” can be somewhat confusing. The function is more closely related to the arg max function than the max function. ... It would perhaps be better to call the softmax function “softargmax,” but the current name is an entrenched convention.

### pp. 188-190 learning Gaussian output units and other extensions, such as GMM

essentially, making sure positive definitiveness is important when learning covariance, and numerical stability is also important. Therefore,

1. precision matrix is preferred, as it doesn't involve division when computing gradient.
2. some parameterization tricks are needed to make sure positive definitiveness.

In pp. 189, they talk about learning full covariance matrix, but their method can also guarantee semi-definitiveness. I think in practice, some small diagonal matrix can be always added. Same goes for learning GMM.

In pp. 190, it says GMM optimization can be unreliable due to division. But I think this can be definitely avoided by using precision.

## 6.3 Hidden Units

### pp. 192 not differentiable may not be a problem as long as there are few points like that.

> Software implementations of neural network training usually return one of the one-sided derivatives rather than reporting that the derivative is undefined or raising an error. This may be heuristically justified by observing that gradient-based optimization on a digital computer is subject to numerical error anyway. ... The important point is that in practice one can safely disregard the non-differentiability of the hidden unit activation functions described below.

### pp. 193 set bias for ReLU units to be positive

> When initializing the parameters of the affine transformation, it can be a good practice to set all elements of $b$ to a small, positive value, such as 0.1. This makes it very likely that the rectified linear units will be initially active for most inputs in the training set and allow the derivatives to pass through.

### pp. 194 maxout units

Check Eq. (6.37). I think it's just using the fact that pointwise maximum of affine functions gives convex function.

### pp. 194 a linear nonlinear unit is more easy to optimize

> Rectified linear units and all of these generalizations of them are based on the principle that models are easier to optimize if their behavior is closer to linear.

### pp. 195 tanh is better than sigmoid, but sigmoid can be used in other contexts, such as gates in LSTM.

> When a sigmoidal activation function must be used, the hyperbolic tangent activation function typically performs better than the logistic sigmoid. It resembles the identity function more closely. ... Sigmoidal activation functions are more common in settings other than feed-forward networks.

### pp. 196 hidden units are roughly all the same in some sense; linear hidden layers can be used to reduce number of parameters, and softmax can be used in some fancy memory-based models.

> New hidden unit types that perform roughly comparably to known types are so common as to be uninteresting.

> Linear hidden units thus offer an effective way of reducing the number of parameters in a network.

> Softmax units are another kind of unit that is usually used as an output (as described in section 6.2.2.3) but may sometimes be used as a hidden unit. ... These kinds of hidden units are usually only used in more advanced architectures that explicitly learn to manipulate memory

### pp. 197 RBF units are difficult to optimize; and performance of hidden unit can be counterintuitive. for example for softplus vs. relu.

> Because it saturates to 0 for most $x$, it can be difficult to optimize.

> The use of the softplus is generally discouraged. The softplus demonstrates that the performance of hidden unit types can be very counterintuitive—one might expect it to have an advantage over the rectifier due to being differentiable everywhere or due to saturating less completely, but empirically it does not.

## 6.4 Architecture Design

### pp. 198 universal approximation theorem even works for ReLU units

> universal approximation theorems have also been proved for a wider class of activation functions, which includes the now commonly used rectified linear unit (Leshno et al., 1993).

### pp. 199-201 deep models can represent certain types of functions more efficiently; but favoring over these types of functions is just belief, which has been shown to be good in practice.

> There exist families of functions which can be approximated efficiently by an architecture with depth greater than some value $d$, but which require a much larger model if depth is restricted to be less than or equal to $d$.

Figure 6.5 gives a good example where a complex curve (leftmost), can be reduced to a simpler curve (rightmost) using the mirroring property of hidden units.

> Of course, there is no guarantee that the kinds of functions we want to learn in applications of machine learning (and in particular for AI) share such a property. ... Choosing a deep model encodes a very general belief that the function we want to learn should involve composition of several simpler functions.

>Empirically, greater depth does seem to result in better generalizationfor a wide variety of tasks. ... See figure 6.6 and figure 6.7 for examples of some of these empirical results. This suggests that using deep architectures does
indeed express a useful prior over the space of functions the model learns.

### pp. 201 skip connection makes gradient flow easier.

> skip connections going from layer $i$ to layer $i + 2$ or higher. These skip connections make it easier for the gradient to flow from output layers to layers nearer the input.

## 6.5 Back-Propagation and Other Differentiation Algorithms

In general, this section gives very good explanation of back-propagation.

### pp. 214 two approcahes to back-propagation in implementation.

Caffe and Torch:

> Some approaches to back-propagation take a computational graph and a set of numerical values for the inputs to the graph, then return a set of numerical values describing the gradient at those input values. We call this approach “symbol-to-number” differentiation.

Theano and TensorFlow: 

> Another approach is to take a computational graph and add additional nodes to the graph that provide a symbolic description of the desired derivatives.

so the graph for derivative is explicitly given, and we can do more fancy things.

> It is possible to run back-propagation again, differentiating the derivatives in order to obtain higher derivatives.

> The key difference is that the symbol-to-number approach does not expose the graph.

### pp. 216-218 some notes on details

* Eq. (6.54): here $X$, $G$ are (flattened) tensors, and this summation is just expanding the dot product notation $g(X)^T G$, where $g(X)$ denotes gradient of this op's output w.r.t. $X$.
* Algorithm 6.5 I think $G'$ would leave only those nodes relevant for computation of grads.
* Algorithm 6.6 Among the input arguments, $G$ is only used to modify old graph, and $G'$ is really used to compute required gradient graph.

### pp. 216 general strategy for parameter sharing

> The op.bprop method should always pretend that all of its inputs are distinct from each other, even if they are not. For example, if the mul operator is passed two copies of $x$ to compute $x^2$, the op.bprop method should still return $x$ as the derivative with respect to both inputs. The back-propagation algorithm will later add both of these arguments together to obtain $2x$, which is the correct total derivative on $x$.

this is just [multivariable chain rule](https://www.math.hmc.edu/calculus/tutorials/multichainrule/). For $y=x^2$, you can think $y=u(x)v(x), u(x)=x, v(x)=x$.

### pp. 221 Section 6.5.8 discusses some implementation issues of bp.

### pp. 222 in general, getting most efficient way of gradient computation is difficult.

> In general, determining the order of evaluation that results in the lowest computational cost is a difficult problem.

see <https://en.wikipedia.org/wiki/Automatic_differentiation#Beyond_forward_and_reverse_accumulation> for more of this topic.

### pp. 223 Forward vs. backward gradient computation as matrix multiplication.

See Eq. (6.58) and text around it. Note that the analysis here doesn't care about storage of intermediate variables. If we care about it, then the analysis will be more difficult.

## 6.6 Historical Notes

### pp. 227 neuroscience justifications of ReLU.

> Glorot et al. (2011a) motivate rectified linear units from biological considerations.