## 15.0

### pp. 527 tradeoff in representation learning

> Most representation learning problems face a tradeoff between preserving as much information about the input as possible and attaining nice properties (such as independence).

### pp. 527-528 human has super strong one-shot learning capability

> Humans and animals are able to learn from very few labeled examples. We do not yet know how this is possible. Many factors could explain improved human performance—for example, the brain may use very large ensembles of classifiers or Bayesian inference techniques.

## 15.1 Greedy Layer-Wise Unsupervised Pretraining.

### pp. 528 pretraining used to help, but no long now.

> The deep learning renaissance of 2006 began with the discovery that this greedy learning procedure could be used to find a good initialization for a joint learning procedure over all the layers, and that this approach could be used to successfully train even fully connected architectures ... Today, we now know that greedy layer-wise pretraining is not required to train fully connected deep architectures, but the unsupervised pretraining approach was the first method to succeed.

### pp. 529 supervised pretraining

> As discussed in section 8.7.4, it is also possible to have greedy layer-wise supervised pretraining. This builds on the premise that training a shallow network is easier than training a deep one, which seems to have been validated in several contexts (Erhan et al., 2010).

### pp. 530 two phases in pretraining can be combined together

> It is also possible to train an autoencoder or generative model at the same time as the supervised model. Examples of this single-stage approach include the discriminative RBM (Larochelle and Bengio, 2008) and the ladder network (Rasmus et al., 2015), in which the total objective is an explicit sum of the two terms (one using the labels and one only using the input).

Probably the best known work in this line is discriminative RBM (Larochelle and Bengio, 2008). <http://swoh.web.engr.illinois.edu/courses/IE598/handout/fall2016_slide20.pdf> gives a good tutorial for discriminative RBM. Not sure how such ideas have evolved for the currently big datasets.

### pp. 531 for better control of learning process, it's better to simultaneous do unsupervised and supervised learning.

There are two reasons for this, one for each of two reasons for using unsupervised learning.

1. people previously think pretraining halps find good initialization. But this is not very well characterized, and if we use two-phase pretraining, we are not sure what contribution unsupervised learning does during supervised learning. For this reason, it might be easier to optimize a weighted sum of supervised cost and unsupervised cost, which gives us more control (otherwise there's no way to control).
2. people hope unsupervised learning can help learning features useful for supervised cost. But for different supervised algorithms (linear regression, logitic regression, etc.), maybe they need different types of features. But without supervision, it's difficult to make the unsupervisedly learned features be appropriate explicitly. It may be so by luck, but why not just optimizing supervised and unsupervised cost together?

### pp. 532 two cases where unsupervised learning can be helpful.

> From the point of view of unsupervised pretraining as learning a representation, we can expect unsupervised pretraining to be more effective when the initial representation is poor.
> 
> From the point of view of unsupervised pretraining as a regularizer, we can expect unsupervised pretraining to be most helpful when the number of labeled examples is very small. Because the source of information added by unsupervised pretraining is the unlabeled data, we may also expect unsupervised pretraining to perform best when the number of unlabeled examples is very large.

### pp. 534 Erhan (2010) paper on unsupervised learning.

Essentially, pretraining finds good initialization.

> Both improvements to training error and improvements to test error may be explained in terms of unsupervised pretraining taking the parameters into a region that would otherwise be inaccessible.
>
> The region where pretrained networks arrive is smaller, suggesting that pretraining reduces the variance of the estimation process, which can in turn reduce the risk of severe over-fitting. In other words, unsupervised pretraining initializes neural network parameters into a region that they do not escape, and the results following this initialization are more consistent and less likely to be very bad than without this initialization.

However, these experiments may not be relevant now.

> Keep in mind that these experiments were performed before the invention and popularization of modern techniques for training very deep networks(rectified linear units, dropout and batch normalization) so less is known about the effect of unsupervised pretraining in conjunction with contemporary approaches.

### pp. 534-535 why having two phases is bad.

> Compared to other forms of unsupervised learning, unsupervised pretraining has the disadvantage that it operates with two separate training phases. ... Unsupervised pretraining does not offer a clear way to adjust the the strength of the regularization arising from the unsupervised stage. ... When we perform unsupervised and supervised learning simultaneously, instead of using the pretraining strategy, there is a single hyperparameter, usually a coefficient attached to the unsupervised cost, that determines how strongly the unsupervised objective will regularize the supervised model.
>
> Another disadvantage of having two separate training phases is that each phase has its own hyperparameters. The performance of the second phase usually cannot be predicted during the first phase, so there is a long delay between proposing hyperparameters for the first phase and being able to update them using feedback from the second phase. 
>
> Today, unsupervised pretraining has been largely abandoned, except in the field of natural language processing, where the natural representation of words as one-hot vectors conveys no similarity information and where very large unlabeled sets are available.

### pp. 535 pure supervision plus regularization can beat unsupervised on medium-sized datasets

> These same techniques outperform unsupervised pretraining on medium-sized datasets such as CIFAR-10 and MNIST, which have roughly 5,000 labeled examples per class.

### pp. 536 supervised pretraining is popular.

> The idea of pretraining has been generalized to supervised pretraining discussed in section 8.7.4, as a very common approach for transfer learning. Supervised pretraining for transfer learning is popular (Oquab et al., 2014; Yosinski et al., 2014) for use with convolutional networks pretrained on the ImageNet dataset. Practitioners publish the parameters of these trained networks for this purpose, just like pretrained word vectors are published for natural language tasks (Collobert et al., 2011a; Mikolov et al., 2013a).

## 15.2 Transfer Learning and Domain Adaptation

### pp. 536 two types of transfer learning, sharing input or output.

> In transfer learning, the learner must perform two or more different tasks, but we assume that many of the factors that explain the variations in P1 are relevant to the variations that need to be captured for learning P2. This is typically understood in a supervised learning context, where the input is the same but the target may be of a different nature. ... Many visual categories share low-level notions of edges and visual shapes, the effects of geometric changes, changes in lighting, etc.
>
> However, sometimes, what is shared among the different tasks is not the semantics of the input but the semantics of the output. (see Figure 15.2).

### pp. 538 basic assumption of transfer learning

> In all of these cases, the objective is to take advantage of data from the first setting to extract information that may be useful when learning or even when directly making predictions in the second setting. The core idea of representation learning is that the same representation may be useful in both settings. Using the same representation in both settings allows the representation to benefit from the training data that is available for both tasks.

### pp. 538 why is one-shot learning possible

> One-shot learning (Fei-Fei et al., 2006) is possible because the representation learns to cleanly separate the underlying classes during the first stage. During the transfer learning stage, only one labeled example is needed to infer the label of many possible test examples that all cluster around the same point in representation space. This works to the extent that the factors of variation corresponding to these invariances have been cleanly separated from the other factors, in the learned representation space, and we have somehow learned which factors do and do not matter when discriminating objects of certain categories.

### pp. 539 why is zero-shot learning possible

> Zero-data learning (Larochelle et al., 2008) and zero-shot learning (Palatucci et al., 2009; Socher et al., 2013b) are only possible because additional information has been exploited during training. 
>
> zero-data learning scenario as including three random variables: the traditional inputs x, the traditional
outputs or targets y, and an additional random variable describing the task, T.
>
> If we have a training set containing unsupervised examples of objects that live in the same space as T , we may be able to infer the meaning of unseen instances of T.
>
> Zero-shot learning requires T to be represented in a way that allows some sort of generalization. For example, T cannot be just a one-hot code indicating an object category.

Figure 15.3 explains this pretty well.

## 15.3 Semi-Supervised Disentangling of Causal Factors

### pp. 541 ideal representation learning -- separating causes

> An important question about representation learning is “what makes one representation better than another?” One hypothesis is that an ideal representation is one in which the features within the representation correspond to the underlying causes of the observed data, with separate features or directions in feature space corresponding to different causes, so that the representation disentangles the causes from one another.
>
> This hypothesis motivates approaches in which we first seek a good representation for $p(x)$. Such a representation may also be a good representation for computing $p(y \mid x)$ if $y$ is among the most salient causes of $x$.

### pp. 541 easy to model representation learning

easy to model and true causes may not be the same. But people assume they coincide.

> In other approaches to representation learning, we have often been concerned with a representation that is easy to model—for example, one whose entries are sparse, or independent from each other.
>
> A representation that cleanly separatesthe underlying causal factors may not necessarily be one that is easy to model.
>
> However, a further part of the hypothesis motivating semi-supervised learning via unsupervised representation learning is that for many AI tasks, these two properties coincide.

### pp. 541-543 example that unsupervised works, and one that doesn't work.

failed: because the (trivial) strucutre in $x$ has nothing to do with $y$.

> First, let us see how semi-supervised learning can fail because unsupervised learning of $p(x)$ is of no help to learn $p(y \mid x)$. Consider for example the case where $p(x)$ is uniformly distributed and we want to learn $f (x) = E[y \mid x]$. Clearly, observing a training set of $x$ values alone gives us no information about $p(y \mid x)$.

succeed: because $y$ is relevant to how $x$ is generated, and our model captures this generation process correctly.

> Next, let us see a simple example of how semi-supervised learning can succeed. Consider the situation where $x$ arises from a mixture, with one mixture component per value of $y$, as illustrated in figure 15.4. If the mixture components are well-separated, then modeling $p(x)$ reveals precisely where each component is, and a single labeled example of each class will then be enough to perfectly learn $p(y \mid x)$.

However I don't agree with some arguments. Below Eq. (15.3), it says

> Thus the marginal $p(x)$ is intimately tied to the conditional $p(y \mid x)$ and knowledge of the structure of the former should be helpful to learn the latter. Therefore, in situations respecting these assumptions, ...

I think this observation can be said of arbitrary $x$ and $y$, and all situations satisfy this. So I find "Therefore, in situations respecting these assumptions, ..." confusing 


### pp. 543 determine what are the important causes to model

> An important research frontier in semi-supervised learning is determining what to encode in each situation. Currently, two of the main strategies for dealing with a large number of underlying causes are to use a supervised learning signal at the same time as the unsupervised learning signal so that the model will choose to capture the most relevant factors of variation, or to use much larger representations if using purely unsupervised learning.

### pp. 543-544 changing emphasis on different features using different cost functions. motivating GAN (generative adversarial networks)

Fig. 15.6 is a prime example.

> An emerging strategy for unsupervised learning is to modify the definition of which underlying causes are most salient. ... For example, mean squared error applied to the pixels of an image implicitly specifies that an underlying cause is only salient if it significantly changes the brightness of a large number of pixels. This can be problematic if the task we wish to solve involves interacting with small objects.
>
> Other definitions of salience are possible. For example, if a group of pixels follow a highly recognizable pattern, even if that pattern does not involve extreme brightness or darkness, then that pattern could be considered extremely salient. One way to implement such a definition of salience is to use a recently developed approach called generative adversarial networks (Goodfellow et al., 2014c). ... For the purposes of the present discussion, it is sufficient to understand that they learn how to determine what is salient
>
> Lotter et al. (2015) showed that models trained to generate images of human heads will often neglect to generate the ears when trained with mean squared error, but will successfully generate the ears when trained with the adversarial framework.

### pp. 545 learning causal model is more generalizable and stable than learning anti-causal (inverse) model

> A benefit of learning the underlying causal factors, as pointed out by Schölkopf et al. (2012), is that if the true generative process has $x$ as an effect and $y$ as a cause, then modeling $p(x \mid y)$ is robust to changes in p(y). If the cause-effect relationship was reversed, this would not be true, since by Bayes’ rule, $p( y \mid x)$ would be sensitive to changes in $p(y)$. Very often, when we consider changes in distribution due to different domains, temporal non-stationarity, or changes in the nature of the task, the causal mechanisms remain invariant (the laws of the universe are constant) while the marginal distribution over the underlying causes can change.

I already fixed some typo when quoting this, the second $p(x \mid y)$ should be $p(y \mid x)$.

I think essentially, the idea is that, the causal model (given cause, model effect) is very stable across many domains, and thus learning it may make transfer learning easier.

## 15.4 Distributed Representation

### pp. 546 essence of distributed representation

> Distributed representations are powerful because they can use n features with k values to describe kn different concepts.

### pp. 546-548 essence of non-distributed representation is that there no significant separate control over each entry in the representation vector

> A symbolic representation is a specific example of the broader class of non-distributed representations, which are representations that may contain many entries but without significant meaningful separate control over each entry.

> For some of these non-distributed algorithms, the output is not constant by parts but instead interpolates between neighboring regions. The relationship between the number of parameters (or examples) and the number of regions they can define remains linear.

For example, when doing inference by GMM, the posterior of which cluster the vector comes from is soft. However, elements in them sum to zero, and they are subject to many constraints. For example, among all nonzero vectors that sum to 1, there are many of them that have no corresponding $x$ possible to generate them. 

I don't understand some examples given. For example I don't get why language model based on n-grams is non-distributed. Maybe this is because I don't have experience in language model.

### pp. 548 distributed representation allow more flexible generalization, over unseen configurations.

> An important related concept that distinguishes a distributed representation from a symbolic one is that generalization arises due to shared attributes between different concepts.

> Neural language models that operate on distributed representations of words generalize much better than other models that operate directly on one-hot representations of words, as discussed in section 12.4. Distributed representations induce a rich similarity space, in which semantically close concepts (or inputs) are close in distance, a property that is absent from purely symbolic representations.

### pp. 549 Figure 15.8 pros and cons of non-distributed vs distributed

> The advantage of a non-distributed approach is that, given enough parameters, it can fit the training set without solving a difficult optimization algorithm, because it is straightforward to choose a different outputindependently for each region.

> The disadvantage is that such non-distributed models generalize only locally via the smoothness prior, making it difficult to learn a complicated function with more peaks and troughs than the available number of examples.

### pp. 548-550 when distributed works.

non-distributed only assumed local smoothness. But many domains have more interesting structures, and distributed representations can exploit that.

> Distributed representations can have a statistical advantage when an apparently complicated structure can be compactly represented using a small number of parameters. 
>
> If we are lucky, there may be some regularity in the target function, besides being smooth. For example, a convolutional network with max-pooling can recognize an object regardless of its location in the image, even though spatial translation of the object may not correspond to smooth transformations in the input space. ...  Using fewer parameters to represent the model means that we have fewer parameters to fit, and thus require far fewer training examples to generalize well.

While distributed representation can seemingly divide the input space in more complex ways, that does not mean it's easy to overfit, because of limited number of parameters, many disjoint regions in a distributed representation are actually correlating with each other.

> A further part of the argument for why models based on distributed representations generalize well is that their capacity remains limited despite being able to distinctly encode so many different regions. ... The use of a distributed representation combined with a linear classifier thus expresses a prior belief that the classes to be recognized are linearly separable as a function of the underlying causal factors captured by h. We will typically want to learn categories such as the set of all images of all green objects or the set of all images of cars, but not categories that require nonlinear, XOR logic.

"The use of a distributed representation combined with a linear classifier thus expresses a prior belief". this prior belief is true in many tasks.

### pp 551-552 experimental proof that deep learning learns distributed representation

> one could imagine learning about each of them without having to see all the configurations of all the others. Radford et al. (2015) demonstrated that a generative model can learn a representation of images of faces, with separate directions in representation space capturing different underlying factors of variation.

See Fig. 15.9

> We can learn about the distinction between male and female, or about the presence or absence of glasses, without having to characterize all of the configurations of the n − 1 other features by examples covering all of these combinations of values. This form of statistical separability is what allows one to generalize to new configurations of a person’s features that have never been seen during training.

The most important thing for distributed representation is that we can control different dimensions separately

## 15.5 Exponential Gains from Depth

Here it's again talking about that deep architectures can represent certain distributions (instead of functions as previously in the book, Section 6.4.1) more efficiently than shallow architectures.

## 15.6 Providing Clues to Discover Underlying Causes

This section gives some general ideas how to encourge learning of useful underlying causes.

> Results such as the no free lunch theorem show that regularization strategies are necessary to obtain good generalization. While it is impossible to find a universally superior regularization strategy, one goal of deep learning is to find a set of fairly generic regularization strategies that are applicable to a wide variety of AI tasks, similar to the tasks that people and animals are able to solve.