Many informations processing tasks can be very easy or difficult depending on the data representation.  

We can think of feedforwards nets for classification as forming of a kind of representation learning.  
The last layer is the linear classifier, and the rest of the network provides a representation to this classifier. The representation learn properties that make the task easier.  

Other kinds of rmodels are explicitley designed to shape the representation in a particular way. For example, if we want to learn a representation that makes density estimation easier, we could design a criterion that forces the elements of $h$ to be independant.  Supervised and unsupervised models have an objective and learn a specific representation as a side effect.  
This representation can be used for one or multiple other tasks.  

Most representation learning models must tradeoff between preserving input information and having a representation with nice properties.  
It can help to reduce overfitting when we have a large amount of unlabelled data and little labelled data. We can learn a good representation from the unlabelled data, then use it to solve the supervised task.

# Greedy Layer-Wise Unsupervised Pretraining

This procedure played a key role in the rerival of deep learning.  
It relies on single layer representation learning methods such as an RMB, a single-layer autoencoder, a sparse coding model.  
Each layer is pretrained with the output of the previous layer, and producing a new representation of the data.  
This procedures gives a good initialization to train deep fully connected networks.  
Before that, only deep networks based on convolutions or reccurence (backprop through time) could be trained.  
Today, we know others methods to train deep models.  

This is called greedy layer-wise because each layer is trained one after the other, giving a greedy suboptimal solution compared to training all layers jointly.  

The supersived learning phase can be training a classifier on top of the learned features, or fine-tuning the whole network.  
It can also be used as initialization for unsupervised learning models, such as deep autoencoders, deep belief networks and deep Botlozmmann machines.

## When and why does it work ?

Greedy layer-wise unsupervised training often yields a lower test error for classification. But on many tasks it may provides no benefit or even harm.  

Unsupersived pretraining is based on 2 ideas:
- the initial choice of parameters affect the regularization, and improve optimization to a lesser extent.  
- learn about the input distribution help to learn a mapping from inputs to outputs.  

The first idea is not really well understood. The second idea is better understood. Somes features usefull for the unsupervised task may also be usefull for the supervised one.  
Simultaneous supervised and unsupervised can be preferable.  

Unsupervided pretraining is more effective when the initial representation is poor (eg: words embedding to replace one-hot vectors).  

Unsupervised pretraining is also usefull when the number of labelled examples is small or the number of unlabelled examples is large.

Unsupervised pretraining helps to learn a complicated function. Unlike regularizers like weight decay that bias the model towards learning a simpler function, the model is biased to discover features fuctions.

Unsupervised pretraining cal also help for optimization, and improve both train and test error for deep autoencoders.

Improvements in train and test error can be explained by the method taking the parameters into a region inaccesible otherwhise.  
Neural networks always halt training in a different region because of parameters initialization, but with unsupervised pretraining consistently halt in the same region.  
Pretraining seems to result in lower variance, whih can reduce overfitting.  

An hypothesis states that pretaining act as a regularizer because it discover features that relate to the underlying causes that generated the data.  
There is no single hyperparameter to control the strength of regularization contrary to other methods. When doing unsupervised and supervised at the same time, there is usually a single hyperpameter to control the regularization of the unsupervised component.  

Hyperparemeters of the unsupervised phase can be set using a validation set on the supervised phase, but this is quite slow.  

Today unsupervised pretraining has been mostly abandonned except for NLP. One can pretrain a huge unlabelled dataset to learn a good representation of words, and then use or fine-tune it for supervised tasks.  

New regularization techniques (dropout, batch normalization) are more effective on big and medium size datasets, and Bayesian methods perform better for small datatsets, so unsupervised pretraining is disused.  

The idea of pretraiing has been generalized to supervised pretraining, which is especially popular for transfer learning.

# Transfer Learning and Domain Adaptation

In Transfer learning and domain adaptation, what has been learned from distribution $P_1$ is exploited for learning about distribution $P_2$.  

In transfer learning, the factors that explain the variations in $P_1$ are relevants to the variations in $P_2$. For example, the inputs are the same, but the targets are different.  

Transfer learning can be achieved via representation learning when there exists features usefull for several taks. Some task may share the lower layers, other the upper layers, depending on the problems.  

In domain adapatation, the task remains the same between each setting, but the input distribution change.  

Concept drift if a form of transfer learning, there are gradual changes to the data distribution over time.  

Using the sane representation for several tasks allows to befenit from the training data of all the tasks.  
When a represention is learned unsupervisely from task 1, and we train a classifier on top of it for task 2 few labelled examples, the deeper the representation learnt is task 1, the better is the classifier, even with very little labelled data.  

Two extreme fors are zero-shot learning (no labeled example given to transfer task), and one-shot learning (only one labelled example).  

One shot learning is possible because the representation learns to clearly separate the underlying classes. Only one example is needed to infer the label of many test examples that all cluster around the same point in represantation space.  

Zero-shot learning is only possible because additional information is available. There is another random variable, $T$, which is an indication of a task the model must perform.  
The model learn to estimate $p(y|x,T)$. With training instance of different T values, me way be able to generelize to unseen $T$.  

Multi-modal learning captures a representation of $p(x)$, a representation of $p(y)$, and a representation of the relationship $p(x,y)$.  
Like zero-shot leearning, multi-modal can generelize to pairs of unseen data.

# Semi-Supervised Disentangling of Causal Factors

An example of good representation is when the representation correspond to the underlying causes of the data, with separate features for each cause, so that the representation disentangle the causes from one another.  

We can first find a representation for $p(x)$, and easily compute $p(y|x)$ if $y$ is among the causes of $x$.  

Semi-supervised learning may fail when learning $p(x)$ is no help to learn  $p(y|x)$ (eg: x$x$ is uniform).  
It may succeed for example when $p(x)$ is a mixture, with one component per $y$. If the the model of $p(x)$ clearly separates the components, it's easy to predict $y$.  

Let's suppose $y$ is one of the causol factors of $x$, and $h$ represent all those factors. Te true generative process can be defined as:
$$p(h,x) = p(x|h)p(h)$$
$$p(x) = \mathbb{E}_h p(x|h)$$  

The best possible model for $x$ is one that uncovers the latent variable $h$ that exlains the observed variations $x$.  

Suppose most examples are formed by a large number of underlying causes, and one of them if $y$: $y=h_i$. The unsupervised model doesn't know which one, so a solution is to learn all $h$, making it easy for the supervised model to predict $y$.  
In practice, it's not possible to capture all or most factors of variation. Which one should we capture ?
- We can use a supervised signal to help capturing the most relevant factors.
- We can use much larger representations.

Models are trained using a criterion defining which factors are considered salient. MSE applied to an image consider a factor to be salient only if it changes the brightness of a lot of pixels, so it ignores small objects.  


Another definition of salient is ff a group a pixels follow a highly recognizable pattern.  
It can be trained using Genereative Adversial Networks (GAN). A generative model is trained to fool the discriminative classifier. The discriminator must recognize all samples from the generator as fake, and all samples from the training data as real.  
Any structured pattern that the discriminator can recognize is considered salient.  
GAN are only one step toward determining wich factors should be represented.  

If a true generative procees has an effect $x$ and cause $y$, modelling $p(x|y)$ is robust to changes in $p(y)$.  
Very often, changes in distribution due to different domains, change of the task, doesn't change the causal factors.

# Distributed Representation

A distributed representation is composed of many elements that can be set separately from each other. $n$ features with $k$ values can represent $k^n$ different concepts.  
Distributed representation are natural to learn the underlying causal factors of the data.  

This is different than a symbolic representation, where each input is associated with a single symbol. For example a one-hot representation, they are $n$ features, but only one active.  

In distributed representations, shared attributes between different concepts leads to better generalization.  
IN NLP, 2 words may seem very different from each other, but with a meaningful distributed representation, shared attributes can be recognized between them. It can lean to better generalization than using one-hot representations of words.  

Distributed representations induces a rich similarity space, semantically close inputs are close in distance.  

Non-distributed algorithms suffer from the curse of dimenstionality. In order to learn an estimator that differs in many different regions, we need at least as many training examples as the number of regions, and they may be exponentially many regions.  

Let's suppose we have a binary distributed representation learning algorithm that separerates. Each feature separates $\mathbb{R}^d$ in 2 half-spaces. If we form $n$ hyperplanes, the total number of regions is $O(n^d)$. The moel has only $O(nd)$ parameters.  

If a parametric transformation with $k$ parameters can learn $r$ regions ($k \ll r$) and obtain a useful reperesentation, we could possible generalize much better than with a non-distributed representation.  
Fewer parameters to fit requires fewer training data to generalizes well.  

It also generalize well because their capacity remains limited: we cannot use all of the code space, and we cannot learn arbitrary mapping from $h$ to $y$ when we use a linear classifier.  
Distributed representation with a linear classifier expresses a prior belief that the classes to be recognized are linearly separable as a function of $h$.  

A GAN learned from images of faces a representation that disantangles factors of variation such as gender or wearing glasses. If we take the representation "man with glasses", subtract "man without glasses", and add "woman without glasses", the GAN can generate images of woman with glasses.

# Exponential Gains from Depth

Mulilayer perceptrons are universal approximators, and functions can be representation with exponentially smaller networks compared to shallow ones. Similar results apply to distributed representations.  

Im many tasks, the underlying factors are more likely to be very high level and a non-linear function of the input. This demands deep distributed representations, where the factors are obtained through many nonlineraities.  

Many structured probabilistic models with a single hidden layer of latient variables (RBM, deep belief net), are universal approximators of probability distributions. As for deep feedforard network, deep models have an exponential advantage over one too shallow to estimate a probability distribution.

# Providing Clues to Discover Underlying Clauses

Most strategies prodives clues to the learner to help disantangle underlying factors.  
A very strong glue is the label, that may specify directy the value of one of the factors.  

There also also less direct hints, usually imposed implicit prior beliefs.  They are implemented as regularization strategies

- Smoothness: $f(x + \epsilon d) \approx f(x)$. Helps to generalize to point near training examples.


- Linearity: Assume that relationship between variables are linear, helps to make predictions very far from the training data.


- Multiple explanatory factors: An assumption is that the data is generated by underlying explanatory factors, and that musk tasks can be solved easily goven those factors. Learning the structure of $p(x)$ make it easier to model $p(y|x)$.


- Causal factors: The model treats  factors of variations $h$  as the cause of $x$.


- Depth: A deep architecture can express high-level concept by forming a hierarchy of simpler concepts. A deep architecture can also express the belief that the task is a multi-step process, using the output of the preivous step.


- Shaed factors across tasks: When several tasks share the same input $x$, but each task having a different outpt $y_i = f^{(i)}(x)$, thay may all be associated with different subsets of the underlying factors $h$. We can learn all $P(y_i | x)$ with a shared representation $P(h|x)$.


- Manifolds: Probabilty mass of the data concentrates in a region locally connected and of tiny volume. These regions can be approximated by a low-dimensional manifolds. Some models performs better when given this low-dimensional representation, other tries to learn the structure of the manifold.


- Natural clustering: Many models assme that each connected manifold in the input space may be assigned to a single class.


- Temporal and spatial coherence: Some models mae assumption that the most important explanatory factors change slowly over time.


- Sparsity: Most features should not be revelant to describe the input. We can impose a prior that any feature should be absent most of the time.


- Simplicity of factor dependencies: In high level representations, the factors are related to each other through simple dependencies.  
    The simplest is marginal independence: $P(h) = \prod_i P(h_i)$, but linear dependencies are also reasonable assumptions.