We can build a model of the input:
$$p_\text{model}(x)$$

The model may also have a latent variable $h$:
$$p_\text{model}(x) = \mathbb{E}_h p_\text{model}(x|h)$$
$h$ is another way to represent the data.

A linear factor model use a stochastic, linear decoder that generatex $x$ by adding noise to a linear transformation of $h$.  
We sample the explanatory factos $h$ from a distribution:
$$h \sim p(h)$$
$$p(h) = \prod_i p(h_i)$$
Next we use the decoder to generate $x$:
$$x = WH + b + \text{noise}$$
Usually the noise is Gaussian and diagonal.

# Probabilistic PCA and Factor analysis

In factor analysis, the latent variable prior is a unit Gaussian:
$$h \sim \mathcal{N}(0, I)$$

The noise is draw from a diogonal distribution, with covariance matrix:
$$\psi = \text{diag}(\sigma^2_1,\text{...},\sigma^2_n)$$

The role of the latent variables is to capture the depencies between the $x_i$. $x$ is a multivariate normal random variable:
$$x \sim \mathcal{N}(b, WW^T + \psi)$$

For probabilistic PCA, we use the same variance $\sigma^2$ for every variable:
$$X = Wh + b + \sigma z$$
$$z \sim \mathcal{N}(0,I)$$
$$x \sim \mathcal{N}(b, WW^T + \sigma^2 I)$$

An iterative EM algorithm can estimates the parameters $W$ and $\sigma^2$.

Probabilistic PCA supposes that most variation in the data can be captured  by the latent variables $h$, up to the reconstruction error $\sigma^2$. As $\sigma \to 0$, Probabilistic PBA becames PCA

# Independant Component Analysis (ICA)


ICA is an approach to model linear factors that separates an observed signal into many underlying signals that are intended to be independant (not just decorrelated).  

We train a fully parametric generative model.  
We fix the prior $p(h)$, and define the generating process:
$$x = Wh$$

Maximum likelihood is used to determine $p(x)$ and learn the model.  
By choosing an independant prior, we recover underlying factors that are as close to independant as possible.  

This is used to recover signals that have been mixed together. For example, if $n$ people speaks simultaneuously, recorded with $n$ microphones, ICA can separate them and. Each example is a moment in time, with each featre $x_i$ corresponding to one microphone. Each $h_i$ will contain only one person speaking clearly.  
It's also used in neuroscience, to separate brain signals and eletric signals measured with electrodes.  

They are many approaches to the problem, most aim to make the elements of $h = W^{-1}x$ independants from each other.  
All requires $p(h)$ to be non-Gaussian, otherwhise $W$ is not identifiable.  
Typical choices of these distributions have larger peaks near $0$, so most implementations of ICA are learning sparse features.

Many variants of ICA are not generative models, they do not represent $p(h)$ or $p(x)$, they only find a way to transform between $x$ and $h$.  
For example, they aim to increase the sample kurtosis of $h=W^{-1}x$, leading to $p(h)$ non-Gaussan.  

ICA can be genelizariated to a nonlinear model, where we use a nonlinear function $f$ to generate the data.  
Another approach, Nonlinear Inpedendant Components Estimation (NICE) stacks a series of invertible encoders. They transforn the data into a space where it has a factorized marginal distribution.  
It is more likely to suceed because the encoders are nonlinear.  

Another generalization of ICA learn groups of features with dependences within a group, but discouraged between groups.

# Slow Feature Analysis

SFA is a regularization technique that can be applied to any differentiable model to learn features that changes slowly over time.  
We add a regularization term to the cost function:
$$\lambda \sum_t L(f(x^{(t+1)}, f(x^{(t)}))$$

with $f$ the feature extractor, and $L$ a loss functon measuring the distance between the features, usually MSE.  

With a linear feature extractor $f(x; \theta)$, we can get a closed form solution.  
The problem is defined as:
$$\min_\theta \mathbb{E}_t[(f(x^{(t+1)})_i - f(x^{(t)})_i)^2]$$
$$\text{s.t. } \mathbb{E}_t[f(x^{(t)})_i] = 0$$
$$\text{s.t. } \mathbb{E}_t[f(x^{(t)})_i^2] = 0$$

Constraining the features to have $0$ mean ensure me have a unique solution, and constraining them to have unit variance prevent features from all collapsing to $0$.  

To learn multiples features, we must had the constraints that the features are decorelatted from each other's, otherwhise all features would be the same lowest signal:
$$\forall i < j, \space \mathbb{E}_t[f(x^{(t)})_i f(x^{(t)})_j] = 0$$

SFA can learn nonlinear features by applying nonlinear basis expansion first. We can campose several SFA modules, applying non-linear basis expansion to the outputs of the previous one, then leanrning SFA on top on it, recursively. It can learn deep nonlinear slow features.

Deep SFA has been used to learn features for object recognition and pose estimation, but it did not made any state of the start applications.  
Maybe the slowness prior is too strong, it encourages the model to ignore position of objects with high velocity. It would be better to impose a prior that features should be easy to predict from one time step to the next.

# Sparse Coding

Sparse coding is a linear factor model that performs unsipervised features learning.

$$p(x|h) = \mathcal{N}(Wh + b, \frac{1}{\beta}I)$$

The prior $p(h)$ is chosen to have sharp peaks near $0$.  
It can be a Laplace:

$$p(h_i) = \text{Laplace}(0,\frac{2}{\lambda}) = \frac{\lambda}{4} \exp (-\frac{1}{2} \lambda |h_i|)$$

Or a Student-t:

$$p(h_i) \propto \frac{1}{(1 + \frac{h_i^2}{\nu})^{\frac{\nu + 1}{2}}}$$

Maximum likelihood is intractable. Instead, we use an iterative 2-steps process.  
The first step is an optimization problem that finds the single most likely code value for $h$:
$$h^* = \arg \max_h p(h|x)$$

With a Laplace prior, it simplies to:

$$h^* = \arg \min_h \lambda ||h||_1 + \beta ||x - Wh||_2^2$$

with $W$ fixed. 

The second step update $W$ to minize the reconstruction error.

The whole procedure yields a sparse $h^*$.

Sparse coding with a non-parametric encoder can minimize the reconstruction error and log-prior better than a parametric model.  
Another advantage is there is no generealization error in the encoder, compared to parametrics model. The problem is convex and leads to an optimal code. When used as a feature extractor for a classifier, it may lead to better generalization than a parametric encoder.  

The disadvantage is that computing $h$ given $x$ is an iterative algorithm that takes a lot more time. 
What's more backpropagate through the encoder is not easy, which make it complicate to finetune it with labelled data.  

Sparse coding, like other linear factor models, generate poor samples, even when the model reconstruct the data well. It's because the prior of $h$ is factorial, resulting in the model including random subsets of all the features.

# Manifold Interpretation of PCA

PCA and factor analysis can be seen as learning a manifold. It defines a Gaussian distribution that is very narrow along somes axes and elongated along others. PCA aligns it with a linear manifold in a higher dimensional space.  

The encoder compute a low-dimensionial representation $h$:
$$h = f(x) = W^T(x - \mu)$$
The decoder reconsructs $x$:
$$\hat{x} = g(h) = Vh + b$$

The goal is to minimise the reconstruction error:
$$\mathbb{E}[||x - \hat{x}||^2]$$

The optimal encoder and decoder corresponds to $V = W$, $\mu = b = \mathbb{E}[x]$, and the columns of $W$ an orthonormal basis which spans the same subspace as the principal eigenvectors of the covariance matrix:
$$C = \mathbb{E}[(x - \mu)(x - \mu)^T]$$

The eigenvalue $\lambda_i$ of $C$ corresponding to the variance of $x$ in the direction of $v^{(i)}$.

If $x \in \mathbb{R}^D$ and $h \in \mathbb{R}^d$, the reconstruction error is:
$$\sum_{i=d+1}^D \lambda_i$$

Hence if the covariance has rank $d$, the reconstruction error is $0$.  

Maximizing the variance of the elements of $h$, with $W$ constrained to be orthogonal, yields the same solution.