## Energy Based Models (EBM)

By Yann Lecun (ICLR 2020, and [Original Paper](http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf)

#### Following section expects that reader is well aware of Manifold Hypothesis.
##### Problem of Integrals

Much of deep learning literature uses probability models
as loss functions but the main problem with probability based models is that they
must be properly normalised, which sometimes require evaluating intractable integrals
over space of all the variables configurations.

<strong>EBM</strong> has only loss function that gives a "happy" value.
If it is happy, it will give a less value. If it is not happy,
then it will blow up.

##### So how is it any different from the probability based models?

EBM can be anything if it loosely fits the earlier definition. An Inverse Probability Measure
is also an EBM.

<strong> Every probability measure is a EBM, and every EBM can be changed to probability Measure if it is normalised.</strong>
Normalized over Gibbs distribution. (Probabilistic Graphical Models)

Often time normalising can be a tedious or impossible task as we have to deal with
<strong>intractable integrals</strong>. The idea is that we have to check the probability of one
instance with every other instance that could exist, which is not possible practically.

<strong> Independent explanatory factors of variation </strong>

Every input (image/sequence of words) with a semantic meaning has a property
 that its encoded form lies somewhere in the high dimensional space. Inputs that are semantically related
  lie on a manifold in that space, and traversing through the manifold will yield an arbitrary variation in each of these
   inputs. We achieve it through changing values of Latent Variables that represents
   independent explanatory factors of variation.

 <strong> What is Energy in this context? </strong>

If a well-trained encoder projects a face image (say) on or near a face manifold (in high dimensional space) then we say that
 the associated energy of that model is low. But if the encoder projects is outside the manifold then high energy. This forces
 the neural network to have less tolerance towards unexpected and unseen output.

 #### Inference

 For a given input $x$ and output label $y$, find the values of $y$ that makes
  $F(x,y)$ small.
   $ y' = \argmin F(x,y)$

   <break>
   If $y$ is continous, apply gradient descend to find optimal $y$.

   If $y$ is discrete, how?

   #### Exploring EBMs

   Machine Learning is learning about the distribution of data.

  <strong> K-means vs GMM </strong>

  We create a 'happy' function that is happy with any point in the
   data distribution. In K-means we associated with closest cluster center, hence
   we are not normalizing anything, so it is a energy based methods and not a probability
   based method.

  However, in GMMs we have to consult with clusters (or Gaussians) and know about our datapoint.
   We have to find the membership of this point with every other Gaussian. Hence, there inherently exists
   a normalization term in the cost function of GMM.


   $F(x,y) = - \frac{1}{B} log \int e$ <sup>- $ \beta E(x,y,z)$</sup>  $


#### Recent surge of energy based models

SimCLR and other contrastive based approaches leverage energy functions as their
loss functions. Such models train to take pairs and create a manifold by forcing the NN
to yield less energy for positive-positive pair, snf high energy for positive-negative pair.

In a way, we can never really describe the manifold but what we can do is to build
energy functions that
basically tells us how far we are from the manifold.


##### Language Model: BERT

1. Take an input.
2. Corrupt it.
3. Build a system to distinguish between clean vs corrupted version.

The idea is that we create a noisy/corrupted form of semantic sequences, so that the corrupted version
can be turned to the actual one. We do this by encoded the semantic sequence somewhere off the manifold,
and then model learn to drag these semantic sequences back to the manifold.

Datapoints (encoded version of semantic sequences) that are all off the
manifold are just noised version of ones that are in on the manifold.

<strong> So how far should we push off from the manifold? If we push too far,
will that at some point be too problematic? </strong>

We might need smooth energy function, not a form of a canyon. If the depressions are small, and if the problem space is too sparse
 we can never really learn anything about the manifold.

 ##### Problem with Maximum Likelihood Estimation

It wants to ake the difference between the energy on the data manifold, and the nergy just outside of it infinitely large.
It wants to make the data manifold an infinitely deep and infinitely narrow canyons.


Therefor, the loss must be regularized to keeo the energy smooth. Eg: Wasserstein GAN. This is
dont so that gradient based inference works.

If we don't regularize properly, we end up like dirac distribution everywhere and nothing in between. To throw away probability framework
 we lose the ability to make numerical predictions about how likely datapoints
 is you simply compare it to others,

 - Probabilistic models will always put a point into global space and compare.
 - Energy function is a scalar valued function that just gives us a number.


 Some algorithms are about representation learning.

 Some algorithms are about manifold learning.

 Here we are learning to rectify the corrupt input data. We can either modify our manifold, by twisting it
 and cuttin it, or we can learn the manifold in the data and ensure our network
  projects to one.

  ##### What about pertubations in visual domain?

  One possible reason could be learned feature invariances makes it hard to pertubate
  the input iamges. Maybe we don't have good ways to throw faces off the manifold.