# Understanding Energy Functions

This document exists to sort out all the confusion in my head about energy, free energy, etc.


## Energy and Boltzmann Machines
[Boltzmann Machines](http://en.wikipedia.org/wiki/Boltzmann_machine).  The "Energy" of a Boltzmann machine is defined as
$$
E(x) = -\sum_{i<j}x_i w_{ij} x_j - \sum_i b_i h_i
$$

A Boltzmann Distributution is a distribution over possible states of the Boltzmann Machine.  It is defined as:
$$
\begin{align} \\
p(x) &= Boltzmann(w, b) \\
&\equiv \frac{e^{-E(x)}}{\sum_{x'}e^{-E(x')}} \\
\end{align}
$$
Where the term $\sum_{x'}e^{-E(x')}$ means "a sum over all possible values of the vector x" (here we assume x is discrete), which, if we have N binary units, means $2^N$ terms.  This term is often denoted as $Z$, and called the "Normalizing Constant".  In practical situations, because it involves so many terms, computing it is intractable, so it must be estimated if we want to get the probability of a given state.

In Boltzmann Machines, we generally restrict units to be binary.  So, given some configuration x, there is some difference to the total energy, $\Delta E_i(x) = \sum_j w_{ji} x_j+b_i$, depending on whether unit $i$ is on or not.  We can derive from the above equations that (see [ref](http://en.wikipedia.org/wiki/Boltzmann_machine#Probability_of_a_unit.27s_state)) the function relating the probability of a unit being on to the difference in energy associated with that unit:

$$
p(x_i=1) = 1/(1+e^{-\sum_j w_{ji} x_j+b_i}) \equiv sigmoid(w_{: i}\cdot x + b_i)
$$

Good, so we've derived the update equation.  Now lets just work in terms of log-probability.  


## Gibbs Sampling

One way to estimate the marginal distribution of a unit - that is, its probability in the Boltzmann distribution, is through Gibbs sampling.  Here, we update units one-by-one though the previous equation:

$$
X_i \sim Bernoulli\big(sigmoid(w_{: i}\cdot x + b_i)\big)
$$

## Learning weights to Minimize Energy

In a Boltzmann Machine, we want to build a model of the data.  To do this, we try to maximize the probability (minimize the energy) of the data ("wake" phase), while minimizing the probability (maximizing the energy) of the samples generated from the model ("sleep" phase). 

Taking the derivative of the energy with respect to the parameters, for a given data point, we get:
$$
\frac{\delta E}{\delta w_{ij}} = -x_i x_j \qquad
\frac{\delta E}{\delta b_i} = -x_i 
$$

We can then use these gradients to do parameter updates:

$$
\Delta w_{ij} = \eta \big({<v_i h_j>_{wake} - <v_i h_j>_{sleep}}\big) \qquad
\Delta b_{i} = \eta \big({<x_i>_{wake} - <x_i>_{sleep}}\big)
$$

Where $<\cdot>_{wake}$ means "average over all data samples" and $<\cdot>_{sleep}$ means "average over samples from the model".  We average over samples because we're trying to maximize the probability over all the samples of data, which is the product of the probability of each data point.  Since we're trying to maximize the negative-log-probability (energy), this product translates to a sum in log-space.

## Hidden Units

Our Boltzmann machine can model more complex distibutions if we add "hidden" units.  We now separate our nodes x into two types of units, visible (v) and hidden (h).  Visible units correspond to data points, and hidden units can be freely sampled.  Hidden units can affect the probability distribution over "visible" units, without being constrained to match the probabilities of the data points.  The probability of a data point is now the sum of the probabilities of each hidden configuration in which the visible units have the given value.

$$
\begin{align} \\
p(v) &= \frac{\sum_h e^{-E(v, h)}}{\sum_{v'h'}e^{-E(v', h')}} \\
&\equiv \mathcal{F}(v)
\end{align}
$$

For convenience, we refer to the term XxXXXX


## Restricted Boltzmann Machines

In an RBM, we can separate nodes x into a bipartite graph of two types of units, (v, and h), and rewrite the energy function as:

$$
E=-\sum_{ij}v_i w_{ij} h_j - \sum_i a_i v_i - \sum_j b_j h_j
$$

This means that 

$$
\frac{\delta E}{\delta w_{ij}} = -v_i h_j \qquad
\frac{\delta E}{\delta a_i} = -v_i \qquad
\frac{\delta E}{\delta b_j} = -h_j 
$$

Which leads to the update equation

$$
\Delta w_{ij} = \eta \big({<v_i h_j>_{wake} - <v_i h_j>_{sleep}>}\big) \qquad
\Delta a_{i} = \eta \big({<v_i>_{wake} - <v_i>_{sleep}>}\big) \qquad
\Delta b_{j} = \eta \big({<h_j>_{wake} - <h_j>_{sleep}>}\big) \qquad
$$

## So then what's all this Free-Energy business?

The above updates will work, but we can do better.  
