Restricted Boltzman Machines
=================

A [Restricted Boltzmann machine (RBM)](https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine) is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs. They are a particular form of [Boltzman Machines](https://en.wikipedia.org/wiki/Boltzmann_machine) subject to a **restriction**. The restriction is that there are no connections between nodes within a group of units (meaning that the network form a bipartite graph, see below).

The RBM is made of two layers, each one having a certain number of units. The input units are called **visible units** of the RBM because their states are observed. The feature detectors correspond to non-observed **hidden units**. The hidden units are often referred to as **latent variables**, as they do not result directly from the observed data. The two layers are connected through a matrix of weights. The units inside the layer are not connected, meaning that the network form a [bipartite graph](https://en.wikipedia.org/wiki/Bipartite_graph).

<p align="center">
<img src="../etc/img/rbm_architecture.png" width="400">
</p>


RBMs were predominantly used in an unsupervised setting to extract better
features from the input data



Energy of a configuration
-----------------------------

The visible and hidden units are often organised as vectors, and a pair of visible-hidden vectors is called a **configuration**. A joint configuration of the visible and hidden units has an **energy** (see Hopfield, 1982) given by:

$$E(v,h) = -a^{T}v -b^{T}h -v^{T} Wh $$

where the matrix of weights $W$ (size $m \times n$) associated with the connection between hidden unit $h$ and visible unit $v$, as well as bias weights $a$ for the visible units and $b$ for the hidden units.
This definition of Energy is the same used in [Hopefield Networks](https://en.wikipedia.org/wiki/Hopfield_network).


Probability distributions over units
------------------------------------------

Using the energy it is possible to define a series of probability distribution of both visible and hidden units. Here we consider both $v$ and $h$ as Bernoulli random variables, meaning that they are binary units (howevever there are ways to generalize to real units).

The [joint probability](https://en.wikipedia.org/wiki/Joint_probability_distribution) of every possible pair of a visible and a hidden vector can be defined as follows:

$$ P(v,h) = \frac{1}{Z} e^{-E(v,h)}$$

where the **partition function** $Z$ is a normalisation factor, given by summing over all possible pairs of visible and hidden vectors:

$$ Z = \sum_{v,h} e^{-E(v,h)}$$

An important problem arises when the estimation of $Z$ is required. Since both $v$ and $h$ are binary there is an exponential number of values that they can take. It is possible to understand this point considering $v$ and $h$ as bit arrays. If the array has $8$ elements then there are $2^8$ possible combination to consider. It is intuitive to understand that increasing the number of bits makes the number of possible combination explode (since there is an exponential factor involved). This problem makes the estimation of the partition function **intractable**. The intractability of $Z$ can be managed using different forms of approximation (introduced later).

Using the energy it is also possible to compute the [marginal probability distribution](https://en.wikipedia.org/wiki/Marginal_distribution) on the visible units:

$$ P(v) = \frac{1}{Z} \sum_{h} e^{-E(v,h)}$$

and on the hidden units:

$$ P(h) = \frac{1}{Z} \sum_{v} e^{-E(v,h)}$$

Since both $P(v)$ and $P(h)$ require the partition function, we cannot directly estimate them.

Using the energy we can also obtain the [conditional probability distribution](https://en.wikipedia.org/wiki/Conditional_probability_distribution) of the $m$ visible units given the hidden units:


$$ P(v|h) = \prod_{i=1}^{m} P(v_{i} | h) $$

and the of the $n$ hidden units with respect to the visible units:

$$ P(h|v) = \prod_{j=1}^{n} P(h_{j} | v) $$

In both cases we used the assumption that the visible units are independent given the hidden units (and viceversa). Differently from the joint and marginal probability distributions, the conditional distribution is **tractable** because there is not any partition function involved in both $P(v|h)$ and $P(h|v)$.


RBM as an undirected graphical model (Markov Random Field)
-------------------------------------------------------------

An RBM can be described in term of graphical models, in particular in terms of undirected graphs such as **Markov Random Fields (MRFs)**.

Training objective
---------------------



Contrastive divergence
-------------------------

To solve the problem we can use a method proposed by Hinton and called **Contrastive Divergence (CD)**.


Resources
------------

- A Practical Guide to Training RBMs by Geoffrey Hinton [[pdf]](https://www.cs.toronto.edu/~hinton/absps/guideTR.pdf)
- Notes on Contrastive Divergence [[pdf]](http://www.robots.ox.ac.uk/~ojw/files/NotesOnCD.pdf)
- Contrastive Divergence the original article from Hinton [[web]](https://www.mitpressjournals.org/doi/abs/10.1162/089976602760128018)