# Restricted Boltzmann Machines and Deep Belief Networks

## Restricted Boltzmann Machines

### Theory

A Restricted Boltzmann Machine (RBM) is a probabilistic graphical model that consists of visible random variables (or units) $\mathbf{v}$ and hidden random variables (or units) $\mathbf{h}$, where:

$$\mathbf{v}=\begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_N \end{bmatrix}$$

And:

$$\mathbf{h}=\begin{bmatrix} h_1 \\ h_2 \\ \vdots \\ h_M \end{bmatrix}$$

See [section 16.5](https://www.deeplearningbook.org/contents/graphical_models.html) in the book titled "Deep Learning" by Ian Goodfellow et al. for why the hidden vector $\mathbf{h}$ is necessary.

The joint distribution represented by all [Boltzmann Machines](https://en.wikipedia.org/wiki/Boltzmann_machine) (not just an RBM) is:

$$p(\mathbf{v},\mathbf{h};\mathbf{W},\mathbf{a},\mathbf{b}) = \frac{1}{Z} e^{-E(\mathbf{v},\mathbf{h};\mathbf{W},\mathbf{a},\mathbf{b})}$$

Where $\mathbf{W}$ is a parameter matrix, $\mathbf{a}$ and $\mathbf{b}$ are parameter vectors, and:

$$E(\mathbf{v},\mathbf{h};\mathbf{W},\mathbf{a},\mathbf{b}) = -\mathbf{a}^T\mathbf{v} - \mathbf{b}^T\mathbf{h} - \mathbf{v}^T\mathbf{W}\mathbf{h}$$

$$Z = \sum_{\mathbf{v},\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{W},\mathbf{a},\mathbf{b})}$$

$E$ is know as the **energy function** and $Z$ is known as the **partition function**. However, to make parameter estimation (learning) and inference tractable, a **Restricted** Boltzmann Machine introduces the following assumptions:

$$ p(\mathbf{v}|\mathbf{h}) = \prod_{i=1}^N p(v_i|\mathbf{h})$$
$$p(\mathbf{h}|\mathbf{v}) = \prod_{j=1}^M p(h_i|\mathbf{v})$$

In words, it is assumed that the individual visible random variables (units) are conditionally independent given all the hidden units $\mathbf{h}$. Also, it is assumed that the individual hidden units are conditionally independent given all the visible units $\mathbf{v}$.

It is desirable to estimate the marginal probability distribution $p(\mathbf{v})$ of all the visible units $\mathbf{v}$, as this reveals relationships between the visible units. This marginal probability distribution can be obtained by marginalizing the joint distribution $p(\mathbf{v},\mathbf{h})$ over all the hidden units $\mathbf{h}$:

$$p(\mathbf{v};\mathbf{W},\mathbf{a},\mathbf{b}) = \frac{1}{Z} \sum_{\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{W},\mathbf{a},\mathbf{b})}$$

This marginal probability distribution can therefore be estimated using maximum likelihood estimation. Assuming that the visible units $\mathbf{v}$ have been observed, then the equation above also represents the likelihood function. The log-likelihood function is therefore:

$$ln(p(\mathbf{v};\mathbf{\theta})) = ln\left(\sum_{\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}\right) - ln\left(\sum_{\mathbf{v},\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}\right)$$

Where:

$$\theta = \begin{bmatrix} \mathbf{W} \\ \mathbf{a} \\ \mathbf{b} \end{bmatrix}$$

For brevity.

Since the log-likelihood function is not in closed form, which means that it is difficult to compute analytically, then gradient ascent will be used to maximize it. The gradient of the log-likelihood function is:

$$\frac{\partial ln(p(\mathbf{v};\mathbf{\theta}))}{\partial \mathbf{\theta}} = -\frac{1}{\sum_{\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}} \sum_{\mathbf{h}} \left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \cdot e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}\right] +
\frac{1}{\sum_{\mathbf{v},\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}} \sum_{\mathbf{v},\mathbf{h}} \left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \cdot e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}\right]$$

Since:

$$p(\mathbf{h}|\mathbf{v}) = \frac{p(\mathbf{v},\mathbf{h})}{p(\mathbf{v})} = \frac{\frac{1}{Z} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}}{\frac{1}{Z} \sum_{\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}} = \frac{e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}}{\sum_{\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}}$$

Then:

$$\frac{\partial ln(p(\mathbf{v};\mathbf{\theta}))}{\partial \mathbf{\theta}} = -\sum_{\mathbf{h}} \left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \cdot p(\mathbf{h}|\mathbf{v}) \right] +
\frac{1}{\sum_{\mathbf{v},\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}} \sum_{\mathbf{v},\mathbf{h}} \left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \cdot e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}\right]$$

Also, since:

$$\frac{e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}}{Z} = \frac{e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}}{\sum_{\mathbf{v},\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h};\mathbf{\theta})}} = p(\mathbf{v},\mathbf{h})$$

Then:

$$\begin{align}
\frac{\partial ln(p(\mathbf{v};\mathbf{\theta}))}{\partial \mathbf{\theta}} &= -\sum_{\mathbf{h}} \left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \cdot p(\mathbf{h}|\mathbf{v}) \right] +
\sum_{\mathbf{v},\mathbf{h}} \left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \cdot p(\mathbf{v},\mathbf{h})\right] \\
&= -\mathbb{E}_{p(\mathbf{h}|\mathbf{v})}\left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \right] + \mathbb{E}_{p(\mathbf{v},\mathbf{h})}\left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \right]
\end{align}$$



Since both of these are expectations, they can be approximated using [Monte Carlo integration](http://people.duke.edu/~ccc14/sta-663-2019/notebook/S14D_Monte_Carlo_Integration.html#Intuition-behind-Monte-Carlo-integration):

$$\frac{\partial ln(p(\mathbf{v};\mathbf{\theta}))}{\partial \mathbf{\theta}} \approx -\frac{1}{N} \sum_{i = 1}^{N} \left[\frac{\partial E(\mathbf{v},\mathbf{h}_i;\mathbf{\theta})}{\partial \mathbf{\theta}} \right] +
\frac{1}{M} \sum_{j=1}^{M} \left[\frac{\partial E(\mathbf{v}_j,\mathbf{h}_j;\mathbf{\theta})}{\partial \mathbf{\theta}} \right]$$

The first term can be computed because it is easy to sample from $p(\mathbf{h}|\mathbf{v})$. However, it is difficult to sample from $p(\mathbf{v},\mathbf{h})$ directly, but since it is easy to sample from $p(\mathbf{v}|\mathbf{h})$, then Gibbs sampling is used to sample from both $p(\mathbf{h}|\mathbf{v})$ and $p(\mathbf{v}|\mathbf{h})$ to approximate a sample from $p(\mathbf{v},\mathbf{h})$. Also, notice that:

$$\begin{align}
\frac{\partial ln(p(\mathbf{v};\mathbf{\theta}))}{\partial \mathbf{\theta}} &= -\mathbb{E}_{p(\mathbf{h}|\mathbf{v})}\left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \right] + \mathbb{E}_{p(\mathbf{v},\mathbf{h})}\left[\frac{\partial E(\mathbf{v},\mathbf{h};\mathbf{\theta})}{\partial \mathbf{\theta}} \right] \\
&\approx - \frac{1}{N} \sum_{i = 1}^{N} \left[\frac{\partial E(\mathbf{v},\mathbf{h}_i;\mathbf{\theta})}{\partial \mathbf{\theta}} \right] + \frac{1}{M} \sum_{j=1}^{M} \left[\frac{\partial E(\mathbf{v}_j,\mathbf{h}_j;\mathbf{\theta})}{\partial \mathbf{\theta}} \right] \\
&\approx \frac{\partial}{\partial \mathbf{\theta}} \left(\frac{1}{M} \sum_{j=1}^{M} \left[E(\mathbf{v}_j,\mathbf{h}_j;\mathbf{\theta}) \right] - \frac{1}{N} \sum_{i = 1}^{N} \left[E(\mathbf{v},\mathbf{h}_i;\mathbf{\theta}) \right] \right)
\end{align}$$

In words, this means that the gradient of the log-likelihood function can be approximated by the gradient of the difference of two sample means. Once the gradient of the log-likelihood function is estimated for a particular set of observed visible units $\mathbf{v}$, then the parameters $\mathbf{W},\mathbf{a},$ and $\mathbf{b}$ can be updated using the following gradient ascent update rules:

$$\mathbf{W}_{t+1} = \mathbf{W}_{t} + \epsilon \frac{\partial ln(p(\mathbf{v};\mathbf{W},\mathbf{a},\mathbf{b}))}{\partial \mathbf{W}}$$

$$\mathbf{a}_{t+1} = \mathbf{a}_{t} + \epsilon \frac{\partial ln(p(\mathbf{v};\mathbf{W},\mathbf{a},\mathbf{b}))}{\partial \mathbf{a}}$$

$$\mathbf{b}_{t+1} = \mathbf{b}_{t} + \epsilon \frac{\partial ln(p(\mathbf{v};\mathbf{W},\mathbf{a},\mathbf{b}))}{\partial \mathbf{b}}$$

Where $\epsilon$ is the learning rate and:

$$\frac{\partial ln(p(\mathbf{v};\mathbf{W},\mathbf{a},\mathbf{b}))}{\partial \mathbf{W}} \approx \frac{\partial}{\partial \mathbf{W}} \left(\frac{1}{M} \sum_{j=1}^{M} \left[E(\mathbf{v}_j,\mathbf{h}_j;\mathbf{W},\mathbf{a},\mathbf{b}) \right] - \frac{1}{N} \sum_{i = 1}^{N} \left[E(\mathbf{v},\mathbf{h}_i;\mathbf{W},\mathbf{a},\mathbf{b}) \right] \right)$$

$$\frac{\partial ln(p(\mathbf{v};\mathbf{W},\mathbf{a},\mathbf{b}))}{\partial \mathbf{a}} \approx \frac{\partial}{\partial \mathbf{a}} \left(\frac{1}{M} \sum_{j=1}^{M} \left[E(\mathbf{v}_j,\mathbf{h}_j;\mathbf{W},\mathbf{a},\mathbf{b}) \right] - \frac{1}{N} \sum_{i = 1}^{N} \left[E(\mathbf{v},\mathbf{h}_i;\mathbf{W},\mathbf{a},\mathbf{b}) \right] \right)$$

$$\frac{\partial ln(p(\mathbf{v};\mathbf{W},\mathbf{a},\mathbf{b}))}{\partial \mathbf{b}} \approx \frac{\partial}{\partial \mathbf{b}} \left(\frac{1}{M} \sum_{j=1}^{M} \left[E(\mathbf{v}_j,\mathbf{h}_j;\mathbf{W},\mathbf{a},\mathbf{b}) \right] - \frac{1}{N} \sum_{i = 1}^{N} \left[E(\mathbf{v},\mathbf{h}_i;\mathbf{W},\mathbf{a},\mathbf{b}) \right] \right)$$

### Implementation

This implementation of an RBM is inspired by the paper titled [Restricted Boltzmann Machines for Collaborative Filtering](https://www.cs.toronto.edu/~rsalakhu/papers/rbmcf.pdf) by Ruslan Salakhutdinov et al., which will from now on be referenced as [1].

More precisely, one of the RBM architectures that the authors implemented involved categorical (or softmax) visible units and Bernoulli (or binary) hidden units. Also, in the paper, the authors use data from the [Netflix Prize](https://en.wikipedia.org/wiki/Netflix_Prize) competition that consists of approximately 100 million user ratings. 480,000 users gave approximately 18,000 movies a rating from 1 to 5. Therefore, the main objective of this implementation is to use these user ratings as the categorical visible units with Bernoulli hidden units and estimate the corresponding parameters using maximum likelihood estimation. However, to keep the implementation simple, only 50 movies per user will be considered.

This RBM will be implemented using PyTorch.

Before considering 50 movies per user, to make the categorical-Bernoulli RBM easier to understand, suppose instead that there are only 4 movie ratings per user and that the relationships between these movie ratings will be encoded in only 3 hidden Bernoulli units. Note that since user ratings range from 1 to 5, they will need to be expressed in one-hot encoding format (see [here](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) for an explanation).

This means that given that there are 4 movie ratings per user ranging from 1 to 5, then the visible units will be expressed in a $4 \times 5$ matrix, denoted $\mathbf{V}$, where the $i^{th}$ row corresponds to the $i^{th}$ movie per user, and the $j^{th}$ column of the $i^{th}$ row corresponds to whether that user gave a rating of $j$ to the $i^{th}$ movie. For example, $\mathbf{V}$ could be:

$$
\mathbf{V} =
\begin{bmatrix} 
0 & 1 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 & 0 \\
1 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 1 \\
\end{bmatrix}
$$

This means that this user gave the first movie a rating of 2, the second movie a rating of 3, the third movie a rating of 1, and the fourth movie a rating of 5.

More generally, given $M$ movie ratings per user, with each rating ranging from 1 to $N$, then $\mathbf{V}$ is a $M \times N$ matrix:

$$
\mathbf{V} = 
\begin{bmatrix}
v_{11} & v_{12} & v_{13} & \dots & v_{1N} \\
v_{21} & v_{22} & v_{23} & \dots & v_{2N} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
v_{M1} & v_{M2} & v_{M3} & \dots & v_{MN} \\
\end{bmatrix}
$$


In code, one example of a $\mathbf{V}$ matrix is: 

In [88]:
import torch

num_visible = 4
num_categories = 5
V = torch.randn(num_visible,num_categories)
# make sure each row is a categorical distribution
V = torch.nn.functional.softmax(V,dim = 1)
# sample
V = torch.distributions.one_hot_categorical.OneHotCategorical(probs = V).sample().to(torch.int)
V

tensor([[0, 1, 0, 0, 0],
        [0, 0, 1, 0, 0],
        [1, 0, 0, 0, 0],
        [0, 0, 1, 0, 0]], dtype=torch.int32)

Given that the hidden units are Bernoulli (binary), then it is sufficient to model the hidden units as a $K \times 1$ column vector:

$$
\mathbf{h} = 
\begin{bmatrix}
h_1 \\
h_2 \\
h_3 \\
\vdots \\
h_K
\end{bmatrix}
$$

For example, in the case of 3 hidden units, $\mathbf{h}$ could be:

$$
\mathbf{h} = 
\begin{bmatrix}
1 \\
0 \\
1
\end{bmatrix}
$$

In code, one example of $\mathbf{h}$ is:

In [89]:
num_hidden = 3
h = torch.randn(num_hidden)
# make sure each element is a Bernoulli distribution
h = torch.sigmoid(h)
# sample
h = torch.distributions.bernoulli.Bernoulli(probs = h).sample().to(torch.int)
h

tensor([1, 0, 0], dtype=torch.int32)

According to equation (1) in section 2.1 from [1], the conditional probability that the $i^{th}$ row and $j^{th}$ column of $\mathbf{V}$ is 1 given that the hidden vector $\mathbf{h}$ is observed is modelled as a logistic model:

$$
p(v_{ij} = 1|\mathbf{h}) = \frac{\exp(\mathbf{w}_{ij}^T \mathbf{h} + b_{ij})}{\sum_j \exp(\mathbf{w}_{ij}^T \mathbf{h} + b_{ij})}
$$

Where:
* $\mathbf{w}_{ij}$ is the weight vector associated with the $i^{th}$ row and $j^{th}$ column of $\mathbf{V}$.
* $b_{ij}$ is the scalar bias associated with the $i^{th}$ row and $j^{th}$ column of $\mathbf{V}$.

In other words, each row of $\mathbf{V}$ is sampled from a categorical distribution then converted to one-hot encoding.

Note that since an inner product is equivalent to an element-wise multiplication followed by a summation, and if:

$$
\mathbf{w}_{ij} =
\begin{bmatrix}
w_{ij}^1 \\
w_{ij}^2 \\
w_{ij}^3 \\
\vdots \\
w_{ij}^K
\end{bmatrix}
$$

Then the above conditional probability distribution can also be expressed as:

$$
p(v_{ij} = 1|\mathbf{h}) = \frac{\exp(\sum_p [w_{ij}^p h_p] + b_{ij})}{\sum_j \exp(\sum_p [w_{ij}^p h_p] + b_{ij})}
$$

Which is how this conditional probability distribution is expressed in equation (1) in section 2.1 in [1].

Since each element of $\mathbf{V}$ is associated with a weight vector $\mathbf{w}_{ij}$ and bias $b_{ij}$, then it is helpful to express these weights and biases as matrices. The weight vectors $\mathbf{w}_{ij}$ can be stored in a 3-dimensional matrix, denoted $\mathbf{W}$:

$$
\mathbf{W} = 
\begin{bmatrix}
\mathbf{w}_{11} & \mathbf{w}_{12} & \mathbf{w}_{13} & \dots & \mathbf{w}_{1N} \\
\mathbf{w}_{21} & \mathbf{w}_{22} & \mathbf{w}_{23} & \dots & \mathbf{w}_{2N} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
\mathbf{w}_{M1} & \mathbf{w}_{M2} & \mathbf{w}_{M3} & \dots & \mathbf{w}_{MN} \\
\end{bmatrix}
$$

This means that $\mathbf{W}$ has the dimensions $M \times N \times K$, where $K$ is the dimensionality of the hidden vector $\mathbf{h}$. In the case of 4 movie ratings per user, 5 possible ratings, and 3 hidden units, then $\mathbf{W}$ has a size of $4 \times 5 \times 3$, and each element is a vector of size $3$.

Similarly, the scalar biases can be stored in a 2-dimensional matrix, denoted $\mathbf{B}$:

$$
\mathbf{B} = 
\begin{bmatrix}
b_{11} & b_{12} & b_{13} & \dots & b_{1N} \\
b_{21} & b_{22} & b_{23} & \dots & b_{2N} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
b_{M1} & b_{M2} & b_{M3} & \dots & b_{MN} \\
\end{bmatrix}
$$

In the case of 4 movie ratings per user and 5 possible ratings, then $\mathbf{B}$ has a size of $4 \times 5$.

Now let:

$$
\mathbf{Y} = \sum_p [\mathbf{W}\mathbf{h}] + \mathbf{B} = 
\begin{bmatrix}
\sum_p [w_{11}^p h_p] + b_{11} & \sum_p [w_{12}^p h_p] + b_{12} & \dots & \sum_p [w_{1N}^p h_p] + b_{1N} \\
\sum_p [w_{21}^p h_p] + b_{21} & \sum_p [w_{22}^p h_p] + b_{22} & \dots & \sum_p [w_{2N}^p h_p] + b_{2N} \\
\vdots & \vdots & \ddots & \vdots \\
\sum_p [w_{M1}^p h_p] + b_{M1} & \sum_p [w_{M2}^p h_p] + b_{M2} & \dots & \sum_p [w_{MN}^p h_p] + b_{MN} \\
\end{bmatrix}
$$

Where $\mathbf{W}\mathbf{h}$ represents [broadcasting](https://numpy.org/devdocs/user/theory.broadcasting.html) $\mathbf{h}$ over $\mathbf{W}$, such that:

$$
y_{ij} = \sum_p [w_{ij}^p h_p] + b_{ij}
$$

Therefore:

$$
p(v_{ij} = 1|\mathbf{h}) = \frac{\exp(\sum_p [w_{ij}^p h_p] + b_{ij})}{\sum_j \exp(\sum_p [w_{ij}^p h_p] + b_{ij})} = \frac{\exp(y_{ij})}{\sum_j y_{ij}}
$$

Since:

$$
p(v_{ij} = 1|\mathbf{h}) = \frac{\exp(y_{ij})}{\sum_j y_{ij}}
$$

Is equivalent to evaluating the softmax function over all columns of $\mathbf{Y}$ on the $i^{th}$ row, then this leads to an efficient way of computing $p(v_{ij} = 1|\mathbf{h})$ for all $i$ and $j$.

An example of this is shown below:


In [90]:
num_visible = 4
num_categories = 5
num_hidden = 3

# W has a size of 4 x 5 x 3, as shown above
W = torch.randn(num_visible,num_categories,num_hidden)
# h has a size of 3, as shown above
h = torch.randn(num_hidden)
# B has a size of 4 x 5, as shown above
B = torch.randn(num_visible,num_categories)

# compute Wh, which involves broadcasting h over W. Note that broadcasting h over W
# returns the same shape as W, as shown below
Wh = torch.multiply(W,h)
Wh.shape

torch.Size([4, 5, 3])

In [91]:
# compute Y = \sum_p [Wh] + B, which involves an element-wise summation and a matrix
# summation of B
Y = torch.sum(Wh,dim = 2) + B
Y

tensor([[-1.1063, -2.7987,  2.4970,  2.7969, -0.4388],
        [-0.9202,  1.0963,  1.0452, -0.2078,  1.0426],
        [ 0.0595,  2.1118, -1.3849, -0.0870, -0.7365],
        [ 1.0304, -0.0254, -0.8133,  0.0376, -3.9434]])

In [92]:
# compute the softmax of each element of Y along its columns to compute p(v_ij = 1|h)
p_V_given_h = torch.nn.functional.softmax(Y,dim = 1)
p_V_given_h

tensor([[0.0112, 0.0021, 0.4107, 0.5543, 0.0218],
        [0.0403, 0.3028, 0.2877, 0.0822, 0.2870],
        [0.0967, 0.7532, 0.0228, 0.0836, 0.0436],
        [0.5309, 0.1847, 0.0840, 0.1967, 0.0037]])

In [93]:
# check that each row is a categorical probability distribution by checking that
# each row sums to 1
p_V_given_h.sum(dim = 1)

tensor([1.0000, 1.0000, 1.0000, 1.0000])

In [94]:
# sample a matrix V given the categorical probability distributions
V = torch.distributions.one_hot_categorical.OneHotCategorical(probs = p_V_given_h).sample().to(torch.int)
V

tensor([[0, 0, 1, 0, 0],
        [0, 0, 1, 0, 0],
        [0, 1, 0, 0, 0],
        [1, 0, 0, 0, 0]], dtype=torch.int32)

According to equation (2) in section 2.1 in [1], the conditional probability that the $p^{th}$ row of $\mathbf{h}$ is equal to 1 given that the matrix $\mathbf{V}$ is observed is also modelled using the logistic model:

$$
p(h_p|\mathbf{V}) = \sigma\left(\sum_{i,j} \left[\mathbf{W}^{[p]} \odot \mathbf{V}\right] + c_p\right)
$$

Where:
* $\sigma(\cdot)$ is the [logistic function](https://en.wikipedia.org/wiki/Logistic_function).
* $\mathbf{W}^{[p]} \odot \mathbf{V}$ is the element-wise product of the $p^{th}$ $M \times N$ matrix of $\mathbf{W}$ and the matrix of one-hot encoded visible units $\mathbf{V}$. Since $\mathbf{W}$ has a size of $M \times N \times K$, where $K$ is the number of hidden units, then there are $K$ possible $M \times N$ weight matrices, denoted as $\mathbf{W}^{[1]},\mathbf{W}^{[2]},...,\mathbf{W}^{[K]}$. The concatenation of these weight matrices in the third dimension results in the original $\mathbf{W}$ matrix.
* $\sum_{i,j} \left[\mathbf{W}^{[p]} \odot \mathbf{V}\right]$ is the summation over all $M$ rows and $N$ columns of the resulting $M \times N$ matrix from the element-wise product of $\mathbf{W}^{[p]}$ and $\mathbf{V}$, which results in a scalar.
* $c_p$ is the scalar bias associated with the $p^{th}$ row of $\mathbf{h}$.

Note that the above conditional probability distribution is also equivalent to:

$$
p(h_p|\mathbf{V}) = \sigma\left(\sum_{i,j} \left[w_{ij}^{[p]} v_{ij}\right] + c_p\right)
$$

Which is how this conditional probability distribution is expressed in equation (2) in section 2.1 in [1].

Since each element of $\mathbf{h}$ is associated with a weight matrix $\mathbf{W}_{[p]}$ and a scalar bias $c_p$, then it is helpful to express these weights and biases in matrix and vector form. The weight matrices $\mathbf{W}^{[1]},\mathbf{W}^{[2]},...,\mathbf{W}^{[K]}$ can be concatenated together in the third dimension to form the original 3-dimensional weight matrix $\mathbf{W}$. The scalar biases $c_1,c_2,...,c_K$ can be stored in a vector $\mathbf{c}$:

$$
\mathbf{c} = 
\begin{bmatrix}
c_1 \\
c_2 \\
\vdots \\
c_K
\end{bmatrix}
$$

Now let:

$$
\mathbf{W} \odot \mathbf{V}
$$

Be a $M \times N \times K$ 3-dimensional matrix that represents broadcasting element-wise multiplication of $\mathbf{V}$ over $\mathbf{W}$, such that $\mathbf{V}$ is multiplied individually and element-wise with each $M \times N$ weight matrix in $\mathbf{W}$.

Then let:

$$
\mathbf{q} = \sum_{i,j} [\mathbf{W} \odot \mathbf{V}] + \mathbf{c}
$$

Be a $K$-dimensional vector such that:

$$
\mathbf{q} = \begin{bmatrix} \sum_{i,j} \left[\mathbf{W}^{[1]} \odot \mathbf{V}\right] + c_1 & \sum_{i,j} \left[\mathbf{W}^{[2]} \odot \mathbf{V}\right] + c_2 & \dots & \sum_{i,j} \left[\mathbf{W}^{[K]} \odot \mathbf{V}\right] + c_K \end{bmatrix}
$$

Where $\sum_{i,j} \left[\mathbf{W}^{[p]} \odot \mathbf{V}\right]$ represents the summation over all rows and columns of the $p^{th}$ $M \times N$ matrix $\mathbf{W}^{[p]} \odot \mathbf{V}$.

This means that:

$$
p(h_p|\mathbf{V}) = \sigma\left(q_p\right)
$$

This derivation provides an efficient way of computing $p(h_p|\mathbf{V})$ for all $p$. An example of this is shown below:

In [95]:
num_visible = 4
num_categories = 5
num_hidden = 3

# W has a size of 4 x 5 x 3, as shown above
W = torch.randn(num_visible,num_categories,num_hidden)
# V has a size of 4 x 5, as shown above
V = torch.randn(num_visible,num_categories)
# c has a size of 3
c = torch.randn(num_hidden)

# compute W \odot V, which involves broadcasting V over W. Note that broadcasting V over
# W returns the same shape as W, as shown below. ALso, the permute() method is needed so
# that broadcasting will occur. Since W has a shape of 4 x 5 x 3 and V has a shape of
# 4 x 5, then W needs to be reshaped to 3 x 4 x 5 such that the V matrix is
# individually multiplied by the the three 4 x 5 matrices in W
WV = torch.multiply(W.permute(2,0,1),V)
WV.shape

torch.Size([3, 4, 5])

In [96]:
# compute q = \sum_{i,j} W \odot V + c, which involves invidually summing all rows and
# columns together in each of the three 4 x 5 matrices followed by the addition of the
# bias vector c
q = torch.sum(WV,dim = (1,2)) + c
q

tensor([ 6.9808, -0.0680, -1.2550])

In [97]:
# compute the invidual probabilities p(h_p = 1|V)
p_h_given_V = torch.sigmoid(q)
p_h_given_V

tensor([0.9991, 0.4830, 0.2218])

In [98]:
# sample a a hidden vector h given the three Bernoulli distributions
h = torch.distributions.bernoulli.Bernoulli(probs = p_h_given_V).sample().to(torch.int)
h

tensor([1, 1, 0], dtype=torch.int32)

Finally, the energy function associated with the categorical-Bernoulli RBM is:

$$
E(\mathbf{V},\mathbf{h}) = - \sum_{i,j,p} \left[(\mathbf{W}\mathbf{h}) \odot \mathbf{V}\right] - \sum_{i,j} [\mathbf{V} \odot \mathbf{B}] - \mathbf{c}^T \mathbf{h}
$$

Where:
* $\mathbf{W}\mathbf{h}$ is a $M \times N \times K$ 3-dimensional matrix that represents broadcasting $\mathbf{h}$ over $\mathbf{W}$, as discussed above.
* $(\mathbf{W}\mathbf{h}) \odot \mathbf{V}$ is a $M \times N \times K$ 3-dimensional matrix that represents the element-wise and individual multiplication of $\mathbf{V}$ with each $M \times N$ matrix in $\mathbf{W}\mathbf{h}$.
* $\sum_{i,j,p} \left[(\mathbf{W}\mathbf{h}) \odot \mathbf{V}\right]$ is a scalar which is equal to the summation over all three dimensions of the matrix $(\mathbf{W}\mathbf{h}) \odot \mathbf{V}$.
* $\mathbf{V} \odot \mathbf{B}$ is a $M \times N$ matrix that represents the element-wise product of $\mathbf{V}$ and $\mathbf{B}$, as discussed above.
* $\sum_{i,j} [\mathbf{V} \odot \mathbf{B}]$ is a scalar that represents the summation over all rows and columns of the $M \times N$ matrix $\mathbf{V} \odot \mathbf{B}$.
* $\mathbf{c}^T \mathbf{h}$ is a scalar that represents the inner product between $\mathbf{c}$ and $\mathbf{h}$.

Here is how the energy function is computed in code:

In [99]:
num_visible = 4
num_categories = 5
num_hidden = 3

V = torch.randn(num_visible,num_categories)
h = torch.randn(num_hidden)
W = torch.randn(num_visible,num_categories,num_hidden)
b = torch.randn(num_visible,num_categories)
c = torch.randn(num_hidden)

# first term of energy function

Wh = torch.multiply(W,h)
WhV = torch.multiply(Wh.permute(2,0,1),V)
first_term = torch.sum(WhV)
print(first_term)

# second term of energy function

VB = torch.multiply(V,B)
second_term = torch.sum(VB)
print(second_term)

# third term of energy function

third_term = torch.dot(c,h)
print(third_term)

# energy

energy = -first_term - second_term - third_term
print(energy)

tensor(1.1643)
tensor(5.9659)
tensor(-0.7572)
tensor(-6.3729)
