# 03. Gibbs Sampler for Factorization Machine

$\DeclareMathOperator*{\argmin}{arg~min}$

In [None]:
import numpy as np
from typing import Optional

## From ALS to Gibbs Sampler

Results of ordinary least squares are equivalent to the maximum a posteriori (MAP) estimation of a linear regression model with normally distributed errors.

That is, when the likelihood of $\theta$ is given by

$$
\begin{aligned}
p(y^{(d)} | \theta, x^{(d)}) &= \mathcal{N}(y^{(d)} | \theta x^{(d)}, \sigma_n^2),
\end{aligned}
$$

where $\sigma_n^2$ is the variance of the noise and its prior is uniform, the posterior distribution of $\theta$ is given by

$$
\begin{aligned}
p(\theta | y, x)
&\propto
p(y | \theta, x) p(\theta)
\\

&\propto
\prod_{d=1}^D \mathcal{N}(y^{(d)} | \theta x^{(d)}, \sigma_n^2) \times 1
\\

&\propto
\exp\left(-\frac{1}{2\sigma_n^2} \sum_{d=1}^D (y^{(d)} - \theta x^{(d)})^2\right)
\\

&=
\exp\left(-\frac{1}{2\sigma_n^2} \sum_{d=1}^D x^{(d)2} \left( \theta - \left( \sum_{d=1}^D x^{(d)2} \right)^{-1} \sum_{d=1}^D x^{(d)}y^{(d)} \right)^2 \right)
\\

&\propto
\mathcal{N}\left(\theta \middle| \mu_\theta^\star, \sigma_\theta^{\star 2} \right),
\end{aligned}
$$

where

$$
\begin{aligned}
\mu_\theta^\star
&=
\left( \sum_{d=1}^D x^{(d)2} \right)^{-1} \sum_{d=1}^D x^{(d)}y^{(d)},
\\

\sigma_\theta^{\star 2}
&=
\sigma_n^2 \left( \sum_{d=1}^D x^{(d)2} \right)^{-1}.
\end{aligned}
$$

Obviously, its MAP estimation, a.k.a. maximum likelihood estimation (MLE) --since it is equivalent to maximizing the likelihood function--,

$$
\hat \theta = \mu_\theta^\star
$$

is equivalent to the ordinary least squares solution.

In the similar way, the L2 regularized least squares is equivalent to the MAP estimation of a linear regression model with

- normal likelihood $p(y^{(d)} | \theta, x^{(d)}) \sim \mathcal{N}(y^{(d)} | \theta x^{(d)}, \sigma_n^2)$ and
- a Gaussian prior $\theta \sim \mathcal N(\theta | \mu_\theta, \sigma_\theta^2)$.

What then about sampling from the posterior distribution instead of performing MAP estimation for each parameter $\theta$ at each iteration in the ALS?

In this case, we are drawing samples from the posterior distribution of $\theta$ given $x, y$ and the other parameters $\Theta \setminus \{ \theta \}$, that is,

$$
\begin{aligned}
\tilde \theta | x, y, \Theta \setminus \{ \theta \} \sim p(\theta | x, y, \Theta \setminus \{ \theta \}) \quad \forall \theta \in \Theta.
\end{aligned}
$$

As can be seen from its expression, this posterior sampling operation is equivalent to performing **Gibbs sampling**, a type of MCMC method.

## Factorization Machine as Hierarchical Bayesian Model

Following [Rendle+ (2012)], we build a hierarchical Bayesian model for FM, which is known as the **Bayesian factorization machine** (BFM).

The hiearchical expression of the BFM is given by the graphical model below.

![alt text](figures/FM_03_hierarchical.svg)

Here,

- $\mathcal D = \{ (x^{(d)}, y^{(d)}) \}_{d=1, \dots, D}$ (denoted by $x, y$) is the given dataset,
- $\Theta_H = \{ \mu_b, \sigma_b^2, m_\theta, \lambda_\theta, a_\theta, b_\theta, a_n, b_n \}$ are given hyperparameters, and
- $\Theta = \{ b, w, v, \sigma_n^2 \}$ are the parameters to be tuned through the training process.

The distributions of the parameters and outputs are given by

$$
\begin{aligned}
p(\{ y_\theta^{(d)} \} | \{ x_\theta^{(d)} \}, \Theta, \sigma_n^2)
&= \prod_{d=1}^D \mathcal{N}(y_\theta^{(d)} | \theta x_\theta^{(d)}, \sigma_n^2),
&& \theta \in \{ b, w, v \}
\\

p(\sigma_n^2 | a_n, b_n)
&= \mathcal{IG}(\sigma_n^2 | a_n, b_n),
\\

p(\theta | \mu_\theta, \sigma_\theta^2)
&= \mathcal{N}(\theta | \mu_\theta, \sigma_\theta^2),
&& \theta \in \{ b, w, v \}
\\

p(\mu_\theta | m_\theta, \lambda_\theta^2, \sigma_\theta^2)
&\sim \mathcal{N}(\mu_\theta | m_\theta, \sigma_\theta^2 / \lambda_\theta^2),
&& \theta \in \{ w, v \}
\\

p(\sigma_\theta^2 | a_\theta, b_\theta) &\sim \mathcal{IG}(\sigma_w^2 | a_\theta, b_\theta).
&& \theta \in \{ w, v \}
\end{aligned}
$$

Bayes' theorem states that the posterior distribution is proportional to the product of the prior distribution and likelihood for each parameter, that is,

$$
\begin{aligned}
p(a | b, x) &\propto p(b | a, x) p(b | x).
\end{aligned}
$$

The conjugacy of the assumed priors allows the posterior distribution to be calculated analytically by keeping only the relevant parameter term and discarding the rest.

The resulting posterior distributions are

$$
\begin{aligned}
p(\sigma_n^2 | \mathcal D, \Theta, \Theta_H)
&= \mathcal{IG}(\sigma_n^2 | a_n^\star, b_n^\star),
\\

p(\theta | \mathcal D, \Theta \setminus \{ \theta \}, \Theta_H)
&= \mathcal{N}(\theta | \mu_\theta^\star, \sigma_\theta^{\star 2}),
\\

p(\mu_\theta | \Theta \setminus \{ \mu_\theta \}, \Theta_H)
&= \mathcal{N}(\mu_\theta | \mu_\mu^\star, \sigma_\theta^2 / \lambda_\mu^{\star 2}),
\\

p(\sigma_\theta^2 | \Theta \setminus \{ \sigma_\theta^2 \}, \Theta_H)
&= \mathcal{IG}(\sigma_\theta^2 | a_\theta^\star, b_\theta^\star),
\end{aligned}
$$

where

$$
\begin{aligned}
a_n^\star
&= a_n + \frac{D}{2},
\\

b_n^\star
&= b_n + \frac{1}{2} \sum_{d=1}^D (y_\theta^{(d)} - x_\theta^{(d)} \theta )^2,
\\

\mu_\theta^\star
&= \sigma_\theta^{\star 2}
\left( \frac{1}{\sigma_n^2} \sum_{d=1}^D x_\theta^{(d)} y_\theta^{(d)} + \frac{1}{\sigma_\theta^2} \mu_\theta \right),
&& \theta \in \{ b, w, v \},
\\

\sigma_\theta^{\star 2}
&= \left( \frac{1}{\sigma_n^2} \sum_{d=1}^D x_\theta^{(d)2} + \frac{1}{\sigma_\theta^2} \right)^{-1},
&& \theta \in \{ b, w, v \}
\\

m_\theta^\star
&= \frac{ 1 }{ \lambda_\theta^\star }
\left( \sum_{i=1}^N \theta_i + \lambda_\theta m_\theta \right),
&& \theta \in \{ w, v \}
\\

\lambda_\theta^\star
&= N + \lambda_\theta,
&& \theta \in \{ w, v \}
\\

a_\theta^\star &=
a_\theta + \frac{N + 1}{2},
&& \theta \in \{ w, v \}
\\

b_\theta^\star &=
a_\theta + \frac{1}{2} \left( \sum_{i=1}^N (\theta_i - \mu_\theta)^2 + \lambda_\theta(\mu_\theta - m_\theta)^2 \right).
&& \theta \in \{ w, v \}

\end{aligned}
$$

Substituting

$$
\begin{aligned}
x_\theta^{(d)} &\coloneqq h_\theta^{(d)}, \\
y_\theta^{(d)} &\coloneqq y^{(d)} - g_\theta^{(d)}
\end{aligned}
$$

into the above distributions gives the Gibbs sampler algorithm for the BFM.

As in the ALS case, we can speed up the algorithm by pre-computing and retaining the values of $f^{(d)}$ and $q^{(d)}_k$ which are given by

$$
\begin{aligned}
f^{(d)} &\coloneqq f(x^{(d)}),
\\
q^{(d)}_k &\coloneqq \sum_{j=1}^D v_{jk} x^{(d)}_j,
\end{aligned}
$$

and updating them sequentially in the following rule:

$$
\begin{aligned}
\Delta f^{(d)}
&= (\theta^{\rm new} - \theta) h_\theta^{(d)},
\\

\Delta q_k^{(d)}
&= x_i^{(d)} (v_{ik}^{\rm new} - v_{ik}),
\\

h_\theta^{(d)}
&=
\left\{\begin{aligned}
& 1,
&& \theta=b \\
& x_i^{(d)},
&& \theta=w_i\\
& x_i^{(d)} \left( q_k^{(d)} - v_{ik} x_i^{(d)} \right).
&& \theta = v_{ik}
\end{aligned}\right.
\end{aligned}
$$