# Bayesian Matrix Factorization + Document Embedding Prior

## Intuition

Sentence or document embedding captures the contextual meaning of document by representing documents of similar meaning with similar vectors. In theory, if two documents have similar words, they should have similar meaning and therefore should be represented by similar vectors. However, the association between word co-occurance and document similarity is implicit. 

On the other hand, count-based technique such as matrix factorization and LDA explicitly model the similarity of documents using the co-occurance of words by creating latent variables, called topics that is shared by both word generation and document decomposition.

To marry these two concepts, let's start with factorizing the probability of a word-document matrix $w_{dn}$

\begin{equation}
\Pr(w_{dn}) = \sum_k \Pr(w_n \mid k) \Pr(k \mid d)
\end{equation}

where $\Pr(w_n \mid k)$ means the likelihood of seeing word $w_n$ in topic $k$, and $\Pr(k \mid d)$ means the likelihood of topic $k$ exists in document $d$. In topic modeling, we are interested in inferring the topic mix of a document, i.e. calculating $\Pr(k \mid d)$. The place where we infuse the document embedding is $\Pr(w_n \mid k)$. Even though, by clustering the document embedding, a document belongs to cluster/topic $k$, it can include another topic $k'$ because the document also contains words that show up in topic $k'$. 

## Bayesian Matrix Factorization

Let $\{w_{dn}\}_{d=1, n=1}^{D, N_d}$ denotes a corpus of $D$ documents, each with $N_d$ words. The probability of observing such corpus is simply

\begin{equation}
\prod_{d=1}^D\prod_{n=1}^{N_d}\Pr(w_{dn})
\end{equation}

To do topic modeling, we assume the above equation can be parametrized by $\vec{\theta}_d$, which is a vector of dimension $K$ that describes the probability of topic $1$ to $K$ being assigned to document $d$:

\begin{equation}
\prod_{d=1}^D\prod_{n=1}^{N_d}\Pr(w_{dn}) = \prod_{d=1}^D\prod_{n=1}^{N_d}\int d\vec{\theta}_d\Pr(w_{dn}\mid \vec{\theta}_d)\Pr(\vec{\theta}_d) \label{full_likelihood_2}\tag{Eq 2}
\end{equation}

In LDA, $\Pr(\vec{\theta}_d)$ is further assumed to follow a Dirichlet distribution $\text{Dir}(\vec{\theta}_d \mid \alpha)$. $\Pr(w_{dn}\mid \vec{\theta}_d)$ can further be rewritten in terms of topic $k$ as 

\begin{equation}
\Pr(w_{dn}\mid \vec{\theta}_d) = \sum_k \Pr(w_{dn}\mid k)\Pr(k \mid \vec{\theta}_d) \label{MF}\tag{Eq 3}
\end{equation}


The above equation becomes \ref{full_likelihood_2} becomes

\begin{equation}
\prod_{d=1}^D\prod_{n=1}^{N_d}\Pr(w_{dn} \mid \alpha) = \prod_{d=1}^D\prod_{n=1}^{N_d}\int d\vec{\theta}_d\sum_k \Pr(w_{dn}\mid k)\Pr(k \mid \vec{\theta}_d) \text{Dir}(\vec{\theta}_d \mid \alpha) \label{full_likelihood_3}\tag{Eq 4}
\end{equation}

$\Pr(k \mid \vec{\theta}_d)$ is essentially the $k$-th component of $\vec{\theta}_d$. We will denote this as $\theta_{dk}$. $\Pr(w_{dn}\mid k)$ can be understood as the likelihood of observing the $n$-th word in document $d$ given the document is assigned topic $k$. This is where the sentence embedding is infused into the Bayesian MF framework.

## Document embedding and topic assignment

A document $\vec{w}_d$, which consists of a vector of words in the LDA world, can be expressed in a more abstract, but context aware space using embedding techniques, such as Sentence BERT. Assuming each document is now embedded in a high dimensional space using Sentence BERT, one can group documents with similar contextual meaning into a cluster. This is usually done by first projecting the high dimensional space into a low dimensional space using technique like PCA, t-SNE and UMAP, then perform any clustering algorithm. Each cluster now consists of a collection of documents. Since a document consists of words, each cluster can be thought of as a collection of words. 

With this view, one can now express $\Pr(w_n \mid k)$ empirically as

\begin{equation}
\Pr(w \mid k) = \frac{\text{number of word $w$ in cluster $k$}}{\text{total number of words in cluster $k$}}
\end{equation}

To borrow the notation used in LDA, we will write the above empirical probability as $\Pr(w \mid k) = \hat{\phi}_{kw}$. \ref{full_likelihood_3} now becomes

\begin{equation}
\prod_{d=1}^D\prod_{n=1}^{N_d}\Pr(w_{dn} \mid \alpha) = \prod_{d=1}^D\prod_{n=1}^{N_d}\int d\vec{\theta}_d\sum_k \hat{\phi}_{kw_{dn}} \theta_{dk} \text{Dir}(\vec{\theta}_d \mid \alpha) \label{full_likelihood_4}\tag{Eq 5}
\end{equation}

## Maximum likelihood and inference

To obtain the optimal $\alpha$, we perform maximum log-likelihood of \ref{full_likelihood_4}

\begin{equation}
\alpha^* = \arg\max_\alpha \sum_{d=1}^D\sum_{n=1}^{N_d} \log \Pr(w_{dn} \mid \alpha)
\end{equation}

with the optimal $\alpha^*$, we can calculate the following term

\begin{equation}
\Pr(\vec{\theta}_d, w_{dn} \mid \alpha^*) = \sum_k \hat{\phi}_{kw_{dn}} \theta_{dk} \text{Dir}(\vec{\theta}_d \mid \alpha)
\end{equation}

Finally, the topic assignment distribution can be calculated using Bayes' theorem

\begin{equation}
\Pr(\theta_d \mid \vec{w}_d) = \frac{\prod_{n=1}^{N_d}\Pr(\theta_d, w_{dn} \mid \alpha^*)}{\prod_{d=1}^D\prod_{n=1}^{N_d}\Pr(w_{dn} \mid \alpha^*)}
\end{equation}

## Inference with out-of-bag data

WIP