# Building a LDA-based Book Recommender System
### Autors: XXX XXX XX

<img align="middle" src="introPic.png"> 
[Source: http://people.ischool.berkeley.edu/~vivienp/presentations/is296/ass1nonfiction.html]

### The task: building a books recommendation engine
* could  be written by Andrew (a motivation on doing a recommendation engine)*

A books recommendation system aims to help users finding books which might be interesting for them based on book titles .....


## Related work 

The Latent Dirichlet Allocation (LDA) model and a Variational EM algorithm used for training the model were proposed by Blei, Ng and Jordan in 2003 (Blei et al., 2003a). Blei et al. (2003b) described LDA as a “generative probabilistic model of a corpus which idea is that the documents are represented as weighted relevancy vectors over latent topics, where a distribution over words characterizes a topic”. This topic model belongs to the family of hierarchical Bayesian models of a corpus, which purpose is to expose the main themes of a corpus which can be used to classify, search, and investigate the documents of the corpus.
In LDA models, a topic is a distribution over the feature space of the corpus, and several topics with different weights can represent each document.  According to Blei et al. (2003a), the number of topics (clusters) and the proportion of vocabulary that create each topic (the number of words in a cluster) are considered as two hidden variables of the model. The conditional distribution of words in topics, given these variables, for an observed set of documents, is regarded as the primary challenge of the model.

Griffiths and Steyvers (2004), used a derivation of the Gibbs sampling algorithm for learning LDA models to analyze abstracts from PNAS (...) by using Bayesian model selection to set the number of topics. They proved that the extracted topics capture essential structure in the data, compatible with the class designations provided by the authors of the articles, and drew further applications of this analysis, including identifying ‘‘hot topics’’ by examining temporal dynamics and tagging abstracts to illustrate semantic content. The work of Griffiths and Steyvers (2004) proved Gibbs sampling algorithm is more efficient than other LDA training methods (e.g., Variational EM). The efficiency of the Gibbs sampling algorithm for inference in a variety of models that extend LDA  is associated with the "the conjugacy between the Dirichlet distribution and the multinomial likelihood".  Thus, when one does sampling, the posterior sampling becomes easier, because the posterior distribution is the same as the prior, and it makes inference feasible; therefore, when we are doing sampling, the posterior sampling becomes easier. (Blei et al., 2009; McCallum et al.,  2005).

Mimno et al. (2012) introduced a hybrid algorithm for Bayesian topic modeling, which aims to in which the main effort is to merge the efficiency of sparse Gibbs sampling with the scalability of online stochastic inference. Their approach decreases the bias of variational inference and can be generalized by many Bayesian hidden-variable models.  

LDA is being applied in various Natural Language Processing tasks such as for opinion analysis (Zhao et al., 2010), for native language identification (Jojo et al., 2011) and for learning word classes (Chrupala, 2011).

In this blog post, we focus on the task of computerized text classification into a set of  topics using LDA, in order to achieve ..... . 

##  What is LDA? 

In order to understand the kind of books that a certain user likes to read, we used a natural language processing technique called Latent Dirichlet Allocation (LDA )used for identifying hidden topics of documents based on the co-occurrence of words collected from those documents.

The general idea of LDA is that 
>each document is generated from a mixture of topics and each of those topics is a mixture of words.

Having this in mind, one could create a mechanism for generating new documents, i.e. we know the topics a priori, or for inferring topics present in a set of documents which is already known for us.  This bayesian topic modelling technique can be used to find out how high is the share of a certain document devoted to a particular topic, which allows the recommendation system to categorize a book topic, for instance, as 30% thriller and 20% politics.

Concerning the model name, one can think of it as follows (...):

`Latent`: Topic structures in a document are latent meaning they are hidden structures in the text.

`Dirichlet`: The Dirichlet distribution determines the mixture proportions of the topics in the documents and the words in each topic.

`Allocation`: Allocation of words to a given topic.

##   Parameter estimation 

LDA is a generative probabilistic model, so to understand exactly how it works one needs first to understand the underlying probability distributions. 

The idea behind probabilistic modeling is (Blei, Ng, and Jordan 2003): 
  - to treat  data as observations that arise from some kind of  generative probabilistic  process (the hidden variables reflect the thematic structure of the documents (books) collection), 
  - to infer the hidden structure using posterior inference (What are the topics that describe this collection?) and 
  - to situate new data into the estimated model (How does a new document fit into the estimated topic structure?)
  
In the next *XXX* sectios we will focus on the multinomial and Dirichlet distributions utilized by LDA.

### Inference: The Building Blocks

.....

### Maximum likelihood  

One of the simplest methods of parameter estimation is the Maximum likelihood (ML) method. Effectively one can calculate the parameter $\theta$ that maximizes the likelihood: 

*to be described further by Quang*

### Bayesian Inference (building blocks)

A further method for estimating parameters is to estimate the posterior of the distribution via Bayesian inference.

*to be described further by Quang*

##  Multinomial Distribution

Instead of maximum-likelihood, Bayesian inference encourages the use of predictive densities and evidence scores. This is illustrated in the context of the multinomial distribution, where predictive estimates are often used but rarely described as Bayesian (Minka, 2003).

Now we will describe the multinomial distribution which is used to model the probability of words in a document. For this reason, we will also discuss the conjugate prior for the multinomial distribution, the `Dirichlet distribution`.

......

.. to do : describe the intuition behind the MD ...

......

### Dirichlet distribution — what is it, and why is it useful?




The probability distribution function for the Dirichlet distribution is given by the following equation:  

$$
\begin{equation}
Dir (\vec\theta\mid \vec\alpha)={\frac {\Gamma (\Sigma_{i=1}^{K}\alpha_{i})}{\prod _{i=1}^{K}\Gamma (\alpha_{i})}}\prod _{i=1}^{K}\theta_{i}^{\alpha_{i}-1}
\end{equation}
$$



This equation  is often represented using the Beta function in place of the first term as seen below:


$$
\begin{equation}
Dir (\vec\theta\mid \vec\alpha)={\frac {1}{B(\alpha)}}\prod _{i=1}^{K}\theta_{i}^{\alpha_{i}-1}
\end{equation}
$$
Where:
$$
\begin{equation}
\frac {1}{B(\alpha)} = {\frac {\Gamma (\Sigma_{i=1}^{K}\alpha_{i})}{\prod _{i=1}^{K}\Gamma (\alpha_{i})}}
\end{equation}
$$

To get a better sense of what the distributions look like, let’s visualize a few examples in the context of topic modelling:


 To do: *Example of Dirichlet with different alpha values*

## LDA as a  Generative process


LDA is being often described as the simplest topic model (...): The intuition behind this model is that documents exhibit multiple topics. Furthermore, one could most easily describe the model by its generative process, by which the model assumes the documents in the collection arose (...).

Assuming that the word distributions for each topic vary based on a Dirichlet distribution, as do the topic distribution for each document, and the document length is drawn from a Poisson distribution, one can generate the words in a two-stage process for each document in the whole data collection:

1. Randomly choose a distribution over topics.
2. For each word in the document:
   -  A topic is being randomly chosen from the distribution over topics in the first step.
      -  Sample parameters for document topic distribution
      - $\theta_{d} \sim Dirichlet(\alpha) $
   -  A word is being randomly chosen from the corresponding distribution over the vocabulary (For $w=1$ to *W* where *W* is the number of words in document *d*
      - Select the topic for word *w* 
      - $z_{i} \sim Multinomial(\theta_{d})$ where $\theta$ is the topic distribution
      - Select word based on topic *z’s* word distribution
      - $w_{i} \sim Multinomial(\phi^{(z_{i})}) $ where $\phi$ is the word distributions of each topic

The distinctive characteristic of LDA is that all the documents in the collection share the same set of topics and each document exhibits those topics in different proportion.


### LDA as a graphical model  
LDA can be described more formally with the following notation:

<img align="middle" src="pic7.png"> 

- The `topics` are $b_{1}:K$, where each $b_{k}$ is a distribution over the vocabulary. 

- The `topic proportion` for the $d^{th}$ document is anotated by $\theta_{d}$ where $\theta_{d,k}$ is the `topic proportion` for `topic` $k$ in `document` $d$.

- The `topic assignment` for the $d^{th}$ document is $z_{d}$, where $z_{d,n}$ is the topic assignment for the $n^{th}$  word in document *d*.

- The observed `words` for document *d* are $w_{d}$, where $w_{d,n}$ is the $n^{th}$ word in document *d*, which is an element from the fixed vocabulary.



Using this notation, the generative process for LDA  is equivalent  to the following joint distribution of the hidden and observed variables (...):

\begin{equation*}
p(\beta_{1:K} , \theta_{1:D} , z_{1:D} , w_{1:D}) =\displaystyle\prod_{i=1}^{K}p(\beta_{i})\displaystyle\prod_{d=1}^{D}p(\theta_{d})\Bigg( \displaystyle\prod_{n=1}^{N}p(z_{d,n}\mid\theta_{d}) p(w_{d,n}\mid\beta_{1:k}, z_{d,n} )\Bigg)
\end{equation*}

As you can see from this distribution the `topic assignment` $z_{d,n}$
depends on the `per-document topic distribution` $\theta_{d}$, and the word $w_{d,n}$ depends on all of the `topics` $\beta_{1:K}$ and on the `topic assignment` $z_{d,n}$.

##  Posterior computation for LDA

As already mentioned,  
> the aim of topic modelling is to automatically discover the topics from a collection of documents, which are observed, while the topic structure is hidden.  

This can be thought of as “reversing” the generative process by asking *what is the hidden structure that likely generated the observed collection?* 

We now turn to the computational problem, computing the conditional distribution of the topic structure given the observed documents (called the *posterior*.) Using our notation, the posterior is 

$$p(\theta,\phi, z \mid w, \alpha, \beta) = \frac{p(\theta,\phi, z, w, \mid \alpha, \beta)}{p(w, \mid \alpha, \beta)}$$



The left side of the equation gives us the probability of the document topic distribution, the word distribution of each topic, and the topic labels given all words (in all documents) and the hyperparameters  $\alpha$ and $\beta$. In particular, we are interested in estimating the probability of topic (z) for a given word (w) (and our prior assumptions, i.e. hyperparameters) for all words and topics.

> ###  A central research goal of modern probabilistic modelling is to develop efficient methods for approximating the posterior inference (Blei, 2012). 

...... to be continued .......







# Python implementation 


## References 