# Latent Dirichlet Allocation

In this lecture we're going to give an explanatory overview of how LDA or **Latent Dirichlet Allocation** for topic
modeling works.

- Johann Peter Gustav Lejeune Dirichlet was a German mathematician in the 1800s who contributed widely to the field of modern mathematics.

- There is a probability distribution named after him called "Dirichlet Distribution". This is the distribution that is used later on in LDA.

- Latent Dirichlet Allocation is based off this probability distribution.

- In 2003 LDA was actually first published as a graphical model for topic discovery in the *Journal of Machine Learning Research* by David Blei, Andrew Ng and Michael l. Jordan. 

So keep in mind even though Dirichlet name is attached to this particular method for topic modeling it really just stems from the fact that it uses the Dirichlet probability distribution not that Dirichlet himself actually invented the LDA for topic modeling the actual method is relatively new from 2003.

- So we're going to get a high level overview of how LDA works for topic modeling.

- But I would really encourage you to also take a look at the original publication paper.

There are two main assumptions we're going to make in order to actually apply LDA for topic modeling.

- The first one is that documents with similar topics use similar groups of words. 

That's a pretty reasonable assumption because that basically saying that if you have various documents covering a similar topic like a bunch of documents covering the topic of business or economy that they should end up using similar words like money, price market, stocks, etc..

- The other assumption we are going to make is that latent topics can then be found by searching for groups of words that frequently occur together in documents across the corpus.  

That's going to be the assumption that we're really going to dive into the details later on.

So again these are the two assumptions and they're both actually quite reasonable for the way humans write documents.

And we can actually think of these two assumptions mathematically. The way we kind of model these assumptions is the following.

- We can say that documents are probability distributions over some underlying latent topics.
- and then topics themselves are probability distributions over words.

So let's see how each of those actually plays out.

- We can imagine that any particular document is going to have a probability distribution over a given amount of latent topics.

So let's say we decide that there are five latent topics across various documents. Then, any particular document is going to have a probability of belonging to each topic. 

So here we can see document one has the highest probability of belonging to topic number two.

![](../imgs/lda01.png)

So we have this discrete probability distribution across the topics for each document.

Then we can look at another document such as document number 2. In this case it does have probabilities of belonging to other topics but we're going to say that it has the highest probability of belonging to Topic four.

![](../imgs/lda02.png)

Notice here we're not saying definitively that document 1 belongs to any particular topic or document two belong to any particular topic. Instead, we're modeling them as having a probability distribution over a variety of topics.

- And then if we look at the topics themselves those are simply going to be modeled as probability distributions over words.

So for example we can define topic one as different probabilities belonging to each of these words as belonging to that topic.

![](../imgs/lda03.png)


So we can see here that it has a low probability of the word "he" belong a topic one, low probability of "food" belong a topic one etc. 

And then, we can see that word such as "cat" and "dog" have a higher probability of belonging to topic one.

And here is where we're actually going to begin as a user trying to understand what this topic is actually representative of.

So, if we were to get this sort of probability distribution, across all the vocabulary of all the words in the corpus, but we would end up doing is asking for maybe the top 10 highest probability words for topic 1 and then we would try to realize what the actual underlying topic was.

So in this case we could make an educated guess that topic one happened to do with pets and we would say that topic one has to do with pets.

Again the LDA or unsupervised learning technique is not going to be able to tell you that directly. It's up to the user to interpret these probability distributions as topics. And we'll actually get hands on practice with that when we perform LDA ourselves on Python.

- So **LDA represents documents as mixtures of topics that spit out words with certain probabilities**

- And it's going to assume that documents are produced in the following fashion:

    - It's first going to decide on the number of words **N** the document will have 
    - Then we choose a **topic mixture** for that documents according to a Dirichlet distribution over a fixed set of **K** topics. So that's where that Dirichlet distribution comes to effect. 

    - So for example we start off and say this document is 60% business 20% politics and 10% foods.

So that's its actual distribution. 

- and then what we're going to do is we're going to generate each word in the document by:

    - first picking a topic according to the multinomial distribution that we sampled previously.

So we picked words 60% of them from the business topic, 20% of them from politics and, then 10% from the food topic. And then using the topic to generate the word itself.

So again according to the topics own multinomial distribution across the words.

- So for example, if we selected the food topic we might generate the word "apple" a 60% probability and another word "home" with less probability like 30% probability and so on.

- Assuming this sort of generative model for a collection of documents, LDA is actually going to then try to backtrack from the documents to find the topics that are likely to have generated the collection.

So again this process here that we just went over (previous slide) LDA is assuming that that's how you built the documents. Now obviously in the real world you're not actually building documents with this sort of frame of mind but, it's a very useful construct of the way topics can be mixed throughout various documents and the way words can be mixed throughout various topics. So what we're going to do is attempt to backtrack that sort of process.

So let's actually show you what else is going to do since it's assuming that that's how you built the documents.

- So we can imagine we have a set of documents.

- and the first that we have to do is actually choose some fixed number of **K** topics to discover, and you should note that very carefully that this is actually really hard. In order for LDA to work, you, as a user, need to decide how many topics are going to be discovered. So even before you start LDA you need to have some sort of intuition over how many topics.

So we choose some fixed number **K** of topics to discover and then we're going to want to use LDA to learn the topic representation of each document and the words associated to each topic.

- Then we're going to go through each document and we're going to randomly assign each word in the document to one of the K topics.

- So keep in mind the very first pass this random assignment actually already gives you both topic representations of all the documents and word distributions of all the topics and we've assigned everything randomly at the very first pass so we're technically not done yet because these initial ran and topics won't really make sense. They're going to be really poor representations of topics since you essentially just assign every word around that topic.

So now it's time to iterate over this and see if we can figure out how to fix these sort of assignments.

- So we're going to iterate over every word in every document to improve these topics.

- and we're going to do it for every word, in every document, and for each topic **t** we're going to calculate the following:

    - we're going to calculate the proportion of words in document D that are currently assigned to topic T. That is p(word **$w$** | topic **$t$**) = the proportion of assignments to topic **$t$** over all documents that come from this word **$w$**.

Then we're also going to calculate the proportion of assignments that topic t over all documents that come from this particular word w and then we're going to reassign **$w$**.

- Then a new topic where we choose topic t with probability of topic T given document the times probability of word w given topic T.


- Reassign w a new topic, where we choose topic t with probabiity 

$$P(topic t | document d)*P(word w | topic t)$$

So this is essentially the probability that topic t generated the word W 


- After repeating that previous step a large number of times, we eventually reach a roughly steady state where the assignments for the topics are acceptable, these words and topics don't start changing that often they become pretty steady.


- So at the end what we have is each document being assigned to a topic.
- And then all we can do is we can then search for the words that have the highest probability of being assigned a topic.

After running through all the documents and performing LDA you pass in one particular document and then report back LDA will report back the LDA is going to say:

- We end up with an output such as:

     - Document assigned to Topic #4
     - Most common words (highest probability) for Topic #4" ['cat', 'vet', 'birds', 'dog', ...,'food', 'home']
     - It is up to the user to interpret these topics.

- Two important notes:
    - The user must decide on the amount of topics present in the document.
    - The user must interpret what the topics are.
    
