# **Latent Dirichlet Allocation (LDA)**
Prior to the explosion of interest in Generative AI, there were a set of models specific to NLP & Computer Vision that leveraged generative models. These models are called generative as they are probabilistic in nature. These probabilisitic machine learning models can be made to generate data as they are based on underlying probability distributions. LDA is one such algorithm. It is a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics [1]. There is an assumption of a set of topic probabilities that are unknown and hence referred to as latent. Each topic is modeled as an infinite mixture over these latent topic probabilities. Learning the topic probabilities would be equivalent to the representation of the document as we would know what topics are present and in which proportions. Therefore, this popular machine learning technique is used in topic modeling to discover hidden topics within a collection of documents.

### 1. **Unsupervised Technique**
   - The key assumption is that any document is a mixture of several topics. Each topic is a distribution of words. For example, in a collection of articles, topics could be "politics," "technology," "sports," etc. For example, an article about artificial intelligence might be 60% "technology" and 40% "business."

### 2. **How does it Work?**
   - **Topics as Word Distributions**: Each topic is represented as a distribution of words. For example, a "technology" topic might have high probabilities for words like "computer," "AI," and "data."
   - **Probabilistic Model**: LDA is based on Dirichlet distributions (hence the name), which are used to model the probability of topics within a document and words within a topic.

### 3. **LDA Steps**
   - **Input**: A set of documents and the number of topics \( K \) you want to discover.
   - **Output**: Probabilities of topics per document and probabilities of words per topic.
   - **Algorithm Steps**:
     1. Initialize the model with a random assignment of topics to words in documents.
     2. Iterate through the words and update topic assignments based on probabilities.
     3. Repeat the process until convergence, where topics stabilize, revealing word distributions per topic and topic distributions per document.

### 4. **Applications of LDA**
   - **Document Clustering**: Grouping similar documents based on topics.
   - **Information Retrieval**: Improving search by identifying relevant topics within documents.
   - **Content Recommendation**: Recommending content based on similar topics.
   - **Text Summarization**: Summarizing content by topic relevance.


LDA provides insights into large text datasets by identifying common themes and structures, making it an essential tool for natural language processing (NLP).

![LDA]("/content/nltk/LDAupdated2.png")





Using the popular libraries `gensim` and `nltk` for topic modeling on a simple set of example documents.

We'll go through:
1. Importing necessary libraries.
2. Preprocessing text data.
3. Creating and training the LDA model.
4. Displaying the topics generated by the model.

## Import necessary libraries

In [5]:
import gensim
from gensim import corpora
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('punkt')
nltk.data.path.append("/content/nltk")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/oysterable/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Download NLTK Resources

In [6]:
nltk.download('punkt', download_dir="/content/nltk")
nltk.download('stopwords', download_dir="/content/nltk")
nltk.download('wordnet', download_dir="/content/nltk")
nltk.download('punkt_tab', download_dir="/content/nltk")

[nltk_data] Downloading package punkt to /content/nltk...


OSError: [Errno 30] Read-only file system: '/content'

## Create an Example Set of documents


In [13]:
documents = [
    "I love reading about machine learning and natural language processing.",
    "Artificial intelligence is a fascinating field.",
    "Deep learning is a part of machine learning.",
    "I enjoy creating machine learning models.",
    "Natural language processing is a key component of artificial intelligence.",
    "Learning about AI and ML is exciting and challenging.",
    "Generative models are a powerful tool in deep learning.",
    "Applications of AI include image and speech recognition."
]

## Preprocess the documents

* Tokenize,
* remove stopwords,
* and lemmatize


In [14]:
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def preprocess(doc):
    tokens = word_tokenize(doc.lower())
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha() and token not in stop_words]
    return tokens

processed_docs = [preprocess(doc) for doc in documents]
processed_docs

## Create a dictionary and corpus for LDA

In [15]:
dictionary = corpora.Dictionary(processed_docs)
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

## Train LDA model


In [16]:
lda_model = gensim.models.LdaModel(corpus, num_topics=3, id2word=dictionary, passes=15)

## Display topics

In [17]:
topics = lda_model.print_topics(num_words=4)
print("Topics generated by LDA:")
for idx, topic in topics:
    print(f"Topic {idx + 1}: {topic}")

Topics generated by LDA:
Topic 1: 0.121*"learning" + 0.048*"machine" + 0.048*"deep" + 0.048*"model"
Topic 2: 0.061*"artificial" + 0.061*"intelligence" + 0.061*"natural" + 0.061*"processing"
Topic 3: 0.093*"learning" + 0.054*"ai" + 0.053*"application" + 0.053*"speech"



### Explanation

1. **Text Preprocessing:** We remove stopwords, tokenize, and lemmatize words to make the text simpler and more relevant for topic modeling.
   
2. **Creating the Dictionary and Corpus:** `dictionary` maps each word to a unique ID, and `corpus` represents each document as a "bag of words" (word frequency counts).

3. **Training the LDA Model:** We specify `num_topics=3` to create three topics and use `passes=15` for model optimization. This can be adjusted based on the data.

4. **Displaying Topics:** The model will output the top words in each topic. The topics can be interpreted based on the most frequent words found.

### Expected Output (Sample)
```plaintext
Topics generated by LDA:
Topic 1: 0.200*"learning" + 0.100*"machine" + 0.080*"deep" + 0.080*"model"
Topic 2: 0.150*"intelligence" + 0.120*"natural" + 0.110*"language" + 0.100*"processing"
Topic 3: 0.250*"ai" + 0.140*"application" + 0.100*"image" + 0.090*"recognition"
```

This example shows that the model has learned three distinct topics, roughly centered on:
1. Machine learning and deep learning.
2. Natural language processing.
3. AI applications like image and speech recognition.

This code can be extended to larger datasets by changing the `documents` list to a more extensive corpus of text or through a more complex structure.

## References

[1] Latent Dirichlet Allocation, https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf