<a href="https://colab.research.google.com/github/pramodgangula19/5731_Spring24/blob/main/gangula_pramod_exercise_04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [None]:
# Write your code here



In [None]:
import gensim
from gensim import corpora
from gensim.models import CoherenceModel
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Sample text data
documents = [
    "Your first document goes here.",
    "The second document is here.",
    "And this is the third document.",
    "Is this the first document or the second?",
    "The last document completes the collection.",
]


# Tokenize and preprocess the text data
nltk.download("stopwords")
nltk.download("punkt")
stop_words = set(stopwords.words("english"))
texts = [
    [word for word in word_tokenize(document) if word.lower() not in stop_words]
    for document in documents
]
# Create a dictionary and document-term matrix
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# Determine an appropriate range for the number of topics (K)
min_topics = 2
max_topics = 10

# # Suppress warnings
# import warnings
# warnings.filterwarnings("ignore", category=DeprecationWarning)


coherence_scores = []
for num_topics in range(min_topics, max_topics + 1):
    lda_model = gensim.models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=50, iterations=100)
    coherence_model = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence="c_v")
    coherence_score = coherence_model.get_coherence()
    coherence_scores.append(coherence_score)


# Select the K with the highest coherence score
optimal_k = min_topics + coherence_scores.index(max(coherence_scores))

#Train the LDA model with the optimal K
lda_model = gensim.models.LdaModel(corpus, num_topics=optimal_k, id2word=dictionary, passes=50, iterations=100)

# Summarize the topics
topics = lda_model.print_topics(num_words=5)

print(topics)



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


[(0, '0.100*"." + 0.100*"third" + 0.100*"document" + 0.100*"second" + 0.100*"?"'), (1, '0.100*"." + 0.100*"third" + 0.100*"document" + 0.100*"second" + 0.100*"?"'), (2, '0.167*"." + 0.167*"document" + 0.166*"completes" + 0.166*"collection" + 0.166*"last"'), (3, '0.258*"document" + 0.197*"." + 0.136*"first" + 0.136*"second" + 0.076*"?"')]


## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [None]:

# Write your code here
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Sample text data
documents = [
    "Your first document goes here.",
    "The second document is here.",
    "And this is the third document.",
    "Is this the first document or the second?",
    "The last document completes the collection.",
]

# Tokenize and preprocess the text data
nltk.download("stopwords")
nltk.download("punkt")
stop_words = set(stopwords.words("english"))
texts = [
    " ".join([word for word in word_tokenize(document) if word.lower() not in stop_words])
    for document in documents
]

# Create a CountVectorizer and transform the data into a document-term matrix
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
# Estimate the optimal number of topics (K) based on explained variance
explained_variances = []
K_range = range(2, min(X.shape) + 1)
for K in K_range:
    svd = TruncatedSVD(n_components=K)
    X_reduced = svd.fit_transform(X)
    explained_variances.append(svd.explained_variance_ratio_.sum())

optimal_K = K_range[explained_variances.index(max(explained_variances))]

# Train the final LSA model with the optimal K
svd = TruncatedSVD(n_components=optimal_K)
X_reduced = svd.fit_transform(X)

# Summarize the topics
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(svd.components_):
    top_words_idx = topic.argsort()[:-6:-1]
    top_words = [feature_names[i] for i in top_words_idx]
    print(f"Topic {topic_idx}: {', '.join(top_words)}")







Topic 0: document, first, second, goes, collection
Topic 1: last, collection, completes, document, third
Topic 2: goes, first, completes, last, collection
Topic 3: third, document, goes, last, completes
Topic 4: goes, document, second, last, completes


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [None]:

# Write your code here
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim import corpora
from gensim.models import LdaModel
from gensim.models import CoherenceModel

# Sample text data
documents = [
    "Your first document goes here.",
    "The second document is here.",
    "And this is the third document.",
    "Is this the first document or the second?",
    "The last document completes the collection.",
]

# Tokenize and preprocess the text data
nltk.download("stopwords")
nltk.download("punkt")
stop_words = set(stopwords.words("english"))
texts = [
    [word for word in word_tokenize(document) if word.lower() not in stop_words]
    for document in documents
]
# Create a dictionary and document-term matrix
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# Determine an appropriate range for the number of topics (K)
min_topics = 2
max_topics = 10

coherence_scores = []
for num_topics in range(min_topics, max_topics + 1):
    lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary)
    coherence_model = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence="c_v")
    coherence_score = coherence_model.get_coherence()
    coherence_scores.append(coherence_score)

# Select the K with the highest coherence score
optimal_k = min_topics + coherence_scores.index(max(coherence_scores))

# Train the LDA model with the optimal K
lda_model = LdaModel(corpus, num_topics=optimal_k, id2word=dictionary)

# Summarize the topics
topics = lda_model.print_topics(num_words=5)  # Adjust the number of words as needed
for topic in topics:
    print(topic)





[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


(0, '0.100*"document" + 0.100*"." + 0.100*"third" + 0.100*"second" + 0.100*"first"')
(1, '0.101*"document" + 0.100*"." + 0.100*"second" + 0.100*"first" + 0.100*"third"')
(2, '0.253*"document" + 0.253*"." + 0.092*"completes" + 0.092*"last" + 0.092*"second"')
(3, '0.100*"document" + 0.100*"." + 0.100*"second" + 0.100*"first" + 0.100*"third"')
(4, '0.211*"document" + 0.210*"second" + 0.210*"first" + 0.210*"?" + 0.026*"."')
(5, '0.100*"document" + 0.100*"." + 0.100*"first" + 0.100*"second" + 0.100*"third"')
(6, '0.210*"first" + 0.210*"." + 0.210*"document" + 0.210*"goes" + 0.026*"second"')


## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [None]:
!pip install bertopic

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

# Load a sample dataset
data = fetch_20newsgroups(subset='all')['data'][:100]

# Initialize and fit BERTopic
model = BERTopic(verbose=True)
topics, _ = model.fit_transform(data)

# Get the number of unique topics
unique_topics = set(topics) - {-1}  # Excluding outlier cluster (-1)
print(f"Number of topics: {len(unique_topics)}")

# Summarize the topics
for topic_num in unique_topics:
    print(f"Topic {topic_num}: {model.get_topic(topic_num)}")




2024-03-29 02:21:44,460 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

2024-03-29 02:22:04,074 - BERTopic - Embedding - Completed ✓
2024-03-29 02:22:04,076 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-29 02:22:07,357 - BERTopic - Dimensionality - Completed ✓
2024-03-29 02:22:07,359 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-29 02:22:07,372 - BERTopic - Cluster - Completed ✓
2024-03-29 02:22:07,377 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-29 02:22:07,462 - BERTopic - Representation - Completed ✓


Number of topics: 3
Topic 0: [('the', 0.07672549090258693), ('to', 0.04595422024770698), ('in', 0.04281812785644799), ('of', 0.04061777075101897), ('and', 0.03904016082197215), ('is', 0.03662042761565035), ('it', 0.031085398931291343), ('that', 0.03018004988233436), ('for', 0.02651423362391401), ('from', 0.025128946096449545)]
Topic 1: [('the', 0.07798544040391854), ('to', 0.0634804870284147), ('for', 0.058763320127653416), ('is', 0.05516637026878951), ('and', 0.05509278324050839), ('of', 0.042037637093413925), ('you', 0.03873835593377405), ('have', 0.03724510065506426), ('in', 0.03595975791498973), ('it', 0.03432894198603913)]
Topic 2: [('the', 0.10607754110082072), ('of', 0.07726270632540169), ('to', 0.06000346504720262), ('and', 0.058991416718061976), ('in', 0.0522244201983313), ('that', 0.04279703281616977), ('is', 0.04050528762693314), ('you', 0.03299746044437099), ('from', 0.03248946575754852), ('he', 0.032078538518078965)]


## **Question 3 (Alternative) - (10 points)**

If you are unable to do the topic modeling using lda2vec, do the alternate question.

Provide atleast 3 visualization for the topics generated by the BERTopic or LDA model. Explain each of the visualization in detail.

In [None]:
# Write your code here
# Then Explain the visualization

# Repeat for the other 2 visualizations as well.

## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [None]:
# Write your code here



In [None]:
# Write your answer here (no code needed for this question)
Comparing the results generated by the four topic modeling algorithms (Latent Dirichlet Allocation - LDA, Latent Semantic Analysis - LSA, Non-Negative Matrix Factorization - NMF, and BERTopic) provides valuable insights into their respective strengths and weaknesses. To assess which one is better, it's essential to consider various aspects of their performance.

LDA (Latent Dirichlet Allocation):

Strengths:
LDA is a well-established topic modeling technique with proven effectiveness in various text analysis tasks.
It provides interpretable topics as word distributions over topics.
The number of topics (K) can be controlled, offering flexibility.
Weaknesses:
LDA assumes a Bag of Words (BoW) representation of documents, which may not capture semantic nuances effectively.
Topics are represented as probability distributions, which can be less informative for some applications.
LSA (Latent Semantic Analysis):

Strengths:
LSA identifies latent semantic structures within documents by analyzing word co-occurrences.
It can capture synonymy and polysemy effectively, as it reduces dimensionality and extracts latent semantic information.
Weaknesses:
LSA's effectiveness depends on the quality of the term-document matrix, which can be noisy and may not handle complex linguistic patterns well.
Like LDA, LSA provides topics as word distributions.
NMF (Non-Negative Matrix Factorization):

Strengths:
NMF enforces non-negativity constraints on the factorization, making the resulting topics more interpretable.
It can be applied to diverse data types, such as text, images, and audio.
Topics extracted by NMF are often considered more interpretable than LDA.
Weaknesses:
The choice of K (number of topics) is a critical hyperparameter, and selecting an appropriate value can be challenging.
NMF is sensitive to the initialization of factors and may converge to different solutions.
BERTopic:
Strengths:
BERTopic leverages contextual embeddings from pre-trained BERT models, capturing semantic nuances effectively.
It does not require extensive preprocessing, making it suitable for diverse text data.
The number of topics can be determined based on coherence scores, which can be more data-driven.
Weaknesses:
BERTopic may be computationally intensive, especially with large datasets and a large number of topics.
It relies on pre-trained language models, which may require significant resources for fine-tuning.
The choice of the best topic modeling algorithm depends on the specific task and data. If interpretability and well-established techniques are a priority, LDA or NMF may be suitable. However, if capturing semantic nuances and leveraging pre-trained language models are crucial, BERTopic may provide better results. Additionally, BERTopic's data-driven approach to determine the number of topics makes it attractive for applications where the optimal topic count is not known in advance.

In conclusion, the best algorithm among LDA, LSA, NMF, and BERTopic depends on the project's goals, the nature of t





SyntaxError: ignored

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
The in-class exercise you've shared is designed to help students understand and apply various topic modeling techniques on a text corpus, a crucial skill in natural language processing (NLP). Let's break down the exercise and its components:

Latent Dirichlet Allocation (LDA): This technique assumes that documents are a mixture of topics and that topics are a mixture of words. By applying LDA, you aim to uncover these hidden topic structures within the corpus. The coherence score helps determine the optimal number of topics by measuring the degree of semantic similarity between high scoring words in the topic.

Latent Semantic Analysis (LSA): LSA is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will appear in similar pieces of text.

lda2vec: This method combines the ideas of word embeddings (like word2vec) and topic models (like LDA) to provide a more nuanced topic representation by incorporating word embeddings to capture semantic relationships between words.

BERTopic: This is a recent technique that leverages the BERT embeddings and a class-based TF-IDF to create dense clusters allowing for easily interpretable topics while keeping important words in the topic descriptions.

For each of these techniques, the exercise guides students through the process of applying the algorithm, determining the optimal number of topics, and interpreting the topics that are generated. This hands-on experience is invaluable for understanding the practical applications of these algorithms in text analysis and NLP.

Challenges students might face include understanding the underlying mathematical concepts, fine-tuning models for better coherence scores, or interpreting the topics in a meaningful way.

In terms of relevance, these techniques are foundational in the field of NLP and are used in various applications such as content recommendation, search engine enhancement, customer feedback analysis, and more. Understanding these algorithms and being able to apply them effectively can provide significant insights from large volumes of text data, which is a common challenge in many fields.

Reflective feedback on this exercise would ideally include a student's personal learning experience, specific challenges they encountered, and insights on how these techniques can be applied in their area of interest within NLP or a related field.





'''