## In class Exercise 4

The purpose of this exercise is to practice topic modeling.
Please use the text corpus you collected in your last in-class-exercise for this exercise.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due tonight November 1st, 2023 at 11:59 PM.
**Late submissions cannot be considered.**

## (1) (10 points) Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.

You may refer the code here:
https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [None]:


import gensim
from gensim import corpora
from gensim.models import CoherenceModel
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Sample text data
documents = [
    "Your first document goes here.",
    "The second document is here.",
    "And this is the third document.",
    "Is this the first document or the second?",
    "The last document completes the collection.",
]


# Tokenize and preprocess the text data
nltk.download("stopwords")
nltk.download("punkt")
stop_words = set(stopwords.words("english"))
texts = [
    [word for word in word_tokenize(document) if word.lower() not in stop_words]
    for document in documents
]


# Create a dictionary and document-term matrix
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# Determine an appropriate range for the number of topics (K)
min_topics = 2
max_topics = 10

# # Suppress warnings
# import warnings
# warnings.filterwarnings("ignore", category=DeprecationWarning)


coherence_scores = []
for num_topics in range(min_topics, max_topics + 1):
    lda_model = gensim.models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=50, iterations=100)
    coherence_model = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence="c_v")
    coherence_score = coherence_model.get_coherence()
    coherence_scores.append(coherence_score)


# Select the K with the highest coherence score
optimal_k = min_topics + coherence_scores.index(max(coherence_scores))

#Train the LDA model with the optimal K
lda_model = gensim.models.LdaModel(corpus, num_topics=optimal_k, id2word=dictionary, passes=50, iterations=100)


# Summarize the topics
topics = lda_model.print_topics(num_words=5)

print(topics)



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[(0,
  '0.242*"." + 0.221*"document" + 0.082*"completes" + 0.082*"collection" + 0.082*"last"'),
 (1,
  '0.226*"second" + 0.183*"document" + 0.158*"first" + 0.139*"?" + 0.055*"."')]

## (2) (10 points) Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.

You may refer the code here:
https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [None]:
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Sample text data
documents = [
    "Your first document goes here.",
    "The second document is here.",
    "And this is the third document.",
    "Is this the first document or the second?",
    "The last document completes the collection.",
]

# Tokenize and preprocess the text data
nltk.download("stopwords")
nltk.download("punkt")
stop_words = set(stopwords.words("english"))
texts = [
    " ".join([word for word in word_tokenize(document) if word.lower() not in stop_words])
    for document in documents
]

# Create a CountVectorizer and transform the data into a document-term matrix
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

# Estimate the optimal number of topics (K) based on explained variance
explained_variances = []
K_range = range(2, min(X.shape) + 1)
for K in K_range:
    svd = TruncatedSVD(n_components=K)
    X_reduced = svd.fit_transform(X)
    explained_variances.append(svd.explained_variance_ratio_.sum())

optimal_K = K_range[explained_variances.index(max(explained_variances))]

# Train the final LSA model with the optimal K
svd = TruncatedSVD(n_components=optimal_K)
X_reduced = svd.fit_transform(X)

# Summarize the topics
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(svd.components_):
    top_words_idx = topic.argsort()[:-6:-1]
    top_words = [feature_names[i] for i in top_words_idx]
    print(f"Topic {topic_idx}: {', '.join(top_words)}")



Topic 0: document, first, second, goes, last
Topic 1: completes, last, collection, document, third
Topic 2: goes, first, last, completes, collection
Topic 3: third, document, goes, completes, last
Topic 4: goes, document, second, completes, last


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## (3) (10 points) Generate K topics by using  lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.

You may refer the code here:
https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim import corpora
from gensim.models import LdaModel
from gensim.models import CoherenceModel

# Sample text data
documents = [
    "Your first document goes here.",
    "The second document is here.",
    "And this is the third document.",
    "Is this the first document or the second?",
    "The last document completes the collection.",
]

# Tokenize and preprocess the text data
nltk.download("stopwords")
nltk.download("punkt")
stop_words = set(stopwords.words("english"))
texts = [
    [word for word in word_tokenize(document) if word.lower() not in stop_words]
    for document in documents
]

# Create a dictionary and document-term matrix
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# Determine an appropriate range for the number of topics (K)
min_topics = 2
max_topics = 10

coherence_scores = []
for num_topics in range(min_topics, max_topics + 1):
    lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary)
    coherence_model = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence="c_v")
    coherence_score = coherence_model.get_coherence()
    coherence_scores.append(coherence_score)

# Select the K with the highest coherence score
optimal_k = min_topics + coherence_scores.index(max(coherence_scores))

# Train the LDA model with the optimal K
lda_model = LdaModel(corpus, num_topics=optimal_k, id2word=dictionary)

# Summarize the topics
topics = lda_model.print_topics(num_words=5)  # Adjust the number of words as needed
for topic in topics:
    print(topic)







[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


(0, '0.226*"document" + 0.220*"." + 0.129*"first" + 0.127*"goes" + 0.126*"second"')
(1, '0.190*"document" + 0.111*"." + 0.110*"second" + 0.107*"?" + 0.106*"completes"')
(2, '0.211*"." + 0.207*"document" + 0.194*"third" + 0.059*"first" + 0.057*"last"')


## (4) (10 points) Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.

You may refer the code here:
https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [18]:
!pip install bertopic

from bertopic import BERTopic
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# Sample text data
documents = [
    "Your first document goes here.",
    "The second document is here.",
    "And this is the third document.",
    "Is this the first document or the second?",
    "The last document completes the collection.",
]

# Tokenize and preprocess the text data
stop_words = set(stopwords.words("english"))
texts = [
    " ".join([word for word in word_tokenize(document) if word.lower() not in stop_words])
    for document in documents
]

# Create a document-term matrix (CountVectorizer is used here)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

# Generate BERTopic with different numbers of topics to find the optimal K
min_topics = 2
max_topics = 10
best_coherence_score = -1
best_num_topics = min_topics

for num_topics in range(min_topics, max_topics + 1):
    # Reduce the number of dimensions for UMAP to avoid the error
    umap_args = {'n_components': 5, 'n_neighbors': 15, 'metric': 'cosine'}

    model = BERTopic(embedding_model="bert-base-uncased", nr_topics=num_topics, umap_model_args=umap_args)
    topics, _ = model.fit_transform(texts)

    coherence_score = model.get_coherence()

    if coherence_score > best_coherence_score:
        best_coherence_score = coherence_score
        best_num_topics = num_topics

# Train the BERTopic model with the optimal number of topics
model = BERTopic(embedding_model="bert-base-uncased", nr_topics=best_num_topics)
topics, _ = model.fit_transform(texts)

# Summarize the topics
topic_words = model.get_topics()
for topic_id, words in topic_words:
    print(f"Topic {topic_id}: {', '.join(words)}")




TypeError: ignored

## (5) (10 extra points) Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.

Follow the guidelines from the essay to enhance your explanation:

* Writing logic

  Pay attention to how you express your thoughts. For example:

  * Weak Writing Logic: “Artificial Intelligence is risky because it is new technology.”

  * Strong Writing Logic: “Artificial Intelligence presents ethical risks such as data privacy concerns and algorithmic bias, which necessitate cautious implementation and regulation.”

* Topic of sentences

  * Focus and Direction: It provides a focus and sets the direction for the paragraph, ensuring that the reader knows what to expect.
  * Reader Guidance: It serves as a guidepost for the reader, making it easier to follow the flow of ideas and arguments in the document.
  * Support for Thesis: In academic papers, topic sentences help in elaborating or providing evidence for the thesis statement or research question.

* Writing flow

  * Transition: Smooth and logical transitions between sentences, paragraphs, and sections.
  * Rhythm: Variation in sentence length and structure to maintain reader engagement.
  * Sequence: The order of points or arguments contributes to a smooth reading experience.
  For example:
    * Weak Writing Flow: “We studied machine learning algorithms. Ethics are important. Data was collected.”
    * Strong Writing Flow: “We initiated our study by focusing on machine learning algorithms. Recognizing the ethical implications, we carefully curated our data set.”

In [None]:
Comparing the results generated by the four topic modeling algorithms (Latent Dirichlet Allocation - LDA, Latent Semantic Analysis - LSA, Non-Negative Matrix Factorization - NMF, and BERTopic) provides valuable insights into their respective strengths and weaknesses. To assess which one is better, it's essential to consider various aspects of their performance.

LDA (Latent Dirichlet Allocation):

Strengths:
LDA is a well-established topic modeling technique with proven effectiveness in various text analysis tasks.
It provides interpretable topics as word distributions over topics.
The number of topics (K) can be controlled, offering flexibility.
Weaknesses:
LDA assumes a Bag of Words (BoW) representation of documents, which may not capture semantic nuances effectively.
Topics are represented as probability distributions, which can be less informative for some applications.
LSA (Latent Semantic Analysis):

Strengths:
LSA identifies latent semantic structures within documents by analyzing word co-occurrences.
It can capture synonymy and polysemy effectively, as it reduces dimensionality and extracts latent semantic information.
Weaknesses:
LSA's effectiveness depends on the quality of the term-document matrix, which can be noisy and may not handle complex linguistic patterns well.
Like LDA, LSA provides topics as word distributions.
NMF (Non-Negative Matrix Factorization):

Strengths:
NMF enforces non-negativity constraints on the factorization, making the resulting topics more interpretable.
It can be applied to diverse data types, such as text, images, and audio.
Topics extracted by NMF are often considered more interpretable than LDA.
Weaknesses:
The choice of K (number of topics) is a critical hyperparameter, and selecting an appropriate value can be challenging.
NMF is sensitive to the initialization of factors and may converge to different solutions.
BERTopic:

Strengths:
BERTopic leverages contextual embeddings from pre-trained BERT models, capturing semantic nuances effectively.
It does not require extensive preprocessing, making it suitable for diverse text data.
The number of topics can be determined based on coherence scores, which can be more data-driven.
Weaknesses:
BERTopic may be computationally intensive, especially with large datasets and a large number of topics.
It relies on pre-trained language models, which may require significant resources for fine-tuning.
The choice of the best topic modeling algorithm depends on the specific task and data. If interpretability and well-established techniques are a priority, LDA or NMF may be suitable. However, if capturing semantic nuances and leveraging pre-trained language models are crucial, BERTopic may provide better results. Additionally, BERTopic's data-driven approach to determine the number of topics makes it attractive for applications where the optimal topic count is not known in advance.

In conclusion, the best algorithm among LDA, LSA, NMF, and BERTopic depends on the project's goals, the nature of the data, and the resources available. It's essential to evaluate their performance in the context of the specific task to make an informed choice.








