<a href="https://colab.research.google.com/github/mushfiq-hussain/INFO5731_Exercise_4_Updated.ipynb/blob/main/INFO5731_Exercise_4_Updated.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [3]:
import numpy as np
import pandas as pd
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')

# Sample text data
documents = [
    "Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.",
    "Data science is related to data mining, machine learning and big data.",
    "Data science is a concept to unify statistics, data analysis and their related methods in order to understand and analyze actual phenomena with data.",
    "Data science employs techniques and theories drawn from many fields within the context of mathematics, statistics, computer science and domain knowledge.",
    "Data science is related to data mining, machine learning and big data.",
    "Data science is a 'concept to unify statistics, data analysis and their related methods in order to understand and analyze actual phenomena with data'",
    "Data science employs techniques and theories drawn from many fields within the context of mathematics, statistics, computer science and domain knowledge.",
    "Data science is related to data mining, machine learning and big data.",
    "Data science is a 'concept to unify statistics, data analysis and their related methods in order to understand and analyze actual phenomena with data'"
]

# Preprocessing
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    return [token for token in tokens if token.isalnum() and token not in stop_words]

# Tokenize and preprocess documents
processed_docs = [preprocess_text(doc) for doc in documents]

# Create dictionary and document-term matrix
dictionary = corpora.Dictionary(processed_docs)
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

# Determine the optimal number of topics using coherence scores
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=1):
    coherence_values = []
    for num_topics in range(start, limit, step):
        model = gensim.models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics)
        coherence_model = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherence_model.get_coherence())
    return coherence_values

coherence_values = compute_coherence_values(dictionary=dictionary, corpus=corpus, texts=processed_docs, start=2, limit=10, step=1)

# Find optimal number of topics
optimal_num_topics = np.argmax(coherence_values) + 2  # Add 2 to start from the 'start' value

# Generate topics
lda_model = gensim.models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=optimal_num_topics)

# Summarize topics
topics = lda_model.show_topics(num_topics=optimal_num_topics, formatted=False)
for i, topic in topics:
    print(f"Topic {i + 1}:")
    words = [word for word, _ in topic]
    print(" ".join(words))


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Topic 1:
data science fields mathematics domain techniques computer employs context statistics
Topic 2:
data related analyze understand phenomena order science methods unify statistics
Topic 3:
data science related big statistics analyze machine learning methods unify
Topic 4:
data science statistics methods knowledge actual analysis unify order phenomena
Topic 5:
data science related mining learning machine big statistics methods analyze


## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [2]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')

# Sample text data
documents = [
    "Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.",
    "Data science is related to data mining, machine learning and big data.",
    "Data science is a concept to unify statistics, data analysis and their related methods in order to understand and analyze actual phenomena with data.",
    "Data science employs techniques and theories drawn from many fields within the context of mathematics, statistics, computer science and domain knowledge.",
    "Data science is related to data mining, machine learning and big data.",
    "Data science is a 'concept to unify statistics, data analysis and their related methods in order to understand and analyze actual phenomena with data'",
    "Data science employs techniques and theories drawn from many fields within the context of mathematics, statistics, computer science and domain knowledge.",
    "Data science is related to data mining, machine learning and big data.",
    "Data science is a 'concept to unify statistics, data analysis and their related methods in order to understand and analyze actual phenomena with data'"
]

# Preprocessing
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    return [token for token in tokens if token.isalnum() and token not in stop_words]

# Create document-term matrix
vectorizer = CountVectorizer(tokenizer=preprocess_text)
X = vectorizer.fit_transform(documents)

# Apply LSA
lsa = TruncatedSVD(n_components=5, random_state=42)
X_lsa = lsa.fit_transform(X)

# Determine the optimal number of topics using coherence scores
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=1):
    coherence_values = []
    for num_topics in range(start, limit, step):
        model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics)
        coherence_model = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherence_model.get_coherence())
    return coherence_values

dictionary = Dictionary([preprocess_text(doc) for doc in documents])
corpus = [dictionary.doc2bow(preprocess_text(doc)) for doc in documents]

coherence_values = compute_coherence_values(dictionary=dictionary, corpus=corpus, texts=[preprocess_text(doc) for doc in documents], start=2, limit=10, step=1)

# Find optimal number of topics
optimal_num_topics = np.argmax(coherence_values) + 2  # Add 2 to start from the 'start' value

# Generate topics
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=optimal_num_topics)

# Summarize topics
topics = lda_model.show_topics(num_topics=optimal_num_topics, formatted=False)
for i, topic in topics:
    print(f"Topic {i + 1}:")
    words = [word for word, _ in topic]
    print(" ".join(words))



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Topic 1:
data science related learning statistics methods big analyze phenomena mining
Topic 2:
data science related learning statistics methods mining big machine unify
Topic 3:
data science related machine mining big learning actual understand methods
Topic 4:
data analyze phenomena unify order statistics related analysis methods science
Topic 5:
data science methods knowledge uses field scientific extract algorithms unstructured
Topic 6:
science data computer statistics context drawn mathematics theories many within
Topic 7:
data science related learning big statistics methods mining analysis machine


## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [6]:
# Install lda2vec
!pip install lda2vec
# Install pyLDAvis
!pip install pyLDAvis

import pyLDAvis

from lda2vec import LDA2Vec
import nltk
from lda2vec import LDA2Vec
from sklearn.metrics.pairwise import cosine_similarity

# Sample documents
documents = [doc1, doc2, doc3, doc4, doc5]

# Preprocess text
processed_docs = [preprocess(doc) for doc in documents]

# Train lda2vec model
model = LDA2Vec(processed_docs, num_topics=10)

# Compute coherence scores
coherence_scores = []
for k in range(2, 12):
    model.num_topics = k
    doc_topic_matrix = model.doc_topic_distr()

    coherence = 0
    for i in range(len(doc_topic_matrix)):
        for j in range(i+1, len(doc_topic_matrix)):
            coherence += cosine_similarity(doc_topic_matrix[i], doc_topic_matrix[j])
    coherence /= len(doc_topic_matrix)

    coherence_scores.append(coherence)

# Determine optimal K
optimal_k = coherence_scores.index(max(coherence_scores)) + 2

# Refit model with optimal K
model.num_topics = optimal_k
model.update_lda()

# Print topics
for k in range(optimal_k):
    top_words = model.get_topic_words(k, top_n=5)
    print(f"Topic {k+1}: {', '.join(top_words)}")

# Visualize topics
vis = pyLDAvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.show(vis)



Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
Collecting pandas>=2.0.0 (from pyLDAvis)
  Downloading pandas-2.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m30.4 MB/s[0m eta [36m0:00:00[0m
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Collecting tzdata>=2022.7 (from pandas>=2.0.0->pyLDAvis)
  Downloading tzdata-2024.1-py2.py3-none-any.whl (345 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.4/345.4 kB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: funcy, tzdata, pandas, pyLDAvis
  Attempting uninstall: pandas
    Found existing installation: pandas 1.5.3
    Uninstalling pandas-1.5.3:
      Successfully uninstalled pandas-1.5.3
[31mERRO

ImportError: cannot import name 'LDA2Vec' from 'lda2vec' (/usr/local/lib/python3.10/dist-packages/lda2vec/__init__.py)

## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [None]:
# Install BERTopic
!pip install bertopic
from bertopic import BERTopic
from sklearn.metrics.pairwise import cosine_similarity

# Sample documents
docs = [doc1, doc2, doc3, doc4, doc5]

# Initialize BERTopic model
topic_model = BERTopic(language="english")

# Compute coherence scores
coherence_scores = []
for k in range(2, 12):
    topic_model.nr_topics = k
    topics, probs = topic_model.fit_transform(docs)

    coherence = 0
    for i in range(len(probs)):
        for j in range(i+1, len(probs)):
            coherence += cosine_similarity(probs[i], probs[j])
    coherence /= len(probs)

    coherence_scores.append(coherence)

# Determine optimal K
optimal_k = coherence_scores.index(max(coherence_scores)) + 2
topic_model.nr_topics = optimal_k

# Retrain model with optimal K
topics, probs = topic_model.fit_transform(docs)

# Print topics
for i, topic in enumerate(topic_model.get_topics()):
    print(f"Topic {i+1}: {topic[:5]}")



Collecting bertopic
  Downloading bertopic-0.16.0-py2.py3-none-any.whl (154 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.1/154.1 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap-learn-0.5.5.tar.gz (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.9/90.9 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentence-transformers>=0.4.1 (from bertopic)
  Downloading sentence_transformers-2.6.1-py3-none-any.whl (163 kB)
[2K     [90m━━━━━━━

## **Question 3 (Alternative) - (10 points)**

If you are unable to do the topic modeling using lda2vec, do the alternate question.

Provide atleast 3 visualization for the topics generated by the BERTopic or LDA model. Explain each of the visualization in detail.

In [None]:
# Install lda2vec
!pip install lda2vec
# Install pyLDAvis
!pip install pyLDAvis

import pyLDAvis

from lda2vec import LDA2Vec
import nltk
from lda2vec import LDA2Vec
from sklearn.metrics.pairwise import cosine_similarity

# Sample documents
documents = [doc1, doc2, doc3, doc4, doc5]

# Preprocess text
processed_docs = [preprocess(doc) for doc in documents]

# Train lda2vec model
model = LDA2Vec(processed_docs, num_topics=10)

# Compute coherence scores
coherence_scores = []
for k in range(2, 12):
    model.num_topics = k
    doc_topic_matrix = model.doc_topic_distr()

    coherence = 0
    for i in range(len(doc_topic_matrix)):
        for j in range(i+1, len(doc_topic_matrix)):
            coherence += cosine_similarity(doc_topic_matrix[i], doc_topic_matrix[j])
    coherence /= len(doc_topic_matrix)

    coherence_scores.append(coherence)

# Determine optimal K
optimal_k = coherence_scores.index(max(coherence_scores)) + 2

# Refit model with optimal K
model.num_topics = optimal_k
model.update_lda()

# Print topics
for k in range(optimal_k):
    top_words = model.get_topic_words(k, top_n=5)
    print(f"Topic {k+1}: {', '.join(top_words)}")

# Visualize topics
vis = pyLDAvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.show(vis)

### 3 visualizations  for topics generated by BERTopic or LDA are :

1) PyLDAvis Interactive Topic Visualization
PyLDAvis allows interactive exploration of topics, terms and documents. It visualizes topic-term and topic-document relationships using an interactive scatterplot and barcharts. The scatterplot positions topics and terms based on relevance and prevalence. Selecting a topic highlights the most relevant terms and documents.

This allows interactive exploration of topics, seeing which terms make up each topic and which documents are most relevant for a topic.

2) Topic Correlation Heatmap
A heatmap can visualize the correlation between topics generated by an LDA or BERTopic model. More correlated topics are clustered together, while unrelated topics are farther apart.

This provides an overview of topic relationships and can help identify redundant/similar topics.

3) Topic Word Clouds
Individual word clouds can be generated for each topic, with larger and darker words representing words with higher probability/importance for that topic.

## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [None]:
# Write your code here
the comparison of the 4 main topic modeling algorithms - LSA, LDA, lda2vec, and BERTopic:

LSA (Latent Semantic Analysis):
Uses singular value decomposition to identify latent topics
Topics may not be clearly separated or interpretable
Fast and simple but topics lack coherence

LDA (Latent Dirichlet Allocation):
Generates clear, discrete topics
Topics are generally coherent and interpretable
Slower than LSA
Requires tuning number of topics

lda2vec:
Combines strengths of LSA for speed with LDA for topic coherence
Topics are coherent while still being "soft" representations
Allows dynamic topic inference on new documents

BERTopic:
Uses state-of-the-art BERT embeddings for improved coherence
Topics are very coherent and interpretable
Computationally intensive compared to LSA/LDA
Requires fine-tuning and parameter tuning
Overall, I would say BERTopic generates the most coherent topics while lda2vec offers a good balance of speed and coherence.
LDA is a solid baseline but topics may not be as sharp as BERTopic. LSA is fast but topic quality is poor.
So in summary, BERTopic > lda2vec > LDA > LSA in terms of topic coherence and interpretability. But there is a tradeoff with computation time.

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

learning experience: Implementing the different topic modeling techniques (LSA, LDA, lda2vec, and BERTopic) allowed me to obtain a better knowledge
of how they function.Going through the entire process of preprocessing, vectorizing, training models, tweaking hyperparameters, and assessing topics
was quite informative.

challenges encountered: I had trouble handling text preparation properly, including lowercasing, tokenizing, and lemmatizing..My dependency management was
lacking due to installation and import difficulties with lda2vec and BERTopic.It was difficult to compare subject coherence between algorithms.

Topic modeling is widely used in NLP for detecting latent topics and deriving meaningful representations from text corpuses.
The methods discussed are basic NLP approaches for unsupervised text analysis.




'''