# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [1]:
# Install Gensim
!pip install gensim nltk spacy



In [2]:
# Pre-process example text from previous in-class exercise
import spacy
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer
import nltk

nltk.download('wordnet')

def clean_text(text):
    result = []
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(WordNetLemmatizer().lemmatize(token, pos='v'))
    return result

# Example text corpus
text_corpus = ["I love this book, it is good, amazing and enjoyable",
             "This book was not good, it was great",
             "I found the book okay, not my favorite",
            ]

# Preprocess the documents
processed_corpus = [clean_text(doc) for doc in text_corpus]


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
# Creating dictionary and corpus
word_dict = gensim.corpora.Dictionary(processed_corpus)
bow_corpus = [word_dict.doc2bow(doc) for doc in processed_corpus]

In [4]:
# LDA and computing coherence
from gensim.models.coherencemodel import CoherenceModel

def evaluate_topic_models(dictionary, corpus, texts, limit, start=2, step=1):
    model_coherences = []
    topic_models = []
    for num_topics in range(start, limit + 1, step):
        model = create_topic_model(corpus=corpus,
                                   id2word=dictionary,
                                   num_topics=num_topics,
                                   random_state=100,
                                   passes=10,
                                   per_word_topics=True)
        topic_models.append(model)
        model_coherences.append(compute_coherence(model=model, texts=texts, dictionary=dictionary, coherence='c_v'))
    return topic_models, model_coherences

def create_topic_model(corpus, id2word, num_topics, random_state, passes, per_word_topics):
    return gensim.models.LdaMulticore(corpus=corpus,
                                      id2word=id2word,
                                      num_topics=num_topics,
                                      random_state=random_state,
                                      passes=passes,
                                      per_word_topics=per_word_topics)

def compute_coherence(model, texts, dictionary, coherence):
    coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence=coherence)
    return coherencemodel.get_coherence()

# range for the number of topics
start, limit, step = 2, 10, 1
topic_models, model_coherences = evaluate_topic_models(word_dict, bow_corpus, processed_corpus, start=start, limit=limit, step=step)

# model with the highest coherence score
max_coherence_val = max(model_coherences)
model_index = model_coherences.index(max_coherence_val)
selected_model = topic_models[model_index]
model_optimal = start + step * model_index

print(f"Optimal Number of Topics: {model_optimal}, Coherence Score: {max_coherence_val}")


Optimal Number of Topics: 3, Coherence Score: 0.305309568966698


In [5]:
# Summarization of topics
def display_topics(model):
    topics = model.show_topics(formatted=False)
    print("Optimal Model's Topics:")
    for num, topic in topics:
        print(f"Topic {num}: {[word[0] for word in topic]}")

display_topics(selected_model)

Optimal Model's Topics:
Topic 0: ['book', 'great', 'good', 'okay', 'favorite', 'enjoyable', 'love', 'amaze']
Topic 1: ['favorite', 'okay', 'book', 'good', 'great', 'amaze', 'love', 'enjoyable']
Topic 2: ['good', 'book', 'love', 'amaze', 'enjoyable', 'favorite', 'okay', 'great']


## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [6]:
# Required libraries
from gensim import corpora, models
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim.models.coherencemodel import CoherenceModel
from nltk.stem import WordNetLemmatizer
import nltk

nltk.download('wordnet')

# Function to preprocess text
def preprocess_document(text):
    tokens = []
    for token in simple_preprocess(text):
        if token not in STOPWORDS and len(token) > 3:
            tokens.append(WordNetLemmatizer().lemmatize(token, pos='v'))
    return tokens

# Sample text corpus
document_samples = [
    "I love this book, it is good, amazing and enjoyable",
    "This book was not good, it was great",
    "I found the book okay, not my favorite"
]

# Preprocess the corpus
clean_corpus = [preprocess_document(doc) for doc in document_samples]

# Create a dictionary and corpus
lexicon = corpora.Dictionary(clean_corpus)
vector_corpus = [lexicon.doc2bow(doc) for doc in clean_corpus]

# LSA and computing coherence
def conduct_lsa(corpus, id2word, documents, num_topics):
    lsi = models.LsiModel(corpus, id2word=id2word, num_topics=num_topics)
    coherence_lsi = CoherenceModel(model=lsi, texts=documents, dictionary=id2word, coherence='c_v')
    return lsi, coherence_lsi.get_coherence()

def determine_coherence_values(corpus, id2word, documents, start, end, step):
    coherence_scores = []
    lsa_models = []
    for num_topics in range(start, end, step):
        model, coherence = conduct_lsa(corpus, id2word, documents, num_topics)
        lsa_models.append(model)
        coherence_scores.append(coherence)
    return lsa_models, coherence_scores

# Example usage
start, end, step = 2, 10, 1
lsa_models, coherence_scores = determine_coherence_values(corpus=vector_corpus, id2word=lexicon, documents=clean_corpus, start=start, end=end, step=step)

# Find the model with the highest coherence score
best_score_index = coherence_scores.index(max(coherence_scores))
best_lsa_model = lsa_models[best_score_index]
best_topic_number = start + (best_score_index * step)

# Summarize the topics
print(f"Best Number of Topics: {best_topic_number}")
topics_summary = best_lsa_model.show_topics(num_topics=best_topic_number)
for num, topic in topics_summary:
    print(f"Topic {num}: {topic}")



[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Best Number of Topics: 2
Topic 0: 0.633*"book" + 0.500*"good" + 0.303*"enjoyable" + 0.303*"love" + 0.303*"amaze" + 0.197*"great" + 0.133*"favorite" + 0.133*"okay"
Topic 1: -0.545*"favorite" + -0.545*"okay" + -0.329*"book" + 0.286*"amaze" + 0.286*"enjoyable" + 0.286*"love" + 0.216*"good" + -0.071*"great"


## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [32]:
!git clone https://github.com/cemoody/lda2vec.git
%cd lda2vec
!pip install -e .



fatal: destination path 'lda2vec' already exists and is not an empty directory.
/content/lda2vec
Obtaining file:///content/lda2vec
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting chainer>=1.5.1 (from lda2vec==0.1)
  Using cached chainer-7.8.1.tar.gz (1.0 MB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sklearn (from lda2vec==0.1)
  Using cached sklearn-0.0.post12.tar.gz (2.6 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above,

In [33]:
import spacy

nlp = spacy.load("en_core_web_sm")  # Make sure to download this model first

text_corpus = [
    "I love this book, it is good, amazing and enjoyable",
    "This book was not good, it was great",
    "I found the book okay, not my favorite"
]

# Tokenize and preprocess
preprocessed_corpus = []
for doc in text_corpus:
    tokens = [token.lemma_ for token in nlp(doc) if token.is_alpha and not token.is_stop]
    preprocessed_corpus.append(tokens)


In [34]:
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary

# Placeholder for extracting topic words from lda2vec model
topic_words = [["word1_topic1", "word2_topic1"], ["word1_topic2", "word2_topic2"]]
# Convert your preprocessed_corpus into a Gensim dictionary and corpus
dictionary = Dictionary(preprocessed_corpus)
corpus = [dictionary.doc2bow(text) for text in preprocessed_corpus]

# Evaluate coherence
coherence_model = CoherenceModel(topics=topic_words, texts=preprocessed_corpus, dictionary=dictionary, coherence='c_v')
coherence_score = coherence_model.get_coherence()
print("Coherence Score:", coherence_score)


ValueError: unable to interpret topic as either a list of tokens or a list of ids

## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [19]:
!pip install bertopic



In [35]:
from bertopic import BERTopic

# Define your documents
document_samples = [
    "I love this book, it is good, amazing and enjoyable",
    "This book was not good, it was great",
    "I found the book okay, not my favorite"
]

# Create topics without UMAP
topic_model = BERTopic(umap_model=None, verbose=True)

# Fit the model to the corpus
try:
    topics, probabilities = topic_model.fit_transform(document_samples)
except Exception as e:
    print(f"An error occurred: {e}")

# Print topics
for topic_num in set(topics):
    if topic_num != -1:  # Exclude outliers
        print(f"Topic {topic_num}: {topic_model.get_topic(topic_num)}\n")


2024-03-30 04:20:25,406 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2024-03-30 04:20:26,238 - BERTopic - Embedding - Completed ✓
2024-03-30 04:20:26,241 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm


An error occurred: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.




TypeError: 'NoneType' object is not iterable

## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [None]:
# I found the lda2vec and bert modeling very complex and was barely able to understand the code algorithm properly to execute it as per my sample text.
# But among the two I generated and got results, I feel the topics generated by LDA appear to be more interpretable,
# although LSA provides insights into the data, it is less interpretable compared to LDA.
# But the numerical values associated with LSA topics (e.g., 0.633, 0.500, etc.)
# indicates a clear importance of each work within the topic and higher value gives greater importance and lower value suggest less importance
# - this definitely is an advantage over the LDA as it provides clear insights.

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:





'''