**Yasir Ahmed Siddiqui _241ADM037**

**Assignment for Topic **

In this assignment, you will work with the given brown-small corpus (which is a small version of the Brown corpus).

Task 1

Perform topic modeling using the LDA with 10 topics. Name your topics using an LLM. (No plotting of any kind is required.)

Task 2

Perform topic modeling using the BERTopic. Name your topics using an LLM. Plot one scatterplot visualizing your results.

Now compare the extracted topics between Task 1 and Task 2 and make conclusions.


Note that in your code you are required to use only those function libraries that were used in previous lectures and nothing else.

In [None]:
!pip install numpy==1.24.4 scipy==1.10.1 gensim==4.3.1


In [None]:
import nltk
import re
import gensim



nltk.download('brown')
nltk.download('stopwords')
nltk.download('punkt')


stopwords = nltk.corpus.stopwords.words('english')

# custom domain-specific stopwords
custom_stopwords = [
    'said', 'says', 'also', 'would', 'whose', 'well', 'non',
    'may', 'go', 'goes', 'going', 'went', 'like', 'cannot',
    'however', 'u'
]
stopwords.extend(custom_stopwords)

# Define the split and normalize function
def split_and_normalize(docs, stop):
    lists_of_words = []
    for doc in docs:
        # Split on non-word characters
        words = re.split(r'\W+', doc)
        # Lowercase and remove stopwords + non-alpha words
        words = [word.lower() for word in words if word.isalpha() and word.lower() not in stop]
        lists_of_words.append(words)
    return lists_of_words

# Load the Brown corpus (e.g., 'news' category)
from nltk.corpus import brown
docs = [' '.join(sent) for sent in brown.sents(categories='news')]

# Process the documents
cleaned_docs = split_and_normalize(docs, stopwords)

# Create a dictionary mapping for all words
dictionary = gensim.corpora.Dictionary(cleaned_docs)

# Filter out extreme tokens
dictionary.filter_extremes(no_below=2, no_above=0.5)

print("Number of unique words after filtering:", len(dictionary))

# Create Bag-of-Words representation for each document
corpus_bow = [dictionary.doc2bow(doc) for doc in cleaned_docs]

# Train the LDA model with 10 topics
lda_model = gensim.models.LdaMulticore(
    corpus=corpus_bow,
    id2word=dictionary,
    num_topics=10,       # Change this to desired number of topics
    passes=20,
    iterations=500,
    workers=2,           # Adjust based on CPU
    chunksize=1000,
    eval_every=None,
    random_state=1
)

#Display the topics
for i in range(lda_model.num_topics):
    terms = [word for word, prob in lda_model.show_topic(i, topn=10)]
    print(f"Topic #{i}: {', '.join(terms)}")




In [None]:
import getpass
api_key = getpass.getpass("gsk_CsCRXVkPUwkP9q8MCd")


In [None]:
lda_model.print_topics(num_words=10)


In [None]:
!pip install groq


In [None]:
#Format the topics and call Groq's LLaMA model to name them using groq SDK
from groq import Groq
import getpass  # For secure API key input

# Securely enter the API key
api_key = getpass.getpass("gsk_CsCRXVkPUwk").strip()

# === Validate API key before continuing ===
if not api_key or not api_key.startswith("gsk_"):
    sys.exit(" Invalid or missing API key.  enter a valid key that starts with 'gsk_'.")

# === Initialize Groq client ===
try:
    client = Groq(api_key=api_key)
except Exception as e:
    sys.exit(f" Failed to initialize Groq client: {e}")

# === Extract topics from LDA ===
topics_keywords = lda_model.show_topics(num_topics=10, num_words=10, formatted=False)
formatted_topics = [
    f"Topic {i}: {', '.join(word for word, _ in keywords)}"
    for i, keywords in topics_keywords
]
prompt_body = "\n".join(formatted_topics)

# === Prepare prompts ===
system_prompt = "You are an expert in topic modeling and academic writing."
user_prompt = (
    "Below are 10 topics extracted from a topic model (LDA), each with its top 10 keywords:\n\n"
    f"{prompt_body}\n\n"
    "Please provide a clear, concise, and academic-style name for each topic.\n"
    "Return the results in the format: Topic X – [Descriptive Name]\n"
)

# === Query Groq LLaMA-3 model ===
try:
    response = client.chat.completions.create(
        model="llama3-8b-8192",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.5,
        max_tokens=500
    )
    output = response.choices[0].message.content
except Exception as e:
    sys.exit(f" Error while querying Groq API: {e}")

# === Output ===
print(" Named Topics:\n")
print(output)

In [None]:
# Install or upgrade necessary libraries
!pip install --upgrade numpy
!pip install --upgrade bertopic scikit-learn umap-learn

#Restart the runtime after the installation (if using Jupyter/Colab)
import os
os._exit(00)


In [None]:
!pip install -q bertopic groq nltk scikit-learn umap-learn

import nltk
from nltk.corpus import brown
from bertopic import BERTopic
from groq import Groq
import getpass
import sys

nltk.download('brown')
nltk.download('punkt')

#Prepare text data
docs = [' '.join(sent) for sent in brown.sents(categories='news')]

#Train BERTopic model
topic_model = BERTopic(language="english", verbose=True)
topics, probs = topic_model.fit_transform(docs)

#Extract topic keywords
topic_info = topic_model.get_topic_info()
topic_keywords = topic_info[topic_info.Topic != -1][['Topic', 'Representation']]

#Format topic descriptions
formatted_topics = []
for _, row in topic_keywords.iterrows():
    topic_num = row['Topic']
    keywords = row['Representation'][:10]  # Already a list
    formatted_topics.append(f"Topic {topic_num}: {', '.join(keywords)}")

prompt_body = "\n".join(formatted_topics)

#Step 4: Define prompts
system_prompt = "You are an expert in topic modeling and academic writing."
user_prompt = (
    "Below are topics extracted using BERTopic, each with its top 10 keywords:\n\n"
    f"{prompt_body}\n\n"
    "Please provide a clear, concise, and academic-style name for each topic.\n"
    "Return the results in the format: Topic X – [Descriptive Name]\n"
)

#API key input for Groq
api_key = getpass.getpass("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").strip()
if not api_key or not api_key.startswith("gsk_"):
    sys.exit("Invalid or missing API key. Please enter a valid key that starts with 'gsk_'.")

#Query Groq to name topics
try:
    client = Groq(api_key=api_key)
    response = client.chat.completions.create(
        model="llama3-8b-8192",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0,
        max_tokens=300
    )
    bertopic_named_output = response.choices[0].message.content
except Exception as e:
    sys.exit(f"Error while querying Groq API: {e}")

#Output BERTopic-named topics
print("Named Topics from BERTopic:\n")
print(bertopic_named_output)

In [None]:
#Visualize only the top 10 topics using a scatter plot
print("Generating scatter plot for Top 10 Topics...")

#Get the top 10 topic IDs (excluding -1 which is 'outlier')
top_topics = topic_model.get_topic_info().head(11)['Topic'].tolist()
top_topics = [t for t in top_topics if t != -1][:10]  # Ensure only 10 topics and no outliers

#Generate scatter plot for selected topics only
fig = topic_model.visualize_topics(top_n_topics=10)
fig.show()


| Feature                | Task 1: LDA                          | Task 2: BERTopic                           |
| ---------------------- | ------------------------------------ | ------------------------------------------ |
| **Algorithm**          | Latent Dirichlet Allocation (Gensim) | BERT + Clustering + UMAP (BERTopic)        |
| **Embeddings**         | Bag-of-Words based                   | Contextual Embeddings (Transformer-based)  |
| **Topic Coherence**    | Often generic (word co-occurrence)   | More specific and human-readable topics    |
| **Custom Topic Names** | Named via LLaMA using keywords       | Named via LLaMA using rich representations |
| **Topic Overlap**      | Can be overlapping                   | More distinct and refined topics           |
| **Visualization**      | Not inherently visual                | Built-in UMAP scatter         |
| **Best For**           | Simpler datasets                     | Rich, contextual datasets (e.g., news)     |


**Conslusion:** evaluating LDA (Task 1) against BERTopic (Task 2), several key differences emerge. BERTopic leverages advanced language embeddings and dimensionality reduction, enabling it to identify topics that are more contextually nuanced and semantically rich than those produced by traditional LDA. Studies have shown that BERTopic generates more coherent and well-separated topic clusters, as visualized through scatterplots and other built-in tools, which aids in intuitive interpretation-an area where LDA typically lacks built-in support.

Moreover, integrating large language models with BERTopic streamlines the process of naming and describing topics, resulting in more precise and human-readable labels. In contrast, LDA often requires manual intervention and extensive preprocessing for topic interpretation. While LDA remains a valuable tool for large-scale text analysis, especially when computational resources are limited, BERTopic stands out for its ability to capture deeper semantic relationships and provide enhanced interpretability and actionable insights from complex text datasets

