<a href="https://colab.research.google.com/github/ms624atyale/NLP/blob/main/13_TopicModeling_LDA_BERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color = 'red'> üêπ üëÄ üêæ **Text/Content/Web Scraping without HTML tags**

## **API-based Data Collection**

### <font color = 'blue'> **cf., Crawling (a.k.a. HTML Scraping) or Text Mining**

In [None]:
!pip install requests

import requests #Import the requests library to make HTTP requests.

def get_wikipedia_page(title):                   #Define a function
    URL = "https://en.wikipedia.org/w/api.php"  #Set the API(application program interface) endpoint URL: https://en.wikipedia.org/w/api.php.

    PARAMS = {                                  #Build PARAMS (query parameters) for the API request:
        "action": "query",                      #ask the API to run a query
        "format": "json",                       #request a JSON response
        "prop": "extracts",                     #ask for the page extract (clean text summary)
        "titles": title,                        #specify which page to fetch (by title)
        "explaintext": 1                        #return plain text (no HTML/markup)

    }

    # IMPORTANT: Error messages for Bots pretending to be browsers. Do not pretend you are browsers.
    #headers = {
    #    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 " #header to mimic a normal browser request (helps avoid blocks)
    #                  "(KHTML, like Gecko) Chrome/123.0 Safari/537.36"
    #}

    headers = {
        "User-Agent": "MyNLPProject (education use)"
        }
    response = requests.get(URL, params=PARAMS, headers=headers)           #Send a GET request to the API with requests

    if response.status_code != 200:                                        #Check the HTTP status code: If not 200 OK, print an error message and return None.
        print("HTTP error:", response.status_code)
        return None

    try:
        data = response.json()                                            #Try to parse the response body as JSON with response.json():
    except:
        print("JSON decode error")                                        #If JSON decoding fails, print a debug message showing the start of the raw response and return None.
        print("Raw response:", response.text[:500])
        return None

    pages = data.get("query", {}).get("pages", {})                      #Navigate the JSON structure to the page data: data["query"]["pages"] (a dictionary keyed by numeric page id).
    page = next(iter(pages.values()))                                   #Extract the single page object with next(iter(pages.values())) (handles the unknown page id).
    return page.get("extract", "")                                      #Return the page‚Äôs plain-text extract via page.get("extract", "").
                                                                        #If the page exists, this is the article text; if not, it returns an empty string (or None earlier if errors occurred).

# üêπüêæ **Install NLTK and Download necessary models**
!pip install nltk
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')


# üêπüêæ **1Ô∏è‚É£ Pandas Library**
!pip install pandas
!pip install lexical_diversity
import pandas as pd #Import Pandas Package
import lexical_diversity as ld


# üÖ∞Ô∏è **Group1**
# ‚úÖ **Text scraping for Group1**
titles = [
    "K-pop",
    "Korean Wave",
    "KPop Demon Hunters",
    "BTS"
]

corpus = {}

for t in titles:
    txt = get_wikipedia_page(t)
    if txt:
        corpus[t] = txt
    else:
        print("Failed:", t)

# Show first 200 chars for each
for title, text in corpus.items():
    print("\n====", title, "====")
    print(text[:200])

# üêπ üêæ üìå **Use this!!!**üìå
# ‚≠ï <font color = 'green'> **Script for [Group1] ‚Äî Create one Txt file with records separated by @@@@@**

output = []

for title in titles:
    txt = get_wikipedia_page(title)
    if not txt:
        txt = ""   # store empty if missing
    block = f"@@@@@\nTITLE: {title}\n{txt}\n"
    output.append(block)

final = "\n".join(output)

with open("wiki_corpus_delimited_group1.txt", "w", encoding="utf-8") as f:
    f.write(final)

print("Saved: wiki_corpus_delimited_group1.txt")


# üêπüêæ **Read the txt file**
# üê£ **Open and read the text file for Group1**

# ‚ñ∂Ô∏è Step 1: You need to modify this codeline üçéüçéüçéüçéüçé
file = open("/content/wiki_corpus_delimited_group1.txt", 'rt')

txt = file.read()
print(txt)
file.close() #Using this close()function, you are no longer using your text file of the current workingdirectory with open()function.



##üêπüêæ ‚ùÑÔ∏è **Basic Cleaning**
###**üìçApply a series of functions for replacement in Group1**
# STEP 2: Clean the text

import re

# Step 1: Read file to change path as needed üçéüçéüçéüçéüçéüçé
with open("/content/wiki_corpus_delimited_group1.txt", 'rt') as fl:
    raw_text = fl.read()

clean_text = (
    raw_text
    .replace("\n", " ")
    .replace("‚Äú", "")
    .replace("‚Äù", "")
    .replace("\"", "")
    .replace("/", "")
    .replace("_", "")
    .replace("===", "")
    .replace("==", "")
    .replace("=", "")
    .replace("*", "")
    .replace("?", "")
    .replace("!", "")
    .replace("--", " ")
    .replace("(", "")
    .replace(")", "")
)

# STEP 3: Save the cleaned content to a NEW file as you designate the output path üçèüçèüçèüçèüçèüçè
output_path = "/content/wiki_corpus_delimited_group1_CLEANED.txt"
with open(output_path, 'w') as cf:
    cf.write(clean_text) #Get content named 'clean_text' to the new empty file

# Optional: Print to verify
print("‚úÖ Cleaned text saved to:", output_path)


# ‚úÖ ‚úÖ**Text scraping for Group2**
titles = [
    "2024 Nobel Prize in Literature",
    "Han Kang",
    "Bong Joon Ho",
    "Pachinko"
]

corpus = {}

for t in titles:
    txt = get_wikipedia_page(t)
    if txt:
        corpus[t] = txt
    else:
        print("Failed:", t)

# Show first 200 chars for each
for title, text in corpus.items():
    print("\n====", title, "====")
    print(text[:200])

# ‚≠ï‚≠ï <font color = 'blue'> **Script for [Group2] ‚Äî Create one Txt file with records separated by @@@@@**
output = []

for title in titles:
    txt = get_wikipedia_page(title)
    if not txt:
        txt = ""   # store empty if missing
    block = f"@@@@@\nTITLE: {title}\n{txt}\n"
    output.append(block)

final = "\n".join(output)

with open("wiki_corpus_delimited_group2.txt", "w", encoding="utf-8") as f:
    f.write(final)

print("Saved: wiki_corpus_delimited_group2.txt")


# üê£üê£ **Open and read the text file for Group2**
# ‚ñ∂Ô∏è Step 1: You need to modify this codeline üçéüçéüçéüçéüçé
file = open("/content/wiki_corpus_delimited_group2.txt", 'rt')

txt = file.read()
print(txt)
file.close() #Using this close()function, you are no longer using your text file of the current workingdirectory with open()function.



# **üìçüìçApply a series of functions for replacement in Group2**
import re

# Step 1: Read file to change path as needed üçéüçéüçéüçéüçéüçé
with open("/content/wiki_corpus_delimited_group2.txt", 'rt') as fl:
    raw_text = fl.read()

# STEP 2: Clean the text
clean_text = (
    raw_text
    .replace("\n", " ")
    .replace("‚Äú", "")
    .replace("‚Äù", "")
    .replace("\"", "")
    .replace("/", "")
    .replace("_", "")
    .replace("===", "")
    .replace("==", "")
    .replace("=", "")
    .replace("*", "")
    .replace("?", "")
    .replace("!", "")
    .replace("--", " ")
    .replace("(", "")
    .replace(")", "")
)

# STEP 3: Save the cleaned content to a NEW file as you designate the output path üçèüçèüçèüçèüçèüçè
output_path = "/content/wiki_corpus_delimited_group2_CLEANED.txt"
with open(output_path, 'w') as cf:
    cf.write(clean_text) #Get content named 'clean_text' to the new empty file

# Optional: Print to verify
print("‚úÖ Cleaned text saved to:", output_path)




#üêπüê£**Clone your github repository of your interest**
!git clone https://github.com/ms624atyale/NLP


# <font color = 'red'> **üîµ PART 1 ‚Äî LDA**

In [None]:
# üìå Install Required Packages
!pip install gensim
!pip install nltk


#üìå LDA Code (Works for Group1 or Group2)
# üíä Change file path depending on which group you want.

import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim import corpora
from gensim.models import LdaModel

nltk.download('stopwords')
nltk.download('punkt')

# üîµ Choose file (change path if needed)
file_path = "/content/wiki_corpus_delimited_group1_CLEANED.txt"
# file_path = "/content/wiki_corpus_delimited_group2_CLEANED.txt"

# STEP 1Ô∏è‚É£ Read cleaned file
with open(file_path, "r", encoding="utf-8") as f:
    text = f.read()

# STEP 2Ô∏è‚É£ Split documents using delimiter
documents = text.split("@@@@@")
documents = [doc.strip() for doc in documents if len(doc.strip()) > 0]

# STEP 3Ô∏è‚É£ Preprocess
stop_words = set(stopwords.words("english"))

processed_docs = []
for doc in documents:
    tokens = word_tokenize(doc.lower())
    tokens = [word for word in tokens if word.isalpha()]
    tokens = [word for word in tokens if word not in stop_words]
    processed_docs.append(tokens)

# STEP 4Ô∏è‚É£ Create dictionary and corpus
dictionary = corpora.Dictionary(processed_docs)
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

# STEP 5Ô∏è‚É£ Train LDA
lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=3,
    passes=20,
    random_state=42
)

# STEP 6Ô∏è‚É£ Print topics
print("\nüü¢ LDA Topics:\n")
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {idx}: {topic}")

#üîé Optional: Show Topic Distribution Per Document
for i, row in enumerate(lda_model[corpus]):
    print(f"\nDocument {i} topic distribution:")
    print(row)


# <font color = 'red'> **üîµ PART 2 ‚Äî BERTopic Model**

In [None]:
# üîµ PART 2 ‚Äî BERTopic Model using Sentence Transformers, UMAP, HDBSCAN, c-TF-IDF

#üìå Install BERTopic
!pip install bertopic
!pip install sentence-transformers

# üìå BERTopic Code
from bertopic import BERTopic

# üîµ Choose file
file_path = "/content/wiki_corpus_delimited_group1_CLEANED.txt"
# file_path = "/content/wiki_corpus_delimited_group2_CLEANED.txt"

# STEP 1Ô∏è‚É£ Read cleaned file
with open(file_path, "r", encoding="utf-8") as f:
    text = f.read()

# STEP 2Ô∏è‚É£ Split documents
documents = text.split("@@@@@")
documents = [doc.strip() for doc in documents if len(doc.strip()) > 0]

# STEP 3Ô∏è‚É£ Initialize BERTopic
topic_model = BERTopic(
    calculate_probabilities=True,
    verbose=True
)

# STEP 4Ô∏è‚É£ Fit model
topics, probs = topic_model.fit_transform(documents)

# STEP 5Ô∏è‚É£ Show topic info
print("\nüîµ BERTopic Summary:\n")
print(topic_model.get_topic_info())

# STEP 6Ô∏è‚É£ Print words per topic
for topic_num in set(topics):
    print(f"\nTopic {topic_num}:")
    print(topic_model.get_topic(topic_num))



# üîé Optional Visualization
topic_model.visualize_topics()



# <font color = 'blue'> 1Ô∏è‚É£ **Combine Group1 + Group2**

In [None]:
# Combine cleaned files
file_path1 = "/content/wiki_corpus_delimited_group1_CLEANED.txt"
file_path2 = "/content/wiki_corpus_delimited_group2_CLEANED.txt"

with open(file_path1, "r", encoding="utf-8") as f1:
    text1 = f1.read()

with open(file_path2, "r", encoding="utf-8") as f2:
    text2 = f2.read()

combined_text = text1 + "\n" + text2

documents = combined_text.split("@@@@@")
documents = [doc.strip() for doc in documents if len(doc.strip()) > 0]

print("Total documents:", len(documents))

# üü¢ **PART 1 ‚Äî LDA with Coherence Score using Gensim LDA & CoherenceModel (c_v score)**

In [None]:
!pip install gensim
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim import corpora
from gensim.models import LdaModel, CoherenceModel

nltk.download("stopwords")
nltk.download("punkt")

# Preprocess
stop_words = set(stopwords.words("english"))

processed_docs = []
for doc in documents:
    tokens = word_tokenize(doc.lower())
    tokens = [w for w in tokens if w.isalpha()]
    tokens = [w for w in tokens if w not in stop_words]
    processed_docs.append(tokens)

# Dictionary & Corpus
dictionary = corpora.Dictionary(processed_docs)
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

# Train LDA
lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=4,
    passes=30,
    random_state=42
)

# Print topics
print("\nüü¢ LDA Topics:\n")
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {idx}: {topic}")

# Compute coherence
coherence_model = CoherenceModel(
    model=lda_model,
    texts=processed_docs,
    dictionary=dictionary,
    coherence='c_v'
)

lda_coherence = coherence_model.get_coherence()
print("\nüü¢ LDA Coherence Score:", lda_coherence)


üß† How to Interpret Coherence

0.3 - weak

0.4 - decent

0.5+ - good

0.6+ - strong (for small datasets)

Higher = more semantically consistent topics.

üîµ PART 2 ‚Äî BERTopic + Coherence

BERTopic doesn‚Äôt directly compute coherence,
so we extract topic words and compute coherence manually.

In [None]:
from bertopic import BERTopic
from umap import UMAP
import hdbscan

# Adjust UMAP for small dataset
umap_model = UMAP(
    n_neighbors=3,      # must be < number of documents
    n_components=2,
    min_dist=0.0,
    metric='cosine',
    random_state=42
)

# Adjust HDBSCAN for small dataset
hdbscan_model = hdbscan.HDBSCAN(
    min_cluster_size=2,
    min_samples=1,
    metric='euclidean',
    prediction_data=True
)

# Initialize BERTopic with custom models
topic_model = BERTopic(
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    verbose=True
)

topics, probs = topic_model.fit_transform(documents)

print(topic_model.get_topic_info())


üìä STEP 3 ‚Äî Compare LDA vs BERTopic Quality

üü¢ PART 1 ‚Äî LDA QUALITY METRICS

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim import corpora
from gensim.models import LdaModel, CoherenceModel
import numpy as np

nltk.download("stopwords")
nltk.download("punkt")

# Preprocess
stop_words = set(stopwords.words("english"))

processed_docs = []
for doc in documents:
    tokens = word_tokenize(doc.lower())
    tokens = [w for w in tokens if w.isalpha()]
    tokens = [w for w in tokens if w not in stop_words]
    processed_docs.append(tokens)

# Dictionary & Corpus
dictionary = corpora.Dictionary(processed_docs)
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

# Train LDA
lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=4,
    passes=30,
    random_state=42
)

# -------- Coherence --------
coherence_model_lda = CoherenceModel(
    model=lda_model,
    texts=processed_docs,
    dictionary=dictionary,
    coherence='c_v'
)

lda_coherence = coherence_model_lda.get_coherence()

# -------- Topic Diversity --------
def topic_diversity(topics, top_k=10):
    unique_words = set()
    total_words = 0
    for topic in topics:
        words = topic[:top_k]
        unique_words.update(words)
        total_words += top_k
    return len(unique_words) / total_words

lda_topics = []
for i in range(4):
    words = [word for word, prob in lda_model.show_topic(i, topn=10)]
    lda_topics.append(words)

lda_diversity = topic_diversity(lda_topics)

print("\nüü¢ LDA Results")
print("Coherence:", round(lda_coherence, 4))
print("Topic Diversity:", round(lda_diversity, 4))
print("Number of Topics:", 4)


üîµ PART 2 ‚Äî BERTopic QUALITY METRICS

In [None]:
from bertopic import BERTopic
from umap import UMAP
import hdbscan

# Adjust for small dataset
umap_model = UMAP(
    n_neighbors=3,
    n_components=2,
    min_dist=0.0,
    metric='cosine',
    random_state=42
)

hdbscan_model = hdbscan.HDBSCAN(
    min_cluster_size=2,
    min_samples=1,
    metric='euclidean',
    prediction_data=True
)

topic_model = BERTopic(
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    verbose=False
)

topics, probs = topic_model.fit_transform(documents)

# -------- Extract topic words --------
topic_info = topic_model.get_topics()

bertopic_topics = []
for topic_id in topic_info:
    if topic_id == -1:
        continue
    words = [word for word, _ in topic_info[topic_id][:10]]
    bertopic_topics.append(words)

# -------- Coherence --------
coherence_model_bertopic = CoherenceModel(
    topics=bertopic_topics,
    texts=processed_docs,
    dictionary=dictionary,
    coherence='c_v'
)

bertopic_coherence = coherence_model_bertopic.get_coherence()

# -------- Topic Diversity --------
bertopic_diversity = topic_diversity(bertopic_topics)

# -------- Outlier Ratio --------
outlier_ratio = topics.count(-1) / len(topics)

print("\nüîµ BERTopic Results")
print("Coherence:", round(bertopic_coherence, 4))
print("Topic Diversity:", round(bertopic_diversity, 4))
print("Number of Topics:", len(bertopic_topics))
print("Outlier Ratio:", round(outlier_ratio, 4))


üìä FINAL SIDE-BY-SIDE COMPARISON

In [None]:
print("\nüìä MODEL COMPARISON")
print("-" * 40)
print(f"LDA Coherence:        {round(lda_coherence,4)}")
print(f"BERTopic Coherence:   {round(bertopic_coherence,4)}")
print()
print(f"LDA Diversity:        {round(lda_diversity,4)}")
print(f"BERTopic Diversity:   {round(bertopic_diversity,4)}")
print()
print(f"LDA Topics:           4")
print(f"BERTopic Topics:      {len(bertopic_topics)}")
print(f"BERTopic Outliers:    {round(outlier_ratio,4)}")


üß† How to Interpret Results
üîπ Coherence

Higher = more semantically meaningful topics.

üîπ Topic Diversity

Closer to 1 = topics share fewer repeated words.

üîπ Outlier Ratio (BERTopic only)

Higher = unstable clustering (common in small datasets).

# üéØ üç∞ üç® üç¨  <font color = 'green'> **My Note:**
# üêπüêæ üå± <font color = 'green'> **Visit my ChatGPT for <Topic Modeling Clarification> for interpretations!**