# BERTopic Parameter Tuning and Coherence Analysis

This notebook explores a **BERTopic** workflow on the Reddit dataset, performing the following steps:

1. **Import** the necessary libraries (including `bertopic`, `sentence_transformers`, `umap`, `hdbscan`, and `gensim`).  
2. **Define** helper functions:
   - `load_bertopic_docs`: loads processed documents for BERTopic  
   - `compute_coherence`: calculates topic coherence  
   - `train_and_evaluate_bertopic`: trains and evaluates BERTopic with different parameters  
3. **Implement** a `main` function that:
   - Loads the data  
   - Iterates over various parameters  
   - Computes the coherence score  
   - Identifies the best combination  
   - Prints the highest coherence result  

By the end, it becomes clear which **n_neighbors** and **min_cluster_size** yield the highest c_v coherence score for the data, guiding further topic-modelling experiments.

In [2]:
#---------------------------------------------------------------------------------------
# 1) Imports and Setup
# ----------------------------------------------------------------------------------------
import json
import os
import traceback
from pathlib import Path

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer

# Gensim imports for coherence
from gensim.models.coherencemodel import CoherenceModel
from gensim import corpora


  from .autonotebook import tqdm as notebook_tqdm


## 2) Helper Functions

These functions handle data loading, coherence measurement, and the actual BERTopic training:

- **`load_bertopic_docs(file_path)`**: Gathers `combined_processed` (posts) and `comment_processed` (comments) into a single list of documents for BERTopic.
- **`compute_coherence(topic_model, docs)`**: Calculates `c_v` coherence using Gensim’s `CoherenceModel`.
- **`train_and_evaluate_bertopic(...)`**: Trains a BERTopic model with specified `n_neighbors` and `min_cluster_size`, updates topics, then computes coherence.


In [4]:
def load_bertopic_docs(file_path: Path) -> list[str]:
    with open(file_path, "r", encoding="utf-8") as f:
        data = json.load(f)

    all_docs = []
    for post in data:
        main_text = post.get("combined_processed", "").strip()
        if main_text:
            all_docs.append(main_text)

        for c in post.get("comments", []):
            c_text = c.get("comment_processed", "").strip()
            if c_text:
                all_docs.append(c_text)
    return all_docs

def compute_coherence(topic_model: BERTopic, docs: list[str]) -> float:
    """Compute c_v coherence for a BERTopic model using Gensim."""
    topic_info = topic_model.get_topic_info()
    # Exclude the outlier topic -1
    unique_topics = sorted(t for t in topic_info["Topic"].unique() if t != -1)

    topic_words = []
    for t in unique_topics:
        words_freqs = topic_model.get_topic(t)
        words = [w for (w, _) in words_freqs]
        topic_words.append(words)

    # Prepare texts + dictionary for Gensim
    tokenized_docs = [d.split() for d in docs]
    dictionary = corpora.Dictionary(tokenized_docs)

    coherence_model = CoherenceModel(
        topics=topic_words,
        texts=tokenized_docs,
        dictionary=dictionary,
        coherence="c_v"
    )
    return coherence_model.get_coherence()

def train_and_evaluate_bertopic(
    docs: list[str],
    n_neighbors_val: int,
    min_cluster_size_val: int,
    verbose: bool = False
):
    """
    Train a BERTopic model with the given parameters and compute c_v coherence.
    Returns: (model, coherence_score)
    """
    custom_umap = UMAP(
        n_neighbors=n_neighbors_val,
        n_components=2,
        metric="cosine",
        random_state=42,
        init="random"
    )
    custom_hdbscan = HDBSCAN(
        min_cluster_size=min_cluster_size_val,
        metric="euclidean",
        cluster_selection_method="eom",
        prediction_data=True
    )
    embedding_model = SentenceTransformer("all-mpnet-base-v2", device="cuda")

    topic_model = BERTopic(
        embedding_model=embedding_model,
        umap_model=custom_umap,
        hdbscan_model=custom_hdbscan,
        verbose=verbose
    )
    _topics, _probs = topic_model.fit_transform(docs)

    # Displayed topic words + bigrams
    vectorizer = CountVectorizer(stop_words="english", ngram_range=(1, 2))
    topic_model.update_topics(docs, vectorizer_model=vectorizer)

    coherence_score = compute_coherence(topic_model, docs)
    return topic_model, coherence_score

## 3) Main Pipeline

In the following code cell, the main function is defined and executed to:

1. **Load** data from `bertopic_ready_data.json`  
2. **Generate** a range of parameter combinations for `n_neighbors` and `min_cluster_size`  
3. **Train** a model for each combination, recording coherence scores  
4. **Identify** which combination yields the **best** coherence result  
5. **Print** the final outcome and provide guidance on future usage of these hyperparameters

In [6]:
def main():
    input_path = Path("Data/bertopic_ready_data.json")
    if not input_path.exists():
        print(f"File {input_path} not found! Please check the path.")
        return

    # Load docs
    docs = load_bertopic_docs(input_path)
    print(f"Loaded {len(docs)} documents from {input_path}")

    # Parameter search space
    n_neighbors_candidates = [5, 10, 15]
    min_cluster_size_candidates = [10, 15, 20]

    best_model = None
    best_score = -1.0
    best_params = (None, None)

    # Try multiple parameter combos, compute coherence
    for nn in n_neighbors_candidates:
        for mcs in min_cluster_size_candidates:
            print(f"\nTrying n_neighbors={nn}, min_cluster_size={mcs}...")
            try:
                model, score = train_and_evaluate_bertopic(docs, nn, mcs, verbose=False)
                print(f"  => c_v coherence={score:.4f}")
                if score > best_score:
                    best_score = score
                    best_model = model
                    best_params = (nn, mcs)
            except Exception as e:
                print(f"Error for n_neighbors={nn}, min_cluster_size={mcs}: {e}")
                traceback.print_exc()

    # If no model succeeded
    if not best_model:
        print("\nNo successful model found! Exiting.")
        return

    # Print best coherence result
    print(f"\nBest Coherence Score: {best_score:.4f} with n_neighbors={best_params[0]}, min_cluster_size={best_params[1]}")

if __name__ == "__main__":
    main()

Loaded 6479 documents from Data\bertopic_ready_data.json

Trying n_neighbors=5, min_cluster_size=10...
  => c_v coherence=0.4061

Trying n_neighbors=5, min_cluster_size=15...
  => c_v coherence=0.4022

Trying n_neighbors=5, min_cluster_size=20...
  => c_v coherence=0.3962

Trying n_neighbors=10, min_cluster_size=10...
  => c_v coherence=0.4463

Trying n_neighbors=10, min_cluster_size=15...
  => c_v coherence=0.4327

Trying n_neighbors=10, min_cluster_size=20...
  => c_v coherence=0.4327

Trying n_neighbors=15, min_cluster_size=10...
  => c_v coherence=0.4571

Trying n_neighbors=15, min_cluster_size=15...
  => c_v coherence=0.4378

Trying n_neighbors=15, min_cluster_size=20...
  => c_v coherence=0.4447

Best Coherence Score: 0.4571 with n_neighbors=15, min_cluster_size=10


## Conclusion

This notebook explores a range of **n_neighbors** and **min_cluster_size** parameters for **BERTopic**, using the **c_v** coherence metric to assess topic quality. In summary, it:

- Loads and combines post/comment texts into a single corpus  
- Trains BERTopic models under multiple parameter settings  
- Identifies the parameter combination yielding the highest coherence score  

These steps enable refinement of the **hyperparameter search**, deeper examination of the **resulting topics**, or further text cleaning and domain-specific stopwords for more nuanced topic structures.

## References

**Reference:**  
Grootendorst, M. (2022) *BERTopic: Leveraging BERT embeddings for unsupervised topic modeling* [computer program].  
Available from: [https://github.com/MaartenGr/BERTopic](https://github.com/MaartenGr/BERTopic) [Accessed 12 January 2025].

**Git Repo:**  
- [BERTopic GitHub](https://github.com/MaartenGr/BERTopic)

**Reference:**  
Reimers, N., and Gurevych, I. (2019) *Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks*.  
Available from: [https://www.sbert.net/](https://www.sbert.net/) [Accessed 12 January 2025].

**Git Repo:**  
- [SentenceTransformers GitHub](https://github.com/UKPLab/sentence-transformers)

**Reference:**  
McInnes, L., Healy, J., and Melville, J. (2018) *UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction* [computer program].  
Available from: [https://umap-learn.readthedocs.io/](https://umap-learn.readthedocs.io/) [Accessed 12 January 2025].

**Git Repo:**  
- [UMAP GitHub](https://github.com/lmcinnes/umap)

**Reference:**  
Campello, R. J. G. B., Moulavi, D., and Sander, J. (2013) *Density-Based Clustering Based on Hierarchical Density Estimates* [computer program].  
Available from: [https://hdbscan.readthedocs.io/](https://hdbscan.readthedocs.io/) [Accessed 12 January 2025].

**Git Repo:**  
- [HDBSCAN GitHub](https://github.com/scikit-learn-contrib/hdbscan)

**Reference:**  
Řehůřek, R., and Sojka, P. (2010) *Software Framework for Topic Modelling with Large Corpora*. In *Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks* [computer program].  
Available from: [https://radimrehurek.com/gensim/](https://radimrehurek.com/gensim/) [Accessed 12 January 2025].

**Git Repo:**  
- [Gensim GitHub](https://github.com/RaRe-Technologies/gensim)

**Reference:**  
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., et al. (2011) *Scikit-learn: Machine Learning in Python* [software framework]. *Journal of Machine Learning Research*, 12, pp. 2825–2830.  
Available from: [https://scikit-learn.org/](https://scikit-learn.org/) [Accessed 12 January 2025].

**Git Repo:**  
- [Scikit-learn GitHub](https://github.com/scikit-learn/scikit-learn)