# BERTopic Pipeline Notebook

This notebook demonstrates how to:

1. **Import** libraries and define the relevant classes/functions for a BERTopic workflow  
2. **Load** documents with metadata, including both posts and comments  
3. **Train** a BERTopic model (using UMAP + HDBSCAN)  
4. **Save** various topic outputs (document info, topic info, and representative documents)  
5. **Generate** HTML visualisations (topics, hierarchy, bar chart)

In [1]:
# ----------------------------------------------------------------------------------------
# 1) Imports and Setup
# ----------------------------------------------------------------------------------------
import json
import os
import traceback
from pathlib import Path

import pandas as pd

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer

  from .autonotebook import tqdm as notebook_tqdm


## 2) Helper Functions

Here we define:
- **`load_docs_with_metadata(file_path)`**: Loads documents (posts + comments) from a JSON structure, returning a combined list of text and relevant metadata.

In [2]:
def load_docs_with_metadata(file_path: Path):
    """
    Loads data from a JSON file (e.g., 'aggregated_raw_reddit_data.json'), 

    Returns a list of (doc_text, metadata_dict) tuples, where doc_text is either
    the post's combined_processed or a comment's comment_processed.
    """
    import json

    with open(file_path, "r", encoding="utf-8") as f:
        data = json.load(f)

    docs_with_meta = []
    running_index = 0

    for post in data:
        # Basic fields
        post_id = post.get("post_id", "")
        subreddit = post.get("subreddit", "")
        keyword = post.get("keyword", "")
        sub_score = post.get("submission_score", None)

        # Post-level text
        main_text = post.get("combined_processed", "").strip()
        if main_text:
            doc_id_used = post_id if post_id else f"UnknownPost_{running_index}"
            docs_with_meta.append((
                main_text,
                {
                    "doc_idx": running_index,
                    "doc_id": doc_id_used,
                    "subreddit": subreddit,
                    "keyword": keyword,
                    "submission_score": sub_score,
                    "type": "post"
                }
            ))
            running_index += 1

        # Comment-level text
        comments = post.get("comments", [])
        for c in comments:
            c_text = c.get("comment_processed", "").strip()
            if c_text:
                c_id = c.get("comment_id", f"UnknownComment_{running_index}")
                docs_with_meta.append((
                    c_text,
                    {
                        "doc_idx": running_index,
                        "doc_id": c_id,
                        "subreddit": subreddit,
                        "keyword": keyword,
                        "submission_score": sub_score,
                        "type": "comment"
                    }
                ))
                running_index += 1

    return docs_with_meta

## 3) Main Pipeline

This code cell
1. **Loading** the data,
2. **Fitting** a BERTopic model with UMAP & HDBSCAN,
3. **Generating** various data frames of topic and document info,
4. **Saving** JSON files for further analysis,
5. **Attempting** optional HTML visualisations.

In [6]:
def main():
    """
    Main BERTopic pipeline:
    1) Loads 'bertopic_ready_data.json' with custom doc IDs (post_id/comment_id) and text.
    2) Builds a BERTopic model (UMAP + HDBSCAN).
    3) Saves:
       - document-level info (doc_info.json)
       - topic info (topic_info.json).
       - representative docs (topic_representatives.json)
    """
    data_folder = r"C:\Users\laure\Desktop\dissertation_notebook\Data"
    input_path = Path(os.path.join(data_folder, "bertopic_ready_data.json"))
    
    # Output directory
    output_folder = Path(os.path.join(data_folder, "bertopic_output"))
    output_folder.mkdir(parents=True, exist_ok=True)

    if not input_path.exists():
        print(f"File {input_path} not found.")
        return

    # 1) Load docs + metadata
    docs_with_meta = load_docs_with_metadata(input_path)
    docs = [item[0] for item in docs_with_meta]
    metadata_list = [item[1] for item in docs_with_meta]

    #hyperparameters
    n_neighbors = 15
    min_cluster_size = 10

    print("Loading embedding model (CUDA if available)...")
    embedding_model = SentenceTransformer("all-mpnet-base-v2", device="cuda")

    custom_umap = UMAP(
        n_neighbors=n_neighbors,
        n_components=2,
        metric="cosine",
        random_state=42,
        init="random"
    )
    custom_hdbscan = HDBSCAN(
        min_cluster_size=min_cluster_size,
        metric="euclidean",
        cluster_selection_method="eom",
        prediction_data=True
    )

    topic_model = BERTopic(
        embedding_model=embedding_model,
        umap_model=custom_umap,
        hdbscan_model=custom_hdbscan,
        verbose=False
    )

    print(f"Fitting BERTopic with n_neighbors={n_neighbors}, min_cluster_size={min_cluster_size} ...")
    topics, probs = topic_model.fit_transform(docs)
    print("BERTopic modelling complete.")

    # refine displayed topic words using bigrams
    vectorizer = CountVectorizer(stop_words="english", ngram_range=(1, 2))
    topic_model.update_topics(docs, vectorizer_model=vectorizer)

    # 2) Retrieve topic info
    topic_info_df = topic_model.get_topic_info()

    # 3) Document-level info
    doc_info_df = topic_model.get_document_info(docs)
    doc_info_df.rename(columns={"Probability": "Topic_Probability"}, inplace=True)

    # Add custom metadata columns
    doc_info_df["Doc_ID"] = ""
    doc_info_df["Type"] = ""
    doc_info_df["Subreddit"] = ""
    doc_info_df["Keyword"] = ""
    doc_info_df["SubmissionScore"] = None

    for i in range(len(doc_info_df)):
        meta = metadata_list[i]
        doc_info_df.at[i, "Doc_ID"] = meta["doc_id"]
        doc_info_df.at[i, "Type"] = meta["type"]
        doc_info_df.at[i, "Subreddit"] = meta["subreddit"]
        doc_info_df.at[i, "Keyword"] = meta["keyword"]
        doc_info_df.at[i, "SubmissionScore"] = meta["submission_score"]

    # -- REMOVE any fields that store representative docs in doc_info --
    # (This step ensures no snippet arrays or flags like Representative_Docs remain)
    columns_to_drop = ["Representative_Docs", "Representative_document"]
    for col in columns_to_drop:
        if col in doc_info_df.columns:
            doc_info_df.drop(columns=[col], inplace=True)

    # 4) create a separate structure for topic_representatives.json
    representative_docs = []
    valid_topics = sorted(set(doc_info_df["Topic"]) - {-1})
    max_docs_per_id = 2

    for t in valid_topics:
        subset = doc_info_df[doc_info_df["Topic"] == t].copy()
        subset.sort_values("Topic_Probability", ascending=False, inplace=True)

        grouped_by_id = subset.groupby("Doc_ID")
        limited_docs = []
        for doc_id_val, group in grouped_by_id:
            top_docs_per_id = group.head(max_docs_per_id)
            limited_docs.append(top_docs_per_id)

        limited_docs = pd.concat(limited_docs).sort_values("Topic_Probability", ascending=False)
        top10 = limited_docs.head(10)

        # Build rep-doc info with short snippet
        docs_for_topic = []
        for _, row2 in top10.iterrows():
            snippet_text = ""
            if "Document" in row2 and isinstance(row2["Document"], str):
                snippet_text = row2["Document"][:300]
            docs_for_topic.append({
                "Doc_ID": row2["Doc_ID"],
                "Topic_Probability": float(row2["Topic_Probability"]),
                "Type": row2["Type"],
                "Subreddit": row2["Subreddit"],
                "Keyword": row2["Keyword"],
                "SubmissionScore": row2["SubmissionScore"],
                "Snippet": snippet_text
            })

        representative_docs.append({
            "Topic": int(t),
            "NumDocsInTopic": int(subset.shape[0]),
            "TopDocs": docs_for_topic
        })

    # 5) Save outputs
    doc_info_path = output_folder / "doc_info.json"
    topic_info_path = output_folder / "topic_info.json"
    rep_docs_path = output_folder / "topic_representatives.json"

    # doc_info.json: full text, probability, etc. 
    doc_records = doc_info_df.to_dict(orient="records")
    with open(doc_info_path, "w", encoding="utf-8") as f:
        json.dump(doc_records, f, indent=2, ensure_ascii=False)

    # topic_info.json: summary of topics, top words
    topic_records = topic_info_df.to_dict(orient="records")
    with open(topic_info_path, "w", encoding="utf-8") as f:
        json.dump(topic_records, f, indent=2, ensure_ascii=False)

    # topic_representatives.json: separate file with top docs by topic
    with open(rep_docs_path, "w", encoding="utf-8") as f:
        json.dump(representative_docs, f, indent=2, ensure_ascii=False)

    # 6) Generate HTML visuals
    try:
        fig_topics = topic_model.visualize_topics()
        fig_topics.write_html(str(output_folder / "topics.html"))
    except Exception as e:
        print("Could not visualize topics.html")
        traceback.print_exc()

    try:
        fig_hierarchy = topic_model.visualize_hierarchy()
        fig_hierarchy.write_html(str(output_folder / "hierarchy.html"))
    except Exception as e:
        print("Could not visualize hierarchy.html")
        traceback.print_exc()

    # Show largest topics in a bar chart
    topic_info_nonoutlier = topic_info_df[topic_info_df["Topic"] != -1].sort_values("Count", ascending=False)
    top10_topics = topic_info_nonoutlier["Topic"].head(10).tolist()
    try:
        fig_barchart = topic_model.visualize_barchart(topics=top10_topics)
        fig_barchart.write_html(str(output_folder / "barchart.html"))
    except Exception as e:
        print("Could not visualize barchart.html")
        traceback.print_exc()

    print("\nAll done!")
    print(f"Document info saved to: {doc_info_path}")
    print(f"Topic info saved to: {topic_info_path}")
    print(f"Representative docs per topic: {rep_docs_path}")
    print("HTML visualizations in 'bertopic_output' folder.")


if __name__ == "__main__":
    try:
        main()
    except Exception as e:
        print("BERTopic pipeline failed.")
        traceback.print_exc()


Loading embedding model (CUDA if available)...
Fitting BERTopic with n_neighbors=15, min_cluster_size=10 ...
BERTopic modelling complete.

All done!
Document info saved to: C:\Users\laure\Desktop\dissertation_notebook\Data\bertopic_output\doc_info.json
Topic info saved to: C:\Users\laure\Desktop\dissertation_notebook\Data\bertopic_output\topic_info.json
Representative docs per topic: C:\Users\laure\Desktop\dissertation_notebook\Data\bertopic_output\topic_representatives.json
HTML visualizations in 'bertopic_output' folder.


## Display resukts

In [22]:
def display_bertopic_results(data_folder):
    """
    Load and display BERTopic results in formatted tables:
    1. An overall topic overview with truncated representative documents
    2. A detailed view of each topic's top documents
    """
    import json
    from IPython.display import display, HTML

    output_folder = Path(data_folder) / "bertopic_output"

    # --- 1) Load results ---
    with open(output_folder / "topic_info.json", 'r', encoding='utf-8') as f:
        topic_info = json.load(f)

    with open(output_folder / "topic_representatives.json", 'r', encoding='utf-8') as f:
        rep_docs = json.load(f)

    # --- 2) Display condensed topic information ---
    topic_df = pd.DataFrame(topic_info)
    topic_df = topic_df[topic_df['Topic'] != -1]  # Excludes the outlier topic
    topic_df = topic_df.sort_values('Count', ascending=False)

    def get_short_rep_examples(topic_id):
        topic_entry = next((d for d in rep_docs if d['Topic'] == topic_id), None)
        if not topic_entry:
            return ""
        topic_docs = topic_entry['TopDocs']
        return '; '.join([d['Snippet'][:100] + '...' for d in topic_docs[:2]])

    topic_df['Representative_Examples'] = topic_df['Topic'].apply(get_short_rep_examples)

    pd.set_option('display.max_colwidth', 200)

    print("\n=== Topic Overview ===")
    display(HTML(topic_df[['Topic', 'Count', 'Name', 'Representative_Examples']].to_html(
        index=False,
        classes='table table-striped'
    )))

    # --- 3) Display detailed topic information ---
    for topic_data in rep_docs:
        topic_id = topic_data['Topic']
        num_docs = topic_data['NumDocsInTopic']

        print(f"\n{'='*80}")
        print(f"\nTopic {topic_id} (Contains {num_docs} documents)")
        print(f"{'-'*80}")

        # Show top 5 representative documents
        docs_data = []
        for doc in topic_data['TopDocs'][:5]:
            docs_data.append({
                'Doc_ID': doc['Doc_ID'],
                'Type': doc['Type'],
                'Subreddit': doc['Subreddit'],
                'Score': doc['SubmissionScore'],
                'Topic Prob': f"{doc['Topic_Probability']:.3f}",
                'Preview': doc['Snippet'][:200] + '...'
            })

        docs_df = pd.DataFrame(docs_data)
        print("\nTop Representative Documents:")
        display(HTML(docs_df.to_html(index=False, classes='table table-striped')))


# Example usage
data_folder = r"C:\Users\laure\Desktop\dissertation_notebook\Data"
display_bertopic_results(data_folder)



=== Topic Overview ===


Topic,Count,Name,Representative_Examples
0,3065,0_diagnosis_autism_people_adhd,"unfortunately just having an afab practitioner wouldn't guarantee your success unless they're active...; i'm audhd so creating and understanding analogies might not be my strong suit, but i'll try. the fir..."
1,1834,1_aba_therapy_child_autistic,aba is basically forcing autistic people to be less autistic and to appear more allistic- which does...; the person who is commenting is coming from a place of knowledge about aba practices of yore. they d...
2,947,2_therapy_therapist_like_just,"i've been to a few different therapists with different methods and it's been unhelpful at best and t...; i actually just broke up with my therapist last night lol. but before her, i was seeing the person w..."
3,62,3_dentist_appointment_dentists_mouth,"my dentist is super careful and gentle while opening my mouth and stuff, she also avoids getting the...; always look up recommendations for gentle densits/hygienists. i will quit a bad one and go to anothe..."
4,57,4_neurodivergent_adhd_people_term,"the unfortunate thing is that the term was coined because people were already using neurodiversity t...; they’re not the most common, dyslexia and dyspraxia are. i have autism and adhd, so there’s nothing ..."
5,42,5_masking_mask_unmasking_just,"there's no treatment in the sense of a cure, but just like there's no treatment for having no legs, ...; i’m 51 am undoubtedly neurodivergent but don’t plan to seek a formal diagnosis. the more i learn abo..."
6,38,6_exposure_exposure therapy_therapy_sensory,"how many of you all were hypersensitive to noises as a kid then had to go to abusive pseudoscience a...; i've heard exposure therapy works for phobias, but not for sensory overload. like some others alread..."
7,31,7_bpd_symptoms_personality disorder_adhd,"i didn't think you thought it wasn't real, just poonting out that this diagnosis is thrown on people...; i wonder if the rampant misdiagnosis of bpd for autistic women by full fledged doctors harms the bpd..."
8,30,8_joke_diagnosis_husband_just,"yeah my sense is, if he's your husband and you've been on this diagnostic journey for a while, he sh...; it could have been a joke that just didn't land. my wife and i have these all the time, and both sus..."
9,25,9_sex_adhd_selfish_husband,this does. not seem like an adhd thing. i'm going to say he's bad at sex. and also kind of an asshol...; being uninterested in pleasing your partner has nothing to do with adhd and i have no idea why you t...




Topic 0 (Contains 3065 documents)
--------------------------------------------------------------------------------

Top Representative Documents:


Doc_ID,Type,Subreddit,Score,Topic Prob,Preview
C_P_100_1792,comment,autism,558,1.0,"unfortunately just having an afab practitioner wouldn't guarantee your success unless they're actively aware of the issue. it's kind of an issue with the whole practice, but definitely still a lot mor..."
C_P_418_8037,comment,neurodiversity,50,1.0,"i'm audhd so creating and understanding analogies might not be my strong suit, but i'll try. the fire analogy doesn't resonate with me, because i don't really think being neurodivergent is painful in ..."
C_P_418_9630,comment,neurodiversity,50,1.0,"having adhd itself doesn't feel like anything. or at least, not that i can tell. that's something that made me doubt it pre-diagnosis. but it's because it's something neurological. i never had a typic..."
C_P_419_4680,comment,AutismInWomen,30,1.0,"i am also in scotland, self referred to the new adult autism service 6 weeks after it launched in 2021, had my assessment earlier this year. so almost 3 years on the waiting list. according to a local..."
C_P_41_1465,comment,autism,684,1.0,"bit late, but i fully agree. i only realized it when i was 35 and first read about aspergers, after reading the symptoms the missing links suddenly appeared. i was tested and diagnosed, and then i rea..."




Topic 1 (Contains 1834 documents)
--------------------------------------------------------------------------------

Top Representative Documents:


Doc_ID,Type,Subreddit,Score,Topic Prob,Preview
C_P_104_1493,comment,autism,65,1.0,aba is basically forcing autistic people to be less autistic and to appear more allistic- which doesn’t work- it leads to heavy autistic masking which leads to burnout- there is a reason why we have a...
C_P_364_8750,comment,Autism_Parenting,17,1.0,"the person who is commenting is coming from a place of knowledge about aba practices of yore. they do not treat aba like that anymore, for the most part. sure there may be some agencies that still hav..."
C_P_40_7458,comment,Autism_Parenting,81,1.0,so happy to hear that your son is having success with aba! i am glad to hear another parent that loves aba therapy!...
C_P_40_5276,comment,Autism_Parenting,81,1.0,that is what i tell everyone! she’s made more progress in the first 2 weeks of aba than she did in a year of speech therapy with 2 different speech language pathologists! so glad your son is making pr...
C_P_40_2578,comment,Autism_Parenting,81,1.0,"thank you for your comment! yes, i agree - the one on one time and for the length of time she goes (monday-friday) vs 2 days a week of speech makes all the difference! grateful for our slps but after ..."




Topic 2 (Contains 947 documents)
--------------------------------------------------------------------------------

Top Representative Documents:


Doc_ID,Type,Subreddit,Score,Topic Prob,Preview
C_P_311_8186,comment,AutisticAdults,23,1.0,"i've been to a few different therapists with different methods and it's been unhelpful at best and traumatizing at worst. most of them weren't specialized in autism, the ones that were, were generally..."
C_P_3_6245,comment,AutismInWomen,214,1.0,"i actually just broke up with my therapist last night lol. but before her, i was seeing the person who told me i might have autism. she ended up moving to canada and tbh it’s been really hard to get o..."
C_P_3_193,comment,AutismInWomen,214,1.0,she did! they just didn’t work. my psychologist felt like the emdr was causing more harm to she stopped it and we switched to regular talk therapy for a few more years. eta - i wrote “didn’t give me t...
C_P_3_2307,comment,AutismInWomen,214,1.0,"i’m a therapist in therapy, and i’ve had the feeling of more significant progress/impact from biofeedback and neurofeedback as more brief treatment modalities. in talk therapy, i specifically ask for ..."
C_P_3_2756,comment,AutismInWomen,214,1.0,i just started with a new therapist & told them right off the bat that “talk therapy” does not work for me. like i know mostly why i do things i do or where it stems from ect but how do i rewire my br...




Topic 3 (Contains 62 documents)
--------------------------------------------------------------------------------

Top Representative Documents:


Doc_ID,Type,Subreddit,Score,Topic Prob,Preview
C_P_57_9611,comment,aspiememes,193,1.0,"my dentist is super careful and gentle while opening my mouth and stuff, she also avoids getting the light on my eyes as much as possible. she being a woman helps me be more comfortable, as i don't fe..."
C_P_235_9507,comment,aspiememes,529,1.0,"always look up recommendations for gentle densits/hygienists. i will quit a bad one and go to another if they hurt me. i like the dentist, it's relaxing...."
C_P_57_4385,comment,aspiememes,193,1.0,"it really depends on what kind of dentist you have. i have an incredibly nice dentist that i got recommended to by friends. they always talk me through the procedure, which brings me a lot less anxiet..."
C_P_57_5294,comment,aspiememes,193,1.0,"i have the dentist in an hour. get them to explain everything and narrate what they're up to. do close your eyes and bring a fidget. remember that dentists don't like hurting you, it makes them sad an..."
C_P_57_5749,comment,aspiememes,193,1.0,"i have a great dentist who explains every step what's she's going to do, what it sounds like and how it's going to feel, and it's great. i close my eyes against the bright light and bring a small fidg..."




Topic 4 (Contains 57 documents)
--------------------------------------------------------------------------------

Top Representative Documents:


Doc_ID,Type,Subreddit,Score,Topic Prob,Preview
C_P_175_8397,comment,neurodiversity,234,1.0,"the unfortunate thing is that the term was coined because people were already using neurodiversity to just mean autism and adhd, and the person who coined it hoped to encourage its use as a more inclu..."
C_P_175_3117,comment,neurodiversity,234,1.0,"they’re not the most common, dyslexia and dyspraxia are. i have autism and adhd, so there’s nothing that i identify with that is “left out of the neurodivergent conversation”, but those with ocd, tour..."
C_P_318_1044,comment,neurodiversity,13,1.0,"anxiety disorders absolutely fall under the umbrella of neurodivergence, because they stray from what society deems neurologically typical. autism and adhd are only the very tip of the expansive neuro..."
C_P_175_1671,comment,neurodiversity,234,1.0,"actually the most common is dyslexia (10% of the population). i can only guess but my sense from the post is that neurodivergence covers a very wise range of very different neurotypes, and communities..."
C_P_175_7446,comment,neurodiversity,234,1.0,and to add to this discussion point. many neurodivergent people have multiple points of cross over into other nd conditions or are a combination for 2-3. an example is many people are auadhd (autistic...




Topic 5 (Contains 42 documents)
--------------------------------------------------------------------------------

Top Representative Documents:


Doc_ID,Type,Subreddit,Score,Topic Prob,Preview
C_P_245_2831,comment,AutisticAdults,60,1.0,"there's no treatment in the sense of a cure, but just like there's no treatment for having no legs, things like ergotherapy can still radically change your life. and masking as a skill is absolutely a..."
C_P_68_4495,comment,AutisticAdults,58,1.0,"i’m 51 am undoubtedly neurodivergent but don’t plan to seek a formal diagnosis. the more i learn about how my brain functions differently than nt brains, the more i have “aha” moments about things in ..."
C_P_232_1887,comment,aspiememes,53,1.0,"relatable. i think of masking like dissociation. it has it's uses, you just don't want to get stuck in it. psychology, brain chemistry, and societal patterns are all extremely interesting for me, thou..."
C_P_232_8857,comment,aspiememes,53,1.0,"i feel about masking the same way i feel about smoking. i won't judge you, it's your life. but don't pressure others to do it and don't pretend it's healthy...."
C_P_562_1275,comment,autism,98,1.0,"for me, masking is going into every situation like “how do i play this?”. i always consider where i am, who i’m with, only following their lead as far as how much to share, whether or not to swear, wh..."




Topic 6 (Contains 38 documents)
--------------------------------------------------------------------------------

Top Representative Documents:


Doc_ID,Type,Subreddit,Score,Topic Prob,Preview
P_530,post,AutisticAdults,35,1.0,how many of you all were hypersensitive to noises as a kid then had to go to abusive pseudoscience audio therapy? i remember having to cover my ears over mundane noises like vacuums and such. it seeme...
C_P_21_4644,comment,AutismInWomen,105,1.0,"i've heard exposure therapy works for phobias, but not for sensory overload. like some others already pointed out, many nts assume our dislike of certain sounds is fear based, they don't realize the n..."
C_P_21_6599,comment,AutismInWomen,105,1.0,"questions like these are something i still struggle with in relation to dealing with my burnou, i think because i have so thoroughly internalized an nt observer gaze, in my effort to figure out and em..."
C_P_21_946,comment,AutismInWomen,105,1.0,"exposure therapy is not good for autistic people. exposure therapy is used mainly for phobias, anxiety disorders, ocd, and ptsd. since autism is not an anxiety disorder and our sensory sensitivities a..."
C_P_21_5049,comment,AutismInWomen,105,1.0,exposure therapy can work for fear but not pain. touching a hot stove over and over does not make the hot stove less painful and does traumatize the person forced to harm themselves over and over agai...




Topic 7 (Contains 31 documents)
--------------------------------------------------------------------------------

Top Representative Documents:


Doc_ID,Type,Subreddit,Score,Topic Prob,Preview
C_P_203_9473,comment,autism,523,1.0,"i didn't think you thought it wasn't real, just poonting out that this diagnosis is thrown on people who don't actually have it due to medical misogyny. i think many social factors play into professio..."
C_P_73_428,comment,autism,1135,1.0,i wonder if the rampant misdiagnosis of bpd for autistic women by full fledged doctors harms the bpd community :/...
C_P_62_6414,comment,AutismInWomen,499,1.0,"i felt the above description could fit me. but bpd, discussed in therapy too never fit. i have a few theories on why bpd is such a popular misdiagnosis. adult diagnosis for autism is tricky however th..."
C_P_62_2937,comment,AutismInWomen,499,1.0,"it actually sounds like bpd to me. i've tried to maintain an open mind, but because of personal experience with people who have borderline personality disorder, i've been resistant to the idea that a ..."
C_P_328_7446,comment,AutismInWomen,262,1.0,yes i've heard bpd is a common misdiagnosis in autistic women. sadly that happens to so many!...




Topic 8 (Contains 30 documents)
--------------------------------------------------------------------------------

Top Representative Documents:


Doc_ID,Type,Subreddit,Score,Topic Prob,Preview
C_P_84_6501,comment,AutismInWomen,1259,1.0,"yeah my sense is, if he's your husband and you've been on this diagnostic journey for a while, he should know how you've felt about it going in. he should've immediately responded with validation of y..."
C_P_84_8797,comment,AutismInWomen,1259,1.0,"it could have been a joke that just didn't land. my wife and i have these all the time, and both suspect we're on the spectrum but have never taken the time to figure out the process of getting a diag..."
C_P_84_7564,comment,AutismInWomen,1259,1.0,yeah my husband and i do not have the kind of relationship where we tease each other (and i hate being teased and always have and he knows it) and i would have been hurt by a comment like that. when i...
C_P_84_3942,comment,AutismInWomen,1259,1.0,"two problems: - he immediately made you new diagnosis about him - he sees your new diagnosis as a negative thing (that’s the only way the joke makes sense, i think)..."
C_P_84_3784,comment,AutismInWomen,1259,1.0,"well i'd be pissed if my husband made a joke when i'm first telling him about a diagnosis of any kind. it's not the time to make that kind of joke, it's something very serious. and it's a pretty bad j..."




Topic 9 (Contains 25 documents)
--------------------------------------------------------------------------------

Top Representative Documents:


Doc_ID,Type,Subreddit,Score,Topic Prob,Preview
C_P_19_1083,comment,neurodiversity,58,1.0,this does. not seem like an adhd thing. i'm going to say he's bad at sex. and also kind of an asshole? you deserve better....
C_P_19_8076,comment,neurodiversity,58,1.0,"being uninterested in pleasing your partner has nothing to do with adhd and i have no idea why you think that it would be, to be honest...."
C_P_19_7994,comment,neurodiversity,58,1.0,"this has nothing to do with adhd or neurodivergent partners. in fact, it’s a bit insulting to even think there’s a connection. many neurodivergent people are actually into kink or are hyper sexual. th..."
C_P_19_7850,comment,neurodiversity,58,1.0,"i am neurodivergent (adhd) and i actually hyperfocus on my partners' responses and enjoyment. i get pleasure from giving them pleasure, and i'm good at it. i have had plenty of neurotypical lovers who..."
C_P_19_1805,comment,neurodiversity,58,1.0,"people with adhd are motivated by what they're interested in. things they're not interested in are work and often not things they want to do (i speak from experience, audhd here). with that being said..."




Topic 10 (Contains 24 documents)
--------------------------------------------------------------------------------

Top Representative Documents:


Doc_ID,Type,Subreddit,Score,Topic Prob,Preview
C_P_205_8540,comment,AutismInWomen,77,1.0,i did ketamine treatment for a couple months in 2021. i found it extremely helpful and i can still feel the positive effects a year and a half later. i feel like it helped me parce out which parts of ...
C_P_205_4403,comment,AutismInWomen,77,1.0,"i will say that it is a very vulnerable state, while you're getting the infusion, so feel free to be as assertive and cautious as you need to be before deciding for sure that it's what you want to do...."
P_205,post,AutismInWomen,77,1.0,"ketamine treatment hello everyone! i'm just curious if any of you have done ketamine treatments (for whatever mental health reason) and if so, how did it go? i have heard lots of first hand accounts o..."
C_P_438_1579,comment,neurodiversity,76,1.0,curious if anyone has tried ketamine for autistic burnout? ​ i ask because i've done several ketamine treatment sessions over the past 24 months. i was seeking ketamine treatment because i thought i h...
C_P_367_5017,comment,AutismInWomen,13,1.0,ketamine treatments are done in a doctors office with either nasal spray or an iv and given very little amounts. i don’t plan on taking drugs to get high. i plan on taking medication to be able to fun...




Topic 11 (Contains 19 documents)
--------------------------------------------------------------------------------

Top Representative Documents:


Doc_ID,Type,Subreddit,Score,Topic Prob,Preview
C_P_104_4065,comment,autism,65,1.0,"could you ellaborate on what is an autistic burnout? i think im suffering from it. like, i've not worked for two years. relationships just kill me right now.. i fantazise about a comet wiping out our ..."
C_P_446_1240,comment,neurodiversity,50,1.0,tl;dr - check out r/autisticwithadhd. learn about the boom/bust cycle. fellow recovering audhd burnout here 💜 i’d highly recommend checking out r/autisticwithadhd — has honestly been my biggest resour...
P_557,post,AutisticAdults,33,1.0,"can therapy help autistic burnout? so, i am 30 and was diagnosed with autism late in life. for the last eight years i have been going in a cycle of working a job that is very draining on me while putt..."
P_551,post,AutisticAdults,714,1.0,what every autistic person going through burnout needs to hear! i recently quit going to therapy after about 6 years because it was absolutely useless and making me feel worse. then i started going to...
P_312,post,neurodiversity,70,1.0,"autistic burnout resources does anyone know good resources on autistic burnout, written by autistic people? i'd like to have some to share with nts...."




Topic 12 (Contains 15 documents)
--------------------------------------------------------------------------------

Top Representative Documents:


Doc_ID,Type,Subreddit,Score,Topic Prob,Preview
C_P_288_2736,comment,neurodiversity,142,1.0,i've had burnout since i started working decades ago and i swear i process things slower now....
C_P_288_3609,comment,neurodiversity,142,1.0,"some of it is age. over the decades, especially with the rise of social media, on demand everything, my ability to focus and retain information has just disappeared. each burnout just added to the dis..."
C_P_288_3960,comment,neurodiversity,142,1.0,"this is precisely what burnout is, and it can definitely have long lasting effect. i’ve been having burnout effects for the better part of 15 years now…and one of the “victims” is my ability to write ..."
C_P_288_5547,comment,neurodiversity,142,1.0,"tl;dr: i don't think healing from burnout is like healing from a broken leg. you don't take it easy for a while, and then go back to your old life. it's finding a new balance that's better for you and..."
C_P_288_8222,comment,neurodiversity,142,1.0,"idk if it's age or what, but each time i've tried to really push for that pure 100% effort it's been sustained for shorter and shorter times. first time i did it i lasted about a year (college). secon..."




Topic 13 (Contains 14 documents)
--------------------------------------------------------------------------------

Top Representative Documents:


Doc_ID,Type,Subreddit,Score,Topic Prob,Preview
C_P_439_1509,comment,AutismInWomen,243,1.0,>one of my clients decided to tell me that she thinks vaccines cause autism because of something she read people should just shut up...
C_P_45_7871,comment,autism,421,1.0,"post-diagnosis, my dad told me he’s not a fan of vaccines because he would never want an autistic child...."
C_P_479_1624,comment,autism,155,1.0,didn't the doctor dude who started that whole thing partner up with some wheelchair dude who thought that his own bone marrow would cure it? how do people still believe that vaccines cause autism afte...
C_P_510_3612,comment,AutismInWomen,87,1.0,"because of ex-dr andrew wakefield. he was a british gastroenterologist who - to cut a very, very long story short - developed a competitor to the mmr vaccine. the problem is that his vaccine was exact..."
C_P_532_4917,comment,neurodiversity,103,1.0,...or maybe it’s just a way to spite all the “vaccines cause autism” super moms. that would be pretty funny....




Topic 14 (Contains 13 documents)
--------------------------------------------------------------------------------

Top Representative Documents:


Doc_ID,Type,Subreddit,Score,Topic Prob,Preview
C_P_4_1765,comment,autism,562,1.0,"there’s nothing wrong with humming is there? if he’s loud but that’s his stim, you either have the option of stopping these things or letting them be. it’s not hurting anyone, and if he’s forced to st..."
C_P_4_2499,comment,autism,562,1.0,humming is 100% a healthy and harmless stim. i don’t think there’s any reason to teach your son to suppress it. suppressive a harmless stim can cause more harm than good. i would speak to her or even ...
C_P_4_2637,comment,autism,562,1.0,"i second that! especially as an adult who is still stimming vocally. not because i never got taught how not to, but because sometimes i absolutely have to to deal with a situation at all. there is no ..."
C_P_4_3221,comment,autism,562,1.0,you should be worried because that is gonna cause harm like you said this is the tricky part of aba when they try an take away positive stimming the humming isn't hurting your kiddo or anyone else the...
C_P_4_5766,comment,autism,562,1.0,"aba often just teaches us to mask our autism and being forced to mask led me to depression, panic attacks, meltdowns, and self harm as a teen, and eventual burnout as an adult. any child who is happy ..."




Topic 15 (Contains 12 documents)
--------------------------------------------------------------------------------

Top Representative Documents:


Doc_ID,Type,Subreddit,Score,Topic Prob,Preview
C_P_197_1688,comment,Autism_Parenting,35,1.0,"our son did not have much luck with feeding therapy, but he has expanded his “acceptable” foods a bit with age. at 14 he will eat any processed chicken nugget (not the homemade kind), any brand of box..."
C_P_197_2020,comment,Autism_Parenting,35,1.0,"this was us, we never did food therapy and he actually did start trying a few things when he was almost 6. yes, much of it is carb and sugar based but at least there is variety. we went from 5 foods t..."
C_P_197_307,comment,Autism_Parenting,35,1.0,yes. we do feeding therapy at our aba center. it is slow going and our daughter is still picky but it has expanded her pallet. she will eat two non-hidden veggies now! (cucumbers and orange peppers) t...
C_P_197_3720,comment,Autism_Parenting,35,1.0,yes. we did feeding therapy for a year and it helped tremendously....
C_P_197_6229,comment,Autism_Parenting,35,1.0,"yeah if you want one new food, i think you'll achieve it but it's gonna take some time at the start. that does 100% suck about how much the mcdonalds costs and also, i'd do the same thing you're doing..."


## Conclusion

Load preprocessed data, train a **BERTopic** model, and export multiple outputs:

- **Document Info** – Indicating assigned topics and probabilities for each post/comment, plus metadata such as subreddit and keywords.  
- **Topic Info** – Summaries of each topic, including the number of documents.  
- **Representative Documents** – A sample of the highest probability docs (capped by source ID).  
- **HTML visualisations** (topics, hierarchy, bar chart) in the `bertopic_output` folder.

Refine or extend this process by experimenting with different parameters or exploring the visual outputs. This completes the **BERTopic pipeline** demonstration.

## References

**Reference:**  
Grootendorst, M. (2022) *BERTopic: Leveraging BERT embeddings for unsupervised topic modeling* [computer program].  
Available from: [https://github.com/MaartenGr/BERTopic](https://github.com/MaartenGr/BERTopic) [Accessed 12 January 2025].

**Git Repo:**  
- [BERTopic GitHub](https://github.com/MaartenGr/BERTopic)

**Reference:**  
Reimers, N., and Gurevych, I. (2019) *Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks*.  
Available from: [https://www.sbert.net/](https://www.sbert.net/) [Accessed 12 January 2025].

**Git Repo:**  
- [SentenceTransformers GitHub](https://github.com/UKPLab/sentence-transformers)

**Reference:**  
McInnes, L., Healy, J., and Melville, J. (2018) *UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction* [computer program].  
Available from: [https://umap-learn.readthedocs.io/](https://umap-learn.readthedocs.io/) [Accessed 12 January 2025].

**Git Repo:**  
- [UMAP GitHub](https://github.com/lmcinnes/umap)

**Reference:**  
Campello, R. J. G. B., Moulavi, D., and Sander, J. (2013) *Density-Based Clustering Based on Hierarchical Density Estimates* [computer program].  
Available from: [https://hdbscan.readthedocs.io/](https://hdbscan.readthedocs.io/) [Accessed 12 January 2025].

**Git Repo:**  
- [HDBSCAN GitHub](https://github.com/scikit-learn-contrib/hdbscan)

**Reference:**  
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., et al. (2011) *Scikit-learn: Machine Learning in Python* [software framework]. *Journal of Machine Learning Research*, 12, pp. 2825–2830.  
Available from: [https://scikit-learn.org/](https://scikit-learn.org/) [Accessed 12 January 2025].

**Git Repo:**  
- [Scikit-learn GitHub](https://github.com/scikit-learn/scikit-learn)

**Reference:**  
McKinney, W. (2010) *Data Structures for Statistical Computing in Python*. *Proceedings of the 9th Python in Science Conference*, pp. 51–56 (v2.2.3).  
Available from: [https://pandas.pydata.org/](https://pandas.pydata.org/) [Accessed 12 January 2025].

**Git Repo:**  
- [Pandas GitHub](https://github.com/pandas-dev/pandas)