# BERTopic
Using BERTopic to identify topics in dementia forum text. Each iteration adds a new level to the model. BERTopic has multiple fully customizable steps to it and each iteration explores with different parts of the model's pipeline

![BERTopic Structure](files/bertopic-structure.png "BERTopic Structure")

## Data Setup
Read data into a list where each document is an item in the list

In [None]:
# Read documents from the file
# corpus_threads_combined.txt contains all dementia forum data
# Each thread in the forum is represented as a document and separated by a new line

with open('../data/corpus_threads_combined.txt', 'r', encoding='utf-8') as file:
    documents = file.read().split('\n')  # Split on newline to get individual documents

In [None]:
# install the following packages, depending on your system, you could use regular pip
!pip3 install bertopic
!pip3 install spacy
!pip3 install datamapplot
!pip3 install "nbformat>=4.2.0"
!pip3 install --upgrade nbformat
!pip3 install ipykernel


## Approach 1: 
- **Embedding Model:** [all-MiniLM-L6-v2 Sentence Transformer](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
- **Dimensionality Reduction:** UMAP
- **Clustering:** HDBScan
- **Tokenizer:** *None*
- **Weighting Scheme:** *None*
- **Representation Tuning:** *None*

In [None]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

In [None]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

# Initialize a sentence transformer model for embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Create a BERTopic model
topic_model = BERTopic(embedding_model=embedding_model, verbose=True)

# Fit the model on the documents
topics, probs = topic_model.fit_transform(documents)

In [None]:
# Show results and inter-topic distance map visualization
print(topic_model.get_topic_info())
topic_model.visualize_topics()

## Approach 2: additional stop word removal
- **Embedding Model:** [all-MiniLM-L6-v2 Sentence Transformer](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
- **Dimensionality Reduction:** UMAP
- **Clustering:** HDBScan
- **Tokenizer:** CountVectorizer
- **Weighting Scheme:** *None*
- **Representation Tuning:** *None*
### Clean up data
Remove some custom stop words not in the existing spacy model's English stop words 

In [None]:
# remove custom stop words that aren't caught by spacy's model
from spacy.lang.en import stop_words

stop_words = list(stop_words.STOP_WORDS)
custom_stop_words = ['with', 'my', 'your', 'she', 'this', 'was', 'her', 'have', 'as', 'he', 'him', 'but', 'not', 'so', 'are', 'at', 'be', 'has', 'do', 'got', 'how', 'on', 'or', 'would', 'will', 'what', 'they', 'if', 'or', 'get', 'can', 'we', 'me', 'can', 'has', 'his', 'there', 'them', 'just', 'am', 'by', 'that', 'from', 'it', 'is', 'in', 'you', 'also', 'very', 'had', 'a', 'an', 'for']

stop_words += custom_stop_words

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(stop_words=custom_stop_words)
topic_model_2 = BERTopic(vectorizer_model=vectorizer_model, embedding_model=embedding_model, verbose=True)

In [None]:
# Fit the BERTopic model to the documents
topics_2, probs_2 = topic_model_2.fit_transform(documents)

In [None]:
# Print the topic information
print(topic_model_2.get_topic_info())

# visualize inter-topic distance map
topic_model_2.visualize_topics()

## Approach 3: c-TF-IDF weighting scheme
- **Embedding Model:** [all-MiniLM-L6-v2 Sentence Transformer](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
- **Dimensionality Reduction:** UMAP
- **Clustering:** HDBScan
- **Tokenizer:** CountVectorizer
- **Weighting Scheme:** c-TF-IDF Transformer
- **Representation Tuning:** *none*

In [None]:
from bertopic.vectorizers import ClassTfidfTransformer

ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
topic_model_3 = BERTopic(ctfidf_model=ctfidf_model, embedding_model=embedding_model, verbose=True, min_topic_size=100, vectorizer_model=vectorizer_model)

In [None]:
# Fit the BERTopic model to the documents
topics_3, probs_3 = topic_model_3.fit_transform(documents)

In [None]:
# Print the topic information
print(topic_model_3.get_topic_info())

# visualize inter-topic distance map
topic_model_3.visualize_topics()

## Approach 4: updated embedding model
- **Embedding Model:** [pritamdeka/S-PubMedBert-MS-MARCO](https://huggingface.co/pritamdeka/S-PubMedBert-MS-MARCO)
- **Dimensionality Reduction:** UMAP
- **Clustering:** HDBScan
- **Tokenizer:** CountVectorizer
- **Weighting Scheme:** c-TF-IDF
- **Representation Tuning:** *none*

In [None]:
# Initialize BERTopic with a sentence transformer fine-tuned on medical text for embeddings
medical_embedding_model = SentenceTransformer('pritamdeka/S-PubMedBert-MS-MARCO')

# Note: we tried using the embedding model below but came out with far worse results than the embedding model above
# nvidia_embedding_model = SentenceTransformer('dunzhang/stella_en_1.5B_v5')

topic_model_4 = BERTopic(ctfidf_model=ctfidf_model, embedding_model=medical_embedding_model, verbose=True, min_topic_size=100, vectorizer_model=vectorizer_model)

# Fit the BERTopic model to the documents
topics_4, probs_4 = topic_model_4.fit_transform(documents)

In [None]:
%pip install datamapplot

topic_model_4.visualize_document_datamap(documents)

In [None]:
# Print the topic information
print(topic_model_4.get_topic_info())

# visualize inter-topic distance map
topic_model_4.visualize_topics()

# visualize hierarchy
topic_model_4.visualize_hierarchy()

# visualize topic word scores
topic_model_4.visualize_barchart()

# visualize term rank
topic_model_4.visualize_term_rank()

In [None]:
# visualize with datamapplot
topic_model_4.visualize_document_datamap(documents)

## Approach 5: adding KeyBERT representation model
- **Embedding Model:** [pritamdeka/S-PubMedBert-MS-MARCO](https://huggingface.co/pritamdeka/S-PubMedBert-MS-MARCO)
- **Dimensionality Reduction:** UMAP
- **Clustering:** HDBScan
- **Tokenizer:** CountVectorizer
- **Weighting Scheme:** c-TF-IDF
- **Representation Tuning:** KeyBERT

In [None]:
from bertopic.representation import KeyBERTInspired

# Create your representation model
representation_model = KeyBERTInspired()

topic_model_5 = BERTopic(ctfidf_model=ctfidf_model, embedding_model=medical_embedding_model, verbose=True, min_topic_size=100, vectorizer_model=vectorizer_model, representation_model=representation_model)


# Fit the BERTopic model to the documents
topics_5, probs_5 = topic_model_5.fit_transform(documents)

In [None]:
# Print the topic information
print(topic_model_5.get_topic_info())

# visualize inter-topic distance map
topic_model_5.visualize_topics()

In [None]:
# Initialize BERTopic with a sentence transformer for embeddings
nvidia_embedding_model = SentenceTransformer('dunzhang/stella_en_1.5B_v5')
topic_model_nvidia = BERTopic(ctfidf_model=ctfidf_model, embedding_model=nvidia_embedding_model, verbose=True, min_topic_size=100, vectorizer_model=vectorizer_model)


# Fit the BERTopic model to the documents
topics_nvidia, probs_nvidia = topic_model_nvidia.fit_transform(documents)

# Print the topic information
print(topic_model_nvidia.get_topic_info())

In [None]:
topic_model_4.visualize_topics()

## Approach 6: Add LLM representation
- **Embedding Model:** [pritamdeka/S-PubMedBert-MS-MARCO](https://huggingface.co/pritamdeka/S-PubMedBert-MS-MARCO)
- **Dimensionality Reduction:** UMAP
- **Clustering:** HDBScan
- **Tokenizer:** CountVectorizer
- **Weighting Scheme:** c-TF-IDF
- **Representation Model:** [mistral-small](https://ollama.com/library/mistral-small)

In [None]:
%load_ext autoreload
%reload_ext autoreload
%autoreload 2

from MistralRepresentation import MistralRepresentation

representation_model = MistralRepresentation() 
topic_model_mistral = BERTopic(ctfidf_model=ctfidf_model, embedding_model=medical_embedding_model, verbose=True, min_topic_size=100, vectorizer_model=vectorizer_model, representation_model=representation_model)


In [None]:
topic_model_mistral.fit_transform(documents)


In [None]:
# save output representations to CSV and markdown
print(topic_model_mistral.get_topic_info())
# this will save the output to a CSV file, increment the file number each time to help track the updated output
file_number = 1
topic_model_mistral.get_topic_info()['Representation'].to_csv('mistral_output_prompt_optimized_' + str(file_number) + '.csv')

import pandas as pd

def format_topic_info_to_markdown(topic_info):
    markdown_content = ""
    for index, row in topic_info.iterrows():
        topic_id = row['Topic']
        topic_name = row['Name']
        markdown_content += f"## Topic {topic_id}\n\n"
        markdown_content += f"{topic_name}\n\n"
    return markdown_content

def write_to_markdown(markdown_content, output_file):
    with open(output_file, 'w') as file:
        file.write(markdown_content)

topic_info = topic_model_mistral.get_topic_info()

# Format the topic information and write to a Markdown file
markdown_content = format_topic_info_to_markdown(topic_info)
write_to_markdown(markdown_content, 'mistral_output_prompt_optimized_' + str(file_number) + '.md')
