# Topic Modeling Scientific Text

*This template and workflow were developed by Margaret Gratian. This set of notebooks can be used to find topics in scientific text.*
____________________________________
## 4. Use BERTopic to Topic Model Scientific Text

**Notebook Goals**
- Demonstrate how to use the previously embedded scientific text with the BERTopic Topic Modeling library to topic model the scientific text by:
    1) Reducing the dimensionality of the text vectors with UMAP
    2) Performing unsupervised clustering with HDBSCAN
    3) Generating cluster names and labels with a variation of TF-IDF.

**Requirements**
- This notebook requires the BERTopic library. Learn more about it here: https://maartengr.github.io/BERTopic/index.html.
- Please see the README for instructions and recommendations on proper installation.

**Major Caveats**
- This pipeline was developed within a Jupyter Notebook to make it user friendly and to provide documentation along the way on the different methods. Use this template as a guide or as a starting point for a script. 
- Some cells of this notebook should be updated by the user depending on the data source, embedding model, and other details. Some cells should not be modified regardless of the specific application. The cells that should not be modified begin with a comment that says "Do not modify this code."

**References**

We make use of the BERTopic library, from the following paper:
- Grootendorst, M., 2022. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint [arXiv:2203.05794](https://arxiv.org/abs/2203.05794).

**Inputs**

The following assumes you used the recommended path for saving your data in Notebook 3. If you modified it, be sure to modify the input path here.

- Input Filepath 1: "../data/SPECTER_embeddings.csv"
     - Titles and abstracts embedded with the SPECTER model

**Outputs**

An Excel file is generated each time you run this notebook, with a path and filename generated based user defined settings and a timestamp. If you want to change this path, you must modify the notebook code. 

## Import Packages

Run the following once when you first open this notebook. 

In [None]:
### Do not modify this code ###

import pandas as pd
import numpy as np
import datetime

# Text embedding library
from sentence_transformers import SentenceTransformer

# Topic Model libraries 
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic import BERTopic

***~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ BEGIN SECTION FOR USER INPUT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~***

The following sections require user input: 

1. Read in Data
2. Load in Text Model
3. Define Input Text and Model Output Filename
4. Define Clustering Parameters
5. Improve Topic Representation

Some cells should not be edited - these begin with a comment that says "Do not modify this code."

## 1. Read in Data

Run the following once when you first open this notebook. 

In [None]:
# Set the path to your input data
input_path = "../data/SPECTER_embeddings.csv"

In [None]:
### Do not modify this code ###
input_df = pd.read_csv(input_path, index_col=0)

print(len(input_df))

input_df.head()

## 2. Load in Text Model

The same text model that was used to produce the embeddings upstream of this notebook should be loaded in here. It is used after clustering to produce representative labels (keywords) for the clusters. 

In [None]:
# Load model
model_path = "allenai-specter"
embedding_model = SentenceTransformer(model_path)

## 3. Define Input Text and Model Output Filename

### 3.1. Set Input Text 

The input text refers to the text used to produce that embeddings that the model will use for clustering.

Note that the input_text string provided below must exactly match one of the columns of the data read into the notebook. Embeddings of the input text should be available as a column that exactly matches the input_text column name with "_embedding" appended. 

When using SPECTER, the input text must be of the format title + "[SEP]" + abstract.

In [None]:
# Set the input text that will be used to cluster 
# Note this must exactly match the column name associated with this data 
# This should refer to the text itself, not the embeddings of the text
input_text = "title_abstract"
print(input_text)

### 3.2 Set Analyst Initials

Set the initials of the analyst. This is used in the model result file naming.

You optionally might consider adding to this a short description or reminder of the particular machine this code was run. This can be important to track for replicability because randonmess in UMAP's initialization can be impacted by the user's operating system.

In [None]:
# Set analyst initials
# For now, there is a placeholder denoted PL
analyst_initials = "PL"
print(analyst_initials)

### 3.3 Set Project Name

Set a short project description. This is used in model result file naming.

In [None]:
project_keywords = "tobacco_SPECTER"
print(project_keywords)

### 3.4 Get the Time

The timestamp is used in the model output filename to uniquely identify the model run. 

In [None]:
### DO NOT MODIFY THIS CODE ###
timestamp =  datetime.datetime.now().strftime('%m_%d_%Y_%H%M')
print(timestamp)

### 3.5 Set Output Filepath 

In [None]:
# Set the folder location of the model output 
# Pay careful attention to this so your file ends up where you expect 
# You may consider adding subfolders to stay organized
output_path = "../data/"
print(output_path)

### 3.6  Output Filename

In [None]:
# Use the information defined above to set the name of the model output file 
output_filename = input_text + "_" + analyst_initials + "_" + timestamp + "_" + project_keywords + ".xlsx"
print(output_filename)

## 4. Define Clustering Parameters 

### 4.1 Set UMAP Parameters (Dimensionality Reduction)

Adjust the UMAP parameters here.

See here for more: https://umap-learn.readthedocs.io/en/latest/parameters.html

In [None]:
# Set to int 
n_neighbors = 3

# Set to int 
n_components = 100

# Set to string, must correspond with a metric implemented in UMAP library 
umap_metric = "cosine"

### 4.2 Set HDBSCAN Parameters 

Adjust the HDBSCAN Parameters here. 

See here for more: https://hdbscan.readthedocs.io/en/latest/parameter_selection.html

In [None]:
# Set to int 
min_cluster_size = 30 

# Set to int 
min_samples = 10

# Set to string, must correspond with a metric implemented in HDBSCAN library 
# Note that the choice of metric should be influenced by the choice of n_components in the umap model
# Lower values mean you are at a lower dimensionality and metrics like euclidean should work well, 
# otherwise may need to consider a metric that works for higher dimensions
hdbscan_metric = "euclidean"

## 5. Improve Topic Representation

This step is used to help refine the cluster names and labels. Importantly, this does not influence clustering. Rather, this process is applied after clustering to provide a machine-generated representation of the clusters. Consider having subject matter experts review these representations as part of your model validation process.

### 5.1 Define Stop Words 

Stop words are words that should be ignored in the cluster name and label generation process. This is our opportunity to make sure words like "the" or "and" do not show up as cluster names or labels. 

We define a list of our own stop words to include and add this to the default list of English stop words from scikit-learn. If you want to add additional stop words, add them to the additional_words list. The default scikit-learn list will have common words like "the", "and", "or." When adding words, consider ones that may be specific to your application. For example, we might add "tobacco" and "cessation" as two of our stopwords, as all of the publications in our example dataset should have these words (because this is how we searched for these publications in PubMed!). We also add a few words that we expect will be common to scientific publications ("hypothesis").

We also add "SEP", the separater required for SPECTER, as a stop word. 

Learn more about stop words in the following:
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/feature_extraction.html#using-stop-words

In [None]:
# Add any additional words to this list 
additional_words = ["hypothesis", "hypotheses", "hypothesize", "propose", "proposal", 
                    "goal", "goals", "objective", "objectives", 
                    "aim", "aims", "abstract", "tobacco", "cessation", "SEP" 
                   ]

In [None]:
### DO NOT MODIFY THIS CODE ###

# Get the default stop words from sklearn
default_stop_words = text.ENGLISH_STOP_WORDS

# See the total number of stop words in the current list
print(len(default_stop_words))

# Add the additional stop words these to the set 
updated_stop_words = text.ENGLISH_STOP_WORDS.union(additional_words)

# Convert to list 
# Add sorted() to keep alphabetical sorting
updated_stop_words = sorted(list(updated_stop_words))

# Check data type
print(type(updated_stop_words))

# Check how many stop words we have now
print(len(updated_stop_words))

### 5.2 Adjust c-TF-IDF Parameters

c-TF-IDF is a modification of TF-IDF developed by the BERTopic library author. It applies TF-IDF over all documents in a cluster to identify important words across the cluster, and uses these to produce a topic representation. 

To understand c-TF-IDF, see the following for more:
- https://maartengr.github.io/BERTopic/getting_started/ctfidf/ctfidf.html#c-tf-idf
- https://maartengr.github.io/BERTopic/api/ctfidf.html#bertopic.vectorizers.ClassTfidfTransformer

To understand TF-IDF, see: 
- https://en.wikipedia.org/wiki/Tf%E2%80%93idf

In [None]:
# Set a tuple of ints 
# CountVectorizer range (allows you to set how many words can be put together to form a topic label)
ngram_range = (1,3)

# Set to boolean True or False
# ctfidf parameter
bm25_weighting = False

# Set to boolean True or False
# ctfidf parameter

# Set to boolean True or False
reduce_frequent_words = True

# Set to int 
# Number of top words, based on c-TF-IDF, to consider for topic representation
top_n_words = 20

### 5.3 Adjust MMR Parameters

Use this to limit duplicative words by increasing diversity. MMR takes a float value between 0 and 1, with lower values being less diverse and values closer to 1 being as diverse as possible. Note this is still subjective and based on the embedding model!

See here for more: 
- https://maartengr.github.io/BERTopic/getting_started/tips_and_tricks/tips_and_tricks.html#diversify-topic-representation

Because we are adjusting values for MMR, we also need to pass in a value for the top_n_words parameters. This is set to 10 as the default and MMR returns 10 words, so diversifying the selection of 10 words for a choice of 10 words will make no difference! The library author recommends a value between 10 and 20 for top_n_words. Note that while relevant to this section, top_n_words is set as a parameter to BERTopic() not to MMR. 

See the following for more: 
- https://maartengr.github.io/BERTopic/getting_started/parameter%20tuning/parametertuning.html#top_n_words
- https://github.com/MaartenGr/BERTopic/issues/1654

In [None]:
# Set to float between 0 and 1 (non inclusive of one)
# MMR parameter
# Note that for this to adjust anything, the number of top_n_words to select from must be more than 10
diversity = 0.9 

***~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ END SECTION FOR USER INPUT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~***

## 6. Set Up Dictionary of Definitions, Inputs, and Hyperparameters

The following dictionary contains all the inputs and hyperparameters adjusted above. The values of the dictionary are based on the values set above. You should not modify this section of the code. 

The values of the dictionary are used in the subsequent sections of the code. 

In [None]:
### DO NOT MODIFY THIS CODE ###

hyperparameter_dictionary = {
    "data":input_text,
    "embedding_model": model_path,
    "n_neighbors":n_neighbors, # UMAP 
    "n_components":n_components, # UMAP
    "umap_metric":umap_metric, # UMAP
    "random_state":42, # UMAP (set to prevent stochastic behavior between model runs)
    "min_cluster_size":min_cluster_size, # HDBSCAN
    "min_samples":min_samples, # HDBSCAN
    "hdbscan_metric":hdbscan_metric, # HDBSCAN
    "stop_words": updated_stop_words, # list of words to use as stop words in topic representation
    "ngram_range":ngram_range, # CountVectorizer range (allows you to set how many words can be put together to form a topic)
    "bm25_weighting":bm25_weighting, # ctfidf parameter
    "reduce_frequent_words":reduce_frequent_words, #ctfidf parameter
    "top_n_words":top_n_words, # Number of words to consider for topic representation
    "diversity":diversity, # MMR parameter 
    "output_filename": output_path + output_filename
}

## 7. Prepare Input Text and Input Text Embeddings

Now, get the input text ready for topic modeling.

### 7.1 Get Docs

In [None]:
### Do not modify this code ###

# Make a copy of the input_df
input_df_nas_dropped = input_df.copy()

In [None]:
### Do not modify this code ###
# If following the steps in the previous notebooks 1-3, there should be no empty values by this point. However, this is here as an additional check on the data.

# Drop any Nan values
input_df_nas_dropped = input_df_nas_dropped.dropna(subset=[input_text])

# See updated shape
print(input_df_nas_dropped.shape)

In [None]:
### Do not modify this code ###

# Get all the text as a list of strings
docs = list(input_df_nas_dropped[input_text])
print(len(docs))

# Preview
docs[:1]

### 7.2 Prepare Embeddings 

Embeddings are passed to the BERTopic model as a 2D NumPy array. Here, we format the column in the Pandas DataFrame which was read in by Pandas as a String.

In [None]:
input_df_nas_dropped[[input_text + "_embedding"]].head()

In [None]:
# Strip the leading and tailing [] and split into a list
input_df_nas_dropped[input_text + "_embedding"] = input_df_nas_dropped[input_text + "_embedding"].str.strip('[]').str.split()

input_df_nas_dropped[[input_text + "_embedding"]].head()

In [None]:
# Convert each element of the list (currently strings) to floats
input_df_nas_dropped[input_text + "_embedding"] = input_df_nas_dropped[input_text + "_embedding"].apply(lambda x: [float(element) for element in x])

In [None]:
### Do not modify this code ###

# Get column of embeddings as list 
embedding_column = list(input_df_nas_dropped[input_text + "_embedding"])
print(len(embedding_column))

In [None]:
### Do not modify this code ###

# Now, convert inner list AND outer list to np ndarray
embeddings = np.array([np.array(inner_list, dtype=np.float32) for inner_list in embedding_column], dtype=np.float32)

In [None]:
### Do not modify this code ###

# See the embedding 2D np array
embeddings

## 8. Build BERTopic Model

Use the settings defined in the Model Parameter Adjustment section to build and fit the BERTopic model. 

### 8.1 Set Model Components

**8.1.1 UMAP**

In [None]:
### Do not modify this code ###

# Set the random state so we prevent stochastic results 
# Also set other key parameters - n_neighbors, n_components, and metric
umap_model = UMAP(n_neighbors=hyperparameter_dictionary["n_neighbors"], 
                  n_components=hyperparameter_dictionary["n_components"], 
                  metric=hyperparameter_dictionary["umap_metric"], 
                  random_state=hyperparameter_dictionary["random_state"])

**8.1.2 HDBSCAN**

In [None]:
### Do not modify this code ###

# Set key parameters - min_cluster_size, min_samples, and metric
hdbscan_model = HDBSCAN(min_cluster_size=hyperparameter_dictionary["min_cluster_size"], 
                        min_samples=hyperparameter_dictionary["min_samples"], 
                        metric=hyperparameter_dictionary["hdbscan_metric"],
                        gen_min_span_tree=True # Adjust this param so we can look at relative validity 
                       )

**8.1.3 Improving the Default Topic Representation**

In [None]:
### Do not modify this code ###

# Add the modified list of stop words and the ngram range from the dictionary 
vectorizer_model = CountVectorizer(stop_words=hyperparameter_dictionary["stop_words"], 
                                   ngram_range=hyperparameter_dictionary["ngram_range"])

In [None]:
### Do not modify this code ###

# Adjust the c-TF-IDF model
ctfidf_model = ClassTfidfTransformer(bm25_weighting=hyperparameter_dictionary["bm25_weighting"], 
                                     reduce_frequent_words=hyperparameter_dictionary["reduce_frequent_words"])

In [None]:
### Do not modify this code ###

# Add additional topic representations 

# Set mmr_model
mmr_model = MaximalMarginalRelevance(diversity=hyperparameter_dictionary["diversity"])

# KeyBERT
# Here we do not adjust anything, but see here for more: https://maartengr.github.io/KeyBERT/index.html
keybert_model = KeyBERTInspired()

# Add these to a dictionary 
representation_model = {
    "KeyBERT": keybert_model,
    "MMR": mmr_model
}

### 8.2 Fit Topic Model

In [None]:
### Do not modify this code ###

# Set up the topic model, using the information defined above
# Note the embedding model MUST match that defined at the top 
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    ctfidf_model=ctfidf_model,
    vectorizer_model=vectorizer_model,
    top_n_words=hyperparameter_dictionary["top_n_words"],
    representation_model=representation_model,
)

In [None]:
### Do not modify this code ###

# Pass the docs and embeddings to topic model 
topics, probs = topic_model.fit_transform(docs, embeddings)

In [None]:
### Do not modify this code ###

# The parameters here should match what was set in the dictionary 
# Note we don't adjust all of these methods
topic_model.get_params()

## 9. Dataset Development

Build DataFrames from the model results. These will be saved in an Excel file.

### 9.1 Build Topic Level DataFrame

In [None]:
### Do not modify this code ###

# Do not modify the topics DataFrame itself, as this is saved as model output 

# Show topics
topics_df = topic_model.get_topic_info()
print(topics_df.shape)

# Preview
topics_df.head()

### 9.2 Build Document Level DataFrame

In [None]:
### Do not modify this code ###

# Do not modify the document_info DataFrame itself, as this is saved as model output 

document_info = topic_model.get_document_info(docs)
print(document_info.shape)

# Preview
document_info.head()

### 9.3 Build Topic Word Representation DataFrame

In [None]:
### Do not modify this code ###

# Do not modify the topics_dictionary DataFrame itself, as this is saved as model output 
topics_dictionary = topic_model.get_topics()

# Convert to DataFrame
# Note that the number of columns = top_n_words
topic_words = pd.DataFrame.from_dict(topics_dictionary, orient="index")

print(topic_words.shape)

# Preview
topic_words.head()

### 9.4 Get Cluster Metrics

Get the following metrics and add to the dictionary: 
- Number of clusters
- Number of documents considered noise
- The relative validity (DBCV) score.
    - Use this metric with caution. It is intended to help with hyperparameter selection, but unsupervised learning methods like clustering are challenging to evaluate with reliable metrics because there is no ground truth. See the following for more about the DBCV score: 
        - https://epubs.siam.org/doi/pdf/10.1137/1.9781611973440.96
        - https://github.com/christopherjenness/DBCV

In [None]:
### Do not modify this code ###

# Get number of clusters
num_clusters = topics_df.shape[0]
print(num_clusters)

# Add to dictionary
hyperparameter_dictionary["num_clusters"] = num_clusters

In [None]:
### Do not modify this code ###

# Get number of unclustered
num_unclustered = document_info[document_info["Topic"] == -1].shape[0]
print(num_unclustered)

# Add to dictionary
hyperparameter_dictionary["num_documents_clustered_as_noise"] = num_unclustered

In [None]:
### Do not modify this code ###

# Relative Validity - used to evaluate differences in hyperparameter choice (higher scores are better)
# https://hdbscan.readthedocs.io/en/latest/api.html#id92
relative_validity = topic_model.hdbscan_model.relative_validity_
print(relative_validity)

# Save this to the hyperparameter dictionary 
hyperparameter_dictionary["relative_validity"] = relative_validity

### 9.5 Build DataFrame of Model Inputs and Metrics

In [None]:
### Do not modify this code ###

# Build DataFrame from the dictionary
hyperparameters_df = pd.DataFrame.from_dict(hyperparameter_dictionary, orient="index", columns=["model_info"])

# Preview
hyperparameters_df

## 10. Analyze and Extract Insights from Data

### 10.1 Visualize Data

The BERTopic library provides several different ways to visualize the data. We demonstrate a few that can be useful to understand the results here. 

**10.1.1 Hierarchical View**

In [None]:
### Do not modify this code ###

# See the hiearchical view of the data
topic_model.visualize_hierarchy()

**10.1.2 Barchart View**

In [None]:
# Adjust the parameters here to see different numbers of clusters and words 
# Note this only shows the largest clusters 
topic_model.visualize_barchart(top_n_topics=10, n_words=15)

**10.1.3 2-Dimensional View of Documents**

Note the plot below is interactive.

In [None]:
# Reduce dimensionality to view in 2 dimensions
reduced_embeddings = UMAP(n_components=2, random_state=42).fit_transform(embeddings)

# Now visualize individual documents
topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings, custom_labels=True)

## 11. Save Outputs

### 11a. Save Model Results

In [None]:
# The following outputs will be saved to: 
save_to = hyperparameter_dictionary["output_filename"]
print(save_to)

In [None]:
# Save to Excel file
with pd.ExcelWriter(save_to) as writer:  
    hyperparameters_df.to_excel(writer, sheet_name='model_info')
    topics_df.to_excel(writer, sheet_name='topic_overview'),
    topic_words.to_excel(writer, sheet_name='topic_words'),
    document_info.to_excel(writer, sheet_name='document_topic_assignment')

### 11b. (Optional) Save Model 

If you want to save the model itself and not just the results, uncomment the code below. This uses [Safe Tensors](https://github.com/huggingface/safetensors) to save the model. 

By default this will save into the output path specified earlier, but you may wish to create a distinct folder for all the model data instead.

In [None]:
# Uncomment if you want to save the model
#topic_model.save(output_path, serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)