# Topic Modelling

Contains topic modeling with BERTopic using OpenAI GPT-4o-mini and pre-computed PubMedBERT and tSNE embeddings.

## ✋Set Up

### Set up GPUs

In [None]:
# GPU information:

gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Wed Oct  9 17:07:40 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In order to use a GPU with your notebook, select the **Runtime > Change runtime** type menu, and then set the hardware accelerator dropdown to GPU.

### High RAM

In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 54.8 gigabytes of available RAM

You are using a high-RAM runtime!


Users who have purchased one of Colab's paid plans have access to high-memory VMs when they are available.

You can see how much memory you have available at any time by running the following code cell. If the execution result of running the code cell below is "Not using a high-RAM runtime", then you can enable a high-RAM runtime via **Runtime > Change runtime** type in the menu. Then select High-RAM in the Runtime shape dropdown. After, re-execute the code cell.

### Install libraries

In [None]:
#installing for this work.
!pip install --quiet  bertopic==0.16.3 scikit-learn==1.2.2 torch==2.1.0 torchvision==0.16 transformers==4.45.1 bitsandbytes==0.44.1 openai==1.51.1 #openai accelerate bitsandbytes xformers adjustText tqdm

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.6/9.6 MB[0m [31m101.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.2/670.2 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m105.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m60.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m383.7/383.7 kB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import warnings
warnings.filterwarnings;

## 📄 Data

Import datasets, that contains full embeddings and reduced embeddings.

In [None]:
import pandas as pd
df = pd.read_hdf('Files/embeddings_full_tSNE_uMAP_01MAR2024.h5', key='embeddings')

In [None]:
df.head(3)

Unnamed: 0,pmid,title,abstract,language,journal_title,pub_year,authors,predicted_category,full_embeddings,umap_2D_x,umap_2D_y,umap_3D_x,umap_3D_y,umap_3D_z,tsne_2D_x,tsne_2D_y
0,9748443,Effect of slow growth on metabolism of Escheri...,Escherichia coli growing on glucose in minimal...,eng,Journal of bacteriology,1998.0,"Tweeddale H, Notley-McRobb L, Ferenci T",Microbiology,"[[0.04921199, 0.1013429, 0.009529841, -0.08067...",7.770308,7.748135,8.087116,7.735348,6.041772,-18.861538,25.245789
1,10675895,On the optimization of classes for the assignm...,"At present, the assignment of function to nove...",eng,Trends in biotechnology,2000.0,"Kell DB, King RD",unlabeled,"[[0.074717656, 0.12005615, 0.023376802, 0.0167...",3.69291,6.724228,4.727096,6.715599,7.221296,45.593254,63.230408
2,10731098,Assessing the effect of reactive oxygen specie...,A two-dimensional thin-layer chromatographic a...,eng,Redox report : communications in free radical ...,1999.0,"Tweeddale H, Notley-McRobb L, Ferenci T",unlabeled,"[[-0.009071778, 0.013007838, -0.0069063944, -0...",8.485703,8.156181,8.033753,7.959665,5.779803,-18.892046,25.188398


In [None]:
df_filtered = df[['title', 'abstract', 'journal_title', 'pub_year', 'authors', 'tsne_2D_x', 'tsne_2D_y']]

In [None]:
df_filtered.head(3)

Unnamed: 0,title,abstract,journal_title,pub_year,authors,tsne_2D_x,tsne_2D_y
0,Effect of slow growth on metabolism of Escheri...,Escherichia coli growing on glucose in minimal...,Journal of bacteriology,1998.0,"Tweeddale H, Notley-McRobb L, Ferenci T",-18.861538,25.245789
1,On the optimization of classes for the assignm...,"At present, the assignment of function to nove...",Trends in biotechnology,2000.0,"Kell DB, King RD",45.593254,63.230408
2,Assessing the effect of reactive oxygen specie...,A two-dimensional thin-layer chromatographic a...,Redox report : communications in free radical ...,1999.0,"Tweeddale H, Notley-McRobb L, Ferenci T",-18.892046,25.188398


In [None]:
abstracts = df_filtered["abstract"]
titles = df_filtered["title"]

In [None]:
print(abstracts[0])

Escherichia coli growing on glucose in minimal medium controls its metabolite pools in response to environmental conditions. The extent of pool changes was followed through two-dimensional thin-layer chromatography of all 14C-glucose labelled compounds extracted from bacteria. The patterns of metabolites and spot intensities detected by phosphorimaging were found to reproducibly differ depending on culture conditions. Clear trends were apparent in the pool sizes of several of the 70 most abundant metabolites extracted from bacteria growing in glucose-limited chemostats at different growth rates. The pools of glutamate, aspartate, trehalose, and adenosine as well as UDP-sugars and putrescine changed markedly. The data on pools observed by two-dimensional thin-layer chromatography were confirmed for amino acids by independent analysis. Other unidentified metabolites also displayed different spot intensities under various conditions, with four trend patterns depending on growth rate. As R

In [None]:
len(abstracts)

80656

## OpenAI GPT - setup

In [None]:
from google.colab import userdata
import os

openai_key = userdata.get('OPENAI_API_KEY')

### Prompt Template: OpenAI

In [None]:
# OpenAI prompt

openai_prompt = """
You are an expert in metabolomics and scientific literature analysis. Your task
is to generate concise, informative topic labels for collections of metabolomics
abstracts from PubMed. Each topic label should be no more than 6 words long and
should capture the essence of the metabolomics research described.

Here is an example:
I have a topic that contains the following metabolomics abstracts:
- This study investigates the metabolic profiling of plasma samples from patients
with type 2 diabetes using LC-MS/MS. We identified several key metabolites
associated with insulin resistance.
- Our research focuses on the application of NMR spectroscopy to analyze urine
samples for early detection of kidney disease. The metabolic signatures
revealed potential biomarkers.
- We employed GC-MS to examine the metabolome of cancer cells under hypoxic
conditions. The results showed significant alterations in glucose and glutamine
metabolism.

The topic is described by the following keywords: 'metabolomics, LC-MS, NMR,
biomarkers, disease detection.'

A suitable topic label would be: Disease Biomarker Discovery.

Now, based on the information provided below, please create a concise topic
label for this metabolomics topic in 6 words or fewer.

Documents: [DOCUMENTS]
Keywords: [KEYWORDS]

Return only the topic label, nothing more. Make sure it is in the following format:
topic: <topic label>
"""

## 🗨️ BERTopic

### Sub-models

#### PubMed Embeddings

The reasoning behind `PrecomputedEmbeddings` class.



The PrecomputedEmbeddings class was introduced as a workaround to a specific challenge I faced when trying to use BERTopic with our pre-computed embeddings.

1. BERTopic's Default Behavior:
   By default, BERTopic is designed to generate embeddings for documents as part of its pipeline. It typically uses a pre-trained language model (like BERT or Sentence-BERT) to create these embeddings during the fit_transform process.

2. Our Scenario:
   In our case, I already have pre-computed embeddings for our abstracts using pubMEDBERT.

3. The Issue:
   When I tried to pass the pre-computed embeddings directly to BERTopic, I encountered errors. This is because BERTopic was still trying to generate new embeddings using its default embedding model, which was set to `None` in our case.

4. The Solution:
   The PrecomputedEmbeddings class acts as a "fake" embedding model for BERTopic. It mimics the interface of a typical embedding model but instead of generating new embeddings, it simply returns the pre-computed embeddings we provide.

Here's how the class works:

```python
class PrecomputedEmbeddings:
    def __init__(self, embeddings):
        self.embeddings = embeddings

    def embed_documents(self, documents, verbose=False):
        return self.embeddings

    def embed_words(self, words, verbose=False):
        return self.embeddings
```

- The `__init__` method stores the pre-computed embeddings.
- The `embed_documents` and `embed_words` methods are required by BERTopic's interface, but they simply return the stored embeddings regardless of the input.

By using this class, we're essentially telling BERTopic: "Don't generate new embeddings. Use these pre-computed ones instead."

In [None]:
class PrecomputedEmbeddings:
    def __init__(self, embeddings):
        self.embeddings = embeddings

    def embed_documents(self, documents, verbose=False):
        return self.embeddings

    def embed_words(self, words, verbose=False):
        return self.embeddings

In [None]:
import numpy as np
embeddings = np.array(df['full_embeddings'].tolist())
if embeddings.ndim == 3:
    embeddings = embeddings.reshape(embeddings.shape[0], -1)

In [None]:
df['full_embeddings'][0].shape

(1, 768)

In [None]:
embeddings.shape

(80656, 768)

In [None]:
# Create an instance of PrecomputedEmbeddings
embedding_model = PrecomputedEmbeddings(embeddings)

#### Dimensionality Reduction

**Pre-computed reduced embeddings**

In [None]:
#reduced embeddings
reduced_embeddings = df[['tsne_2D_x', 'tsne_2D_y']].values

#### Clustering

**The resoning behind clustering**

**HDBSCAN vs K-Means:** HDBSCAN was selected due to its strong ability to capture structures with varying densities, making it particularly useful for this context. It’s important to note that no clustering model is perfect. For example, K-means allows you to predefine the number of clusters and forces every point into a cluster, meaning no outliers are created. However, this method has its drawbacks. By forcing every point into a cluster, the model is likely to include noise, which can distort topic representations and negatively impact the quality of the clustering.

**Starting Point**: Given the dataset size of ~80k and the desire for fewer topics, a good starting point would be to set `min_cluster_size` to around 800 (1% of your dataset).

**Iterative Approach**: Start with this conservative estimate and then adjust based on the results you get:

- If you get too many topics, increase `min_cluster_size`
- If you get too few topics, decrease `min_cluster_size`

Metabolomics is a broad field, so we might want to lean towards larger clusters (higher `min_cluster_size`) to capture overarching themes.

**Additional Parameters**: Consider setting `min_samples` equal to `min_cluster_size` for more conservative clustering. You can also experiment with c`luster_selection_epsilon` to merge smaller clusters into larger ones.

**Understanding `min_samples` and its relationship with `min_cluster_size`**

**Purpose**: `min_samples` determines how many neighboring points a data point needs to be considered a core point in the clustering process. <br>
**Effect on Clustering**: A higher `min_samples` value leads to more conservative clustering, with fewer but more robust clusters. <br>
**Setting it Equal to `min_cluster_size`**:
- This ensures that every point in a cluster is a core point.
- It results in more clearly defined, robust clusters.
- It reduces the risk of chaining (where disparate clusters get artificially connected).
- It simplifies the parameter tuning process. <br>

**Trade-offs**: While this approach leads to more coherent clusters, it may also result in fewer clusters overall and more points being labeled as noise. <br>

**Flexibility**: You can adjust `min_samples` independently of `min_cluster_size` if needed:
- Lowering it allows for more variation within clusters.
- Raising it makes clustering even more conservative.



For this specific case with ~80,000 metabolomics abstracts, starting with `min_samples` = `min_cluster_size` = 800 is a good conservative approach. This will likely give fewer, but very well-defined topics. If this results in too few clusters or too many abstracts being labeled as noise, you can try slightly decreasing `min_samples` while keeping `min_cluster_size` the same.
Remember, the best settings often come through an iterative process of adjusting parameters and evaluating results. The goal is to find clusters (topics) that are meaningful and interpretable in the context of metabolomics research.

In [None]:
#from umap import UMAP
from hdbscan import HDBSCAN

hdbscan_model = HDBSCAN(min_cluster_size=500, #change this parameter to update number of topics.
                        min_samples=300,
                        metric='euclidean',
                        cluster_selection_method='eom',
                        prediction_data=True) #prediction_data=True ensures that the model can generalize to new documents, assigning them to the appropriate topics as needed.

#### Tokenizer/ Vectorizer

**On Count Vectorizer Model**

Parameters:

**stop_words="english"**: This removes common English stop words (like "the", "a", "an", "in") from the text. Stop words often don't contribute much to the meaning of a topic, so removing them helps focus on more meaningful words.

**min_df=_n_**: This sets the minimum document frequency for a term to be included. A term must appear in at least _n_ documents to be considered. This helps remove very rare words that might not be representative of broader topics.

**ngram_range=(1, 2)**: This allows the vectorizer to consider both unigrams (single words) and bigrams (two-word phrases). This can capture more complex concepts that might be expressed in phrases rather than single words.

How this improves topic modeling:

- Removing stop words helps focus on more meaningful content words, making topics more interpretable and distinct.
- Setting a minimum document frequency (min_df) helps filter out very rare terms that might be noise or very specific to a single document, leading to more general and robust topics.
- Including bigrams allows the model to capture more complex concepts and phrases, which can lead to more nuanced and interpretable topics. For example, instead of just "learning" and "machine" as separate words, it might capture "machine learning" as a single concept.
- By preprocessing the text in this way, the c-TF-IDF step in BERTopic (which uses **CountVectorizer**) can create more meaningful and distinct topic representations.
This preprocessing happens after documents are assigned to topics, so it doesn't influence the clustering process itself. Instead, it improves how topics are represented and described once they're formed.

_Note that I can update the vectorizer model after fitting the BERTopic model. This will allow for better topic representation without retraining the model._

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer
vectorizer_model = CountVectorizer(
    stop_words="english",  # This removes common English stop words
    min_df=10,             # A term must appear in at least 10 documents to be considered
    ngram_range=(1, 2)     # Consider both unigrams (single words) and bigrams (two-word phrases)
)

#### Topic Representation: c-TF-IDF

for more information is [c-TF-IDF BERTopic documentation](https://maartengr.github.io/BERTopic/getting_started/ctfidf/ctfidf.html).

In [None]:
from bertopic.vectorizers import ClassTfidfTransformer

ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

#### Distilled Representation Model & Fine Tuning

Running 50 abstracts through GPT-4o costs approximately $0.90, and processing 100 abstracts is roughly double that amount.

I experimented with 4, 50, and 100 abstracts, and found no significant differences in topic generation, particularly when comparing 50 versus 100 abstracts

On the other hand running 100 abstracts with GPT-4o mini costs just about 10 cents!

In [None]:
from bertopic.representation import KeyBERTInspired, TextGeneration, OpenAI
import openai

# KeyBERT
#keybert = KeyBERTInspired()

#GPT-4
client = openai.OpenAI(api_key=openai_key)
gpt4o = OpenAI(client,
               model="gpt-4o-mini", #Use gpt-4o
               delay_in_seconds=2,
               exponential_backoff=True, #retries protocol
               chat=True,
               prompt=openai_prompt,
               nr_docs=100 #number of abstracts included in the [DOCUMENTS]
               )

# Text generation with Llama 3
#llama3_1 = TextGeneration(generator,
#                          prompt=prompt,
#                          nr_docs=5 #number of abstracts included in the [DOCUMENTS]
#                          )

# All representation models
representation_model = {
    #"KeyBERT": keybert,
    "GPT-4o": gpt4o,
    #"Llama3.1": llama3_1,
}

## 🔥 Training

**How the model comes up with the topics**:

It doesn't process all abstracts directly in this case. Instead, it works on a per-topic basis, where topics have been pre-determined by the clustering algorithm (HDBSCAN in this case).
For each topic, Llama receives:
- A set of keywords that represent the topic (generated by c-TF-IDF).
- A small subset of representative documents from that topic (~4-5??).

The model then generates a concise label (5 words or less) based on this information.
This process is repeated for each topic identified by the clustering algorithm.

The advantage of this approach is that it allows for efficient topic labeling even with large datasets. Llama doesn't need to process all documents, but instead works with a *distilled representation* of each topic.
However, this also means that the quality of Llama's labels depends heavily on the preceding steps: the quality of the initial embeddings, the effectiveness of the clustering, and the *representativeness of the selected documents and keywords for each topic*.

In [None]:
from bertopic import BERTopic

# Create and train BERTopic model
topic_model = BERTopic(
    embedding_model=embedding_model, # Step 1 - Extract embeddings
    umap_model=None,                 # Step 2 - Reduce dimensionality
    hdbscan_model=hdbscan_model,     # Step 3 - Cluster documents
    vectorizer_model=vectorizer_model, # Step 4 - Tokenize topics
    ctfidf_model=ctfidf_model,         # Step 5 - Extract topic words
    representation_model=representation_model, # Step 6 - Fine-tune topic representations

    # Hyperparameters
    top_n_words=10,
    verbose=True
)

# Train model using pre-computed embeddings
topics, probs = topic_model.fit_transform(abstracts)

2024-10-09 18:22:10,122 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/2521 [00:00<?, ?it/s]

2024-10-09 18:25:39,288 - BERTopic - Embedding - Completed ✓
2024-10-09 18:25:39,290 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-10-09 18:26:23,983 - BERTopic - Dimensionality - Completed ✓
2024-10-09 18:26:23,987 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-10-09 18:26:37,626 - BERTopic - Cluster - Completed ✓
2024-10-09 18:26:37,650 - BERTopic - Representation - Extracting topics from clusters using representation models.
100%|██████████| 21/21 [02:55<00:00,  8.34s/it]
2024-10-09 18:30:02,577 - BERTopic - Representation - Completed ✓


## ☑ Results

### Topics

In [None]:
# Show topics
topic_model.get_topic_info().head()

Unnamed: 0,Topic,Count,Name,Representation,GPT-4o,Representative_Docs
0,-1,14980,-1_exposure_metabolic_study_patients,"[exposure, metabolic, study, patients, metabol...",[Metabolomics in Disease and Treatment],[Background: In numerous studies based predomi...
1,0,19258,0_plant_plants_genes_compounds,"[plant, plants, genes, compounds, stress, grow...",[Plant Stress Response Mechanisms],[To investigate differences in fresh leaves of...
2,1,9730,1_liver_diabetes_insulin_muscle,"[liver, diabetes, insulin, muscle, mice, exerc...",[Metabolic Profiles and Dysregulation],[Fenugreek is a well-known medicinal plant use...
3,2,7254,2_cancer_tumor_cells_cell,"[cancer, tumor, cells, cell, breast, cancer ce...",[Cancer Metabolism and Therapy Resistance],[Cancer cells rewire the metabolic processes b...
4,3,6499,3_data_ms_mass_metabolomics,"[data, ms, mass, metabolomics, sample, method,...",[Metabolomics Data Analysis and Integration],[Although metabolomics data acquisition and an...


**Generate custom labels**

In [None]:
chatgpt_topic_labels = {topic: " | ".join(list(zip(*values))[0]) for topic, values in topic_model.topic_aspects_["GPT-4o"].items()}
chatgpt_topic_labels[-1] = "Outlier Topic"
topic_model.set_topic_labels(chatgpt_topic_labels)

In [None]:
# Show topics
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,CustomName,Representation,GPT-4o,Representative_Docs
0,-1,14980,-1_exposure_metabolic_study_patients,Outlier Topic,"[exposure, metabolic, study, patients, metabol...",[Metabolomics in Disease and Treatment],[Background: In numerous studies based predomi...
1,0,19258,0_plant_plants_genes_compounds,Plant Stress Response Mechanisms,"[plant, plants, genes, compounds, stress, grow...",[Plant Stress Response Mechanisms],[To investigate differences in fresh leaves of...
2,1,9730,1_liver_diabetes_insulin_muscle,Metabolic Profiles and Dysregulation,"[liver, diabetes, insulin, muscle, mice, exerc...",[Metabolic Profiles and Dysregulation],[Fenugreek is a well-known medicinal plant use...
3,2,7254,2_cancer_tumor_cells_cell,Cancer Metabolism and Therapy Resistance,"[cancer, tumor, cells, cell, breast, cancer ce...",[Cancer Metabolism and Therapy Resistance],[Cancer cells rewire the metabolic processes b...
4,3,6499,3_data_ms_mass_metabolomics,Metabolomics Data Analysis and Integration,"[data, ms, mass, metabolomics, sample, method,...",[Metabolomics Data Analysis and Integration],[Although metabolomics data acquisition and an...
5,4,4313,4_gut_microbiota_gut microbiota_microbiome,Gut Microbiota and Metabolomic Interactions,"[gut, microbiota, gut microbiota, microbiome, ...",[Gut Microbiota and Metabolomic Interactions],[Given the high and increasing prevalence of o...
6,5,2723,5_ad_brain_alzheimer_alzheimer disease,Metabolomics in Neurodegenerative Disorders,"[ad, brain, alzheimer, alzheimer disease, pd, ...",[Metabolomics in Neurodegenerative Disorders],[Alzheimer's disease (AD) is the most common c...
7,6,2389,6_fish_zebrafish_exposure_aquatic,Environmental Toxicology and Metabolism,"[fish, zebrafish, exposure, aquatic, exposed, ...",[Environmental Toxicology and Metabolism],[Microplastics (MPs) pollution has been recogn...
8,7,1971,7_milk_meat_dairy_cattle,Metabolomics in Animal Nutrition,"[milk, meat, dairy, cattle, 05, feed, lactatio...",[Metabolomics in Animal Nutrition],[Ruminants account for a relatively large shar...
9,8,1661,8_depression_sleep_circadian_schizophrenia,Microbiota-Gut-Brain Axis Interactions,"[depression, sleep, circadian, schizophrenia, ...",[Microbiota-Gut-Brain Axis Interactions],[Prenatal stress (PS) increases offspring susc...


Here is the topic model for a specific cluster. <br>
`Main` represents the c-TF-IDF topic representation outputs, in this case indicating the top 10 topic word. `OpenAI` and `Llama` is used for topic distillation.  

In [None]:
topic_model.get_topic(2, full=True)

{'Main': [('cancer', 0.37852654333747204),
  ('tumor', 0.312547535360719),
  ('cells', 0.305661962531353),
  ('cell', 0.2660296334483482),
  ('breast', 0.247039976881673),
  ('cancer cells', 0.24132808303516712),
  ('breast cancer', 0.23847543598042675),
  ('tumors', 0.23433026384303857),
  ('mitochondrial', 0.22064409809677088),
  ('patients', 0.2173006254780799)],
 'GPT-4o': [('Cancer Metabolism and Therapy Resistance', 1)]}

### Outlier Reduction - Skipped.

I have attempted using outlier reduction, based on my observation, there tends to be topic contamination, so this process were skipped.

By default, HDBSCAN generates outliers which is a helpful mechanic in creating accurate topic representations. However, you might want to assign every single document to a topic. We can use `.reduce_outliers` to map some or all outliers to a topic. Here we use the `tf-idf` strategy.

In [None]:
# Use the "c-TF-IDF" strategy with a threshold
new_topics = topic_model.reduce_outliers(abstracts,
                                         topics ,
                                         strategy="c-tf-idf",
                                         threshold=0.12)

The outlier reduction strategy above has reduced the outlier documents to 8,261 papers.

In [None]:
print(topics.count(-1))
print(new_topics.count(-1))

18367
8261


**Note:** updating the topics this way may lead to errors if topic reduction or topic merging techniques are used afterwards. The reason for this is that when you assign a -1 document to topic 1 and another -1 document to topic 2, it is unclear how you map the -1 documents. Is it matched to topic 1 or 2.

In [None]:
# Apply the representation model to rename topics
topic_model.update_topics(abstracts,
                          topics=new_topics,
                          representation_model=representation_model)

100%|██████████| 21/21 [02:14<00:00,  6.38s/it]


In [None]:
# Show udated topics
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,GPT-4o,Representative_Docs
0,-1,8261,-1_of_the_and_in,"[of, the, and, in, to, is, with, for, that, we]",[Metabolomics in Disease Mechanisms],[Vitamin D is a steroid hormone precursor that...
1,0,19065,0_the_of_and_in,"[the, of, and, in, to, were, that, for, as, by]",[Plant Metabolic Responses to Stress],[Celery is an important leafy vegetable that c...
2,1,9960,1_and_the_of_in,"[and, the, of, in, to, with, were, liver, that...",[Metabolomic Insights into Disease States],[Non-alcoholic steatohepatitis (NASH) is a sev...
3,2,7878,2_the_of_and_to,"[the, of, and, to, for, data, in, ms, is, meta...",[Metabolomics in Disease Diagnosis],[High-resolution mass spectrometry (HRMS)-base...
4,3,7378,3_cancer_of_and_in,"[cancer, of, and, in, the, cells, to, cell, tu...",[Cancer Metabolism and Biomarker Discovery],[Triple-negative breast cancer (TNBC) is the m...
5,4,4892,4_gut_the_microbiota_and,"[gut, the, microbiota, and, of, in, to, microb...",[Gut Microbiota and Metabolism],[Background: Recent evidence suggests that the...
6,5,3092,5_and_the_of_in,"[and, the, of, in, to, with, covid, 19, for, t...",[Metabolomics in COVID-19 Research],[Obese patientss with nonalcoholic steatohepat...
7,6,3054,6_the_and_of_in,"[the, and, of, in, brain, ad, to, disease, wit...",[Metabolic Biomarkers in Neurodegenerative Dis...,[(1) Background: Alzheimer's disease (AD) is a...
8,7,2979,7_the_and_of_in,"[the, and, of, in, to, exposure, were, that, m...",[Environmental Contaminant Impact on Metabolism],[Copper (Cu) is a micronutrient essential for ...
9,8,1858,8_kidney_and_of_renal,"[kidney, and, of, renal, the, in, to, ckd, wit...",[Metabolomics in Kidney Disease Research],[Hepatorenal syndrome (HRS) continues to be on...


In [None]:
topic_model.get_topic(1, full=True)

{'Main': [('the', 0.028779946662807827),
  ('of', 0.02864578100666034),
  ('and', 0.02448622292237831),
  ('to', 0.02432008961573552),
  ('for', 0.0241340138406972),
  ('data', 0.02266133022556849),
  ('in', 0.0201851148834008),
  ('ms', 0.019470664827897718),
  ('is', 0.016727441583856882),
  ('mass', 0.015428174802613718)],
 'GPT-4o': [('Mass Spectrometry-Based Metabolomics Analysis', 1)]}

Update custom names

In [None]:
chatgpt_topic_labels = {topic: " | ".join(list(zip(*values))[0]) for topic, values in topic_model.topic_aspects_["GPT-4o"].items()}
#chatgpt_topic_labels[-1] = "Outlier Topic"
topic_model.set_topic_labels(chatgpt_topic_labels)

In [None]:
topic_model.get_topic_info().head()

Unnamed: 0,Topic,Count,Name,CustomName,Representation,GPT-4o,Representative_Docs
0,-1,2632,-1_of_the_and_in,Metabolic Mechanisms in Disease Conditions,"[of, the, and, in, to, is, with, for, that, we]",[Metabolic Mechanisms in Disease Conditions],[Mycobacterium tuberculosis developed efficien...
1,0,19955,0_the_of_and_in,Metabolomics in Stress Resistance and Adaptation,"[the, of, and, in, to, were, that, for, as, by]",[Metabolomics in Stress Resistance and Adaptat...,[Phenolic compounds are implied in plant-micro...
2,1,8534,1_the_of_and_to,Mass Spectrometry-Based Metabolomics Analysis,"[the, of, and, to, for, data, in, ms, is, mass]",[Mass Spectrometry-Based Metabolomics Analysis],"[Metabolomics, as a part of systems biology, h..."
3,2,7831,2_cancer_cells_of_and,Cancer Metabolism and Biomarkers,"[cancer, cells, of, and, in, the, cell, to, tu...",[Cancer Metabolism and Biomarkers],[Rationale: It has been proposed that cancer s...
4,3,5019,3_gut_microbiota_the_and,Gut Microbiota and Human Health,"[gut, microbiota, the, and, of, in, microbiome...",[Gut Microbiota and Human Health],[Over the last five years an increasing effort...


### GPT-4o Mini

In [None]:
gpt4o_labels = [label[0][0].split("\n")[0] for label in topic_model.get_topics(full=True)["GPT-4o"].values()]
topic_model.set_topic_labels(gpt4o_labels)

In [None]:
topic_model.visualize_documents(titles,
                                reduced_embeddings=reduced_embeddings,
                                hide_annotations=True,
                                hide_document_hover=False,
                                custom_labels=True)

Output hidden; open in https://colab.research.google.com to view.

## 🗄 Backup Saving

In [None]:
# Install safetensors if not already installed
!pip install --quiet safetensors==0.4.5

import pickle
from bertopic import BERTopic

In [None]:
# Save the topic model
topic_model.save("topic_model",
                 serialization="safetensors",
                 save_ctfidf=True,
                 #save_embedding_model="meta-llama/Meta-Llama-3.1-8B-Instruct"
                 )

# Save the reduced embeddings
with open('reduced_embeddings.pickle', 'wb') as handle:
    pickle.dump(reduced_embeddings, handle, protocol=pickle.HIGHEST_PROTOCOL)

# Save the representative documents
with open('rep_docs.pickle', 'wb') as handle:
    pickle.dump(topic_model.representative_docs_,
                handle,
                protocol=pickle.HIGHEST_PROTOCOL)

# Save Llama 3.1 labels
#llama3_1_labels = [label[0][0].split("\n")[0] for label in topic_model.get_topics(full=True)["Llama3.1"].values()]
#with open('llama2_labels.pickle', 'wb') as handle:
#    pickle.dump(llama3_1_labels, handle, protocol=pickle.HIGHEST_PROTOCOL)

# Save GPT-4o labels
gpt4o_labels = [label[0][0].split("\n")[0] for label in topic_model.get_topics(full=True)["GPT-4o"].values()]
with open('gpt4o_labels.pickle', 'wb') as handle:
    pickle.dump(gpt4o_labels, handle, protocol=pickle.HIGHEST_PROTOCOL)

# Optional: Zip the saved files
#!zip -r /content/combined_topic_model.zip /content/combined_topic_model reduced_embeddings.pickle rep_docs.pickle llama2_labels.pickle gpt4o_labels.pickle

In [None]:
# To load the saved components later:
# loaded_model = BERTopic.load("combined_topic_model")
# with open('reduced_embeddings.pickle', 'rb') as handle:
#     loaded_reduced_embeddings = pickle.load(handle)
# with open('rep_docs.pickle', 'rb') as handle:
#     loaded_rep_docs = pickle.load(handle)
# with open('llama2_labels.pickle', 'rb') as handle:
#     loaded_llama2_labels = pickle.load(handle)
# with open('gpt4o_labels.pickle', 'rb') as handle:
#     loaded_gpt4o_labels = pickle.load(handle)

# To recreate visualizations:
# topic_model.set_topic_labels(loaded_llama2_labels)  # For Llama 2 visualization
# topic_model.visualize_documents(titles,
#                                 reduced_embeddings=loaded_reduced_embeddings,
#                                 hide_annotations=True,
#                                 hide_document_hover=False,
#                                 custom_labels=True)

# topic_model.set_topic_labels(loaded_gpt4o_labels)  # For GPT-4o visualization
# topic_model.visualize_documents(titles,
#                                 reduced_embeddings=loaded_reduced_embeddings,
#                                 hide_annotations=True,
#                                 hide_document_hover=False,
#                                 custom_labels=True)

In [None]:
!apt-get install --quiet texlive-xetex pandoc

Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  dvisvgm fonts-droid-fallback fonts-lato fonts-lmodern fonts-noto-mono fonts-texgyre
  fonts-urw-base35 libapache-pom-java libcmark-gfm-extensions0.29.0.gfm.3 libcmark-gfm0.29.0.gfm.3
  libcommons-logging-java libcommons-parent-java libfontbox-java libfontenc1 libgs9 libgs9-common
  libidn12 libijs-0.35 libjbig2dec0 libkpathsea6 libpdfbox-java libptexenc1 libruby3.0 libsynctex2
  libteckit0 libtexlua53 libtexluajit2 libwoff1 libzzip-0-13 lmodern pandoc-data poppler-data
  preview-latex-style rake ruby ruby-net-telnet ruby-rubygems ruby-webrick ruby-xmlrpc ruby3.0
  rubygems-integration t1utils teckit tex-common tex-gyre texlive-base texlive-binaries
  texlive-fonts-recommended texlive-latex-base texlive-latex-extra texlive-latex-recommended
  texlive-pictures texlive-plain-generic tipa xfonts-encodings xfonts-utils
Suggested packages:
  fonts-noto font

In [None]:
#README: you need to copy this notebook into the folder session first for this code to work.
import os
os.chdir('/content')
!jupyter nbconvert --to pdf "topic-modelling.ipynb"

[NbConvertApp] Converting notebook topic-modelling.ipynb to pdf
  ((*- endblock -*))
[NbConvertApp] Writing 104052 bytes to notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: ['xelatex', 'notebook.tex', '-quiet']
[NbConvertApp] Running bibtex 1 time: ['bibtex', 'notebook']
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 126592 bytes to topic-modelling.pdf
