<a href="https://colab.research.google.com/github/itsdivya1309/Machine-Learning/blob/main/LLMs/Text%20Clustering%20and%20Topic%20Modeling/Topic_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling

**Topic Modeling**: finding themes or latent topics in a collection of textual data. Traditionally, it invlovles finding a collection of keywords or phrases that best represent and capture the meaning of the topic.

## Arxive's Articles

We will try to find topics of the Arxive articles from the 9 categories listed below:

- 'q-bio.BM': 'Biomolecules',
- 'q-bio.CB': 'Cell Behavior',
- 'q-bio.GN': 'Genomics',
- 'q-bio.MN': 'Molecular Networks',
- 'q-bio.NC': 'Neurons and Cognition',
- 'q-bio.OT': 'Other Quantitative Biology',
- 'q-bio.PE': 'Populations and Evolution',
- 'q-bio.QM': 'Quantitative Methods',
- 'q-bio.SC': 'Subcellular Processes',
- 'q-bio.TO': 'Tissues and Organs',

In [None]:
! pip install kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json



In [None]:
!kaggle datasets download -d Cornell-University/arxiv

Dataset URL: https://www.kaggle.com/datasets/Cornell-University/arxiv
License(s): CC0-1.0
Downloading arxiv.zip to /content
 99% 1.40G/1.40G [00:21<00:00, 63.3MB/s]
100% 1.40G/1.40G [00:21<00:00, 70.3MB/s]


In [None]:
!unzip arxiv.zip -d arxiv_data

Archive:  arxiv.zip
  inflating: arxiv_data/arxiv-metadata-oai-snapshot.json  


In [None]:
import numpy as np
import pandas as pd

In [None]:
import json

# Load JSON file
json_file = "arxiv_data/arxiv-metadata-oai-snapshot.json"

# The categories of interest
bio_categories = {
    'q-bio.BM', 'q-bio.CB', 'q-bio.GN', 'q-bio.MN',
    'q-bio.NC', 'q-bio.OT', 'q-bio.PE', 'q-bio.QM', 'q-bio.SC'
}

# Prepare a list to store filtered data
filtered_data = []

# Process the file line by line
with open(json_file, "r") as f:
    for line in f:
        paper = json.loads(line)  # Parse JSON line

        # Check if any category in the paper matches our target categories
        if any(cat in paper['categories'] for cat in bio_categories):
            # Extract only the required fields
            filtered_data.append({
                "title": paper.get("title", ""),
                "authors": paper.get("authors", ""),
                "abstract": paper.get("abstract", ""),
                "doi": paper.get("doi", "N/A")  # Default to "N/A" if DOI is missing
            })

In [None]:
# Convert to a DataFrame
papers = pd.DataFrame(filtered_data)

# Display the first few rows
papers.head()

Unnamed: 0,title,authors,abstract,doi
0,Molecular Synchronization Waves in Arrays of A...,"Vanessa Casagrande, Yuichi Togashi, Alexander ...",Spatiotemporal pattern formation in a produc...,10.1103/PhysRevLett.99.048301
1,Origin of adaptive mutants: a quantum measurem...,Vasily Ogryzko,This is a supplement to the paper arXiv:q-bi...,
2,A remark on the number of steady states in a m...,Liming Wang and Eduardo D. Sontag,The multisite phosphorylation-dephosphorylat...,
3,Complexities of Human Promoter Sequences,"Fangcui Zhao, Huijie Yang, and Binghong Wang","By means of the diffusion entropy approach, ...",10.1016/j.jtbi.2007.03.035
4,Intricate Knots in Proteins: Function and Evol...,"Peter Virnau (1), Leonid A. Mirny (1,2), Mehra...",A number of recently discovered protein stru...,


In [None]:
# Extract metadata
abstracts = papers['abstract']
titles = papers['title']
abstracts.shape

(45847,)

## BERTopic: a Modular Topic Modelling Framework

BERTopic is a topic modeling technique that leverages clusters of semantically similar texts to extract various types of topic repesentations.

It essentially has two components:

1. **Text Clustering**
    * Embed the documents using sentence transformer or any other embedding model.
    * Reduce the size of embeddings using any dimensionality reduction technique.
    * Cluster the reduced embeddings into semantically similar groups.

    ```
    SBERT -----> UMAP -----> HDBSCAN
    ```

2. **Topic Modeling**
    * The second part of BERTopic's pipeline is representing the topics: the calculation of the weight of a term `x` in a class `c`.
    * In topic modeling, we calculate the frequency of words in the entire cluster instead of only one document.
    * Each cluster is treated as a single document and `class-based TF-IDF (c-TF-IDF)` is calculated for each word, which gives more importance to contextual words and less weightage to common/stop words.

    ```
    CountVectorizer -----> c-TF-IDF
    ```

In [None]:
# Embedding model
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("thenlper/gte-small")

embeddings = embedding_model.encode(abstracts, show_progress_bar=True)
print(embeddings.shape)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/66.7M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1433 [00:00<?, ?it/s]

(45847, 384)


In [None]:
! pip install bertopic

Collecting bertopic
  Downloading bertopic-0.16.4-py3-none-any.whl.metadata (23 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloa

### Using BERTopic with default models

In [None]:
! pip install umap-learn



In [None]:
import umap

# UMAP model to reduce 384 dimensions to 9
umap_model = umap.UMAP(n_components=9, min_dist=0.1, metric='cosine', random_state=42)

# Fit and transform the embeddings
reduced_embeddings = umap_model.fit_transform(embeddings)

  warn(


In [None]:
# Form clusters using HDBSCAN
from sklearn.cluster import HDBSCAN

hdbscan_model = HDBSCAN(min_cluster_size=60, metric='euclidean', cluster_selection_method='eom')

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))

In [None]:
from bertopic import BERTopic

# Train our model with our previously defined models
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    verbose=True)
topic_model.fit(abstracts, reduced_embeddings)

2025-03-04 13:37:40,619 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-03-04 13:40:54,364 - BERTopic - Dimensionality - Completed ✓
2025-03-04 13:40:54,367 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-03-04 13:41:26,343 - BERTopic - Cluster - Completed ✓
2025-03-04 13:41:26,355 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-03-04 13:41:48,064 - BERTopic - Representation - Completed ✓


<bertopic._bertopic.BERTopic at 0x7f9e091c5a50>

In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,427,-1_model_protein_vaccination_drug,"[model, protein, vaccination, drug, disease, e...",[ The rapid emergence and the disastrous impa...
1,0,25847,0_data_model_brain_network,"[data, model, brain, network, networks, cell, ...",[ Alzheimer's disease (AD) is an irreversible...
2,1,2937,1_tree_species_phylogenetic_trees,"[tree, species, phylogenetic, trees, fitness, ...",[ The presence of reticulate evolutionary eve...
3,2,1264,2_population_species_prey_predator,"[population, species, prey, predator, model, e...",[ Antipredator behaviour is a self-preservati...
4,3,1144,3_covid_19_covid 19_pandemic,"[covid, 19, covid 19, pandemic, cov, sars, sar...",[ To combat the coronavirus disease 2019 (COV...
5,4,882,4_protein_molecular_prediction_drug,"[protein, molecular, prediction, drug, learnin...",[ Understanding the relationships between pro...
6,5,847,5_folding_protein_proteins_energy,"[folding, protein, proteins, energy, native, d...",[ Understanding how monomeric proteins fold u...
7,6,812,6_cooperation_game_games_evolutionary,"[cooperation, game, games, evolutionary, playe...",[ Cooperative behaviour has been extensively ...
8,7,781,7_protein_molecules_molecular_drug,"[protein, molecules, molecular, drug, design, ...",[ The novel nature of SARS-CoV-2 calls for th...
9,8,781,8_species_population_model_fish,"[species, population, model, fish, models, eco...",[ Quantifying population dynamics is a fundam...


In [None]:
topic_model.get_topic(48)

[('protein', 0.025490549007796955),
 ('structures', 0.0177940824125552),
 ('amyloid', 0.016970138543439515),
 ('proteins', 0.015194229722274665),
 ('molecular', 0.015018478010263639),
 ('structure', 0.013633411786610398),
 ('amyloidogenic', 0.013425275131773912),
 ('dna', 0.01308861346705917),
 ('rna', 0.012702318934897011),
 ('target', 0.011855808711827315)]

In [None]:
# Visualize barchart with ranked keywords
topic_model.visualize_barchart(top_n_topics=50)  # Adjust `top_n_topics` as needed

In [None]:
# Visualize relationships between topics
topic_model.visualize_heatmap(n_clusters=10)

In [None]:
# Visualize the potential hierarchical structure of topics
topic_model.visualize_hierarchy()

In [None]:
# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings = umap.UMAP(n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)

In [None]:
fig = topic_model.visualize_documents(
    titles,
    reduced_embeddings=reduced_embeddings,
    width=1200,
    hide_annotations=True)

# Update fonts of legend for easier visualization
fig.update_layout(font=dict(size=16))

In [None]:
topic_model.visualize_topics()

## Fine-tuning the topic representations

We can fine-tune the topic representiations by reranking the initial distribution of words to improve the representation. The reranker models are called representation models in BERTopic.

We are using two reranker models here:

1. **KeyBERTInspired**
    * Inspired by KeyBERT, a model that extracts keywords from texts by comparing cosine similarity of word and document embeddings.
    * KeyBERTInspired extracts the most representative documents per topic by calculating the similarity between a document's c-TF-IDF values and those of the topic they correspond to.
    * The average document embedding per topic is compared to the embeddings of candidate keywords to rerank the keywords.

2. **Maximal marginal relevance (MMR)**
    * Diversifies our topic representations.
    * Finds a set of keywords that are diverse from one another but still relate to the documents they are compared to.

In [None]:
# Save original representations
from copy import deepcopy
original_topics = deepcopy(topic_model.topic_representations_)

In [None]:
def topic_differences(model, original_topics, nr_topics=5):
    """Show the differences in topic representations between two models """
    df = pd.DataFrame(columns=["Topic", "Original", "Updated"])
    for topic in range(nr_topics):
        # Extract top 5 words per topic per model
        og_words = " | ".join(list(zip(*original_topics[topic]))[0][:5])
        new_words = " | ".join(list(zip(*model.get_topic(topic)))[0][:5])
        df.loc[len(df)] = [topic, og_words, new_words]
    return df

In [None]:
from bertopic.representation import KeyBERTInspired

# Update our topic representations using KeyBERTInspired
representation_model = KeyBERTInspired()
topic_model.update_topics(abstracts, representation_model=representation_model)

In [None]:
# Show topic differences
topic_differences(topic_model, original_topics)

Unnamed: 0,Topic,Original,Updated
0,0,data | model | brain | network | networks,neurons | neural | cells | brain | cell
1,1,tree | species | phylogenetic | trees | fitness,phylogenetic | evolutionary | evolution | netw...
2,2,population | species | prey | predator | model,extinction | dispersal | coexistence | stochas...
3,3,covid | 19 | covid 19 | pandemic | cov,epidemic | pandemic | covid | coronavirus | ou...
4,4,protein | molecular | prediction | drug | lear...,predicting | proteins | prediction | protein |...


In [None]:
from bertopic.representation import MaximalMarginalRelevance

# Update our topic representations to MaximalMarginalRelevance
representation_model = MaximalMarginalRelevance(diversity=0.2)
topic_model.update_topics(abstracts, representation_model=representation_model)

In [None]:
# Show topic differences
topic_differences(topic_model, original_topics)

Unnamed: 0,Topic,Original,Updated
0,0,data | model | brain | network | networks,to | we | for | data | model
1,1,tree | species | phylogenetic | trees | fitness,tree | species | phylogenetic | trees | fitness
2,2,population | species | prey | predator | model,population | species | prey | predator | extin...
3,3,covid | 19 | covid 19 | pandemic | cov,covid | pandemic | sars | epidemic | spread
4,4,protein | molecular | prediction | drug | lear...,protein | molecular | prediction | learning | ...
