# 🤖 Notebook 3: BERTopic Training

## 1. Setup and GPU Check

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pavannn16/BERTopic-arXiv-Analysis/blob/main/notebooks/03_topic_modeling.ipynb)

---

**Purpose:** Train BERTopic model with GPU acceleration.

⚠️ **GPU Required:** Runtime → Change runtime type → GPU

**Time:** ~2 minutes with GPU

In [25]:
# Check GPU availability
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
    print('⚠️ Not connected to a GPU!')
    print('Go to Runtime > Change runtime type > GPU')
else:
    print('✅ GPU is available!')
    print(gpu_info)

✅ GPU is available!
Tue Dec  2 03:29:19 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   31C    P0             49W /  400W |    2857MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                            

In [26]:
# Check RAM
import psutil

ram_gb = psutil.virtual_memory().total / 1e9
print(f'Available RAM: {ram_gb:.1f} GB')

if ram_gb < 12:
    print('⚠️ Consider enabling High-RAM runtime for large datasets')

Available RAM: 89.6 GB


In [27]:
# Install required packages
%pip install bertopic sentence-transformers umap-learn hdbscan plotly safetensors -q

In [28]:
# ============================================================
# PROJECT PATH SETUP - Works on Colab Web, VS Code, or Local
# ============================================================

import os
from pathlib import Path

# Detect environment and set project path
if 'google.colab' in str(get_ipython()):
    from google.colab import drive
    drive.mount('/content/drive')
    PROJECT_PATH = '/content/drive/MyDrive/BERTopic-arXiv-Analysis'
    print("✅ Running on Google Colab")
else:
    PROJECT_PATH = str(Path(os.getcwd()).parent) if 'notebooks' in os.getcwd() else os.getcwd()
    print("✅ Running locally")

print(f"📁 Project path: {PROJECT_PATH}")

Project path: /content
Path exists: True


In [29]:
# Import libraries
import pandas as pd
import numpy as np
import json
import os
from tqdm import tqdm

# BERTopic components
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer

# Visualization
import plotly.express as px
import plotly.io as pio
pio.renderers.default = 'notebook'  # Works in VS Code

print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Load Processed Data

In [30]:
# Load processed data
df = pd.read_csv(f"{PROJECT_PATH}/data/processed/arxiv_cs_ai_processed.csv")

print(f"Loaded {len(df)} documents")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")

# Extract documents for topic modeling
documents = df['text'].tolist()
print(f"\nSample document:")
print(documents[0][:300] + "...")

Loaded 19898 documents
Date range: 2025-07-01 to 2025-12-01

Sample document:
Foundation Priors. Foundation models, and in particular large language models, can generate highly informative responses, prompting growing interest in using these ''synthetic'' outputs as data in empirical research and decision-making. This paper introduces the idea of a foundation prior, which sho...


## 3. Compute Embeddings (GPU Accelerated)

In [31]:
# Configuration
EMBEDDING_MODEL = "all-mpnet-base-v2"  # High quality, ~110M params
# Alternative: "all-MiniLM-L6-v2"      # Faster, ~22M params

BATCH_SIZE = 64  # Adjust based on GPU memory

print(f"Embedding model: {EMBEDDING_MODEL}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Documents to embed: {len(documents)}")

Embedding model: all-mpnet-base-v2
Batch size: 64
Documents to embed: 19898


In [32]:
# Check if embeddings already exist
embeddings_path = f"{PROJECT_PATH}/data/embeddings/embeddings.npy"

if os.path.exists(embeddings_path):
    print("Loading pre-computed embeddings...")
    embeddings = np.load(embeddings_path)
    print(f"Loaded embeddings shape: {embeddings.shape}")
else:
    print("Computing embeddings (this may take 5-15 minutes)...")
    
    # Load embedding model
    embedding_model = SentenceTransformer(EMBEDDING_MODEL)
    
    # Detect available device
    import torch
    if torch.cuda.is_available():
        device = 'cuda'
    elif torch.backends.mps.is_available():
        device = 'mps'  # Apple Silicon
    else:
        device = 'cpu'
    print(f"Using device: {device}")
    
    # Compute embeddings
    embeddings = embedding_model.encode(
        documents,
        batch_size=BATCH_SIZE,
        show_progress_bar=True,
        convert_to_numpy=True,
        device=device
    )
    
    # Save embeddings
    os.makedirs(os.path.dirname(embeddings_path), exist_ok=True)
    np.save(embeddings_path, embeddings)
    print(f"\nEmbeddings saved to {embeddings_path}")
    print(f"Shape: {embeddings.shape}")

Computing embeddings (this may take 5-15 minutes)...
Using device: cuda


Batches:   0%|          | 0/311 [00:00<?, ?it/s]


Embeddings saved to /content/data/embeddings/embeddings.npy
Shape: (19898, 768)


## 4. Configure BERTopic Components

In [33]:
# UMAP for dimensionality reduction
# Parameters tuned for topic modeling:
#   - n_neighbors: local structure (15-50 for topics)
#   - n_components: dimensions for clustering (5-15)
#   - min_dist: how tightly points cluster (0.0 for cleaner clusters)
#   - metric: cosine works well with embeddings

umap_model = UMAP(
    n_neighbors=15,
    n_components=5,
    min_dist=0.0,
    metric='cosine',
    random_state=42,
    low_memory=False
)

print("UMAP configured:")
print(f"  n_neighbors: 15")
print(f"  n_components: 5")
print(f"  min_dist: 0.0")
print(f"  metric: cosine")

UMAP configured:
  n_neighbors: 15
  n_components: 5
  min_dist: 0.0
  metric: cosine


In [34]:
# HDBSCAN for clustering
# Parameters:
#   - min_cluster_size: minimum documents per topic (adjust based on dataset size)
#   - min_samples: core point neighborhood (affects noise detection)
#   - cluster_selection_method: 'eom' for variable-sized clusters

# For ~3000-5000 documents, try min_cluster_size=15-30
MIN_CLUSTER_SIZE = 20

hdbscan_model = HDBSCAN(
    min_cluster_size=MIN_CLUSTER_SIZE,
    min_samples=10,
    metric='euclidean',
    cluster_selection_method='eom',
    prediction_data=True
)

print("HDBSCAN configured:")
print(f"  min_cluster_size: {MIN_CLUSTER_SIZE}")
print(f"  min_samples: 10")
print(f"  cluster_selection_method: eom")

HDBSCAN configured:
  min_cluster_size: 20
  min_samples: 10
  cluster_selection_method: eom


In [35]:
# CountVectorizer for c-TF-IDF topic representation
# Use n-grams (1-2) to capture phrases like "large language model"

vectorizer_model = CountVectorizer(
    ngram_range=(1, 2),
    stop_words='english',
    min_df=5,       # Appear in at least 5 documents
    max_df=0.95     # Appear in at most 95% of documents
)

print("CountVectorizer configured:")
print(f"  ngram_range: (1, 2)")
print(f"  stop_words: english")
print(f"  min_df: 5")
print(f"  max_df: 0.95")

CountVectorizer configured:
  ngram_range: (1, 2)
  stop_words: english
  min_df: 5
  max_df: 0.95


## 5. Train BERTopic Model

In [36]:
# Build BERTopic model with custom components
# Note: We pass embeddings separately since we pre-computed them

topic_model = BERTopic(
    # Use our custom components
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    
    # Topic representation settings
    top_n_words=10,           # Top words per topic
    nr_topics=None,           # Auto-detect number of topics (or set to specific number)
    
    # Additional settings
    calculate_probabilities=True,
    verbose=True
)

print("BERTopic model configured!")

BERTopic model configured!


In [37]:
%%time
# Fit the model (this may take a few minutes)
print(f"Training BERTopic on {len(documents)} documents...")
print("This typically takes 2-5 minutes on GPU...\n")

topics, probs = topic_model.fit_transform(documents, embeddings=embeddings)

print("\n" + "="*50)
print("TRAINING COMPLETE!")
print("="*50)

2025-12-02 03:30:58,806 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm


Training BERTopic on 19898 documents...
This typically takes 2-5 minutes on GPU...



2025-12-02 03:31:20,324 - BERTopic - Dimensionality - Completed ✓
2025-12-02 03:31:20,325 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-12-02 03:31:39,773 - BERTopic - Cluster - Completed ✓
2025-12-02 03:31:39,780 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-12-02 03:31:46,359 - BERTopic - Representation - Completed ✓



TRAINING COMPLETE!
CPU times: user 56.4 s, sys: 357 ms, total: 56.7 s
Wall time: 50.9 s


## 6. Explore Topics

In [38]:
# Get topic info
topic_info = topic_model.get_topic_info()

n_topics = len(topic_info) - 1  # Exclude -1 (outliers)
n_outliers = (np.array(topics) == -1).sum()
outlier_pct = 100 * n_outliers / len(topics)

print(f"Number of topics discovered: {n_topics}")
print(f"Outliers: {n_outliers} ({outlier_pct:.1f}%)")
print(f"Documents assigned to topics: {len(topics) - n_outliers}")

print("\nTop 15 topics by size:")
topic_info.head(16)

Number of topics discovered: 159
Outliers: 5146 (25.9%)
Documents assigned to topics: 14752

Top 15 topics by size:


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,5146,-1_language_llms_reasoning_agents,"[language, llms, reasoning, agents, ai, langua...",[Enhancing Vision-Language Model Training with...
1,0,796,0_segmentation_medical_clinical_imaging,"[segmentation, medical, clinical, imaging, ima...",[LesiOnTime -- Joint Temporal and Clinical Mod...
2,1,578,1_diffusion_image_generation_video,"[diffusion, image, generation, video, diffusio...",[Personalized Image Editing in Text-to-Image D...
3,2,539,2_reasoning_rlvr_reasoning models_rl,"[reasoning, rlvr, reasoning models, rl, cot, e...",[PEAR: Phase Entropy Aware Reward for Efficien...
4,3,468,3_rag_retrieval_knowledge_graph,"[rag, retrieval, knowledge, graph, retrievalau...",[Clue-RAG: Towards Accurate and Cost-Efficient...
5,4,420,4_code_software_code generation_software engin...,"[code, software, code generation, software eng...",[The Rise of AI Teammates in Software Engineer...
6,5,387,5_clinical_medical_health_healthcare,"[clinical, medical, health, healthcare, patien...",[Retrieval-Augmented Framework for LLM-Based C...
7,6,340,6_molecular_protein_drug_chemical,"[molecular, protein, drug, chemical, materials...",[S Drug: Bridging Protein Sequence and 3D Stru...
8,7,334,7_manipulation_robot_robotic_vla,"[manipulation, robot, robotic, vla, robotic ma...",[From Human Hands to Robot Arms: Manipulation ...
9,8,316,8_ai_moral_ethical_governance,"[ai, moral, ethical, governance, ai systems, r...",[Trustworthiness of Legal Considerations for t...


In [39]:
# Explore specific topics
def show_topic_details(topic_id):
    """Display detailed information about a topic."""
    if topic_id == -1:
        print("Topic -1 is the outlier/noise cluster")
        return
    
    # Get top words
    words = topic_model.get_topic(topic_id)
    print(f"\nTopic {topic_id}")
    print("="*50)
    print("Top words:")
    for word, score in words[:10]:
        print(f"  {word}: {score:.4f}")
    
    # Get representative documents
    rep_docs = topic_model.get_representative_docs(topic_id)
    print(f"\nRepresentative documents:")
    for i, doc in enumerate(rep_docs[:3], 1):
        print(f"\n{i}. {doc[:200]}...")

# Show details for top 5 topics
for topic_id in topic_info['Topic'].head(6).tolist():
    if topic_id != -1:
        show_topic_details(topic_id)


Topic 0
Top words:
  segmentation: 0.0183
  medical: 0.0175
  clinical: 0.0147
  imaging: 0.0141
  images: 0.0126
  image: 0.0111
  mri: 0.0109
  diagnostic: 0.0088
  diagnosis: 0.0087
  tumor: 0.0086

Representative documents:

1. LesiOnTime -- Joint Temporal and Clinical Modeling for Small Breast Lesion Segmentation in Longitudinal DCE-MRI. Accurate segmentation of small lesions in Breast Dynamic Contrast-Enhanced MRI (DCE-MRI...

2. MedSAM3: Delving into Segment Anything with Medical Concepts. Medical image segmentation is fundamental for biomedical discovery. Existing methods lack generalizability and demand extensive, time-cons...

3. Sim4Seg: Boosting Multimodal Multi-disease Medical Diagnosis Segmentation with Region-Aware Vision-Language Similarity Masks. Despite significant progress in pixel-level medical image analysis, existi...

Topic 1
Top words:
  diffusion: 0.0206
  image: 0.0175
  generation: 0.0154
  video: 0.0141
  diffusion models: 0.0132
  editing: 0.0110
  image g

## 7. Visualizations

In [40]:
# Topic overview bar chart
fig = topic_model.visualize_barchart(top_n_topics=15)
fig.show()

# Save figure
fig.write_html(f"{PROJECT_PATH}/results/topic_barchart.html")

In [41]:
# Topic similarity heatmap
fig = topic_model.visualize_heatmap(top_n_topics=20)
fig.show()

fig.write_html(f"{PROJECT_PATH}/results/topic_heatmap.html")

In [42]:
# 2D UMAP visualization of documents colored by topic
# This is the main visualization showing topic clusters

# Reduce to 2D for visualization
print("Computing 2D UMAP for visualization...")
umap_2d = UMAP(n_components=2, min_dist=0.1, metric='cosine', random_state=42)
embeddings_2d = umap_2d.fit_transform(embeddings)

print("Creating scatter plot...")

Computing 2D UMAP for visualization...
Creating scatter plot...


In [43]:
# Create interactive scatter plot
vis_df = pd.DataFrame({
    'x': embeddings_2d[:, 0],
    'y': embeddings_2d[:, 1],
    'topic': [str(t) for t in topics],
    'title': df['title'].tolist(),
    'date': df['date'].tolist()
})

# Color outliers differently
vis_df['is_outlier'] = vis_df['topic'] == '-1'

fig = px.scatter(
    vis_df[~vis_df['is_outlier']],  # Exclude outliers for clarity
    x='x', y='y',
    color='topic',
    hover_data=['title', 'date'],
    title='arXiv cs.AI Topics - 2D UMAP Projection',
    width=1000, height=700
)

fig.update_traces(marker=dict(size=5, opacity=0.7))
fig.update_layout(legend=dict(title='Topic', itemsizing='constant'))
fig.show()

# Save
fig.write_html(f"{PROJECT_PATH}/results/topic_scatter_2d.html")
print(f"\nSaved to {PROJECT_PATH}/results/topic_scatter_2d.html")


Saved to /content/results/topic_scatter_2d.html


In [44]:
# Hierarchy visualization (dendrogram)
fig = topic_model.visualize_hierarchy(top_n_topics=30)
fig.show()

fig.write_html(f"{PROJECT_PATH}/results/topic_hierarchy.html")

## 8. Save Model and Results

In [45]:
# Save BERTopic model
model_path = f"{PROJECT_PATH}/models/bertopic_model"
os.makedirs(model_path, exist_ok=True)

topic_model.save(
    model_path,
    serialization="safetensors",
    save_ctfidf=True,
    save_embedding_model=False  # We saved embeddings separately
)

print(f"Model saved to {model_path}")



Model saved to /content/models/bertopic_model


In [46]:
# Save topic assignments
results_df = df.copy()
results_df['topic'] = topics
results_df['topic_prob'] = [p.max() if len(p) > 0 else 0 for p in probs]

# Add topic labels
topic_labels = {row['Topic']: row['Name'] for _, row in topic_info.iterrows()}
results_df['topic_name'] = results_df['topic'].map(topic_labels)

# Save
results_df.to_csv(f"{PROJECT_PATH}/results/topic_assignments.csv", index=False)
print(f"Topic assignments saved!")

# Save topic info
topic_info.to_csv(f"{PROJECT_PATH}/results/topic_info.csv", index=False)
print("Topic info saved!")

Topic assignments saved!
Topic info saved!


In [47]:
# Save 2D embeddings for visualization
np.save(f"{PROJECT_PATH}/data/embeddings/embeddings_2d.npy", embeddings_2d)
print("2D embeddings saved!")

2D embeddings saved!


## Summary

This notebook has:
1. ✅ Computed document embeddings using Sentence-BERT (GPU)
2. ✅ Trained BERTopic with UMAP + HDBSCAN + c-TF-IDF
3. ✅ Discovered topics in arXiv cs.AI abstracts
4. ✅ Generated visualizations (bar chart, heatmap, scatter, hierarchy)
5. ✅ Saved model and results

**Next step:** Run `04_evaluation.ipynb` to compute coherence and diversity metrics.