<a href="https://colab.research.google.com/github/khushidubeyokok/BERTopic/blob/main/BERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Research Paper Topic Modeling with BERTopic (neuralwork/arxiver)

In this notebook, we apply BERTopic to the **[neuralwork/arxiver](https://huggingface.co/datasets/neuralwork/arxiver)** dataset from Hugging Face. This dataset includes abstracts from various research papers, ideal for identifying scientific themes through topic modeling.

## Steps Covered in This Notebook
1. **Load and Explore Dataset**: Inspect the data structure and content.
2. **Preprocess Text**: Clean abstracts for analysis.
3. **Apply BERTopic**: Generate and interpret topic clusters.
4. **Visualize Findings**: Plot and analyze topic distributions.


## Load and Explore Dataset

In this section, we load the **neuralwork/arxiver** dataset and examine its structure to better understand what content is available for topic modeling.


In [50]:
#Install Required Libraries
!pip -q install datasets
!pip -q install bertopic
!pip -q install transformers

In [51]:
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("neuralwork/arxiver")

In [52]:
# Display the dataset structure
print(dataset)

# View sample entries from the dataset
print("Sample Abstracts:")
for i in range(3):  # Display the first three abstracts
    print(f"Abstract {i+1}:\n{dataset['train'][i]['abstract']}\n")

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'abstract', 'authors', 'published_date', 'link', 'markdown'],
        num_rows: 63357
    })
})
Sample Abstracts:
Abstract 1:
Given the recent advances with image-generating algorithms, deep image
completion methods have made significant progress. However, state-of-art
methods typically provide poor cross-scene generalization, and generated masked
areas often contain blurry artifacts. Predictive filtering is a method for
restoring images, which predicts the most effective kernels based on the input
scene. Motivated by this approach, we address image completion as a filtering
problem. Deep feature-level semantic filtering is introduced to fill in missing
information, while preserving local structure and generating visually realistic
content. In particular, a Dual-path Cooperative Filtering (DCF) model is
proposed, where one path predicts dynamic kernels, and the other path extracts
multi-level features by using Fast Fo

## Text Preprocessing

To improve the quality of topic modeling, we’ll perform several preprocessing steps on the abstracts:
1. **Remove Stop Words**: Words that don’t add much meaning, like "the," "and," "is."
2. **Lemmatization**: Reduce words to their root forms to handle variations.
3. **Remove Numbers and Special Characters**: Clean up any non-alphabetic characters.
4. **Remove Extra Whitespace and Convert to Lowercase**: Ensure consistent formatting.



In [53]:
abstracts=dataset['train']['abstract']

In [54]:
abstracts[1]

'We analyzed four epochs of beamformed EVN data of the Crab Pulsar at 1658.49\nMHz. With the high sensitivity resulting from resolving out the Crab Nebula, we\nare able to detect even the faint high-frequency components in the folded\nprofile. We also detect a total of 65951 giant pulses, which we use to\ninvestigate the rates, fluence, phase, and arrival time distributions. We find\nthat for the main pulse component, our giant pulses represent about 80% of the\ntotal flux. This suggests we have a nearly complete giant pulse energy\ndistribution, although it is not obvious how the observed distribution could be\nextended to cover the remaining 20% of the flux without invoking large numbers\nof faint bursts for every rotation. Looking at the difference in arrival time\nbetween subsequent bursts in single rotations, we confirm that the likelihood\nof finding giant pulses close to each other is increased beyond that expected\nfor randomly occurring bursts - some giant pulses consist of ca

In [55]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

# Download stopwords if needed
nltk.download("stopwords")
nltk.download("wordnet")

# Initialize stopwords and lemmatizer
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove numbers and special characters
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'\W+', ' ', text)

    # Remove stop words and lemmatize
    words = text.split()
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]

    # Join words back into a single string
    processed_text = ' '.join(words)

    return processed_text

# Apply preprocessing to all abstracts
processed_abstracts = [preprocess_text(abstract) for abstract in abstracts]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [56]:
# Display a processed abstract
processed_abstracts[1]

'analyzed four epoch beamformed evn data crab pulsar mhz high sensitivity resulting resolving crab nebula able detect even faint high frequency component folded profile also detect total giant pulse use investigate rate fluence phase arrival time distribution find main pulse component giant pulse represent total flux suggests nearly complete giant pulse energy distribution although obvious observed distribution could extended cover remaining flux without invoking large number faint burst every rotation looking difference arrival time subsequent burst single rotation confirm likelihood finding giant pulse close increased beyond expected randomly occurring burst giant pulse consist causally related microbursts typical separation sim rm mu also find evidence separation gtrsim rm mu likelihood finding another giant pulse suppressed addition high sensitivity enabled u detect weak echo feature brightest pulse sim peak giant pulse flux delayed sim rm mu'

## Apply BERTopic for Topic Modeling

With our preprocessed abstracts, we’ll apply BERTopic to identify topics. BERTopic uses BERT embeddings combined with clustering techniques to find patterns in text, making it ideal for identifying scientific themes in research abstracts.


In [57]:
from umap import UMAP
from hdbscan import HDBSCAN
from bertopic import BERTopic

topic_model = BERTopic(
    embedding_model="all-MiniLM-L6-v2",   # Efficient embedding model for quick computation
    umap_model=UMAP(
        n_neighbors=10,                  # Lower neighbors for tighter clusters
        n_components=5,                  # Dimensionality reduction
        min_dist=0.1,                    # Slight separation between clusters
        metric='cosine'                  # Cosine similarity for text data
    ),
    hdbscan_model=HDBSCAN(
        min_cluster_size=60,             # Increase to reduce topic count
        min_samples=15,                   # Fewer samples to prevent fragmentation
        metric='euclidean',              # Works well with UMAP output
        cluster_selection_method='eom'   # Keeps distinct clusters
    ),                       # Target topic count, reduces smaller ones
    top_n_words=10                       # Top words per topic
)

# Fit the model on the preprocessed abstracts
topics,probs= topic_model.fit_transform(processed_abstracts)

In [58]:
## Handling Outliers
threshold = 0.15
filtered_topics = [topic if prob >= threshold else -1 for topic, prob in zip(topics, probs)]

In [59]:
# Custom topic labels
topic_model.set_topic_labels(topic_model.generate_topic_labels(
    separator=" | ",
    topic_prefix=False    # Removes the Topic _ prefix
))

# View improved topics
print(topic_model.get_topic_info())

     Topic  Count                                            Name  \
0       -1  21991                       -1_model_data_method_task   
1        0  17904                      0_quantum_mass_field_state   
2        1   1281            1_image_segmentation_medical_imaging   
3        2   1139                      2_speech_audio_speaker_asr   
4        3    803          3_network_neural_gradient_optimization   
..     ...    ...                                             ...   
105    104     62                           104_agent_llm_ai_game   
106    105     61              105_percolation_random_vertex_walk   
107    106     61  106_summarization_summary_abstractive_document   
108    107     60          107_sentence_word_embeddings_embedding   
109    108     60                     108_legal_court_case_lawyer   

                                CustomName  \
0                    model | data | method   
1                   quantum | mass | field   
2           image | segmentation 

## Visualize Topics

Visualizing the discovered topics helps in understanding the distribution and relationships between different topics. BERTopic provides several visualization tools to aid in this analysis:
1. **Intertopic Distance Map**: Shows how topics are related to each other.
2. **Topic Hierarchy**: Displays the hierarchical structure of topics.
3. **Top Words per Topic**: Lists the most representative words for each topic.

Let's generate these visualizations to gain insights into the topic structure.


In [60]:
topic_model.visualize_heatmap()

In [61]:
topic_model.visualize_barchart()

In [62]:
topic_model.visualize_topics()

In [63]:
topic_model.visualize_hierarchy()

In [64]:
# Display the top topics
print("Top topics identified:")
for i, topic in enumerate(topic_model.get_topic_info().head(10).values):
    print(f"Topic {i+1}: {topic}")

Top topics identified:
Topic 1: [-1 21991 '-1_model_data_method_task' 'model | data | method'
 list(['model', 'data', 'method', 'task', 'learning', 'based', 'problem', 'paper', 'approach', 'result'])
 list(['transformer highly successful deep learning model revolutionised world artificial neural network first natural language processing later computer vision model based attention mechanism able capture complex semantic relationship variety pattern present input data precisely characteristic transformer recently exploited time series forecasting problem assuming natural adaptability domain continuous numerical series despite acclaimed result literature work raised doubt robustness effectiveness approach paper investigate effectiveness transformer based model applied domain time series forecasting demonstrate limitation propose set alternative model better performing significantly le complex particular empirically show simplifying transformer based forecasting model almost always lead im

## Map Titles to Topics and Probabilities

To provide a clear overview of how each research paper is categorized, we'll create a table that links each paper's title to its assigned topic and the corresponding probability score. This allows us to assess the confidence of the topic assignments and understand the distribution of topics across the dataset.


In [91]:
import pandas as pd

# Extract titles from the dataset
titles = [entry['title'] for entry in dataset["train"]]

# Create a DataFrame with titles, topics, and probabilities
df = pd.DataFrame({
    'Title': titles,
    'Topic': topics,
    'Probability': probs
})

# Display the first 25 entries
print("Sample of Titles with Assigned Topics and Probabilities:")
df_sample = df.iloc[25:40]
print(df_sample)

#Save the DataFrame to a CSV file for future reference
df.to_csv("titles_topics_probabilities.csv", index=False)

Sample of Titles with Assigned Topics and Probabilities:
                                                Title  Topic  Probability
25  Enhanced Controllability of Diffusion Models v...     13     1.000000
26  Utility-based Adaptive Teaching Strategies usi...     -1     0.000000
27  TopoBERT: Plug and Play Toponym Recognition Mo...     -1     0.000000
28  CLiFF-LHMP: Using Spatial Dynamics Patterns fo...     -1     0.000000
29  Potential Ways to Detect Unfairness in HRI and...     19     0.725750
30                         DTC: Deep Tracking Control     71     1.000000
31  tSPM+; a high-performance algorithm for mining...     -1     0.000000
32  Dynamic Multi-Scale Context Aggregation for Co...     -1     0.000000
33  Hand Gesture Recognition with Two Stage Approa...    102     1.000000
34  The Boundaries of Verifiable Accuracy, Robustn...     -1     0.000000
35  Bose Gas Modeling of the Schwarzschild Black H...      0     1.000000
36     Cubical Approximation for Directed Topology II  

In [89]:
topic_model.get_topic(0)

[('quantum', 0.01276574753832458),
 ('mass', 0.00834357748823333),
 ('field', 0.008038466011139906),
 ('state', 0.007514787294640221),
 ('star', 0.007447554020196089),
 ('energy', 0.0074311689460621715),
 ('spin', 0.007046346561808763),
 ('magnetic', 0.006929218399119117),
 ('phase', 0.006751599543446306),
 ('galaxy', 0.006408631267170356)]

## Conclusion

In this project, we successfully applied BERTopic to the **neuralwork/arxiver** dataset to uncover thematic clusters within research paper abstracts. The key findings include:

- **Diverse Topic Distribution**: The dataset encompasses a wide range of scientific domains, reflected in the variety of identified topics.
- **High-Confidence Assignments**: A significant number of papers were confidently categorized, indicating the effectiveness of BERTopic in discerning clear themes.
- **Insightful Visualizations**: The intertopic distance map and top words per topic provide valuable insights into the relationships and nature of the topics.

### Potential Applications

- **Literature Review Automation**: Assisting researchers in quickly identifying relevant papers based on thematic clusters.
- **Trend Analysis**: Monitoring the evolution of research topics over time to identify emerging areas of interest.
- **Recommendation Systems**: Suggesting related papers or topics to researchers based on their areas of interest.

### Future Work

- **Fine-Tuning BERTopic**: Experimenting with different embedding models or hyperparameters to enhance topic coherence.
- **Expanding the Dataset**: Incorporating more recent papers or additional datasets to broaden the scope of analysis.
- **Interactive Visualizations**: Creating interactive dashboards for dynamic exploration of topics and their relationships.

This project demonstrates the power of advanced topic modeling techniques like BERTopic in organizing and making sense of vast amounts of scientific literature.
