<a href="https://colab.research.google.com/github/khushidubeyokok/BERTopic/blob/main/Copy_of_BERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Research Paper Topic Modeling with BERTopic (neuralwork/arxiver)

In this notebook, we apply BERTopic to the **[neuralwork/arxiver](https://huggingface.co/datasets/neuralwork/arxiver)** dataset from Hugging Face. This dataset includes abstracts from various research papers, ideal for identifying scientific themes through topic modeling.

## Steps Covered in This Notebook
1. **Load and Explore Dataset**: Inspect the data structure and content.
2. **Preprocess Text**: Clean abstracts for analysis.
3. **Apply BERTopic**: Generate and interpret topic clusters.
4. **Visualize Findings**: Plot and analyze topic distributions.


## Load and Explore Dataset

In this section, we load the **neuralwork/arxiver** dataset and examine its structure to better understand what content is available for topic modeling.


In [1]:
#Install Required Libraries
!pip -q install datasets
!pip -q install bertopic
!pip -q install transformers

In [2]:
import pandas as pd

In [3]:
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("neuralwork/arxiver")

In [4]:
train_dataset = dataset["train"]
df = train_dataset.to_pandas()
df

Unnamed: 0,id,title,abstract,authors,published_date,link,markdown
0,2305.00379,Image Completion via Dual-path Cooperative Fil...,Given the recent advances with image-generatin...,"Pourya Shamsolmoali, Masoumeh Zareapoor, Eric ...",2023-04-30T03:54:53Z,http://arxiv.org/abs/2305.00379v1,# Image Completion via Dual-Path Cooperative F...
1,2307.16362,High Sensitivity Beamformed Observations of th...,We analyzed four epochs of beamformed EVN data...,"Rebecca Lin, Marten H. van Kerkwijk",2023-07-31T01:36:55Z,http://arxiv.org/abs/2307.16362v2,# High Sensitivity Beamformed Observations of ...
2,2301.07687,"Maybe, Maybe Not: A Survey on Uncertainty in V...",Understanding and evaluating uncertainty play ...,Krisha Mehta,2022-12-14T00:07:06Z,http://arxiv.org/abs/2301.07687v1,"# Maybe, Maybe Not: A Survey on Uncertainty in..."
3,2309.09088,Enhancing GAN-Based Vocoders with Contrastive ...,Vocoder models have recently achieved substant...,"Haoming Guo, Seth Z. Zhao, Jiachen Lian, Gopal...",2023-09-16T20:04:16Z,http://arxiv.org/abs/2309.09088v2,# Enhancing Gan-Based Vocoders with Contrastiv...
4,2307.16404,Nonvolatile Magneto-Thermal Switching in MgB2,Ongoing research explores thermal switching ma...,"Hiroto Arima, Yoshikazu Mizuguchi",2023-07-31T04:59:19Z,http://arxiv.org/abs/2307.16404v1,# Nonvolatile Magneto-Thermal Switching in MgB...
...,...,...,...,...,...,...,...
63352,2306.06241,Almost paratopological groups,A class of almost paratopological groups is in...,Evgenii Reznichenko,2023-06-09T20:27:33Z,http://arxiv.org/abs/2306.06241v2,# Almost paratopological groups\n\n###### Abst...
63353,2301.12293,ACL-Fig: A Dataset for Scientific Figure Class...,Most existing large-scale academic search engi...,"Zeba Karishma, Shaurya Rohatgi, Kavya Shriniva...",2023-01-28T20:27:35Z,http://arxiv.org/abs/2301.12293v1,# ACL-Fig: A Dataset for Scientific Figure Cla...
63354,2303.04288,Polynomial Time and Private Learning of Unboun...,We study the problem of privately estimating t...,"Jamil Arbas, Hassan Ashtiani, Christopher Liaw",2023-03-07T23:24:27Z,http://arxiv.org/abs/2303.04288v2,# Polynomial Time and Private Learning of Unbo...
63355,2308.11291,Improving Knot Prediction in Wood Logs with Lo...,The quality of a wood log in the wood industry...,"Salim Khazem, Jeremy Fix, Cédric Pradalier",2023-08-22T09:12:11Z,http://arxiv.org/abs/2308.11291v1,# Improving Knot Prediction in Wood Logs with ...


## Text Preprocessing

To improve the quality of topic modeling, we’ll perform several preprocessing steps on the abstracts:
1. **Remove Stop Words**: Words that don’t add much meaning, like "the," "and," "is."
2. **Lemmatization**: Reduce words to their root forms to handle variations.
3. **Remove Numbers and Special Characters**: Clean up any non-alphabetic characters.
4. **Remove Extra Whitespace and Convert to Lowercase**: Ensure consistent formatting.



In [5]:
abstracts=dataset['train']['abstract']

In [6]:
abstracts[1]

'We analyzed four epochs of beamformed EVN data of the Crab Pulsar at 1658.49\nMHz. With the high sensitivity resulting from resolving out the Crab Nebula, we\nare able to detect even the faint high-frequency components in the folded\nprofile. We also detect a total of 65951 giant pulses, which we use to\ninvestigate the rates, fluence, phase, and arrival time distributions. We find\nthat for the main pulse component, our giant pulses represent about 80% of the\ntotal flux. This suggests we have a nearly complete giant pulse energy\ndistribution, although it is not obvious how the observed distribution could be\nextended to cover the remaining 20% of the flux without invoking large numbers\nof faint bursts for every rotation. Looking at the difference in arrival time\nbetween subsequent bursts in single rotations, we confirm that the likelihood\nof finding giant pulses close to each other is increased beyond that expected\nfor randomly occurring bursts - some giant pulses consist of ca

In [7]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

# Download stopwords if needed
nltk.download("stopwords")
nltk.download("wordnet")

# Initialize stopwords and lemmatizer
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove numbers and special characters
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'\W+', ' ', text)

    # Remove stop words and lemmatize
    words = text.split()
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]

    # Join words back into a single string
    processed_text = ' '.join(words)

    return processed_text

# Apply preprocessing to all abstracts
processed_abstracts = [preprocess_text(abstract) for abstract in abstracts]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [8]:
# Display a processed abstract
processed_abstracts[1]

'analyzed four epoch beamformed evn data crab pulsar mhz high sensitivity resulting resolving crab nebula able detect even faint high frequency component folded profile also detect total giant pulse use investigate rate fluence phase arrival time distribution find main pulse component giant pulse represent total flux suggests nearly complete giant pulse energy distribution although obvious observed distribution could extended cover remaining flux without invoking large number faint burst every rotation looking difference arrival time subsequent burst single rotation confirm likelihood finding giant pulse close increased beyond expected randomly occurring burst giant pulse consist causally related microbursts typical separation sim rm mu also find evidence separation gtrsim rm mu likelihood finding another giant pulse suppressed addition high sensitivity enabled u detect weak echo feature brightest pulse sim peak giant pulse flux delayed sim rm mu'

In [9]:
df['processed_abstracts']=processed_abstracts
df.head()

Unnamed: 0,id,title,abstract,authors,published_date,link,markdown,processed_abstracts
0,2305.00379,Image Completion via Dual-path Cooperative Fil...,Given the recent advances with image-generatin...,"Pourya Shamsolmoali, Masoumeh Zareapoor, Eric ...",2023-04-30T03:54:53Z,http://arxiv.org/abs/2305.00379v1,# Image Completion via Dual-Path Cooperative F...,given recent advance image generating algorith...
1,2307.16362,High Sensitivity Beamformed Observations of th...,We analyzed four epochs of beamformed EVN data...,"Rebecca Lin, Marten H. van Kerkwijk",2023-07-31T01:36:55Z,http://arxiv.org/abs/2307.16362v2,# High Sensitivity Beamformed Observations of ...,analyzed four epoch beamformed evn data crab p...
2,2301.07687,"Maybe, Maybe Not: A Survey on Uncertainty in V...",Understanding and evaluating uncertainty play ...,Krisha Mehta,2022-12-14T00:07:06Z,http://arxiv.org/abs/2301.07687v1,"# Maybe, Maybe Not: A Survey on Uncertainty in...",understanding evaluating uncertainty play key ...
3,2309.09088,Enhancing GAN-Based Vocoders with Contrastive ...,Vocoder models have recently achieved substant...,"Haoming Guo, Seth Z. Zhao, Jiachen Lian, Gopal...",2023-09-16T20:04:16Z,http://arxiv.org/abs/2309.09088v2,# Enhancing Gan-Based Vocoders with Contrastiv...,vocoder model recently achieved substantial pr...
4,2307.16404,Nonvolatile Magneto-Thermal Switching in MgB2,Ongoing research explores thermal switching ma...,"Hiroto Arima, Yoshikazu Mizuguchi",2023-07-31T04:59:19Z,http://arxiv.org/abs/2307.16404v1,# Nonvolatile Magneto-Thermal Switching in MgB...,ongoing research explores thermal switching ma...


## Apply BERTopic for Topic Modeling

With our preprocessed abstracts, we’ll apply BERTopic to identify topics. BERTopic uses BERT embeddings combined with clustering techniques to find patterns in text, making it ideal for identifying scientific themes in research abstracts.


In [10]:
from umap import UMAP
from hdbscan import HDBSCAN
from bertopic import BERTopic

topic_model = BERTopic(
    embedding_model="all-MiniLM-L6-v2",   # Efficient embedding model for quick computation
    umap_model=UMAP(
        n_neighbors=10,                  # Lower neighbors for tighter clusters
        n_components=5,                  # Dimensionality reduction
        min_dist=0.1,                    # Slight separation between clusters
        metric='cosine'                  # Cosine similarity for text data
    ),
    hdbscan_model=HDBSCAN(
        min_cluster_size=60,             # Increase to reduce topic count
        min_samples=15,                   # Fewer samples to prevent fragmentation
        metric='euclidean',              # Works well with UMAP output
        cluster_selection_method='eom'   # Keeps distinct clusters
    ),                       # Target topic count, reduces smaller ones
    top_n_words=10                       # Top words per topic
)

# Fit the model on the preprocessed abstracts
topics,probs= topic_model.fit_transform(processed_abstracts)

In [30]:
import pickle

# Save the BERTopic model
with open("bertopic_model.pkl", "wb") as f:
    pickle.dump(topic_model, f)

# Save the list of topics
with open("topics_list.pkl", "wb") as f:
    pickle.dump(topics, f)


In [11]:
df['topic'] = topics
df_filtered = df[df['topic'] != -1]
df_filtered

Unnamed: 0,id,title,abstract,authors,published_date,link,markdown,processed_abstracts,topic
1,2307.16362,High Sensitivity Beamformed Observations of th...,We analyzed four epochs of beamformed EVN data...,"Rebecca Lin, Marten H. van Kerkwijk",2023-07-31T01:36:55Z,http://arxiv.org/abs/2307.16362v2,# High Sensitivity Beamformed Observations of ...,analyzed four epoch beamformed evn data crab p...,0
3,2309.09088,Enhancing GAN-Based Vocoders with Contrastive ...,Vocoder models have recently achieved substant...,"Haoming Guo, Seth Z. Zhao, Jiachen Lian, Gopal...",2023-09-16T20:04:16Z,http://arxiv.org/abs/2309.09088v2,# Enhancing Gan-Based Vocoders with Contrastiv...,vocoder model recently achieved substantial pr...,5
4,2307.16404,Nonvolatile Magneto-Thermal Switching in MgB2,Ongoing research explores thermal switching ma...,"Hiroto Arima, Yoshikazu Mizuguchi",2023-07-31T04:59:19Z,http://arxiv.org/abs/2307.16404v1,# Nonvolatile Magneto-Thermal Switching in MgB...,ongoing research explores thermal switching ma...,1
6,2304.00044,On The Theory of Ring Afterglows,Synchrotron and inverse Compton emission succe...,"Marcus DuPont, Andrew MacFadyen, Re'em Sari",2023-03-31T18:02:12Z,http://arxiv.org/abs/2304.00044v1,# On The Theory of Ring Afterglows\n\n###### A...,synchrotron inverse compton emission successfu...,0
8,2309.07927,Kid-Whisper: Towards Bridging the Performance ...,Recent advancements in Automatic Speech Recogn...,"Ahmed Adel Attia, Jing Liu, Wei Ai, Dorottya D...",2023-09-12T06:58:18Z,http://arxiv.org/abs/2309.07927v3,Kid-Whisper: Towards Bridging the Performance ...,recent advancement automatic speech recognitio...,5
...,...,...,...,...,...,...,...,...,...
63350,2305.10173,"Quantum theory without the Axiom of choice, an...","In this conceptual paper, we discuss quantum f...",Koen Thas,2023-05-17T12:57:19Z,http://arxiv.org/abs/2305.10173v1,"# Quantum theory without the axiom of choice, ...",conceptual paper discus quantum formalism use ...,1
63351,2307.11414,The Derived Deligne Conjecture,Derived $A_\infty$-algebras have a wealth of t...,"Javier Aguilar Martín, Constanze Roitzheim",2023-07-21T08:16:23Z,http://arxiv.org/abs/2307.11414v3,# The derived Deligne conjecture\n\n###### Abs...,derived a_ infty algebra wealth theoretical ad...,2
63352,2306.06241,Almost paratopological groups,A class of almost paratopological groups is in...,Evgenii Reznichenko,2023-06-09T20:27:33Z,http://arxiv.org/abs/2306.06241v2,# Almost paratopological groups\n\n###### Abst...,class almost paratopological group introduced ...,2
63354,2303.04288,Polynomial Time and Private Learning of Unboun...,We study the problem of privately estimating t...,"Jamil Arbas, Hassan Ashtiani, Christopher Liaw",2023-03-07T23:24:27Z,http://arxiv.org/abs/2303.04288v2,# Polynomial Time and Private Learning of Unbo...,study problem privately estimating parameter d...,62


In [12]:
info = pd.DataFrame(topic_model.get_topic_info() )
info

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,20721,-1_model_data_method_task,"[model, data, method, task, learning, based, a...",[vision system see reason compositional nature...
1,0,9571,0_mass_star_galaxy_energy,"[mass, star, galaxy, energy, hole, field, blac...",[perform new general relativistic viscous radi...
2,1,7030,1_quantum_state_spin_phase,"[quantum, state, spin, phase, magnetic, system...",[electronic state calculation using quantum co...
3,2,3053,2_group_algebra_mathbb_category,"[group, algebra, mathbb, category, prove, mani...",[mapping class group x smooth manifold x group...
4,3,1418,3_image_segmentation_medical_imaging,"[image, segmentation, medical, imaging, mri, m...",[accurate medical image segmentation utmost im...
...,...,...,...,...,...
92,91,63,91_summarization_summary_document_abstractive,"[summarization, summary, document, abstractive...",[natural language processing booming applicati...
93,92,63,92_sentence_embeddings_word_embedding,"[sentence, embeddings, word, embedding, simila...",[sentence embeddings enable u capture semantic...
94,93,62,93_entropy_thermodynamics_thermodynamic_equili...,"[entropy, thermodynamics, thermodynamic, equil...",[paper investigates generalized thermodynamic ...
95,94,60,94_distributed_consensus_agent_algorithm,"[distributed, consensus, agent, algorithm, con...",[paper develops novel approach consensus probl...


In [13]:
from collections import defaultdict

topic_abstracts = defaultdict(list)

# For each topic, collect 10 abstracts
for topic_num in info['Topic'].values:
    if topic_num == -1:
        continue
    sample_abstracts = df[df['topic'] == topic_num]['processed_abstracts'].sample(
        n=min(10, df[df['topic'] == topic_num].shape[0]),
        random_state=42
    ).tolist()
    topic_abstracts[topic_num] = sample_abstracts


In [14]:
for topic_num in list(topic_abstracts.keys())[:2]:  # first 2 topics
    print(f"\nTopic {topic_num} - Sample Abstracts:")
    for i, abs_text in enumerate(topic_abstracts[topic_num]):
        print(f"{i+1}. {abs_text[:200]}...")  #first 200 chars


Topic 0 - Sample Abstracts:
1. v izzo et al nucl fusion state art modeling thermal current quench cq mhd coupled self consistent evolution runaway electron generation transport showed non axisymmetric n vessel coil could passively ...
2. report analysis result globular cluster gc ngc millisecond pulsar msp j recently reported found gc data used large area telescope onboard fermi gamma ray space telescope fermi detect gamma ray pulsati...
3. present left right symmetric model provides explanation mass hierarchy charged fermion within framework standard model explanation achieved utilization tree level radiative seesaw mechanism model tiny...
4. present study long term variability jupiter mid infrared auroral ch emission micron image jupiter recorded earth based telescope last three decade collated order quantify magnitude timescales northern...
5. demonstrate prototype kinetic inductance detector kid readout system us less mw per pixel ccat prime rfsoc based readout capable reading fou

In [15]:
llm_input = []

for topic_num, abstracts in topic_abstracts.items():
    # Get topic keywords from BERTopic
    topic_keywords = topic_model.get_topic(topic_num)
    keywords = [word for word, _ in topic_keywords[:10]]  # top 10 keywords

    llm_input.append({
        "topic_num": int(topic_num),
        "topic_keywords": keywords,
        "sample_abstracts": abstracts
    })

# Preview one sample
import json
print(json.dumps(llm_input[0], indent=2))

{
  "topic_num": 0,
  "topic_keywords": [
    "mass",
    "star",
    "galaxy",
    "energy",
    "hole",
    "field",
    "black",
    "rm",
    "stellar",
    "matter"
  ],
  "sample_abstracts": [
    "v izzo et al nucl fusion state art modeling thermal current quench cq mhd coupled self consistent evolution runaway electron generation transport showed non axisymmetric n vessel coil could passively prevent beam formation disruption sparc compact high field tokamak projected achieve fusion gain q dt plasma however suppression requires finite transport re within magnetic island healed flux surface conservatively assuming zero transport region lead upper bound current compared pre disruption plasma current investigation find core localized electron within r kinetic energy mev contribute plateau formation yet relatively small amount transport e diffusion coefficient mathrm needed core fully mitigate re properly accounting cq electric field effect transport island ii contribution signific

In [32]:
from openai import OpenAI
from google.colab import userdata
openai_api_key = userdata.get('openai_key')
client = OpenAI(api_key=openai_api_key)

In [17]:
def generate_topic_name(topic_keywords, sample_abstracts):
    prompt = f"""
You are a helpful assistant for naming topics from research paper abstracts.
Given the following keywords generated using BERTopic and sample abstracts, generate a short and meaningful topic name.

The topic name should be very short,maximum of 3 to 4 words — not a sentence or description.

Keywords: {', '.join(topic_keywords)}

Abstracts:
{chr(10).join(f"- {abs}" for abs in sample_abstracts)}

Give a concise 3–4 word topic name:"""

    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": prompt}
        ],
        temperature=0.7,
        max_tokens=30
    )
    return response.choices[0].message.content.strip()


In [18]:
renamed_topics = {}
'''
for entry in llm_input:
    name = generate_topic_name(entry["topic_keywords"], entry["sample_abstracts"][:5])
    renamed_topics[entry["topic_num"]] = name
    print(f"Topic {entry['topic_num']}: {name}")
'''

Topic 0: Stellar Energy Variability
Topic 1: Quantum Material Properties
Topic 2: Algebraic Curve Construction
Topic 3: Medical Image Segmentation
Topic 4: Discrete Wave Equations
Topic 5: Audio Manipulation Techniques
Topic 6: Magic Distance Graphs
Topic 7: Wireless Channel Communication
Topic 8: Logical Composition Theory
Topic 9: Generative Image Editing
Topic 10: Bayesian Estimation Methods
Topic 11: Stochastic Distribution Analysis
Topic 12: Renewable Energy Optimization
Topic 13: Multilingual Language Models
Topic 14: Multimodal Image Understanding
Topic 15: Automated Program Repair
Topic 16: Video Compression Techniques
Topic 17: Secure Computation Acceleration
Topic 18: Geometric Graph Learning
Topic 19: Climate Forecasting Model
Topic 20: Blockchain Security Analysis
Topic 21: Crop Change Detection
Topic 22: Misinformation Detection Methods
Topic 23: Adversarial Attack Detection
Topic 24: Financial Risk Assessment
Topic 25: Memory-Augmented RL Agent
Topic 26: Causal Inference 

In [20]:
df_filtered["Topic Name"] = df["topic"].map(renamed_topics)

In [21]:
df_filtered

Unnamed: 0,id,title,abstract,authors,published_date,link,markdown,processed_abstracts,topic,Topic Name
1,2307.16362,High Sensitivity Beamformed Observations of th...,We analyzed four epochs of beamformed EVN data...,"Rebecca Lin, Marten H. van Kerkwijk",2023-07-31T01:36:55Z,http://arxiv.org/abs/2307.16362v2,# High Sensitivity Beamformed Observations of ...,analyzed four epoch beamformed evn data crab p...,0,Stellar Energy Variability
3,2309.09088,Enhancing GAN-Based Vocoders with Contrastive ...,Vocoder models have recently achieved substant...,"Haoming Guo, Seth Z. Zhao, Jiachen Lian, Gopal...",2023-09-16T20:04:16Z,http://arxiv.org/abs/2309.09088v2,# Enhancing Gan-Based Vocoders with Contrastiv...,vocoder model recently achieved substantial pr...,5,Audio Manipulation Techniques
4,2307.16404,Nonvolatile Magneto-Thermal Switching in MgB2,Ongoing research explores thermal switching ma...,"Hiroto Arima, Yoshikazu Mizuguchi",2023-07-31T04:59:19Z,http://arxiv.org/abs/2307.16404v1,# Nonvolatile Magneto-Thermal Switching in MgB...,ongoing research explores thermal switching ma...,1,Quantum Material Properties
6,2304.00044,On The Theory of Ring Afterglows,Synchrotron and inverse Compton emission succe...,"Marcus DuPont, Andrew MacFadyen, Re'em Sari",2023-03-31T18:02:12Z,http://arxiv.org/abs/2304.00044v1,# On The Theory of Ring Afterglows\n\n###### A...,synchrotron inverse compton emission successfu...,0,Stellar Energy Variability
8,2309.07927,Kid-Whisper: Towards Bridging the Performance ...,Recent advancements in Automatic Speech Recogn...,"Ahmed Adel Attia, Jing Liu, Wei Ai, Dorottya D...",2023-09-12T06:58:18Z,http://arxiv.org/abs/2309.07927v3,Kid-Whisper: Towards Bridging the Performance ...,recent advancement automatic speech recognitio...,5,Audio Manipulation Techniques
...,...,...,...,...,...,...,...,...,...,...
63350,2305.10173,"Quantum theory without the Axiom of choice, an...","In this conceptual paper, we discuss quantum f...",Koen Thas,2023-05-17T12:57:19Z,http://arxiv.org/abs/2305.10173v1,"# Quantum theory without the axiom of choice, ...",conceptual paper discus quantum formalism use ...,1,Quantum Material Properties
63351,2307.11414,The Derived Deligne Conjecture,Derived $A_\infty$-algebras have a wealth of t...,"Javier Aguilar Martín, Constanze Roitzheim",2023-07-21T08:16:23Z,http://arxiv.org/abs/2307.11414v3,# The derived Deligne conjecture\n\n###### Abs...,derived a_ infty algebra wealth theoretical ad...,2,Algebraic Curve Construction
63352,2306.06241,Almost paratopological groups,A class of almost paratopological groups is in...,Evgenii Reznichenko,2023-06-09T20:27:33Z,http://arxiv.org/abs/2306.06241v2,# Almost paratopological groups\n\n###### Abst...,class almost paratopological group introduced ...,2,Algebraic Curve Construction
63354,2303.04288,Polynomial Time and Private Learning of Unboun...,We study the problem of privately estimating t...,"Jamil Arbas, Hassan Ashtiani, Christopher Liaw",2023-03-07T23:24:27Z,http://arxiv.org/abs/2303.04288v2,# Polynomial Time and Private Learning of Unbo...,study problem privately estimating parameter d...,62,Private Data Mechanisms


In [23]:
df_filtered.to_csv("BERTopic_output.csv", index=False)

## Visualize Topics

Visualizing the discovered topics helps in understanding the distribution and relationships between different topics. BERTopic provides several visualization tools to aid in this analysis:
1. **Intertopic Distance Map**: Shows how topics are related to each other.
2. **Topic Hierarchy**: Displays the hierarchical structure of topics.
3. **Top Words per Topic**: Lists the most representative words for each topic.

Let's generate these visualizations to gain insights into the topic structure.


In [24]:
topic_model.visualize_heatmap()

In [25]:
topic_model.visualize_barchart()

In [26]:
topic_model.visualize_topics()

In [27]:
topic_model.visualize_hierarchy()

## Conclusion

In this project, we successfully applied BERTopic to the **neuralwork/arxiver** dataset to uncover thematic clusters within research paper abstracts. The key findings include:

- **Diverse Topic Distribution**: The dataset encompasses a wide range of scientific domains, reflected in the variety of identified topics.
- **High-Confidence Assignments**: A significant number of papers were confidently categorized, indicating the effectiveness of BERTopic in discerning clear themes.
- **Insightful Visualizations**: The intertopic distance map and top words per topic provide valuable insights into the relationships and nature of the topics.

### Potential Applications

- **Literature Review Automation**: Assisting researchers in quickly identifying relevant papers based on thematic clusters.
- **Trend Analysis**: Monitoring the evolution of research topics over time to identify emerging areas of interest.
- **Recommendation Systems**: Suggesting related papers or topics to researchers based on their areas of interest.

### Future Work

- **Fine-Tuning BERTopic**: Experimenting with different embedding models or hyperparameters to enhance topic coherence.
- **Expanding the Dataset**: Incorporating more recent papers or additional datasets to broaden the scope of analysis.
- **Interactive Visualizations**: Creating interactive dashboards for dynamic exploration of topics and their relationships.

This project demonstrates the power of advanced topic modeling techniques like BERTopic in organizing and making sense of vast amounts of scientific literature.
