<a href="https://colab.research.google.com/github/khushidubeyokok/BERTopic/blob/main/Title_vs_Abstract_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📊 Title vs. Abstract Analysis  
This notebook explores the relationship between topics derived from **research paper titles** and their **abstracts** using BERTopic.

We will use the [titles_topics_probabilities.csv](https://raw.githubusercontent.com/khushidubeyokok/BERTopic/refs/heads/main/titles_topics_probabilities.csv) file created in our previous [topic modeling notebook](https://github.com/khushidubeyokok/BERTopic/blob/main/BERTopic.ipynb) which contains paper titles, assigned topics, and topic probabilities.

### Objectives:
1. **Analyze** whether research paper titles carry meaningful information about their abstracts.  
2. **Compare** topic probabilities between titles and abstracts for deeper insights.  
3. Provide data-driven suggestions for improving title and abstract alignment in academic papers.

By the end, we’ll generate visualizations and a correlation analysis to validate our findings. Let’s dive in! 🚀



## 🛠️ Step 1: Load Essential Libraries  
To kick things off, we’ll import the necessary Python libraries, including BERTopic for topic modeling and data manipulation tools like Pandas.


In [13]:
# Import necessary libraries
!pip -q install bertopic
!pip -q install transformers
from bertopic import BERTopic
import pandas as pd
import matplotlib.pyplot as plt

## 📂 Step 2: Load and Preview Data  
We’ll load the dataset containing **topics and probabilities** from abstracts and take a quick look to ensure everything is in order.  
This dataset will be critical for comparison later on.


In [14]:
# Load the CSV file with topics based on abstracts
url = "https://raw.githubusercontent.com/khushidubeyokok/BERTopic/refs/heads/main/titles_topics_probabilities.csv"
df = pd.read_csv(url)

# Display the first few rows of the dataset
df.head()

Unnamed: 0,Title,Topic,Probability
0,Image Completion via Dual-path Cooperative Fil...,-1,0.0
1,High Sensitivity Beamformed Observations of th...,0,1.0
2,"Maybe, Maybe Not: A Survey on Uncertainty in V...",-1,0.0
3,Enhancing GAN-Based Vocoders with Contrastive ...,4,0.758336
4,Nonvolatile Magneto-Thermal Switching in MgB2,39,0.78135


## 📝 Step 3: Preprocessing Titles for Topic Modeling
Next, we will extract the titles from the dataset and prepare them for topic modeling. Since we already have topic probabilities for abstracts, we will focus on the titles and apply BERTopic to them.


In [16]:
# Extract titles from the dataset
titles = df['Title'].values
titles

array(['Image Completion via Dual-path Cooperative Filtering',
       "High Sensitivity Beamformed Observations of the Crab Pulsar's Radio\n  Emission",
       'Maybe, Maybe Not: A Survey on Uncertainty in Visualization', ...,
       'Polynomial Time and Private Learning of Unbounded Gaussian Mixture\n  Models',
       'Improving Knot Prediction in Wood Logs with Longitudinal Feature\n  Propagation',
       'Attention-based Multi-task Learning for Base Editor Outcome Prediction'],
      dtype=object)

## 🤖 Step 4: Title-Based Topic Modeling  
We now use **BERTopic** to uncover hidden patterns in research paper titles, clustering them into topics.  
This process involves:
1. Fitting the model on titles.
2. Assigning topics and probabilities to each title.
3. Saving the results for further analysis.  


In [19]:
from umap import UMAP
from hdbscan import HDBSCAN
from bertopic import BERTopic

#using bertopic with the same model configs used for abstracts
topic_model = BERTopic(
    embedding_model="all-MiniLM-L6-v2",   # Efficient embedding model for quick computation
    umap_model=UMAP(
        n_neighbors=10,                  # Lower neighbors for tighter clusters
        n_components=5,                  # Dimensionality reduction
        min_dist=0.1,                    # Slight separation between clusters
        metric='cosine'                  # Cosine similarity for text data
    ),
    hdbscan_model=HDBSCAN(
        min_cluster_size=60,             # Increase to reduce topic count
        min_samples=15,                   # Fewer samples to prevent fragmentation
        metric='euclidean',              # Works well with UMAP output
        cluster_selection_method='eom'   # Keeps distinct clusters
    ),                       # Target topic count, reduces smaller ones
    top_n_words=10                       # Top words per topic
)

# Fit the model on the preprocessed abstracts
topics,probs= topic_model.fit_transform(titles)

# Add the topic assignments to the dataframe
df['Title_topics'] = topics

# Display the updated dataframe with topics for titles
df[['Title', 'Title_topics']].head()

Unnamed: 0,Title,Title_topics
0,Image Completion via Dual-path Cooperative Fil...,-1
1,High Sensitivity Beamformed Observations of th...,121
2,"Maybe, Maybe Not: A Survey on Uncertainty in V...",132
3,Enhancing GAN-Based Vocoders with Contrastive ...,128
4,Nonvolatile Magneto-Thermal Switching in MgB2,-1


##🔍 Step 5: Topic Modeling Analysis
Now that we have topic assignments for both titles and abstracts, we can compare these to explore how well the title's topic correlates with the abstract's topic.

In [20]:
# Display the distribution of topics for titles
topics.get_topic_info()

# Display the distribution of topics for abstracts
topic_model.get_topic_info()

AttributeError: 'list' object has no attribute 'get_topic_info'