# Topic Modelling on The NSF Research Awards Abstracts dataset using BERTopic

In this notebook, I'll apply topic modelling techniques using text embeddings with BERTopic. Topic modelling use a corpus of text to identify clusters or groups of common themes or similar words.

Traditionally, for topic modelling, we would use techniques like LDA (Latent Dirichlet Allocation) or LSA (Latent Semantic Analysis). However, in this notebook, I will use embeddings generated by models based in transformers to get a dense representation of the meaning of each text. In this case, each text represents one abstract from the NSF Research Awards Abstracts dataset.

I'll start by reading the data that was prepared by another notebook.

In [1]:
import pandas as pd
import os

abstracts_df = pd.read_csv(os.path.join('data', 'processed', 'abstracts.csv'))
# https://www.nsf.gov/awardsearch/showAward?AWD_ID=2053734&HistoricalAwards=false
abstracts_df.dropna(subset=['award_id', 'abstract'], inplace=True)

## Train the topic model

I'll train the topic model using BERTopic with a modification in the default value of `min_topic_size`. I increased the value to **50** to get more interesting topics with more abstracts in them. We use a vectorizer to remove stop words **after** getting the embeddings and finding the topics as advised in the section of [tips and tricks](https://maartengr.github.io/BERTopic/getting_started/tips_and_tricks/tips_and_tricks.html#removing-stop-words)

In [11]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(stop_words="english")
topic_model = BERTopic(verbose=True, min_topic_size=50, vectorizer_model=vectorizer_model)
topics, _ = topic_model.fit_transform(abstracts_df['abstract'].tolist())
print(f'# of topics discovered: {len(topic_model.get_topic_info())}')

Batches:   0%|          | 0/412 [00:00<?, ?it/s]

2023-10-06 12:19:10,058 - BERTopic - Transformed documents to Embeddings
2023-10-06 12:19:15,214 - BERTopic - Reduced dimensionality
2023-10-06 12:19:15,633 - BERTopic - Clustered reduced embeddings


# of topics discovered: 49


## Topic representation

I discovered 47 topics. Let's analyze the top 10 of them.

In [12]:
topic_model.get_topic_info().head(10)

Unnamed: 0,Topic,Count,Name
0,-1,5487,-1_research_project_data_using
1,0,617,0_stem_engineering_learning_students
2,1,589,1_physics_stars_universe_matter
3,2,530,2_theory_geometry_equations_algebraic
4,3,493,3_species_plant_fellow_research
5,4,342,4_ice_climate_ocean_sea
6,5,328,5_mantle_seismic_earthquakes_subduction
7,6,282,6_patients_patient_health_business
8,7,246,7_proteins_cell_protein_cells
9,8,231,8_water_zone_critical_soil


In [13]:
topic_model.visualize_barchart(top_n_topics=8, height=700)

From the last visualization, I can see 8 different topics: 1) STEM education 2) Astro-physics 3) Maths 4) Biology (Botanics) 5) Meteorology (or Climate Change) 6) Seisms and Earthquakes 7) Health 8) Micro-biology.

## Topic relationships

I'd like to check out the uniqueness of each topic. Some topics can be similar to others, so we merge them o we can explore them further. To do this, we use a 2D representation of the topics via [UMAP](https://umap-learn.readthedocs.io/en/latest/).

In [14]:
topic_model.visualize_topics(top_n_topics=47)

Another way of visualizing the relationships among topics is using hierarchy generated by the [HDBSCAN algorithm](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html) used to generate the topics

In [16]:
topic_model.visualize_hierarchy(top_n_topics=47, width=1000)

## References

* [Topic Modeling arXiv Abstract with BERTopic](https://www.kaggle.com/code/maartengr/topic-modeling-arxiv-abstract-with-bertopic/notebook)
* [Tips & Tricks - BERTopic](https://maartengr.github.io/BERTopic/getting_started/tips_and_tricks/tips_and_tricks.html)