### Clustering

This notebook uses BERTopic to perform topic modeling on arXiv NLP paper abstracts. It clusters similar abstracts and extracts representative words for each topic, showcasing how to analyze research trends in NLP.

In [1]:
!pip install datasets
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
!pip uninstall umap
!pip install umap-learn
from umap import UMAP
!pip install hdbscan
from hdbscan import HDBSCAN
!pip install bertopic
from bertopic import BERTopic
!pip install bertopic
from bertopic.representation import MaximalMarginalRelevance
import numpy as np
import pandas as pd
from copy import deepcopy
from bertopic.representation import KeyBERTInspired



In [2]:
dataset = load_dataset("maartengr/arxiv_nlp")['train']
abstracts = dataset['Abstracts']
titles = dataset['Titles']

In [3]:
embedding_model = SentenceTransformer("thenlper/gte-small")
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)

Batches:   0%|          | 0/1405 [00:00<?, ?it/s]

In [4]:
embeddings.shape

(44949, 384)

In [5]:
umap_model = UMAP(n_components=5, min_dist=0.0, metric='cosine', random_state=42)
reduced_embeddings = umap_model.fit_transform(embeddings)

In [6]:
hdbscan_model = HDBSCAN(min_cluster_size=50, metric="euclidean", cluster_selection_method="eom").fit(reduced_embeddings)

In [7]:
clusters = hdbscan_model.labels_
len(set(clusters))

153

In [8]:
cluster = 0
for index in np.where(clusters==cluster)[0][:3]:
  print(abstracts[index][:300] + "... \n")

  This works aims to design a statistical machine translation from English text
to American Sign Language (ASL). The system is based on Moses tool with some
modifications and the results are synthesized through a 3D avatar for
interpretation. First, we translate the input text to gloss, a written fo... 

  Researches on signed languages still strongly dissociate lin- guistic issues
related on phonological and phonetic aspects, and gesture studies for
recognition and synthesis purposes. This paper focuses on the imbrication of
motion and meaning for the analysis, synthesis and evaluation of sign lang... 

  Modern computational linguistic software cannot produce important aspects of
sign language translation. Using some researches we deduce that the majority of
automatic sign language translation systems ignore many aspects when they
generate animation; therefore the interpretation lost the truth inf... 



In [9]:
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    verbose=True
).fit(abstracts, embeddings)

2025-01-09 10:57:22,544 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-01-09 10:58:18,838 - BERTopic - Dimensionality - Completed ✓
2025-01-09 10:58:18,840 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-01-09 10:58:20,809 - BERTopic - Cluster - Completed ✓
2025-01-09 10:58:20,823 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-01-09 10:58:25,276 - BERTopic - Representation - Completed ✓


In [10]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,14462,-1_the_of_and_to,"[the, of, and, to, in, we, for, that, language...",[ Cross-lingual text classification aims at t...
1,0,2241,0_question_questions_qa_answer,"[question, questions, qa, answer, answering, a...",[ Question generation (QG) attempts to solve ...
2,1,2098,1_speech_asr_recognition_end,"[speech, asr, recognition, end, acoustic, audi...",[ End-to-end models have achieved impressive ...
3,2,903,2_image_visual_multimodal_images,"[image, visual, multimodal, images, vision, mo...",[ In this paper we propose a model to learn m...
4,3,887,3_summarization_summaries_summary_abstractive,"[summarization, summaries, summary, abstractiv...",[ We present a novel divide-and-conquer metho...
...,...,...,...,...,...
148,147,54,147_counseling_mental_therapy_health,"[counseling, mental, therapy, health, psychoth...",[ Mental health care poses an increasingly se...
149,148,53,148_chatgpt_its_openai_has,"[chatgpt, its, openai, has, it, tasks, capabil...","[ Over the last few years, large language mod..."
150,149,52,149_mixed_code_sentiment_mixing,"[mixed, code, sentiment, mixing, english, anal...",[ In today's interconnected and multilingual ...
151,150,51,150_diffusion_generation_autoregressive_text,"[diffusion, generation, autoregressive, text, ...",[ Diffusion models have achieved great succes...


In [11]:
original_topics = deepcopy(topic_model.topic_representations_)

In [12]:
def topic_differences(model, original_topics, nr_topics=5):
  df = pd.DataFrame(columns=['Topic', 'Original', 'Updated'])
  for topic in range(nr_topics):
    og_words = " | ".join(list(zip(*original_topics[topic]))[0][:5])
    new_words = " | ".join(list(zip(*model.get_topic(topic)))[0][:5])
    df.loc[len(df)] = [topic, og_words, new_words]
  return df

In [13]:
representation_model = KeyBERTInspired()
topic_model.update_topics(abstracts, representation_model=representation_model)
topic_differences(topic_model, original_topics)

Unnamed: 0,Topic,Original,Updated
0,0,question | questions | qa | answer | answering,answering | questions | answer | question | co...
1,1,speech | asr | recognition | end | acoustic,phonetic | encoder | language | speech | trans...
2,2,image | visual | multimodal | images | vision,captioning | multimodal | visual | visually | ...
3,3,summarization | summaries | summary | abstract...,summarization | summarizers | summaries | summ...
4,4,translation | nmt | machine | neural | bleu,translation | translate | translated | transla...
