<a href="https://colab.research.google.com/github/mZaiam/llm/blob/main/topic_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install bertopic



In [None]:
from datasets import load_dataset

from sentence_transformers import SentenceTransformer

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from bertopic.representation import MaximalMarginalRelevance

from umap import UMAP

from sklearn.cluster import HDBSCAN

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Loading the arXiv abstracts dataset.

In [None]:
dataset = load_dataset("maartengr/arxiv_nlp")["train"]

abstracts = dataset["Abstracts"]
titles = dataset["Titles"]

Instatiating an embedding model and generating the embeddings for each abstract.

In [None]:
embedding_model = SentenceTransformer("thenlper/gte-small")

embeddings = embedding_model.encode(abstracts)

Instantiating UMAP for dimensionality reduction and HDBSCAN for clustering the papers.

In [None]:
umap_model = UMAP(
    n_components=5,
    min_dist=0.0,
    metric='cosine',
)

hdbscan_model = HDBSCAN(
    min_cluster_size=100,
    metric="euclidean",
    cluster_selection_method="eom"
)

Creating a BERTopic pipeline with the previous models.

In [None]:
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model
).fit(abstracts, embeddings)

Printing the information about each topic.

In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,13883,-1_the_of_and_to,"[the, of, and, to, in, we, for, that, on, lang...",[ Sentiment analysis (SA) has been a long-sta...
1,0,2664,0_dialogue_dialog_response_the,"[dialogue, dialog, response, the, to, and, res...",[ End-to-end task-oriented dialogue systems a...
2,1,2222,1_speech_asr_recognition_end,"[speech, asr, recognition, end, the, acoustic,...","[ All-neural, end-to-end ASR systems gained r..."
3,2,2014,2_question_questions_answer_qa,"[question, questions, answer, qa, answering, r...","[ In recent years, there have been amazing ad..."
4,3,1194,3_medical_clinical_biomedical_patient,"[medical, clinical, biomedical, patient, and, ...",[ Biomedical Named Entity Recognition (NER) i...
...,...,...,...,...,...
78,77,110,77_continual_forgetting_catastrophic_learning,"[continual, forgetting, catastrophic, learning...",[ Catastrophic forgetting (CF) is a phenomeno...
79,78,106,78_layout_document_documents_understanding,"[layout, document, documents, understanding, i...","[ In recent years, the use of multi-modal pre..."
80,79,106,79_srl_role_argument_labeling,"[srl, role, argument, labeling, predicate, sem...",[ Semantic role labeling (SRL) aims to discov...
81,80,105,80_nlp_crowdsourcing_data_to,"[nlp, crowdsourcing, data, to, and, that, of, ...",[ Crowdsourcing has been the prevalent paradi...


Looking closely to a specific topic.

In [None]:
topic_model.get_topic(3)

[('medical', np.float64(0.028219012249494756)),
 ('clinical', np.float64(0.025114363756328723)),
 ('biomedical', np.float64(0.01825143227798579)),
 ('patient', np.float64(0.011822913045745473)),
 ('and', np.float64(0.011013958626039592)),
 ('the', np.float64(0.010242919734656934)),
 ('of', np.float64(0.010181314206431596)),
 ('in', np.float64(0.01005104355868864)),
 ('health', np.float64(0.009597960507068671)),
 ('for', np.float64(0.0095361111857753))]

Finding a topic which is close to ``transformers``.

In [None]:
idx = topic_model.find_topics('transformers')[0][0]

topic_model.get_topic(idx)

[('attention', np.float64(0.031004578079129506)),
 ('transformer', np.float64(0.02704728909479139)),
 ('pruning', np.float64(0.020618150053030422)),
 ('transformers', np.float64(0.015471128558229581)),
 ('compression', np.float64(0.013300308150075131)),
 ('the', np.float64(0.010859954888303838)),
 ('models', np.float64(0.010283083530349438)),
 ('heads', np.float64(0.010120370626904076)),
 ('self', np.float64(0.010071056542397415)),
 ('to', np.float64(0.009802507026317266))]

Visualizing the documents with an interative plot.

In [None]:
topic_model.visualize_documents(
    list(titles),
    embeddings=embeddings,
    width=1200,
    hide_annotations=True
)

Ploting the most important words for each topic.

In [None]:
topic_model.visualize_barchart()

Ploting the hierarquichal structure of the topics.

In [None]:
topic_model.visualize_hierarchy()

Refining the topics with a representation model based on BERT.

In [None]:
representation_model = KeyBERTInspired()
topic_model.update_topics(
    abstracts,
    representation_model=representation_model
)

topic_model.visualize_barchart()

Further refinement with Maximal Marginal Relevance (MMR).

In [None]:
mmr_model = MaximalMarginalRelevance(diversity=0.1)
topic_model.update_topics(
    abstracts,
    representation_model=mmr_model
)

topic_model.visualize_barchart()