# BertTopic

# Data: ArXiv Articles

In this notebook we will focus on topic modeling for a selection of ArXiv articles in the realms of machine learning and natural language processing. The dataset comprises approximately XXX articles, spanning from XXX to XXX, providing a rich ground for our text clustering exploration.

In [1]:
# check if we are using google colab
from pathlib import Path
import textwrap
if 'google.colab' in str(get_ipython()):
    from google.colab import drive
    drive.mount("/content/drive")
    !pip install datasets transformers bertopic umap hdbscan  tiktoken openai -U -qq

    base_folder = Path("/content/drive/MyDrive/data")
else:
    base_folder = Path("/home/harpreet/Insync/google_drive_shaannoor/data")

Mounted at /content/drive
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m99.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.5/158.5 kB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m107.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m98.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m39.6 MB/s[0m eta [36m0:00:00[0m
[2K     [

In [2]:
from pprint import pprint
import joblib
from umap import UMAP
from hdbscan import HDBSCAN
from scipy.cluster import hierarchy as sch
import seaborn as sns
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from copy import deepcopy
import numpy as np

import torch
from datasets import load_dataset
from sentence_transformers import SentenceTransformer

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from bertopic.representation import OpenAI
from bertopic import BERTopic

from google.colab import userdata
import openai
import tiktoken


In [3]:
def print_wrap(text, width=80):
    """
    Prints the given text, wrapping lines to a maximum of the specified width (default is 80 characters).

    Args:
    text (str): The text to be printed.
    width (int): The maximum width of a line, in characters.
    """
    wrapper = textwrap.TextWrapper(width=width)
    wrapped_text = wrapper.fill(text)
    print(wrapped_text)

In [4]:
data_folder = base_folder/'datasets/arxiv'
model_folder = base_folder/'models/nlp_fall_2023/clustering/arxiv'
model_folder.mkdir(exist_ok=True, parents=True)
data_folder.mkdir(exist_ok=True, parents=True)


## Import Data

In [5]:
dataset = load_dataset("maartengr/arxiv_nlp")["train"]

Downloading readme:   0%|          | 0.00/617 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/53.2M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [6]:
dataset

Dataset({
    features: ['Titles', 'Abstracts', 'Years', 'Categories'],
    num_rows: 44949
})

In [7]:
dataset_small = dataset.shuffle(seed = 42).select(range(1000))

In [8]:
dataset_small.features

{'Titles': Value(dtype='string', id=None),
 'Abstracts': Value(dtype='string', id=None),
 'Years': Value(dtype='int64', id=None),
 'Categories': Value(dtype='string', id=None)}

## Extract Meta data

In [9]:
abstracts = dataset_small["Abstracts"]
years = dataset_small["Years"]
titles = dataset_small["Titles"]

## Get Embeddings

In [10]:
# We load our model
embedding_model = SentenceTransformer('all-mpnet-base-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [11]:
# The abstracts are converted to vector representations
embeddings = embedding_model.encode(abstracts)

In [12]:
joblib.dump(embeddings,model_folder/'arxiv_nlp_abstract_embeddings')

['/content/drive/MyDrive/data/models/nlp_fall_2023/clustering/arxiv/arxiv_nlp_abstract_embeddings']

In [13]:
embeddings_arxiv = joblib.load(model_folder/'arxiv_nlp_abstract_embeddings')

In [14]:
embeddings_arxiv.shape

(1000, 768)

# Topic Modeling

Topic modeling is an analytical technique used to uncover latent themes or topics in extensive textual data. It identifies sets of keywords or phrases that encapsulate the essence of each topic, making it an invaluable tool for deciphering common themes across large text corpora. This approach transforms vast, unstructured text collections into meaningful clusters of similar content.

One of the classical and widely adopted methods in topic modeling is **Latent Dirichlet Allocation (LDA)**, introduced by Blei et al. in 2003. LDA operates on the premise that each document within a corpus is a mixture of various topics, with each topic defined by a specific probability distribution over the corpus’s vocabulary. For instance, a document focusing on advanced AI techniques might frequently include terms like “neural networks”, “deep learning”, and “algorithm optimization”, suggesting its alignment with certain topics.

While traditional topic modeling methods like LDA remain foundational in the field, the advent of Large Language Models (LLMs) has opened new avenues in topic analysis. Models like **BERTopic** represent an evolution in this domain, integrating the advanced capabilities of LLMs. BERTopic stands out for its flexible and modular architecture, allowing for the easy integration of newly developed language models. This adaptability ensures that as LLMs continue to advance, BERTopic evolves alongside, offering novel and potent ways to apply these models in topic modeling.

These developments signify a shift in topic modeling techniques, moving from traditional probabilistic methods to more dynamic and contextually aware approaches afforded by LLMs. Models like BERTopic leverage the depth of understanding in language provided by LLMs, resulting in more nuanced and accurately represented topics. This synergy between traditional topic modeling frameworks and modern LLMs exemplifies the ongoing innovation in the field, ensuring its relevance and applicability in a rapidly evolving digital landscape.

## BERTopic

<img src="https://drive.google.com/uc?export=view&id=130GEogTzjQRQ8wckX2ljFKapjhEJKMOa" width="600"/>

BERTopic, a flexible and advanced topic modeling technique, is structured into two main sections: **Clustering** and **Topic Representation**. Each section utilizes customizable steps, allowing for the use of various methods according to specific needs.

### Clustering
1. **Embedding Documents**: While BERTopic defaults to sentence-transformers models like "all-MiniLM-L6-v2", it allows for the use of any embedding model that captures semantic similarities in documents.
2. **Dimensionality Reduction**: UMAP is the default choice for reducing the high-dimensional embeddings, but alternative methods like PCA can be employed to maintain the essential structure for clustering.
3. **Clustering Documents**: HDBSCAN is used by default due to its effectiveness in identifying diverse cluster shapes and densities, but BERTopic can integrate other clustering techniques as needed.

### Topic Representation
1. **Bag-of-Words Approach**: After clustering, BERTopic combines all documents in a cluster into a single document and uses a bag-of-words model to count word frequencies. This approach is chosen because it makes minimal assumptions about the structure of the clusters, focusing instead on the frequency and distribution of words to represent topics.
2. **Class-Based TF-IDF**: This modified TF-IDF method, unique to BERTopic, emphasizes words that distinguish one cluster from others, highlighting representative terms for each topic.
3. **Fine-Tuning**: Although BERTopic initially relies on c-TF-IDF for topic representation, it can incorporate additional fine-tuning techniques, like GPT or T5, to refine the topics further.

BERTopic's strength lies in its modularity, allowing users to adapt each step, from embedding to topic representation, to suit diverse datasets and objectives, ensuring both accuracy and relevance in the topics it generates.

<img src="https://drive.google.com/uc?export=view&id=1I8v7bydn2PQ4oB5he8Bl3Lh_uBQR5M_j" width="800"/>

Image Source: https://maartengr.github.io/BERTopic/algorithm/algorithm.html#6-optional-fine-tune-topic-representation


The code for the steps will look something like this:
```

from umap import UMAP
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from bertopic.vectorizers import ClassTfidfTransformer


# Step 1 - Extract embeddings (blue block)
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Step 2 - Reduce dimensionality (red block)
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')

# Step 3 - Cluster reduced embeddings (green block)
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# Step 4 - Tokenize topics (yellow block)
vectorizer_model = CountVectorizer(stop_words="english")

# Step 5 - Create topic representation (green block)
ctfidf_model = ClassTfidfTransformer()

# Step 6 - (Optional) Fine-tune topic representations with a `bertopic.representation` model (dark blue block)
representation_model = KeyBERTInspired()

# All steps together
topic_model = BERTopic(
  embedding_model=embedding_model,          # Step 1 - Extract embeddings
  umap_model=umap_model,                    # Step 2 - Reduce dimensionality
  hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
  representation_model=representation_model # Step 6 - (Optional) Fine-tune topic represenations
)

```

Source : https://maartengr.github.io/BERTopic/algorithm/algorithm.html#6-optional-fine-tune-topic-representation

### Default Pipeline

In [15]:
# Default pipeline can be implemented in three lines of code
topic_model = BERTopic()
# topics, probs = topic_model.fit_transform(abstracts)

The key issues with using the default BERTopic model include:

- **Computational Efficiency**: Recalculating embeddings for each run is resource-heavy.
- **Stochastic Outcomes**: Default UMAP settings can lead to inconsistent results in dimensionality reduction.
- **Uncontrolled Topic Number**: The number of topics generated by default may not align with specific analytical needs.
- **Basic Topic Representations**: The default c-TF-IDF method might not fully capture the nuances in topic representation

Improving the default BERTopic model involves several key strategies:

1. **Pre-Calculating Embeddings**: Instead of recalculating embeddings for each iteration, which is resource-intensive, pre-calculate and store them. This approach speeds up the process as BERTopic can directly use these pre-computed embeddings.

2. **Controlling Stochastic Behavior in UMAP**: UMAP, used for dimensionality reduction, can yield different results on each run due to its stochastic nature. To achieve consistent results, set a fixed `random_state` in the UMAP model before integrating it with BERTopic.

3. **Managing the Number of Topics**: Rather than relying solely on BERTopic's `nr_topics` parameter, which merges existing topics, adjust the `min_topic_size` parameter in HDBSCAN. This indirectly influences the number of generated topics - a higher `min_topic_size` results in fewer topics, and vice versa.

4. **Enhancing Topic Representations**: The default c-TF-IDF method can be optimized using the `CountVectorizer`. By removing stopwords, ignoring infrequent words, and adjusting the n-gram range, you can refine the topic representations. This preprocessing step occurs after documents are assigned to topics and doesn't affect the clustering process.

These improvements collectively enhance BERTopic's efficiency, reproducibility, and the relevance of its topic representations.

### Improved Pipeline

In [16]:
# Improvement1 : we already have saved  embeddings, so we can use them to speed up the process

# improvement 2: add random_state to make the results reproducible
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

# improvement 3: add min_cluster_size to remove small topics
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# improvement 4: Enhance topic representation by removing stopwords, infrequent words, and using bigrams
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))



In [17]:
# Creating the BERTopic model without embedding model
topic_model = BERTopic(embedding_model=embedding_model, umap_model=umap_model,
                       hdbscan_model=hdbscan_model, vectorizer_model=vectorizer_model,
                       calculate_probabilities=True)
topics, probs = topic_model.fit_transform(embeddings=embeddings_arxiv, documents=abstracts)


Using this pipeline, you'll receive three outputs: `topic_model`, `topics`, and `probs`:

- `topic_model` refers to the trained model, encompassing details about its configuration and the generated topics.
  - See attributes of `topic_moidel` available here: https://maartengr.github.io/BERTopic/api/bertopic.html
- `topics` indicates the specific topics assigned to each abstract.
- `probs` represents the likelihood of each topic being associated with a particular abstract.

### Interpreting Topics

#### Overall Summary of Topics

In [18]:
topic_model.get_topic_info().head(20)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,234,-1_models_language_model_text,"[models, language, model, text, data, based, t...",[ Large-scale pre-trained language models hav...
1,0,172,0_dialogue_llms_models_reasoning,"[dialogue, llms, models, reasoning, model, lan...",[ Few-shot learning is a challenging task tha...
2,1,83,1_models_language_tuning_tasks,"[models, language, tuning, tasks, training, pr...",[ Transformer-based language models are now w...
3,2,70,2_word_similarity_semantic_embeddings,"[word, similarity, semantic, embeddings, embed...",[ This paper introduces Latent Relational Ana...
4,3,67,3_relation_entity_extraction_entities,"[relation, entity, extraction, entities, relat...",[ Relation extraction is a fundamental proble...
5,4,66,4_speech_recognition_asr_end,"[speech, recognition, asr, end, speech recogni...",[ Self-supervised learning (SSL) achieves gre...
6,5,55,5_sentiment_social_analysis_social media,"[sentiment, social, analysis, social media, me...","[ In the last few years, emotion detection in..."
7,6,44,6_translation_machine translation_machine_nmt,"[translation, machine translation, machine, nm...",[ Neural Machine Translation (NMT) can be use...
8,7,35,7_medical_clinical_healthcare_notes,"[medical, clinical, healthcare, notes, informa...",[ Coding diagnosis and procedures in medical ...
9,8,29,8_generation_text_text generation_human,"[generation, text, text generation, human, lan...",[ Large language models pre-trained for code ...


The `topic_model.get_topic_info()` method provides an overview of the topics generated by the BERTopic model, including:

- **Topic IDs**: Each topic is assigned a unique identifier (e.g., 0, 1, 2, ...). The topic labeled "-1" typically represents outliers or noise.
- **Count**: The number of documents or abstracts assigned to each topic.
- **Name**: A descriptive name for each topic, derived from the most representative words or phrases in that topic (e.g., "0_dialogue_dialog_response_responses").
- **Representation**: Key words or phrases associated with each topic, which are indicative of the topic's content.
- **Representative Docs**: Excerpts or summaries from documents that are representative of each topic.
- **Outlier Topic (-1)**: The first topic, labeled as "-1", comprises documents not fitting within any defined topic, essentially considered outliers. This outcome is due to HDBSCAN's approach of not mandating every point to be part of a cluster. To address these outliers, one could either switch to a clustering algorithm like k-Means, which doesn't produce outliers, or utilize BERTopic’s reduce_outliers() function to reassign some outlier documents to existing topics.

This information is useful for understanding the distribution and thematic focus of the topics within your dataset. For example, topic 0 seems to focus on dialogues and responses, while topic 2 is about speech recognition. The "Count" column shows the prevalence of each topic in your dataset.

#### Get individual Topic Info

The get_topic() function can be used to extract and highlight these top 10 keywords and their respective importance within the topic, providing a clear thematic overview.

In [19]:
topic_model.get_topic(0)

[('dialogue', 0.024288463185530816),
 ('llms', 0.021958848766865727),
 ('models', 0.020684912214456213),
 ('reasoning', 0.019652427230549406),
 ('model', 0.019454050675933473),
 ('language', 0.01681879499457873),
 ('question', 0.0163905357050371),
 ('task', 0.015987602139905424),
 ('knowledge', 0.014515024761516499),
 ('performance', 0.013512051925950298)]

In the provided example, Topic 0 is characterized by keywords such as 'dialogue', 'dialog', 'response', 'responses', and 'intent', with corresponding c-TF-IDF weights indicating their significance within the topic. These keywords collectively suggest that the central theme of Topic 0 is likely related to dialogues and conversational interactions, possibly in the context of conversational AI or dialogue systems.

#### Find Topics related to a keyword

In [20]:
embedding_model = SentenceTransformer('all-mpnet-base-v2')


def find_similar_topics(embedding_model, topic_model, search_term, top_n=5):
    # Get sorted list of topic IDs from BERTopic model
    topic_list = list(topic_model.topic_representations_.keys())
    topic_list.sort()

    # Generate search term embedding
    search_term_embedding = embedding_model.encode([search_term]).flatten()

    # Compute cosine similarity
    similarities = cosine_similarity(search_term_embedding.reshape(1, -1), topic_model.topic_embeddings_).flatten()

    # Find top n similar topics
    top_indices = np.argsort(similarities)[-top_n:][::-1]
    similar_topics = [topic_list[i] for i in top_indices]
    similarity_scores = [similarities[i] for i in top_indices]


    return similar_topics, similarity_scores


In [21]:
# Usage
similar_topics, similarity_scores = find_similar_topics(embedding_model, topic_model, "summarization", top_n=5)

In [22]:
similar_topics, similarity_scores

([10, 8, 2, -1, 7], [0.58678555, 0.40861213, 0.3766405, 0.3735007, 0.3330688])

In [23]:
topic_model.get_topic(10)

[('summarization', 0.0769418048202718),
 ('summaries', 0.06058871697116944),
 ('summary', 0.04605578927178144),
 ('abstractive', 0.04304197240526727),
 ('document', 0.0348882519921391),
 ('models', 0.02203933352768306),
 ('sentences', 0.02186824281639762),
 ('factual', 0.02121638362166064),
 ('mds', 0.02075867326444991),
 ('model', 0.020536836427840717)]

In [24]:
topic_model.get_topic(9)

[('parsing', 0.06693560453681066),
 ('dependency', 0.0603612353552701),
 ('parser', 0.047422845962027116),
 ('semantic', 0.038229508473026606),
 ('amr', 0.03284163232351285),
 ('meaning', 0.030414716421819317),
 ('constituent', 0.029977192803917446),
 ('tree', 0.029164267705053575),
 ('neural', 0.026903678260532256),
 ('syntactic', 0.02679027807451442)]

In [25]:
topic_model.get_topic(8)

[('generation', 0.057031573467166655),
 ('text', 0.05084510753523988),
 ('text generation', 0.035438279521028765),
 ('human', 0.02709436375429308),
 ('language', 0.025850195278074212),
 ('models', 0.02259743790723246),
 ('nlg', 0.02202303832228654),
 ('ad', 0.02096359290073598),
 ('metrics', 0.02063690498679956),
 ('pre', 0.0204578228544329)]

#### Topics for Documents

In [26]:
topic_model.get_document_info(abstracts)

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Probability,Representative_document
0,Speech is one of the most effective ways of ...,4,4_speech_recognition_asr_end,"[speech, recognition, asr, end, speech recogni...",[ Self-supervised learning (SSL) achieves gre...,speech - recognition - asr - end - speech reco...,1.000000,False
1,Large language models are trained on massive...,1,1_models_language_tuning_tasks,"[models, language, tuning, tasks, training, pr...",[ Transformer-based language models are now w...,models - language - tuning - tasks - training ...,0.333076,False
2,Fine-tuning pre-trained contextualized embed...,-1,-1_models_language_model_text,"[models, language, model, text, data, based, t...",[ Large-scale pre-trained language models hav...,models - language - model - text - data - base...,0.553720,False
3,We propose a novel end-to-end Aspect-based R...,-1,-1_models_language_model_text,"[models, language, model, text, data, based, t...",[ Large-scale pre-trained language models hav...,models - language - model - text - data - base...,0.464680,False
4,Pretrained transformer-based language models...,1,1_models_language_tuning_tasks,"[models, language, tuning, tasks, training, pr...",[ Transformer-based language models are now w...,models - language - tuning - tasks - training ...,0.381982,False
...,...,...,...,...,...,...,...,...
995,We study the entropy of Chinese and English ...,2,2_word_similarity_semantic_embeddings,"[word, similarity, semantic, embeddings, embed...",[ This paper introduces Latent Relational Ana...,word - similarity - semantic - embeddings - em...,0.073524,False
996,Matching and retrieving previously translate...,6,6_translation_machine translation_machine_nmt,"[translation, machine translation, machine, nm...",[ Neural Machine Translation (NMT) can be use...,translation - machine translation - machine - ...,0.459700,False
997,Hybrid question answering (HQA) aims to answ...,0,0_dialogue_llms_models_reasoning,"[dialogue, llms, models, reasoning, model, lan...",[ Few-shot learning is a challenging task tha...,dialogue - llms - models - reasoning - model -...,1.000000,False
998,Cross-lingual topic models have been prevale...,-1,-1_models_language_model_text,"[models, language, model, text, data, based, t...",[ Large-scale pre-trained language models hav...,models - language - model - text - data - base...,0.538256,False


In [27]:
pprint(abstracts[2])

('  Fine-tuning pre-trained contextualized embedding models has become an\n'
 'integral part of the NLP pipeline. At the same time, probing has emerged as '
 'a\n'
 'way to investigate the linguistic knowledge captured by pre-trained models.\n'
 'Very little is, however, understood about how fine-tuning affects the\n'
 'representations of pre-trained models and thereby the linguistic knowledge '
 'they\n'
 'encode. This paper contributes towards closing this gap. We study three\n'
 'different pre-trained models: BERT, RoBERTa, and ALBERT, and investigate\n'
 'through sentence-level probing how fine-tuning affects their '
 'representations.\n'
 'We find that for some probing tasks fine-tuning leads to substantial changes '
 'in\n'
 'accuracy, possibly suggesting that fine-tuning introduces or even removes\n'
 'linguistic knowledge from a pre-trained model. These changes, however, vary\n'
 'greatly across different models, fine-tuning and probing tasks. Our '
 'analysis\n'
 'reveals that

In [28]:
topics[2], probs[2]

(-1,
 array([0.02828948, 0.08096788, 0.03326451, 0.02061631, 0.0188891 ,
        0.02372904, 0.02155308, 0.02062249, 0.02921869, 0.03177763,
        0.01226069, 0.03192929, 0.0265849 , 0.01892844, 0.02451252,
        0.02313631]))

In [29]:
top_five_topics_for_doc1 = np.argsort(probs[2])[-5:][::-1]
top_five_probs_doc1 = probs[2][top_five_topics_for_doc1]
top_five_topics_for_doc1, top_five_probs_doc1

(array([ 1,  2, 11,  9,  8]),
 array([0.08096788, 0.03326451, 0.03192929, 0.03177763, 0.02921869]))

In [30]:
topic_model.get_topic(2)

[('word', 0.05135076530031293),
 ('similarity', 0.044121546058868),
 ('semantic', 0.040979302403907875),
 ('embeddings', 0.03482121127084541),
 ('embedding', 0.023182405829945112),
 ('words', 0.019542487826814676),
 ('vector', 0.019310883876229284),
 ('vectors', 0.017732365325661463),
 ('tasks', 0.01739919323579438),
 ('language', 0.016722306681134225)]

In [31]:
topic_model.get_topic(15)

[('gender', 0.08268997103639629),
 ('bias', 0.07337297774751578),
 ('biases', 0.06102499076639604),
 ('models', 0.02656638827981151),
 ('gender bias', 0.020377573030231206),
 ('word embeddings', 0.02022017095156906),
 ('embeddings', 0.020155265530524456),
 ('approaches', 0.01982802805246346),
 ('fairness', 0.01941475188361467),
 ('social', 0.01877136281679783)]

### Hierarchical Topics
In BERTopic for creating a hierarchical structure of topics, the default choice is Scipy's ward linkage function. Nonetheless, depending on your specific requirements, you might prefer other linkage functions like single, complete, average, centroid, or median. BERTopic offers the flexibility to customize the linkage function and even select the desired distance function to suit your particular use case.


In [32]:
# Hierarchical topics

hierarchical_topics = topic_model.hierarchical_topics(abstracts)
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)


100%|██████████| 15/15 [00:00<00:00, 258.96it/s]


### Representation Models

BERTopic's design enables it to utilize a variety of Large Language Models (LLMs) efficiently for refining topic representations. With BERTopic, one can employ diverse fine-tuning methods, ranging from part-of-speech tagging to advanced text-generation techniques like those used in models similar to ChatGPT. The broad spectrum of LLMs compatible with BERTopic for fine-tuning is exemplified in the figure below:

<img src="https://drive.google.com/uc?export=view&id=1K7f6jDeevS1h9_VCfI8MoZqSdvfgxKa-" width="800"/>

The initial rankings of words within topics, created using c-TF-IDF in BERTopic, act as preliminary or candidate keywords for each topic. These rankings might be adjusted based on different representation models. In the following sections, we will explore various representation models to refine and enhance the initial topic word rankings provided by c-TF-IDF.


In [33]:
# representtaion of 10th topic
topic_model.topic_representations_[10]

[('summarization', 0.0769418048202718),
 ('summaries', 0.06058871697116944),
 ('summary', 0.04605578927178144),
 ('abstractive', 0.04304197240526727),
 ('document', 0.0348882519921391),
 ('models', 0.02203933352768306),
 ('sentences', 0.02186824281639762),
 ('factual', 0.02121638362166064),
 ('mds', 0.02075867326444991),
 ('model', 0.020536836427840717)]

In [34]:
# Save original representations

original_topics = deepcopy(topic_model.topic_representations_)

In [35]:
def compare_topic_changes(new_model, original_topic_words, max_length=75, top_n_topics=10):

    """Displays differences in top words of topic representations between the original and new models."""

    for topic_id in range(top_n_topics):
        # Extract top 5 words per topic from the original and new models
        original_top_words = "_".join(word for word, _ in original_topic_words[topic_id][:5])
        new_top_words = "_".join(word for word, _ in new_model.get_topic(topic_id)[:5])

        # Calculate whitespace for alignment
        alignment_spaces = " " * (max_length - len(original_top_words))

        # Print 'before' and 'after' topic word changes
        print(f"Topic: {topic_id}    {original_top_words}{alignment_spaces} >>     {new_top_words}")


#### OpenAI

In [36]:
# OpenAI Representation Model
prompt = """
I have a topic that contains the following documents: \n[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short topic label in the following format:
topic: <topic label>
"""



In [37]:
openai_key = userdata.get('clustering')


In [38]:
# Tokenizer
tokenizer= tiktoken.encoding_for_model("gpt-3.5-turbo")

# Create your representation model
client = openai.OpenAI(api_key=openai_key)
representation_model = OpenAI(
    client,
    model="gpt-3.5-turbo",
    delay_in_seconds=2,
    chat=True,
    nr_docs=4,
    doc_length=100,
    tokenizer=tokenizer
)
representation_model = OpenAI(client, model="gpt-3.5-turbo", delay_in_seconds=10, chat=True)


In [39]:
# Update our topic representations
topic_model.update_topics(abstracts, representation_model=representation_model)

In [40]:
topic_model.representation_model.prompts_[0]

"\nI have a topic that contains the following documents: \n-   Text classification is a very classic NLP task, but it has two prominent\nshortcomings: On the one hand, text classification is deeply domain-dependent.\nThat is, a classifier trained on the corpus of one domain may not perform so\nwell in another domain. On the other hand, text classification models require a\nlot of annotated data for training. However, for some domains, there may not\nexist enough annotated data. Therefore, it is valuable to investigate how to\nefficiently utilize text data from different domains to improve the performance\nof models in various domains. Some multi-domain text classification models are\ntrained by adversarial training to extract shared features among all domains\nand the specific features of each domain. We noted that the distinctness of the\ndomain-specific features is different, so in this paper, we propose to use a\ncurriculum learning strategy based on keyword weight ranking to improv

In [41]:
# Show topic differences
compare_topic_changes(topic_model, original_topics)

Topic: 0    dialogue_llms_models_reasoning_model                                        >>     Cross-domain Dialogue System Adaptation with Tree Encoder and Layer-wise Attention
Topic: 1    models_language_tuning_tasks_training                                       >>     Enhancing Language Model Performance with Data Augmentation and Efficient Parameter Tuning
Topic: 2    word_similarity_semantic_embeddings_embedding                               >>     Exploration of Word Embedding Techniques and Semantic Similarity in NLP
Topic: 3    relation_entity_extraction_entities_relation extraction                     >>     Relation Extraction in Biomedical Research
Topic: 4    speech_recognition_asr_end_speech recognition                               >>     Optimization of Speech Recognition Systems
Topic: 5    sentiment_social_analysis_social media_media                                >>     Hate Speech Detection and Sentiment Analysis on Social Media
Topic: 6    translation_machine trans

In [42]:
hierarchical_topics = topic_model.hierarchical_topics(abstracts)
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

100%|██████████| 15/15 [02:46<00:00, 11.08s/it]
