- Follow https://www.kaggle.com/code/maartengr/topic-modeling-arxiv-abstract-with-bertopic/notebook
- Evaluation : https://github.com/MaartenGr/BERTopic/issues/90
- tips and tricks : https://maartengr.github.io/BERTopic/getting_started/tips_and_tricks/tips_and_tricks.html#keybert-bertopic

In [1]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
getattr(ssl, '_create_unverified_context', None)):
    ssl._create_default_https_context = ssl._create_unverified_context

In [31]:
import gensim
import pandas as pd
import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel

In [80]:
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
model = BERTopic(language="English",top_n_words=10,min_topic_size=20,
                verbose=False, n_gram_range=(1, 3),nr_topics='auto')
topics, probabilities = model.fit_transform(docs)

2022-12-21 17:25:07,563 - BERTopic - Transformed documents to Embeddings
2022-12-21 17:25:24,371 - BERTopic - Reduced dimensionality


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

2022-12-21 17:25:26,480 - BERTopic - Clustered reduced embeddings
2022-12-21 17:28:00,124 - BERTopic - Reduced number of topics from 115 to 93


In [12]:
model.get_topic_freq().head()

Unnamed: 0,Topic,Count
0,-1,6907
1,0,1829
2,1,570
3,2,474
4,3,251


In [63]:
len(model.get_topic_freq())

230

In [62]:
model.get_topic(6)

[('', 1e-05),
 ('', 1e-05),
 ('', 1e-05),
 ('', 1e-05),
 ('', 1e-05),
 ('', 1e-05),
 ('', 1e-05),
 ('', 1e-05),
 ('', 1e-05),
 ('', 1e-05)]

In [20]:
model.visualize_barchart(top_n_topics=10, height=700)

In [22]:
model.visualize_term_rank()

In [24]:
model.visualize_term_rank(log_scale=True)

In [26]:
model.visualize_topics(top_n_topics=50)

In [28]:
model.visualize_hierarchy(top_n_topics=50, width=800)

In [38]:
documents = pd.DataFrame({"Document": docs,
                          "ID": range(len(docs)),
                          "Topic": topics,
                          "Topic_prob": probabilities})

In [39]:
documents.head()

Unnamed: 0,Document,ID,Topic,Topic_prob
0,\n\nI am sure some bashers of Pens fans are pr...,0,0,1.0
1,My brother is in the market for a high-perform...,1,10,0.778909
2,\n\n\n\n\tFinally you said what you dream abou...,2,21,0.46748
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,3,44,0.782099
4,1) I have an old Jasmine drive which I cann...,4,85,0.614488


In [40]:
documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})

In [43]:
cleaned_docs = model._preprocess_text(documents_per_topic.Document.values)

In [46]:
# Extract vectorizer and analyzer from BERTopic
vectorizer = model.vectorizer_model
analyzer = vectorizer.build_analyzer()

In [76]:
# Extract features for Topic Coherence evaluation
words = vectorizer.get_feature_names()
tokens = [analyzer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
topic_words = [[words for words, _ in model.get_topic(topic) if words!=''] 
               for topic in range(len(set(topics))-1)]
topic_words = [t for t in topic_words if len(t) >0] ## for some reason some topics has all "" as topic words

In [78]:
# Evaluate
coherence_model = CoherenceModel(topics=topic_words, 
                                 texts=tokens, 
                                 corpus=corpus,
                                 dictionary=dictionary, 
                                 coherence='c_v')
coherence = coherence_model.get_coherence()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [79]:
coherence

0.5102444996410709