Inconsistency between topic with maximum probability and the predicted one for a document #65

arielibaba · 2021-02-24T15:42:10Z

Hi Maarten,

When using BERTopic on fetch_20newsgroups dataset to extract topics and their associated representative documents I figured out that for a given document the predicted topic was different from the one with the maximum probability. Of course, I checked it for topic label different from -1. In other words, it seems to have an inconsistency between predicted topics and probabilities. Is this normal ?

When we use the following:

topic_model = BERTopic(language="english", calculate_probabilities=True)
preds, probs = topic_model.fit_transform(docs)

For each index idx we should not have preds[idx] == numpy.argmax(probs[idx, :]) ?

Thank you in advance for your response.

The text was updated successfully, but these errors were encountered:

dopc · 2021-02-24T19:35:01Z

I think it is related with HDBSCAN. You can check here, original docs.

MaartenGr · 2021-02-25T06:27:09Z

Unfortunately, the soft clustering is still an experimental feature that does have its fair share of open issues if you look through the HDBSCAN repo. As of this moment, it does seem that the probabilities for some documents do not represent the topics they were assigned to. Having said that, after some testing, it does seem that 98,9% of the probabilities are correctly assigned. The ones that aren't do match with their second highest probability. Fortunately, this means that the probabilities itself still can be interpreted although you should be careful indeed when blindly taking the highest probability.

arielibaba · 2021-02-25T10:36:41Z

Thanks you guys for being responsive.
Indeed, it seems that the soft clustering for HDBSCAN still has to be improved.
Anyway, I will leverage the concept of exemplar points that they use in the documentation as it seems to be more reliable.

firmai · 2021-06-16T15:27:51Z

Hey, not entirely sure how to proceed, what is the best way to get probabilities over documents: currently I have a 100% disconnect, is this normal.

daviddiazsolis · 2021-07-14T15:22:23Z

Hello everyone, I faced the same issue, and did some further research on the HDBSCAN repo and found there are some new commits that propose a way around the problem. It is not a perfect solution but the new probabilities are a much closer match than before.

Following the flat.py hdbscan_flat functions, the solutions involve three steps:

using your own hdbscan model, so first you need to define one

import hdbscan
clusterer = hdbscan.HDBSCAN(min_cluster_size=10, prediction_data=True,cluster_selection_method='eom')

then, call bertopic using that hdbscan model

from bertopic import BERTopic

topic_model = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True,
hdbscan_model=clusterer)
#we get the old topics, and probabilities with the issue
topics, probs = topic_model.fit_transform(docs)
that=[np.argmax(probs[r,:]) for r in range(len(probs))] #max probabilities don't match with the topics

we get the embedings from the original model and then run the hdbscan_flat functions
cluster=topic_model.hdbscan_model
embs=topic_model.umap_model.embedding_

from hdbscan.flat import (HDBSCAN_flat,
approximate_predict_flat,
membership_vector_flat,
all_points_membership_vectors_flat)

def n_clusters_from_labels(labels_):
return np.amax(labels_) + 1

n_clusters = n_clusters_from_labels(cluster.labels_)

When we ask for flat clustering with same n_clusters,

clusterer_flat = HDBSCAN_flat(embs, n_clusters=n_clusters,
cluster_selection_method='eom')

#we get the new topics, and probabilities

topics2=clusterer_flat.labels_
probs2 = all_points_membership_vectors_flat(clusterer_flat)
that2=[np.argmax(probs2[r,:]) for r in range(len(probs2))]

now the prob2 match much closer to topics 2 and that2

I hope this helps

wsosnowski · 2022-09-14T13:54:11Z

Hi,
the problem with inconsistency during inference phase is caused by UMAP not HDBSCAN anymore and there is a very easy fix - just create custom umap model and fix the random_state:

umap_model = UMAP(n_neighbors=15, n_components=5,  min_dist=0.0, metric='cosine', random_state=42)
topic_model = BERTopic(umap_model=umap_model, verbose=True, calculate_probabilities=True)

arielibaba closed this as completed Feb 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency between topic with maximum probability and the predicted one for a document #65

Inconsistency between topic with maximum probability and the predicted one for a document #65

arielibaba commented Feb 24, 2021 •

edited

dopc commented Feb 24, 2021

MaartenGr commented Feb 25, 2021

arielibaba commented Feb 25, 2021 •

edited

firmai commented Jun 16, 2021 •

edited

daviddiazsolis commented Jul 14, 2021

wsosnowski commented Sep 14, 2022 •

edited

Inconsistency between topic with maximum probability and the predicted one for a document #65

Inconsistency between topic with maximum probability and the predicted one for a document #65

Comments

arielibaba commented Feb 24, 2021 • edited

dopc commented Feb 24, 2021

MaartenGr commented Feb 25, 2021

arielibaba commented Feb 25, 2021 • edited

firmai commented Jun 16, 2021 • edited

daviddiazsolis commented Jul 14, 2021

When we ask for flat clustering with same n_clusters,

wsosnowski commented Sep 14, 2022 • edited

arielibaba commented Feb 24, 2021 •

edited

arielibaba commented Feb 25, 2021 •

edited

firmai commented Jun 16, 2021 •

edited

wsosnowski commented Sep 14, 2022 •

edited