Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency between topic with maximum probability and the predicted one for a document #65

Closed
arielibaba opened this issue Feb 24, 2021 · 6 comments

Comments

@arielibaba
Copy link

arielibaba commented Feb 24, 2021

Hi Maarten,

When using BERTopic on fetch_20newsgroups dataset to extract topics and their associated representative documents I figured out that for a given document the predicted topic was different from the one with the maximum probability. Of course, I checked it for topic label different from -1. In other words, it seems to have an inconsistency between predicted topics and probabilities. Is this normal ?

When we use the following:

topic_model = BERTopic(language="english", calculate_probabilities=True)
preds, probs = topic_model.fit_transform(docs)

For each index idx we should not have preds[idx] == numpy.argmax(probs[idx, :]) ?

Thank you in advance for your response.

@dopc
Copy link

dopc commented Feb 24, 2021

I think it is related with HDBSCAN. You can check here, original docs.

@MaartenGr
Copy link
Owner

Unfortunately, the soft clustering is still an experimental feature that does have its fair share of open issues if you look through the HDBSCAN repo. As of this moment, it does seem that the probabilities for some documents do not represent the topics they were assigned to. Having said that, after some testing, it does seem that 98,9% of the probabilities are correctly assigned. The ones that aren't do match with their second highest probability. Fortunately, this means that the probabilities itself still can be interpreted although you should be careful indeed when blindly taking the highest probability.

@arielibaba
Copy link
Author

arielibaba commented Feb 25, 2021

Thanks you guys for being responsive.
Indeed, it seems that the soft clustering for HDBSCAN still has to be improved.
Anyway, I will leverage the concept of exemplar points that they use in the documentation as it seems to be more reliable.

@firmai
Copy link

firmai commented Jun 16, 2021

Hey, not entirely sure how to proceed, what is the best way to get probabilities over documents: currently I have a 100% disconnect, is this normal.

image

@daviddiazsolis
Copy link

Hello everyone, I faced the same issue, and did some further research on the HDBSCAN repo and found there are some new commits that propose a way around the problem. It is not a perfect solution but the new probabilities are a much closer match than before.

Following the flat.py hdbscan_flat functions, the solutions involve three steps:

  1. using your own hdbscan model, so first you need to define one

import hdbscan
clusterer = hdbscan.HDBSCAN(min_cluster_size=10, prediction_data=True,cluster_selection_method='eom')

  1. then, call bertopic using that hdbscan model

from bertopic import BERTopic

topic_model = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True,
hdbscan_model=clusterer)
#we get the old topics, and probabilities with the issue
topics, probs = topic_model.fit_transform(docs)
that=[np.argmax(probs[r,:]) for r in range(len(probs))] #max probabilities don't match with the topics

  1. we get the embedings from the original model and then run the hdbscan_flat functions
    cluster=topic_model.hdbscan_model
    embs=topic_model.umap_model.embedding_

from hdbscan.flat import (HDBSCAN_flat,
approximate_predict_flat,
membership_vector_flat,
all_points_membership_vectors_flat)

def n_clusters_from_labels(labels_):
return np.amax(labels_) + 1

n_clusters = n_clusters_from_labels(cluster.labels_)

When we ask for flat clustering with same n_clusters,

clusterer_flat = HDBSCAN_flat(embs, n_clusters=n_clusters,
cluster_selection_method='eom')

#we get the new topics, and probabilities

topics2=clusterer_flat.labels_
probs2 = all_points_membership_vectors_flat(clusterer_flat)
that2=[np.argmax(probs2[r,:]) for r in range(len(probs2))]

now the prob2 match much closer to topics 2 and that2
image
image

I hope this helps

@wsosnowski
Copy link

wsosnowski commented Sep 14, 2022

Hi,
the problem with inconsistency during inference phase is caused by UMAP not HDBSCAN anymore and there is a very easy fix - just create custom umap model and fix the random_state:

umap_model = UMAP(n_neighbors=15, n_components=5,  min_dist=0.0, metric='cosine', random_state=42)
topic_model = BERTopic(umap_model=umap_model, verbose=True, calculate_probabilities=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants