Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare LDA, NMF, LSA with BERTopic (w/ embedding: all-MiniLM-L6-v2 + dim_red: UMAP + cluster: HDBSCAN) #2009

Open
abis330 opened this issue May 23, 2024 · 1 comment

Comments

@abis330
Copy link

abis330 commented May 23, 2024

Hi @MaartenGr ,

Given a dataset of texts, we want to extract topics using LDA, NMF, LSA and BERTopic (w/ embedding: all-MiniLM-L6-v2 + dim_red: UMAP + cluster: HDBSCAN).

In order to select the best algorithm for this dataset, there was an intuition that a mathematical combination of an applicable topic coherence measure and an applicable topic diversity measure was chosen to optimize. In one of previous issues, #90 , I observed that when calculating topic coherence, you treated concatenation of texts belonging to a cluster as a single document.

However, for calculating topic coherence for LDA, LSA and NMF, we simply get the BoW representation of given texts and calculate topic coherence.

To the best of my understanding, shouldn't we ensure that the corpus and dictionary passed to initialize CoherenceModel object from gensim.coherencemodel be the same between BERTopic and LSA/LDA/NMF, so that we can actually now compare values of topic coherence achieved for all algorithms and then select the one with highest topic coherence?

Apologies for such a long description.

Thanks,
Abi

@MaartenGr
Copy link
Owner

To the best of my understanding, shouldn't we ensure that the corpus and dictionary passed to initialized CoherenceModel object from gensim.coherencemodel be the same between BERTopic and LSA/LDA/NMF, so that we can actually now compare values of topic coherence achieved for all algorithms and then select the one with highest topic coherence?

It depends. Although we typically would like to approach it with the same corpus/dictionary, that would also mean being constrained to the same types of representations as other models. Moreover, it also means that we are constrained by using the c-TF-IDF representations whereas you could also use other forms of representations in BERTopic. Personally, and as shown in the mentioned issue, I'm not particularly a big fan of optimizing BERTopic for coherence/diversity. Especially since it ignores all those additional representations that are integrated in the library. It's always interesting to see papers using BERTopic and using coherence on the default pipeline without considering MMR, KeyBERTInspired, PartOfSpeech, and even LLM-based representations.

Also, consider the following. Is the model with the highest coherence actually the best model? What is the definition of the best model in your particular use case? In all honesty, I highly doubt that optimizing for coherence/diversity is the answer here which is why I typically advise people to first find the metrics that fit with their use case. That might also mean that, and I hope it does, that human evaluation (for instance, with domain-experts), are considered or even your own validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants