Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix computation of topic coherence #3197

Merged
merged 7 commits into from
Apr 25, 2022
Merged

Conversation

silviatti
Copy link
Contributor

@silviatti silviatti commented Jul 21, 2021

Fixes ##3181.

Problem description and how I solved it

Before this PR, the accumulator used for the computation of topic coherence discarded the documents that do not contain words that belong to the input topics, called "not relevant" (see in the code). Although this may seem computationally convenient, it has an impact on the computation of the topic coherence (c_v, c_npmi and c_uci). In fact, the number of documents that the accumulator computes is part of the formula for the computation of the direct confirmation measures.

As a result, discarding "irrelevant" documents makes the resulting topic coherence not comparable with the coherences computed over topics with different words (this could result in a different number of total documents). To solve this, I consider all the documents of the corpus in the counting of the number of documents. This PR should solve issue #3181.

Motivating example

I'll explain the fairness issue with an example, for the sake of simplicity. Suppose we have two topic models A and B. Each topic model outputs two topics, the first topic of each model (A1 and B1) happens to be the same. We want to compare their coherences in a fair way. We expect the coherence of A1 and B2 to be the same.

Topic A1: ['human', 'computer']
Topic A2: ['system', 'interface']

Topic B1: ['human', 'computer']
Topic B2: ['graph', 'minors']

Corpus of documents to compute the occurrences:
d1: 'human computer system'
d2: 'human interface'
d3: 'graph minors trees'

Before the PR:

  • Considering topic model A:
    Document d3 ('graph minors trees') is discarded by the accumulator because it doesn't contain words that are in the top-words of A1 or A2. So the total number of documents (num_docs) is 2. Let us compute the probability of the occurrence of the word "human" (this should be part of the direct confirmation measure. I just show the example of "human" but could be applied to other words. It is just to show that the results are different even if they shouldn't be). P(human) would be:
    P(human)= number of occurrences of "human" / number of total "relevant" docs = 2/2

  • Considering topic model B:
    Now document d3 is not discarded in the counts because it contains words that are in the top-words of B2, so the total number of documents is 3. Then P(human) is:
    P(human) = number of occurrences of "human" / number of total "relevant" docs = 2/3

P(human), along with all the other word probabilities and co-occurrence probabilities, will be then used for the computation of topic coherence (in particular, the direct confirmation measure). We will then get two different coherences for the topics A1 and B1, even if they are the same topic in practice. This shouldn't be expected. The coherence of the same topic should be the same.

Code for comparing the coherences of A1 and B1:

from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary

topics_A = [['human', 'computer'], ['system', 'interface']]
topics_B = [['human', 'computer'], ['graph', 'minors']]

corpus = ['human computer system', 'human interface', 'graph minors trees']
texts = [doc.split() for doc in corpus]
dictionary = Dictionary(texts)

cm_A = CoherenceModel(topics=topics_A, texts=texts, coherence='c_npmi', dictionary=dictionary)
coherence_A = cm_A.get_coherence_per_topic()
print(coherence_A[0]) #coherence of topic A1 is 2.885326251991536e-12

cm_B = CoherenceModel(topics=topics_B, texts=texts, coherence='c_npmi', dictionary=dictionary)
coherence_B = cm_B.get_coherence_per_topic()
print(coherence_B[0]) #coherence of topic B1 is 0.3690702464322811

After the PR:

  • Considering topic model A:
    Document d3 is not discarded in the countings because all the documents are considered, so P(human) is:
    P(human) = number of occurrences of "human" / number of total docs = 2/3

  • Considering topic model B:
    Document d3 is not discarded, so P(human) is:
    P(human) = number of occurrences of "human" / number of total docs = 2/3

Now the coherences of A1 and B1 are the same (same code above to test it).

@silviatti
Copy link
Contributor Author

Hello, do you have any plans to consider this PR and the related issue? Many people use gensim to estimate the coherence of the topics and I believe it's important to get accurate and reliable scores.

Thanks :)
@piskvorky

@piskvorky
Copy link
Owner

Thanks for the clear and articulate PR!

Sorry we've been swamped with life, and to be honest I forgot about this PR. It's great you pinged us!

I'll try to review. CC @mpenkov WDYT?

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand the relevant_words business, and the existing code looks a bit shady to me overall.

So I only did a surface review – as long as your changes don't break the corner case you're replacing ("no relevant_words at all"), and helps your cause, I think were good to merge. Thanks!

gensim/topic_coherence/text_analysis.py Outdated Show resolved Hide resolved
@piskvorky piskvorky added this to the Next release milestone Feb 19, 2022
@piskvorky piskvorky changed the title Fix computation of topic coherence #3181 Fix computation of topic coherence Feb 19, 2022
@piskvorky piskvorky self-assigned this Feb 25, 2022
@silviatti
Copy link
Contributor Author

Hello,
I improved the readability of the code by splitting line 303 into two, still preserving np.fromiter().

If a topic is composed of only words that are not in the dictionary, that section of code is not even reached. In fact, CoherenceModel(topics, ...) would raise an Exception (so before the computation of the coherence). I added an additional test for this case.

Apologize for the late reply. I will have more time to work on this issue in the next days, in case you request other changes.

@piskvorky
Copy link
Owner

piskvorky commented Apr 25, 2022

What happens when there are no relevant words in the text at all? (Why was the text_is_relevant check there in the first place?)

Before, empty texts were skipped, with this PR they yield an empty numpy array. This changes the behaviour but solves @silviatti's original issue (and is also cleaner conceptually), so I'm merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants