Fix computation of topic coherence #3197

silviatti · 2021-07-21T13:32:02Z

Fixes ##3181.

Problem description and how I solved it

Before this PR, the accumulator used for the computation of topic coherence discarded the documents that do not contain words that belong to the input topics, called "not relevant" (see in the code). Although this may seem computationally convenient, it has an impact on the computation of the topic coherence (c_v, c_npmi and c_uci). In fact, the number of documents that the accumulator computes is part of the formula for the computation of the direct confirmation measures.

As a result, discarding "irrelevant" documents makes the resulting topic coherence not comparable with the coherences computed over topics with different words (this could result in a different number of total documents). To solve this, I consider all the documents of the corpus in the counting of the number of documents. This PR should solve issue #3181.

Motivating example

I'll explain the fairness issue with an example, for the sake of simplicity. Suppose we have two topic models A and B. Each topic model outputs two topics, the first topic of each model (A1 and B1) happens to be the same. We want to compare their coherences in a fair way. We expect the coherence of A1 and B2 to be the same.

Topic A1: ['human', 'computer']
Topic A2: ['system', 'interface']

Topic B1: ['human', 'computer']
Topic B2: ['graph', 'minors']

Corpus of documents to compute the occurrences:
d1: 'human computer system'
d2: 'human interface'
d3: 'graph minors trees'

Before the PR:

Considering topic model A:
Document d3 ('graph minors trees') is discarded by the accumulator because it doesn't contain words that are in the top-words of A1 or A2. So the total number of documents (num_docs) is 2. Let us compute the probability of the occurrence of the word "human" (this should be part of the direct confirmation measure. I just show the example of "human" but could be applied to other words. It is just to show that the results are different even if they shouldn't be). P(human) would be:
P(human)= number of occurrences of "human" / number of total "relevant" docs = 2/2
Considering topic model B:
Now document d3 is not discarded in the counts because it contains words that are in the top-words of B2, so the total number of documents is 3. Then P(human) is:
P(human) = number of occurrences of "human" / number of total "relevant" docs = 2/3

P(human), along with all the other word probabilities and co-occurrence probabilities, will be then used for the computation of topic coherence (in particular, the direct confirmation measure). We will then get two different coherences for the topics A1 and B1, even if they are the same topic in practice. This shouldn't be expected. The coherence of the same topic should be the same.

Code for comparing the coherences of A1 and B1:

from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary

topics_A = [['human', 'computer'], ['system', 'interface']]
topics_B = [['human', 'computer'], ['graph', 'minors']]

corpus = ['human computer system', 'human interface', 'graph minors trees']
texts = [doc.split() for doc in corpus]
dictionary = Dictionary(texts)

cm_A = CoherenceModel(topics=topics_A, texts=texts, coherence='c_npmi', dictionary=dictionary)
coherence_A = cm_A.get_coherence_per_topic()
print(coherence_A[0]) #coherence of topic A1 is 2.885326251991536e-12

cm_B = CoherenceModel(topics=topics_B, texts=texts, coherence='c_npmi', dictionary=dictionary)
coherence_B = cm_B.get_coherence_per_topic()
print(coherence_B[0]) #coherence of topic B1 is 0.3690702464322811

After the PR:

Considering topic model A:
Document d3 is not discarded in the countings because all the documents are considered, so P(human) is:
P(human) = number of occurrences of "human" / number of total docs = 2/3
Considering topic model B:
Document d3 is not discarded, so P(human) is:
P(human) = number of occurrences of "human" / number of total docs = 2/3

Now the coherences of A1 and B1 are the same (same code above to test it).

silviatti · 2021-12-06T10:28:31Z

Hello, do you have any plans to consider this PR and the related issue? Many people use gensim to estimate the coherence of the topics and I believe it's important to get accurate and reliable scores.

Thanks :)
@piskvorky

piskvorky · 2021-12-06T10:39:22Z

Thanks for the clear and articulate PR!

Sorry we've been swamped with life, and to be honest I forgot about this PR. It's great you pinged us!

I'll try to review. CC @mpenkov WDYT?

piskvorky

I don't really understand the relevant_words business, and the existing code looks a bit shady to me overall.

So I only did a surface review – as long as your changes don't break the corner case you're replacing ("no relevant_words at all"), and helps your cause, I think were good to merge. Thanks!

gensim/topic_coherence/text_analysis.py

silviatti · 2022-02-26T11:28:49Z

Hello,
I improved the readability of the code by splitting line 303 into two, still preserving np.fromiter().

If a topic is composed of only words that are not in the dictionary, that section of code is not even reached. In fact, CoherenceModel(topics, ...) would raise an Exception (so before the computation of the coherence). I added an additional test for this case.

Apologize for the late reply. I will have more time to work on this issue in the next days, in case you request other changes.

piskvorky · 2022-04-25T08:20:50Z

What happens when there are no relevant words in the text at all? (Why was the text_is_relevant check there in the first place?)

Before, empty texts were skipped, with this PR they yield an empty numpy array. This changes the behaviour but solves @silviatti's original issue (and is also cleaner conceptually), so I'm merging.

silviatti added 5 commits July 21, 2021 14:43

fix piskvorky#3181

5348a69

added tests

9c83e51

Merge branch 'RaRe-Technologies:develop' into fix_3181

af6460a

Merge branch 'RaRe-Technologies:develop' into fix_3181

e08eaed

Merge branch 'RaRe-Technologies:develop' into fix_3181

7a7ffe4

piskvorky requested review from piskvorky and mpenkov December 6, 2021 10:39

piskvorky requested changes Dec 8, 2021

View reviewed changes

gensim/topic_coherence/text_analysis.py Outdated Show resolved Hide resolved

piskvorky added this to the Next release milestone Feb 19, 2022

piskvorky changed the title ~~Fix computation of topic coherence #3181~~ Fix computation of topic coherence Feb 19, 2022

piskvorky self-assigned this Feb 25, 2022

silviatti added 2 commits February 26, 2022 12:13

improve readability

72debfb

add test for topics with unseen words

298880b

silviatti requested a review from piskvorky February 26, 2022 11:30

piskvorky merged commit 7cb443b into piskvorky:develop Apr 25, 2022

piskvorky mentioned this pull request Jul 25, 2022

CoherenceModel does not finish with computing #3368

Closed

PrimozGodec mentioned this pull request Nov 18, 2022

Path Coherence Model to correctly handle empty documents #3406

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix computation of topic coherence #3197

Fix computation of topic coherence #3197

silviatti commented Jul 21, 2021 •

edited by piskvorky

silviatti commented Dec 6, 2021

piskvorky commented Dec 6, 2021

piskvorky left a comment

silviatti commented Feb 26, 2022

piskvorky commented Apr 25, 2022 •

edited

Fix computation of topic coherence #3197

Fix computation of topic coherence #3197

Conversation

silviatti commented Jul 21, 2021 • edited by piskvorky

silviatti commented Dec 6, 2021

piskvorky commented Dec 6, 2021

piskvorky left a comment

Choose a reason for hiding this comment

silviatti commented Feb 26, 2022

piskvorky commented Apr 25, 2022 • edited

silviatti commented Jul 21, 2021 •

edited by piskvorky

piskvorky commented Apr 25, 2022 •

edited