Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed KeyError in coherence model #2830

Merged
merged 13 commits into from
Jun 29, 2021
Merged

Fixed KeyError in coherence model #2830

merged 13 commits into from
Jun 29, 2021

Conversation

pietrotrope
Copy link
Contributor

@pietrotrope pietrotrope commented May 7, 2020

When the token list of a topic contains a token which is not present in the dictionary token2id the framework raised an error instead of excluding the token and continue to compute the metric score.

Excluding the tokens of the topics which are not present in the dictionary allows to compute the metric on different texts that do not have to contain every representative word of the topics.

@piskvorky
Copy link
Owner

Thanks. Can you add a test too?

Are there other places in the coherence with the same issue? (poor input validation / preprocessing) Let's fix them all at once.

@pietrotrope
Copy link
Contributor Author

I ran some test and i fail when passing topics id's instead of tokens.
I need to solve this issue, after that we can proceed to search other places with the same issue

@pietrotrope
Copy link
Contributor Author

The idea behind the issue is to allow valuation of topics with new tokens and/or ID's that are not present in the texts by excluding them from the computation.

To explain i will refer to the issue example with a simple test.

Given:

topics = [  ['human', 'computer', 'system', 'interface'],
            ['graph', 'minors', 'trees', 'eps']]

test_WITH_human = [  ['human', 'computer', 'system', 'interface'],
                     ['graph', 'minors', 'trees', 'eps']]

test_WITHOUT_human = [  [ 'computer', 'system', 'interface'],
                        ['graph', 'minors', 'trees', 'eps']]

The following code

dictionary = gensim.corpora.Dictionary(test_WITHOUT_human)
corpus = [dictionary.doc2bow(text) for text in test_WITHOUT_human]

cm = CoherenceModel(topics=topics, 
                    corpus=corpus, 
                    texts=test_WITHOUT_human,
                    dictionary=dictionary, 
                    coherence="c_npmi")

cm.get_coherence()

Returns


--> 445             return np.array([self.dictionary.token2id[token] for token in topic])
    446         except KeyError:  # might be a list of token ids already, but let's verify all in dict

KeyError: 'human'

Instead of:
1.000000000005771

which is the metric score.

With the last update i get the right result from this test, but it raises some other issues due to the fact that in the initial version, token or id formatting was retrieved by handling an exception.

Original code:

def _ensure_elements_are_ids(self, topic):
        try:
            return np.array([self.dictionary.token2id[token] for token in topic])
        except KeyError:  # might be a list of token ids already, but let's verify all in dict
            topic = (self.dictionary.id2token[_id] for _id in topic)
            return np.array([self.dictionary.token2id[token] for token in topic])

To find out if topics are in token or ID form, this code consider an exception as an hint that topics lists are expressed in ID form and proceed to retrieve them from their ID's

But if we have a topic with a token not present in the texts the code raise the exception and proceed to compute topics as if they're in ID form.
Topics are not in ID form and we got a runtime error.

Let's try to add a control to exclude tokens which are not present in the texts:

def _ensure_elements_are_ids(self, topic):
        try:
            return np.array([self.dictionary.token2id[token] for token in topic if token in self.dictionary.token2id])
        except KeyError:  # might be a list of token ids already, but let's verify all in dict
            topic = (self.dictionary.id2token[_id] for _id in topic if _id in self.dictionary.id2token)
            return np.array([self.dictionary.token2id[token] for token in topic])

The test in the example with tokens is now covered, but if the input is made of ID's and not tokens, the code will not raise an exception because every id is not a key in the token2id dictionary and it will be excluded, resulting in a list of 0 topics instead of an excpetion which trigger the right code.

To handle the problem i proceeded to try this code:

def _ensure_elements_are_ids(self, topic):
           elements_are_tokens = np.array([self.dictionary.token2id[token] for token in topic if token in self.dictionary.token2id])
           topic_tokens_from_id = (self.dictionary.id2token[_id] for _id in topic if _id in self.dictionary.id2token)
           elements_are_ids = np.array([self.dictionary.token2id[token] for token in topic_tokens_from_id])
           if elements_are_tokens.size > elements_are_ids.size:
               return elements_are_tokens
           elif elements_are_ids.size > elements_are_tokens.size:
               return elements_are_ids
           else:
               raise Exception("Topic list is not a list of lists of tokens or ids")

With this code, we assume that the right formatting is the one with the higher number of matches with the token2id or id2token dictionaries.
When we can't say if the right formatting is tokens lists or ID's lists we raise an exception.
This example seems to pass more tests but it makes an assumption on the data formatting instead of finding out the real one.

If anyone can help me to solve the problem i will gladly try to work on it again and solve the problem once and for all.

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Picked up a code formatting nitpick.

Is this code covered by tests? If not, please add them.

@@ -120,6 +120,7 @@ class CoherenceModel(interfaces.TransformationABC):
>>> coherence = cm.get_coherence() # get coherence value

"""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you add this blank line?

@mpenkov
Copy link
Collaborator

mpenkov commented Jul 18, 2020

To explain i will refer to the issue example with a simple test.

I think this would make a good unit test. It would demonstrate the usefulness of your fix and prevent future regressions.

@piskvorky
Copy link
Owner

Looks like we dropped the ball on this PR. People still keep tripping over the same issue.

@pietrotrope @mpenkov do you think we could finish this & merge? Thanks.

@pietrotrope pietrotrope requested a review from mpenkov April 9, 2021 17:27
@pietrotrope
Copy link
Contributor Author

hi, I tried to update the code by modifying it according to the requests.
I point out that the solution that I have proposed in recent months (and that I am trying to submit with this commit) excludes from the calculation of the coherence the words that are not present in the used dictionary.

@piskvorky
Copy link
Owner

Thanks @pietrotrope .

I point out that the solution that I have proposed in recent months (and that I am trying to submit with this commit) excludes from the calculation of the coherence the words that are not present in the used dictionary.

That's fine – better than an exception at any rate. Is this fact clearly documented?

@mpenkov can you please review & merge?

@T0admomo
Copy link

T0admomo commented May 4, 2021

pietrotrope - just wanted to say thank you for taking the time to propose this fix! New to gensim and I quickly encountered this problem, this fix will cut down on a lot of preprocessing! following closely !

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, please have a look. Sorry it took me so long to review this.

gensim/models/coherencemodel.py Outdated Show resolved Hide resolved
gensim/models/coherencemodel.py Outdated Show resolved Hide resolved
gensim/test/test_coherencemodel.py Show resolved Hide resolved
@pietrotrope
Copy link
Contributor Author

Hi! I updated the code,
as i said in the comments in the revision i did some refactoring (following your suggestions) and added the new tests (also corrected one)

@mpenkov
Copy link
Collaborator

mpenkov commented May 9, 2021

Can you merge origin/develop branch into your PR's branch? Looks like there's some kind of conflict.

@mpenkov mpenkov added this to Triage in PR triage 2021-06 Jun 22, 2021
@mpenkov
Copy link
Collaborator

mpenkov commented Jun 22, 2021

Note to self: resolve conflicts, check tests.

@mpenkov mpenkov moved this from Unsorted to Needs work before merge in PR triage 2021-06 Jun 22, 2021
@mpenkov mpenkov self-assigned this Jun 22, 2021
@mpenkov mpenkov changed the title Fixed coherence model issue #2711 Fixed KeyError in coherence model Jun 29, 2021
@mpenkov mpenkov merged commit 52fade6 into piskvorky:develop Jun 29, 2021
@mpenkov
Copy link
Collaborator

mpenkov commented Jun 29, 2021

Merged. Thank you @pietrotrope !

@mpenkov mpenkov moved this from Needs work before merge to Ready to merge in PR triage 2021-06 Jun 29, 2021
@pietrotrope pietrotrope deleted the coherence_score_fix branch May 25, 2022 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
PR triage 2021-06
Ready to merge
Development

Successfully merging this pull request may close these issues.

Coherence key error on held out set Coherence score on new data Key Error
4 participants