Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coherence score on new data Key Error #2711

Closed
bauer-jan opened this issue Dec 27, 2019 · 3 comments · Fixed by #2830
Closed

Coherence score on new data Key Error #2711

bauer-jan opened this issue Dec 27, 2019 · 3 comments · Fixed by #2830
Labels
need info Not enough information for reproduce an issue, need more info from author

Comments

@bauer-jan
Copy link

bauer-jan commented Dec 27, 2019

I want to compare different models (LDA, Mallet, etc.) with a Cross Validation. I train the model with training data and want to calculate the coherence score (c_v) with the test data. I do something like this:

    dictionary = gensim.corpora.Dictionary(test)
    corpus = [dictionary.doc2bow(text) for text in test]

    cm = CoherenceModel(topics=topics, 
                        corpus=corpus, 
                        texts=test,
                        dictionary=dictionary, 
                        coherence="c_v")
    
    cm.get_coherence()

When a word in a topic found on the training data is not present in the test data this raises an key error in the dictionary at some point. Does someone know something about this? is this a bug or how do i solve this issue?

grafik

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 29, 2019

Can you please make your example reproducible? Ideally, you should be able to copy-paste it into a REPL shell to demonstrate the problem.

@mpenkov mpenkov added the need info Not enough information for reproduce an issue, need more info from author label Dec 29, 2019
@bauer-jan
Copy link
Author

We have 2 topics and want to calculate the coherence score of the topic on new data. If a word like "human" is present in a topic but not in the new data we want to calculate the coherence score over an key error is trown.

from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary

topics = [  ['human', 'computer', 'system', 'interface'],
            ['graph', 'minors', 'trees', 'eps']]

test_WITH_human = [  ['human', 'computer', 'system', 'interface'],
                     ['graph', 'minors', 'trees', 'eps']]

test_WITHOUT_human = [  [ 'computer', 'system', 'interface'],
                        ['graph', 'minors', 'trees', 'eps']]

If i want to calculate the coherence score over the test_WITH_human , it works and we get a score of 1.0. If you want to calculate it over the test data test_WITHOUT_human, an error is thrown:

dictionary = gensim.corpora.Dictionary(test_WITHOUT_human)
corpus = [dictionary.doc2bow(text) for text in test_WITHOUT_human]

cm = CoherenceModel(topics=topics, 
                    corpus=corpus, 
                    texts=test_WITHOUT_human,
                    dictionary=dictionary, 
                    coherence="c_v")

cm.get_coherence()

Error message:

--> 445             return np.array([self.dictionary.token2id[token] for token in topic])
    446         except KeyError:  # might be a list of token ids already, but let's verify all in dict

KeyError: 'human'

@nadiafelix
Copy link

I have the same problem:
What is the solution?


KeyError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/gensim/models/coherencemodel.py in _ensure_elements_are_ids(self, topic)
438 try:
--> 439 return np.array([self.dictionary.token2id[token] for token in topic])
440 except KeyError: # might be a list of token ids already, but let's verify all in dict

5 frames
KeyError: 'de arma'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/gensim/models/coherencemodel.py in (.0)
439 return np.array([self.dictionary.token2id[token] for token in topic])
440 except KeyError: # might be a list of token ids already, but let's verify all in dict
--> 441 topic = [self.dictionary.id2token[_id] for _id in topic]
442 return np.array([self.dictionary.token2id[token] for token in topic])
443

KeyError: 'de arma'

mpenkov added a commit that referenced this issue Jun 29, 2021
* Fixed coherence model issue #2711

* Handled token or id formatting of topics

* Raised error with wrong formatting

* removed blank lines

* updated code

* updated code

* revision on coherencemodel.py

* added new tests

* rm trailing whitespace

* more flake8 fixes

* still more flake8 fixes

* update changelog

Co-authored-by: Michael Penkov <misha.penkov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need info Not enough information for reproduce an issue, need more info from author
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants