Coherence score on new data Key Error #2711

bauer-jan · 2019-12-27T14:50:00Z

I want to compare different models (LDA, Mallet, etc.) with a Cross Validation. I train the model with training data and want to calculate the coherence score (c_v) with the test data. I do something like this:

    dictionary = gensim.corpora.Dictionary(test)
    corpus = [dictionary.doc2bow(text) for text in test]

    cm = CoherenceModel(topics=topics, 
                        corpus=corpus, 
                        texts=test,
                        dictionary=dictionary, 
                        coherence="c_v")
    
    cm.get_coherence()

When a word in a topic found on the training data is not present in the test data this raises an key error in the dictionary at some point. Does someone know something about this? is this a bug or how do i solve this issue?

mpenkov · 2019-12-29T13:36:00Z

Can you please make your example reproducible? Ideally, you should be able to copy-paste it into a REPL shell to demonstrate the problem.

bauer-jan · 2019-12-29T15:42:30Z

We have 2 topics and want to calculate the coherence score of the topic on new data. If a word like "human" is present in a topic but not in the new data we want to calculate the coherence score over an key error is trown.

from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary

topics = [  ['human', 'computer', 'system', 'interface'],
            ['graph', 'minors', 'trees', 'eps']]

test_WITH_human = [  ['human', 'computer', 'system', 'interface'],
                     ['graph', 'minors', 'trees', 'eps']]

test_WITHOUT_human = [  [ 'computer', 'system', 'interface'],
                        ['graph', 'minors', 'trees', 'eps']]

If i want to calculate the coherence score over the test_WITH_human , it works and we get a score of 1.0. If you want to calculate it over the test data test_WITHOUT_human, an error is thrown:

dictionary = gensim.corpora.Dictionary(test_WITHOUT_human)
corpus = [dictionary.doc2bow(text) for text in test_WITHOUT_human]

cm = CoherenceModel(topics=topics, 
                    corpus=corpus, 
                    texts=test_WITHOUT_human,
                    dictionary=dictionary, 
                    coherence="c_v")

cm.get_coherence()

Error message:

--> 445             return np.array([self.dictionary.token2id[token] for token in topic])
    446         except KeyError:  # might be a list of token ids already, but let's verify all in dict

KeyError: 'human'

nadiafelix · 2021-04-06T20:50:21Z

I have the same problem:
What is the solution?

KeyError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/gensim/models/coherencemodel.py in _ensure_elements_are_ids(self, topic)
438 try:
--> 439 return np.array([self.dictionary.token2id[token] for token in topic])
440 except KeyError: # might be a list of token ids already, but let's verify all in dict

5 frames
KeyError: 'de arma'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/gensim/models/coherencemodel.py in (.0)
439 return np.array([self.dictionary.token2id[token] for token in topic])
440 except KeyError: # might be a list of token ids already, but let's verify all in dict
--> 441 topic = [self.dictionary.id2token[_id] for _id in topic]
442 return np.array([self.dictionary.token2id[token] for token in topic])
443

KeyError: 'de arma'

* Fixed coherence model issue #2711 * Handled token or id formatting of topics * Raised error with wrong formatting * removed blank lines * updated code * updated code * revision on coherencemodel.py * added new tests * rm trailing whitespace * more flake8 fixes * still more flake8 fixes * update changelog Co-authored-by: Michael Penkov <misha.penkov@gmail.com>

mpenkov added the need info Not enough information for reproduce an issue, need more info from author label Dec 29, 2019

pietrotrope mentioned this issue May 7, 2020

Fixed KeyError in coherence model #2830

Merged

piskvorky mentioned this issue Apr 9, 2021

Coherence key error on held out set #3111

Closed

mpenkov closed this as completed in #2830 Jun 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coherence score on new data Key Error #2711

Coherence score on new data Key Error #2711

bauer-jan commented Dec 27, 2019 •

edited by mpenkov

Loading

mpenkov commented Dec 29, 2019

bauer-jan commented Dec 29, 2019

nadiafelix commented Apr 6, 2021

Coherence score on new data Key Error #2711

Coherence score on new data Key Error #2711

Comments

bauer-jan commented Dec 27, 2019 • edited by mpenkov Loading

mpenkov commented Dec 29, 2019

bauer-jan commented Dec 29, 2019

nadiafelix commented Apr 6, 2021

bauer-jan commented Dec 27, 2019 •

edited by mpenkov

Loading