Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of filter_tokens and add_documents on a Dictionary results in multiple token assignment #326

Closed
bmabey opened this issue Apr 18, 2015 · 4 comments

Comments

@bmabey
Copy link

bmabey commented Apr 18, 2015

I'm on the latest version, 0.11.1-1, and have ran into a bug where adding documents after I have filtered tokens is resulting in multiple words being assigned the same token id.

In [227]: d = gensim.corpora.Dictionary([['foo','bar','baz'], ['foo','bar','bar','baz']])

In [228]: d.token2id
Out[228]: {u'bar': 0, u'baz': 1, u'foo': 2}

In [229]: d.filter_tokens([0])

In [230]: d.token2id
Out[230]: {u'baz': 1, u'foo': 2}

In [231]: d.add_documents([['foo','bar','baz','bar']])

In [232]: d.token2id
Out[239]: {u'bar': 2, u'baz': 1, u'foo': 2}

Note how we now have bar and foo mapping to 2! This of course results in incorrect bag of words when we convert a document:

In [240]: d.doc2bow(['foo','foo','bar','baz','baz','baz'])
Out[240]: [(1, 3), (2, 2)]
@cscorley
Copy link
Contributor

Hm, yep. Good catch; going to look into this today. Thanks for the report!

@piskvorky
Copy link
Owner

Maybe we should call compactify() automatically after filter_tokens().

Not sure why I left these two as separate methods, but updating the Dictionary after filter_tokens (without compactifying) won't work indeed.

@cscorley
Copy link
Contributor

@piskvorky Just created a few tests for this, and that seems to work. The problem seems to stem from assuming there aren't any gaps. (Another option to avoid compactifying a bunch is to not care about gaps and use an infinite number generator, but looks harder to implement).

cscorley added a commit that referenced this issue Apr 18, 2015
- Always compactify after Dictionary token filtering
- Add a test for Dictionary token filtering
- Add a basic test for Dictionary merging
@bmabey
Copy link
Author

bmabey commented Apr 18, 2015

Thanks for the quick turn-around!

VorontsovIE added a commit to VorontsovIE/gensim that referenced this issue Jul 24, 2017
…and commit 4863040), so I fixed that point in tutorial.
VorontsovIE added a commit to VorontsovIE/gensim that referenced this issue Jul 24, 2017
…and commit 4863040), so I fixed that point in tutorial.
VorontsovIE added a commit to VorontsovIE/gensim that referenced this issue Jul 25, 2017
…and commit 4863040), so I fixed that point in tutorial.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@bmabey @cscorley @piskvorky and others