Use of `filter_tokens` and `add_documents` on a Dictionary results in multiple token assignment #326

bmabey · 2015-04-18T02:59:03Z

I'm on the latest version, 0.11.1-1, and have ran into a bug where adding documents after I have filtered tokens is resulting in multiple words being assigned the same token id.

In [227]: d = gensim.corpora.Dictionary([['foo','bar','baz'], ['foo','bar','bar','baz']])

In [228]: d.token2id
Out[228]: {u'bar': 0, u'baz': 1, u'foo': 2}

In [229]: d.filter_tokens([0])

In [230]: d.token2id
Out[230]: {u'baz': 1, u'foo': 2}

In [231]: d.add_documents([['foo','bar','baz','bar']])

In [232]: d.token2id
Out[239]: {u'bar': 2, u'baz': 1, u'foo': 2}

Note how we now have bar and foo mapping to 2! This of course results in incorrect bag of words when we convert a document:

In [240]: d.doc2bow(['foo','foo','bar','baz','baz','baz'])
Out[240]: [(1, 3), (2, 2)]

The text was updated successfully, but these errors were encountered:

cscorley · 2015-04-18T16:09:58Z

Hm, yep. Good catch; going to look into this today. Thanks for the report!

piskvorky · 2015-04-18T16:36:04Z

Maybe we should call compactify() automatically after filter_tokens().

Not sure why I left these two as separate methods, but updating the Dictionary after filter_tokens (without compactifying) won't work indeed.

cscorley · 2015-04-18T16:49:04Z

@piskvorky Just created a few tests for this, and that seems to work. The problem seems to stem from assuming there aren't any gaps. (Another option to avoid compactifying a bunch is to not care about gaps and use an infinite number generator, but looks harder to implement).

- Always compactify after Dictionary token filtering - Add a test for Dictionary token filtering - Add a basic test for Dictionary merging

bmabey · 2015-04-18T21:27:27Z

Thanks for the quick turn-around!

…and commit 4863040), so I fixed that point in tutorial.

cscorley added a commit that referenced this issue Apr 18, 2015

Fixes issue #326.

4863040

- Always compactify after Dictionary token filtering - Add a test for Dictionary token filtering - Add a basic test for Dictionary merging

cscorley closed this as completed Apr 18, 2015

VorontsovIE added a commit to VorontsovIE/gensim that referenced this issue Jul 24, 2017

filter_token calls compactify automatically (see issue piskvorky#326 …

85f8d9d

…and commit 4863040), so I fixed that point in tutorial.

VorontsovIE added a commit to VorontsovIE/gensim that referenced this issue Jul 24, 2017

filter_token calls compactify automatically (see issue piskvorky#326 …

d81c5ae

…and commit 4863040), so I fixed that point in tutorial.

VorontsovIE mentioned this issue Jul 24, 2017

Fix tutorial (filter_token calls compactify automatically) #1502

Merged

VorontsovIE added a commit to VorontsovIE/gensim that referenced this issue Jul 25, 2017

filter_token calls compactify automatically (see issue piskvorky#326 …

3199aa1

…and commit 4863040), so I fixed that point in tutorial.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of `filter_tokens` and `add_documents` on a Dictionary results in multiple token assignment #326

Use of `filter_tokens` and `add_documents` on a Dictionary results in multiple token assignment #326

bmabey commented Apr 18, 2015

cscorley commented Apr 18, 2015

piskvorky commented Apr 18, 2015

cscorley commented Apr 18, 2015

bmabey commented Apr 18, 2015

Use of filter_tokens and add_documents on a Dictionary results in multiple token assignment #326

Use of filter_tokens and add_documents on a Dictionary results in multiple token assignment #326

Comments

bmabey commented Apr 18, 2015

cscorley commented Apr 18, 2015

piskvorky commented Apr 18, 2015

cscorley commented Apr 18, 2015

bmabey commented Apr 18, 2015

Use of `filter_tokens` and `add_documents` on a Dictionary results in multiple token assignment #326

Use of `filter_tokens` and `add_documents` on a Dictionary results in multiple token assignment #326