New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of filter_tokens
and add_documents
on a Dictionary results in multiple token assignment
#326
Comments
Hm, yep. Good catch; going to look into this today. Thanks for the report! |
Maybe we should call Not sure why I left these two as separate methods, but updating the Dictionary after |
@piskvorky Just created a few tests for this, and that seems to work. The problem seems to stem from assuming there aren't any gaps. (Another option to avoid compactifying a bunch is to not care about gaps and use an infinite number generator, but looks harder to implement). |
- Always compactify after Dictionary token filtering - Add a test for Dictionary token filtering - Add a basic test for Dictionary merging
Thanks for the quick turn-around! |
…and commit 4863040), so I fixed that point in tutorial.
…and commit 4863040), so I fixed that point in tutorial.
…and commit 4863040), so I fixed that point in tutorial.
I'm on the latest version,
0.11.1-1
, and have ran into a bug where adding documents after I have filtered tokens is resulting in multiple words being assigned the same token id.Note how we now have
bar
andfoo
mapping to2
! This of course results in incorrect bag of words when we convert a document:The text was updated successfully, but these errors were encountered: