New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MalletCorpus breaks with metadata #609
Comments
Indeed, this is handled incorrectly. Thanks for reporting! As a workaround until a fix is released, you can init the corpus and turn on metadata gathering later: c = corpora.MalletCorpus("./tmp.txt")
c.metadata = True Or provide a pre-built dictionary: d = corpora.Dictionary(...)
c = corpora.MalletCorpus("./tmp.txt", id2word=d) |
Marking this as easy, if anyone wants to take a shot at it fixing it before I do it this weekend -- feel free to contact me for pointers. |
@piskvorky So, I was working on this specific issue, but I began testing/looking at the different corpora and noticed there's a pretty inconsistent use of Some corpora accept A few options are:
What do you think is best here? (updated) |
The metadata functionality came from a user PR some time ago. I merged it because it didn't have any side effects (backward compatible), but it's not a promoted or featured functionality. I don't even remember exactly what it does. I wouldn't like gensim to go in this direction (integrating unrelated data structures, basically databases, into its core). So, from your two options, I'm +1 on the second one. Perhaps asking first to see if anybody still uses it, so we don't break things too badly. Although if the users and other devs (CC @gojomo @tmylk ) disagree, I don't care much either way. As long as the changes are backward compatible, don't disrupt existing use cases and don't bring too much magic or maintenance headache. |
When I first seen metadata argument I though it's a user supplied dictionary that will be automatically saved and loaded with the corpus. I think this is useful as it takes a common task from user's shoulders, but it's not strictly needed. IMHO, the really needed piece for all corpora is to be able to store document ids within the corpora. I doubt there is a real app out there with no need for linking external databases. Right now users need to track document names separately and save then along corpora in custom built wrappers. But, once you decide to keep doc names as part of the corpora, it's one small step towards keeping a full dict of metadata with a special "doc_names" slot and open the dict for user supplied custom metadata. |
LOW input file with metadata in mallet format:
Trying to load it with metadata gives:
The text was updated successfully, but these errors were encountered: