Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MalletCorpus breaks with metadata #609

Open
vspinu opened this issue Feb 11, 2016 · 5 comments
Open

MalletCorpus breaks with metadata #609

vspinu opened this issue Feb 11, 2016 · 5 comments
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills

Comments

@vspinu
Copy link

vspinu commented Feb 11, 2016

LOW input file with metadata in mallet format:

id1 en aa bb cc dd
id2 en aa bb cc dd

Trying to load it with metadata gives:

>>> corpora.MalletCorpus("./tmp.txt", metadata=True)
2016-02-11 18:08:50,147 : INFO : loading corpus from ./tmp.txt
2016-02-11 18:08:50,147 : INFO : extracting vocabulary from the corpus
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/vspinu/.local/lib/python2.7/site-packages/gensim/corpora/malletcorpus.py", line 41, in __init__
    LowCorpus.__init__(self, fname, id2word)
  File "/home/vspinu/.local/lib/python2.7/site-packages/gensim/corpora/lowcorpus.py", line 78, in __init__
    all_terms.update(word for word, wordCnt in doc)
  File "/home/vspinu/.local/lib/python2.7/site-packages/gensim/corpora/lowcorpus.py", line 78, in <genexpr>
    all_terms.update(word for word, wordCnt in doc)
ValueError: too many values to unpack
@cscorley cscorley added the bug Issue described a bug label Feb 11, 2016
@cscorley
Copy link
Contributor

Indeed, this is handled incorrectly. Thanks for reporting!

As a workaround until a fix is released, you can init the corpus and turn on metadata gathering later:

c = corpora.MalletCorpus("./tmp.txt")
c.metadata = True

Or provide a pre-built dictionary:

d = corpora.Dictionary(...)
c = corpora.MalletCorpus("./tmp.txt", id2word=d)

@cscorley cscorley added the difficulty easy Easy issue: required small fix label Feb 11, 2016
@cscorley
Copy link
Contributor

Marking this as easy, if anyone wants to take a shot at it fixing it before I do it this weekend -- feel free to contact me for pointers.

@cscorley
Copy link
Contributor

@piskvorky So, I was working on this specific issue, but I began testing/looking at the different corpora and noticed there's a pretty inconsistent use of metadata throughout (so much for being easy).

Some corpora accept metadata in the constructor, while others don't but still set it to False and check for it during iteration. Some do not even return metadata (UciCorpus & others), while some simply return the line number (TextCorpus). I'm not sure any corpus outside of MalletCorpus even has any real metadata to return, since getting document numbers is as easy as wrapping the corpus iterator with enumerate().

A few options are:

  1. Add metadata handling to all corpora as needed, in the constructors, &c; defaulting to yielding the line/document number,
  2. Remove some of the metadata handling (e.g., TextCorpus) because it's not very useful.

What do you think is best here?

(updated)

@piskvorky
Copy link
Owner

The metadata functionality came from a user PR some time ago. I merged it because it didn't have any side effects (backward compatible), but it's not a promoted or featured functionality. I don't even remember exactly what it does.

I wouldn't like gensim to go in this direction (integrating unrelated data structures, basically databases, into its core).

So, from your two options, I'm +1 on the second one. Perhaps asking first to see if anybody still uses it, so we don't break things too badly.

Although if the users and other devs (CC @gojomo @tmylk ) disagree, I don't care much either way. As long as the changes are backward compatible, don't disrupt existing use cases and don't bring too much magic or maintenance headache.

@vspinu
Copy link
Author

vspinu commented Feb 16, 2016

When I first seen metadata argument I though it's a user supplied dictionary that will be automatically saved and loaded with the corpus. I think this is useful as it takes a common task from user's shoulders, but it's not strictly needed.

IMHO, the really needed piece for all corpora is to be able to store document ids within the corpora. I doubt there is a real app out there with no need for linking external databases. Right now users need to track document names separately and save then along corpora in custom built wrappers.

But, once you decide to keep doc names as part of the corpora, it's one small step towards keeping a full dict of metadata with a special "doc_names" slot and open the dict for user supplied custom metadata.

@menshikh-iv menshikh-iv added difficulty medium Medium issue: required good gensim understanding & python skills and removed difficulty easy Easy issue: required small fix labels Oct 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills
Projects
None yet
Development

No branches or pull requests

4 participants