MalletCorpus breaks with metadata #609

vspinu · 2016-02-11T17:09:44Z

LOW input file with metadata in mallet format:

id1 en aa bb cc dd
id2 en aa bb cc dd

Trying to load it with metadata gives:

>>> corpora.MalletCorpus("./tmp.txt", metadata=True)
2016-02-11 18:08:50,147 : INFO : loading corpus from ./tmp.txt
2016-02-11 18:08:50,147 : INFO : extracting vocabulary from the corpus
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/vspinu/.local/lib/python2.7/site-packages/gensim/corpora/malletcorpus.py", line 41, in __init__
    LowCorpus.__init__(self, fname, id2word)
  File "/home/vspinu/.local/lib/python2.7/site-packages/gensim/corpora/lowcorpus.py", line 78, in __init__
    all_terms.update(word for word, wordCnt in doc)
  File "/home/vspinu/.local/lib/python2.7/site-packages/gensim/corpora/lowcorpus.py", line 78, in <genexpr>
    all_terms.update(word for word, wordCnt in doc)
ValueError: too many values to unpack

The text was updated successfully, but these errors were encountered:

cscorley · 2016-02-11T20:04:33Z

Indeed, this is handled incorrectly. Thanks for reporting!

As a workaround until a fix is released, you can init the corpus and turn on metadata gathering later:

c = corpora.MalletCorpus("./tmp.txt")
c.metadata = True

Or provide a pre-built dictionary:

d = corpora.Dictionary(...)
c = corpora.MalletCorpus("./tmp.txt", id2word=d)

cscorley · 2016-02-11T20:26:29Z

Marking this as easy, if anyone wants to take a shot at it fixing it before I do it this weekend -- feel free to contact me for pointers.

cscorley · 2016-02-16T00:40:21Z

@piskvorky So, I was working on this specific issue, but I began testing/looking at the different corpora and noticed there's a pretty inconsistent use of metadata throughout (so much for being easy).

Some corpora accept metadata in the constructor, while others don't but still set it to False and check for it during iteration. Some do not even return metadata (UciCorpus & others), while some simply return the line number (TextCorpus). I'm not sure any corpus outside of MalletCorpus even has any real metadata to return, since getting document numbers is as easy as wrapping the corpus iterator with enumerate().

A few options are:

Add metadata handling to all corpora as needed, in the constructors, &c; defaulting to yielding the line/document number,
Remove some of the metadata handling (e.g., TextCorpus) because it's not very useful.

What do you think is best here?

(updated)

piskvorky · 2016-02-16T02:28:12Z

The metadata functionality came from a user PR some time ago. I merged it because it didn't have any side effects (backward compatible), but it's not a promoted or featured functionality. I don't even remember exactly what it does.

I wouldn't like gensim to go in this direction (integrating unrelated data structures, basically databases, into its core).

So, from your two options, I'm +1 on the second one. Perhaps asking first to see if anybody still uses it, so we don't break things too badly.

Although if the users and other devs (CC @gojomo @tmylk ) disagree, I don't care much either way. As long as the changes are backward compatible, don't disrupt existing use cases and don't bring too much magic or maintenance headache.

vspinu · 2016-02-16T09:47:55Z

When I first seen metadata argument I though it's a user supplied dictionary that will be automatically saved and loaded with the corpus. I think this is useful as it takes a common task from user's shoulders, but it's not strictly needed.

IMHO, the really needed piece for all corpora is to be able to store document ids within the corpora. I doubt there is a real app out there with no need for linking external databases. Right now users need to track document names separately and save then along corpora in custom built wrappers.

But, once you decide to keep doc names as part of the corpora, it's one small step towards keeping a full dict of metadata with a special "doc_names" slot and open the dict for user supplied custom metadata.

cscorley added the bug Issue described a bug label Feb 11, 2016

cscorley added the difficulty easy Easy issue: required small fix label Feb 11, 2016

cscorley added a commit that referenced this issue Feb 16, 2016

Fixes issue #609: MalletCorpus metadata=True in constructor

7906b7c

menshikh-iv added difficulty medium Medium issue: required good gensim understanding & python skills and removed difficulty easy Easy issue: required small fix labels Oct 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MalletCorpus breaks with metadata #609

MalletCorpus breaks with metadata #609

vspinu commented Feb 11, 2016

cscorley commented Feb 11, 2016

cscorley commented Feb 11, 2016

cscorley commented Feb 16, 2016

piskvorky commented Feb 16, 2016

vspinu commented Feb 16, 2016

MalletCorpus breaks with metadata #609

MalletCorpus breaks with metadata #609

Comments

vspinu commented Feb 11, 2016

cscorley commented Feb 11, 2016

cscorley commented Feb 11, 2016

cscorley commented Feb 16, 2016

piskvorky commented Feb 16, 2016

vspinu commented Feb 16, 2016