Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation is out of date #22

Closed
DavidNemeskey opened this issue Mar 30, 2011 · 4 comments
Closed

Documentation is out of date #22

DavidNemeskey opened this issue Mar 30, 2011 · 4 comments

Comments

@DavidNemeskey
Copy link
Contributor

Just another issue, for the fun of it. :)

As I have been skimming though the documentation I have found a few places where it is outdated, or lacking in certain respects. Here's a few off the top of my head (actually, it's not just the top, it's all I have found for now):

  1. There is no serialize method in the corpus classes, only saveCorpus (both in API and tutorials)
  2. The line lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, numTopics=2) does not work (http://nlp.fi.muni.cz/projekty/gensim/tut2.html). It should be id2word=dictionary.id2word.
  3. In http://nlp.fi.muni.cz/projekty/gensim/tut1.html, it is not at all clear what happens after you call dictionary.filterTokens, let alone compactify. I reckon the corpus would obviously "not work" after these commands, and has to be reparsed, but I am not sure. A few sentences on this would be welcome, along with a code example of how to reparse the corpus with such a dictionary (doc2bow with allowUpdate=False?)
@piskvorky
Copy link
Owner

Hi David, I think documentation is very important, so improvements are very welcome!

However in this case, I think you are wrong -- I always check all tutorial examples before each release:

  1. The serialize method resides in corpora.IndexedCorpus class, a base class of other serializers.
  2. Dictionary objects can now act as simple mappings, so id2word=dictionary works (and is the preferred way of using id2word).
  3. Dictionary.filterTokens and compactify methods are not explained in great detail in the tutorial itself, but you can always look at the API documentation. In fact, the documentation for these functions is longer than their Python code :-)

Your note about corpus parsing and reparsing is serious though, it means it is not clear to users how the dictionary processing fits within corpus creation. That's a conceptual mistake, so the tutorial is apparently not doing a good job there, I will try to improve it.

EDIT: maybe the confusion comes from you using an older version of gensim? The documentation always reflects the latest release.

@DavidNemeskey
Copy link
Contributor Author

Hi Radim,

  1. Yes, I am using 0.7.7, so that must be the reason for 1 & 2.
    If the documentation already reflects the API of 0.7.8, then more power to you. :)
    However, you might consider making the older documentation available as well, if someone has to live with it for a while (e.g. because of installation policy, etc.).

As for 3., I didn't mean the methods filterToken and compactify themselves, I think that's pretty straightforward. So having a short example on corpus reparsing is all I ask for. :)

@DavidNemeskey
Copy link
Contributor Author

I am sorry, I closed this accidentally. I am still learning GitHub, I just wished the "Comment and close" button wasn't the default. :/

@piskvorky
Copy link
Owner

Learning GitHub is a never ending process... I filed one site bug report just yesterday :-)

Full documentation (including HTML) is version controlled, and is a part of each gensim release. So you can access the relevant version a) from the source .tgz package of your release, docs dir, b) from GitHub, when you select Switch Tags from the main gensim page, c) from your local repo, by git checkout 0.7.7.

There are several questions about dictionaries and corpora at the mailing list now, not just yours, so apparently the tutorial on that part is insufficient. I'll try to improve it, but once you figure it out, please consider upgrading the docs yourself. I know gensim too well, it's difficult to have a detached perspective on some things. I may see stuff as obvious and misunderstand problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants