Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move BrownCorpus from word2vec to gensim.corpora #2956

Open
piskvorky opened this issue Sep 24, 2020 · 6 comments
Open

Move BrownCorpus from word2vec to gensim.corpora #2956

piskvorky opened this issue Sep 24, 2020 · 6 comments
Labels
good first issue Issue for new contributors (not required gensim understanding + very simple) housekeeping internal tasks and processes

Comments

@piskvorky
Copy link
Owner

Not a high-priority at all, but it'd be more sensible for such a tutorial/testing utility corpus to be implemented elsewhere - maybe under /test/ or some other data- or doc- related module – rather than in gensim.models.word2vec.

Originally posted by @gojomo in #2939 (comment)

@piskvorky piskvorky added good first issue Issue for new contributors (not required gensim understanding + very simple) housekeeping internal tasks and processes labels Sep 24, 2020
@rishabhvarshney14
Copy link

Can I work on this?

@piskvorky
Copy link
Owner Author

piskvorky commented Sep 24, 2020

Yes :)

We'll want to retain an alias in word2vec.py, so that people's code that relies on the current location continues to work. But the code itself should live under gensim.corpora.
Same with other such corpus classes that are currently under gensim.models – I believe there were several.

Thanks!

@gojomo
Copy link
Collaborator

gojomo commented Sep 24, 2020

The same goes for the Text8Corpus alongside it - if that class survives at all. Probably, LineSentence should be able to handle not just "more than X tokens" lines but also "all tokens on one line" - in which case, it can subsume what Text8Corpus was doing. (And then, maybe it lives in a more sensible place, like utils/corpora, instead of as a tag-on under word2vec?

@rishabhvarshney14
Copy link

@piskvorky I am thinking about moving BrownCorpus to another file named browncorpus.py in corpora also there is TaggedBrownCorpus in doc2vec which I will move to browncorpus.py along with BrownCorpus and will do it for other utility corpus like Text8Corpus. Is this what should be done? Please correct me if I am wrong.

@piskvorky
Copy link
Owner Author

piskvorky commented Sep 25, 2020

Yes, plus:

  1. Add aliases to the old locations, so code that relies on the original location continues to work.
  2. Make sure the new files have proper headers and docstrings.
  3. Add .rst documentation for each new module here. You can use the same "stub" template as the other existing .rst files in that directory. This step ensures the new modules will appear in the public API documentation.

Do we have unit tests for those classes? If not, could you add them too please? Thanks.

@gojomo
Copy link
Collaborator

gojomo commented Sep 27, 2020

As these classes haven't had unit-tests, I wouldn't burden the simple housekeeping step (moving them to a more logical place) with the requirement to add unit tests for the 1st time.

What would it mean to unit-test a corpus, anyway? Check that it iterates to give an expected number of items/words? But what if that requires downloading extra data that isn't even in the standard gensim project? (I believe that's the case with both Brown and Text8.) Then, the "unit test" is actually adding new download code that wouldn't otherwise be present. It'd be slowing every CI test run with two new remote-downloads – which could (and almost certainly eventually will) fail due to matters entirely outside Gensim's control. And the code it's testing isn't very high-value – only used in a few demos. So I'd view creating unit texts for these classes as negative-value work: creating new costs/risks (slower tests that may generate spurious failures) for negligible value.

OTOH, if Text8Corpus's functionality could be folded into LineSentence, that new more-general corpus iterable would still feature in many tutorials/test-code. It could be (somewhat) meaningfully-tested with a small (self-contained) fragment of a long-lined corpus file (without claiming to actually be a unit-tested Text8Corpus).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Issue for new contributors (not required gensim understanding + very simple) housekeeping internal tasks and processes
Projects
None yet
Development

No branches or pull requests

3 participants