Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove/refactor useless subpackages #1584

Open
5 of 6 tasks
menshikh-iv opened this issue Sep 11, 2017 · 16 comments
Open
5 of 6 tasks

Remove/refactor useless subpackages #1584

menshikh-iv opened this issue Sep 11, 2017 · 16 comments
Labels
breaks backward-compatibility Change breaks backward compatibility difficulty medium Medium issue: required good gensim understanding & python skills

Comments

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Sep 11, 2017

In gensim, we have many sub-packages, but several of this should be a part of another subpackage, another part is broken/useless and should be removed.

Candidates:

  • /examples - Old broken code + confused users with name, should be removed.
  • /parsing - Not relevant for gensim (all of this already implemented in NLTK/etc.), should be removed (but before, need to check that this isn't used).
  • /scripts - Many scripts for Wikipedia + symlinks + some conversions + w2v_standalone. Need to look into all wiki scripts and understand why this needed, remove all that no needed (w2v_standalone too).
  • /summarization - need to refactor code and create one model, no need distinct subpackage for this.
  • /topic_coherence - same as summarization
  • nose.py - unused runner for nose, should be removed.
@souravsingh
Copy link
Contributor

@menshikh-iv Should all the scripts in the examples directory be removed?

@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Sep 14, 2017

@souravsingh Yes, but I personally will works with this, because we need to do everything very carefully (I mean all issues from project)

@menshikh-iv menshikh-iv added breaks backward-compatibility Change breaks backward compatibility difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 2, 2017
@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Oct 9, 2017

Update

We discussed it with @macks22 in #1607 and I thinks we shouldn't miss good hierarchy, but we should do it more carefully and deeply, for this reason

  • gensim.utils should be split to several "subsubpackages":

    • gensim.utils.utils for old gensim.utils
    • gensim.utils.text_utils for gensim.parsing
    • etc
  • gensim.topic_coherence should be moved with renaming to gensim.models._coherence

  • gensim.summarization should be moved to gensim.models.summarization (without creation distinct model)

  • gensim.parsing should be cleaned and used as text_utils in gensim.utils.text_utils

СС @piskvorky @macks22 wdyt?

@piskvorky
Copy link
Owner

-1 on gensim.utils.utils -- doesn't look like a good naming scheme.

gensim.utils.parsing - what is the point? Why not simply remove it?

gensim.models.coherence_inner - what is this about? Why inner?

summarization: OK to refactor to a single file under models. Not sure what you mean by (without creation distrinct model).

@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Oct 9, 2017

gensim.utils.parsing - what is the point? Why not simply remove it?

Because most part of this functions used in different places (mainly in notebooks), I'll remove part of this (that never used, but not all)

gensim.models.coherence_inner - what is this about? Why inner?

Because for CoherenceModel we need a lot of additional well-structured functions, inner because it's for internal usage (in CoherenceModel)

summarization: OK to refactor to a single file under models. Not sure what you mean by (without creation distinct model).

I means I'll move with renaeming gensim.summarization to gensim.models.summarization_inner, and in __init__.py by default will be imported summarize, summarize_corpus and keywords (not moeving all to one file)

-1 on gensim.utils.utils -- doesn't look like a good naming scheme.

I agree, but If we import all functions from utils in gensim/utils/__init__.py, it will not make any difference for users (or move current gensim/utils.py to gensim/utils/__init__.py but I don't think that this approach is better)

@piskvorky
Copy link
Owner

piskvorky commented Oct 9, 2017

Aha, so topic_coherence are some internal functions? And the actual model will live in another module?

I still don't understand the summarization part: where will the models/utility functions be? Same module/different modules? What is the package structure?

utils.utils is too generic. We could perhaps do utils.text_utils and utils.math_utils (etc). Still, we definitely want to ensure backward compatibility (import the functions from utils.__init__ so the "old way" continues to work).

parsing: OK. Better a part of text_utils? Does this need a separate module?

@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Oct 9, 2017

Aha, so topic_coherence are some internal functions? And the actual model will live in another module?

Yes, CoherenceModel in gensim.models

I still don't understand the summarization part: where will the models/utility functions be? Same module/different modules? What is the package structure?

Now structure here and used as from gensim.summarization import summarize, keywords, summarize_corpus

I suggest move gesnim/summarization -> gensim.models.summarization with imports in gensim/models/summarization_inner/__init__.py.
New imports will look like

  • from gensim.models import summarize, keywords, summarize_corpus
  • from gensim.models.summarization import summarize, keywords, summarize_corpus

Split current utils is a good idea, but I don't think that I should make it now (I'll make it later in distinct PR). For now, I think gensim/utils/utils.py with imports in __init__.py is the best choice

parsing: OK. Better a part of text_utils? Does this need a separate module?

text_utils is OK for parsing, I agree.

@piskvorky
Copy link
Owner

piskvorky commented Oct 9, 2017

I'm not an expert in packaging but that _inner stuff looks weird. To me, it is an indication the whole thing should be a (sub)package, gensim.models.summarization.

Why not simply:

 📂 gensim
    └📁 ...
    └📂 models
        └📂 summarization
            └📁 __init__.py
            └📁 some_utils_module.py
            └📁 class_module.py
            └📁 ...

?

@menshikh-iv
Copy link
Contributor Author

@piskvorky
For summarization - completely agree
for coherence - _inner is better, because this functions for internal use only (it's like an explicit marker)

@piskvorky
Copy link
Owner

Internal use is marked with a _ prefix in Python, not _inner suffix. I still don't see why we should do that. CC @gojomo @jayantj thoughts?

@menshikh-iv
Copy link
Contributor Author

Ok, let's use _

@piskvorky
Copy link
Owner

How do other projects denote "internal modules and packages"?

@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Oct 9, 2017

_ is default variant. I asked about _inner because we already have similar things (like word2vec_inner.*)

@piskvorky
Copy link
Owner

piskvorky commented Oct 9, 2017

That's a file though (and a special one, not Python), not a package.

I don't feel strongly either way; I think we should follow the common practice here, the path of least surprise for our users.

@menshikh-iv
Copy link
Contributor Author

Agree with you @piskvorky, let's continue my PR.

@gojomo
Copy link
Collaborator

gojomo commented Oct 9, 2017

re: _inner – I have not noticed other Python projects grouping "internal-only" functionality into its own distinctly-named module, though I may have just not noticed. Unless it's really longwinded & arcane stuff, I'd keep it alongside the more public classes/functions that it supports, and group/name by describable functionality (either exact functions, or roles like '_support' or '_extras' or '_utils') rather than just the more-aesthetic (and not really enforceable) 'outer'/surface vs 'inner' distinction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaks backward-compatibility Change breaks backward compatibility difficulty medium Medium issue: required good gensim understanding & python skills
Projects
None yet
Development

No branches or pull requests

4 participants