Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider renaming to nld #138

Closed
honnibal opened this issue May 15, 2020 · 13 comments
Closed

Consider renaming to nld #138

honnibal opened this issue May 15, 2020 · 13 comments
Labels
generic discussion Generic discussion on the library

Comments

@honnibal
Copy link

Hey :)

Just making a thread here recording what I said on Twitter, as it's impossible to follow discussion there. It's also just really not a good way to talk about this sort of thing.

The issue is that modules go into the global namespace, so you shouldn't use variable names that conflict with module names. This means the package makes nlp a bad variable name everywhere in the codebase. I've always used nlp as the canonical variable name of spaCy's Language objects, and this is a convention that a lot of other code has followed (Stanza, flair, etc). And actually, your transformers library uses nlp as the name for its Pipeline instance in your readme.

If you stick with the nlp name for this package, if anyone uses it then they should rewrite all of that code. If nlp is a bad choice of variable anywhere, it's a bad choice of variable everywhere --- because you shouldn't have to notice whether some other function uses a module when you're naming variables within a function. You want to have one convention that you can stick to everywhere.

If people use your nlp package and continue to use the nlp variable name, they'll find themselves with confusing bugs. There will be many many bits of code cut-and-paste from tutorials that give confusing results when combined with the data loading from the nlp library. The problem will be especially bad for shadowed modules (people might reasonably have a module named nlp.py within their codebase) and notebooks, as people might run notebook cells for data loading out-of-order.

I don't think it's an exaggeration to say that if your library becomes popular, we'll all be answering issues around this about once a week for the next few years. That seems pretty unideal, so I do hope you'll reconsider.

I suggest nld as a better name. It more accurately represents what the package actually does. It's pretty unideal to have a package named nlp that doesn't do any processing, and contains data about natural language generation or other non-NLP tasks. The name is equally short, and is sort of a visual pun on nlp, since a d is a rotated p.

@igrinis
Copy link

igrinis commented May 15, 2020

I would suggest nlds. NLP is a very general, broad and ambiguous term, the library is not about NLP (as in processing) per se, it is about accessing Natural Language related datasets. So the name should reflect its purpose.

@pmbaumgartner
Copy link

Chiming in to second everything @honnibal said, and to add that I think the current name is going to impact the discoverability of this library. People who are looking for "NLP Datasets" through a search engine are going to see a library called nlp and think it's too broad. People who are looking to do NLP in python are going to search "Python NLP" and end up here, confused that this is a collection of datasets.

The names of the other huggingface libraries work because they're the only game in town: there are not very many robust, distinct libraries for tokenizers or transformers in python, for example. But there are several options for NLP in python, and adding this as a possible search result for "python nlp" when datasets are likely not what someone is searching for adds noise and frustrates potential users.

@dennlinger
Copy link

I'm also not sure whether the naming of nlp is the problem itself, as long as it comes with the appropriate identifier, so maybe something like huggingface_nlp? This is analogous to what @honnibal and spacy are doing for spacy-transformers. Of course, this is a "step back" from the recent changes/renaming of transformers, but may be some middle ground between a complete rebranding, and keeping it identifiable.

@thomwolf thomwolf added the generic discussion Generic discussion on the library label May 17, 2020
@thomwolf
Copy link
Member

thomwolf commented May 17, 2020

Interesting, thanks for sharing your thoughts.

As we’ll move toward a first non-beta release, we will pool the community of contributors/users of the library for their opinions on a good final name (like when we renamed the beautifully (?) named pytorch-pretrained-bert)

In the meantime, using from nlp import load_dataset, load_metric should work 😉

@julien-c
Copy link
Member

I feel like we are conflating two distinct subjects here:

  1. @honnibal's point is that using nlp as a package name might break existing code and bring developer usability issues in the future
  2. @pmbaumgartner's point is that the nlp package name is too broad and shouldn't be used by a package that exposes only datasets and metrics

(let me know if I mischaracterize your point)

I'll chime in to say that the first point is a bit silly IMO. As Python developers due to the limitations of the import system we already have to share:

  • a single flat namespace for packages
  • which also conflicts with local modules i.e. local files

If we add the constraint that this flat namespace also be shared with variable names this gets untractable pretty fast :)

I also think all Python software developers/ML engineers/scientists are capable of at least a subset of:

  • importing only the methods that they need like @thomwolf suggested
  • aliasing their import
  • renaming a local variable

@thomwolf
Copy link
Member

By the way, nlp will very likely not be only about datasets, and not even just about datasets and metrics.

I see it as a laboratory for testing several long-term ideas about how we could do NLP in terms of research as well as open-source and community sharing, most of these ideas being too experimental/big to fit in transformers.

Some of the directions we would like to explore are about sharing, traceability and more experimental models, as well as seeing a model as the community-based process of creating a composite entity from data, optimization, and code.

We'll see how these ideas end up being implemented and we'll better know how we should define the library when we start to dive into these topics. I'll try to get the nlp team to draft a roadmap on these topics at some point.

@honnibal
Copy link
Author

honnibal commented May 27, 2020

If we add the constraint that this flat namespace also be shared with variable names this gets untractable pretty fast :)

I'm sort of confused by your point here. The namespace is shared by variable names. You should not use local variables that are named the same as modules, because then you cannot use the module within the scope of your function.

For instance,

import nlp
import transformers

nlp = transformers.pipeline("sentiment-analysis")

This is a bug: you've just overwritten the module, so now you can't use it. Or instead:

import transformers

nlp = transformers.pipeline("sentiment-analysis")
# (Later, e.g. in a notebook)
import nlp

This is also a bug: you've overwritten your variable with an import.

If you have a module named nlp, you should avoid using nlp as a variable, or you'll have bugs in some contexts and inconsistencies in other contexts. You'll have situations where you need to import differently in one module vs another, or name variables differently in one context vs another, which is bad.

importing only the methods that they need like @thomwolf suggested

Okay but the same logic applies to naming the module literally anything else. There's absolutely no point in having a module name that's 3 letters if you always plan to do import from! It would be entirely better to name it nlp_datasets if you don't want people to do import nlp.

And finally:

By the way, nlp will very likely not be only about datasets, and not even just about datasets and metrics.

So...it isn't a datasets library? https://twitter.com/Thom_Wolf/status/1261282491622731781

I'm confused 😕

@davidefiocco
Copy link

davidefiocco commented Sep 27, 2020

Dropping by as I noticed that the library has been renamed datasets so I wonder if the conversation above is settled (nlp not used anymore) :)

@thomwolf
Copy link
Member

I guess indeed

@jramapuram
Copy link

I'd argue that datasets is worse than nlp. Datasets should be a user specific decision and not encapsulate all of python (pip install datasets). If this package contained every dataset in the world (NLP / vision / etc) then it would make sense =/

@davidefiocco
Copy link

davidefiocco commented Dec 17, 2020

I can't speak for the HF team @jramapuram, but as member of the community it looks to me that HF wanted to avoid the past path of changing names as scope broadened over time:

Remember
https://github.com/huggingface/pytorch-openai-transformer-lm
https://github.com/huggingface/pytorch-pretrained-BERT
https://github.com/huggingface/pytorch-transformers
and now
https://github.com/huggingface/transformers

;)

Jokes aside, seems that the library is growing in a multi-modal direction (#363) so the current name is not that implausible. Possibly HF ambition is really to grow its community and bring here a large chunk of datasets of the world (including tabular / vision / audio?).

@jramapuram
Copy link

Yea I see your point. However, wouldn't scoping solve the entire problem?

import huggingface.datasets as D
import huggingface.transformers as T

Calling something datasets is akin to saying I'm going to name my package python --> import python

@zxteloiv
Copy link

Sorry to reply to an old thread, but the name issue really makes troubles recently in my project.

I'd never known in advance there's a package called "datasets". My first thought is that such a general term may be safe to arbitrarily use. Avoiding such a common name because of its ambiguity is quite weird.

As we know in python it's not easy to differentiate system-wide and project-wide import like in C and C++.

On the contrary I fully understand the challenge to rename a popular library. So it seems to provide a "huggingface" wrapper library as suggested above by @jramapuram may be a happy medium for both developers and users.

Best Regards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
generic discussion Generic discussion on the library
Projects
None yet
Development

No branches or pull requests

9 participants