Consider renaming to nld #138

honnibal · 2020-05-15T20:23:27Z

Hey :)

Just making a thread here recording what I said on Twitter, as it's impossible to follow discussion there. It's also just really not a good way to talk about this sort of thing.

The issue is that modules go into the global namespace, so you shouldn't use variable names that conflict with module names. This means the package makes nlp a bad variable name everywhere in the codebase. I've always used nlp as the canonical variable name of spaCy's Language objects, and this is a convention that a lot of other code has followed (Stanza, flair, etc). And actually, your transformers library uses nlp as the name for its Pipeline instance in your readme.

If you stick with the nlp name for this package, if anyone uses it then they should rewrite all of that code. If nlp is a bad choice of variable anywhere, it's a bad choice of variable everywhere --- because you shouldn't have to notice whether some other function uses a module when you're naming variables within a function. You want to have one convention that you can stick to everywhere.

If people use your nlp package and continue to use the nlp variable name, they'll find themselves with confusing bugs. There will be many many bits of code cut-and-paste from tutorials that give confusing results when combined with the data loading from the nlp library. The problem will be especially bad for shadowed modules (people might reasonably have a module named nlp.py within their codebase) and notebooks, as people might run notebook cells for data loading out-of-order.

I don't think it's an exaggeration to say that if your library becomes popular, we'll all be answering issues around this about once a week for the next few years. That seems pretty unideal, so I do hope you'll reconsider.

I suggest nld as a better name. It more accurately represents what the package actually does. It's pretty unideal to have a package named nlp that doesn't do any processing, and contains data about natural language generation or other non-NLP tasks. The name is equally short, and is sort of a visual pun on nlp, since a d is a rotated p.

The text was updated successfully, but these errors were encountered:

igrinis · 2020-05-15T20:50:54Z

I would suggest nlds. NLP is a very general, broad and ambiguous term, the library is not about NLP (as in processing) per se, it is about accessing Natural Language related datasets. So the name should reflect its purpose.

pmbaumgartner · 2020-05-16T12:56:47Z

Chiming in to second everything @honnibal said, and to add that I think the current name is going to impact the discoverability of this library. People who are looking for "NLP Datasets" through a search engine are going to see a library called nlp and think it's too broad. People who are looking to do NLP in python are going to search "Python NLP" and end up here, confused that this is a collection of datasets.

The names of the other huggingface libraries work because they're the only game in town: there are not very many robust, distinct libraries for tokenizers or transformers in python, for example. But there are several options for NLP in python, and adding this as a possible search result for "python nlp" when datasets are likely not what someone is searching for adds noise and frustrates potential users.

dennlinger · 2020-05-16T15:09:43Z

I'm also not sure whether the naming of nlp is the problem itself, as long as it comes with the appropriate identifier, so maybe something like huggingface_nlp? This is analogous to what @honnibal and spacy are doing for spacy-transformers. Of course, this is a "step back" from the recent changes/renaming of transformers, but may be some middle ground between a complete rebranding, and keeping it identifiable.

thomwolf · 2020-05-17T13:10:12Z

Interesting, thanks for sharing your thoughts.

As we’ll move toward a first non-beta release, we will pool the community of contributors/users of the library for their opinions on a good final name (like when we renamed the beautifully (?) named pytorch-pretrained-bert)

In the meantime, using from nlp import load_dataset, load_metric should work 😉

julien-c · 2020-05-18T16:45:15Z

I feel like we are conflating two distinct subjects here:

@honnibal's point is that using nlp as a package name might break existing code and bring developer usability issues in the future
@pmbaumgartner's point is that the nlp package name is too broad and shouldn't be used by a package that exposes only datasets and metrics

(let me know if I mischaracterize your point)

I'll chime in to say that the first point is a bit silly IMO. As Python developers due to the limitations of the import system we already have to share:

a single flat namespace for packages
which also conflicts with local modules i.e. local files

If we add the constraint that this flat namespace also be shared with variable names this gets untractable pretty fast :)

I also think all Python software developers/ML engineers/scientists are capable of at least a subset of:

importing only the methods that they need like @thomwolf suggested
aliasing their import
renaming a local variable

thomwolf · 2020-05-19T12:47:17Z

By the way, nlp will very likely not be only about datasets, and not even just about datasets and metrics.

I see it as a laboratory for testing several long-term ideas about how we could do NLP in terms of research as well as open-source and community sharing, most of these ideas being too experimental/big to fit in transformers.

Some of the directions we would like to explore are about sharing, traceability and more experimental models, as well as seeing a model as the community-based process of creating a composite entity from data, optimization, and code.

We'll see how these ideas end up being implemented and we'll better know how we should define the library when we start to dive into these topics. I'll try to get the nlp team to draft a roadmap on these topics at some point.

honnibal · 2020-05-27T16:13:48Z

If we add the constraint that this flat namespace also be shared with variable names this gets untractable pretty fast :)

I'm sort of confused by your point here. The namespace is shared by variable names. You should not use local variables that are named the same as modules, because then you cannot use the module within the scope of your function.

For instance,

import nlp
import transformers

nlp = transformers.pipeline("sentiment-analysis")

This is a bug: you've just overwritten the module, so now you can't use it. Or instead:

import transformers

nlp = transformers.pipeline("sentiment-analysis")
# (Later, e.g. in a notebook)
import nlp

This is also a bug: you've overwritten your variable with an import.

If you have a module named nlp, you should avoid using nlp as a variable, or you'll have bugs in some contexts and inconsistencies in other contexts. You'll have situations where you need to import differently in one module vs another, or name variables differently in one context vs another, which is bad.

importing only the methods that they need like @thomwolf suggested

Okay but the same logic applies to naming the module literally anything else. There's absolutely no point in having a module name that's 3 letters if you always plan to do import from! It would be entirely better to name it nlp_datasets if you don't want people to do import nlp.

And finally:

By the way, nlp will very likely not be only about datasets, and not even just about datasets and metrics.

So...it isn't a datasets library? https://twitter.com/Thom_Wolf/status/1261282491622731781

I'm confused 😕

davidefiocco · 2020-09-27T10:13:56Z

Dropping by as I noticed that the library has been renamed datasets so I wonder if the conversation above is settled (nlp not used anymore) :)

thomwolf · 2020-09-28T00:08:10Z

I guess indeed

jramapuram · 2020-12-17T17:51:05Z

I'd argue that datasets is worse than nlp. Datasets should be a user specific decision and not encapsulate all of python (pip install datasets). If this package contained every dataset in the world (NLP / vision / etc) then it would make sense =/

davidefiocco · 2020-12-17T18:13:43Z

I can't speak for the HF team @jramapuram, but as member of the community it looks to me that HF wanted to avoid the past path of changing names as scope broadened over time:

Remember
https://github.com/huggingface/pytorch-openai-transformer-lm
https://github.com/huggingface/pytorch-pretrained-BERT
https://github.com/huggingface/pytorch-transformers
and now
https://github.com/huggingface/transformers

;)

Jokes aside, seems that the library is growing in a multi-modal direction (#363) so the current name is not that implausible. Possibly HF ambition is really to grow its community and bring here a large chunk of datasets of the world (including tabular / vision / audio?).

jramapuram · 2020-12-17T18:18:43Z

Yea I see your point. However, wouldn't scoping solve the entire problem?

import huggingface.datasets as D
import huggingface.transformers as T

Calling something datasets is akin to saying I'm going to name my package python --> import python

zxteloiv · 2022-09-16T05:18:22Z

Sorry to reply to an old thread, but the name issue really makes troubles recently in my project.

I'd never known in advance there's a package called "datasets". My first thought is that such a general term may be safe to arbitrarily use. Avoiding such a common name because of its ambiguity is quite weird.

As we know in python it's not easy to differentiate system-wide and project-wide import like in C and C++.

On the contrary I fully understand the challenge to rename a popular library. So it seems to provide a "huggingface" wrapper library as suggested above by @jramapuram may be a happy medium for both developers and users.

Best Regards.

thomwolf added the generic discussion Generic discussion on the library label May 17, 2020

stefan-it mentioned this issue May 18, 2020

Use new Hugging Face datasets library flairNLP/flair#1615

Closed

thomwolf closed this as completed Sep 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider renaming to nld #138

Consider renaming to nld #138

honnibal commented May 15, 2020

igrinis commented May 15, 2020

pmbaumgartner commented May 16, 2020

dennlinger commented May 16, 2020

thomwolf commented May 17, 2020 •

edited

julien-c commented May 18, 2020

thomwolf commented May 19, 2020

honnibal commented May 27, 2020 •

edited

davidefiocco commented Sep 27, 2020 •

edited

thomwolf commented Sep 28, 2020

jramapuram commented Dec 17, 2020

davidefiocco commented Dec 17, 2020 •

edited

jramapuram commented Dec 17, 2020

zxteloiv commented Sep 16, 2022

Consider renaming to nld #138

Consider renaming to nld #138

Comments

honnibal commented May 15, 2020

igrinis commented May 15, 2020

pmbaumgartner commented May 16, 2020

dennlinger commented May 16, 2020

thomwolf commented May 17, 2020 • edited

julien-c commented May 18, 2020

thomwolf commented May 19, 2020

honnibal commented May 27, 2020 • edited

davidefiocco commented Sep 27, 2020 • edited

thomwolf commented Sep 28, 2020

jramapuram commented Dec 17, 2020

davidefiocco commented Dec 17, 2020 • edited

jramapuram commented Dec 17, 2020

zxteloiv commented Sep 16, 2022

thomwolf commented May 17, 2020 •

edited

honnibal commented May 27, 2020 •

edited

davidefiocco commented Sep 27, 2020 •

edited

davidefiocco commented Dec 17, 2020 •

edited