Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add env variable for MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2399

Merged
merged 6 commits into from May 26, 2021

Conversation

albertvillanova
Copy link
Member

Add env variable for MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES.

This will allow to turn off default behavior: loading in memory (and not caching) small datasets.

Fix #2387.

Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great, @albertvillanova - thank you!

The only suggestion is for the docs to clarify the priority of env var vs config var with the same name. As currently it say either-or.

Also would it help to add a test? To mock env setting, here is one helper testing function we use in transformers:
https://github.com/huggingface/transformers/blob/master/src/transformers/testing_utils.py#L919

@stas00
Copy link
Contributor

stas00 commented May 25, 2021

Thank you for clarifying the precedence, @albertvillanova

Isn't it typically the case where env vars have the highest precedence?

In my understanding the point of env vars is to be able to override software w/o needing to touch the code.

Please correct me if this is not so in the general case.

@albertvillanova
Copy link
Member Author

albertvillanova commented May 25, 2021

Hi @stas00,

Well, I'm not an expert on this topic, but the precedence hierarchy I have normally found is from higher to lower:

  • command line parameters
  • env vars
  • config files
    So yes, normally env vars have precedence over configuration files.

Anyway, for Datasets, there are no configuration files. The in-memory config is set from default values or env vars (which have precedence over default values). But this is done at import.

However, once the library is imported, the user can modify the in-memory config, and this will have precedence over the rest of mechanisms (which take place only at import).

@stas00
Copy link
Contributor

stas00 commented May 25, 2021

In my limited experience env vars are typically above cmd line args, so that one can override scripts with cmd lines using env vars, but usually one then uses env vars inside cmd line args, so it's loud and clear.

For example specifying a specific gpu number on a command line will depend on CUDA_VISIBLE_DEVICES so gpu0 will be different if someone sets CUDA_VISIBLE_DEVICES=2,3 vs CUDA_VISIBLE_DEVICES=0,1.

However, once the library is imported, the user can modify the in-memory config, and this will have precedence over the rest of mechanisms (which take place only at import).

And this is exactly the problem we are trying to solve here. For a good reason HF examples don't want to use keep_in_memory=False, and they may choose to now set datasets.config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES and which means we again can't override it via env var.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks :)

The precedence is consistent with the other env/config variables so it's all good to me

@lhoestq lhoestq merged commit fea351a into huggingface:master May 26, 2021
@stas00
Copy link
Contributor

stas00 commented May 26, 2021

oops, sorry, didn't think earlier - do we need to prefix this with HF_DATASETS or HF_ like all the other env vars? or is it long enough already to be unique - it's just not telling the user in the config file what projet this variable is for...

@lhoestq
Copy link
Member

lhoestq commented May 27, 2021

You're right, I just opened #2409

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

datasets 1.6 ignores cache
3 participants