Add env variable for MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2399

albertvillanova · 2021-05-24T17:19:15Z

Add env variable for MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES.

This will allow to turn off default behavior: loading in memory (and not caching) small datasets.

stas00

Looking great, @albertvillanova - thank you!

The only suggestion is for the docs to clarify the priority of env var vs config var with the same name. As currently it say either-or.

Also would it help to add a test? To mock env setting, here is one helper testing function we use in transformers:
https://github.com/huggingface/transformers/blob/master/src/transformers/testing_utils.py#L919

stas00 · 2021-05-25T16:49:46Z

Thank you for clarifying the precedence, @albertvillanova

Isn't it typically the case where env vars have the highest precedence?

In my understanding the point of env vars is to be able to override software w/o needing to touch the code.

Please correct me if this is not so in the general case.

albertvillanova · 2021-05-25T17:12:17Z

Hi @stas00,

Well, I'm not an expert on this topic, but the precedence hierarchy I have normally found is from higher to lower:

command line parameters
env vars
config files
So yes, normally env vars have precedence over configuration files.

Anyway, for Datasets, there are no configuration files. The in-memory config is set from default values or env vars (which have precedence over default values). But this is done at import.

However, once the library is imported, the user can modify the in-memory config, and this will have precedence over the rest of mechanisms (which take place only at import).

stas00 · 2021-05-25T21:12:28Z

In my limited experience env vars are typically above cmd line args, so that one can override scripts with cmd lines using env vars, but usually one then uses env vars inside cmd line args, so it's loud and clear.

For example specifying a specific gpu number on a command line will depend on CUDA_VISIBLE_DEVICES so gpu0 will be different if someone sets CUDA_VISIBLE_DEVICES=2,3 vs CUDA_VISIBLE_DEVICES=0,1.

However, once the library is imported, the user can modify the in-memory config, and this will have precedence over the rest of mechanisms (which take place only at import).

And this is exactly the problem we are trying to solve here. For a good reason HF examples don't want to use keep_in_memory=False, and they may choose to now set datasets.config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES and which means we again can't override it via env var.

lhoestq

Thanks :)

The precedence is consistent with the other env/config variables so it's all good to me

stas00 · 2021-05-26T17:05:35Z

oops, sorry, didn't think earlier - do we need to prefix this with HF_DATASETS or HF_ like all the other env vars? or is it long enough already to be unique - it's just not telling the user in the config file what projet this variable is for...

lhoestq · 2021-05-27T09:07:15Z

You're right, I just opened #2409

albertvillanova added 3 commits May 24, 2021 19:16

Add env variable for MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES

de57c17

Simplify logic with MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES

4edcb12

Update docs

862fdfe

stas00 approved these changes May 25, 2021

View reviewed changes

albertvillanova added 3 commits May 25, 2021 08:36

Update docs explaining precedence

7681106

Test env variable

a03643b

Fix style

63bddcb

lhoestq approved these changes May 26, 2021

View reviewed changes

lhoestq merged commit fea351a into huggingface:master May 26, 2021

lhoestq mentioned this pull request May 27, 2021

Add HF_ prefix to env var MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2409

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add env variable for MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2399

Add env variable for MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2399

albertvillanova commented May 24, 2021

stas00 left a comment

stas00 commented May 25, 2021 •

edited

albertvillanova commented May 25, 2021 •

edited

stas00 commented May 25, 2021 •

edited

lhoestq left a comment

stas00 commented May 26, 2021

lhoestq commented May 27, 2021

Add env variable for MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2399

Add env variable for MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2399

Conversation

albertvillanova commented May 24, 2021

stas00 left a comment

Choose a reason for hiding this comment

stas00 commented May 25, 2021 • edited

albertvillanova commented May 25, 2021 • edited

stas00 commented May 25, 2021 • edited

lhoestq left a comment

Choose a reason for hiding this comment

stas00 commented May 26, 2021

lhoestq commented May 27, 2021

stas00 commented May 25, 2021 •

edited

albertvillanova commented May 25, 2021 •

edited

stas00 commented May 25, 2021 •

edited