New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add env variable for MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2399
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great, @albertvillanova - thank you!
The only suggestion is for the docs to clarify the priority of env var vs config
var with the same name. As currently it say either-or.
Also would it help to add a test? To mock env setting, here is one helper testing function we use in transformers
:
https://github.com/huggingface/transformers/blob/master/src/transformers/testing_utils.py#L919
Thank you for clarifying the precedence, @albertvillanova Isn't it typically the case where env vars have the highest precedence? In my understanding the point of env vars is to be able to override software w/o needing to touch the code. Please correct me if this is not so in the general case. |
Hi @stas00, Well, I'm not an expert on this topic, but the precedence hierarchy I have normally found is from higher to lower:
Anyway, for Datasets, there are no configuration files. The in-memory config is set from default values or env vars (which have precedence over default values). But this is done at import. However, once the library is imported, the user can modify the in-memory config, and this will have precedence over the rest of mechanisms (which take place only at import). |
In my limited experience env vars are typically above cmd line args, so that one can override scripts with cmd lines using env vars, but usually one then uses env vars inside cmd line args, so it's loud and clear. For example specifying a specific gpu number on a command line will depend on
And this is exactly the problem we are trying to solve here. For a good reason HF examples don't want to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks :)
The precedence is consistent with the other env/config variables so it's all good to me
oops, sorry, didn't think earlier - do we need to prefix this with |
You're right, I just opened #2409 |
Add env variable for
MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES
.This will allow to turn off default behavior: loading in memory (and not caching) small datasets.
Fix #2387.