New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optionally key functions by a hash of their source code (includes docs,tests) #129
base: main
Are you sure you want to change the base?
Conversation
Oops! Sorry, didn't realize I broke the documentation. My last commit should fix it. |
(bump) I find myself using my own fork of Joblib for my personal projects because I like this behavior better. It would be awesome to have something like this in joblib. Is there any way you would like me to polish this patch up? Is something like this incompatible with the Joblib philosophy? |
Hi,
Part of this is indeed incompatible with the joblib philosophy, as joblib
tries to make it impossible to shoot yourself in the foot. Although I
haven't studied the patch, it seems to me that the proposed modifications
open a loophole: making hash collisions in the hash function will lead to
buggy behavior. That is indeed against joblib's philosophy, as users will
loose trust in joblib.
What would be acceptable would be to add a string that get preppended to
the current hash. It would make it less black-boxy, while keeping the
robustness. Would that do the trick for you?
|
Thanks for writing back and explaining your thought process! Hm. Great points. Just to be clear: are you proposing some sort of 'tag' functionality that you can use to mark "versions" of a function? E.g. something like this?
And the joblib output results would be saved to something like Maybe the biggest difference in our philosophy is that I feel that sometimes, deleting data is expensive and undesirable, even though it's just forcing some recomputation. In fact, maybe what I want (and thus should be looking at) is just something simpler:
Maybe I want something much simpler, like the ability to specify a per-function results directory and a way to prevent deletion of old state. Both could be provided by the tag functionality outlined above. Does that match what you proposed above? Do you think that's worth implementing? If so, I could look into writing a simpler patch, if you feel it could be worthwhile. |
I think that in vanilla joblib, the loophole you mention exists because input arguments are also hashed in the same way and can be collided too. Hmm. |
I would love something like this- during development, I keep accidentally invalidating my cache because I added a line above my function. I'm going to have to isolate my cached functions to prevent this cache invalidation |
Joblib's default behavior is to delete all of the cached results of a function when its source code or line number changes. Further, two identical functions that appear in different source files are considered to be different. This behavior is annoying if a function is under active development, or if multiple machines have the same function checked out in different filesystem paths.
This patch adds an optional argument to the
Memory()
constructor,func_key_mode
, which allows the user to select alternate behavior. Whenfunc_key_mode="code"
, functions are keyed by a hash of their source code, instead of just their name and filesystem path.Benefits:
/opt/experiment/run.py
while the developer runs/home/gcr/experiment/run.py
, but the developer can reuse the cluster's cached results.What's included?
Memory()
parameter.memory = Memory("cachedir", func_key_mode="code")
for this behavior, ormemory = Memory("cachedir", func_key_mode="filename")
for the default, old behavior.This patch is extremely useful to me. I hope it can be included in mainline! :)
I hereby assign copyright of this patch to the joblib developers.
This change is