Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally key functions by a hash of their source code (includes docs,tests) #129

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

gcr
Copy link

@gcr gcr commented Apr 14, 2014

Joblib's default behavior is to delete all of the cached results of a function when its source code or line number changes. Further, two identical functions that appear in different source files are considered to be different. This behavior is annoying if a function is under active development, or if multiple machines have the same function checked out in different filesystem paths.

This patch adds an optional argument to the Memory() constructor, func_key_mode, which allows the user to select alternate behavior. When func_key_mode="code", functions are keyed by a hash of their source code, instead of just their name and filesystem path.

Benefits:

  • If a function changes, results are now saved to a different place, but the old results are still available if the function's definition changes back.
  • Avoids aliasing issues and better handles name conflicts
  • Filesystem-agnostic. Compute clusters can run /opt/experiment/run.py while the developer runs /home/gcr/experiment/run.py, but the developer can reuse the cluster's cached results.

What's included?

  • This optional behavior, turned off by default (so it doesn't interfere with existing joblib users): It can be toggled by a Memory() parameter. memory = Memory("cachedir", func_key_mode="code") for this behavior, or memory = Memory("cachedir", func_key_mode="filename") for the default, old behavior.
  • Docs (I don't know how to build them though) and docstrings
  • Unit tests

This patch is extremely useful to me. I hope it can be included in mainline! :)

I hereby assign copyright of this patch to the joblib developers.


This change is Reviewable

@gcr
Copy link
Author

gcr commented Apr 14, 2014

Oops! Sorry, didn't realize I broke the documentation. My last commit should fix it.

@gcr
Copy link
Author

gcr commented Feb 8, 2015

(bump)

I find myself using my own fork of Joblib for my personal projects because I like this behavior better.

It would be awesome to have something like this in joblib. Is there any way you would like me to polish this patch up? Is something like this incompatible with the Joblib philosophy?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 8, 2015 via email

@gcr
Copy link
Author

gcr commented Feb 8, 2015

Thanks for writing back and explaining your thought process! Hm. Great points.

Just to be clear: are you proposing some sort of 'tag' functionality that you can use to mark "versions" of a function? E.g. something like this?

memory = joblib.Memory("data/...", func_key_mode='hash')

@memory(version="2015-02-05 Just fixed some bug")
def optimize(data):
    # ...

And the joblib output results would be saved to something like data/joblib/optimize-alias-2015-02-05 Just fixed some bug/INPUT_ARGS_HASH/output.pkl ?

Maybe the biggest difference in our philosophy is that I feel that sometimes, deleting data is expensive and undesirable, even though it's just forcing some recomputation. In fact, maybe what I want (and thus should be looking at) is just something simpler:

  • A safeguard to prevent expensive computation results from being deleted. In this patch, this is provided implicitly by the hash function, but perhaps joblib could, say, optionally raise an exception instead.
  • A way to migrate or keep old results, especially if the change to the source code is cosmetic. The price (to me) is that if I do make a breaking change, then I would have to remember to update the function tag. Hmmmm. I see what you're saying here about shooting people in the foot: I am likely to forget to do this.

Maybe I want something much simpler, like the ability to specify a per-function results directory and a way to prevent deletion of old state. Both could be provided by the tag functionality outlined above.

Does that match what you proposed above? Do you think that's worth implementing? If so, I could look into writing a simpler patch, if you feel it could be worthwhile.

@gcr
Copy link
Author

gcr commented Feb 8, 2015

I think that in vanilla joblib, the loophole you mention exists because input arguments are also hashed in the same way and can be collided too. Hmm.

@AlJohri
Copy link

AlJohri commented Oct 25, 2018

I would love something like this- during development, I keep accidentally invalidating my cache because I added a line above my function. I'm going to have to isolate my cached functions to prevent this cache invalidation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants