Optionally key functions by a hash of their source code (includes docs,tests) #129

gcr · 2014-04-14T12:58:41Z

Joblib's default behavior is to delete all of the cached results of a function when its source code or line number changes. Further, two identical functions that appear in different source files are considered to be different. This behavior is annoying if a function is under active development, or if multiple machines have the same function checked out in different filesystem paths.

This patch adds an optional argument to the Memory() constructor, func_key_mode, which allows the user to select alternate behavior. When func_key_mode="code", functions are keyed by a hash of their source code, instead of just their name and filesystem path.

Benefits:

If a function changes, results are now saved to a different place, but the old results are still available if the function's definition changes back.
Avoids aliasing issues and better handles name conflicts
Filesystem-agnostic. Compute clusters can run /opt/experiment/run.py while the developer runs /home/gcr/experiment/run.py, but the developer can reuse the cluster's cached results.

What's included?

This optional behavior, turned off by default (so it doesn't interfere with existing joblib users): It can be toggled by a Memory() parameter. memory = Memory("cachedir", func_key_mode="code") for this behavior, or memory = Memory("cachedir", func_key_mode="filename") for the default, old behavior.
Docs (I don't know how to build them though) and docstrings
Unit tests

This patch is extremely useful to me. I hope it can be included in mainline! :)

I hereby assign copyright of this patch to the joblib developers.

This change is

…on name

gcr · 2014-04-14T16:27:38Z

Oops! Sorry, didn't realize I broke the documentation. My last commit should fix it.

gcr · 2015-02-08T22:09:14Z

(bump)

I find myself using my own fork of Joblib for my personal projects because I like this behavior better.

It would be awesome to have something like this in joblib. Is there any way you would like me to polish this patch up? Is something like this incompatible with the Joblib philosophy?

GaelVaroquaux · 2015-02-08T22:25:24Z

Hi, Part of this is indeed incompatible with the joblib philosophy, as joblib tries to make it impossible to shoot yourself in the foot. Although I haven't studied the patch, it seems to me that the proposed modifications open a loophole: making hash collisions in the hash function will lead to buggy behavior. That is indeed against joblib's philosophy, as users will loose trust in joblib. What would be acceptable would be to add a string that get preppended to the current hash. It would make it less black-boxy, while keeping the robustness. Would that do the trick for you?

gcr · 2015-02-08T23:28:46Z

Thanks for writing back and explaining your thought process! Hm. Great points.

Just to be clear: are you proposing some sort of 'tag' functionality that you can use to mark "versions" of a function? E.g. something like this?

memory = joblib.Memory("data/...", func_key_mode='hash')

@memory(version="2015-02-05 Just fixed some bug")
def optimize(data):
    # ...

And the joblib output results would be saved to something like data/joblib/optimize-alias-2015-02-05 Just fixed some bug/INPUT_ARGS_HASH/output.pkl ?

Maybe the biggest difference in our philosophy is that I feel that sometimes, deleting data is expensive and undesirable, even though it's just forcing some recomputation. In fact, maybe what I want (and thus should be looking at) is just something simpler:

A safeguard to prevent expensive computation results from being deleted. In this patch, this is provided implicitly by the hash function, but perhaps joblib could, say, optionally raise an exception instead.
A way to migrate or keep old results, especially if the change to the source code is cosmetic. The price (to me) is that if I do make a breaking change, then I would have to remember to update the function tag. Hmmmm. I see what you're saying here about shooting people in the foot: I am likely to forget to do this.

Maybe I want something much simpler, like the ability to specify a per-function results directory and a way to prevent deletion of old state. Both could be provided by the tag functionality outlined above.

Does that match what you proposed above? Do you think that's worth implementing? If so, I could look into writing a simpler patch, if you feel it could be worthwhile.

gcr · 2015-02-08T23:30:37Z

I think that in vanilla joblib, the loophole you mention exists because input arguments are also hashed in the same way and can be collided too. Hmm.

AlJohri · 2018-10-25T05:14:29Z

I would love something like this- during development, I keep accidentally invalidating my cache because I added a line above my function. I'm going to have to isolate my cached functions to prevent this cache invalidation

gcr added 7 commits April 13, 2014 16:25

[WIP] Function dirnames should be keyed by their code, not the functi…

66f5838

…on name

[WIP] Passes tests: thread func_key_mode through other functions too

87c96c5

Make func_key_code a non-optional parameter in the helper

f21d9eb

Add docstrings for the new function_key_mode param

4eaa3f1

Fix representation of a Result to include the function key

f0bdc75

Add documentation to the func_key_mode parameter.

31c45b3

Fixes formatting of documentation for the func_key_mode parameter

91ab0cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optionally key functions by a hash of their source code (includes docs,tests) #129

Optionally key functions by a hash of their source code (includes docs,tests) #129

gcr commented Apr 14, 2014 •

edited

gcr commented Apr 14, 2014

gcr commented Feb 8, 2015

GaelVaroquaux commented Feb 8, 2015 via email

gcr commented Feb 8, 2015

gcr commented Feb 8, 2015

AlJohri commented Oct 25, 2018

Optionally key functions by a hash of their source code (includes docs,tests) #129

Are you sure you want to change the base?

Optionally key functions by a hash of their source code (includes docs,tests) #129

Conversation

gcr commented Apr 14, 2014 • edited

Benefits:

What's included?

gcr commented Apr 14, 2014

gcr commented Feb 8, 2015

GaelVaroquaux commented Feb 8, 2015 via email

gcr commented Feb 8, 2015

gcr commented Feb 8, 2015

AlJohri commented Oct 25, 2018

gcr commented Apr 14, 2014 •

edited