-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Custom store backends API #397
Conversation
For the record, I created a new repository which aims at implementing extra store backends for Joblib. This is at a very early stage and only a backend for S3 is provided. |
6c7dbf1
to
02b82fd
Compare
I refactored my initial proposal of backend API to something more simple. Now a store backend inherit from a backend base class and store manager mixin. The latter containing all the complex logic and the former, only a few set of required functions. Adding a new backend is now simpler. Apart from that, I backported the cache reduction mecanism to this refactoring and could also make the tests pass again without too much changes. |
I reworked slightly the initial way of registering custom backend because I had problems with hdfs. Now the strategy is to be explicit (rather than implicit ;) ) by using a backend name associated with a location when instanciating the memory object. Example: from joblib import Memory
from joblib import register_store_backend
# register a custom store backend
register_store_backend('mybackend', StoreBackendSubclass)
# cache myfunc using Memory
mem = Memory(location='joblib/cache', backend='mybakend', compress=True, mybackend_option='option')
myfunc = mem.cache(myfunc)
[...] by default, the I changed the status of this PR as [MRG] because I think it's a pretty good shape now. The joblibstore project also gives examples based on S3 and HDFS. |
There was some talk at one point, that we needed an additional level for the cache folder in order for the cloud storage backend to have a chance of being functional. @aabadie have you looked at it? IIRC folder layout would look like this:
Where |
Not yet. Will do asap.
agreed, do you prefer to have this done in this PR or in another one ? (+1 for the first one) |
different PR please! It is a small and incremental change and would fix the race condition mentioned above, makes it way easier to review. |
I started to look at this and now have something nearly ready. I just have some comments:
|
Just curious, there was some idea of being able to have a custom store for joblib.load/dump but actually at the moment this amounts to have a file-like object for your store (as I am guessing hdfs3.HDFileSystem().open does for example)? |
Not sure I understand correctly but we have the same requirement with the actual refactoring. Any store backend should have a file-like object in order to be able to dump/load in it. |
57261ac
to
2020389
Compare
@lesteve, rebased with current master |
joblib/test/test_memory.py
Outdated
"Does nothing" | ||
|
||
|
||
def test_register_invalid_store_backends(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one is a good candidate for pytest @parametrize and split in 2 separate test functions.
c4407bc
to
c2d9800
Compare
While testing joblibstore with the hdfs3 backend, I get serious issues : hdfs3 file object doesn't work with joblib, and even more with pickle itself. I opened 2 issues regarding the problem: dask/hdfs3#107 and dask/hdfs3#108. Other remarks:
|
I am looking forward for this to be merged. Is there anything to do to speed it up? I would love to have s3 or google storage as backend stores to joblib |
Yes : reviewing the changes of this PR and test. BTW, joblibstore is still alive now better tested. It also fully works (at least for me) with S3. With HDFS, there are still issues with numpy arrays but that doesn't come from joblib or joblibstore : this is an issue in the hdfs3 package. |
@eyadsibai, it would be really great if you could give a try to the joblib store backends implemented in joblibstore (try S3 first) and report bugs if you find any |
@aabadie sure will do |
bc9ce1a
to
164cc0d
Compare
Codecov Report
@@ Coverage Diff @@
## master #397 +/- ##
==========================================
- Coverage 94.98% 94.81% -0.17%
==========================================
Files 38 39 +1
Lines 5005 5193 +188
==========================================
+ Hits 4754 4924 +170
- Misses 251 269 +18
Continue to review full report at Codecov.
|
I have some cosmetic changes to do on this PR. @aabadie : is it OK if I push them to your branch? I hope that you don't have changes not pushed here. |
I did a thorough read of this PR, and I don't see any major things to change. I do see two minor things that I'd like to change:
|
While working on the code I also noted that the stores use the vocabulary "result" a lot. We should probably change that to "object", as the store will be used for other things than objects. |
I've started changing the names of the API so that the terminology of the store is less specialized to caching. @aabadie : I am sure that this will break other backends. My apologies. But now is better than later. |
No problem, this was expected. We'll just have to give them a bit of love after we stabilize the API here (and eventually merge this PR). |
Another thing that is not addressed by this PR is some documentation about the new behaviour. |
Another thing that is not addressed by this PR is some documentation about the new behaviour.
Yes :)
|
joblib/_store_backends.py
Outdated
|
||
The StoreBackend subclass has to implement 3 methods: create_location, | ||
clear_location and configure. The StoreBackend also has to provide | ||
open_object and object_exists methods by monkey matching them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/matching/patching/
21f50bd
to
9a257b1
Compare
14f0e46
to
375e216
Compare
@GaelVaroquaux, I could fix your CI issues and renamed I wanted to rebase to latest master but had some commits conflicting with this branch, so I squashed all commits in this PR in a single one, rebased and forced the push. We lost your/my recent history of changes. Hope this is fine. |
- make internal methods of store backend base class private - document backend base class methods
Merging. Congratulations! |
Very nice! IIRC for cloud backends, updating an object at a given location is not possible, so you need to delete the object and create a new one at the same location but there is no guarantee which version (old or new) you will get. The fix discussed a while ago involved adding an additional level of folder (one per function hash) in order to avoid sharing the cache between different versions of the cached function. See my old comment. A few issues to solve:
cc @ogrisel since he was the one that originally mentionned this issue about cloud backends. |
if cachedir is not None: | ||
if location is None: | ||
warnings.warn("cachedir option is deprecated since version " | ||
"0.10 and will be removed after version 0.12.\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what the versions should be but at the moment they are inconcistent in this PR. This one says 0.10 -> 0.12. Other places say 0.11 -> 0.13.
I am guessing it would make sense to release a 0.12 with all the changes that are going into master at the moment.
- Looks like an oversight introduced in joblib#397
- Looks like an oversight introduced in #397
This is a WIP PR just to check how this refactoring behaves on Windows. I'm also interested in API comments.Normally tests should pass (at least on Linux/Python 3.5).
I tried to keep as much as possible the behavior of the test suite so normally it shouldn't introduce regression.
Here's a list of remaining things: