Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Custom store backends API #397

Merged
merged 4 commits into from
Jan 25, 2018

Conversation

aabadie
Copy link
Contributor

@aabadie aabadie commented Sep 21, 2016

This is a WIP PR just to check how this refactoring behaves on Windows. I'm also interested in API comments.
Normally tests should pass (at least on Linux/Python 3.5).
I tried to keep as much as possible the behavior of the test suite so normally it shouldn't introduce regression.

Here's a list of remaining things:

  • Backport new cache reduction function in the store API (it works like the store is a file system behind). Comment/proposal of API welcome here.
  • Provide a gist with an example working with S3 (I still need to update the one I have with my recent changes in the API)
  • More testing
  • Documentation

@aabadie
Copy link
Contributor Author

aabadie commented Sep 22, 2016

For the record, I created a new repository which aims at implementing extra store backends for Joblib. This is at a very early stage and only a backend for S3 is provided.

https://github.com/aabadie/joblibstore

@aabadie
Copy link
Contributor Author

aabadie commented Sep 24, 2016

I refactored my initial proposal of backend API to something more simple. Now a store backend inherit from a backend base class and store manager mixin. The latter containing all the complex logic and the former, only a few set of required functions. Adding a new backend is now simpler.

Apart from that, I backported the cache reduction mecanism to this refactoring and could also make the tests pass again without too much changes.

@aabadie aabadie changed the title [WIP] Custom store backend [MRG] Custom store backends API Sep 27, 2016
@aabadie
Copy link
Contributor Author

aabadie commented Sep 27, 2016

I reworked slightly the initial way of registering custom backend because I had problems with hdfs. Now the strategy is to be explicit (rather than implicit ;) ) by using a backend name associated with a location when instanciating the memory object. Example:

from joblib import Memory
from joblib import register_store_backend

# register a custom store backend
register_store_backend('mybackend', StoreBackendSubclass)

# cache myfunc using Memory
mem = Memory(location='joblib/cache', backend='mybakend', compress=True, mybackend_option='option')
myfunc = mem.cache(myfunc)
[...]

by default, the backend parameter is set to local which corresponds to the regular (historical) file system backend.

I changed the status of this PR as [MRG] because I think it's a pretty good shape now. The joblibstore project also gives examples based on S3 and HDFS.

@lesteve
Copy link
Member

lesteve commented Nov 23, 2016

There was some talk at one point, that we needed an additional level for the cache folder in order for the cloud storage backend to have a chance of being functional. @aabadie have you looked at it?

IIRC folder layout would look like this:

full/module/name/function_name/function_code_hash/arguments_hash/output.pkl

Where function_code_hash is the additional level. Also this would very likely fix the race condition highlighted at #111.

@aabadie
Copy link
Contributor Author

aabadie commented Nov 24, 2016

There was some talk at one point, that we needed an additional level for the cache folder in order for the cloud storage backend to have a chance of being functional. @aabadie have you looked at it?

Not yet. Will do asap.

IIRC folder layout would look like this: ...

agreed, do you prefer to have this done in this PR or in another one ? (+1 for the first one)

@lesteve
Copy link
Member

lesteve commented Nov 24, 2016

agreed, do you prefer to have this done in this PR or in another one ? (+1 for the first one)

different PR please! It is a small and incremental change and would fix the race condition mentioned above, makes it way easier to review.

@aabadie
Copy link
Contributor Author

aabadie commented Jan 5, 2017

There was some talk at one point, that we needed an additional level for the cache folder in order for the cloud storage backend to have a chance of being functional. @aabadie have you looked at it?

I started to look at this and now have something nearly ready. I just have some comments:

@lesteve
Copy link
Member

lesteve commented Jan 10, 2017

@aabadie #466 has been merged

@lesteve
Copy link
Member

lesteve commented Jan 10, 2017

Just curious, there was some idea of being able to have a custom store for joblib.load/dump but actually at the moment this amounts to have a file-like object for your store (as I am guessing hdfs3.HDFileSystem().open does for example)?

@aabadie
Copy link
Contributor Author

aabadie commented Jan 10, 2017

Just curious, there was some idea of being able to have a custom store for joblib.load/dump but actually at the moment this amounts to have a file-like object for your store (as I am guessing hdfs3.HDFileSystem().open does for example)?

Not sure I understand correctly but we have the same requirement with the actual refactoring. Any store backend should have a file-like object in order to be able to dump/load in it.

@aabadie
Copy link
Contributor Author

aabadie commented Jan 10, 2017

@aabadie #466 has been merged

Thanks @lesteve, I'll fix the conflicts here asap.

@aabadie aabadie force-pushed the storage_backend_2 branch 2 times, most recently from 57261ac to 2020389 Compare January 10, 2017 18:12
@aabadie
Copy link
Contributor Author

aabadie commented Jan 10, 2017

@lesteve, rebased with current master

"Does nothing"


def test_register_invalid_store_backends():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one is a good candidate for pytest @parametrize and split in 2 separate test functions.

@aabadie
Copy link
Contributor Author

aabadie commented Jan 13, 2017

ping @lesteve @ogrisel

@aabadie
Copy link
Contributor Author

aabadie commented Jan 17, 2017

While testing joblibstore with the hdfs3 backend, I get serious issues : hdfs3 file object doesn't work with joblib, and even more with pickle itself. I opened 2 issues regarding the problem: dask/hdfs3#107 and dask/hdfs3#108.

Other remarks:

  • With the S3 backend using s3fs, it's not possible to get the last_accessed metadata of a file. We'll have to find another solution in this case, maybe the metadata.json file
  • The HDFS3 filesystem object provides this information otherwise

@eyadsibai
Copy link

I am looking forward for this to be merged. Is there anything to do to speed it up? I would love to have s3 or google storage as backend stores to joblib

@aabadie
Copy link
Contributor Author

aabadie commented Feb 27, 2017

Is there anything to do to speed it up?

Yes : reviewing the changes of this PR and test. BTW, joblibstore is still alive now better tested. It also fully works (at least for me) with S3. With HDFS, there are still issues with numpy arrays but that doesn't come from joblib or joblibstore : this is an issue in the hdfs3 package.

@aabadie
Copy link
Contributor Author

aabadie commented Mar 1, 2017

@eyadsibai, it would be really great if you could give a try to the joblib store backends implemented in joblibstore (try S3 first) and report bugs if you find any
Let us know if you have problems setting things up.

@eyadsibai
Copy link

eyadsibai commented Mar 1, 2017

@aabadie sure will do

@codecov
Copy link

codecov bot commented Apr 19, 2017

Codecov Report

Merging #397 into master will decrease coverage by 0.16%.
The diff coverage is 93.29%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #397      +/-   ##
==========================================
- Coverage   94.98%   94.81%   -0.17%     
==========================================
  Files          38       39       +1     
  Lines        5005     5193     +188     
==========================================
+ Hits         4754     4924     +170     
- Misses        251      269      +18
Impacted Files Coverage Δ
joblib/disk.py 81.66% <0%> (ø) ⬆️
joblib/__init__.py 100% <100%> (ø) ⬆️
joblib/test/test_disk.py 100% <100%> (ø) ⬆️
joblib/_store_backends.py 91% <91%> (ø)
joblib/test/test_memory.py 97.76% <95.03%> (-1.6%) ⬇️
joblib/memory.py 94.95% <95.08%> (+1.55%) ⬆️
joblib/_parallel_backends.py 94.73% <0%> (-0.44%) ⬇️
joblib/test/test_parallel.py 96.1% <0%> (-0.41%) ⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ba06b7e...ce0f664. Read the comment docs.

@GaelVaroquaux
Copy link
Member

I have some cosmetic changes to do on this PR. @aabadie : is it OK if I push them to your branch? I hope that you don't have changes not pushed here.

@GaelVaroquaux
Copy link
Member

I did a thorough read of this PR, and I don't see any major things to change. I do see two minor things that I'd like to change:

  1. There are many places where "kwargs" are used. I'd like to decrease these, as that's hard to read because it's too generic. I see two paths (and will probably end up using a combination of both):

    • I might simply turn "**kwargs" to "store_args" and document that "store_args" is a dictionary of store-specific arguments.
    • Some arguments can be taken out of the kwargs and simply not used by some stores (but maybe none are eligible here)
  2. The path in the store is specified as "func_id", "args_id". As we are considering in Expose storing dumping of objects to disk #593 a more general usage of the store backends, I'd like to change it to be more general.

    • How about taking as inputs "path_elements", which would be a list of strings? Memory would need to be changed to generate it, but it seems easy.

@GaelVaroquaux
Copy link
Member

While working on the code I also noted that the stores use the vocabulary "result" a lot. We should probably change that to "object", as the store will be used for other things than objects.

@GaelVaroquaux
Copy link
Member

I've started changing the names of the API so that the terminology of the store is less specialized to caching. @aabadie : I am sure that this will break other backends. My apologies. But now is better than later.

@GaelVaroquaux
Copy link
Member

@aabadie : we should have a discussion on how we evolve this object in the context of #593. I think that we can easily make naming a bit more universal. Let's talk on Wednesday.

@aabadie
Copy link
Contributor Author

aabadie commented Jan 23, 2018

I am sure that this will break other backends

No problem, this was expected. We'll just have to give them a bit of love after we stabilize the API here (and eventually merge this PR).

@aabadie
Copy link
Contributor Author

aabadie commented Jan 23, 2018

Another thing that is not addressed by this PR is some documentation about the new behaviour.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jan 23, 2018 via email


The StoreBackend subclass has to implement 3 methods: create_location,
clear_location and configure. The StoreBackend also has to provide
open_object and object_exists methods by monkey matching them
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/matching/patching/

@aabadie
Copy link
Contributor Author

aabadie commented Jan 24, 2018

@GaelVaroquaux, I could fix your CI issues and renamed Memory/MemorizedFunc store to store_backend. CI are passing.

I wanted to rebase to latest master but had some commits conflicting with this branch, so I squashed all commits in this PR in a single one, rebased and forced the push. We lost your/my recent history of changes. Hope this is fine.

- make internal methods of store backend base class private
- document backend base class methods
@GaelVaroquaux
Copy link
Member

Merging. Congratulations!

@GaelVaroquaux GaelVaroquaux merged commit 89743fa into joblib:master Jan 25, 2018
@lesteve
Copy link
Member

lesteve commented Jan 29, 2018

Very nice!

IIRC for cloud backends, updating an object at a given location is not possible, so you need to delete the object and create a new one at the same location but there is no guarantee which version (old or new) you will get.

The fix discussed a while ago involved adding an additional level of folder (one per function hash) in order to avoid sharing the cache between different versions of the cached function. See my old comment.

A few issues to solve:

  • what kind of backward-compatibility do we want with the old cache layout? I'd say we want at least to still be able to read it with some warning
  • is it doable to move the old cache argument-hash folder into the func-code-hash folder so that the migration to the new folder layout is transparent for the user.

cc @ogrisel since he was the one that originally mentionned this issue about cloud backends.

if cachedir is not None:
if location is None:
warnings.warn("cachedir option is deprecated since version "
"0.10 and will be removed after version 0.12.\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what the versions should be but at the moment they are inconcistent in this PR. This one says 0.10 -> 0.12. Other places say 0.11 -> 0.13.

I am guessing it would make sense to release a 0.12 with all the changes that are going into master at the moment.

jdanbrown added a commit to jdanbrown/joblib that referenced this pull request May 16, 2018
- Looks like an oversight introduced in joblib#397
ogrisel pushed a commit that referenced this pull request May 22, 2018
- Looks like an oversight introduced in #397
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants