Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] persistence in/from file objects #351

Merged
merged 16 commits into from Jun 29, 2016

Conversation

@aabadie
Copy link
Contributor

@aabadie aabadie commented May 11, 2016

This PR should fix #341 and fix #312.

Examples of new usage:

import joblib
data = "some data to persist"
# persist to/from a raw file
with open('/tmp/test.pkl', 'wb') as f:
    joblib.dump(data, f)

with open('/tmp/test.pkl', 'rb') as f:
    print(joblib.load(f))

# compressors are also supported:
import gzip
with gzip.GzipFile('/tmp/test.pkl.gz', 'wb') as f:
    joblib.dump(data, f)

with gzip.GzipFile('/tmp/test.pkl.gz', 'rb') as f:
    print(joblib.load(f))

# auto detection of compression also works:
with open('/tmp/test.pkl.gz', 'rb') as f:
    print(joblib.load(f))

# it's also possible to pickle in a buffer in memory using io.BytesIO:
import io
buf = io.BytesIO()
joblib.dump(data, buf)
joblib.load(buf)

Test functions are still missing.

@aabadie aabadie added the in progress label May 11, 2016
@aabadie aabadie mentioned this pull request May 11, 2016
@aabadie aabadie changed the title WIP : persistence in file handles WIP : persistence using file handles May 11, 2016
@aabadie aabadie force-pushed the aabadie:file_handle_persistence branch 2 times, most recently from 56d6805 to ba0f914 May 11, 2016
@aabadie aabadie changed the title WIP : persistence using file handles [MRG] persistence using file handles May 12, 2016
@aabadie
Copy link
Contributor Author

@aabadie aabadie commented May 12, 2016

I think this is in a pretty good shape now, so changing the status to MRG.

A few comments though:

  • the implementation is not perfect (e.g new _COMPRESSOR_OBJS list, etc) so I'd like to have someone else's opinion,
  • forcing the decompressor to a wrong one (exemple: use gzip compression with dump but try to uncompress with bz2) will raise an exception. Should we handle this case ? My answer on this is 'no' as the message is quite meaningful but maybe other people won't agree with this,
  • online documentation and CHANGES.rst still need to be updated but this can be done after a preliminary review.
@aabadie aabadie force-pushed the aabadie:file_handle_persistence branch from ba0f914 to 2d9833c May 18, 2016
@aabadie
Copy link
Contributor Author

@aabadie aabadie commented May 18, 2016

@lesteve, I addressed what we discussed IRL. Feel free to have a look when you have time.

aabadie referenced this pull request in GaelVaroquaux/website May 20, 2016
@@ -389,6 +389,9 @@ def dump(value, filename, compress=0, protocol=None, cache_size=None):
if Path is not None and isinstance(filename, Path):
filename = str(filename)

isFilename = isinstance(filename, _basestring)
isFileobj = hasattr(filename, "read") and hasattr(filename, "write")

This comment has been minimized.

@ogrisel

ogrisel May 26, 2016
Contributor

🚨 ⚠️ 🐫 ⚠️ camelcase alert! ⚠️ 🐫 ⚠️ 🚨

This comment has been minimized.

@aabadie

aabadie May 27, 2016
Author Contributor

Sorry, habits from my old Qt life. I renamed using 🐍 snake_case 🐍 😄

new_exc.__cause__ = exc
raise new_exc

if hasattr(filename, "read") and hasattr(filename, "write"):

This comment has been minimized.

@ogrisel

ogrisel May 26, 2016
Contributor

We don't need the "write" to load therefore testing only for "read" (and "seek" too).

This comment has been minimized.

@aabadie

aabadie May 27, 2016
Author Contributor

Agreed and fixed.

@aabadie aabadie force-pushed the aabadie:file_handle_persistence branch 2 times, most recently from 7996f7b to f33fac6 May 27, 2016
@aabadie
Copy link
Contributor Author

@aabadie aabadie commented Jun 9, 2016

@lesteve, @ogrisel , any comment ?

@aabadie aabadie changed the title [MRG] persistence using file handles [MRG] persistence using file object Jun 9, 2016
@aabadie aabadie changed the title [MRG] persistence using file object [MRG] persistence in/from file object Jun 9, 2016
@aabadie aabadie changed the title [MRG] persistence in/from file object [MRG] persistence in/from file objects Jun 9, 2016
@@ -467,12 +472,48 @@ def dump(value, filename, compress=0, protocol=None, cache_size=None):
NumpyPickler(f, protocol=protocol).dump(value)

else:

This comment has been minimized.

@lesteve

lesteve Jun 9, 2016
Member

Use a elif here so that all the conditions are on the same indentation level.

This comment has been minimized.

@aabadie

aabadie Jun 9, 2016
Author Contributor

Fixed !

@@ -147,20 +151,42 @@ def _read_fileobject(fileobj, filename, mmap_mode=None):
"""
# Detect if the fileobj contains compressed data.
compressor = _detect_compressor(fileobj)
if isinstance(fileobj, tuple(_COMPRESSOR_OBJS)):

This comment has been minimized.

@lesteve

lesteve Jun 9, 2016
Member

Remind me why is _read_fileobject a generator function again?

This comment has been minimized.

@aabadie

aabadie Jun 15, 2016
Author Contributor

It was to be able to be able to manage in the same function the compatibility with old pickles. The problem was that the load_compatibility function had to be called from the initial calling function (e.g load). The solution is to yield the filename in case of old pickles and yield file objects with new pickles.

This comment has been minimized.

@lesteve

lesteve Jun 17, 2016
Member

I think my question was why you were using yield rather than return but this discussion is probably best left out of this PR.

@aabadie aabadie force-pushed the aabadie:file_handle_persistence branch from 06d650e to 99a275a Jun 15, 2016
@aabadie aabadie added this to the 0.10 milestone Jun 16, 2016
@aabadie
Copy link
Contributor Author

@aabadie aabadie commented Jun 17, 2016

Just for the record, here is the kind of things now possible with this PR: https://gist.github.com/aabadie/074587354d97d872aff6abb65510f618

@aabadie aabadie force-pushed the aabadie:file_handle_persistence branch from 99a275a to 6e9d362 Jun 17, 2016
@@ -333,14 +334,15 @@ def test_compress_mmap_mode_warning():
numpy_pickle.load(this_filename, mmap_mode='r+')
nose.tools.assert_equal(len(caught_warnings), 1)
for warn in caught_warnings:
nose.tools.assert_equal(warn.category, DeprecationWarning)
nose.tools.assert_equal(warn.category, Warning)

This comment has been minimized.

@lesteve

lesteve Jun 17, 2016
Member

Why DeprecationWarning -> Warning?

This comment has been minimized.

@aabadie

aabadie Jun 17, 2016
Author Contributor

Because this is the type of warning raised here and that we want to test: it checks one cannot load a compressed pickle using mmap_mode so I think there's no meaning using a DeprecationWarning in this case.

if compressor == 'zlib':
'mmap_mode "%(mmap_mode)s" flag passed. mmap_mode '
'option will be ignored.'
% locals(), Warning, stacklevel=2)

This comment has been minimized.

@lesteve

lesteve Jun 17, 2016
Member

For generic warnings, probably better not to specify anything here:

warnings.warn(message, stacklevel=2)

It will default to UserWarning.

@@ -49,6 +50,9 @@

# Supported compressors
_COMPRESSORS = ('zlib', 'bz2', 'lzma', 'xz', 'gzip')
_COMPRESSOR_OBJS = [gzip.GzipFile, bz2.BZ2File]

This comment has been minimized.

@lesteve

lesteve Jun 17, 2016
Member

_COMPRESSOR_CLASSES would be a better name.

@aabadie aabadie force-pushed the aabadie:file_handle_persistence branch from 6e9d362 to 1af045f Jun 20, 2016
@aabadie aabadie added the enhancement label Jun 20, 2016
'This feature is not supported by joblib.')
new_exc.__cause__ = exc
raise new_exc
# Raise exception "as-is" with python 2

This comment has been minimized.

@lesteve

lesteve Jun 20, 2016
Member

Reraise exception with Python 2

@aabadie aabadie force-pushed the aabadie:file_handle_persistence branch from e7b1cb3 to 2002371 Jun 23, 2016
@aabadie
Copy link
Contributor Author

@aabadie aabadie commented Jun 23, 2016

I had issues with doctest and the usage of file object in a context manager. I skipped the lines in the doc but that may not be a solution. Any better idea ?

Otherwise, I rebased on the lastest master (including the flake8_diff.sh recent changes) and it works just fine.

@lesteve
Copy link
Member

@lesteve lesteve commented Jun 24, 2016

I had issues with doctest and the usage of file object in a context manager. I skipped the lines in the doc but that may not be a solution. Any better idea ?

What was the error? It would be great to understand where it comes from and fix it properly ...

Otherwise, I rebased on the lastest master (including the flake8_diff.sh recent changes) and it works just fine.

Great, thanks a lot!

@aabadie
Copy link
Contributor Author

@aabadie aabadie commented Jun 24, 2016

What was the error? It would be great to understand where it comes from and fix it properly ...

There were 2 things:

  • with open() as fo returns the file object ([< io.blabla >]) : I fixed it using an ellipsis flags and by adding [<...>] after the with block
  • in python 2.6, using gzip.GzipFile or bz2.BZ2File doesn't work if you don't use contextlib.closing. So I had to skip the file object compression example.
@lesteve
Copy link
Member

@lesteve lesteve commented Jun 24, 2016

with open() as fo returns the file object ([]) : I fixed it using an ellipsis flags and by adding [<...>] after the with block

Hmmm I am not sure what we should do in this case:

with gzip.GzipFile('/tmp/test.gz', 'wb') as f:             
    res = joblib.dump({'a': [1, 2, 3], 'b': 'asadf'}, f)

Should res be:

  • a single element list containing containing f
  • a single element list containing containing f.name to be more consistent with when dump is called with a filename
  • because this is a new feature, can we do what we want, in particular returning a list may not make sense, since the list was there for historical reasons (main pickle + companion files for each numpy array)

in python 2.6, using gzip.GzipFile or bz2.BZ2File doesn't work if you don't use contextlib.closing. So I had to skip the file object compression example.

Pfff 😭 is there a way we can just skip doctests in Python 2.6?

@lesteve
Copy link
Member

@lesteve lesteve commented Jun 24, 2016

Pfff 😭 is there a way we can just skip doctests in Python 2.6?

Seems like scikit-learn has some *fixture* files in the doc folder to skip the doctests. You can probably have a look there. I believe there is something in setup.cfg as well to work with nose and doctest-fixtures.

@aabadie
Copy link
Contributor Author

@aabadie aabadie commented Jun 27, 2016

Should res be:

  • a single element list containing containing f
  • a single element list containing containing f.name to be more consistent with when dump is called > with a filename
  • because this is a new feature, can we do what we want, in particular returning a list may not make > sense, since the list was there for historical reasons (main pickle + companion files for each numpy array)

I think returning nothing in case a file object is given as input is reasonable because this means the user knows in advance the destination of the dump and the returned list is here, as you said, for historical reasons.

@aabadie
Copy link
Contributor Author

@aabadie aabadie commented Jun 27, 2016

Seems like scikit-learn has some fixture files in the doc folder to skip the doctests

Nice, it works like a charm. The setup.cfg file in joblib was already correctly configured.

@aabadie aabadie force-pushed the aabadie:file_handle_persistence branch from 3a96f67 to 7d1e947 Jun 27, 2016
def setup_module(module):
"""Setup module."""
if _compat.PY26:
raise SkipTest("Skipping persitence doctest in Python 2.6")

This comment has been minimized.

@lesteve

lesteve Jun 28, 2016
Member

typo in persistence (missing s)

@lesteve
Copy link
Member

@lesteve lesteve commented Jun 28, 2016

I think returning nothing in case a file object is given as input is reasonable because this means the user knows in advance the destination of the dump and the returned list is here,

Not entirely convinced, but it is a good point that joblib.dump returning something is mostly for companion files (i.e. npy files) since they are not passed in by the user. Let's ask @GaelVaroquaux and @ogrisel whether they have an opinion on this!

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jun 28, 2016

I think returning nothing in case a file object is given as input
is reasonable because this means the user knows in advance the
destination of the dump and the returned list is here,

The danger with returning the file object is that it risks having users
holding on the open file handles. Under windows, this locks the file, and
hence can raise problems in parallel computing.

@aabadie
Copy link
Contributor Author

@aabadie aabadie commented Jun 28, 2016

The danger with returning the file object is that it risks having users holding on the open file handles. > Under windows, this locks the file, and hence can raise problems in parallel computing.

So I guess we can keep it as it is in this PR (returns nothing).

@lesteve
Copy link
Member

@lesteve lesteve commented Jun 29, 2016

So I guess we can keep it as it is in this PR (returns nothing).

Let's do it like this in this PR. It'd be good to try to make the joblib.dump return type consistent whether it takes a filename or a file object though in a further PR.

def setup_module(module):
"""Setup module."""
if _compat.PY26:
raise SkipTest("Skipping persistence doctest in Python 2.6")

This comment has been minimized.

@lesteve

lesteve Jun 29, 2016
Member

Can you add a comment to say that GzipFile can not be used as a context manager in Python 2.6 or some better explanation?

This comment has been minimized.

@aabadie

aabadie Jun 29, 2016
Author Contributor

Comment added.

@lesteve lesteve merged commit af2e9b7 into joblib:master Jun 29, 2016
3 checks passed
3 checks passed
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.2%) to 89.713%
Details
@lesteve lesteve removed the in progress label Jun 29, 2016
@lesteve
Copy link
Member

@lesteve lesteve commented Jun 29, 2016

Merging thanks!

@aabadie aabadie deleted the aabadie:file_handle_persistence branch Sep 24, 2016
yarikoptic added a commit to yarikoptic/joblib that referenced this pull request Sep 30, 2016
* tag '0.10.2': (55 commits)
  Bump up __version__
  Update release to 0.10.2 in CHANGES.rst
  nosetests should run tests only from the joblib folder.
  API expose joblib.parallel.ParallelBackendBase and joblib.parallel.AutoBatchingMixin
  Update to numpydoc 0.6
  [numpy_pick_utils] Handle unicode filenames in BinaryZlibFile (joblib#384)
  Fix format_stack with compiled code (e.g. .so or .pyd)
  PEP8: cosmit fix (joblib#376)
  FIX typo
  Release 0.10.0
  FIX: __all__ should hold symbol names as strings
  Fix bench_auto_batching.py
  [MRG] Persistence in/from file objects (joblib#351)
  Minor tweaks in auto batching benchmark script
  Improve flake8_diff.sh (joblib#371)
  FIX numpy array persistence with pickle HIGHEST_PROTOCOL (joblib#370)
  DOC: remove easy_install from joblib installation documentation (joblib#363)
  MAINT fix typo
  DOC Add documentation of mmap_mode
  Explicit handling of job cancellation on first collected exception (joblib#361)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

4 participants
You can’t perform that action at this time.