Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix numpy pickle for versions less than 1.9 #243

Conversation

@lesteve
Copy link
Member

lesteve commented Sep 24, 2015

Fix #236. There were two bugs conflated in one:

Python 3 writer, Python 2 reader

That's the direction reported in #236. Standard pickle of dtype created with python 3 and numpy 1.9 can not be read with python 2.

Writer script (python 3, numpy 1.9)

import pickle
import numpy as np
dt = np.dtype('>f8')
with open('/tmp/test.pkl', 'wb') as f:
    pickle.Pickler(f, protocol=2).dump(dt)

Reader script (python 2, numpy 1.8)

import pickle
pickle.load(open('/tmp/test.pkl', 'rb'))

Traceback:

Traceback (most recent call last):
  File "/home/lesteve/miniconda3/envs/scratch_py27/lib/python2.7/unittest/case.py", line 331, in run
    testMethod()
  File "/home/lesteve/miniconda3/envs/scratch_py27/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/lesteve/dev/joblib/joblib/test/test_numpy_pickle.py", line 324, in test_compressed_pickle_python_2_3_compatibility
    result_list = numpy_pickle.load(fname)
  File "/home/lesteve/dev/joblib/joblib/numpy_pickle.py", line 524, in load
    obj = unpickler.load()
  File "/home/lesteve/miniconda3/envs/scratch_py27/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/home/lesteve/dev/joblib/joblib/numpy_pickle.py", line 384, in load_build
    Unpickler.load_build(self)
  File "/home/lesteve/miniconda3/envs/scratch_py27/lib/python2.7/pickle.py", line 1217, in load_build
    setstate(state)
TypeError: must be char, not unicode

Python 2 writer, Python 3 reader

Related to NumpyPickler/NumpyUnPickler cycle. The likely culprit is that we use encoding=bytes for performance reasons, but that also means that all python2 strings are loaded as bytes in python3. This seems a bit iffy to say the least ... @GaelVaroquaux @ogrisel any opinions ?

Writer script (python 2, numpy 1.9)

import joblib
pickler = joblib.numpy_pickle.NumpyPickler('/tmp/test.pkl')
import numpy as np
pickler = joblib.numpy_pickle.NumpyPickler('/tmp/test.pkl')
pickler.save(np.dtype('f8'))
# Don't forget to quit ipython here in order to flush to file ...

Reader script (python 3, numpy 1.8)

import joblib
unpickler = joblib.numpy_pickle.NumpyUnpickler('/tmp/test.pkl', open('/tmp/test.pkl', 'rb'))
unpickler.load()

Traceback

Traceback (most recent call last):
  File "/volatile/le243287/miniconda3/envs/scratch/lib/python3.4/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/volatile/le243287/miniconda3/envs/scratch/lib/python3.4/unittest/case.py", line 577, in run
    testMethod()
  File "/volatile/le243287/miniconda3/envs/scratch/lib/python3.4/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/le243287/dev/joblib/joblib/test/test_numpy_pickle.py", line 325, in test_compressed_pickle_python_2_3_compatibility
    result_list = numpy_pickle.load(fname)
  File "/home/le243287/dev/joblib/joblib/numpy_pickle.py", line 524, in load
    obj = unpickler.load()
  File "/volatile/le243287/miniconda3/envs/scratch/lib/python3.4/pickle.py", line 1038, in load
    dispatch[key[0]](self)
  File "/home/le243287/dev/joblib/joblib/numpy_pickle.py", line 384, in load_build
    Unpickler.load_build(self)
  File "/volatile/le243287/miniconda3/envs/scratch/lib/python3.4/pickle.py", line 1505, in load_build
    setstate(state)
TypeError: must be a unicode character, not bytes
@lesteve

This comment has been minimized.

Copy link
Member Author

lesteve commented Sep 24, 2015

The likely culprit is that we use encoding=bytes for performance reasons, but that also means that all python2 strings are loaded as bytes in python3. This seems a bit iffy to say the least ... @GaelVaroquaux @ogrisel any opinions ?

Full disclosure: at the moment the python 2/3 compatibility for non compressed pickles is actually broken because of this iffyness. An instance of NDArrayWrapper ends up with keys that are bytes rather than str.

lesteve added 2 commits Sep 24, 2015
For numpy < 1.9, np.dtype.__setstate__ is very picky about getting '<'
and not b'<' in python 3, and '<' and not u'<' in python 2.
@lesteve lesteve force-pushed the lesteve:fix-numpy-pickle-for-versions-less-than-1.9 branch from add4c59 to 6c9d287 Sep 28, 2015
@lesteve

This comment has been minimized.

Copy link
Member Author

lesteve commented Oct 5, 2015

This has been superseded by #249.

@lesteve lesteve closed this Oct 5, 2015
0-wiz-0 pushed a commit to NetBSD/pkgsrc-wip that referenced this pull request Nov 4, 2015
Olivier Grisel

    Revert back to the fork start method (instead of forkserver) as the latter was found to cause crashes in interactive Python sessions.

Release 0.9.2

Lo?c Est?ve

    Joblib hashing now uses the default pickle protocol (2 for Python 2 and 3 for Python 3). This makes it very unlikely to get the same hash for a given object under Python 2 and Python 3.

    In particular, for Python 3 users, this means that the output of joblib.hash changes when switching from joblib 0.8.4 to 0.9.2 . We strive to ensure that the output of joblib.hash does not change needlessly in future versions of joblib but this is not officially guaranteed.

Lo?c Est?ve

    Joblib pickles generated with Python 2 can not be loaded with Python 3 and the same applies for joblib pickles generated with Python 3 and loaded with Python 2.

    During the beta period 0.9.0b2 to 0.9.0b4, we experimented with a joblib serialization that aimed to make pickles serialized with Python 3 loadable under Python 2. Unfortunately this serialization strategy proved to be too fragile as far as the long-term maintenance was concerned (For example see joblib/joblib#243). That means that joblib pickles generated with joblib 0.9.0bN can not be loaded under joblib 0.9.2. Joblib beta testers, who are the only ones likely to be affected by this, are advised to delete their joblib cache when they upgrade from 0.9.0bN to 0.9.2.

Arthur Mensch

    Fixed a bug with joblib.hash that used to return unstable values for strings and numpy.dtype instances depending on interning states.

Olivier Grisel

    Make joblib use the 'forkserver' start method by default under Python 3.4+ to avoid causing crash with 3rd party libraries (such as Apple vecLib / Accelerate or the GCC OpenMP runtime) that use an internal thread pool that is not not reinitialized when a fork system call happens.

Olivier Grisel

    New context manager based API (with block) to re-use the same pool of workers across consecutive parallel calls.

Vlad Niculae and Olivier Grisel

    Automated batching of fast tasks into longer running jobs to hide multiprocessing dispatching overhead when possible.

Olivier Grisel

    FIX make it possible to call joblib.load(filename, mmap_mode='r') on pickled objects that include a mix of arrays of both memmory memmapable dtypes and object dtype.

Release 0.8.4

2014-11-20 Olivier Grisel

    OPTIM use the C-optimized pickler under Python 3

    This makes it possible to efficiently process parallel jobs that deal with numerous Python objects such as large dictionaries.

Release 0.8.3

2014-08-19 Olivier Grisel

    FIX disable memmapping for object arrays

2014-08-07 Lars Buitinck

    MAINT NumPy 1.10-safe version comparisons

2014-07-11 Olivier Grisel

    FIX #146: Heisen test failure caused by thread-unsafe Python lists

    This fix uses a queue.Queue datastructure in the failing test. This datastructure is thread-safe thanks to an internal Lock. This Lock instance not picklable hence cause the picklability check of delayed to check fail.

    When using the threading backend, picklability is no longer required, hence this PRs give the user the ability to disable it on a case by case basis.

Release 0.8.2

2014-06-30 Olivier Grisel

    BUG: use mmap_mode='r' by default in Parallel and MemmapingPool

    The former default of mmap_mode='c' (copy-on-write) caused problematic use of the paging file under Windows.

2014-06-27 Olivier Grisel

    BUG: fix usage of the /dev/shm folder under Linux

Release 0.8.1

2014-05-29 Gael Varoquaux

    BUG: fix crash with high verbosity

Release 0.8.0

2014-05-14 Olivier Grisel

    Fix a bug in exception reporting under Python 3

2014-05-10 Olivier Grisel

    Fixed a potential segfault when passing non-contiguous memmap instances.

2014-04-22 Gael Varoquaux

    ENH: Make memory robust to modification of source files while the interpreter is running. Should lead to less spurious cache flushes and recomputations.

2014-02-24 Philippe Gervais

    New Memory.call_and_shelve API to handle memoized results by reference instead of by value.

Release 0.8.0a3

2014-01-10 Olivier Grisel & Gael Varoquaux

    FIX #105: Race condition in task iterable consumption when pre_dispatch != 'all' that could cause crash with error messages "Pools seems closed" and "ValueError: generator already executing".

2014-01-12 Olivier Grisel

    FIX #72: joblib cannot persist "output_dir" keyword argument.

Release 0.8.0a2

2013-12-23 Olivier Grisel

    ENH: set default value of Parallel's max_nbytes to 100MB

    Motivation: avoid introducing disk latency on medium sized parallel workload where memory usage is not an issue.

    FIX: properly handle the JOBLIB_MULTIPROCESSING env variable

    FIX: timeout test failures under windows

Release 0.8.0a

2013-12-19 Olivier Grisel

    FIX: support the new Python 3.4 multiprocessing API

2013-12-05 Olivier Grisel

    ENH: make Memory respect mmap_mode at first call too

    ENH: add a threading based backend to Parallel

    This is low overhead alternative backend to the default multiprocessing backend that is suitable when calling compiled extensions that release the GIL.

Author: Dan Stahlke <dan@stahlke.org> Date: 2013-11-08

    FIX: use safe_repr to print arg vals in trace

    This fixes a problem in which extremely long (and slow) stack traces would be produced when function parameters are large numpy arrays.

2013-09-10 Olivier Grisel

    ENH: limit memory copy with Parallel by leveraging numpy.memmap when possible

Release 0.7.1

2013-07-25 Gael Varoquaux

    MISC: capture meaningless argument (n_jobs=0) in Parallel

2013-07-09 Lars Buitinck

    ENH Handles tuples, sets and Python 3's dict_keys type the same as lists. in pre_dispatch

2013-05-23 Martin Luessi

    ENH: fix function caching for IPython

Release 0.7.0

This release drops support for Python 2.5 in favor of support for Python 3.0

2013-02-13 Gael Varoquaux

    BUG: fix nasty hash collisions

2012-11-19 Gael Varoquaux

    ENH: Parallel: Turn of pre-dispatch for already expanded lists

Gael Varoquaux 2012-11-19

    ENH: detect recursive sub-process spawning, as when people do not protect the __main__ in scripts under Windows, and raise a useful error.

Gael Varoquaux 2012-11-16

    ENH: Full python 3 support

Release 0.6.5

2012-09-15 Yannick Schwartz

    BUG: make sure that sets and dictionnaries give reproducible hashes

2012-07-18 Marek Rudnicki

    BUG: make sure that object-dtype numpy array hash correctly

2012-07-12 GaelVaroquaux

    BUG: Bad default n_jobs for Parallel

Release 0.6.4

2012-05-07 Vlad Niculae

    ENH: controlled randomness in tests and doctest fix

2012-02-21 GaelVaroquaux

    ENH: add verbosity in memory

2012-02-21 GaelVaroquaux

    BUG: non-reproducible hashing: order of kwargs

    The ordering of a dictionnary is random. As a result the function hashing was not reproducible. Pretty hard to test

Release 0.6.3

2012-02-14 GaelVaroquaux

    BUG: fix joblib Memory pickling

2012-02-11 GaelVaroquaux

    BUG: fix hasher with Python 3

2012-02-09 GaelVaroquaux

    API: filter_args: *args, **kwargs -> args, kwargs

Release 0.6.2

2012-02-06 Gael Varoquaux

    BUG: make sure Memory pickles even if cachedir=None

Release 0.6.1

Bugfix release because of a merge error in release 0.6.0
Release 0.6.0

Beta 3

2012-01-11 Gael Varoquaux

    BUG: ensure compatibility with old numpy

    DOC: update installation instructions

    BUG: file semantic to work under Windows

2012-01-10 Yaroslav Halchenko

    BUG: a fix toward 2.5 compatibility

Beta 2

2012-01-07 Gael Varoquaux

    ENH: hash: bugware to be able to hash objects defined interactively in IPython

2012-01-07 Gael Varoquaux

    ENH: Parallel: warn and not fail for nested loops

    ENH: Parallel: n_jobs=-2 now uses all CPUs but one

2012-01-01 Juan Manuel Caicedo Carvajal and Gael Varoquaux

    ENH: add verbosity levels in Parallel

Release 0.5.7

2011-12-28 Gael varoquaux

    API: zipped -> compress

2011-12-26 Gael varoquaux

    ENH: Add a zipped option to Memory

    API: Memory no longer accepts save_npy

2011-12-22 Kenneth C. Arnold and Gael varoquaux

    BUG: fix numpy_pickle for array subclasses

2011-12-21 Gael varoquaux

    ENH: add zip-based pickling

2011-12-19 Fabian Pedregosa

    Py3k: compatibility fixes. This makes run fine the tests test_disk and test_parallel

Release 0.5.6

2011-12-11 Lars Buitinck

    ENH: Replace os.path.exists before makedirs with exception check New disk.mkdirp will fail with other errnos than EEXIST.

2011-12-10 Bala Subrahmanyam Varanasi

    MISC: pep8 compliant

Release 0.5.5

2011-19-10 Fabian Pedregosa

    ENH: Make joblib installable under Python 3.X

Release 0.5.4

2011-09-29 Jon Olav Vik

    BUG: Make mangling path to filename work on Windows

2011-09-25 Olivier Grisel

    FIX: doctest heisenfailure on execution time

2011-08-24 Ralf Gommers

    STY: PEP8 cleanup.

Release 0.5.3

2011-06-25 Gael varoquaux

    API: All the usefull symbols in the __init__

Release 0.5.2

2011-06-25 Gael varoquaux

    ENH: Add cpu_count

2011-06-06 Gael varoquaux

    ENH: Make sure memory hash in a reproducible way

Release 0.5.1

2011-04-12 Gael varoquaux

    TEST: Better testing of parallel and pre_dispatch

Yaroslav Halchenko 2011-04-12

    DOC: quick pass over docs -- trailing spaces/spelling

Yaroslav Halchenko 2011-04-11

    ENH: JOBLIB_MULTIPROCESSING env var to disable multiprocessing from the environment

Alexandre Gramfort 2011-04-08

    ENH : adding log message to know how long it takes to load from disk the cache

Release 0.5.0

2011-04-01 Gael varoquaux

    BUG: pickling MemoizeFunc does not store timestamp

2011-03-31 Nicolas Pinto

    TEST: expose hashing bug with cached method

2011-03-26...2011-03-27 Pietro Berkes

    BUG: fix error management in rm_subdirs BUG: fix for race condition during tests in mem.clear()

Gael varoquaux 2011-03-22...2011-03-26

    TEST: Improve test coverage and robustness

Gael varoquaux 2011-03-19

    BUG: hashing functions with only *var **kwargs

Gael varoquaux 2011-02-01... 2011-03-22

    BUG: Many fixes to capture interprocess race condition when mem.cache is used by several processes on the same cache.

Fabian Pedregosa 2011-02-28

    First work on Py3K compatibility

Gael varoquaux 2011-02-27

    ENH: pre_dispatch in parallel: lazy generation of jobs in parallel for to avoid drowning memory.

GaelVaroquaux 2011-02-24

    ENH: Add the option of overloading the arguments of the mother 'Memory' object in the cache method that is doing the decoration.

Gael varoquaux 2010-11-21

    ENH: Add a verbosity level for more verbosity

Release 0.4.6

Gael varoquaux 2010-11-15

    ENH: Deal with interruption in parallel

Gael varoquaux 2010-11-13

    BUG: Exceptions raised by Parallel when n_job=1 are no longer captured.

Gael varoquaux 2010-11-13

    BUG: Capture wrong arguments properly (better error message)

Release 0.4.5

Pietro Berkes 2010-09-04

    BUG: Fix Windows peculiarities with path separators and file names BUG: Fix more windows locking bugs

Gael varoquaux 2010-09-03

    ENH: Make sure that exceptions raised in Parallel also inherit from the original exception class ENH: Add a shadow set of exceptions

Fabian Pedregosa 2010-09-01

    ENH: Clean up the code for parallel. Thanks to Fabian Pedregosa for the patch.

Release 0.4.4

Gael varoquaux 2010-08-23

    BUG: Fix Parallel on computers with only one CPU, for n_jobs=-1.

Gael varoquaux 2010-08-02

    BUG: Fix setup.py for extra setuptools args.

Gael varoquaux 2010-07-29

    MISC: Silence tests (and hopefuly Yaroslav :P)

Release 0.4.3

Gael Varoquaux 2010-07-22

    BUG: Fix hashing for function with a side effect modifying their input argument. Thanks to Pietro Berkes for reporting the bug and proving the patch.

Release 0.4.2

Gael Varoquaux 2010-07-16

    BUG: Make sure that joblib still works with Python2.5. => release 0.4.2

Release 0.4.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

1 participant
You can’t perform that action at this time.