Skip to content

Cannot specify pickle protocol used when writing HDF5. #33087

@pdemarti

Description

@pdemarti

Problem description

It appears currently impossible (unless I'm mistaken) to specify what pickle protocol will be used by DataFrame.to_hdf() (and thus by PyTables) if pickling is necessary. That makes it impossible to share HDF5 data written and read by a mix of clients running py37 and py38.

In Python 3.8, pickle protocol 5 was introduced (PEP-574). This prevents my team from supporting py38 in a system meant to support clients running a range of Python versions (from 3.6) and sharing a common distributed filesystem. We plan to eventually deprecate support for py36 and py37, but the problem is that there doesn't seem to be a way to manage the transition when such a new protocol is introduced.

xref: this StackOverflow question.

Example

(base) $ conda activate py38
(py38) $ python
Python 3.8.1 (default, Jan  8 2020, 22:29:32)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.DataFrame(['hello', 'world']))
>>> df.to_hdf('foo', 'x')
>>> exit()
(py38) $ conda deactivate
(base) $ python
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.read_hdf('foo', 'x')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 407, in read_hdf
    return store.select(key, auto_close=auto_close, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 782, in select
    return it.get_result()
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 1639, in get_result
    results = self.func(self.start, self.stop, where)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 766, in func
    return s.read(start=_start, stop=_stop, where=_where, columns=columns)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 3206, in read
    "block{idx}_values".format(idx=i), start=_start, stop=_stop
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 2737, in read_array
    ret = node[0][start:stop]
  File "/opt/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 681, in __getitem__
    return self.read(start, stop, step)[0]
  File "/opt/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 825, in read
    outlistarr = [atom.fromarray(arr) for arr in listarr]
  File "/opt/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 825, in <listcomp>
    outlistarr = [atom.fromarray(arr) for arr in listarr]
  File "/opt/anaconda3/lib/python3.7/site-packages/tables/atom.py", line 1227, in fromarray
    return six.moves.cPickle.loads(array.tostring())
ValueError: unsupported pickle protocol: 5
>>>

Expected Behavior

df.to_hdf('foo', 'x', pickle_protocol=4)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions