Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Unable to write to hdf when format='table' and list-like values are present in a column #39138

Open
2 of 3 tasks
galipremsagar opened this issue Jan 13, 2021 · 1 comment
Labels
Enhancement IO HDF5 read_hdf, HDFStore

Comments

@galipremsagar
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

In[43]: df = pd.DataFrame({'a':[[1, 2, 3]]})
In[44]: df
Out[44]: 
           a
0  [1, 2, 3]
In[45]: df.to_hdf('sample.hdf', 'test')
/home/pgali/anaconda3/envs/cudf_dev/lib/python3.7/site-packages/pandas/core/generic.py:2449: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['a'], dtype='object')]

  encoding=encoding,
In[46]: df.to_hdf('sample.hdf', 'test')
/home/pgali/anaconda3/envs/cudf_dev/lib/python3.7/site-packages/pandas/core/generic.py:2449: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['a'], dtype='object')]

  encoding=encoding,
In[47]: df.to_hdf('sample.hdf', 'test', format='table')
Traceback (most recent call last):
  File "/home/pgali/anaconda3/envs/cudf_dev/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3418, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-47-13d7ad5be031>", line 1, in <module>
    df.to_hdf('sample.hdf', 'test', format='table')
  File "/home/pgali/anaconda3/envs/cudf_dev/lib/python3.7/site-packages/pandas/core/generic.py", line 2449, in to_hdf
    encoding=encoding,
  File "/home/pgali/anaconda3/envs/cudf_dev/lib/python3.7/site-packages/pandas/io/pytables.py", line 270, in to_hdf
    f(store)
  File "/home/pgali/anaconda3/envs/cudf_dev/lib/python3.7/site-packages/pandas/io/pytables.py", line 262, in <lambda>
    encoding=encoding,
  File "/home/pgali/anaconda3/envs/cudf_dev/lib/python3.7/site-packages/pandas/io/pytables.py", line 1129, in put
    track_times=track_times,
  File "/home/pgali/anaconda3/envs/cudf_dev/lib/python3.7/site-packages/pandas/io/pytables.py", line 1801, in _write_to_group
    track_times=track_times,
  File "/home/pgali/anaconda3/envs/cudf_dev/lib/python3.7/site-packages/pandas/io/pytables.py", line 4238, in write
    data_columns=data_columns,
  File "/home/pgali/anaconda3/envs/cudf_dev/lib/python3.7/site-packages/pandas/io/pytables.py", line 3907, in _create_axes
    errors=self.errors,
  File "/home/pgali/anaconda3/envs/cudf_dev/lib/python3.7/site-packages/pandas/io/pytables.py", line 4896, in _maybe_convert_for_string_atom
    for i in range(len(block.shape[0])):
TypeError: object of type 'int' has no len()

Problem description

My dataframe has list-like values and when we pass format=table, I'm not able to write the dataframe to hdf.

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

commit : b5958ee
python : 3.7.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-36-generic
Version : #40~20.04.1-Ubuntu SMP Wed Jan 6 10:15:55 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.5
numpy : 1.19.5
pytz : 2020.5
dateutil : 2.8.1
pip : 20.3.3
setuptools : 49.6.0.post20210108
Cython : 0.29.21
pytest : 6.2.1
hypothesis : 6.0.0
sphinx : 3.4.3
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.8.5
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : 2.7.2
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 1.0.1
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.52.0

@galipremsagar galipremsagar added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 13, 2021
@jreback
Copy link
Contributor

jreback commented Jan 13, 2021

there is no support for nested data types with table format
you can try fixed format, serialize the data (say to bytes), or look for a community contribution for this enhancement, or use parquet which supports this

@jreback jreback added IO HDF5 read_hdf, HDFStore Enhancement and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 13, 2021
@jreback jreback added this to the Contributions Welcome milestone Jan 13, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

No branches or pull requests

3 participants