Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to write to HDF5 table if DataFrame has mixed object types (pd.Timestamp and str) #8284

Closed
kvncp opened this issue Sep 16, 2014 · 9 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Enhancement IO HDF5 read_hdf, HDFStore

Comments

@kvncp
Copy link

kvncp commented Sep 16, 2014

When attempting to store data in an HDF5 table, I found a problem where an error is raised if there are multiple object columns containing different data.

import pandas as pd

data = {'ints':pd.Series([1,2,3], index=index), 'Timestamps': pd.Series([pd.Timestamp('2014-1-1 12:00', tz='UTC'), pd.Timestamp('2014-1-2 12:00', tz='UTC'), pd.Timestamp('2014-1-3 12:00', tz='UTC')], index=index), 'strings': pd.Series(['r','g','b'], index=index)}

df = pd.DataFrame(data)

df.to_hdf('test.h5', 'data', format='table')

This leads to an exception: TypeError: Cannot serialize the column [Timestamps] because
its data contents are [datetime] object dtype

However, if I remove the string column:

del df['strings']
df.to_hdf('test.h5', 'data', format='table')

Now it works fine - so it isn't a problem with using the pd.Timestamp type.

Digging a little deeper, it appears the problem is that pandas.io.pytables.Table.create_axes groups the columns by data type, with all columns of type object being grouped into one set of data. Then when set_atom is called, it does this:

rvalues = block.values.ravel()
inferred_type = lib.infer_dtype(rvalues)

This leads to an inferred type of 'mixed' since there are multiple types of objects present, and this isn't handled and throws the exception.

As a fix, it seems that each object column should be handled separately, or at least grouped by the inferred type. I haven't committed to pandas before, or dug this deeply into this section of code, so I'm not sure of the best way to fix this and what other implications there may be, but I'd be happy to help however I can.

@jreback
Copy link
Contributor

jreback commented Sep 16, 2014

you can work around this by setting the non-string object columns as data_columns (that will segregate them up front)

if these are truly utc tz aware then to be honest guy should simply make them datetime64[ns] columns and the problem also goes away

you are right though see here : https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L1734 for the inference on an object column (note that they could be a period type, datetime tz aware, or an actual string)

so the object block handling needs to be fixed up a bit - by further splitting of object blocks if necessary

pull - requests welcome!

@jreback
Copy link
Contributor

jreback commented Sep 16, 2014

see #7796 as well (for the period support)

@jreback jreback added Bug Enhancement IO HDF5 read_hdf, HDFStore Dtype Conversions Unexpected or buggy dtype conversions labels Sep 16, 2014
@jreback jreback added this to the 0.16 milestone Sep 16, 2014
@jreback
Copy link
Contributor

jreback commented Sep 16, 2014

FYI u normally don't handle the columns separately and instead store them as a single block as it's much more efficient (can be controlled by specifying data_columns though)

@kvncp
Copy link
Author

kvncp commented Sep 16, 2014

Thanks for the fast response. They aren't actually UTC in my application, that was just the easiest way to create a simple example. Setting as a data_column will work though, thanks for the tip.

If I get a bit of time I'll look into a fix.

@jreback jreback modified the milestones: 0.16, 0.15.1 Oct 7, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@TomAugspurger
Copy link
Contributor

I can't reproduce the original example. index is not defined.

This simple example seems to work

In [36]: df = pd.DataFrame({"A": [1, 2], 'B': ['a', 'b'], 'C': pd.to_datetime(['2017', '2018']).tz_localize("UTC")})

In [37]: df.to_hdf('test.h5', 'data', format='table')

Let me know if that isn't representative of the original.

@TomAugspurger TomAugspurger modified the milestones: Contributions Welcome, No action Jul 6, 2018
@petiop
Copy link

petiop commented Nov 7, 2019

I am having the same issue where the use-case is storing multidimensional and variable-shape np arrays (unflattened images). I store in 'table' format and I tried adding the column to data_columns. Still getting the same error:

TypeError: Cannot serialize the column [image] because
its data contents are [mixed] object dtype

Are there other workarounds that I can try? Also, is this issue still open to contributions (beefing up the object-block handling to work with types other than strings)?

@jreback
Copy link
Contributor

jreback commented Nov 8, 2019

there is no support for non scalar types at all

@petiop
Copy link

petiop commented Nov 8, 2019

I don’t mind converting them to bytes and saving that, but that too is not supported atm

@jreback
Copy link
Contributor

jreback commented Nov 8, 2019

@petiop you are welcome to submit a PR for this but it’s non-trivial

i would use parquet for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Enhancement IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

No branches or pull requests

4 participants