Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFStore.append fails when appending dataframe with empty string column for which min_itemsize < 8 #12242

Closed
amcpherson opened this issue Feb 6, 2016 · 4 comments

Comments

Projects
None yet
5 participants
@amcpherson
Copy link
Contributor

commented Feb 6, 2016

Reproduced as follows:

In [3]: store = pd.HDFStore('teststore.h5', 'w')

In [4]: chunk = pd.DataFrame({'V1':['a','b','c','d','e'], 'data':np.arange(5)})

In [5]: store.append('df', chunk, min_itemsize={'V1': 4})

In [6]: chunk = pd.DataFrame({'V1':['', ''], 'data': [3, 5]})

In [7]: store.append('df', chunk, min_itemsize={'V1': 4})
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-c9bafa18ead0> in <module>()
----> 1 store.append('df', chunk, min_itemsize={'V1': 4})

/Users/amcpherson/Anaconda/lib/python2.7/site-packages/pandas/io/pytables.pyc in append(self, key, value, format, append, columns, dropna, **kwargs)
    905         kwargs = self._validate_format(format, kwargs)
    906         self._write_to_group(key, value, append=append, dropna=dropna,
--> 907                              **kwargs)
    908
    909     def append_to_multiple(self, d, value, selector, data_columns=None,

/Users/amcpherson/Anaconda/lib/python2.7/site-packages/pandas/io/pytables.pyc in _write_to_group(self, key, value, format, index, append, complib, encoding, **kwargs)
   1250
   1251         # write the object
-> 1252         s.write(obj=value, append=append, complib=complib, **kwargs)
   1253
   1254         if s.is_table and index:

/Users/amcpherson/Anaconda/lib/python2.7/site-packages/pandas/io/pytables.pyc in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, **kwargs)
   3755         self.create_axes(axes=axes, obj=obj, validate=append,
   3756                          min_itemsize=min_itemsize,
-> 3757                          **kwargs)
   3758
   3759         for a in self.axes:

/Users/amcpherson/Anaconda/lib/python2.7/site-packages/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   3432                 self.values_axes.append(col)
   3433             except (NotImplementedError, ValueError, TypeError) as e:
-> 3434                 raise e
   3435             except Exception as detail:
   3436                 raise Exception(

ValueError: Trying to store a string with len [8] in [V1] column but
this column has a limit of [4]!
Consider using min_itemsize to preset the sizes on these columns

Does not raise unless all values in a column are empty strings. A workaround is to set min_itemsize to 8 or higher.

Version information:

In [10]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8

pandas: 0.17.1
nose: 1.3.7
pip: 8.0.2
setuptools: 19.6.2
Cython: 0.23.4
numpy: 1.10.2
scipy: 0.16.1
statsmodels: 0.6.1
IPython: 4.0.3
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
Jinja2: None
@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Feb 6, 2016

So it looks like this comes from np.asarray doing something strange with length 0 fixed-width strings:

ipdb> data
array([[b'', b'']], dtype=object)
ipdb> n
> /Users/tom.augspurger/Envs/dev/lib/python3.4/site-packages/pandas/pandas/io/pytables.py(4453)_convert_string_array()
   4451
   4452     data = np.asarray(data, dtype="S%d" % itemsize)  # itemsize is 0
-> 4453     return data
   4454
   4455

ipdb> data
array([[b'', b'']],
      dtype='|S8')   # note the length 8

The array constructor converts to length 1 strings though:

In [12]: np.array([b'', b''], dtype='S0')
Out[12]:
array([b'', b''],
      dtype='|S1')   # length 1

Looks like a numpy bug maybe? Either way, we can work around by changing this line to

        itemsize = max(1, lib.max_len_string_array(com._ensure_object(data.ravel())))

@amcpherson thanks for the report. Are you interested in submitting a Pull Request with that fix and a test?

@TomAugspurger TomAugspurger modified the milestones: 0.18.0, Next Major Release Feb 6, 2016

@amcpherson

This comment has been minimized.

Copy link
Contributor Author

commented Feb 6, 2016

i can do a pull request, cant guarantee how soon though
On Sat, Feb 6, 2016 at 8:02 AM Tom Augspurger notifications@github.com
wrote:

So it looks like this comes from np.asarray doing something strange with
length 0 fixed-width strings:

ipdb> data
array([[b'', b'']], dtype=object)
ipdb> n

/Users/tom.augspurger/Envs/dev/lib/python3.4/site-packages/pandas/pandas/io/pytables.py(4453)_convert_string_array()
4451
4452 data = np.asarray(data, dtype="S%d" % itemsize) # itemsize is 0
-> 4453 return data
4454
4455

ipdb> data
array([[b'', b'']],
dtype='|S8')

The array constructor converts to length 1 strings though:

In [12]: np.array([b'', b''], dtype='S0')
Out[12]:
array([b'', b''],
dtype='|S1')

Looks like a numpy bug maybe? Either way, we can work around by changing this
line
https://github.com/pydata/pandas/blob/6693a723aa2a8a53a071860a43804c173a7f92c6/pandas/io/pytables.py#L4450
to

    itemsize = max(1, lib.max_len_string_array(com._ensure_object(data.ravel())))

@amcpherson https://github.com/amcpherson thanks for the report. Are
you interested in submitting a Pull Request with that fix and a test?


Reply to this email directly or view it on GitHub
#12242 (comment).

@finkelm

This comment has been minimized.

Copy link

commented Apr 30, 2017

This still hasn't been factored in.

@jreback jreback closed this Apr 30, 2017

@jreback jreback reopened this Apr 30, 2017

@jreback

This comment has been minimized.

Copy link
Contributor

commented Apr 30, 2017

This still hasn't been factored in.

hence the open issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.