Dimension mismatch when reading categorical HDF5 column with weird NaN encoding #21741

sschuldenzucker · 2018-07-05T08:29:21Z

Hi guys, I got a crash reading an HDF5 file with a message "ValueError: operands could not be broadcast together with shapes (171285,) (15,) (171285,)" (stack trace below).

I think the problem arises whenever reading a categorical column via pytables where NaN is stored as one of the categories (rather than a special code -1). I'm not sure in which situations pytables generates one or the other representation of NaN. (any explanation is appreciated)

In my case, 171285 is the number of rows in my data and 15 is the number of categories including NaN.

The offending line is:

pandas/pandas/io/pytables.py

Line 2224 in 1070976

codes[codes != -1] -= mask.astype(int).cumsum().values

I'm not sure what's going on here or why we need this code. (any explanations are appreciated) But indeed, codes has size # rows and mask has size # categories. (mask[i] is True iff categories[i] is NaN) So this seems wrong. Looks like we'd rather want some kind of join equivalent to

non_nan_codes = codes[codes != -1]
delta = mask.astype(int).cumsum().values
for i in non_nan_codes:
    non_nan_codes[i] -= delta[non_nan_codes[i]]
    # Right now, we got something equivalent to delta[i], not delta[non_nan_codes[i]]

I don't know how to do this fast in raw numpy, though.

Stack trace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-23-0fe86b15df04> in <module>()
----> 1 df = pd.read_hdf(hdfs[5])

C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in read_hdf(path_or_buf, key, **kwargs)
    356                                      'contains multiple datasets.')
    357             key = candidate_only_group._v_pathname
--> 358         return store.select(key, auto_close=auto_close, **kwargs)
    359     except:
    360         # if there is an error, close the store

C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
    720                            chunksize=chunksize, auto_close=auto_close)
    721 
--> 722         return it.get_result()
    723 
    724     def select_as_coordinates(

C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in get_result(self, coordinates)
   1426 
   1427         # directly return the result
-> 1428         results = self.func(self.start, self.stop, where)
   1429         self.close()
   1430         return results

C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in func(_start, _stop, _where)
    713             return s.read(start=_start, stop=_stop,
    714                           where=_where,
--> 715                           columns=columns, **kwargs)
    716 
    717         # create the iterator

C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in read(self, where, columns, **kwargs)
   4101     def read(self, where=None, columns=None, **kwargs):
   4102 
-> 4103         if not self.read_axes(where=where, **kwargs):
   4104             return None
   4105 

C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in read_axes(self, where, **kwargs)
   3306         for a in self.axes:
   3307             a.set_info(self.info)
-> 3308             a.convert(values, nan_rep=self.nan_rep, encoding=self.encoding)
   3309 
   3310         return True

C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in convert(self, values, nan_rep, encoding)
   2112                 if mask.any():
   2113                     categories = categories[~mask]
-> 2114                     codes[codes != -1] -= mask.astype(int).cumsum().values
   2115 
   2116                 self.data = Categorical.from_codes(codes,

ValueError: operands could not be broadcast together with shapes (171285,) (15,) (171285,)

The text was updated successfully, but these errors were encountered:

sschuldenzucker · 2018-07-05T09:26:39Z

FYI, as for the question why pandas creates such a HDF5, I think I figured it out:

In my code, I was doing some data conversions that turned out to be ultimately akin to something like:

s = pd.Series(['foo', nan, 'bar', 'foo']).astype(str).astype('category')

Turned out that this creates a regular category 'nan' (not nan, mind the quotes!). This is then written to HDF. 'nan' also happens to be the default nan_rep used by pytables, so when pytables reads the categories in the HDF file back in, it's converted into nan (without quotes). So we now have a nan category, which leads to the offending code being executed.

jreback · 2018-07-05T09:31:46Z

pls show a reproducible example

sschuldenzucker · 2018-07-05T10:44:53Z

MWE:

import pandas as pd
import numpy as np

# Note: We need a repeated entry here. It happens to not crash if #rows = #categories. 
# (coincidentally; it's still doing the wrong thing of course.)
s = pd.Series(['foo', 'foo', 'nan']).astype('category')
# Alternatively:
# s = pd.Series(['foo', 'foo', np.nan]).astype(str).astype('category')

df = pd.DataFrame({'A': s})
df.to_hdf('test.h5', key='data', format='table')
df_back = pd.read_hdf('test.h5', key='data') # This crashes

I'm on pandas 0.20.1 btw, but I don't think the issue is fixed in more recent versions.

jreback · 2018-07-05T17:23:33Z

you need to use nan_rep http://pandas.pydata.org/pandas-docs/stable/io.html#string-columns

I guess you could detect this issue though, a PR to fix is welcome!

joseortiz3 · 2019-01-29T06:09:51Z

Ran into this problem today, same exact scenario where the string 'nan' is a value in the column due to using df[col_name] = df[col_name].astype(str).astype('category').

>>> df_mixed = pd.DataFrame({'A': np.random.randn(8),
... 'B': np.random.randn(8),
... 'C': np.array(np.random.randn(8), dtype='float32'),
... 'string': 'string',
... 'int': 1,
... 'bool': True,
... 'datetime64': pd.Timestamp('20010102')},
... index=list(range(8)))
>>> df_mixed
          A         B         C  string  int  bool datetime64
0 -0.833346 -0.598527  1.013500  string    1  True 2001-01-02
1 -0.823901 -0.118210  0.793684  string    1  True 2001-01-02
2  0.725413 -0.867698  1.478408  string    1  True 2001-01-02
3 -0.246141  0.786121  1.483667  string    1  True 2001-01-02
4  1.760388  1.675248  1.169727  string    1  True 2001-01-02
5 -0.000398  0.039454  1.514879  string    1  True 2001-01-02
6 -2.815542 -0.539987 -1.873862  string    1  True 2001-01-02
7  0.791794 -0.031423  1.250562  string    1  True 2001-01-02
>>> df_mixed.loc[df_mixed.index[3:5],
... ['A', 'B', 'string', 'datetime64']] = np.nan
>>> df_mixed
          A         B         C  string  int  bool datetime64
0 -0.833346 -0.598527  1.013500  string    1  True 2001-01-02
1 -0.823901 -0.118210  0.793684  string    1  True 2001-01-02
2  0.725413 -0.867698  1.478408  string    1  True 2001-01-02
3       NaN       NaN  1.483667     NaN    1  True        NaT
4       NaN       NaN  1.169727     NaN    1  True        NaT
5 -0.000398  0.039454  1.514879  string    1  True 2001-01-02
6 -2.815542 -0.539987 -1.873862  string    1  True 2001-01-02
7  0.791794 -0.031423  1.250562  string    1  True 2001-01-02
>>> df_mixed['string'].iloc[4]
nan
>>> type(df_mixed['string'].iloc[4])
<class 'float'>
>>> df_mixed['string'] = df_mixed['string'].astype(str).astype('category')
>>> df_mixed
          A         B         C  string  int  bool datetime64
0 -0.833346 -0.598527  1.013500  string    1  True 2001-01-02
1 -0.823901 -0.118210  0.793684  string    1  True 2001-01-02
2  0.725413 -0.867698  1.478408  string    1  True 2001-01-02
3       NaN       NaN  1.483667     nan    1  True        NaT
4       NaN       NaN  1.169727     nan    1  True        NaT
5 -0.000398  0.039454  1.514879  string    1  True 2001-01-02
6 -2.815542 -0.539987 -1.873862  string    1  True 2001-01-02
7  0.791794 -0.031423  1.250562  string    1  True 2001-01-02
>>> df_mixed['string'].iloc[4]
'nan'
>>> df_mixed.to_hdf(dir+'hdf.hdf',key='df_mixed',format='table')
>>> df_mixed2 = pd.read_hdf(dir+'hdf.hdf',key='df_mixed',format='table')

results in the error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Joey\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\pytables.py", line 394, in read_hdf
    return store.select(key, auto_close=auto_close, **kwargs)
  File "C:\Users\Joey\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\pytables.py", line 741, in select
    return it.get_result()
  File "C:\Users\Joey\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\pytables.py", line 1483, in get_result
    results = self.func(self.start, self.stop, where)
  File "C:\Users\Joey\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\pytables.py", line 734, in func
    columns=columns)
  File "C:\Users\Joey\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\pytables.py", line 4180, in read
    if not self.read_axes(where=where, **kwargs):
  File "C:\Users\Joey\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\pytables.py", line 3383, in read_axes
    errors=self.errors)
  File "C:\Users\Joey\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\pytables.py", line 2177, in convert
    codes[codes != -1] -= mask.astype(int).cumsum().values
ValueError: operands could not be broadcast together with shapes (8,) (2,) (8,)

How to [temporarily until a PR] fix it using nan_rep=np.nan (not nan_rep='nan', also this parameter's documentation is lacking)

>>> df_mixed.to_hdf(dir+'hdf.hdf',key='df_mixed',format='table',nan_rep=np.nan)
>>> df_mixed2 = pd.read_hdf(dir+'hdf.hdf',key='df_mixed',format='table')

jreback added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate IO HDF5 read_hdf, HDFStore Difficulty Intermediate labels Jul 5, 2018

jreback added this to the Next Major Release milestone Jul 5, 2018

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dimension mismatch when reading categorical HDF5 column with weird NaN encoding #21741

Dimension mismatch when reading categorical HDF5 column with weird NaN encoding #21741

sschuldenzucker commented Jul 5, 2018

sschuldenzucker commented Jul 5, 2018

jreback commented Jul 5, 2018

sschuldenzucker commented Jul 5, 2018

jreback commented Jul 5, 2018

joseortiz3 commented Jan 29, 2019 •

edited

Dimension mismatch when reading categorical HDF5 column with weird NaN encoding #21741

Dimension mismatch when reading categorical HDF5 column with weird NaN encoding #21741

Comments

sschuldenzucker commented Jul 5, 2018

sschuldenzucker commented Jul 5, 2018

jreback commented Jul 5, 2018

sschuldenzucker commented Jul 5, 2018

jreback commented Jul 5, 2018

joseortiz3 commented Jan 29, 2019 • edited

joseortiz3 commented Jan 29, 2019 •

edited