Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dimension mismatch when reading categorical HDF5 column with weird NaN encoding #21741

Open
sschuldenzucker opened this issue Jul 5, 2018 · 5 comments
Labels
Bug IO HDF5 read_hdf, HDFStore Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@sschuldenzucker
Copy link

Hi guys, I got a crash reading an HDF5 file with a message "ValueError: operands could not be broadcast together with shapes (171285,) (15,) (171285,)" (stack trace below).

I think the problem arises whenever reading a categorical column via pytables where NaN is stored as one of the categories (rather than a special code -1). I'm not sure in which situations pytables generates one or the other representation of NaN. (any explanation is appreciated)

In my case, 171285 is the number of rows in my data and 15 is the number of categories including NaN.

The offending line is:

codes[codes != -1] -= mask.astype(int).cumsum().values

I'm not sure what's going on here or why we need this code. (any explanations are appreciated) But indeed, codes has size # rows and mask has size # categories. (mask[i] is True iff categories[i] is NaN) So this seems wrong. Looks like we'd rather want some kind of join equivalent to

non_nan_codes = codes[codes != -1]
delta = mask.astype(int).cumsum().values
for i in non_nan_codes:
    non_nan_codes[i] -= delta[non_nan_codes[i]]
    # Right now, we got something equivalent to delta[i], not delta[non_nan_codes[i]]

I don't know how to do this fast in raw numpy, though.


Stack trace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-23-0fe86b15df04> in <module>()
----> 1 df = pd.read_hdf(hdfs[5])

C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in read_hdf(path_or_buf, key, **kwargs)
    356                                      'contains multiple datasets.')
    357             key = candidate_only_group._v_pathname
--> 358         return store.select(key, auto_close=auto_close, **kwargs)
    359     except:
    360         # if there is an error, close the store

C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
    720                            chunksize=chunksize, auto_close=auto_close)
    721 
--> 722         return it.get_result()
    723 
    724     def select_as_coordinates(

C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in get_result(self, coordinates)
   1426 
   1427         # directly return the result
-> 1428         results = self.func(self.start, self.stop, where)
   1429         self.close()
   1430         return results

C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in func(_start, _stop, _where)
    713             return s.read(start=_start, stop=_stop,
    714                           where=_where,
--> 715                           columns=columns, **kwargs)
    716 
    717         # create the iterator

C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in read(self, where, columns, **kwargs)
   4101     def read(self, where=None, columns=None, **kwargs):
   4102 
-> 4103         if not self.read_axes(where=where, **kwargs):
   4104             return None
   4105 

C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in read_axes(self, where, **kwargs)
   3306         for a in self.axes:
   3307             a.set_info(self.info)
-> 3308             a.convert(values, nan_rep=self.nan_rep, encoding=self.encoding)
   3309 
   3310         return True

C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in convert(self, values, nan_rep, encoding)
   2112                 if mask.any():
   2113                     categories = categories[~mask]
-> 2114                     codes[codes != -1] -= mask.astype(int).cumsum().values
   2115 
   2116                 self.data = Categorical.from_codes(codes,

ValueError: operands could not be broadcast together with shapes (171285,) (15,) (171285,)
@sschuldenzucker
Copy link
Author

FYI, as for the question why pandas creates such a HDF5, I think I figured it out:

In my code, I was doing some data conversions that turned out to be ultimately akin to something like:

s = pd.Series(['foo', nan, 'bar', 'foo']).astype(str).astype('category')

Turned out that this creates a regular category 'nan' (not nan, mind the quotes!). This is then written to HDF. 'nan' also happens to be the default nan_rep used by pytables, so when pytables reads the categories in the HDF file back in, it's converted into nan (without quotes). So we now have a nan category, which leads to the offending code being executed.

@jreback
Copy link
Contributor

jreback commented Jul 5, 2018

pls show a reproducible example

@sschuldenzucker
Copy link
Author

MWE:

import pandas as pd
import numpy as np

# Note: We need a repeated entry here. It happens to not crash if #rows = #categories. 
# (coincidentally; it's still doing the wrong thing of course.)
s = pd.Series(['foo', 'foo', 'nan']).astype('category')
# Alternatively:
# s = pd.Series(['foo', 'foo', np.nan]).astype(str).astype('category')

df = pd.DataFrame({'A': s})
df.to_hdf('test.h5', key='data', format='table')
df_back = pd.read_hdf('test.h5', key='data') # This crashes

I'm on pandas 0.20.1 btw, but I don't think the issue is fixed in more recent versions.

@jreback
Copy link
Contributor

jreback commented Jul 5, 2018

you need to use nan_rep http://pandas.pydata.org/pandas-docs/stable/io.html#string-columns

I guess you could detect this issue though, a PR to fix is welcome!

@jreback jreback added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate IO HDF5 read_hdf, HDFStore Difficulty Intermediate labels Jul 5, 2018
@jreback jreback added this to the Next Major Release milestone Jul 5, 2018
@joseortiz3
Copy link
Contributor

joseortiz3 commented Jan 29, 2019

Ran into this problem today, same exact scenario where the string 'nan' is a value in the column due to using df[col_name] = df[col_name].astype(str).astype('category').

>>> df_mixed = pd.DataFrame({'A': np.random.randn(8),
... 'B': np.random.randn(8),
... 'C': np.array(np.random.randn(8), dtype='float32'),
... 'string': 'string',
... 'int': 1,
... 'bool': True,
... 'datetime64': pd.Timestamp('20010102')},
... index=list(range(8)))
>>> df_mixed
          A         B         C  string  int  bool datetime64
0 -0.833346 -0.598527  1.013500  string    1  True 2001-01-02
1 -0.823901 -0.118210  0.793684  string    1  True 2001-01-02
2  0.725413 -0.867698  1.478408  string    1  True 2001-01-02
3 -0.246141  0.786121  1.483667  string    1  True 2001-01-02
4  1.760388  1.675248  1.169727  string    1  True 2001-01-02
5 -0.000398  0.039454  1.514879  string    1  True 2001-01-02
6 -2.815542 -0.539987 -1.873862  string    1  True 2001-01-02
7  0.791794 -0.031423  1.250562  string    1  True 2001-01-02
>>> df_mixed.loc[df_mixed.index[3:5],
... ['A', 'B', 'string', 'datetime64']] = np.nan
>>> df_mixed
          A         B         C  string  int  bool datetime64
0 -0.833346 -0.598527  1.013500  string    1  True 2001-01-02
1 -0.823901 -0.118210  0.793684  string    1  True 2001-01-02
2  0.725413 -0.867698  1.478408  string    1  True 2001-01-02
3       NaN       NaN  1.483667     NaN    1  True        NaT
4       NaN       NaN  1.169727     NaN    1  True        NaT
5 -0.000398  0.039454  1.514879  string    1  True 2001-01-02
6 -2.815542 -0.539987 -1.873862  string    1  True 2001-01-02
7  0.791794 -0.031423  1.250562  string    1  True 2001-01-02
>>> df_mixed['string'].iloc[4]
nan
>>> type(df_mixed['string'].iloc[4])
<class 'float'>
>>> df_mixed['string'] = df_mixed['string'].astype(str).astype('category')
>>> df_mixed
          A         B         C  string  int  bool datetime64
0 -0.833346 -0.598527  1.013500  string    1  True 2001-01-02
1 -0.823901 -0.118210  0.793684  string    1  True 2001-01-02
2  0.725413 -0.867698  1.478408  string    1  True 2001-01-02
3       NaN       NaN  1.483667     nan    1  True        NaT
4       NaN       NaN  1.169727     nan    1  True        NaT
5 -0.000398  0.039454  1.514879  string    1  True 2001-01-02
6 -2.815542 -0.539987 -1.873862  string    1  True 2001-01-02
7  0.791794 -0.031423  1.250562  string    1  True 2001-01-02
>>> df_mixed['string'].iloc[4]
'nan'
>>> df_mixed.to_hdf(dir+'hdf.hdf',key='df_mixed',format='table')
>>> df_mixed2 = pd.read_hdf(dir+'hdf.hdf',key='df_mixed',format='table')

results in the error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Joey\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\pytables.py", line 394, in read_hdf
    return store.select(key, auto_close=auto_close, **kwargs)
  File "C:\Users\Joey\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\pytables.py", line 741, in select
    return it.get_result()
  File "C:\Users\Joey\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\pytables.py", line 1483, in get_result
    results = self.func(self.start, self.stop, where)
  File "C:\Users\Joey\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\pytables.py", line 734, in func
    columns=columns)
  File "C:\Users\Joey\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\pytables.py", line 4180, in read
    if not self.read_axes(where=where, **kwargs):
  File "C:\Users\Joey\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\pytables.py", line 3383, in read_axes
    errors=self.errors)
  File "C:\Users\Joey\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\pytables.py", line 2177, in convert
    codes[codes != -1] -= mask.astype(int).cumsum().values
ValueError: operands could not be broadcast together with shapes (8,) (2,) (8,) 

How to [temporarily until a PR] fix it using nan_rep=np.nan (not nan_rep='nan', also this parameter's documentation is lacking)

>>> df_mixed.to_hdf(dir+'hdf.hdf',key='df_mixed',format='table',nan_rep=np.nan)
>>> df_mixed2 = pd.read_hdf(dir+'hdf.hdf',key='df_mixed',format='table')

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

5 participants