-
-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dimension mismatch when reading categorical HDF5 column with weird NaN encoding #21741
Comments
FYI, as for the question why pandas creates such a HDF5, I think I figured it out: In my code, I was doing some data conversions that turned out to be ultimately akin to something like:
Turned out that this creates a regular category |
pls show a reproducible example |
MWE: import pandas as pd
import numpy as np
# Note: We need a repeated entry here. It happens to not crash if #rows = #categories.
# (coincidentally; it's still doing the wrong thing of course.)
s = pd.Series(['foo', 'foo', 'nan']).astype('category')
# Alternatively:
# s = pd.Series(['foo', 'foo', np.nan]).astype(str).astype('category')
df = pd.DataFrame({'A': s})
df.to_hdf('test.h5', key='data', format='table')
df_back = pd.read_hdf('test.h5', key='data') # This crashes I'm on pandas 0.20.1 btw, but I don't think the issue is fixed in more recent versions. |
you need to use I guess you could detect this issue though, a PR to fix is welcome! |
Ran into this problem today, same exact scenario where the string >>> df_mixed = pd.DataFrame({'A': np.random.randn(8),
... 'B': np.random.randn(8),
... 'C': np.array(np.random.randn(8), dtype='float32'),
... 'string': 'string',
... 'int': 1,
... 'bool': True,
... 'datetime64': pd.Timestamp('20010102')},
... index=list(range(8)))
>>> df_mixed
A B C string int bool datetime64
0 -0.833346 -0.598527 1.013500 string 1 True 2001-01-02
1 -0.823901 -0.118210 0.793684 string 1 True 2001-01-02
2 0.725413 -0.867698 1.478408 string 1 True 2001-01-02
3 -0.246141 0.786121 1.483667 string 1 True 2001-01-02
4 1.760388 1.675248 1.169727 string 1 True 2001-01-02
5 -0.000398 0.039454 1.514879 string 1 True 2001-01-02
6 -2.815542 -0.539987 -1.873862 string 1 True 2001-01-02
7 0.791794 -0.031423 1.250562 string 1 True 2001-01-02
>>> df_mixed.loc[df_mixed.index[3:5],
... ['A', 'B', 'string', 'datetime64']] = np.nan
>>> df_mixed
A B C string int bool datetime64
0 -0.833346 -0.598527 1.013500 string 1 True 2001-01-02
1 -0.823901 -0.118210 0.793684 string 1 True 2001-01-02
2 0.725413 -0.867698 1.478408 string 1 True 2001-01-02
3 NaN NaN 1.483667 NaN 1 True NaT
4 NaN NaN 1.169727 NaN 1 True NaT
5 -0.000398 0.039454 1.514879 string 1 True 2001-01-02
6 -2.815542 -0.539987 -1.873862 string 1 True 2001-01-02
7 0.791794 -0.031423 1.250562 string 1 True 2001-01-02
>>> df_mixed['string'].iloc[4]
nan
>>> type(df_mixed['string'].iloc[4])
<class 'float'>
>>> df_mixed['string'] = df_mixed['string'].astype(str).astype('category')
>>> df_mixed
A B C string int bool datetime64
0 -0.833346 -0.598527 1.013500 string 1 True 2001-01-02
1 -0.823901 -0.118210 0.793684 string 1 True 2001-01-02
2 0.725413 -0.867698 1.478408 string 1 True 2001-01-02
3 NaN NaN 1.483667 nan 1 True NaT
4 NaN NaN 1.169727 nan 1 True NaT
5 -0.000398 0.039454 1.514879 string 1 True 2001-01-02
6 -2.815542 -0.539987 -1.873862 string 1 True 2001-01-02
7 0.791794 -0.031423 1.250562 string 1 True 2001-01-02
>>> df_mixed['string'].iloc[4]
'nan'
>>> df_mixed.to_hdf(dir+'hdf.hdf',key='df_mixed',format='table')
>>> df_mixed2 = pd.read_hdf(dir+'hdf.hdf',key='df_mixed',format='table') results in the error
How to [temporarily until a PR] fix it using >>> df_mixed.to_hdf(dir+'hdf.hdf',key='df_mixed',format='table',nan_rep=np.nan)
>>> df_mixed2 = pd.read_hdf(dir+'hdf.hdf',key='df_mixed',format='table') |
Hi guys, I got a crash reading an HDF5 file with a message "ValueError: operands could not be broadcast together with shapes (171285,) (15,) (171285,)" (stack trace below).
I think the problem arises whenever reading a categorical column via pytables where NaN is stored as one of the categories (rather than a special code -1). I'm not sure in which situations pytables generates one or the other representation of NaN. (any explanation is appreciated)
In my case, 171285 is the number of rows in my data and 15 is the number of categories including NaN.
The offending line is:
pandas/pandas/io/pytables.py
Line 2224 in 1070976
I'm not sure what's going on here or why we need this code. (any explanations are appreciated) But indeed,
codes
has size# rows
andmask
has size# categories
. (mask[i] is True iff categories[i] is NaN) So this seems wrong. Looks like we'd rather want some kind of join equivalent toI don't know how to do this fast in raw numpy, though.
Stack trace:
The text was updated successfully, but these errors were encountered: