New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Categorical data fails to load from hdf when all columns are NaN #18413

Closed
ssche opened this Issue Nov 21, 2017 · 6 comments

Comments

Projects
None yet
2 participants
@ssche
Contributor

ssche commented Nov 21, 2017

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

df = pd.DataFrame({'a': ['a', 'b', 'c', np.nan], 'b': [np.nan, np.nan, np.nan, np.nan]})
df['a'] = df.a.astype('category')
df['b'] = df.b.astype('category')

df.to_hdf('foo.h5', 'bar', format='table')
pd.read_hdf('foo.h5', 'bar')

Problem description

While storing an hdf file with categorical data containing np.nans works fine, loading the file back in to a DataFrame raises an exception.

  File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 372, in read_hdf
    return store.select(key, auto_close=auto_close, **kwargs)
  File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 742, in select
    return it.get_result()
  File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 1449, in get_result
    results = self.func(self.start, self.stop, where)
  File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 735, in func
    columns=columns, **kwargs)
  File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 4124, in read
    if not self.read_axes(where=where, **kwargs):
  File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 3329, in read_axes
    a.convert(values, nan_rep=self.nan_rep, encoding=self.encoding)
  File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 2133, in convert
    if mask.any():
AttributeError: 'bool' object has no attribute 'any'

This exception is related to storing a column that contains np.nan values only (column a stores and loads fine on its own).

The problem could already be in the way the metadata for column b (the np.nan-only column) is stored when calling df.to_hdf() as the metadata is None when loading. The relevant code for the pd.read_hdf() in DataCol.convert:

elif meta == u('category'):

                # we have a categorical
                categories = self.metadata
                codes = self.data.ravel()

                # if we have stored a NaN in the categories
                # then strip it; in theory we could have BOTH
                # -1s in the codes and nulls :<
                mask = isnull(categories)
                if mask.any():
                    categories = categories[~mask]
                    codes[codes != -1] -= mask.astype(int).cumsum().values

has metadata set to None (self.metadata == categories == None) which in turn makes mask (=isnull(None)) a scalar value (False) and thus .any() fails.

Expected Output

No exception; dataframe loads as it would without categorical data.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8
LOCALE: None.None

pandas: 0.20.3
pytest: 3.1.3
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.16.0
xarray: None
IPython: 5.4.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.0
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 1.4.3
openpyxl: 1.8.6
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: None
html5lib: 0.999999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.9.5
s3fs: 0.1.2
pandas_gbq: None
pandas_datareader: None

@ssche

This comment has been minimized.

Show comment
Hide comment
@ssche

ssche Nov 26, 2017

Contributor

Has anyone had a chance to look at this? Can it be reproduced? If so, is there a workaround or some suggestions?

Contributor

ssche commented Nov 26, 2017

Has anyone had a chance to look at this? Can it be reproduced? If so, is there a workaround or some suggestions?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 26, 2017

Contributor

looks like a bug. The issue is the categories are an empty array; when this is stored, pytables cannot write a zero-len array, so on readback the categories are None.

In [2]: pd.read_hdf('foo.h5', 'bar')
Out[2]: 
     a    b
0    a  NaN
1    b  NaN
2    c  NaN
3  NaN  NaN

In [3]: pd.read_hdf('foo.h5', 'bar').dtypes
Out[3]: 
a    category
b    category
dtype: object

patch

diff --git a/pandas/io/pytables.py b/pandas/io/pytables.py
index 2a66aea..2ae3765 100644
--- a/pandas/io/pytables.py
+++ b/pandas/io/pytables.py
@@ -2137,10 +2137,13 @@ class DataCol(IndexCol):
                 # if we have stored a NaN in the categories
                 # then strip it; in theory we could have BOTH
                 # -1s in the codes and nulls :<
-                mask = isna(categories)
-                if mask.any():
-                    categories = categories[~mask]
-                    codes[codes != -1] -= mask.astype(int).cumsum().values
+                if categories is None:
+                    categories = []
+                else:
+                    mask = isna(categories)
+                    if mask.any():
+                        categories = categories[~mask]
+                        codes[codes != -1] -= mask.astype(int).cumsum().values
 
                 self.data = Categorical.from_codes(codes,
                                                    categories=categories,

if you can submit a PR would be great.

Contributor

jreback commented Nov 26, 2017

looks like a bug. The issue is the categories are an empty array; when this is stored, pytables cannot write a zero-len array, so on readback the categories are None.

In [2]: pd.read_hdf('foo.h5', 'bar')
Out[2]: 
     a    b
0    a  NaN
1    b  NaN
2    c  NaN
3  NaN  NaN

In [3]: pd.read_hdf('foo.h5', 'bar').dtypes
Out[3]: 
a    category
b    category
dtype: object

patch

diff --git a/pandas/io/pytables.py b/pandas/io/pytables.py
index 2a66aea..2ae3765 100644
--- a/pandas/io/pytables.py
+++ b/pandas/io/pytables.py
@@ -2137,10 +2137,13 @@ class DataCol(IndexCol):
                 # if we have stored a NaN in the categories
                 # then strip it; in theory we could have BOTH
                 # -1s in the codes and nulls :<
-                mask = isna(categories)
-                if mask.any():
-                    categories = categories[~mask]
-                    codes[codes != -1] -= mask.astype(int).cumsum().values
+                if categories is None:
+                    categories = []
+                else:
+                    mask = isna(categories)
+                    if mask.any():
+                        categories = categories[~mask]
+                        codes[codes != -1] -= mask.astype(int).cumsum().values
 
                 self.data = Categorical.from_codes(codes,
                                                    categories=categories,

if you can submit a PR would be great.

@jreback jreback added this to the Next Major Release milestone Nov 26, 2017

@jreback jreback changed the title from Categorical data fails to load from hdf when all columns are NaN to BUG: Categorical data fails to load from hdf when all columns are NaN Nov 26, 2017

ssche added a commit to ssche/pandas that referenced this issue Dec 5, 2017

Fixed pandas-dev#18413, but test case not passing
* Handle all-NaN columns differently when building metadata for categorical axes on saving hdf5 file
* Categorical axes fail test case comparison due to type difference (even though there isn't a visibly type difference)
@ssche

This comment has been minimized.

Show comment
Hide comment
@ssche

ssche Dec 5, 2017

Contributor

Ok, I tried, but am struggling with a new test case which is not passing. It seems to be related to being able to store NaN-only categorical columns now (only fails when including column 'b' in the test case test_categorical_nan_only_columns). I can't see where the axes are different... Any help would be appreciated.

______________________________________________ TestHDFStore.test_categorical_nan_only_columns ______________________________________________

self = <pandas.tests.io.test_pytables.TestHDFStore object at 0x7f844c87ff90>

    def test_categorical_nan_only_columns(self):
        # GH18413
        # Check that read_hdf with categorical columns with NaN-only values can
        # be read back.
        df = pd.DataFrame({
            'a': ['a', 'b', 'c', np.nan],
            'b': [np.nan, np.nan, np.nan, np.nan],
            'c': [1, 2, 3, 4]
        })
        df['a'] = df.a.astype('category')
        df['b'] = df.b.astype('category')
        expected = df.copy()
        with ensure_clean_path(self.path) as path:
            df.to_hdf(path, 'df', format='table', data_columns=True)
            result = read_hdf(path, 'df')
            print 'result', result.dtypes
            print 'expected', expected.dtypes
>           tm.assert_frame_equal(result, expected)

pandas/tests/io/test_pytables.py:4947: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/util/testing.py:1390: in assert_frame_equal
    obj='DataFrame.iloc[:, {idx}]'.format(idx=i))
pandas/util/testing.py:1235: in assert_series_equal
    assert_attr_equal('dtype', left, right)
pandas/util/testing.py:1001: in assert_attr_equal
    raise_assert_detail(obj, msg, left_attr, right_attr)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

obj = 'Attributes', message = 'Attribute "dtype" are different', left = 'CategoricalDtype(categories=[], ordered=False)'
right = 'CategoricalDtype(categories=[], ordered=False)', diff = None

    def raise_assert_detail(obj, message, left, right, diff=None):
        if isinstance(left, np.ndarray):
            left = pprint_thing(left)
        elif is_categorical_dtype(left):
            left = repr(left)
        if isinstance(right, np.ndarray):
            right = pprint_thing(right)
        elif is_categorical_dtype(right):
            right = repr(right)
    
        msg = """{obj} are different
    
    {message}
    [left]:  {left}
    [right]: {right}""".format(obj=obj, message=message, left=left, right=right)
    
        if diff is not None:
            msg += "\n[diff]: {diff}".format(diff=diff)
    
>       raise AssertionError(msg)
E       AssertionError: Attributes are different
E       
E       Attribute "dtype" are different
E       [left]:  CategoricalDtype(categories=[], ordered=False)
E       [right]: CategoricalDtype(categories=[], ordered=False)

pandas/util/testing.py:1086: AssertionError
Contributor

ssche commented Dec 5, 2017

Ok, I tried, but am struggling with a new test case which is not passing. It seems to be related to being able to store NaN-only categorical columns now (only fails when including column 'b' in the test case test_categorical_nan_only_columns). I can't see where the axes are different... Any help would be appreciated.

______________________________________________ TestHDFStore.test_categorical_nan_only_columns ______________________________________________

self = <pandas.tests.io.test_pytables.TestHDFStore object at 0x7f844c87ff90>

    def test_categorical_nan_only_columns(self):
        # GH18413
        # Check that read_hdf with categorical columns with NaN-only values can
        # be read back.
        df = pd.DataFrame({
            'a': ['a', 'b', 'c', np.nan],
            'b': [np.nan, np.nan, np.nan, np.nan],
            'c': [1, 2, 3, 4]
        })
        df['a'] = df.a.astype('category')
        df['b'] = df.b.astype('category')
        expected = df.copy()
        with ensure_clean_path(self.path) as path:
            df.to_hdf(path, 'df', format='table', data_columns=True)
            result = read_hdf(path, 'df')
            print 'result', result.dtypes
            print 'expected', expected.dtypes
>           tm.assert_frame_equal(result, expected)

pandas/tests/io/test_pytables.py:4947: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/util/testing.py:1390: in assert_frame_equal
    obj='DataFrame.iloc[:, {idx}]'.format(idx=i))
pandas/util/testing.py:1235: in assert_series_equal
    assert_attr_equal('dtype', left, right)
pandas/util/testing.py:1001: in assert_attr_equal
    raise_assert_detail(obj, msg, left_attr, right_attr)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

obj = 'Attributes', message = 'Attribute "dtype" are different', left = 'CategoricalDtype(categories=[], ordered=False)'
right = 'CategoricalDtype(categories=[], ordered=False)', diff = None

    def raise_assert_detail(obj, message, left, right, diff=None):
        if isinstance(left, np.ndarray):
            left = pprint_thing(left)
        elif is_categorical_dtype(left):
            left = repr(left)
        if isinstance(right, np.ndarray):
            right = pprint_thing(right)
        elif is_categorical_dtype(right):
            right = repr(right)
    
        msg = """{obj} are different
    
    {message}
    [left]:  {left}
    [right]: {right}""".format(obj=obj, message=message, left=left, right=right)
    
        if diff is not None:
            msg += "\n[diff]: {diff}".format(diff=diff)
    
>       raise AssertionError(msg)
E       AssertionError: Attributes are different
E       
E       Attribute "dtype" are different
E       [left]:  CategoricalDtype(categories=[], ordered=False)
E       [right]: CategoricalDtype(categories=[], ordered=False)

pandas/util/testing.py:1086: AssertionError
@ssche

This comment has been minimized.

Show comment
Hide comment
@ssche

ssche Dec 5, 2017

Contributor

Actually, I found a difference (there may be more).

    def test_categorical_nan_only_columns(self):
        # GH18413
        # Check that read_hdf with categorical columns with NaN-only values can
        # be read back.
        df = pd.DataFrame({
            'a': ['a', 'b', 'c', np.nan],
            'b': [np.nan, np.nan, np.nan, np.nan],
            'c': [1, 2, 3, 4]
        })
        df['a'] = df.a.astype('category')
        df['b'] = df.b.astype('category')
        expected = df.copy()
        with ensure_clean_path(self.path) as path:
            df.to_hdf(path, 'df', format='table')
            result = read_hdf(path, 'df', data_columns=True)
            print 'result', result.b.dtype, type(result.b.dtype)
            print 'expected', expected.b.dtype, type(expected.b.dtype)
            print result.b.dtype == expected.b.dtype
            print result.b.dtype.categories
            print expected.b.dtype.categories
            tm.assert_frame_equal(result, expected, check_dtype=False)

shows

result category <class 'pandas.core.dtypes.dtypes.CategoricalDtype'>
expected category <class 'pandas.core.dtypes.dtypes.CategoricalDtype'>
False
Index([], dtype='object')
Float64Index([], dtype='float64')

Is this expected behaviour? Does it also need to be addressed and if so some pointer of where to look would be helpful...

Contributor

ssche commented Dec 5, 2017

Actually, I found a difference (there may be more).

    def test_categorical_nan_only_columns(self):
        # GH18413
        # Check that read_hdf with categorical columns with NaN-only values can
        # be read back.
        df = pd.DataFrame({
            'a': ['a', 'b', 'c', np.nan],
            'b': [np.nan, np.nan, np.nan, np.nan],
            'c': [1, 2, 3, 4]
        })
        df['a'] = df.a.astype('category')
        df['b'] = df.b.astype('category')
        expected = df.copy()
        with ensure_clean_path(self.path) as path:
            df.to_hdf(path, 'df', format='table')
            result = read_hdf(path, 'df', data_columns=True)
            print 'result', result.b.dtype, type(result.b.dtype)
            print 'expected', expected.b.dtype, type(expected.b.dtype)
            print result.b.dtype == expected.b.dtype
            print result.b.dtype.categories
            print expected.b.dtype.categories
            tm.assert_frame_equal(result, expected, check_dtype=False)

shows

result category <class 'pandas.core.dtypes.dtypes.CategoricalDtype'>
expected category <class 'pandas.core.dtypes.dtypes.CategoricalDtype'>
False
Index([], dtype='object')
Float64Index([], dtype='float64')

Is this expected behaviour? Does it also need to be addressed and if so some pointer of where to look would be helpful...

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 5, 2017

Contributor

in the patch use Index([]) rather than [] for the empty categories
I think we just fixed this

Contributor

jreback commented Dec 5, 2017

in the patch use Index([]) rather than [] for the empty categories
I think we just fixed this

ssche added a commit to ssche/pandas that referenced this issue Dec 6, 2017

Fixed test case (pandas-dev#18413)
* Change empty category to `Index([], dtype=np.float64)` instead of `[]`.
* Remove printouts in test case.
@ssche

This comment has been minimized.

Show comment
Hide comment
@ssche

ssche Dec 6, 2017

Contributor

I had to make it a Index([], dtype=np.float64) because that's the type that is read back from the hdf5 store.

Contributor

ssche commented Dec 6, 2017

I had to make it a Index([], dtype=np.float64) because that's the type that is read back from the hdf5 store.

@jreback jreback modified the milestones: Next Major Release, 0.21.1, 0.22.0 Dec 6, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment