Irregular errors when reading certain categorical strings from hdf #10366

cottrell · 2015-06-16T16:33:23Z

It seems that there is something bad happening when we use certain strings with special characters AND the empty string with categoricals:

    # -*- coding: latin-1 -*-
    import pandas
    import os

    examples = [
            pandas.Series(['EÉ, 17', '', 'a', 'b', 'c'], dtype='category'),
            pandas.Series(['EÉ, 17', 'a', 'b', 'c'], dtype='category'),
            pandas.Series(['', 'a', 'b', 'c'], dtype='category'),
            pandas.Series(['EE, 17', '', 'a', 'b', 'c'], dtype='category'),
            pandas.Series(['øü', 'a', 'b', 'c'], dtype='category'),
            pandas.Series(['Aøü', '', 'a', 'b', 'c'], dtype='category'),
            pandas.Series(['EÉ, 17', 'øü', 'a', 'b', 'c'], dtype='category')
            ]

    def test_hdf(s):
        f = 'testhdf.h5'
        if os.path.exists(f):
            os.remove(f)
        s.to_hdf(f, 'data', format='table')
        return pandas.read_hdf(f, 'data')

    for i, s in enumerate(examples):
        flag = True
        e = ''
        try:
            test_hdf(s)
        except Exception as ex:
            e = ex
            flag = False
        print('%d: %s\t%s\t%s' % (i, 'pass' if flag else 'fail', s.tolist(), e))

Results in:

    0: fail ['EÉ, 17', '', 'a', 'b', 'c']   Categorical categories must be unique
    1: pass ['EÉ, 17', 'a', 'b', 'c']
    2: pass ['', 'a', 'b', 'c']
    3: pass ['EE, 17', '', 'a', 'b', 'c']
    4: pass ['øü', 'a', 'b', 'c']
    5: fail ['Aøü', '', 'a', 'b', 'c']  Categorical categories must be unique
    6: pass ['EÉ, 17', 'øü', 'a', 'b', 'c']

Not sure if I am using this incorrectly or if this is actually a corner case.

The text was updated successfully, but these errors were encountered:

bashtage · 2015-06-16T20:27:15Z

On Linux/Python 2.7/pandas: 0.16.2-9-g7636c2c (master) I get

0: pass ['E\xc3\x89, 17', '', 'a', 'b', 'c']    
1: pass ['E\xc3\x89, 17', 'a', 'b', 'c']    
2: pass ['', 'a', 'b', 'c'] 
3: pass ['EE, 17', '', 'a', 'b', 'c']   
4: pass ['\xc3\xb8\xc3\xbc', 'a', 'b', 'c'] 
5: pass ['A\xc3\xb8\xc3\xbc', '', 'a', 'b', 'c']    
6: pass ['E\xc3\x89, 17', '\xc3\xb8\xc3\xbc', 'a', 'b', 'c']

What does pd.show_versions() output?

cottrell · 2015-06-17T09:05:15Z

@bashtage Sorry I forgot to mention any version platform information. I am on OSX and am getting the errors only in python3 (python2 behaves differently). I was on master but turned back to the latest release.

Notice the "-coding-" EDIT in the example I gave above. It changes the look of the output slightly.

Here is what I just tested now:

    $ python --version
    Python 3.4.3 :: Anaconda 2.2.0 (x86_64)

    $ python -c 'import pandas; pandas.show_versions()'

    INSTALLED VERSIONS
    ------------------
    commit: None
    python: 3.4.3.final.0
    python-bits: 64
    OS: Darwin
    OS-release: 14.3.0
    machine: x86_64
    processor: i386
    byteorder: little
    LC_ALL: None
    LANG: en_CA.UTF-8

    pandas: 0.16.2
    nose: None
    Cython: 0.22
    numpy: 1.9.2
    scipy: 0.15.1
    statsmodels: None
    IPython: 3.1.0
    sphinx: 1.2.3
    patsy: 0.3.0
    dateutil: 2.4.2
    pytz: 2015.4
    bottleneck: None
    tables: 3.1.1
    numexpr: 2.3.1
    matplotlib: 1.4.3
    openpyxl: 1.8.5
    xlrd: 0.9.3
    xlwt: None
    xlsxwriter: 0.6.7
    lxml: 3.4.2
    bs4: 4.3.2
    html5lib: 0.999
    httplib2: None
    apiclient: None
    sqlalchemy: 0.9.9
    pymysql: None
    psycopg2: None

    $ python debug.py
    0: fail ['EÃ\x89, 17', '', 'a', 'b', 'c']  Categorical categories must be unique
    1: pass ['EÃ\x89, 17', 'a', 'b', 'c']
    2: pass ['', 'a', 'b', 'c']
    3: pass ['EE, 17', '', 'a', 'b', 'c']
    4: pass ['Ã¸Ã¼', 'a', 'b', 'c']
    5: fail ['AÃ¸Ã¼', '', 'a', 'b', 'c']    Categorical categories must be unique
    6: fail ['EÃ\x89, 17', 'Ã¸Ã¼', 'a', 'b', 'c']  'utf-8' codec can't decode byte 0xc2 in position 6: unexpected end of data

    $ python --version
    Python 2.7.10 :: Continuum Analytics, Inc.

    $ python -c 'import pandas; pandas.show_versions()'

    INSTALLED VERSIONS
    ------------------
    commit: None
    python: 2.7.10.final.0
    python-bits: 64
    OS: Darwin
    OS-release: 14.3.0
    machine: x86_64
    processor: i386
    byteorder: little
    LC_ALL: None
    LANG: en_CA.UTF-8

    pandas: 0.16.2
    nose: 1.3.4
    Cython: 0.21.1
    numpy: 1.9.2
    scipy: 0.15.1
    statsmodels: 0.6.1
    IPython: 2.3.1
    sphinx: 1.2.3
    patsy: 0.3.0
    dateutil: 2.4.2
    pytz: 2015.4
    bottleneck: None
    tables: 3.1.1
    numexpr: 2.3.1
    matplotlib: 1.4.3
    openpyxl: 2.0.2
    xlrd: 0.9.3
    xlwt: None
    xlsxwriter: 0.6.4
    lxml: 3.4.1
    bs4: 4.3.2
    html5lib: None
    httplib2: 0.8
    apiclient: None
    sqlalchemy: 0.9.8
    pymysql: None
    psycopg2: None

    $ python debug.py
    0: pass ['E\xc3\x89, 17', '', 'a', 'b', 'c']
    1: pass ['E\xc3\x89, 17', 'a', 'b', 'c']
    2: pass ['', 'a', 'b', 'c']
    3: pass ['EE, 17', '', 'a', 'b', 'c']
    4: pass ['\xc3\xb8\xc3\xbc', 'a', 'b', 'c']
    5: pass ['A\xc3\xb8\xc3\xbc', '', 'a', 'b', 'c']
    6: pass ['E\xc3\x89, 17', '\xc3\xb8\xc3\xbc', 'a', 'b', 'c']

jreback · 2015-06-17T10:35:39Z

So this is a bug here.
The Categoricals are written as a separate table in a sub-node of the main table. The encoding and nan_rep are known on the table at this point, and should be passed thru. Since you already have a test you have the hardest part already done. pls submit a pull-request to fix!

cottrell · 2015-06-20T11:15:23Z

@jreback Thanks. I think I can get to this in a week or so.

jreback · 2015-06-20T11:17:29Z

@cottrell gr8!

cottrell · 2015-06-26T14:21:17Z

It seems feeding the encoding and nan_rep through only fixed some of the errors. Basically, it looks like the categorical metadata is being mapped to "nan" for anything with non-standard encodings. Any suggestions on how to check whether the problem is with the writing or the reading of the hdf store? I can open the hdf store using pytables directly and I think the relevant node is '/data/meta/values/meta' where "data" is my top level key.

jreback · 2015-06-26T14:28:12Z

hmm you will need to encode to nan string as well
in the writing

when reading everything needs to be decoded then the categorical created

cottrell · 2015-06-27T13:46:48Z

Posting some notes here as I go.

https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L4408 seems to be turning different encodings to nan. Commenting it out resolves the uniqueness exceptions but encodings are still not quite right.

So it looks to me like the writing (to_hdf) is possible ok:

In [94]: import tables
In [95]: f = tables.open_file('testhdf.h5', 'r')
In [96]: for r in f.root.data.meta.values.meta.table:
    print(r['index'], r['values'])
   ....:
0 b''
1 b'E\xc3\x89, 17'
2 b'a'
3 b'b'
4 b'c'

…f serialization.

jreback added Bug Unicode Unicode strings IO HDF5 read_hdf, HDFStore Difficulty Intermediate labels Jun 17, 2015

jreback added this to the 0.17.0 milestone Jun 17, 2015

cottrell mentioned this issue Jun 27, 2015

Attempt to fix issue #10366 encoding and categoricals hdf serialization. #10454

Closed

cottrell pushed a commit to cottrell/pandas that referenced this issue Aug 22, 2015

Add tests and fix issue pandas-dev#10366 encoding and categoricals hd…

8463c63

…f serialization.

jreback mentioned this issue Aug 22, 2015

BUG: encoding of categoricals in hdf serialization #10889

Merged

jreback pushed a commit to jreback/pandas that referenced this issue Aug 26, 2015

Add tests and fix issue pandas-dev#10366 encoding and categoricals hd…

e10e701

…f serialization.

jreback pushed a commit to jreback/pandas that referenced this issue Aug 27, 2015

Add tests and fix issue pandas-dev#10366 encoding and categoricals hd…

b268bb0

…f serialization.

jreback closed this as completed in #10889 Aug 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Irregular errors when reading certain categorical strings from hdf #10366

Irregular errors when reading certain categorical strings from hdf #10366

cottrell commented Jun 16, 2015

bashtage commented Jun 16, 2015

cottrell commented Jun 17, 2015

jreback commented Jun 17, 2015

cottrell commented Jun 20, 2015

jreback commented Jun 20, 2015

cottrell commented Jun 26, 2015

jreback commented Jun 26, 2015

cottrell commented Jun 27, 2015

Irregular errors when reading certain categorical strings from hdf #10366

Irregular errors when reading certain categorical strings from hdf #10366

Comments

cottrell commented Jun 16, 2015

bashtage commented Jun 16, 2015

cottrell commented Jun 17, 2015

jreback commented Jun 17, 2015

cottrell commented Jun 20, 2015

jreback commented Jun 20, 2015

cottrell commented Jun 26, 2015

jreback commented Jun 26, 2015

cottrell commented Jun 27, 2015