Irregular errors when reading certain categorical strings from hdf #10366

Closed
cottrell opened this Issue Jun 16, 2015 · 8 comments

Comments

Projects
None yet
3 participants
Contributor

cottrell commented Jun 16, 2015

It seems that there is something bad happening when we use certain strings with special characters AND the empty string with categoricals:

    # -*- coding: latin-1 -*-
    import pandas
    import os

    examples = [
            pandas.Series(['EÉ, 17', '', 'a', 'b', 'c'], dtype='category'),
            pandas.Series(['EÉ, 17', 'a', 'b', 'c'], dtype='category'),
            pandas.Series(['', 'a', 'b', 'c'], dtype='category'),
            pandas.Series(['EE, 17', '', 'a', 'b', 'c'], dtype='category'),
            pandas.Series(['øü', 'a', 'b', 'c'], dtype='category'),
            pandas.Series(['Aøü', '', 'a', 'b', 'c'], dtype='category'),
            pandas.Series(['EÉ, 17', 'øü', 'a', 'b', 'c'], dtype='category')
            ]

    def test_hdf(s):
        f = 'testhdf.h5'
        if os.path.exists(f):
            os.remove(f)
        s.to_hdf(f, 'data', format='table')
        return pandas.read_hdf(f, 'data')

    for i, s in enumerate(examples):
        flag = True
        e = ''
        try:
            test_hdf(s)
        except Exception as ex:
            e = ex
            flag = False
        print('%d: %s\t%s\t%s' % (i, 'pass' if flag else 'fail', s.tolist(), e))

Results in:

    0: fail ['EÉ, 17', '', 'a', 'b', 'c']   Categorical categories must be unique
    1: pass ['EÉ, 17', 'a', 'b', 'c']
    2: pass ['', 'a', 'b', 'c']
    3: pass ['EE, 17', '', 'a', 'b', 'c']
    4: pass ['øü', 'a', 'b', 'c']
    5: fail ['Aøü', '', 'a', 'b', 'c']  Categorical categories must be unique
    6: pass ['EÉ, 17', 'øü', 'a', 'b', 'c']

Not sure if I am using this incorrectly or if this is actually a corner case.

Contributor

bashtage commented Jun 16, 2015

On Linux/Python 2.7/pandas: 0.16.2-9-g7636c2c (master) I get

0: pass ['E\xc3\x89, 17', '', 'a', 'b', 'c']    
1: pass ['E\xc3\x89, 17', 'a', 'b', 'c']    
2: pass ['', 'a', 'b', 'c'] 
3: pass ['EE, 17', '', 'a', 'b', 'c']   
4: pass ['\xc3\xb8\xc3\xbc', 'a', 'b', 'c'] 
5: pass ['A\xc3\xb8\xc3\xbc', '', 'a', 'b', 'c']    
6: pass ['E\xc3\x89, 17', '\xc3\xb8\xc3\xbc', 'a', 'b', 'c']    

What does pd.show_versions() output?

Contributor

cottrell commented Jun 17, 2015

@bashtage Sorry I forgot to mention any version platform information. I am on OSX and am getting the errors only in python3 (python2 behaves differently). I was on master but turned back to the latest release.

Notice the "-coding-" EDIT in the example I gave above. It changes the look of the output slightly.

Here is what I just tested now:

    $ python --version
    Python 3.4.3 :: Anaconda 2.2.0 (x86_64)

    $ python -c 'import pandas; pandas.show_versions()'

    INSTALLED VERSIONS
    ------------------
    commit: None
    python: 3.4.3.final.0
    python-bits: 64
    OS: Darwin
    OS-release: 14.3.0
    machine: x86_64
    processor: i386
    byteorder: little
    LC_ALL: None
    LANG: en_CA.UTF-8

    pandas: 0.16.2
    nose: None
    Cython: 0.22
    numpy: 1.9.2
    scipy: 0.15.1
    statsmodels: None
    IPython: 3.1.0
    sphinx: 1.2.3
    patsy: 0.3.0
    dateutil: 2.4.2
    pytz: 2015.4
    bottleneck: None
    tables: 3.1.1
    numexpr: 2.3.1
    matplotlib: 1.4.3
    openpyxl: 1.8.5
    xlrd: 0.9.3
    xlwt: None
    xlsxwriter: 0.6.7
    lxml: 3.4.2
    bs4: 4.3.2
    html5lib: 0.999
    httplib2: None
    apiclient: None
    sqlalchemy: 0.9.9
    pymysql: None
    psycopg2: None

    $ python debug.py
    0: fail ['EÃ\x89, 17', '', 'a', 'b', 'c']  Categorical categories must be unique
    1: pass ['EÃ\x89, 17', 'a', 'b', 'c']
    2: pass ['', 'a', 'b', 'c']
    3: pass ['EE, 17', '', 'a', 'b', 'c']
    4: pass ['øü', 'a', 'b', 'c']
    5: fail ['Aøü', '', 'a', 'b', 'c']    Categorical categories must be unique
    6: fail ['EÃ\x89, 17', 'øü', 'a', 'b', 'c']  'utf-8' codec can't decode byte 0xc2 in position 6: unexpected end of data

    $ python --version
    Python 2.7.10 :: Continuum Analytics, Inc.

    $ python -c 'import pandas; pandas.show_versions()'

    INSTALLED VERSIONS
    ------------------
    commit: None
    python: 2.7.10.final.0
    python-bits: 64
    OS: Darwin
    OS-release: 14.3.0
    machine: x86_64
    processor: i386
    byteorder: little
    LC_ALL: None
    LANG: en_CA.UTF-8

    pandas: 0.16.2
    nose: 1.3.4
    Cython: 0.21.1
    numpy: 1.9.2
    scipy: 0.15.1
    statsmodels: 0.6.1
    IPython: 2.3.1
    sphinx: 1.2.3
    patsy: 0.3.0
    dateutil: 2.4.2
    pytz: 2015.4
    bottleneck: None
    tables: 3.1.1
    numexpr: 2.3.1
    matplotlib: 1.4.3
    openpyxl: 2.0.2
    xlrd: 0.9.3
    xlwt: None
    xlsxwriter: 0.6.4
    lxml: 3.4.1
    bs4: 4.3.2
    html5lib: None
    httplib2: 0.8
    apiclient: None
    sqlalchemy: 0.9.8
    pymysql: None
    psycopg2: None

    $ python debug.py
    0: pass ['E\xc3\x89, 17', '', 'a', 'b', 'c']
    1: pass ['E\xc3\x89, 17', 'a', 'b', 'c']
    2: pass ['', 'a', 'b', 'c']
    3: pass ['EE, 17', '', 'a', 'b', 'c']
    4: pass ['\xc3\xb8\xc3\xbc', 'a', 'b', 'c']
    5: pass ['A\xc3\xb8\xc3\xbc', '', 'a', 'b', 'c']
    6: pass ['E\xc3\x89, 17', '\xc3\xb8\xc3\xbc', 'a', 'b', 'c']

jreback added this to the 0.17.0 milestone Jun 17, 2015

Contributor

jreback commented Jun 17, 2015

So this is a bug here.
The Categoricals are written as a separate table in a sub-node of the main table. The encoding and nan_rep are known on the table at this point, and should be passed thru. Since you already have a test you have the hardest part already done. pls submit a pull-request to fix!

Contributor

cottrell commented Jun 20, 2015

@jreback Thanks. I think I can get to this in a week or so.

Contributor

jreback commented Jun 20, 2015

@cottrell gr8!

Contributor

cottrell commented Jun 26, 2015

It seems feeding the encoding and nan_rep through only fixed some of the errors. Basically, it looks like the categorical metadata is being mapped to "nan" for anything with non-standard encodings. Any suggestions on how to check whether the problem is with the writing or the reading of the hdf store? I can open the hdf store using pytables directly and I think the relevant node is '/data/meta/values/meta' where "data" is my top level key.

Contributor

jreback commented Jun 26, 2015

hmm you will need to encode to nan string as well
in the writing

when reading everything needs to be decoded then the categorical created

Contributor

cottrell commented Jun 27, 2015

Posting some notes here as I go.

https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L4408 seems to be turning different encodings to nan. Commenting it out resolves the uniqueness exceptions but encodings are still not quite right.

So it looks to me like the writing (to_hdf) is possible ok:

In [94]: import tables
In [95]: f = tables.open_file('testhdf.h5', 'r')
In [96]: for r in f.root.data.meta.values.meta.table:
    print(r['index'], r['values'])
   ....:
0 b''
1 b'E\xc3\x89, 17'
2 b'a'
3 b'b'
4 b'c'

@cottrell cottrell pushed a commit to cottrell/pandas that referenced this issue Aug 22, 2015

David Cottrell Add tests and fix issue #10366 encoding and categoricals hdf serializ…
…ation.
8463c63

@jreback jreback added a commit to jreback/pandas that referenced this issue Aug 26, 2015

@jreback David Cottrell + jreback Add tests and fix issue #10366 encoding and categoricals hdf serializ…
…ation.
e10e701

@jreback jreback added a commit to jreback/pandas that referenced this issue Aug 27, 2015

@jreback David Cottrell + jreback Add tests and fix issue #10366 encoding and categoricals hdf serializ…
…ation.
b268bb0

jreback closed this in #10889 Aug 28, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment