Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Irregular errors when reading certain categorical strings from hdf #10366

Closed
cottrell opened this issue Jun 16, 2015 · 8 comments · Fixed by #10889
Closed

Irregular errors when reading certain categorical strings from hdf #10366

cottrell opened this issue Jun 16, 2015 · 8 comments · Fixed by #10889
Labels
Bug IO HDF5 read_hdf, HDFStore Unicode Unicode strings
Milestone

Comments

@cottrell
Copy link
Contributor

It seems that there is something bad happening when we use certain strings with special characters AND the empty string with categoricals:

    # -*- coding: latin-1 -*-
    import pandas
    import os

    examples = [
            pandas.Series(['EÉ, 17', '', 'a', 'b', 'c'], dtype='category'),
            pandas.Series(['EÉ, 17', 'a', 'b', 'c'], dtype='category'),
            pandas.Series(['', 'a', 'b', 'c'], dtype='category'),
            pandas.Series(['EE, 17', '', 'a', 'b', 'c'], dtype='category'),
            pandas.Series(['øü', 'a', 'b', 'c'], dtype='category'),
            pandas.Series(['Aøü', '', 'a', 'b', 'c'], dtype='category'),
            pandas.Series(['EÉ, 17', 'øü', 'a', 'b', 'c'], dtype='category')
            ]

    def test_hdf(s):
        f = 'testhdf.h5'
        if os.path.exists(f):
            os.remove(f)
        s.to_hdf(f, 'data', format='table')
        return pandas.read_hdf(f, 'data')

    for i, s in enumerate(examples):
        flag = True
        e = ''
        try:
            test_hdf(s)
        except Exception as ex:
            e = ex
            flag = False
        print('%d: %s\t%s\t%s' % (i, 'pass' if flag else 'fail', s.tolist(), e))

Results in:

    0: fail ['EÉ, 17', '', 'a', 'b', 'c']   Categorical categories must be unique
    1: pass ['EÉ, 17', 'a', 'b', 'c']
    2: pass ['', 'a', 'b', 'c']
    3: pass ['EE, 17', '', 'a', 'b', 'c']
    4: pass ['øü', 'a', 'b', 'c']
    5: fail ['Aøü', '', 'a', 'b', 'c']  Categorical categories must be unique
    6: pass ['EÉ, 17', 'øü', 'a', 'b', 'c']

Not sure if I am using this incorrectly or if this is actually a corner case.

@bashtage
Copy link
Contributor

On Linux/Python 2.7/pandas: 0.16.2-9-g7636c2c (master) I get

0: pass ['E\xc3\x89, 17', '', 'a', 'b', 'c']    
1: pass ['E\xc3\x89, 17', 'a', 'b', 'c']    
2: pass ['', 'a', 'b', 'c'] 
3: pass ['EE, 17', '', 'a', 'b', 'c']   
4: pass ['\xc3\xb8\xc3\xbc', 'a', 'b', 'c'] 
5: pass ['A\xc3\xb8\xc3\xbc', '', 'a', 'b', 'c']    
6: pass ['E\xc3\x89, 17', '\xc3\xb8\xc3\xbc', 'a', 'b', 'c']    

What does pd.show_versions() output?

@cottrell
Copy link
Contributor Author

@bashtage Sorry I forgot to mention any version platform information. I am on OSX and am getting the errors only in python3 (python2 behaves differently). I was on master but turned back to the latest release.

Notice the "-coding-" EDIT in the example I gave above. It changes the look of the output slightly.

Here is what I just tested now:

    $ python --version
    Python 3.4.3 :: Anaconda 2.2.0 (x86_64)

    $ python -c 'import pandas; pandas.show_versions()'

    INSTALLED VERSIONS
    ------------------
    commit: None
    python: 3.4.3.final.0
    python-bits: 64
    OS: Darwin
    OS-release: 14.3.0
    machine: x86_64
    processor: i386
    byteorder: little
    LC_ALL: None
    LANG: en_CA.UTF-8

    pandas: 0.16.2
    nose: None
    Cython: 0.22
    numpy: 1.9.2
    scipy: 0.15.1
    statsmodels: None
    IPython: 3.1.0
    sphinx: 1.2.3
    patsy: 0.3.0
    dateutil: 2.4.2
    pytz: 2015.4
    bottleneck: None
    tables: 3.1.1
    numexpr: 2.3.1
    matplotlib: 1.4.3
    openpyxl: 1.8.5
    xlrd: 0.9.3
    xlwt: None
    xlsxwriter: 0.6.7
    lxml: 3.4.2
    bs4: 4.3.2
    html5lib: 0.999
    httplib2: None
    apiclient: None
    sqlalchemy: 0.9.9
    pymysql: None
    psycopg2: None

    $ python debug.py
    0: fail ['EÃ\x89, 17', '', 'a', 'b', 'c']  Categorical categories must be unique
    1: pass ['EÃ\x89, 17', 'a', 'b', 'c']
    2: pass ['', 'a', 'b', 'c']
    3: pass ['EE, 17', '', 'a', 'b', 'c']
    4: pass ['øü', 'a', 'b', 'c']
    5: fail ['Aøü', '', 'a', 'b', 'c']    Categorical categories must be unique
    6: fail ['EÃ\x89, 17', 'øü', 'a', 'b', 'c']  'utf-8' codec can't decode byte 0xc2 in position 6: unexpected end of data

    $ python --version
    Python 2.7.10 :: Continuum Analytics, Inc.

    $ python -c 'import pandas; pandas.show_versions()'

    INSTALLED VERSIONS
    ------------------
    commit: None
    python: 2.7.10.final.0
    python-bits: 64
    OS: Darwin
    OS-release: 14.3.0
    machine: x86_64
    processor: i386
    byteorder: little
    LC_ALL: None
    LANG: en_CA.UTF-8

    pandas: 0.16.2
    nose: 1.3.4
    Cython: 0.21.1
    numpy: 1.9.2
    scipy: 0.15.1
    statsmodels: 0.6.1
    IPython: 2.3.1
    sphinx: 1.2.3
    patsy: 0.3.0
    dateutil: 2.4.2
    pytz: 2015.4
    bottleneck: None
    tables: 3.1.1
    numexpr: 2.3.1
    matplotlib: 1.4.3
    openpyxl: 2.0.2
    xlrd: 0.9.3
    xlwt: None
    xlsxwriter: 0.6.4
    lxml: 3.4.1
    bs4: 4.3.2
    html5lib: None
    httplib2: 0.8
    apiclient: None
    sqlalchemy: 0.9.8
    pymysql: None
    psycopg2: None

    $ python debug.py
    0: pass ['E\xc3\x89, 17', '', 'a', 'b', 'c']
    1: pass ['E\xc3\x89, 17', 'a', 'b', 'c']
    2: pass ['', 'a', 'b', 'c']
    3: pass ['EE, 17', '', 'a', 'b', 'c']
    4: pass ['\xc3\xb8\xc3\xbc', 'a', 'b', 'c']
    5: pass ['A\xc3\xb8\xc3\xbc', '', 'a', 'b', 'c']
    6: pass ['E\xc3\x89, 17', '\xc3\xb8\xc3\xbc', 'a', 'b', 'c']

@jreback jreback added Bug Unicode Unicode strings IO HDF5 read_hdf, HDFStore Difficulty Intermediate labels Jun 17, 2015
@jreback jreback added this to the 0.17.0 milestone Jun 17, 2015
@jreback
Copy link
Contributor

jreback commented Jun 17, 2015

So this is a bug here.
The Categoricals are written as a separate table in a sub-node of the main table. The encoding and nan_rep are known on the table at this point, and should be passed thru. Since you already have a test you have the hardest part already done. pls submit a pull-request to fix!

@cottrell
Copy link
Contributor Author

@jreback Thanks. I think I can get to this in a week or so.

@jreback
Copy link
Contributor

jreback commented Jun 20, 2015

@cottrell gr8!

@cottrell
Copy link
Contributor Author

It seems feeding the encoding and nan_rep through only fixed some of the errors. Basically, it looks like the categorical metadata is being mapped to "nan" for anything with non-standard encodings. Any suggestions on how to check whether the problem is with the writing or the reading of the hdf store? I can open the hdf store using pytables directly and I think the relevant node is '/data/meta/values/meta' where "data" is my top level key.

@jreback
Copy link
Contributor

jreback commented Jun 26, 2015

hmm you will need to encode to nan string as well
in the writing

when reading everything needs to be decoded then the categorical created

@cottrell
Copy link
Contributor Author

Posting some notes here as I go.

https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L4408 seems to be turning different encodings to nan. Commenting it out resolves the uniqueness exceptions but encodings are still not quite right.

So it looks to me like the writing (to_hdf) is possible ok:

In [94]: import tables
In [95]: f = tables.open_file('testhdf.h5', 'r')
In [96]: for r in f.root.data.meta.values.meta.table:
    print(r['index'], r['values'])
   ....:
0 b''
1 b'E\xc3\x89, 17'
2 b'a'
3 b'b'
4 b'c'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore Unicode Unicode strings
Projects
None yet
3 participants