Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Irregular errors when reading certain categorical strings from hdf #10366
Comments
|
On Linux/Python 2.7/pandas: 0.16.2-9-g7636c2c (master) I get
What does |
|
@bashtage Sorry I forgot to mention any version platform information. I am on OSX and am getting the errors only in python3 (python2 behaves differently). I was on master but turned back to the latest release. Notice the "-coding-" EDIT in the example I gave above. It changes the look of the output slightly. Here is what I just tested now:
|
jreback
added Bug Unicode HDF5 Difficulty Intermediate Effort Low
labels
Jun 17, 2015
jreback
added this to the
0.17.0
milestone
Jun 17, 2015
|
So this is a bug here. |
|
@jreback Thanks. I think I can get to this in a week or so. |
|
@cottrell gr8! |
|
It seems feeding the encoding and nan_rep through only fixed some of the errors. Basically, it looks like the categorical metadata is being mapped to "nan" for anything with non-standard encodings. Any suggestions on how to check whether the problem is with the writing or the reading of the hdf store? I can open the hdf store using pytables directly and I think the relevant node is '/data/meta/values/meta' where "data" is my top level key. |
|
hmm you will need to encode to nan string as well when reading everything needs to be decoded then the categorical created |
|
Posting some notes here as I go. https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L4408 seems to be turning different encodings to nan. Commenting it out resolves the uniqueness exceptions but encodings are still not quite right. So it looks to me like the writing (to_hdf) is possible ok: In [94]: import tables
In [95]: f = tables.open_file('testhdf.h5', 'r')
In [96]: for r in f.root.data.meta.values.meta.table:
print(r['index'], r['values'])
....:
0 b''
1 b'E\xc3\x89, 17'
2 b'a'
3 b'b'
4 b'c' |
cottrell
referenced
this issue
Jun 27, 2015
Closed
Attempt to fix issue #10366 encoding and categoricals hdf serialization. #10454
cottrell
pushed a commit
to cottrell/pandas
that referenced
this issue
Aug 22, 2015
|
|
8463c63
|
jreback
referenced
this issue
Aug 22, 2015
Merged
BUG: encoding of categoricals in hdf serialization #10889
jreback
added a commit
to jreback/pandas
that referenced
this issue
Aug 26, 2015
|
|
+ jreback |
e10e701
|
jreback
added a commit
to jreback/pandas
that referenced
this issue
Aug 27, 2015
|
|
+ jreback |
b268bb0
|
cottrell commentedJun 16, 2015
It seems that there is something bad happening when we use certain strings with special characters AND the empty string with categoricals:
Results in:
Not sure if I am using this incorrectly or if this is actually a corner case.