Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
HDFStore fails to read non-ascii characters #11234
Comments
|
should be fixed by : pydata#10889 give a try with
|
jreback
added the
IO HDF5
label
Oct 4, 2015
FilipDusek
commented
Oct 4, 2015
|
No, unfortunately I still get the error
Versions INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.17.0rc2
nose: 1.3.7
pip: 7.1.2
setuptools: 18.3.2
Cython: 0.22.1
numpy: 1.9.3
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.2.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: 1.0.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.5
pymysql: None
psycopg2: None |
|
@jreback looks like we truncate the column to be length 1 since This works though In [19]: df = pd.DataFrame({'A': ['é']})
In [20]: store = pd.HDFStore(r'thiswillcrash.h5')
In [21]: store.put('df', df, format='table', min_itemsize={'A': 30})
In [22]: store.get('df')
Out[22]:
A
0 éDo you have a good idea where a fix would go? |
|
https://github.com/pydata/pandas/blob/master/pandas/lib.pyx#L972 is where the width of the strings are determined |
|
Is it because the encoded length is different than the number of characters? In [10]: x
Out[10]: 'é'
In [11]: len(x)
Out[11]: 1
In [12]: len(x.encode('utf-8'))
Out[12]: 2 |
|
yep should encode before we check and set the length |
TomAugspurger
referenced
this issue
Oct 5, 2015
Closed
BUG: HDFStore.append with encoded string itemsize #11240
jreback
added Bug Unicode
labels
Oct 5, 2015
jreback
added this to the
0.17.1
milestone
Oct 5, 2015
jreback
added a commit
that referenced
this issue
Oct 9, 2015
|
|
TomAugspurger + jreback |
26db172
|
|
closed by #11240 |
jreback
closed this
Oct 9, 2015
yarikoptic
added a commit
to neurodebian/pandas
that referenced
this issue
Oct 11, 2015
|
|
yarikoptic |
cd7c38b
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
FilipDusek commentedOct 4, 2015
When I try to save some non-ascii character like é and then load it again, I end up with UnicodeDecodeError. If you add some more data to the string (like 'aée'), the data gets stored and retrieved without error, but the result is missing the last character.
Versions