New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFStore fails to read non-ascii characters #11234

Closed
FilipDusek opened this Issue Oct 4, 2015 · 7 comments

Comments

Projects
None yet
3 participants
@FilipDusek

FilipDusek commented Oct 4, 2015

When I try to save some non-ascii character like é and then load it again, I end up with UnicodeDecodeError. If you add some more data to the string (like 'aée'), the data gets stored and retrieved without error, but the result is missing the last character.

import pandas as pd

df = pd.DataFrame(columns=["A"])
toAppend = {"A": "é"}
df = df.append(toAppend, ignore_index = True)

store = pd.HDFStore(r'thiswillcrash.h5')
store.put('df', df, format='table', encoding="utf-8")
d = store["df"]
print(d)

store.close()

Versions

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.16.2
nose: 1.3.4
Cython: 0.22
numpy: 1.9.3
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.0.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.6
bottleneck: None
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.6.7
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 0.9.9
pymysql: None
psycopg2: None
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 4, 2015

Contributor

should be fixed by : #10889

give a try with v0.17.0rc2

conda install pandas -c pandas

Contributor

jreback commented Oct 4, 2015

should be fixed by : #10889

give a try with v0.17.0rc2

conda install pandas -c pandas

@jreback jreback added the IO HDF5 label Oct 4, 2015

@FilipDusek

This comment has been minimized.

Show comment
Hide comment
@FilipDusek

FilipDusek Oct 4, 2015

No, unfortunately I still get the error

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-3-3e6096eba1ca> in <module>()
      8 store = pd.HDFStore(r'iwillcrash30.h5')
      9 store.put('df', df, format='table', encoding="utf-8")
---> 10 d = store["df"]
     11 print(d)
     12 

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\io\pytables.py in __getitem__(self, key)
    424 
    425     def __getitem__(self, key):
--> 426         return self.get(key)
    427 
    428     def __setitem__(self, key, value):

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\io\pytables.py in get(self, key)
    634         if group is None:
    635             raise KeyError('No object named %s in the file' % key)
--> 636         return self._read_group(group)
    637 
    638     def select(self, key, where=None, start=None, stop=None, columns=None,

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\io\pytables.py in _read_group(self, group, **kwargs)
   1271         s = self._create_storer(group)
   1272         s.infer_axes()
-> 1273         return s.read(**kwargs)
   1274 
   1275 

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\io\pytables.py in read(self, where, columns, **kwargs)
   4004     def read(self, where=None, columns=None, **kwargs):
   4005 
-> 4006         if not self.read_axes(where=where, **kwargs):
   4007             return None
   4008 

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\io\pytables.py in read_axes(self, where, **kwargs)
   3216         for a in self.axes:
   3217             a.set_info(self.info)
-> 3218             a.convert(values, nan_rep=self.nan_rep, encoding=self.encoding)
   3219 
   3220         return True

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\io\pytables.py in convert(self, values, nan_rep, encoding)
   2062         if _ensure_decoded(self.kind) == u('string'):
   2063             self.data = _unconvert_string_array(
-> 2064                 self.data, nan_rep=nan_rep, encoding=encoding)
   2065 
   2066         return self

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\io\pytables.py in _unconvert_string_array(data, nan_rep, encoding)
   4430 
   4431         if isinstance(data[0], compat.binary_type):
-> 4432             data = Series(data).str.decode(encoding).values
   4433         else:
   4434             data = data.astype(dtype, copy=False).astype(object, copy=False)

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\core\strings.py in decode(self, encoding, errors)
   1310     @copy(str_decode)
   1311     def decode(self, encoding, errors="strict"):
-> 1312         result = str_decode(self.series, encoding, errors)
   1313         return self._wrap_result(result)
   1314 

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\core\strings.py in str_decode(arr, encoding, errors)
    979     """
    980     f = lambda x: x.decode(encoding, errors)
--> 981     return _na_map(f, arr)
    982 
    983 

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\core\strings.py in _na_map(f, arr, na_result, dtype)
    119 def _na_map(f, arr, na_result=np.nan, dtype=object):
    120     # should really _check_ for NA
--> 121     return _map(f, arr, na_mask=True, na_value=na_result, dtype=dtype)
    122 
    123 

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\core\strings.py in _map(f, arr, na_mask, na_value, dtype)
    135         mask = isnull(arr)
    136         try:
--> 137             result = lib.map_infer_mask(arr, f, mask.view(np.uint8))
    138         except (TypeError, AttributeError):
    139             def g(x):

pandas\src\inference.pyx in pandas.lib.map_infer_mask (pandas\lib.c:61753)()

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\core\strings.py in <lambda>(x)
    978     decoded : Series/Index of objects
    979     """
--> 980     f = lambda x: x.decode(encoding, errors)
    981     return _na_map(f, arr)
    982 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 0: unexpected end of data

Versions

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.17.0rc2
nose: 1.3.7
pip: 7.1.2
setuptools: 18.3.2
Cython: 0.22.1
numpy: 1.9.3
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.2.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: 1.0.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.5
pymysql: None
psycopg2: None

FilipDusek commented Oct 4, 2015

No, unfortunately I still get the error

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-3-3e6096eba1ca> in <module>()
      8 store = pd.HDFStore(r'iwillcrash30.h5')
      9 store.put('df', df, format='table', encoding="utf-8")
---> 10 d = store["df"]
     11 print(d)
     12 

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\io\pytables.py in __getitem__(self, key)
    424 
    425     def __getitem__(self, key):
--> 426         return self.get(key)
    427 
    428     def __setitem__(self, key, value):

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\io\pytables.py in get(self, key)
    634         if group is None:
    635             raise KeyError('No object named %s in the file' % key)
--> 636         return self._read_group(group)
    637 
    638     def select(self, key, where=None, start=None, stop=None, columns=None,

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\io\pytables.py in _read_group(self, group, **kwargs)
   1271         s = self._create_storer(group)
   1272         s.infer_axes()
-> 1273         return s.read(**kwargs)
   1274 
   1275 

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\io\pytables.py in read(self, where, columns, **kwargs)
   4004     def read(self, where=None, columns=None, **kwargs):
   4005 
-> 4006         if not self.read_axes(where=where, **kwargs):
   4007             return None
   4008 

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\io\pytables.py in read_axes(self, where, **kwargs)
   3216         for a in self.axes:
   3217             a.set_info(self.info)
-> 3218             a.convert(values, nan_rep=self.nan_rep, encoding=self.encoding)
   3219 
   3220         return True

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\io\pytables.py in convert(self, values, nan_rep, encoding)
   2062         if _ensure_decoded(self.kind) == u('string'):
   2063             self.data = _unconvert_string_array(
-> 2064                 self.data, nan_rep=nan_rep, encoding=encoding)
   2065 
   2066         return self

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\io\pytables.py in _unconvert_string_array(data, nan_rep, encoding)
   4430 
   4431         if isinstance(data[0], compat.binary_type):
-> 4432             data = Series(data).str.decode(encoding).values
   4433         else:
   4434             data = data.astype(dtype, copy=False).astype(object, copy=False)

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\core\strings.py in decode(self, encoding, errors)
   1310     @copy(str_decode)
   1311     def decode(self, encoding, errors="strict"):
-> 1312         result = str_decode(self.series, encoding, errors)
   1313         return self._wrap_result(result)
   1314 

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\core\strings.py in str_decode(arr, encoding, errors)
    979     """
    980     f = lambda x: x.decode(encoding, errors)
--> 981     return _na_map(f, arr)
    982 
    983 

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\core\strings.py in _na_map(f, arr, na_result, dtype)
    119 def _na_map(f, arr, na_result=np.nan, dtype=object):
    120     # should really _check_ for NA
--> 121     return _map(f, arr, na_mask=True, na_value=na_result, dtype=dtype)
    122 
    123 

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\core\strings.py in _map(f, arr, na_mask, na_value, dtype)
    135         mask = isnull(arr)
    136         try:
--> 137             result = lib.map_infer_mask(arr, f, mask.view(np.uint8))
    138         except (TypeError, AttributeError):
    139             def g(x):

pandas\src\inference.pyx in pandas.lib.map_infer_mask (pandas\lib.c:61753)()

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\core\strings.py in <lambda>(x)
    978     decoded : Series/Index of objects
    979     """
--> 980     f = lambda x: x.decode(encoding, errors)
    981     return _na_map(f, arr)
    982 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 0: unexpected end of data

Versions

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.17.0rc2
nose: 1.3.7
pip: 7.1.2
setuptools: 18.3.2
Cython: 0.22.1
numpy: 1.9.3
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.2.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: 1.0.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.5
pymysql: None
psycopg2: None
@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Oct 4, 2015

Contributor

@jreback looks like we truncate the column to be length 1 since len(df.iloc[0, 0]) is 1.

This works though

In [19]: df = pd.DataFrame({'A': ['é']})

In [20]: store = pd.HDFStore(r'thiswillcrash.h5')

In [21]: store.put('df', df, format='table', min_itemsize={'A': 30})

In [22]: store.get('df')
Out[22]:
   A
0  é

Do you have a good idea where a fix would go?

Contributor

TomAugspurger commented Oct 4, 2015

@jreback looks like we truncate the column to be length 1 since len(df.iloc[0, 0]) is 1.

This works though

In [19]: df = pd.DataFrame({'A': ['é']})

In [20]: store = pd.HDFStore(r'thiswillcrash.h5')

In [21]: store.put('df', df, format='table', min_itemsize={'A': 30})

In [22]: store.get('df')
Out[22]:
   A
0  é

Do you have a good idea where a fix would go?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 4, 2015

Contributor

https://github.com/pydata/pandas/blob/master/pandas/lib.pyx#L972

is where the width of the strings are determined
but it should work for unicode

Contributor

jreback commented Oct 4, 2015

https://github.com/pydata/pandas/blob/master/pandas/lib.pyx#L972

is where the width of the strings are determined
but it should work for unicode

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Oct 4, 2015

Contributor

Is it because the encoded length is different than the number of characters?

In [10]: x
Out[10]: 'é'

In [11]: len(x)
Out[11]: 1

In [12]: len(x.encode('utf-8'))
Out[12]: 2
Contributor

TomAugspurger commented Oct 4, 2015

Is it because the encoded length is different than the number of characters?

In [10]: x
Out[10]: 'é'

In [11]: len(x)
Out[11]: 1

In [12]: len(x.encode('utf-8'))
Out[12]: 2
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 4, 2015

Contributor

yep should encode before we check and set the length

Contributor

jreback commented Oct 4, 2015

yep should encode before we check and set the length

@jreback jreback added this to the 0.17.1 milestone Oct 5, 2015

jreback added a commit that referenced this issue Oct 9, 2015

BUG: HDFStore.append with encoded string itemsize, #11234
Failure came when the maximum length of the unencoded string
was smaller than the maximum encoded lenght.
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 9, 2015

Contributor

closed by #11240

Contributor

jreback commented Oct 9, 2015

closed by #11240

@jreback jreback closed this Oct 9, 2015

yarikoptic added a commit to neurodebian/pandas that referenced this issue Oct 11, 2015

Merge commit 'v0.17.0-8-gcac4ad2' into debian
* commit 'v0.17.0-8-gcac4ad2': (57 commits)
  BUG: to_excel duplicate columns
  BUG: HDFStore.append with encoded string itemsize, pandas-dev#11234
  BUG: remove midrule in latex output with header=False
  BUG: squeeze works on 0 length arrays, pandas-dev#11299, pandas-dev#8999
  DOC: add whatsnew 0.17.1 to index
  DOC: update resample docs
  timeseries: add tip about using groupby() rather than resample
  DOC: release_stats.sh script to report release stats
  DOC: edit release.rst
  CI: fix numpy to 1.9.3 in 2.7,3.5 builds for now, as packages for 1.10.0 not released ATM
  DOC: Included halflife as one 3 optional params that must be specified
  DOC: whatsnew 0.17.0 edits
  BUG/ERR: raise when trying to set a subset of values in a datetime64[ns, tz] column with another tz
  DOC: Add note about unicode layout
  DOC: hack to numpydoc to include attributes that are None (GH6100)
  DOC: add str accessor docstring pages to api.rst to avoid warning
  DOC: hack to numpydoc to avoid warnings for Categorical (not including members)
  skip some plotting tests if scipy is not installed
  add matplotlib to ci for 3.5
  COMPAT/PERF: lib.ismember_int64 on older numpies/cython not comparing correctly PERF: use np.in1d on larger isin sizes
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment