ERR: raise on invalid coulmns using a fixed HDFStore #13492

Closed
amanhanda opened this Issue Jun 20, 2016 · 12 comments

Comments

Projects
None yet
4 participants
@amanhanda

Code Sample

idx = pd.Index(pd.to_datetime([datetime.date(2000, 1, 1), datetime.date(2000, 1, 2)]), name='cols')
idx1 = pd.Index(pd.to_datetime([datetime.date(2010, 1, 1), datetime.date(2010, 1, 2)]), name='rows')
s = pd.DataFrame(np.arange(4).reshape(2,2), columns=idx, index=idx1)
print type(s.index.name)
# The type is str
<type 'str'>
s.reset_index()
cols       rows  2000-01-01 00:00:00  2000-01-02 00:00:00
0    2010-01-01                    0                    1
1    2010-01-02                    2                    3
with pd.HDFStore("/logs/tmp/test.h5", "w") as store:
    store.put("test", s, "fixed")
# When reading the data from HDF5, the index name comes back as a numpy.string_

with pd.HDFStore("/logs/tmp/test.h5", "r") as store:
    s1 = store["test"]
type(s1.index.name)
numpy.string_
# numpy.concatenate throws a ValueError, 
# which the code does not catch to convert the column to type object from DatetimeIndex, and fails

s1.reset_index()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-93-f61766d7f5c1> in <module>()
----> 1 s1.reset_index()

/auto/energymdl2/anaconda/envs/commod_20160516/lib/python2.7/site-packages/pandas/core/frame.pyc in reset_index(self, level, drop, inplace, col_level, col_fill)
   2731                     name = tuple(name_lst)
   2732             values = _maybe_casted_values(self.index)
-> 2733             new_obj.insert(0, name, values)
   2734
   2735         new_obj.index = new_index

/auto/energymdl2/anaconda/envs/commod_20160516/lib/python2.7/site-packages/pandas/core/frame.pyc in insert(self, loc, column, value, allow_duplicates)
   2228         value = self._sanitize_column(column, value)
   2229         self._data.insert(
-> 2230             loc, column, value, allow_duplicates=allow_duplicates)
   2231
   2232     def assign(self, **kwargs):

/auto/energymdl2/anaconda/envs/commod_20160516/lib/python2.7/site-packages/pandas/core/internals.pyc in insert(self, loc, item, value, allow_duplicates)
   3100             self._blknos = np.insert(self._blknos, loc, len(self.blocks))
   3101
-> 3102         self.axes[0] = self.items.insert(loc, item)
   3103
   3104         self.blocks += (block,)

/auto/energymdl2/anaconda/envs/commod_20160516/lib/python2.7/site-packages/pandas/tseries/index.pyc in insert(self, loc, item)
   1505             item = _to_m8(item, tz=self.tz)
   1506         try:
-> 1507             new_dates = np.concatenate((self[:loc].asi8, [item.view(np.int64)],
   1508                                         self[loc:].asi8))
   1509             if self.tz is not None:

ValueError: new type not compatible with array.

# The exception caluse does not catch ValueError

.../pandas/tseries/index.py
   1720         freq = None
   1721
   1722         if isinstance(item, (datetime, np.datetime64)):
   1723             self._assert_can_do_op(item)
   1724             if not self._has_same_tz(item):
   1725                 raise ValueError(
   1726                     'Passed item and index have different timezone')
   1727             # check freq can be preserved on edge cases
   1728             if self.size and self.freq is not None:
   1729                 if ((loc == 0 or loc == -len(self)) and
   1730                         item + self.freq == self[0]):
   1731                     freq = self.freq
   1732                 elif (loc == len(self)) and item - self.freq == self[-1]:
   1733                     freq = self.freq
   1734             item = _to_m8(item, tz=self.tz)
   1735         try:
1> 1736             new_dates = np.concatenate((self[:loc].asi8, [item.view(np.int64)],
   1737                                         self[loc:].asi8))
   1738             if self.tz is not None:
   1739                 new_dates = tslib.tz_convert(new_dates, 'UTC', self.tz)
   1740             return DatetimeIndex(new_dates, name=self.name, freq=freq,
   1741                                  tz=self.tz)
   1742
   1743         except (AttributeError, TypeError):
   1744
   1745             # fall back to object index
   1746             if isinstance(item, compat.string_types):
   1747                 return self.asobject.insert(loc, item)
   1748             raise TypeError(
   1749                 "cannot insert DatetimeIndex with incompatible label")

Expected Output

cols       rows  2000-01-01 00:00:00  2000-01-02 00:00:00
0    2010-01-01                    0                    1
1    2010-01-02                    2                    3

output of pd.show_versions()

# Problem occurs in 0.16.2 and 0.18.1

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-573.7.1.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.1
setuptools: 20.7.0
Cython: 0.24
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: 0.7.2
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.1
dateutil: 2.5.2
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.4.3
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.3.2
html5lib: 0.999
httplib2: 0.9.2
apiclient: 1.5.0
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None


@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 20, 2016

Contributor

not really sure what you are doing.

pls show an exact reproduction.

In [7]: idx = pd.Index(pd.to_datetime([datetime.date(2000, 1, 1), datetime.date(2000, 1, 2)]), name='cols')

In [8]: idx1 = pd.Index(pd.to_datetime([datetime.date(2010, 1, 1), datetime.date(2010, 1, 2)]), name='rows')

In [9]: s = pd.DataFrame(np.arange(4).reshape(2,2), columns=idx, index=idx1)

In [10]: s.to_hdf('test.h5','df',mode='w',format='table')                                          

In [11]: pd.read_hdf('test.h5','df')                                                               
Out[11]: 
cols        2000-01-01  2000-01-02
rows                              
2010-01-01           0           1
2010-01-02           2           3

In [12]: s.to_hdf('test.h5','df',mode='w',format='fixed')                                          

In [13]: pd.read_hdf('test.h5','df')
Out[13]: 
cols        2000-01-01  2000-01-02
rows                              
2010-01-01           0           1
2010-01-02           2           3

In [14]: pd.__version__
Out[14]: u'0.18.1'
Contributor

jreback commented Jun 20, 2016

not really sure what you are doing.

pls show an exact reproduction.

In [7]: idx = pd.Index(pd.to_datetime([datetime.date(2000, 1, 1), datetime.date(2000, 1, 2)]), name='cols')

In [8]: idx1 = pd.Index(pd.to_datetime([datetime.date(2010, 1, 1), datetime.date(2010, 1, 2)]), name='rows')

In [9]: s = pd.DataFrame(np.arange(4).reshape(2,2), columns=idx, index=idx1)

In [10]: s.to_hdf('test.h5','df',mode='w',format='table')                                          

In [11]: pd.read_hdf('test.h5','df')                                                               
Out[11]: 
cols        2000-01-01  2000-01-02
rows                              
2010-01-01           0           1
2010-01-02           2           3

In [12]: s.to_hdf('test.h5','df',mode='w',format='fixed')                                          

In [13]: pd.read_hdf('test.h5','df')
Out[13]: 
cols        2000-01-01  2000-01-02
rows                              
2010-01-01           0           1
2010-01-02           2           3

In [14]: pd.__version__
Out[14]: u'0.18.1'
@amanhanda

This comment has been minimized.

Show comment
Hide comment
@amanhanda

amanhanda Jun 20, 2016

I am using the HDFStore interface. With your code snippet, please try and reset_index() on the returned frame, when the format="fixed"


In [36]: s.to_hdf('test.h5','df',mode='w', format="fixed")

In [37]: s1 = pd.read_hdf('test.h5','df')

In [38]: s1.reset_index()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-38-f61766d7f5c1> in <module>()
----> 1 s1.reset_index()

/auto/energymdl2/anaconda/envs/commod_20160516_pd18/lib/python2.7/site-packages/pandas/core/frame.pyc in reset_index(self, level, drop, inplace, col_level, col_fill)
   2959                     name = tuple(name_lst)
   2960             values = _maybe_casted_values(self.index)
-> 2961             new_obj.insert(0, name, values)
   2962
   2963         new_obj.index = new_index

/auto/energymdl2/anaconda/envs/commod_20160516_pd18/lib/python2.7/site-packages/pandas/core/frame.pyc in insert(self, loc, column, value, allow_duplicates)
   2447         value = self._sanitize_column(column, value)
   2448         self._data.insert(loc, column, value,
-> 2449                           allow_duplicates=allow_duplicates)
   2450
   2451     def assign(self, **kwargs):

/auto/energymdl2/anaconda/envs/commod_20160516_pd18/lib/python2.7/site-packages/pandas/core/internals.pyc in insert(self, loc, item, value, allow_duplicates)
   3514
   3515         # insert to the axis; this could possibly raise a TypeError
-> 3516         new_axis = self.items.insert(loc, item)
   3517
   3518         block = make_block(values=value, ndim=self.ndim,

/auto/energymdl2/anaconda/envs/commod_20160516_pd18/lib/python2.7/site-packages/pandas/tseries/index.pyc in insert(self, loc, item)
   1734             item = _to_m8(item, tz=self.tz)
   1735         try:
-> 1736             new_dates = np.concatenate((self[:loc].asi8, [item.view(np.int64)],
   1737                                         self[loc:].asi8))
   1738             if self.tz is not None:

ValueError: new type not compatible with array.


I am using the HDFStore interface. With your code snippet, please try and reset_index() on the returned frame, when the format="fixed"


In [36]: s.to_hdf('test.h5','df',mode='w', format="fixed")

In [37]: s1 = pd.read_hdf('test.h5','df')

In [38]: s1.reset_index()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-38-f61766d7f5c1> in <module>()
----> 1 s1.reset_index()

/auto/energymdl2/anaconda/envs/commod_20160516_pd18/lib/python2.7/site-packages/pandas/core/frame.pyc in reset_index(self, level, drop, inplace, col_level, col_fill)
   2959                     name = tuple(name_lst)
   2960             values = _maybe_casted_values(self.index)
-> 2961             new_obj.insert(0, name, values)
   2962
   2963         new_obj.index = new_index

/auto/energymdl2/anaconda/envs/commod_20160516_pd18/lib/python2.7/site-packages/pandas/core/frame.pyc in insert(self, loc, column, value, allow_duplicates)
   2447         value = self._sanitize_column(column, value)
   2448         self._data.insert(loc, column, value,
-> 2449                           allow_duplicates=allow_duplicates)
   2450
   2451     def assign(self, **kwargs):

/auto/energymdl2/anaconda/envs/commod_20160516_pd18/lib/python2.7/site-packages/pandas/core/internals.pyc in insert(self, loc, item, value, allow_duplicates)
   3514
   3515         # insert to the axis; this could possibly raise a TypeError
-> 3516         new_axis = self.items.insert(loc, item)
   3517
   3518         block = make_block(values=value, ndim=self.ndim,

/auto/energymdl2/anaconda/envs/commod_20160516_pd18/lib/python2.7/site-packages/pandas/tseries/index.pyc in insert(self, loc, item)
   1734             item = _to_m8(item, tz=self.tz)
   1735         try:
-> 1736             new_dates = np.concatenate((self[:loc].asi8, [item.view(np.int64)],
   1737                                         self[loc:].asi8))
   1738             if self.tz is not None:

ValueError: new type not compatible with array.


@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 20, 2016

Contributor

I c. Well that's not really supported; you must have strings for column names. We did a fix for tables IIRC.
#10098, this is related (but the check isn't there).

want to do a pull-request?

Contributor

jreback commented Jun 20, 2016

I c. Well that's not really supported; you must have strings for column names. We did a fix for tables IIRC.
#10098, this is related (but the check isn't there).

want to do a pull-request?

@jreback jreback added this to the Next Major Release milestone Jun 20, 2016

@jreback jreback changed the title from DataFrame reset_index() fails when data frame read from HDF5. to ERR: raise on invalid coulmns using a fixed HDFStore Jun 20, 2016

@amanhanda

This comment has been minimized.

Show comment
Hide comment
@amanhanda

amanhanda Jun 21, 2016

The index name is string in the source data frame. Storing it to hdf5 and retrieving it back is when the type changes to numpy.string_.
The column name is "cols" and index name is "rows". Both strings.

I have not done a pull request before. This would be my first. Will give it a shot.

The index name is string in the source data frame. Storing it to hdf5 and retrieving it back is when the type changes to numpy.string_.
The column name is "cols" and index name is "rows". Both strings.

I have not done a pull request before. This would be my first. Will give it a shot.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 21, 2016

Contributor

fixed is not very respectful of attributes like this
table generally works in a smoother way

Contributor

jreback commented Jun 21, 2016

fixed is not very respectful of attributes like this
table generally works in a smoother way

@makmanalp

This comment has been minimized.

Show comment
Hide comment
@makmanalp

makmanalp May 22, 2017

Contributor

Hi! I'm at the sprints at pycon and am looking to pick this up! Managed to reproduce the issue even though for the type I get:

In [27]: type(s1.index.name)
Out[27]: numpy.str_

instead of numpy.string_ but perhaps that's a naming difference across numpy versions ('1.12.1' here).

Same issue arises when reading the table with read_hdf instead of HDFStore and doing a reset_index().

In terms of expected behavior, I'm not entirely certain what we want here - should we be casting the numpy.str_ to a string? (seems reasonable - unsure why they're incompatible in the first place).

Contributor

makmanalp commented May 22, 2017

Hi! I'm at the sprints at pycon and am looking to pick this up! Managed to reproduce the issue even though for the type I get:

In [27]: type(s1.index.name)
Out[27]: numpy.str_

instead of numpy.string_ but perhaps that's a naming difference across numpy versions ('1.12.1' here).

Same issue arises when reading the table with read_hdf instead of HDFStore and doing a reset_index().

In terms of expected behavior, I'm not entirely certain what we want here - should we be casting the numpy.str_ to a string? (seems reasonable - unsure why they're incompatible in the first place).

@makmanalp

This comment has been minimized.

Show comment
Hide comment
@makmanalp

makmanalp May 22, 2017

Contributor

Also can confirm that this doesn't happen with table.

Contributor

makmanalp commented May 22, 2017

Also can confirm that this doesn't happen with table.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger May 22, 2017

Contributor

@makmanalp yeah, I think the best thing to do would be to cast np.str_ to a python string. Hopefully we don't hit any encoding issues... It's not clear to me whether np.str_ is a python 3 str (unicode) or a python 2 str (bytes)

Contributor

TomAugspurger commented May 22, 2017

@makmanalp yeah, I think the best thing to do would be to cast np.str_ to a python string. Hopefully we don't hit any encoding issues... It's not clear to me whether np.str_ is a python 3 str (unicode) or a python 2 str (bytes)

@makmanalp

This comment has been minimized.

Show comment
Hide comment
@makmanalp

makmanalp May 22, 2017

Contributor

On my python3 installation, I'm finding that np.string_ is just the same as np.bytes_, which is different from np.str_. So perhaps there is some py2/3 trickiness here. I'll give it a first stab and perhaps try it on both somehow.

Contributor

makmanalp commented May 22, 2017

On my python3 installation, I'm finding that np.string_ is just the same as np.bytes_, which is different from np.str_. So perhaps there is some py2/3 trickiness here. I'll give it a first stab and perhaps try it on both somehow.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger May 22, 2017

Contributor

Ugh that's unfortunate. I guess we should know the encoding inside the HDF reader.

Contributor

TomAugspurger commented May 22, 2017

Ugh that's unfortunate. I guess we should know the encoding inside the HDF reader.

@makmanalp

This comment has been minimized.

Show comment
Hide comment
@makmanalp

makmanalp May 22, 2017

Contributor

Single-file example for easy reproduction:

import pandas as pd
import numpy as np
import datetime

idx = pd.Index(pd.to_datetime([datetime.date(2000, 1, 1), datetime.date(2000, 1, 2)]), name='cols')
idx1 = pd.Index(pd.to_datetime([datetime.date(2010, 1, 1), datetime.date(2010, 1, 2)]), name='rows')
s = pd.DataFrame(np.arange(4).reshape(2,2), columns=idx, index=idx1)

with pd.HDFStore("test.h5", "w") as store:
    store.put("test", s, "fixed")

with pd.HDFStore("test.h5", "r") as store:
    s1 = store["test"]

# s1.reset_index()
Contributor

makmanalp commented May 22, 2017

Single-file example for easy reproduction:

import pandas as pd
import numpy as np
import datetime

idx = pd.Index(pd.to_datetime([datetime.date(2000, 1, 1), datetime.date(2000, 1, 2)]), name='cols')
idx1 = pd.Index(pd.to_datetime([datetime.date(2010, 1, 1), datetime.date(2010, 1, 2)]), name='rows')
s = pd.DataFrame(np.arange(4).reshape(2,2), columns=idx, index=idx1)

with pd.HDFStore("test.h5", "w") as store:
    store.put("test", s, "fixed")

with pd.HDFStore("test.h5", "r") as store:
    s1 = store["test"]

# s1.reset_index()

makmanalp added a commit to makmanalp/pandas that referenced this issue May 23, 2017

@makmanalp

This comment has been minimized.

Show comment
Hide comment
@makmanalp

makmanalp May 23, 2017

Contributor

So, I just made a PR, it's just a first stab at the issue but hopefully it's in the right direction! Please let me know how happy you are with this fix and what I can do to get it release-ready!

Contributor

makmanalp commented May 23, 2017

So, I just made a PR, it's just a first stab at the issue but hopefully it's in the right direction! Please let me know how happy you are with this fix and what I can do to get it release-ready!

makmanalp added a commit to makmanalp/pandas that referenced this issue Jun 1, 2017

makmanalp added a commit to makmanalp/pandas that referenced this issue Jun 1, 2017

makmanalp added a commit to makmanalp/pandas that referenced this issue Jun 1, 2017

makmanalp added a commit to makmanalp/pandas that referenced this issue Jun 1, 2017

@jreback jreback modified the milestones: 0.20.2, Next Major Release Jun 2, 2017

TomAugspurger added a commit that referenced this issue Jun 4, 2017

BUG: convert numpy strings in index names in HDF #13492 (#16444)
* BUG: Handle numpy strings in index names in HDF5 #13492

* REF: refactor to _ensure_str

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jun 4, 2017

BUG: convert numpy strings in index names in HDF #13492 (#16444)
* BUG: Handle numpy strings in index names in HDF5 #13492

* REF: refactor to _ensure_str

(cherry picked from commit 18c316b)

TomAugspurger added a commit that referenced this issue Jun 4, 2017

BUG: convert numpy strings in index names in HDF #13492 (#16444)
* BUG: Handle numpy strings in index names in HDF5 #13492

* REF: refactor to _ensure_str

(cherry picked from commit 18c316b)

Kiv added a commit to Kiv/pandas that referenced this issue Jun 11, 2017

BUG: convert numpy strings in index names in HDF #13492 (#16444)
* BUG: Handle numpy strings in index names in HDF5 #13492

* REF: refactor to _ensure_str

stangirala added a commit to stangirala/pandas that referenced this issue Jun 11, 2017

BUG: convert numpy strings in index names in HDF #13492 (#16444)
* BUG: Handle numpy strings in index names in HDF5 #13492

* REF: refactor to _ensure_str

yarikoptic added a commit to neurodebian/pandas that referenced this issue Jul 12, 2017

Merge tag 'v0.20.2' into releases
Version 0.20.2

* tag 'v0.20.2': (68 commits)
  RLS: v0.20.2
  DOC: Update release.rst
  DOC: Whatsnew fixups (#16596)
  ERRR: Raise error in usecols when column doesn't exist but length matches (#16460)
  BUG: convert numpy strings in index names in HDF #13492 (#16444)
  PERF: vectorize _interp_limit (#16592)
  DOC: whatsnew 0.20.2 edits (#16587)
  API: Make is_strictly_monotonic_* private (#16576)
  BUG: reimplement MultiIndex.remove_unused_levels (#16565)
  Strictly monotonic (#16555)
  ENH: add .ngroup() method to groupby objects (#14026) (#14026)
  fix linting
  BUG: Incorrect handling of rolling.cov with offset window (#16244)
  BUG: select_as_multiple doesn't respect start/stop kwargs GH16209 (#16317)
  return empty MultiIndex for symmetrical difference on equal MultiIndexes (#16486)
  BUG: Bug in .resample() and .groupby() when aggregating on integers (#16549)
  BUG: Fixed tput output on windows (#16496)
  Strictly monotonic (#16555)
  BUG: fixed wrong order of ordered labels in pd.cut()
  BUG: Fixed to_html ignoring index_names parameter
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment