Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: indexing in datetime IntervalIndex with duplicate values fails #20636

Closed
avnovikov opened this issue Apr 8, 2018 · 4 comments · Fixed by #22988
Closed

BUG: indexing in datetime IntervalIndex with duplicate values fails #20636

avnovikov opened this issue Apr 8, 2018 · 4 comments · Fixed by #22988
Labels
Bug Datetime Datetime data dtype Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type
Milestone

Comments

@avnovikov
Copy link

avnovikov commented Apr 8, 2018

Code Sample

import pandas as pd
start_end_idx = pd.IntervalIndex.from_arrays(start, end)
start_end_idx.name = 'port_time'
mm = mm.set_index(['imo','port_id', start_end_idx])
global_bins = mm.index
global_bins.get_loc((8801527, 5610.0, pd.to_datetime('2016-10-16 16:57:44.316')))
/anaconda3/envs/mariquant/lib/python3.6/site-packages/pandas/core/indexes/interval.py in _engine(self)
    283     @cache_readonly
    284     def _engine(self):
--> 285         return IntervalTree(self.left, self.right, closed=self.closed)
    286 
    287     @property

pandas/_libs/intervaltree.pxi in pandas._libs.interval.IntervalTree.__init__()

KeyError: ('datetime64[ns]', 'right')

Problem description

It is impossible to use non unique IntervalIndex with datetimes as start-end points. Both index.get_loc and DataFrame.loc produce the same error in IntervalTree.

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.5.1
Cython: None
numpy: 1.14.2
scipy: 1.0.0
pyarrow: None
xarray: 0.10.2
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jorisvandenbossche
Copy link
Member

@avnovikov Can you provide a reproducible example? (in this case, provide example code for start, end and mm) See also http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@jorisvandenbossche jorisvandenbossche added the Needs Info Clarification about behavior needed to assess issue label Apr 9, 2018
@jschendel
Copy link
Member

I can reproduce the error on master with a duplicated IntervalIndex:

In [2]: ii = pd.interval_range(pd.Timestamp('20180101'), periods=2).repeat(2)

In [3]: ii
Out[3]:
IntervalIndex([(2018-01-01, 2018-01-02], (2018-01-01, 2018-01-02], (2018-01-02, 2018-01-03], (2018-01-02, 2018-01-03]]
              closed='right',
              dtype='interval[datetime64[ns]]')

In [4]: ii.get_loc(pd.Timestamp('20180102'))
---------------------------------------------------------------------------
KeyError: ('datetime64[ns]', 'right')

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 10, 2018

OK, and for numeric intervals, this works correctly:

In [30]: ii = pd.interval_range(1, periods=2).repeat(2)

In [31]: ii
Out[31]: 
IntervalIndex([(1, 2], (1, 2], (2, 3], (2, 3]]
              closed='right',
              dtype='interval[int64]')

In [33]:  ii.get_loc(2)
Out[33]: array([0, 1])

@jorisvandenbossche jorisvandenbossche added Bug Datetime Datetime data dtype Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type and removed Needs Info Clarification about behavior needed to assess issue labels Apr 10, 2018
@jorisvandenbossche jorisvandenbossche added this to the Next Major Release milestone Apr 10, 2018
@jorisvandenbossche jorisvandenbossche changed the title Problem with datetime interval index BUG: indexing in datetime IntervalIndex with duplicate values fails Apr 10, 2018
@jschendel
Copy link
Member

jschendel commented Apr 10, 2018

I suspect this will fail for dtypes other than float32/float64/int32/int64 when the IntervalIndex is overlapping or not monotonic (the non-overlapping monotonic case doesn't require using IntervalTree). It looks like IntervalTree was only implemented for the four dtypes I mentioned earlier:

# we need specialized nodes and leaves to optimize for different dtype and
# closed values
{{py:
nodes = []
for dtype in ['float32', 'float64', 'int32', 'int64']:

Using timedelta64[ns] fails:

In [2]: ii = pd.interval_range(pd.Timedelta('0 days'), periods=2).repeat(2)

In [3]: ii
Out[3]:
IntervalIndex([(0 days 00:00:00, 1 days 00:00:00], (0 days 00:00:00, 1 days 00:00:00], (1 days 00:00:00, 2 days 00:00:00], (1 days 00:00:00, 2 days 00:00:00]]
              closed='right',
              dtype='interval[timedelta64[ns]]')

In [4]: ii.get_loc(pd.Timedelta('1 day'))
---------------------------------------------------------------------------
KeyError: ('timedelta64[ns]', 'right')

Using uint64 fails:

In [11]: ii = pd.interval_range(1, periods=2).repeat(2).astype('interval[uint64]')

In [12]: ii
Out[12]:
IntervalIndex([(1, 2], (1, 2], (2, 3], (2, 3]]
              closed='right',
              dtype='interval[uint64]')

In [13]: ii.get_loc(1.5)
---------------------------------------------------------------------------
KeyError: ('uint64', 'right')

I'm guessing for datetimelike we'll need to do i8 conversion? Or is there a way to add that directly? I think uint64 can be directly added as a dtype. Any other dtypes I'm forgetting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants