BUG: indexing in datetime IntervalIndex with duplicate values fails #20636

avnovikov · 2018-04-08T11:45:22Z

Code Sample

import pandas as pd
start_end_idx = pd.IntervalIndex.from_arrays(start, end)
start_end_idx.name = 'port_time'
mm = mm.set_index(['imo','port_id', start_end_idx])
global_bins = mm.index
global_bins.get_loc((8801527, 5610.0, pd.to_datetime('2016-10-16 16:57:44.316')))

/anaconda3/envs/mariquant/lib/python3.6/site-packages/pandas/core/indexes/interval.py in _engine(self)
    283     @cache_readonly
    284     def _engine(self):
--> 285         return IntervalTree(self.left, self.right, closed=self.closed)
    286 
    287     @property

pandas/_libs/intervaltree.pxi in pandas._libs.interval.IntervalTree.__init__()

KeyError: ('datetime64[ns]', 'right')

Problem description

It is impossible to use non unique IntervalIndex with datetimes as start-end points. Both index.get_loc and DataFrame.loc produce the same error in IntervalTree.

Expected Output

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.5.1
Cython: None
numpy: 1.14.2
scipy: 1.0.0
pyarrow: None
xarray: 0.10.2
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2018-04-09T11:27:47Z

@avnovikov Can you provide a reproducible example? (in this case, provide example code for start, end and mm) See also http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

jschendel · 2018-04-09T23:08:55Z

I can reproduce the error on master with a duplicated IntervalIndex:

In [2]: ii = pd.interval_range(pd.Timestamp('20180101'), periods=2).repeat(2)

In [3]: ii
Out[3]:
IntervalIndex([(2018-01-01, 2018-01-02], (2018-01-01, 2018-01-02], (2018-01-02, 2018-01-03], (2018-01-02, 2018-01-03]]
              closed='right',
              dtype='interval[datetime64[ns]]')

In [4]: ii.get_loc(pd.Timestamp('20180102'))
---------------------------------------------------------------------------
KeyError: ('datetime64[ns]', 'right')

jorisvandenbossche · 2018-04-10T09:23:54Z

OK, and for numeric intervals, this works correctly:

In [30]: ii = pd.interval_range(1, periods=2).repeat(2)

In [31]: ii
Out[31]: 
IntervalIndex([(1, 2], (1, 2], (2, 3], (2, 3]]
              closed='right',
              dtype='interval[int64]')

In [33]:  ii.get_loc(2)
Out[33]: array([0, 1])

jschendel · 2018-04-10T23:10:58Z

I suspect this will fail for dtypes other than float32/float64/int32/int64 when the IntervalIndex is overlapping or not monotonic (the non-overlapping monotonic case doesn't require using IntervalTree). It looks like IntervalTree was only implemented for the four dtypes I mentioned earlier:

pandas/pandas/_libs/intervaltree.pxi.in

Lines 202 to 208 in 4e6aa1c

    
           # we need specialized nodes and leaves to optimize for different dtype and 
        
           # closed values 
        
           {{py: 
        
           nodes = [] 
        
           for dtype in ['float32', 'float64', 'int32', 'int64']:

Using timedelta64[ns] fails:

In [2]: ii = pd.interval_range(pd.Timedelta('0 days'), periods=2).repeat(2)

In [3]: ii
Out[3]:
IntervalIndex([(0 days 00:00:00, 1 days 00:00:00], (0 days 00:00:00, 1 days 00:00:00], (1 days 00:00:00, 2 days 00:00:00], (1 days 00:00:00, 2 days 00:00:00]]
              closed='right',
              dtype='interval[timedelta64[ns]]')

In [4]: ii.get_loc(pd.Timedelta('1 day'))
---------------------------------------------------------------------------
KeyError: ('timedelta64[ns]', 'right')

Using uint64 fails:

In [11]: ii = pd.interval_range(1, periods=2).repeat(2).astype('interval[uint64]')

In [12]: ii
Out[12]:
IntervalIndex([(1, 2], (1, 2], (2, 3], (2, 3]]
              closed='right',
              dtype='interval[uint64]')

In [13]: ii.get_loc(1.5)
---------------------------------------------------------------------------
KeyError: ('uint64', 'right')

I'm guessing for datetimelike we'll need to do i8 conversion? Or is there a way to add that directly? I think uint64 can be directly added as a dtype. Any other dtypes I'm forgetting?

jorisvandenbossche added the Needs Info Clarification about behavior needed to assess issue label Apr 9, 2018

jorisvandenbossche added Bug Datetime Datetime data dtype Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type and removed Needs Info Clarification about behavior needed to assess issue labels Apr 10, 2018

jorisvandenbossche added this to the Next Major Release milestone Apr 10, 2018

jorisvandenbossche changed the title ~~Problem with datetime interval index~~ BUG: indexing in datetime IntervalIndex with duplicate values fails Apr 10, 2018

jschendel mentioned this issue Apr 11, 2018

BUG: Add uint64 support to IntervalTree #20651

Merged

4 tasks

jschendel mentioned this issue Oct 4, 2018

BUG: Perform i8 conversion for datetimelike IntervalTree queries #22988

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 0.24.0 Oct 7, 2018

jreback closed this as completed in #22988 Oct 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: indexing in datetime IntervalIndex with duplicate values fails #20636

BUG: indexing in datetime IntervalIndex with duplicate values fails #20636

avnovikov commented Apr 8, 2018 •

edited

Loading

INSTALLED VERSIONS

jorisvandenbossche commented Apr 9, 2018

jschendel commented Apr 9, 2018

jorisvandenbossche commented Apr 10, 2018 •

edited

Loading

jschendel commented Apr 10, 2018 •

edited

Loading

BUG: indexing in datetime IntervalIndex with duplicate values fails #20636

BUG: indexing in datetime IntervalIndex with duplicate values fails #20636

Comments

avnovikov commented Apr 8, 2018 • edited Loading

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jorisvandenbossche commented Apr 9, 2018

jschendel commented Apr 9, 2018

jorisvandenbossche commented Apr 10, 2018 • edited Loading

jschendel commented Apr 10, 2018 • edited Loading

avnovikov commented Apr 8, 2018 •

edited

Loading

Output of `pd.show_versions()`

jorisvandenbossche commented Apr 10, 2018 •

edited

Loading

jschendel commented Apr 10, 2018 •

edited

Loading