Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: indexing in datetime IntervalIndex with duplicate values fails #20636

Closed
avnovikov opened this Issue Apr 8, 2018 · 4 comments

Comments

Projects
None yet
4 participants
@avnovikov
Copy link

commented Apr 8, 2018

Code Sample

import pandas as pd
start_end_idx = pd.IntervalIndex.from_arrays(start, end)
start_end_idx.name = 'port_time'
mm = mm.set_index(['imo','port_id', start_end_idx])
global_bins = mm.index
global_bins.get_loc((8801527, 5610.0, pd.to_datetime('2016-10-16 16:57:44.316')))
/anaconda3/envs/mariquant/lib/python3.6/site-packages/pandas/core/indexes/interval.py in _engine(self)
    283     @cache_readonly
    284     def _engine(self):
--> 285         return IntervalTree(self.left, self.right, closed=self.closed)
    286 
    287     @property

pandas/_libs/intervaltree.pxi in pandas._libs.interval.IntervalTree.__init__()

KeyError: ('datetime64[ns]', 'right')

Problem description

It is impossible to use non unique IntervalIndex with datetimes as start-end points. Both index.get_loc and DataFrame.loc produce the same error in IntervalTree.

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.5.1
Cython: None
numpy: 1.14.2
scipy: 1.0.0
pyarrow: None
xarray: 0.10.2
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Apr 9, 2018

@avnovikov Can you provide a reproducible example? (in this case, provide example code for start, end and mm) See also http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@jschendel

This comment has been minimized.

Copy link
Member

commented Apr 9, 2018

I can reproduce the error on master with a duplicated IntervalIndex:

In [2]: ii = pd.interval_range(pd.Timestamp('20180101'), periods=2).repeat(2)

In [3]: ii
Out[3]:
IntervalIndex([(2018-01-01, 2018-01-02], (2018-01-01, 2018-01-02], (2018-01-02, 2018-01-03], (2018-01-02, 2018-01-03]]
              closed='right',
              dtype='interval[datetime64[ns]]')

In [4]: ii.get_loc(pd.Timestamp('20180102'))
---------------------------------------------------------------------------
KeyError: ('datetime64[ns]', 'right')
@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Apr 10, 2018

OK, and for numeric intervals, this works correctly:

In [30]: ii = pd.interval_range(1, periods=2).repeat(2)

In [31]: ii
Out[31]: 
IntervalIndex([(1, 2], (1, 2], (2, 3], (2, 3]]
              closed='right',
              dtype='interval[int64]')

In [33]:  ii.get_loc(2)
Out[33]: array([0, 1])

@jorisvandenbossche jorisvandenbossche added this to the Next Major Release milestone Apr 10, 2018

@jorisvandenbossche jorisvandenbossche changed the title Problem with datetime interval index BUG: indexing in datetime IntervalIndex with duplicate values fails Apr 10, 2018

@jschendel

This comment has been minimized.

Copy link
Member

commented Apr 10, 2018

I suspect this will fail for dtypes other than float32/float64/int32/int64 when the IntervalIndex is overlapping or not monotonic (the non-overlapping monotonic case doesn't require using IntervalTree). It looks like IntervalTree was only implemented for the four dtypes I mentioned earlier:

# we need specialized nodes and leaves to optimize for different dtype and
# closed values
{{py:
nodes = []
for dtype in ['float32', 'float64', 'int32', 'int64']:

Using timedelta64[ns] fails:

In [2]: ii = pd.interval_range(pd.Timedelta('0 days'), periods=2).repeat(2)

In [3]: ii
Out[3]:
IntervalIndex([(0 days 00:00:00, 1 days 00:00:00], (0 days 00:00:00, 1 days 00:00:00], (1 days 00:00:00, 2 days 00:00:00], (1 days 00:00:00, 2 days 00:00:00]]
              closed='right',
              dtype='interval[timedelta64[ns]]')

In [4]: ii.get_loc(pd.Timedelta('1 day'))
---------------------------------------------------------------------------
KeyError: ('timedelta64[ns]', 'right')

Using uint64 fails:

In [11]: ii = pd.interval_range(1, periods=2).repeat(2).astype('interval[uint64]')

In [12]: ii
Out[12]:
IntervalIndex([(1, 2], (1, 2], (2, 3], (2, 3]]
              closed='right',
              dtype='interval[uint64]')

In [13]: ii.get_loc(1.5)
---------------------------------------------------------------------------
KeyError: ('uint64', 'right')

I'm guessing for datetimelike we'll need to do i8 conversion? Or is there a way to add that directly? I think uint64 can be directly added as a dtype. Any other dtypes I'm forgetting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.