Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Interval type should support intersection, union & overlaps & difference #21998

Open
haleemur opened this issue Jul 20, 2018 · 3 comments

Comments

Projects
None yet
2 participants
@haleemur
Copy link
Contributor

commented Jul 20, 2018

Problem description

We have the Interval type in pandas, which is extremely useful, however the standard interval arithmetic operations are missing from the pandas implementation. I would be happy to work on this enhancement.

One should be able to do the following with pandas.Interval

The example uses numeric intervals, but the same operations are also valid for time series intervals.

# following proposed operations and suggested behaviour:

import pandas as pd
i0 = pd.Interval(0, 3, closed='right')
i1 = pd.Interval(2, 4, closed='right')
i2 = pd.Interval(5, 8, closed='right')

# 1. intersection
i0.intersection(i1)
# should return: pd.Interval(2, 3, closed='right')
i0.intersection(i3)
# should return: np.nan (or, perhaps a more appropriate null-interval representation)

# 2. union
i0.union(i1)
# should return: pd.IntervalIndex([pd.Interval(0, 4, closed='right')])
i0.union(i2)
# should return: pd.IntervalIndex([pd.Interval(0, 2, closed='right'), pd.Interval(5,8, closed='right')])

3. overlaps
i0.overlaps(i1)
# should return: True
i0.overlaps(i2)
# should return: False

3. difference
i0.difference(i1)
# should return: pd.Interval(0, 2, closed='right')
i0.difference(i2)
# should return: pd.interval(0, 3, closed='right')
@jschendel

This comment has been minimized.

Copy link
Member

commented Jul 20, 2018

xref #19480

This is reasonable, though some care is needed if these operations are to work between intervals with mixed closed. My initial inclination was to simply not allow mixed closed operations, but they seem generally well defined after thinking about it.

For example, the following seems reasonable:

In [2]: i0 = pd.Interval(0, 2, closed='both')

In [3]: i1 = pd.Interval(1, 3, closed='neither')

In [4]: i0.intersection(i1)
Out[4]: Interval(1, 2, closed='right')

The only thing that immediately comes to mind as problematic is union. Mixed closed should be fine if the intervals are overlapping. For example, using the intervals defined above, the following seems reasonable:

In [5]: i0.union(i1)
Out[5]: Interval(0, 3, closed='left')

Non-overlapping intervals with the same closed should also be fine, though I'd like to note that my preference would be to return an IntervalArray (newly implemented, slated for 0.24.0) instead of an IntervalIndex:

In [6]: i2 = pd.Interval(8, 10, closed='both')

In [7]: i0.union(i2)
Out[7]:
IntervalArray([[0, 2], [8, 10]],
              closed='both',
              dtype='interval[int64]')

The problematic case for union is when you have mixed closed intervals that are non-overlapping. We can't use IntervalArray (or IntervalIndex) in that case, as they require all intervals to be closed on the same side. The options would be to either raise an error, or return a numpy object dtype array:

In [8]: i3 = pd.Interval(8, 10, closed='neither')

In [9]: i0.union(i3)
Out[9]: array([Interval(0, 2, closed='both'), Interval(8, 10, closed='neither')], dtype=object)

I'm not sure if an object dtype array actually provides an utility here, and seems a bit unnatural, so I'd lean towards raising. Could be convinced otherwise if anyone has a practical use for it though.

@jschendel

This comment has been minimized.

Copy link
Member

commented Jul 20, 2018

Actually, I think difference could also be similarly problematic, even when both intervals are closed on the same side. Specifically, I'd expect problematic behavior to occur with nested intervals:

In [2]: i0 = pd.Interval(0, 3, closed='both')

In [3]: i1 = pd.Interval(1, 2, closed='both')

In [4]: i0.difference(i1)
Out[4]: array([Interval(0, 1, closed='left'), Interval(2, 3, closed='right')], dtype=object)

But with mixed closed you could actually get a valid IntervalArray:

In [5]: i2 = pd.Interval(1, 2, closed='neither')

In [6]: i0.difference(i2)
Out[6]:
IntervalArray([[0, 1], [2, 3]],
              closed='both',
              dtype='interval[int64]')

So not really sure if difference should be supported, as it can become a bit confusing from a user perspective to keep track of what is valid and what isn't (i.e. what returns an IntervalArray vs. what raises/returns an object dtype array).

@haleemur

This comment has been minimized.

Copy link
Contributor Author

commented Jul 27, 2018

@jschendel you are correct in the above observations. The complexity in my proposal derives from trying to return multiple intervals in either IntervalIndex or IntervalRange with mixed boundary types.

I looked into the behaviour of postgresql's range types, and perhaps we can model these operations similarly. Postgresql avoids the mixed boundary type problem by only returning a single value (boolean or single-continuous-interval) as result.

I think the following range operators from postgresql taking two intervals as arguments should be interesting for pandas users:

Operator Description Example Result
&& overlap (have points in common) int8range(3,7) && int8range(4,12) t
<< strictly left of int8range(1,10) << int8range(100,110) t
>> strictly right of int8range(50,60) >> int8range(20,30) t
&< does not extend to the right of int8range(1,20) &< int8range(18,20) t
&> does not extend to the left of int8range(7,20) &> int8range(5,10) t
-|- is adjacent to numrange(1.1,2.2) -|- numrange(2.2,3.3) t
+ union numrange(5,15) + numrange(10,20) [5,20)
* intersection int8range(5,15) * int8range(10,20) [10,15)
- difference int8range(5,15) - int8range(10,20) [5,10)

and this function:

Function Return Type Description Example Result
range_merge(anyrange, anyrange) anyrange the smallest range which includes both of the given ranges range_merge('[1,2)'::int4range, '[3,4)'::int4range) [1,4)

examples of how postgresql handles non trivial range operations:

intersection:

hal=> select int8range(4,8) * int8range(10,20);
 ?column?
----------
 empty
(1 row)

difference:

hal=> select int8range(4,8) - int8range(5,7);
ERROR:  result of range difference would not be contiguous

union:

hal=> select int8range(4,8) + int8range(10,20);
ERROR:  result of range union would not be contiguous

I think we could implement behaviour similar to casting functions, where the difference & union functions would have an errors parameter, with the following options: raise|coerce|first|last|greatest|smallest

  • raise: raises an error
  • coerce: sets the value to np.nan
  • first: returns the first interval in the the result array
  • last: returns the last interval in the result array
  • greatest: returns the biggest interval in the result array
  • smallest: returns the smallest interval in the result array

greatest & smallest could be useful options for the difference operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.