ENH: Interval type should support intersection, union & overlaps & difference #21998

Open
opened this issue Jul 20, 2018 · 3 comments

Projects
None yet
2 participants
Contributor

haleemur commented Jul 20, 2018

Problem description

We have the Interval type in pandas, which is extremely useful, however the standard interval arithmetic operations are missing from the pandas implementation. I would be happy to work on this enhancement.

One should be able to do the following with `pandas.Interval`

The example uses numeric intervals, but the same operations are also valid for time series intervals.

```# following proposed operations and suggested behaviour:

import pandas as pd
i0 = pd.Interval(0, 3, closed='right')
i1 = pd.Interval(2, 4, closed='right')
i2 = pd.Interval(5, 8, closed='right')

# 1. intersection
i0.intersection(i1)
# should return: pd.Interval(2, 3, closed='right')
i0.intersection(i3)
# should return: np.nan (or, perhaps a more appropriate null-interval representation)

# 2. union
i0.union(i1)
# should return: pd.IntervalIndex([pd.Interval(0, 4, closed='right')])
i0.union(i2)
# should return: pd.IntervalIndex([pd.Interval(0, 2, closed='right'), pd.Interval(5,8, closed='right')])

3. overlaps
i0.overlaps(i1)
# should return: True
i0.overlaps(i2)
# should return: False

3. difference
i0.difference(i1)
# should return: pd.Interval(0, 2, closed='right')
i0.difference(i2)
# should return: pd.interval(0, 3, closed='right')```
Member

jschendel commented Jul 20, 2018

 xref #19480 This is reasonable, though some care is needed if these operations are to work between intervals with mixed `closed`. My initial inclination was to simply not allow mixed `closed` operations, but they seem generally well defined after thinking about it. For example, the following seems reasonable: ```In [2]: i0 = pd.Interval(0, 2, closed='both') In [3]: i1 = pd.Interval(1, 3, closed='neither') In [4]: i0.intersection(i1) Out[4]: Interval(1, 2, closed='right')``` The only thing that immediately comes to mind as problematic is `union`. Mixed `closed` should be fine if the intervals are overlapping. For example, using the intervals defined above, the following seems reasonable: ```In [5]: i0.union(i1) Out[5]: Interval(0, 3, closed='left')``` Non-overlapping intervals with the same `closed` should also be fine, though I'd like to note that my preference would be to return an `IntervalArray` (newly implemented, slated for 0.24.0) instead of an `IntervalIndex`: ```In [6]: i2 = pd.Interval(8, 10, closed='both') In [7]: i0.union(i2) Out[7]: IntervalArray([[0, 2], [8, 10]], closed='both', dtype='interval[int64]')``` The problematic case for `union` is when you have mixed `closed` intervals that are non-overlapping. We can't use `IntervalArray` (or `IntervalIndex`) in that case, as they require all intervals to be closed on the same side. The options would be to either raise an error, or return a numpy object dtype array: ```In [8]: i3 = pd.Interval(8, 10, closed='neither') In [9]: i0.union(i3) Out[9]: array([Interval(0, 2, closed='both'), Interval(8, 10, closed='neither')], dtype=object)``` I'm not sure if an object dtype array actually provides an utility here, and seems a bit unnatural, so I'd lean towards raising. Could be convinced otherwise if anyone has a practical use for it though.

Member

jschendel commented Jul 20, 2018

 Actually, I think `difference` could also be similarly problematic, even when both intervals are closed on the same side. Specifically, I'd expect problematic behavior to occur with nested intervals: ```In [2]: i0 = pd.Interval(0, 3, closed='both') In [3]: i1 = pd.Interval(1, 2, closed='both') In [4]: i0.difference(i1) Out[4]: array([Interval(0, 1, closed='left'), Interval(2, 3, closed='right')], dtype=object)``` But with mixed `closed` you could actually get a valid `IntervalArray`: ```In [5]: i2 = pd.Interval(1, 2, closed='neither') In [6]: i0.difference(i2) Out[6]: IntervalArray([[0, 1], [2, 3]], closed='both', dtype='interval[int64]')``` So not really sure if `difference` should be supported, as it can become a bit confusing from a user perspective to keep track of what is valid and what isn't (i.e. what returns an `IntervalArray` vs. what raises/returns an object dtype array).
Contributor Author

haleemur commented Jul 27, 2018

@jschendel you are correct in the above observations. The complexity in my proposal derives from trying to return multiple intervals in either `IntervalIndex` or `IntervalRange` with mixed boundary types.

I looked into the behaviour of postgresql's range types, and perhaps we can model these operations similarly. Postgresql avoids the mixed boundary type problem by only returning a single value (boolean or single-continuous-interval) as result.

I think the following range operators from postgresql taking two intervals as arguments should be interesting for pandas users:

Operator Description Example Result
`&&` overlap (have points in common) `int8range(3,7) && int8range(4,12)` `t`
`<<` strictly left of `int8range(1,10) << int8range(100,110)` `t`
`>>` strictly right of `int8range(50,60) >> int8range(20,30)` `t`
`&<` does not extend to the right of `int8range(1,20) &< int8range(18,20)` `t`
`&>` does not extend to the left of `int8range(7,20) &> int8range(5,10)` `t`
`-|-` is adjacent to `numrange(1.1,2.2) -|- numrange(2.2,3.3)` `t`
`+` union `numrange(5,15) + numrange(10,20)` `[5,20)`
`*` intersection `int8range(5,15) * int8range(10,20)` `[10,15)`
`-` difference `int8range(5,15) - int8range(10,20)` `[5,10)`

and this function:

Function Return Type Description Example Result
`range_merge(anyrange, anyrange)` anyrange the smallest range which includes both of the given ranges `range_merge('[1,2)'::int4range, '[3,4)'::int4range)` `[1,4)`

examples of how postgresql handles non trivial range operations:

intersection:

``````hal=> select int8range(4,8) * int8range(10,20);
?column?
----------
empty
(1 row)
``````

difference:

``````hal=> select int8range(4,8) - int8range(5,7);
ERROR:  result of range difference would not be contiguous
``````

union:

``````hal=> select int8range(4,8) + int8range(10,20);
ERROR:  result of range union would not be contiguous
``````

I think we could implement behaviour similar to casting functions, where the `difference` & `union` functions would have an `errors` parameter, with the following options: `raise|coerce|first|last|greatest|smallest`

• raise: raises an error
• coerce: sets the value to np.nan
• first: returns the first interval in the the result array
• last: returns the last interval in the result array
• greatest: returns the biggest interval in the result array
• smallest: returns the smallest interval in the result array

`greatest` & `smallest` could be useful options for the difference operation.

Merged