# ENH: Interval type should support intersection, union & overlaps & difference #21998

opened this issue Jul 20, 2018 · 3 comments

Contributor

### haleemur commented Jul 20, 2018

#### Problem description

We have the Interval type in pandas, which is extremely useful, however the standard interval arithmetic operations are missing from the pandas implementation. I would be happy to work on this enhancement.

#### One should be able to do the following with `pandas.Interval`

The example uses numeric intervals, but the same operations are also valid for time series intervals.

```# following proposed operations and suggested behaviour:

import pandas as pd
i0 = pd.Interval(0, 3, closed='right')
i1 = pd.Interval(2, 4, closed='right')
i2 = pd.Interval(5, 8, closed='right')

# 1. intersection
i0.intersection(i1)
# should return: pd.Interval(2, 3, closed='right')
i0.intersection(i3)
# should return: np.nan (or, perhaps a more appropriate null-interval representation)

# 2. union
i0.union(i1)
# should return: pd.IntervalIndex([pd.Interval(0, 4, closed='right')])
i0.union(i2)
# should return: pd.IntervalIndex([pd.Interval(0, 2, closed='right'), pd.Interval(5,8, closed='right')])

3. overlaps
i0.overlaps(i1)
# should return: True
i0.overlaps(i2)
# should return: False

3. difference
i0.difference(i1)
# should return: pd.Interval(0, 2, closed='right')
i0.difference(i2)
# should return: pd.interval(0, 3, closed='right')```
Member

### jschendel commented Jul 20, 2018

 xref #19480 This is reasonable, though some care is needed if these operations are to work between intervals with mixed `closed`. My initial inclination was to simply not allow mixed `closed` operations, but they seem generally well defined after thinking about it. For example, the following seems reasonable: ```In : i0 = pd.Interval(0, 2, closed='both') In : i1 = pd.Interval(1, 3, closed='neither') In : i0.intersection(i1) Out: Interval(1, 2, closed='right')``` The only thing that immediately comes to mind as problematic is `union`. Mixed `closed` should be fine if the intervals are overlapping. For example, using the intervals defined above, the following seems reasonable: ```In : i0.union(i1) Out: Interval(0, 3, closed='left')``` Non-overlapping intervals with the same `closed` should also be fine, though I'd like to note that my preference would be to return an `IntervalArray` (newly implemented, slated for 0.24.0) instead of an `IntervalIndex`: ```In : i2 = pd.Interval(8, 10, closed='both') In : i0.union(i2) Out: IntervalArray([[0, 2], [8, 10]], closed='both', dtype='interval[int64]')``` The problematic case for `union` is when you have mixed `closed` intervals that are non-overlapping. We can't use `IntervalArray` (or `IntervalIndex`) in that case, as they require all intervals to be closed on the same side. The options would be to either raise an error, or return a numpy object dtype array: ```In : i3 = pd.Interval(8, 10, closed='neither') In : i0.union(i3) Out: array([Interval(0, 2, closed='both'), Interval(8, 10, closed='neither')], dtype=object)``` I'm not sure if an object dtype array actually provides an utility here, and seems a bit unnatural, so I'd lean towards raising. Could be convinced otherwise if anyone has a practical use for it though.

Member

### jschendel commented Jul 20, 2018

 Actually, I think `difference` could also be similarly problematic, even when both intervals are closed on the same side. Specifically, I'd expect problematic behavior to occur with nested intervals: ```In : i0 = pd.Interval(0, 3, closed='both') In : i1 = pd.Interval(1, 2, closed='both') In : i0.difference(i1) Out: array([Interval(0, 1, closed='left'), Interval(2, 3, closed='right')], dtype=object)``` But with mixed `closed` you could actually get a valid `IntervalArray`: ```In : i2 = pd.Interval(1, 2, closed='neither') In : i0.difference(i2) Out: IntervalArray([[0, 1], [2, 3]], closed='both', dtype='interval[int64]')``` So not really sure if `difference` should be supported, as it can become a bit confusing from a user perspective to keep track of what is valid and what isn't (i.e. what returns an `IntervalArray` vs. what raises/returns an object dtype array).
Contributor Author

### haleemur commented Jul 27, 2018

@jschendel you are correct in the above observations. The complexity in my proposal derives from trying to return multiple intervals in either `IntervalIndex` or `IntervalRange` with mixed boundary types.

I looked into the behaviour of postgresql's range types, and perhaps we can model these operations similarly. Postgresql avoids the mixed boundary type problem by only returning a single value (boolean or single-continuous-interval) as result.

I think the following range operators from postgresql taking two intervals as arguments should be interesting for pandas users:

Operator Description Example Result
`&&` overlap (have points in common) `int8range(3,7) && int8range(4,12)` `t`
`<<` strictly left of `int8range(1,10) << int8range(100,110)` `t`
`>>` strictly right of `int8range(50,60) >> int8range(20,30)` `t`
`&<` does not extend to the right of `int8range(1,20) &< int8range(18,20)` `t`
`&>` does not extend to the left of `int8range(7,20) &> int8range(5,10)` `t`
`-|-` is adjacent to `numrange(1.1,2.2) -|- numrange(2.2,3.3)` `t`
`+` union `numrange(5,15) + numrange(10,20)` `[5,20)`
`*` intersection `int8range(5,15) * int8range(10,20)` `[10,15)`
`-` difference `int8range(5,15) - int8range(10,20)` `[5,10)`

and this function:

Function Return Type Description Example Result
`range_merge(anyrange, anyrange)` anyrange the smallest range which includes both of the given ranges `range_merge('[1,2)'::int4range, '[3,4)'::int4range)` `[1,4)`

examples of how postgresql handles non trivial range operations:

intersection:

``````hal=> select int8range(4,8) * int8range(10,20);
?column?
----------
empty
(1 row)
``````

difference:

``````hal=> select int8range(4,8) - int8range(5,7);
ERROR:  result of range difference would not be contiguous
``````

union:

``````hal=> select int8range(4,8) + int8range(10,20);
ERROR:  result of range union would not be contiguous
``````

I think we could implement behaviour similar to casting functions, where the `difference` & `union` functions would have an `errors` parameter, with the following options: `raise|coerce|first|last|greatest|smallest`

• raise: raises an error
• coerce: sets the value to np.nan
• first: returns the first interval in the the result array
• last: returns the last interval in the result array
• greatest: returns the biggest interval in the result array
• smallest: returns the smallest interval in the result array

`greatest` & `smallest` could be useful options for the difference operation.

