# BUG: Infinite recursion on IntervalIndex constructor due to integer overflow #25485

Closed
opened this issue Feb 28, 2019 · 2 comments

### kingsykes commented Feb 28, 2019 • edited

#### Code Sample:

```import numpy as np
import pandas as pd

# construct IntervalIndex from timestamps
# --> Length of index is > 100, the default leaf_size
# --> index is heavily skewed towards Timstamp.max
idx = pd.IntervalIndex.from_arrays(pd.date_range('2018-01-01', '2018-12-31', freq='D'), [pd.Timestamp.max]*365)

# type conversion from IntervalIndex._engine
left = idx._maybe_convert_i8(idx.left)
right = idx._maybe_convert_i8(idx.right)

# pivot point calculation from IntervalNode constructor, when n_elements > leaf_size
pivot = np.median(left + right) / 2

print(pd.Timestamp(pivot))
# >>> 1848-02-11 00:06:21.572612096
# This is below the minimum left-side value, which should not be

print(pd.Timestamp(np.median(idx.mid.asi8)))
# >>> 2140-05-21 23:53:38.427387904
# This is the correct midpoint
# NB: with my own data, accessing `mid` throws OverflowError: Overflow in int64 addition

# proposed alternate logic for midpoint
alternate = np.median(left/2 + right/2)

print(pd.Timestamp(alternate)
# >>> 2140-05-21 23:53:38.427387904
# Matches the correct midpoint, and should never over/under flow```

#### Description

Current behavior causes a hard crash in the python process under certain conditions:

• Length of the index is > 100, the default leaf_size of the IntervalTree
• The intervals are heavily skewed towards the upper bound of the type (for signed types I believe this can also be the lower bound, but I haven't verified it)

Having an index length greater than the leaf_size parameter, causes the the IndexNode constructor to create child nodes. Currently the default leaf_size is 100 and is not exposed to the user API. Instantiating an IntervalTree directly and setting the leaf_size > index length does not lead to a crash.

By emulating the `IntervalNode.classify_intervals` method in the REPL, I found that the entire index was assigned to the right-hand child node, because the pivot was calculated on an array of overflowed integers, and that the constructor would never stop recursing.

As an additional note, I also encountered an `OverflowError` when accessing the `mid` attribute of the `IntervalIndex` with my own data, but this does not occur in the code sample above. I traced it back through the `length` attribute, where left bounds are subtracted from right bounds.

```idx.right - idx.left
# >>> Traceback (most recent call last):
# >>>   File "<input>", line 1, in <module>
# >>>   File "~\pandas\core\indexes\datetimelike.py", line 501, in __sub__
# >>>     result = self._data.__sub__(maybe_unwrap_index(other))
# >>>   File "~\pandas\core\arrays\datetimelike.py", line 1275, in __sub__
# >>>     result = self._sub_datetime_arraylike(other)
# >>>   File "~\pandas\core\arrays\datetimes.py", line 724, in _sub_datetime_arraylike
# >>>   File "~\pandas\core\algorithms.py", line 938, in checked_add_with_arr
# >>>     raise OverflowError("Overflow in int64 addition")
# >>> OverflowError: Overflow in int64 addition```

#### Expected Output

Changing the pivot calculation from
`self.pivot = np.median(left + right) / 2`
to
`self.pivot = np.median(left/2 + right/2)`
should protect against over and underflows.

#### Output of `pd.show_versions()`

```INSTALLED VERSIONS
------------------
commit: None
python: 3.5.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
------------------
pandas: 0.24.1
pytest: None
pip: 10.0.1
setuptools: 40.2.0
Cython: None
numpy: 1.16.1
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
gcsfs: None
```

Member

### gfyoung commented Mar 1, 2019 • edited

 @kingsykes : Thanks for reporting this! Unless there is pushback, please feel free to open a PR with your patches and add a test to confirm working behavior.
This was referenced Mar 1, 2019
Member

### jschendel commented Mar 1, 2019

 Thanks for the detailed report @kingsykes! I've created a PR (#25498) with your proposed change to the pivot calculation. I've also created a separate issue (#25499) for your additional note on the `mid` attribute in order to have that fully documented in a separate location.