Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Infinite recursion on IntervalIndex constructor due to integer overflow #25485

Closed
kingsykes opened this issue Feb 28, 2019 · 2 comments

Comments

@kingsykes
Copy link

commented Feb 28, 2019

Code Sample:

import numpy as np
import pandas as pd

# construct IntervalIndex from timestamps
# --> Length of index is > 100, the default leaf_size
# --> index is heavily skewed towards Timstamp.max
idx = pd.IntervalIndex.from_arrays(pd.date_range('2018-01-01', '2018-12-31', freq='D'), [pd.Timestamp.max]*365)

# type conversion from IntervalIndex._engine
left = idx._maybe_convert_i8(idx.left)
right = idx._maybe_convert_i8(idx.right)

# pivot point calculation from IntervalNode constructor, when n_elements > leaf_size
pivot = np.median(left + right) / 2

print(pd.Timestamp(pivot))
# >>> 1848-02-11 00:06:21.572612096
# This is below the minimum left-side value, which should not be

print(pd.Timestamp(np.median(idx.mid.asi8)))
# >>> 2140-05-21 23:53:38.427387904
# This is the correct midpoint
# NB: with my own data, accessing `mid` throws OverflowError: Overflow in int64 addition

# proposed alternate logic for midpoint
alternate = np.median(left/2 + right/2)

print(pd.Timestamp(alternate)
# >>> 2140-05-21 23:53:38.427387904
# Matches the correct midpoint, and should never over/under flow

Description

Current behavior causes a hard crash in the python process under certain conditions:

  • Length of the index is > 100, the default leaf_size of the IntervalTree
  • The intervals are heavily skewed towards the upper bound of the type (for signed types I believe this can also be the lower bound, but I haven't verified it)

Having an index length greater than the leaf_size parameter, causes the the IndexNode constructor to create child nodes. Currently the default leaf_size is 100 and is not exposed to the user API. Instantiating an IntervalTree directly and setting the leaf_size > index length does not lead to a crash.

By emulating the IntervalNode.classify_intervals method in the REPL, I found that the entire index was assigned to the right-hand child node, because the pivot was calculated on an array of overflowed integers, and that the constructor would never stop recursing.

As an additional note, I also encountered an OverflowError when accessing the mid attribute of the IntervalIndex with my own data, but this does not occur in the code sample above. I traced it back through the length attribute, where left bounds are subtracted from right bounds.

idx.right - idx.left
# >>> Traceback (most recent call last):
# >>>   File "<input>", line 1, in <module>
# >>>   File "~\pandas\core\indexes\datetimelike.py", line 501, in __sub__
# >>>     result = self._data.__sub__(maybe_unwrap_index(other))
# >>>   File "~\pandas\core\arrays\datetimelike.py", line 1275, in __sub__
# >>>     result = self._sub_datetime_arraylike(other)
# >>>   File "~\pandas\core\arrays\datetimes.py", line 724, in _sub_datetime_arraylike
# >>>     arr_mask=self._isnan)
# >>>   File "~\pandas\core\algorithms.py", line 938, in checked_add_with_arr
# >>>     raise OverflowError("Overflow in int64 addition")
# >>> OverflowError: Overflow in int64 addition

Expected Output

Changing the pivot calculation from
self.pivot = np.median(left + right) / 2
to
self.pivot = np.median(left/2 + right/2)
should protect against over and underflows.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
------------------
pandas: 0.24.1
pytest: None
pip: 10.0.1
setuptools: 40.2.0
Cython: None
numpy: 1.16.1
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
@gfyoung

This comment has been minimized.

Copy link
Member

commented Mar 1, 2019

cc @TomAugspurger

@kingsykes : Thanks for reporting this! Unless there is pushback, please feel free to open a PR with your patches and add a test to confirm working behavior.

@jschendel

This comment has been minimized.

Copy link
Member

commented Mar 1, 2019

Thanks for the detailed report @kingsykes!

I've created a PR (#25498) with your proposed change to the pivot calculation. I've also created a separate issue (#25499) for your additional note on the mid attribute in order to have that fully documented in a separate location.

@jreback jreback modified the milestones: 0.25.0, 0.24.2 Mar 3, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.