Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Series.groupby returns an error if index is float and size is >= 1000000 #35788

Closed
2 of 3 tasks
MarcosCarreira opened this issue Aug 18, 2020 · 1 comment · Fixed by #35999
Closed
2 of 3 tasks
Labels
Bug Index Related to the Index class or subclasses Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@MarcosCarreira
Copy link

MarcosCarreira commented Aug 18, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# Imports
import numpy as np
import pandas as pd

# Values
nsteps = 5*10**6
sv = np.random.normal(loc=100.0, scale=1.0, size=nsteps+1)

# Series with int index
se = pd.Series(sv)

# Series with float index
flind = np.arange(nsteps+1)/nsteps
sef = pd.Series(sv, index=flind)

# Group with int index works
seg = se.groupby(level=0).last()

# Group with float index and size<1000000 works
sefg = sef.iloc[:999999].groupby(level=0).last()

# Group with float index and size>=1000000 doesn't work
sefg2 = sef.iloc[:1000000].groupby(level=0).last()

Problem description

Series.groupby returns an error if index is float and size is >= 1000000 (no problem with this size if index is int):

File "pandas/_libs/index.pyx", line 345, in pandas._libs.index._bin_search
TypeError: '<' not supported between instances of 'float' and 'NoneType'
Screenshot 2020-08-18 at 13 51 08

Expected Output

sefg2 should be similar to sefg (with one additional row)
Screenshot 2020-08-18 at 13 39 59
Screenshot 2020-08-18 at 13 39 23

Output of pd.show_versions()

INSTALLED VERSIONS

commit : d9fff27
python : 3.7.7.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Thu Jun 18 20:49:00 PDT 2020; root:xnu-6153.141.1~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.0
numpy : 1.19.0
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.1.0.post20200704
Cython : None
pytest : 5.4.3
hypothesis : None
sphinx : 3.1.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.17.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.0
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.50.1

@MarcosCarreira MarcosCarreira added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 18, 2020
@WillAyd
Copy link
Member

WillAyd commented Aug 18, 2020

Probably an issue with how we are hashing index objects. Here's a line that is most likely related:

_SIZE_CUTOFF = 1_000_000

I don't know the history of that cut off but investigation and PRs are certainly welcome

@WillAyd WillAyd added Index Related to the Index class or subclasses Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 18, 2020
@rhshadrach rhshadrach added this to the Contributions Welcome milestone Aug 19, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.1.2 Sep 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Index Related to the Index class or subclasses Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants