New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rank Mixes np.nan with np.inf values #19538

Closed
WillAyd opened this Issue Feb 5, 2018 · 4 comments

Comments

Projects
None yet
3 participants
@WillAyd
Member

WillAyd commented Feb 5, 2018

Code Sample, a copy-pastable example if possible

In []: df = pd.DataFrame([1, np.nan, np.inf, -np.inf, 25])
In []: df.rank()
Out []:
     0
0  2.0
1  NaN
2  NaN
3  1.0
4  3.0

In []: df.rank(ascending=False)
Out []:
     0
0  3.0
1  NaN
2  1.0
3  NaN
4  2.0

Problem description

np.inf or -np.inf gets grouped with np.nan in the rank operation, depending on which direction the ranking occurs in. Ideally, np.inf would be entirely separate from np.nan.

@WillAyd WillAyd referenced this issue Feb 5, 2018

Merged

PERF: Cythonize Groupby Rank #19481

4 of 4 tasks complete
@jreback

This comment has been minimized.

Contributor

jreback commented Feb 6, 2018

see #6945 / #17903 this was just done in master, is this another case?

cc @peterpanmj

@jreback jreback added the Numeric label Feb 6, 2018

@WillAyd

This comment has been minimized.

Member

WillAyd commented Feb 6, 2018

Interesting...from glancing at it I think the problem is that PR only updated the rank_1d_ methods in algos. This makes it so you get the desired functionality from a Series object but not from a DataFrame or GroupBy object. There's also a bug with how it handles values in descending order, all of which are highlighted below

In []: df[0].rank()  # this works
Out[]: 
0    2.0
1    NaN
2    4.0
3    1.0
4    3.0

In []: df[0].rank(ascending=False)  # handles na appropriately, but incorrectly sets np.inf and -np.inf to equal
Out[]: 
0    3.0
1    NaN
2    1.0
3    1.0
4    2.0

In []: df['key'] = ['foo'] * 5
In []: df.groupby('key').rank()  # doesn't handle missing values appropriately
Out[]: 
     0
0  2.0
1  NaN
2  NaN
3  1.0
4  3.0

INSTALLED VERSIONS

commit: 93c86aa
python: 3.6.2.final.0
python-bits: 64
OS: Darwin
OS-release: 17.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.0.dev0+238.g93c86aa13.dirty
pytest: 3.2.5
pip: 9.0.1
setuptools: 36.4.0
Cython: 0.26
numpy: 1.13.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.3
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@peterpanmj

This comment has been minimized.

Contributor

peterpanmj commented Feb 9, 2018

It is the same issue as #6945. And, it seems that #17903 does not completely fix it when ascending is False. I will have a look into it.
Group by nan values is another thing .

In []: df['key'] = ['foo'] * 5
In []: df.groupby('key').rank()  # doesn't handle missing values appropriately
Out[]: 
     0
0  2.0
1  NaN
2  NaN
3  1.0
4  3.0

It should related to #3729

@WillAyd

This comment has been minimized.

Member

WillAyd commented Feb 9, 2018

@peterpanmj whatever solution you come up with in algos it shouldn't be that different to move to the groupby_helper.pyx.in file to support GroupBy nan handling as well (I'm touching the latter in #19481). Ideally these would be consistent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment