New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupBy Rank Operations With Infinity Incorrect #20561

Closed
WillAyd opened this Issue Mar 30, 2018 · 5 comments

Comments

Projects
None yet
3 participants
@WillAyd
Member

WillAyd commented Mar 30, 2018

xref #20091 @peterpanmj

In [51]: df = pd.DataFrame([1, np.nan, np.inf, -np.inf, 25])

In [52]: df['key'] = 'foo'

In [53]: df.groupby("key").rank()      # not working properly for infinity
Out[53]:
     0
0  2.0
1  NaN
2  NaN
3  1.0
4  3.0

In [54]: df.rank()     # this is ok
Out[54]:
     0  key
0  2.0  3.0
1  NaN  3.0
2  4.0  3.0
3  1.0  3.0
4  3.0  3.0

In [55]: df.groupby("key").apply(lambda x:x.rank())   # this is also ok
Out[55]:
     0  key
0  2.0  3.0
1  NaN  3.0
2  4.0  3.0
3  1.0  3.0
4  3.0  3.0
@peterpanmj

This comment has been minimized.

Contributor

peterpanmj commented Apr 1, 2018

At line 723, groupby_helper.pxi

            if (i == N - 1 or (
                    (masked_vals[_as[i]] != masked_vals[_as[i+1]]) and not
                    (mask[_as[i]] and mask[_as[i+1]]))):

This is the cause of it. We cannot separate infs from nans.
When inf and nan are stacked together after the lexsort, this logic won't catch the transition point,
It is always True that, Masked_vals[_as[i]] == masked_vals[_as[i+1]], because they are all infinity (one real inf, one filled inf).

@WillAyd

This comment has been minimized.

Member

WillAyd commented Apr 1, 2018

@peterpanmj doesn't the mask array still differentiate between np.nan and np.inf? Your comment is correct in that the masked_vals array obfuscates that different because it uses np.inf as a fill value, but I'm assuming mask would still help you distinguish the two, right?

@peterpanmj

This comment has been minimized.

Contributor

peterpanmj commented Apr 2, 2018

Yes, mask should distinguish the two. However the code combines the two conditions with an and. When
not (mask[_as[i]] and mask[_as[i+1]]) is True at the transition point from real infs to filled infs , (masked_vals[_as[i]] != masked_vals[_as[i+1]]) is False. The result is a False. It missed it.

@WillAyd

This comment has been minimized.

Member

WillAyd commented Apr 2, 2018

So would it make sense to have those be separate conditions then? Something along the lines of

if masked_vals[_as[i]] != masked_vals[_as[i+1]]:  
    if mask[_as[i]] or mask[_as[i+1]:
        # Either nan or inf here...
        mask[_as[i]] and mask[_as[i+1]:
            # This would mean both are nan
        else:
            # One is inf
    else:
        # Not dealing with nan or inf

Not saying the above is the right approach and couldn't be refactored but think its in the right direction. All of the elements are there so just need to piece together appropriately

@peterpanmj

This comment has been minimized.

Contributor

peterpanmj commented Apr 2, 2018

            if (i == N - 1 or         # condition 1
                    (masked_vals[_as[i]] != masked_vals[_as[i+1]]) or   # condition 2
                    (mask[_as[i]] ^ mask[_as[i+1]])):   # condition 3
  1. End of Arrary,
  2. A new value occurs
  3. Transition from nans to not-nans ( or the reverse depends on na_option and ascending)
    I think condition 2 and 3 are independent to each others. Haven't tested it though.

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Apr 14, 2018

@jreback jreback modified the milestones: 0.23.0, Next Major Release Apr 21, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment