Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed segfaults and incorrect results in GroupBy.quantile with NA Values in Grouping #29173

Merged
merged 7 commits into from
Oct 30, 2019

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Oct 22, 2019

@WillAyd WillAyd added Groupby Segfault Non-Recoverable Error labels Oct 22, 2019
@jbrockmendel jbrockmendel added the quantile quantile method label Oct 22, 2019

# Random segfaults; would have been guaranteed in loop
grp = df.groupby("key")
for _ in range(100):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i take it this is doesn't reliably segfault on the first try?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or i could just read the comment three lines up. never mind

def test_quantile_missing_group_values_correct_results():
# GH 28662
data = np.array([1.0, np.nan, 3.0, np.nan])
df = pd.DataFrame(dict(key=data, val=range(4)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to re-use the setup from the other test? could just assert the result after the loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different frame shapes. I think segfaults were occurring with oddly sized frames and bad results with evenly sized. Might be some other conflating factor

@jreback jreback added this to the 1.0 milestone Oct 23, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. I suppose we could mark for a potential 0.25.3, but aren't planning as of yet @pandas-dev/pandas-core

@WillAyd
Copy link
Member Author

WillAyd commented Oct 25, 2019

All green on this. FWIW I think worth a 0.25.3 release since its a regression and happy to volunteer for that if others agree

@jbrockmendel
Copy link
Member

lgtm cc @jreback


result = df.groupby("key").quantile()
expected = pd.DataFrame(
[1.0, 3.0], index=pd.Index([1.0, 3.0], name="key"), columns=["val"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the expected should be [0, 2], but also on 0.24.2 the columns index name was the q value. not sure if this is important.

>>> pd.__version__
'0.24.2'
>>>
>>> import numpy as np
>>>
>>> data = np.array([1.0, np.nan, 3.0, np.nan])
>>> df = pd.DataFrame(dict(key=data, val=range(4)))
>>>
>>>
>>> df.groupby("key").quantile()
0.5  val
key
1.0  0.0
3.0  2.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind that this is a groupby and each element belongs to its own group. [1, 3] is essentially the identity of the non-NA groupings and should be correct

w.r.t. the column index name that probably goes back to https://github.com/pandas-dev/pandas/pull/20405/files#r208366338 which was an inconsistency in 0.24.2

@jreback
Copy link
Contributor

jreback commented Oct 30, 2019

All green on this. FWIW I think worth a 0.25.3 release since its a regression and happy to volunteer for that if others agree

@jorisvandenbossche @TomAugspurger

ok by me

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 30, 2019 via email

@WillAyd WillAyd merged commit d0fe636 into pandas-dev:master Oct 30, 2019
@WillAyd
Copy link
Member Author

WillAyd commented Oct 30, 2019

Sounds good. I'll do the back port and start on the release today / tomorrow. I'll reach out as questions come up

@WillAyd WillAyd deleted the quantile-segfault branch October 30, 2019 18:29
@@ -410,6 +410,7 @@ Groupby/resample/rolling
- Bug in :meth:`DataFrame.groupby` not offering selection by column name when ``axis=1`` (:issue:`27614`)
- Bug in :meth:`DataFrameGroupby.agg` not able to use lambda function with named aggregation (:issue:`27519`)
- Bug in :meth:`DataFrame.groupby` losing column name information when grouping by a categorical column (:issue:`28787`)
- Bug in :meth:`DataFrameGroupBy.quantile` where NA values in the grouping could cause segfaults or incorrect results (:issue:`28882`)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have removed this before merging; will clean up after backport

WillAyd added a commit to WillAyd/pandas that referenced this pull request Oct 31, 2019
…n GroupBy.quantile with NA Values in Grouping
WillAyd added a commit to WillAyd/pandas that referenced this pull request Oct 31, 2019
jreback pushed a commit that referenced this pull request Oct 31, 2019
* Backport PR #27826 for 0.25.3 release

* BUG: Fix groupby quantile segfault

Validate that q is between 0 and 1.

Closes #27470

* prettier

* Backport PR #29173 for 0.25.3 release

* Backport PR #29296 for 0.25.3 release

* Backport PR #29294 for 0.25.3 whatsnew
Reksbril pushed a commit to Reksbril/pandas that referenced this pull request Nov 18, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby quantile quantile method Segfault Non-Recoverable Error
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crash during groupby quantile
5 participants