Ensure valid Block mutation in SeriesBinGrouper. #32561

TomAugspurger · 2020-03-09T19:16:11Z

This "fixes" #31802 by expanding the number of cases where we swallow an
exception in libreduction. Currently, we're creating an invalid Series
in SeriesBinGrouper where the .mgr_locs doesn't match the values. See
#31802 (comment)
for more.

For now, we simply catch more cases that fall back to Python. I've gone
with a minimal change which addresses only issues hitting this exact
exception. We might want to go broader, but that's not clear.

cc @jbrockmendel & @WillAyd

Closes pandas-dev#31802 This "fixes" pandas-dev#31802 by expanding the number of cases where we swallow an exception in libreduction. Currently, we're creating an invalid Series in SeriesBinGrouper where the `.mgr_locs` doesn't match the values. See pandas-dev#31802 (comment) for more. For now, we simply catch more cases that fall back to Python. I've gone with a minimal change which addresses only issues hitting this exact exception. We might want to go broader, but that's not clear.

TomAugspurger · 2020-03-09T19:17:29Z

pandas/tests/groupby/test_bin_groupby.py

+    "func",
+    [
+        cumsum_max,
+        pytest.param(assert_block_lengths, marks=pytest.mark.xfail(reason="debatable")),


Currently we just catch ValueError in https://github.com/pandas-dev/pandas/pull/32561/files#diff-8c0985a9fca770c2028bed688dfc043fR641. "fixing" this would essentially require an except Exception. Do people have an opinion here?

jorisvandenbossche

I am personally -1 on catching such specific error messages like this. This only fixes the exact bug report at hand, not the general issue that things can fail inside the libreduction code in several unforeseen ways.

I would personally just broaden the exception that is swallowed (or at least in releases, in master branch I am fine with being more strict in the hope to catch some bugs)

TomAugspurger · 2020-03-09T19:55:27Z

I could go either way. My long-term hope is to get away from using exceptions as control flow like this. I'm not sure whether except Exception gets us closer or further from that goal (probably further) but it does fix regressions.

jbrockmendel · 2020-03-09T20:03:00Z

Per comment in #31082, would it be viable to check in libreduction before setting incorrectly-shaped block.values?

TomAugspurger · 2020-03-09T20:14:11Z

I don't have a firm grip on when this occurs, but it seems like it should be any time you have a change in the group size. But I might be incorrect.

pandas/_libs/reduction.pyx

TomAugspurger · 2020-03-10T18:30:55Z

ad746ba changed things to also mutate the Block.mgr_locs in addition to the values, so the earlier discussion on what to catch is moot (as far as this PR is concerned).

pandas/tests/groupby/test_bin_groupby.py

jbrockmendel · 2020-03-10T18:39:26Z

I think some unrelated (already merged in master) edits have snuck in

…ression

TomAugspurger · 2020-03-10T18:55:48Z

Fixed the git snafu I think.

doc/source/whatsnew/v1.0.2.rst

jreback

lgtm. conflict in whatsnew, merge on green.

…ression

jorisvandenbossche · 2020-03-11T12:21:34Z

pandas/tests/groupby/test_bin_groupby.py

@@ -51,6 +52,30 @@ def test_series_bin_grouper():
    tm.assert_almost_equal(counts, exp_counts)


+def assert_block_lengths(x):
+    assert len(x) == len(x._data.blocks[0].mgr_locs)


If this fails, the assertion errors bubbles up?

Yeah

diff --git a/pandas/tests/groupby/test_bin_groupby.py b/pandas/tests/groupby/test_bin_groupby.py index 152086c241..b6518c1962 100644 --- a/pandas/tests/groupby/test_bin_groupby.py +++ b/pandas/tests/groupby/test_bin_groupby.py @@ -53,7 +53,7 @@ def test_series_bin_grouper(): def assert_block_lengths(x): - assert len(x) == len(x._data.blocks[0].mgr_locs) + assert len(x) == len(x._data.blocks[0].mgr_locs) + 1 return 0

___________________________________________________________________ test_mgr_locs_updated[assert_block_lengths] ____________________________________________________________________ func = <function assert_block_lengths at 0x122354b90> @pytest.mark.parametrize("func", [cumsum_max, assert_block_lengths]) def test_mgr_locs_updated(func): # https://github.com/pandas-dev/pandas/issues/31802 # Some operations may require creating new blocks, which requires # valid mgr_locs df = pd.DataFrame({"A": ["a", "a", "a"], "B": ["a", "b", "b"], "C": [1, 1, 1]}) > result = df.groupby(["A", "B"]).agg(func) pandas/tests/groupby/test_bin_groupby.py:71: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pandas/core/groupby/generic.py:939: in aggregate return self._python_agg_general(func, *args, **kwargs) pandas/core/groupby/groupby.py:926: in _python_agg_general result, counts = self.grouper.agg_series(obj, f) pandas/core/groupby/ops.py:640: in agg_series return self._aggregate_series_fast(obj, func) pandas/core/groupby/ops.py:665: in _aggregate_series_fast result, counts = grouper.get_result() pandas/_libs/reduction.pyx:377: in pandas._libs.reduction.SeriesGrouper.get_result res, initialized = self._apply_to_group(cached_typ, cached_ityp, pandas/_libs/reduction.pyx:195: in pandas._libs.reduction._BaseGrouper._apply_to_group res = self.f(cached_typ) pandas/core/groupby/groupby.py:913: in <lambda> f = lambda x: func(x, *args, **kwargs) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ x = Series([], Name: C, dtype: int64) def assert_block_lengths(x): > assert len(x) == len(x._data.blocks[0].mgr_locs) + 1 E assert 1 == (1 + 1) E + where 1 = len(0 1\nName: C, dtype: int64) E + and 1 = len(BlockPlacement(slice(0, 1, 1))) E + where BlockPlacement(slice(0, 1, 1)) = IntBlock: 1 dtype: int64.mgr_locs pandas/tests/groupby/test_bin_groupby.py:56: AssertionError ==================================================================== 1 failed, 1 passed, 9 deselected in 0.24s =====================================================================

…ression

TomAugspurger · 2020-03-11T14:28:32Z

Has anyone seen the 32bit failure elsewhere?

=================================== FAILURES ===================================
___________________ TestDataFrameAnalytics.test_stat_op_calc ___________________
[gw0] linux -- Python 3.6.7 /home/vsts/miniconda3/envs/pandas-dev/bin/python

    
        def kurt(x):
            from scipy.stats import kurtosis  # noqa:F811
    
            if len(x) < 4:
                return np.nan
            return kurtosis(x, bias=False)
    
        assert_stat_op_calc(
            "nunique",
            nunique,
            float_frame_with_na,
            has_skipna=False,
            check_dtype=False,
            check_dates=True,
        )
    
        # mixed types (with upcasting happening)
        assert_stat_op_calc(
>           "sum", np.sum, mixed_float_frame.astype("float32"), check_dtype=False,
        )

pandas/tests/frame/test_analytics.py:321:

I guess https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=30468&view=logs&j=3a03f79d-0b41-5610-1aa4-b4a014d0bc70&t=4d05ed0e-1ed3-5bff-dd63-1e957f2766a9&l=74 had it on the npdev build. Is that a flaky test?

jbrockmendel · 2020-03-11T14:35:44Z

Has anyone seen the 32bit failure elsewhere?

Yes. Best guess is #32571 caused this, am troubleshooting now

…ression

…nGrouper.

…32635) Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>

TomAugspurger added Groupby Regression Functionality that used to work in a prior pandas version labels Mar 9, 2020

TomAugspurger added this to the 1.0.2 milestone Mar 9, 2020

TomAugspurger commented Mar 9, 2020

View reviewed changes

jorisvandenbossche reviewed Mar 9, 2020

View reviewed changes

update mgr_locs

ad746ba

jbrockmendel reviewed Mar 9, 2020

View reviewed changes

pandas/_libs/reduction.pyx Show resolved Hide resolved

TomAugspurger and others added 3 commits March 9, 2020 20:35

revert

922b30d

TST: separate out pd.crosstab tests from test_pivot (pandas-dev#32536)

f63acd3

CLN: remove Categorical.put (pandas-dev#32554)

7e49bd5

TomAugspurger changed the title ~~REGR: Expand ValueError catching in series aggregate~~ Ensure valid Block mutation in SeriesBinGroupr. Mar 10, 2020

TomAugspurger changed the title ~~Ensure valid Block mutation in SeriesBinGroupr.~~ Ensure valid Block mutation in SeriesBinGrouper. Mar 10, 2020

jbrockmendel reviewed Mar 10, 2020

View reviewed changes

pandas/tests/groupby/test_bin_groupby.py Show resolved Hide resolved

Merge remote-tracking branch 'upstream/master' into 31802-groupby-reg…

26ecad8

…ression

jreback reviewed Mar 11, 2020

View reviewed changes

doc/source/whatsnew/v1.0.2.rst Show resolved Hide resolved

jreback approved these changes Mar 11, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into 31802-groupby-reg…

cb5d20f

…ression

jorisvandenbossche reviewed Mar 11, 2020

View reviewed changes

jorisvandenbossche approved these changes Mar 11, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into 31802-groupby-reg…

4649033

…ression

Merge remote-tracking branch 'upstream/master' into 31802-groupby-reg…

7060dd3

…ression

jreback merged commit ecb5b57 into pandas-dev:master Mar 11, 2020

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Mar 11, 2020

Backport PR pandas-dev#32561: Ensure valid Block mutation in SeriesBi…

8625282

…nGrouper.

meeseeksmachine mentioned this pull request Mar 11, 2020

Backport PR #32561 on branch 1.0.x (Ensure valid Block mutation in SeriesBinGrouper.) #32635

Merged

TomAugspurger added a commit that referenced this pull request Mar 11, 2020

Backport PR #32561: Ensure valid Block mutation in SeriesBinGrouper. (#…

4dae4ab

…32635) Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>

SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this pull request Mar 22, 2020

Ensure valid Block mutation in SeriesBinGrouper. (pandas-dev#32561)

1223029

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure valid Block mutation in SeriesBinGrouper. #32561

Ensure valid Block mutation in SeriesBinGrouper. #32561

TomAugspurger commented Mar 9, 2020 •

edited

TomAugspurger Mar 9, 2020

jorisvandenbossche left a comment

TomAugspurger commented Mar 9, 2020

jbrockmendel commented Mar 9, 2020

TomAugspurger commented Mar 9, 2020

TomAugspurger commented Mar 10, 2020

jbrockmendel commented Mar 10, 2020

TomAugspurger commented Mar 10, 2020

jreback left a comment

jorisvandenbossche Mar 11, 2020

TomAugspurger Mar 11, 2020

TomAugspurger commented Mar 11, 2020

jbrockmendel commented Mar 11, 2020

Ensure valid Block mutation in SeriesBinGrouper. #32561

Ensure valid Block mutation in SeriesBinGrouper. #32561

Conversation

TomAugspurger commented Mar 9, 2020 • edited

TomAugspurger Mar 9, 2020

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

TomAugspurger commented Mar 9, 2020

jbrockmendel commented Mar 9, 2020

TomAugspurger commented Mar 9, 2020

TomAugspurger commented Mar 10, 2020

jbrockmendel commented Mar 10, 2020

TomAugspurger commented Mar 10, 2020

jreback left a comment

Choose a reason for hiding this comment

jorisvandenbossche Mar 11, 2020

Choose a reason for hiding this comment

TomAugspurger Mar 11, 2020

Choose a reason for hiding this comment

TomAugspurger commented Mar 11, 2020

jbrockmendel commented Mar 11, 2020

TomAugspurger commented Mar 9, 2020 •

edited