Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure valid Block mutation in SeriesBinGrouper. #32561

Merged
merged 9 commits into from Mar 11, 2020

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Mar 9, 2020

Closes #31802

This "fixes" #31802 by expanding the number of cases where we swallow an
exception in libreduction. Currently, we're creating an invalid Series
in SeriesBinGrouper where the .mgr_locs doesn't match the values. See
#31802 (comment)
for more.

For now, we simply catch more cases that fall back to Python. I've gone
with a minimal change which addresses only issues hitting this exact
exception. We might want to go broader, but that's not clear.

cc @jbrockmendel & @WillAyd

Closes pandas-dev#31802

This "fixes" pandas-dev#31802 by expanding the number of cases where we swallow an
exception in libreduction. Currently, we're creating an invalid Series
in SeriesBinGrouper where the `.mgr_locs` doesn't match the values. See
pandas-dev#31802 (comment)
for more.

For now, we simply catch more cases that fall back to Python. I've gone
with a minimal change which addresses only issues hitting this exact
exception. We might want to go broader, but that's not clear.
@TomAugspurger TomAugspurger added Groupby Regression Functionality that used to work in a prior pandas version labels Mar 9, 2020
@TomAugspurger TomAugspurger added this to the 1.0.2 milestone Mar 9, 2020
"func",
[
cumsum_max,
pytest.param(assert_block_lengths, marks=pytest.mark.xfail(reason="debatable")),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we just catch ValueError in https://github.com/pandas-dev/pandas/pull/32561/files#diff-8c0985a9fca770c2028bed688dfc043fR641. "fixing" this would essentially require an except Exception. Do people have an opinion here?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am personally -1 on catching such specific error messages like this. This only fixes the exact bug report at hand, not the general issue that things can fail inside the libreduction code in several unforeseen ways.

I would personally just broaden the exception that is swallowed (or at least in releases, in master branch I am fine with being more strict in the hope to catch some bugs)

@TomAugspurger
Copy link
Contributor Author

I could go either way. My long-term hope is to get away from using exceptions as control flow like this. I'm not sure whether except Exception gets us closer or further from that goal (probably further) but it does fix regressions.

@jbrockmendel
Copy link
Member

Per comment in #31082, would it be viable to check in libreduction before setting incorrectly-shaped block.values?

@TomAugspurger
Copy link
Contributor Author

I don't have a firm grip on when this occurs, but it seems like it should be any time you have a change in the group size. But I might be incorrect.

@TomAugspurger TomAugspurger changed the title REGR: Expand ValueError catching in series aggregate Ensure valid Block mutation in SeriesBinGroupr. Mar 10, 2020
@TomAugspurger TomAugspurger changed the title Ensure valid Block mutation in SeriesBinGroupr. Ensure valid Block mutation in SeriesBinGrouper. Mar 10, 2020
@TomAugspurger
Copy link
Contributor Author

ad746ba changed things to also mutate the Block.mgr_locs in addition to the values, so the earlier discussion on what to catch is moot (as far as this PR is concerned).

@jbrockmendel
Copy link
Member

I think some unrelated (already merged in master) edits have snuck in

@TomAugspurger
Copy link
Contributor Author

Fixed the git snafu I think.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. conflict in whatsnew, merge on green.

@@ -51,6 +52,30 @@ def test_series_bin_grouper():
tm.assert_almost_equal(counts, exp_counts)


def assert_block_lengths(x):
assert len(x) == len(x._data.blocks[0].mgr_locs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this fails, the assertion errors bubbles up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah

diff --git a/pandas/tests/groupby/test_bin_groupby.py b/pandas/tests/groupby/test_bin_groupby.py
index 152086c241..b6518c1962 100644
--- a/pandas/tests/groupby/test_bin_groupby.py
+++ b/pandas/tests/groupby/test_bin_groupby.py
@@ -53,7 +53,7 @@ def test_series_bin_grouper():
 
 
 def assert_block_lengths(x):
-    assert len(x) == len(x._data.blocks[0].mgr_locs)
+    assert len(x) == len(x._data.blocks[0].mgr_locs) + 1
     return 0
___________________________________________________________________ test_mgr_locs_updated[assert_block_lengths] ____________________________________________________________________

func = <function assert_block_lengths at 0x122354b90>

    @pytest.mark.parametrize("func", [cumsum_max, assert_block_lengths])
    def test_mgr_locs_updated(func):
        # https://github.com/pandas-dev/pandas/issues/31802
        # Some operations may require creating new blocks, which requires
        # valid mgr_locs
        df = pd.DataFrame({"A": ["a", "a", "a"], "B": ["a", "b", "b"], "C": [1, 1, 1]})
>       result = df.groupby(["A", "B"]).agg(func)

pandas/tests/groupby/test_bin_groupby.py:71:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/core/groupby/generic.py:939: in aggregate
    return self._python_agg_general(func, *args, **kwargs)
pandas/core/groupby/groupby.py:926: in _python_agg_general
    result, counts = self.grouper.agg_series(obj, f)
pandas/core/groupby/ops.py:640: in agg_series
    return self._aggregate_series_fast(obj, func)
pandas/core/groupby/ops.py:665: in _aggregate_series_fast
    result, counts = grouper.get_result()
pandas/_libs/reduction.pyx:377: in pandas._libs.reduction.SeriesGrouper.get_result
    res, initialized = self._apply_to_group(cached_typ, cached_ityp,
pandas/_libs/reduction.pyx:195: in pandas._libs.reduction._BaseGrouper._apply_to_group
    res = self.f(cached_typ)
pandas/core/groupby/groupby.py:913: in <lambda>
    f = lambda x: func(x, *args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

x = Series([], Name: C, dtype: int64)

    def assert_block_lengths(x):
>       assert len(x) == len(x._data.blocks[0].mgr_locs) + 1
E       assert 1 == (1 + 1)
E        +  where 1 = len(0    1\nName: C, dtype: int64)
E        +  and   1 = len(BlockPlacement(slice(0, 1, 1)))
E        +    where BlockPlacement(slice(0, 1, 1)) = IntBlock: 1 dtype: int64.mgr_locs

pandas/tests/groupby/test_bin_groupby.py:56: AssertionError
==================================================================== 1 failed, 1 passed, 9 deselected in 0.24s =====================================================================

@TomAugspurger
Copy link
Contributor Author

Has anyone seen the 32bit failure elsewhere?

=================================== FAILURES ===================================
___________________ TestDataFrameAnalytics.test_stat_op_calc ___________________
[gw0] linux -- Python 3.6.7 /home/vsts/miniconda3/envs/pandas-dev/bin/python

    
        def kurt(x):
            from scipy.stats import kurtosis  # noqa:F811
    
            if len(x) < 4:
                return np.nan
            return kurtosis(x, bias=False)
    
        assert_stat_op_calc(
            "nunique",
            nunique,
            float_frame_with_na,
            has_skipna=False,
            check_dtype=False,
            check_dates=True,
        )
    
        # mixed types (with upcasting happening)
        assert_stat_op_calc(
>           "sum", np.sum, mixed_float_frame.astype("float32"), check_dtype=False,
        )

pandas/tests/frame/test_analytics.py:321: 

I guess https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=30468&view=logs&j=3a03f79d-0b41-5610-1aa4-b4a014d0bc70&t=4d05ed0e-1ed3-5bff-dd63-1e957f2766a9&l=74 had it on the npdev build. Is that a flaky test?

@jbrockmendel
Copy link
Member

Has anyone seen the 32bit failure elsewhere?

Yes. Best guess is #32571 caused this, am troubleshooting now

@jreback jreback merged commit ecb5b57 into pandas-dev:master Mar 11, 2020
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Mar 11, 2020
TomAugspurger added a commit that referenced this pull request Mar 11, 2020
…32635)

Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>
SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this pull request Mar 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REGR: cumsum regression with groupby call to agg
5 participants