Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/BUG: Use Kleene logic for groupby any/all #40819

Merged
merged 40 commits into from
Apr 13, 2021

Conversation

mzeitlin11
Copy link
Member

@mzeitlin11 mzeitlin11 commented Apr 6, 2021

Built on #40811

ASV's don't look affected:

before           after         ratio
     [3b4b193d]       [423f43fc]
     <master>         <enh/any_all_kleene>
          216±8ms          205±4ms     0.95  groupby.GroupByCythonAgg.time_frame_agg('float64', 'all')
          208±8ms          197±3ms     0.95  groupby.GroupByCythonAgg.time_frame_agg('float64', 'any')
         162±10μs          139±4μs    ~0.86  groupby.GroupByMethods.time_dtype_as_field('datetime', 'all', 'direct')
         161±20μs          138±9μs    ~0.86  groupby.GroupByMethods.time_dtype_as_field('datetime', 'all', 'transformation')
         159±10μs          142±5μs    ~0.90  groupby.GroupByMethods.time_dtype_as_field('datetime', 'any', 'direct')
         160±10μs          140±5μs    ~0.88  groupby.GroupByMethods.time_dtype_as_field('datetime', 'any', 'transformation')
          109±8μs         113±10μs     1.04  groupby.GroupByMethods.time_dtype_as_field('float', 'all', 'direct')
         103±10μs          109±9μs     1.06  groupby.GroupByMethods.time_dtype_as_field('float', 'all', 'transformation')
         113±20μs         132±10μs    ~1.17  groupby.GroupByMethods.time_dtype_as_field('float', 'any', 'direct')
         138±10μs         124±20μs    ~0.90  groupby.GroupByMethods.time_dtype_as_field('float', 'any', 'transformation')
          110±9μs          112±5μs     1.02  groupby.GroupByMethods.time_dtype_as_field('int', 'all', 'direct')
          115±9μs         109±10μs     0.94  groupby.GroupByMethods.time_dtype_as_field('int', 'all', 'transformation')
         123±10μs          115±7μs     0.94  groupby.GroupByMethods.time_dtype_as_field('int', 'any', 'direct')
         125±10μs          114±8μs     0.92  groupby.GroupByMethods.time_dtype_as_field('int', 'any', 'transformation')
         449±30μs         452±20μs     1.01  groupby.GroupByMethods.time_dtype_as_field('object', 'all', 'direct')
         442±30μs         445±20μs     1.01  groupby.GroupByMethods.time_dtype_as_field('object', 'all', 'transformation')
         470±10μs         455±30μs     0.97  groupby.GroupByMethods.time_dtype_as_field('object', 'any', 'direct')
         466±30μs          406±4μs    ~0.87  groupby.GroupByMethods.time_dtype_as_field('object', 'any', 'transformation')
          118±8μs          118±7μs     1.00  groupby.GroupByMethods.time_dtype_as_group('datetime', 'all', 'direct')
          134±9μs         132±20μs     0.99  groupby.GroupByMethods.time_dtype_as_group('datetime', 'all', 'transformation')
         131±20μs          115±9μs    ~0.88  groupby.GroupByMethods.time_dtype_as_group('datetime', 'any', 'direct')
         121±10μs          118±5μs     0.97  groupby.GroupByMethods.time_dtype_as_group('datetime', 'any', 'transformation')
         111±10μs          111±4μs     1.00  groupby.GroupByMethods.time_dtype_as_group('float', 'all', 'direct')
         109±10μs         120±10μs    ~1.11  groupby.GroupByMethods.time_dtype_as_group('float', 'all', 'transformation')
          110±9μs         127±10μs    ~1.16  groupby.GroupByMethods.time_dtype_as_group('float', 'any', 'direct')
         112±10μs         122±10μs     1.09  groupby.GroupByMethods.time_dtype_as_group('float', 'any', 'transformation')
         130±10μs          115±6μs    ~0.89  groupby.GroupByMethods.time_dtype_as_group('int', 'all', 'direct')
          133±7μs          115±7μs    ~0.86  groupby.GroupByMethods.time_dtype_as_group('int', 'all', 'transformation')
          117±5μs         123±10μs     1.05  groupby.GroupByMethods.time_dtype_as_group('int', 'any', 'direct')
         133±10μs         118±20μs    ~0.89  groupby.GroupByMethods.time_dtype_as_group('int', 'any', 'transformation')
          107±9μs         122±10μs    ~1.14  groupby.GroupByMethods.time_dtype_as_group('object', 'all', 'direct')
         113±10μs          107±5μs     0.94  groupby.GroupByMethods.time_dtype_as_group('object', 'all', 'transformation')
         135±10μs          108±4μs    ~0.80  groupby.GroupByMethods.time_dtype_as_group('object', 'any', 'direct')
          114±6μs          105±2μs     0.92  groupby.GroupByMethods.time_dtype_as_group('object', 'any', 'transformation')

@mzeitlin11 mzeitlin11 added Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 6, 2021
@mzeitlin11 mzeitlin11 added NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 7, 2021
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

doc/source/whatsnew/v1.3.0.rst Outdated Show resolved Hide resolved
pandas/tests/reductions/test_reductions.py Outdated Show resolved Hide resolved
pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mzeitlin11 Thanks a lot for working on this!

Comment on lines 83 to 84
from pandas.core.arrays.boolean import BooleanArray
from pandas.core.arrays.masked import BaseMaskedArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can import both from from pandas.core.arrays import ..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@@ -1435,6 +1446,7 @@ def result_to_bool(result: np.ndarray, inference: type) -> np.ndarray:
post_processing=result_to_bool,
val_test=val_test,
skipna=skipna,
masked=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this set to always be False?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had originally used masked being included in kwargs as a signal for whether to pass masked into the groupby func. But the original value here wasn't relevant since it gets set before being used.

This is definitely confusing and a poor solution, changed to instead use a positional arg

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the update looks clearer to follow now

@pytest.mark.parametrize("bool_agg_func", ["any", "all"])
@pytest.mark.parametrize("skipna", [True, False])
@pytest.mark.parametrize(
# expected indexed as [skipna][bool_agg_func == "all"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you maybe also write out what that results in practice? So [skipna=False/any, skipna=False / all], [skipna=True / any, skipna=True / all], if I understand correctly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep have changed, that is a much clearer way to describe it

pandas/tests/groupby/test_any_all.py Outdated Show resolved Hide resolved
@pep8speaks
Copy link

pep8speaks commented Apr 10, 2021

Hello @mzeitlin11! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-04-13 15:02:14 UTC

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! (and much cleaner now ;))

I only added a few tiny stylistic/doc suggestions you can apply.

doc/source/whatsnew/v1.3.0.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v1.3.0.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v1.3.0.rst Outdated Show resolved Hide resolved
pandas/tests/groupby/test_any_all.py Outdated Show resolved Hide resolved
pandas/tests/groupby/test_any_all.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls merge master as well

doc/source/whatsnew/v1.3.0.rst Outdated Show resolved Hide resolved

# The expected result we compared to should match aggregating on the whole
# series
result = getattr(df[0], bool_agg_func)(skipna=skipna)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check is really hard to parse.
what exactly are you checking? can you make this much simpler

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this is essentially doing is series.any(skipna=skipna), but then with getattr because it's generic for any/all. Do you have a concrete suggestion to change? (or to clarify the comment?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a complicated expexcted then a really complicated assert. This needs to be greatly simplified. It is impossible to grok with all of this logic. my suggesetion is to break this up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be very concrete, you mean breaking it up into 2 lines, like

series = df[0]
getattr(series, method)(skipna=skipna)

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have tried to simplify by moving this validation of "expected" into a separate test in test_reductions.py (replacing the existing test there as this tests a superset of the logic).

def test_masked_kleene_logic(bool_agg_func, data, expected_data, skipna):
# GH#37506
df = DataFrame(data, dtype="boolean")
expected = DataFrame(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this expected is really hard to parse can you do in multiple steps / make simpler

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing your comment above makes this simpler, let me know if you think any piece is still confusing

mzeitlin11 and others added 8 commits April 13, 2021 10:29
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok looks good. question on the rendering in the notes, ping on green.

doc/source/whatsnew/v1.3.0.rst Show resolved Hide resolved
pandas/tests/reductions/test_reductions.py Show resolved Hide resolved
@jreback jreback added this to the 1.3 milestone Apr 13, 2021
@mzeitlin11
Copy link
Member Author

Greenish, failures unrelated and fixed by #40930 and flaky zip fix that just got merged

@jreback jreback merged commit bbbd15c into pandas-dev:master Apr 13, 2021
@jreback
Copy link
Contributor

jreback commented Apr 13, 2021

thanks @mzeitlin11 very nice!

@mzeitlin11 mzeitlin11 deleted the enh/any_all_kleene branch April 13, 2021 17:13
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
5 participants