Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix Series.apply(..., by_row), v2. #53601

Merged
merged 14 commits into from
Jun 29, 2023

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Jun 11, 2023

Fixes #53400 (comment) by making the by_row param take "compat" as a parameter and using that internally that when apply is given dicts og lists.

The compatability path is now to deprecate by_rows=(True|"compat") at some point, so by_rows=False will become the default in v3.0.

Supercedes #53584.

Code example:

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]})

# On 2.0.x and main
print(df.apply([lambda x: 1]))
#          a        b
#   <lambda> <lambda>
# 0        1        1
# 1        1        1
# 2        1        1

@topper-123
Copy link
Contributor Author

CC: @rhshadrach.

@rhshadrach
Copy link
Member

rhshadrach commented Jun 11, 2023

I don't see how this addresses #53584 (comment). The issue is with DataFrame.apply:

pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]}).apply([lambda x: 1])

Users have no argument to specify by_row, so they have no way to adopt the future (3.0) behavior of by_rows=False. Am I missing it?

@topper-123
Copy link
Contributor Author

topper-123 commented Jun 11, 2023

Puff, I was more focused on the fact than in main we currently have:

>>> pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]}).apply([lambda x: 1])
          a  b
<lambda>  1  1

While in v2.0 we had:

>>> pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]}).apply([lambda x: 1])
         a        b
  <lambda> <lambda>
0        1        1
1        1        1
2        1        1

but looks like you're right wrt. deprecation process, if we want to deprecate the current behavior for DataFrame, we will need a by_row parameter in DataFrame.apply.

@mroeschke mroeschke added the Apply Apply, Aggregate, Transform label Jun 12, 2023
@topper-123
Copy link
Contributor Author

I've made a new version, where we can deprecate the old behavior by deprecating by_row=True|"compat" in DataFrame.apply, similarly to how do it in Series.apply.

pandas/core/frame.py Outdated Show resolved Hide resolved
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the removal of by_row=True would also apply in the Series case, is that right? If not, a lot of my requests below are invalid.

pandas/core/apply.py Outdated Show resolved Hide resolved
pandas/core/apply.py Outdated Show resolved Hide resolved
pandas/core/apply.py Outdated Show resolved Hide resolved
kwargs,
) -> None:
if by_row is not False and by_row != "compat":
raise NotImplementedError(f"by_row={by_row} not implemented")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the docs, I think NotImplementedError signifies the implementation is currently incomplete, and that users can expect this to be supported once we "get around to it". Can this be a ValueError instead.

This comment was marked as resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this comment somehow. I've changed it now.

Comment on lines 9663 to 9673
by_row : False or "compat", default "compat"
If "compat", will if possible first translate the func into pandas
methods (e.g. ``Series().apply(np.sum)`` will be translated to
``Series().sum()``). If that doesn't work, will try call to apply again with
``by_row=True`` and if that fails, will call apply again with
``by_row=False``
If False, the funcs will be passed the whole Series at once.
``by_row`` only has effect when ``func`` is a listlike or dictlike of funcs
and the func isn't a string.
``by_row=True`` has not been implemented, and will raise an
``NotImplenentedError``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to have the callout on this only applying to list/dict-likes at the beginning, and adding in that this is compatible with previous versions. What do you think about being more vague about the compat behavior instead of trying to detail it out? Something like

        by_row : False or "compat", default "compat"
            Only has effect an when ``func`` is a listlike or dictlike of funcs
            on the values that aren't NumPy functions (e.g. ``np.sum``) or 
            string-aliases for operations (e.g. ``"sum"``). 
            "compat" is backwards compatible with previous versions and will
            sometimes operate by row and sometimes operate on the whole Series at once.
            If False, the funcs will be passed the whole Series at once.

I'm also okay with keeping the more detailed description of compat if you prefer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is for the other version, if that's ok. I changed it a bit though.

@topper-123
Copy link
Contributor Author

I was thinking the removal of by_row=True would also apply in the Series case, is that right? If not, a lot of my requests below are invalid.

Unfortunately not. For Series.apply we have:

>>> ser = pd.Series([1, np.nan, 2])
>>> ser.apply(np.sum, by_row=True)
0    1.0
1    NaN
2    2.0
dtype: float64
>>> ser.apply(np.sum, by_row="compat")
3.0

for Series.apply , by_row="compat" ends up calling SeriesApply.apply_compat, when given a dict- or listlike. For example:

>>> ser.apply(np.sum)  # by_row=True is implicit here
0    1.0
1    NaN
2    2.0
dtype: float64
>>> ser.apply([np.sum])  # for each func in the list, call ser.apply(np.sum, by_row="compat")
sum    3.0
dtype: float64

These two cases can't be combined into one parameter option, so I don't see another way forward here myself.

I'll look into your detailed questions later today.

@rhshadrach
Copy link
Member

In pandas 3.0, we will just have the by_row=False behavior, right? Why does the user need to be able to specify by_row=True?

@topper-123
Copy link
Contributor Author

I've changed the by_row parameter value names to better suit your proposals:

  • True in Series.apply has been changed to "compat"
  • compat in Series.apply has been changed to "_compat"
  • False is unchanged

This means that for both Series and DataFrame by_row="compat" now means "do it the same way as in v2.0", by_row=False means pass the whole Series to func in all cases, while by_row="_compat" in Series.apply is internal and should not be called by end users.

I think this is better?

pandas/core/series.py Outdated Show resolved Hide resolved
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I see now why you need the third state "_compat". We're now calling apply instead of agg from agg_or_apply_list_like when op_name is apply. Agreed this naming is better.

@topper-123
Copy link
Contributor Author

I’ve updated. Yeah, it’s needed for backward compatability, unfortunately. After by_row=“compat” has been removed in v.3.0, this will become a whole lot simpler.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just the two open requests from earlier on.

@topper-123
Copy link
Contributor Author

In pandas 3.0, we will just have the by_row=False behavior, right?

Yes, though that parameter will be deprecated and it's default value changed to lib.no_default in v3.0 in preparation for its removal in v4.0.

@rhshadrach rhshadrach added this to the 2.1 milestone Jun 29, 2023
@rhshadrach rhshadrach changed the title fix Series.apply(..., by_row), v2. BUG: fix Series.apply(..., by_row), v2. Jun 29, 2023
@rhshadrach rhshadrach added the Bug label Jun 29, 2023
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach rhshadrach merged commit 9a9fcf6 into pandas-dev:main Jun 29, 2023
34 checks passed
@rhshadrach
Copy link
Member

Thanks @topper-123!

@topper-123 topper-123 deleted the fix_series_Apply_by_row_II branch June 29, 2023 22:04
@topper-123
Copy link
Contributor Author

Nice. I'll scan to see if there are other changes needed for Series/DataFrame.apply/agg, but if not, this concludes the work for these methods, until the deprecation is activated. Series/DataFrame.transform still need some work to get in line with the others, though.

I have not intended to work on the groupby methods, because you're taking care of those, right?

@rhshadrach
Copy link
Member

I have not intended to work on the groupby methods, because you're taking care of those, right?

Yes - my next step is to refactor groupby methods to by-and-large not share code with core.apply. Then after this is done, evaluate what we can reasonable deprecate for 3.0.

Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants