Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: na_sentinel in factorize #47157

Merged
merged 26 commits into from Jun 24, 2022
Merged

Conversation

rhshadrach
Copy link
Member

The NotImplementedError will be removed as part of #46601

@rhshadrach rhshadrach added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Deprecate Functionality to remove in pandas labels May 28, 2022
@rhshadrach rhshadrach added this to the 1.5 milestone May 28, 2022
@pep8speaks
Copy link

pep8speaks commented May 28, 2022

Hello @rhshadrach! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-06-24 18:33:26 UTC

@jbrockmendel
Copy link
Member

The warning handling in algorithms.factorize isn't my favorite, but i'll take your word for it that this is the best viable option.

use_na_sentinel: bool | lib.NoDefault,
warn: bool = True,
) -> int | None:
"""Determine value of na_sentinel for factorize methods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick newline before Determine

@jbrockmendel
Copy link
Member

needs rebase, one nitpick, otherwise LGTM

… into depr_na_sentinel

� Conflicts:
�	doc/source/whatsnew/v1.5.0.rst
�	pandas/core/algorithms.py
�	pandas/core/arrays/base.py
�	pandas/core/arrays/sparse/array.py
�	pandas/core/common.py
�	pandas/tests/extension/base/methods.py
…_na_sentinel

� Conflicts:
�	doc/source/whatsnew/v1.5.0.rst
@rhshadrach
Copy link
Member Author

The warning handling in algorithms.factorize isn't my favorite, but i'll take your word for it that this is the best viable option.

The other option is to use catch_warnings when the call to EA.factorize is made from within pd.factorize. I typically try to avoid using it, but perhaps this is a place where it's worth it. It would certainly simplify that logic (which I myself am not a fan of).

Comment on lines 757 to 767
if passed_na_sentinel is lib.no_default:
# User didn't specify na_sentinel; avoid warning. Note EA path always
# uses a na_sentinel value.
codes, uniques = values.factorize(use_na_sentinel=True)
elif passed_na_sentinel is None:
# Emit the appropriate warning message for None
_ = com.resolve_na_sentinel(passed_na_sentinel, use_na_sentinel)
codes, uniques = values.factorize(use_na_sentinel=True)
else:
# EA.factorize will warn
codes, uniques = values.factorize(na_sentinel=na_sentinel)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to distinguish deprecating the keyword that the user passes (I think that is handled above?) vs deprecating what the EA can accept.

Currently, if you have an external EA that has a custom factorize(self, na_sentinel=-1) method, then passing use_na_sentinel will raise an error. So to provide a smooth upgrade path for EA authors (and not users calling Series.factorize()), we might need to check here if the new keyword can be passed or not? (and if not, raising a deprecation warning urging the EA author to update their factorize() implementation)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess also worth clarifying is whether we expect the user to call EA.factorize directly vs pd.factorize

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally we don't "expect" users to directly call EA.factorize, but it is a public method on a public object, so it falls under a deprecation policy (as is being done in this PR already). So I think for EA authors that have a custom EA.factorize, we should recommend doing a similar deprecation in their custom method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorisvandenbossche. I think anyone using a mypy will be getting errors that their signature does not match the base class. I guess this not sufficient? It does seem noisy to add warnings for users, so I added the warning via __init_subclass__ so that it's emitted only once for each EA subclass. The potential downside is that it's emitted regardless of whether the method is used. Note users can silence warnings that are emitted when using pd.factorize even before the EA author adopts the new signature.

This also caught a defect in my previous implementation - pd.factorize was passing use_na_sentinel in some cases, which would cause code to fail if the EA author has not yet made the update. I went the catch_warnings route as I mentioned above.

Copy link
Member Author

@rhshadrach rhshadrach Jun 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't notice you mentioned deprecation warning instead of future warning. That makes sense and will be less noisy.

@rhshadrach
Copy link
Member Author

@jorisvandenbossche - friendly ping

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing, I think it would be good if we have one of our test extension arrays (eg decimal) to still have the old signature, to ensure that we don't start to error for that

Comment on lines +604 to +605
The na_sentinel argument is deprecated and
will be removed in a future version of pandas. Specify use_na_sentinel as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The na_sentinel argument is deprecated and
will be removed in a future version of pandas. Specify use_na_sentinel as
The `na_sentinel` argument is deprecated and
will be removed in a future version of pandas. Specify `use_na_sentinel` as

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took this to mean use backticks for any argument / code in the warnings. I went and implemented that for all warnings.

Comment on lines 749 to 752
with warnings.catch_warnings():
# We've already warned above
warnings.filterwarnings("ignore", ".*use_na_sentinel.*", FutureWarning)
codes, uniques = values.factorize(na_sentinel=na_sentinel)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do a similar "if use_na_sentinel in signature" check (as in init_subclass) and in that case already pass the new keyword to the underlying EA? (in that case we don't have to catch any warning)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me as we'd be trading out catch_warnings for inspect and more complex logic (though not by much). I thought inspect would take somewhat longer (on the order of 1ms) but on my machine it's pretty quick:

values = pd.array([1, 2, 3])
print("use_na_sentinel" in inspect.signature(values.factorize).parameters)
%timeit "use_na_sentinel" in inspect.signature(values.factorize).parameters

# True
# 12.8 µs ± 2.97 µs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

I've gone an implemented it; easy to revert if we don't like it.

…_na_sentinel

� Conflicts:
�	doc/source/whatsnew/v1.5.0.rst
@rhshadrach
Copy link
Member Author

Thanks @jorisvandenbossche - changes made and ready for another review.

For testing, I think it would be good if we have one of our test extension arrays (eg decimal) to still have the old signature, to ensure that we don't start to error for that

I don't think there is decimal, so I did this for datetimelike. It has the easiest implementation to change and doesn't generate warnings on its own (it uses super to do that), so seemed like the best option. Plus it also starts with a 'd' :D

@rhshadrach
Copy link
Member Author

rhshadrach commented Jun 10, 2022

Encountered a minor difficulty with the warnings and leaving TimelikeOps with the old signature. In order to not get the warning on import of pandas, need to special case it in __init_subclass__. However, as this "method" is run at "import time", we can't even import DatetimeArray et al (gives circular imports) to test whether cls == DatetimeArray. The best I think we can do is check it's name, which is not exactly ideal.

With the current implementation, worst case is some code does not get warned to update the signature for classes named "TimelikeOps", "DatetimeArray", "TimedeltaArray".

…_na_sentinel

� Conflicts:
�	doc/source/whatsnew/v1.5.0.rst
elif not isinstance(values.dtype, np.dtype):
if (
na_sentinel == -1
and "use_na_sentinel" in inspect.signature(values.factorize).parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you inspecting like this? its always passed (with no_default maybe)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're inspecting whether the EA values can accept the new argument "use_na_sentinel". #47157 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to add a comment to that effect

@@ -700,3 +700,53 @@ def deprecate_numeric_only_default(cls: type, name: str, deprecate_none: bool =
)

warnings.warn(msg, FutureWarning, stacklevel=find_stack_level())


def resolve_na_sentinel(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put this somewhere else, closer to where its used

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to algorithms.

@rhshadrach
Copy link
Member Author

@jreback @jorisvandenbossche friendly ping.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO looks pretty good

…_na_sentinel

� Conflicts:
�	pandas/core/arrays/arrow/array.py
@jorisvandenbossche
Copy link
Member

@rhshadrach thanks for the updates! Looks good to me from a quick look. I won't be able to further look at it the coming two weeks, so don't wait on me to move forward here

@mroeschke mroeschke merged commit d580826 into pandas-dev:main Jun 24, 2022
@mroeschke
Copy link
Member

Thanks @rhshadrach. Follow ups can be made as needed

@rhshadrach
Copy link
Member Author

@rhshadrach thanks for the updates! Looks good to me from a quick look. I won't be able to further look at it the coming two weeks, so don't wait on me to move forward here

Thanks @jorisvandenbossche; if you do end up circling back, let me know if you have any more comments and happy to do a followup.

@rhshadrach rhshadrach deleted the depr_na_sentinel branch June 24, 2022 21:48
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
* DEPR: na_sentinel in factorize

* WIP

* DEPR: na_sentinel in factorize

* Fixups

* Fixups

* black

* fixup

* docs

* newline

* Warn on class construction, rework pd.factorize warnings

* FutureWarning -> DeprecationWarning

* Remove old comment

* backticks in warnings, revert datetimelike, avoid catch_warnings

* fixup for warnings

* mypy fixups

* Move resolve_na_sentinel

* Remove underscores

Co-authored-by: Jeff Reback <jeff@reback.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Deprecate Functionality to remove in pandas Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deprecate na_sentinel, add use_na_sentinel
6 participants