Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: Series.nlargest with masked arrays #42838

Merged
merged 8 commits into from
Aug 10, 2021

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel added NA - MaskedArrays Related to pd.NA and nullable extension arrays Regression Functionality that used to work in a prior pandas version labels Aug 1, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3.2 milestone Aug 2, 2021
@@ -1255,6 +1259,18 @@ def compute(self, method: str) -> Series:

dropped = self.obj.dropna()

if is_extension_array_dtype(dropped.dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this prefereable to hanlding on L1281 with the other dtypes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bc ensure_data does the wrong thing with MaskedArrays, np.asarray(obj) defaults to object dtype.

we could kludge _ensure_data to work in cases where we dont have any pd.NAs, but doing it here lets us handle cases with NAs too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bc ensure_data does the wrong thing with MaskedArrays, np.asarray(obj) defaults to object dtype.

this is very unfortunate. I don't really like this approach here. i suppose its ok for a backport though this is na experimental type and so i don't consider this regression to be a big deal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the more i work on it, the more i want to nuke NA from space

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok what is involved with changing this to a non-recursive formulation though? e.g. ensure_data should be able to handle NA (if not we will have other issues)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three options:

  1. have ensure_data special-case MaskedArray cases with no NAs, in which case they can just use the ndarray. This fixes the regression (cases with NAs didnt use to work IIUC). kluuuuudge
  2. make algos.kth_smallest support object dtype (and not choke on pd.NA)
  3. make the EA implement its own nlargest

I decided the approach here was less kludgy than those in part bc this function uses obj.dropna(), so the MaskedArray case actually is much simpler to implement than an arbitrary EA

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is 1 a kludge?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bc its special-casing for MaskedArray and special casing for not values.isna().any()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worse than that, it isnt not values.isna().any() but not values._mask.any()

@jreback jreback merged commit 61cbb73 into pandas-dev:master Aug 10, 2021
@jreback
Copy link
Contributor

jreback commented Aug 10, 2021

not super thrilled with this, but does fix.

@jreback
Copy link
Contributor

jreback commented Aug 10, 2021

@meeseeksdev backport 1.3.x

@lumberbot-app
Copy link

lumberbot-app bot commented Aug 10, 2021

Something went wrong ... Please have a look at my logs.

@jbrockmendel jbrockmendel deleted the regr-42816 branch August 11, 2021 05:18
@@ -23,6 +23,7 @@ Fixed regressions
- Fixed regression where :meth:`pandas.read_csv` raised a ``ValueError`` when parameters ``names`` and ``prefix`` were both set to None (:issue:`42387`)
- Fixed regression in comparisons between :class:`Timestamp` object and ``datetime64`` objects outside the implementation bounds for nanosecond ``datetime64`` (:issue:`42794`)
- Fixed regression in :meth:`.Styler.highlight_min` and :meth:`.Styler.highlight_max` where ``pandas.NA`` was not successfully ignored (:issue:`42650`)
- Regression in :meth:`Series.nlargest` and :meth:`Series.nsmallest` with nullable integer or float dtype (:issue:`41816`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need to change to #42816. will do that after backport is merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in #42983

simonjayhawkins pushed a commit that referenced this pull request Aug 11, 2021
… arrays) (#42975)

* Backport PR #42838: REGR: Series.nlargest with masked arrays

* fix final import

Co-authored-by: jbrockmendel <jbrockmendel@gmail.com>
feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NA - MaskedArrays Related to pd.NA and nullable extension arrays Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: nlargest raises TypeError "No matching signature found" on Float64Dtype Series, versions >1.3.0
3 participants