Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Avoid full data sorts in masked array median #20258

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

david-cortes
Copy link
Contributor

Fixes #18821

Currently, calling np.ma.median will do a full sort of the data, which is inefficient if one needs to get only the median. This PR changes it to instead filter the data and call the non-ma median on that filtered data, which only requires an argpartition operation instead.

Before:

import numpy as np, pandas as pd

rng = np.random.default_rng(seed=1)
nrows = int(1e6)
ncols = 10
X = pd.DataFrame(rng.normal(size=(nrows, ncols)))
%%timeit
np.ma.median(X, axis=0)
1.07 s ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After:

212 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@david-cortes
Copy link
Contributor Author

david-cortes commented Nov 1, 2021

That failing test is interesting and I am not sure if the expected behavior is what's in this PR or what's in the values to expect in the tests.

import numpy as np
x = np.ma.array(np.arange(10).reshape(2, 5), mask=[0]*6 + [1]*4)
x.fill_value = 3
np.ma.median(x, axis=-1)

Current master:

masked_array(data=[2.0, 5.0],
             mask=[False, False],
       fill_value=1e+20)

This PR:

masked_array(data=[2., 5.],
             mask=False,
       fill_value=3.0)

On the one hand, I'd have expected the fill_value to be kept when calculating medians (and did so explicitly in the PR code), but on the other hand, it perhaps doesn't make sense to keep it if the result of the median will not have anything filled with that value.

What should be the right behavior in this case?

@WarrenWeckesser WarrenWeckesser added the component: numpy.ma masked arrays label Nov 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

np.ma.median is inefficient with DataFrames
2 participants