Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add weighted quantile for inverted_cdf #24254

Merged
merged 46 commits into from Jan 22, 2024

Conversation

lorentzenchr
Copy link
Contributor

Partially solves #8935.

This PR adds support for weights in np.quantile similar to np.average(..., weights=w). As an uncontroversial start, only method="inverted_cdf" is implemented for weights support.

See #8935 (comment) for details.

@lorentzenchr lorentzenchr changed the title ENH add weighted quantile for inverted_cdf ENH: add weighted quantile for inverted_cdf Jul 24, 2023
@lorentzenchr
Copy link
Contributor Author

@seberg friendly ping. I know numpy 2.0 requires resources, but maybe you could have a quick look. The underlying issue is very old with a ton of discussions. (And I need help to fix the emscripten CI error.)

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am even here unsure whether this should be called weights or something more clear? For inverted_cdf I guess it doesn't matter (unless someone expects more than a single datapoint being mixed in the final step if its weight < 1, but I am not sure that can be argued to even make sense).

# setup wgt to broadcast along axis
wgt = np.broadcast_to(wgt, (a.ndim-1)*(1,) + wgt.shape)
wgt = wgt.swapaxes(-1, axis)
return wgt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting logic, but prior art in either case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why interesting? I just shifted these lines of code out of average for reuse in quantile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic tries to add convenience for an N-D input with with axis and 1-D (or other input) and I find that very dubious guessing of user intent with relatively little gain (it isn't hard to make sure weights and data are properly broadcast/aligned) and potential for confusion.
(The potential for terrible bugs where you expect swapping of weights to happen but it doesn't because shapes happen to be identical seems possible but unlikely.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, that code is not new or from me. I just took these lines out of np.average for reuse in np.quantile.

numpy/lib/function_base.py Outdated Show resolved Hide resolved
weights = np.moveaxis(weights, axis, destination=0)
index_array = np.argsort(arr, axis=axis, kind="stable")
# TODO: Which value to set to slices_having_nans.
slices_having_nans = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just [-1] after sorting.

numpy/lib/function_base.py Outdated Show resolved Hide resolved
numpy/lib/tests/test_function_base.py Outdated Show resolved Hide resolved
@lorentzenchr
Copy link
Contributor Author

@seberg Thank you very much for having a look!

@lorentzenchr
Copy link
Contributor Author

Do I have to add tests for n-dim data and that axis works correctly or is this done somewhere automatically?

@lorentzenchr
Copy link
Contributor Author

@seberg I put more effort into it to also test several ndim of input. What should I do with _function_base_impl.pyi?
Can you also help with some of the failing tests? Locally with my laptop, they all pass.

@lorentzenchr
Copy link
Contributor Author

@ogrisel uncovered some shortcomings that I'm trying to solve, in particular the out parameter.

I'm stuck because of the following error:

import numpy as np

np.take(np.arange(10), np.array([1, 4]), out=np.zeros(2, dtype=float))
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

How can I write the content of some slice of an int64 array into a float64 array?

Copy link
Contributor Author

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently stuck - and I already thought the PR is ready...

EDIT: I finally solved it in fc0b5b7 and some minor fixes in 2d96b56.
I think this warrants for another proper review as those changes are not trivial at all.

numpy/lib/_function_base_impl.py Outdated Show resolved Hide resolved
- cast cdf to dtype of quantile to avoid surprises
- Convert scalar python objects to pure python objects as in unweighted case
- Extend weighted case to test_linear_interpolation
- Extend weighted case to test_percentile_out
@lorentzenchr
Copy link
Contributor Author

lorentzenchr commented Jan 15, 2024

This is now ready for a - hopfully - final review. Test coverage is much extended and corner cases like nan handling are now fixed (and tested), no todo left.

Edit: And I like the 100% 🟢 of the current CI.

@lorentzenchr
Copy link
Contributor Author

@seberg Do you think this could get in numpy 2.0? That would be my hope.

@seberg seberg removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Jan 22, 2024
@seberg
Copy link
Member

seberg commented Jan 22, 2024

Thanks, @lorentzenchr and @ogrisel for reviewing. The new changes look good, and the new tests were really needed. I did have a look at the shape logic, and it seems really quite clean to me, only comments I added were so small, that there would be no point in following up.

Since voiced anything about the added API here or on the mailing list. Putting this in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants