Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Improve error messages for random.choice #25521

Merged
merged 5 commits into from
Jan 22, 2024

Conversation

MilesCranmer
Copy link
Contributor

@MilesCranmer MilesCranmer commented Jan 2, 2024

This improves the error messages for random.choice by suggesting the user use p = p / np.sum(p). It also suggests what to do for round-off issues if the sum of probabilities is, for example, 0.99999997 (due to precision issues in division or summation), which can otherwise be very confusing (see for example [1], [2], and [3])

@mdhaber
Copy link
Contributor

mdhaber commented Jan 2, 2024

@rkern thought you might have an opinion about whether the choice error message "probabilities do not sum to 1" should suggest workarounds.

raise ValueError("Probabilities do not sum to 1. "
"You can typically solve this issue with "
"`p = p / np.sum(p)`. "
"In rare cases this may not work due to round-off error, "
Copy link
Contributor

@mdhaber mdhaber Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there examples of normalization by np.sum(p) in double precision arithmetic not working in modern NumPy? Assuming kahan_sum is not so different from np.sum, I would have expected the tolerance atol = max(atol, np.sqrt(np.finfo(p.dtype).eps)) to be pretty generous compared to typical roundoff error in float64, even for large arrays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that p came in as float64, yes. When the user provides a float32 array that sums reasonably close to 1 in float32 arithmetic, we're simply casting it to float64, and that's one of the more common sources of "spuriously" hitting this exception.

Copy link
Contributor

@mdhaber mdhaber Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I was going to suggest that if we were to provide a recommendation, it would be sufficient to recommend that the user normalize by np.sum(p, dtype=np.float64) (assuming NEP 50) rather than providing a primary strategy and backup strategy.

@rkern
Copy link
Member

rkern commented Jan 2, 2024

I feel like the possible remedies are more context-dependent than what should fit into an exception message. The advice to rescale works when the user provided just implicitly-scaled weights (that do not sum at all to 1) or when they are close enough to 1 (maybe in a reduced-precision arithmetic) that the rescaling does negligible damage to the values. But sometimes, you just want to fix up the last value and leave the rest alone. Or maybe the first. Or some other one that can tolerate the deviation from what was requested better than the others because of problem-specific circumstances.

@rkern
Copy link
Member

rkern commented Jan 2, 2024

@matheussouza88 Welcome to the project. Please look at our contribution guide on ways to contribute to the project. Numpy is an old, mature project, and there are some ways to contribute that are more well-adapted to newcomers than others. PR approval is one of the things that requires some longer experience with the project, its forward-looking goals, and its backwards-looking history. Thanks.

@mdhaber
Copy link
Contributor

mdhaber commented Jan 2, 2024

I feel like the possible remedies are more context-dependent than what should fit into an exception message.

I agree that we cannot provide exhaustive or one-size-fits-all advice. Perhaps a compromise would be to provide a short, lightly worded suggestion to those who are confused by the error:

Probabilities do not sum to 1. Consider normalizing p; e.g. p = p / np.sum(p, dtype=float).

Those who have considered it and decided that it is not appropriate for their use case can do whatever they deem appropriate.

Another possibility is to add an example in the documentation.

@rkern
Copy link
Member

rkern commented Jan 2, 2024

A float32 example would probably be the best medium.

@mdhaber
Copy link
Contributor

mdhaber commented Jan 2, 2024

A float32 example would probably be the best medium.

@MilesCranmer would that address the issue, and if so, would you change the PR accordingly?

@MilesCranmer
Copy link
Contributor Author

I don't think this would help because requiring the user to google their error is the poor experience that this PR attempts to address. Whether that google search goes to stackoverflow (current) or the docs is not really a big difference imo. Avoiding placing the burden on the user is really the PR's goal (however that ends up).

I suppose the p / sum(p) is a more obvious from the existing message, so isn't needed, but the float32 -> float64 which changes the sum away from 1 is subtle and a real pain (which I ran into myself; hence this PR). In this case a helpful error message would be the best form of documentation.

@rkern
Copy link
Member

rkern commented Jan 3, 2024

I'm happy to expand on the message, for instance to mention that the calculation is done in float64, but the extended explanation of ways to fix it should remain in the docstring, IMO. I'd be happy for the message to say to see the docstring for more details. I'd even be happy to check if the input was originally float32 and mention that (in the Generator.choice() implementation; RandomState must be left alone).

@MilesCranmer
Copy link
Contributor Author

Good idea. Let me try to implement that.

@mdhaber
Copy link
Contributor

mdhaber commented Jan 4, 2024

@rkern Were you OK with changing the error message of RandomState.choice or change just Generator.choice?

Would you like to trim it to:

Probabilities do not sum to 1. See Notes section of docstring for more information.

And move the bit about needing normalization to the Notes?

@rkern
Copy link
Member

rkern commented Jan 4, 2024

Only change Generator.choice, please. If RandomState isn't segfaulting, don't change it.

The content doesn't necessarily have to move to the Notes if it's going to be that one sentence. But in that case, the message should refer to just "the docstring" and not the Notes.

Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>
@mdhaber mdhaber merged commit 6fbbcab into numpy:main Jan 22, 2024
57 of 63 checks passed
@mdhaber
Copy link
Contributor

mdhaber commented Jan 22, 2024

Failures look unrelated. Thanks @MilesCranmer!

@MilesCranmer MilesCranmer deleted the choice-errors branch January 22, 2024 17:30
@COderHop
Copy link

COderHop commented Mar 9, 2024

Hi i am sorry to ask this kind of question ,
i am newbies i tried to find the source code of numpy.random.choice without succes. where i can found it ?
i am wondering who the weights function works to select random value from array but the same function of numpy are used on pandas DataFrame sample(weights)
thanks in advance

@rkern
Copy link
Member

rkern commented Mar 9, 2024

Questions like this are best asked on the mailing list or the Scientific Python Discourse, preferably not on unrelated Github issues.

The current recommended implementation is to use Generator.choice (i.e. rng = np.random.default_rng(); rng.choice(...)) rather than np.random.choice(), which is a legacy implementation. If you do need the source for np.random.choice() in particular, it's here.

@COderHop
Copy link

COderHop commented Mar 9, 2024

thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants