DOC: Improve error messages for random.choice #25521

MilesCranmer · 2024-01-02T11:30:20Z

This improves the error messages for random.choice by suggesting the user use p = p / np.sum(p). It also suggests what to do for round-off issues if the sum of probabilities is, for example, 0.99999997 (due to precision issues in division or summation), which can otherwise be very confusing (see for example [1], [2], and [3])

mdhaber · 2024-01-02T16:33:46Z

@rkern thought you might have an opinion about whether the choice error message "probabilities do not sum to 1" should suggest workarounds.

mdhaber · 2024-01-02T16:35:55Z

numpy/random/_generator.pyx

+                raise ValueError("Probabilities do not sum to 1. "
+                                 "You can typically solve this issue with "
+                                 "`p = p / np.sum(p)`. "
+                                 "In rare cases this may not work due to round-off error, "


Are there examples of normalization by np.sum(p) in double precision arithmetic not working in modern NumPy? Assuming kahan_sum is not so different from np.sum, I would have expected the tolerance atol = max(atol, np.sqrt(np.finfo(p.dtype).eps)) to be pretty generous compared to typical roundoff error in float64, even for large arrays.

Assuming that p came in as float64, yes. When the user provides a float32 array that sums reasonably close to 1 in float32 arithmetic, we're simply casting it to float64, and that's one of the more common sources of "spuriously" hitting this exception.

Right. I was going to suggest that if we were to provide a recommendation, it would be sufficient to recommend that the user normalize by np.sum(p, dtype=np.float64) (assuming NEP 50) rather than providing a primary strategy and backup strategy.

rkern · 2024-01-02T18:28:33Z

I feel like the possible remedies are more context-dependent than what should fit into an exception message. The advice to rescale works when the user provided just implicitly-scaled weights (that do not sum at all to 1) or when they are close enough to 1 (maybe in a reduced-precision arithmetic) that the rescaling does negligible damage to the values. But sometimes, you just want to fix up the last value and leave the rest alone. Or maybe the first. Or some other one that can tolerate the deviation from what was requested better than the others because of problem-specific circumstances.

rkern · 2024-01-02T18:35:24Z

@matheussouza88 Welcome to the project. Please look at our contribution guide on ways to contribute to the project. Numpy is an old, mature project, and there are some ways to contribute that are more well-adapted to newcomers than others. PR approval is one of the things that requires some longer experience with the project, its forward-looking goals, and its backwards-looking history. Thanks.

mdhaber · 2024-01-02T20:28:20Z

I feel like the possible remedies are more context-dependent than what should fit into an exception message.

I agree that we cannot provide exhaustive or one-size-fits-all advice. Perhaps a compromise would be to provide a short, lightly worded suggestion to those who are confused by the error:

Probabilities do not sum to 1. Consider normalizing p; e.g. p = p / np.sum(p, dtype=float).

Those who have considered it and decided that it is not appropriate for their use case can do whatever they deem appropriate.

Another possibility is to add an example in the documentation.

rkern · 2024-01-02T20:33:21Z

A float32 example would probably be the best medium.

mdhaber · 2024-01-02T20:34:45Z

A float32 example would probably be the best medium.

@MilesCranmer would that address the issue, and if so, would you change the PR accordingly?

MilesCranmer · 2024-01-03T05:33:14Z

I don't think this would help because requiring the user to google their error is the poor experience that this PR attempts to address. Whether that google search goes to stackoverflow (current) or the docs is not really a big difference imo. Avoiding placing the burden on the user is really the PR's goal (however that ends up).

I suppose the p / sum(p) is a more obvious from the existing message, so isn't needed, but the float32 -> float64 which changes the sum away from 1 is subtle and a real pain (which I ran into myself; hence this PR). In this case a helpful error message would be the best form of documentation.

rkern · 2024-01-03T05:42:00Z

I'm happy to expand on the message, for instance to mention that the calculation is done in float64, but the extended explanation of ways to fix it should remain in the docstring, IMO. I'd be happy for the message to say to see the docstring for more details. I'd even be happy to check if the input was originally float32 and mention that (in the Generator.choice() implementation; RandomState must be left alone).

MilesCranmer · 2024-01-03T05:57:40Z

Good idea. Let me try to implement that.

mdhaber · 2024-01-04T06:20:26Z

@rkern Were you OK with changing the error message of RandomState.choice or change just Generator.choice?

Would you like to trim it to:

Probabilities do not sum to 1. See Notes section of docstring for more information.

And move the bit about needing normalization to the Notes?

rkern · 2024-01-04T14:53:58Z

Only change Generator.choice, please. If RandomState isn't segfaulting, don't change it.

The content doesn't necessarily have to move to the Notes if it's going to be that one sentence. But in that case, the message should refer to just "the docstring" and not the Notes.

[skip cirrus] [skip azp]

numpy/random/_generator.pyx

Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>

mdhaber · 2024-01-22T17:22:20Z

Failures look unrelated. Thanks @MilesCranmer!

COderHop · 2024-03-09T09:59:45Z

Hi i am sorry to ask this kind of question ,
i am newbies i tried to find the source code of numpy.random.choice without succes. where i can found it ?
i am wondering who the weights function works to select random value from array but the same function of numpy are used on pandas DataFrame sample(weights)
thanks in advance

rkern · 2024-03-09T16:06:27Z

Questions like this are best asked on the mailing list or the Scientific Python Discourse, preferably not on unrelated Github issues.

The current recommended implementation is to use Generator.choice (i.e. rng = np.random.default_rng(); rng.choice(...)) rather than np.random.choice(), which is a legacy implementation. If you do need the source for np.random.choice() in particular, it's here.

COderHop · 2024-03-09T17:18:43Z

thank you

DOC: Improve error messages for random.choice

6551b8a

github-actions bot added the 04 - Documentation label Jan 2, 2024

matheussouza88 approved these changes Jan 2, 2024

View reviewed changes

mdhaber reviewed Jan 2, 2024

View reviewed changes

DOC: Describe p normalization in docstring

a031206

MilesCranmer force-pushed the choice-errors branch from de1c387 to a031206 Compare January 3, 2024 06:20

MilesCranmer requested review from rkern and mdhaber January 3, 2024 06:20

mdhaber added 2 commits January 22, 2024 00:57

Merge remote-tracking branch 'upstream/main' into choice-errors

41162aa

MAINT: RandomState.choice: revert changes

1646243

[skip cirrus] [skip azp]

mdhaber reviewed Jan 22, 2024

View reviewed changes

numpy/random/_generator.pyx Outdated Show resolved Hide resolved

mdhaber reviewed Jan 22, 2024

View reviewed changes

numpy/random/_generator.pyx Outdated Show resolved Hide resolved

Apply suggestions from code review

5a8c6e3

Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>

mdhaber merged commit 6fbbcab into numpy:main Jan 22, 2024
57 of 63 checks passed

MilesCranmer deleted the choice-errors branch January 22, 2024 17:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Improve error messages for random.choice #25521

DOC: Improve error messages for random.choice #25521

MilesCranmer commented Jan 2, 2024 •

edited

mdhaber commented Jan 2, 2024

mdhaber Jan 2, 2024 •

edited

rkern Jan 2, 2024

mdhaber Jan 2, 2024 •

edited

rkern commented Jan 2, 2024

rkern commented Jan 2, 2024

mdhaber commented Jan 2, 2024 •

edited

rkern commented Jan 2, 2024

mdhaber commented Jan 2, 2024

MilesCranmer commented Jan 3, 2024

rkern commented Jan 3, 2024

MilesCranmer commented Jan 3, 2024

mdhaber commented Jan 4, 2024 •

edited

rkern commented Jan 4, 2024

mdhaber commented Jan 22, 2024

COderHop commented Mar 9, 2024

rkern commented Mar 9, 2024

COderHop commented Mar 9, 2024

DOC: Improve error messages for random.choice #25521

DOC: Improve error messages for random.choice #25521

Conversation

MilesCranmer commented Jan 2, 2024 • edited

mdhaber commented Jan 2, 2024

mdhaber Jan 2, 2024 • edited

Choose a reason for hiding this comment

rkern Jan 2, 2024

Choose a reason for hiding this comment

mdhaber Jan 2, 2024 • edited

Choose a reason for hiding this comment

rkern commented Jan 2, 2024

rkern commented Jan 2, 2024

mdhaber commented Jan 2, 2024 • edited

rkern commented Jan 2, 2024

mdhaber commented Jan 2, 2024

MilesCranmer commented Jan 3, 2024

rkern commented Jan 3, 2024

MilesCranmer commented Jan 3, 2024

mdhaber commented Jan 4, 2024 • edited

rkern commented Jan 4, 2024

mdhaber commented Jan 22, 2024

COderHop commented Mar 9, 2024

rkern commented Mar 9, 2024

COderHop commented Mar 9, 2024

MilesCranmer commented Jan 2, 2024 •

edited

mdhaber Jan 2, 2024 •

edited

mdhaber Jan 2, 2024 •

edited

mdhaber commented Jan 2, 2024 •

edited

mdhaber commented Jan 4, 2024 •

edited