Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stdlib: add Random.int_in_range and similar functions #12459

Merged
merged 21 commits into from
Nov 22, 2023

Conversation

gmevel
Copy link
Contributor

@gmevel gmevel commented Aug 3, 2023

This PR adds the following functions to Random:

val int_in_range : min:int -> max:int -> int
(** [Random.int_in_range ~min ~max] returns a random integer
    between [min] and [max] (both bounds inclusive).
    [min] must be less than or equal to [max].
*)

val int32_in_range : min:int32 -> max:int32 -> int32
val int64_in_range : min:int64 -> max:int64 -> int64
val nativeint_in_range : min:nativeint -> max:nativeint -> nativeint
(* similar *)

and their counterpart in the Random.State sub-module:

  val int_in_range : t -> min:int -> max:int -> int
  val int32_in_range : t -> min:int32 -> max:int32 -> int32
  val nativeint_in_range : t -> min:nativeint -> max:nativeint -> nativeint
  val int64_in_range : t -> min:int64 -> max:int64 -> int64

Motivation

The already-existing function for drawing integers uniformly in an interval, namely Random.full_int, has limited functionality:

(1) expressiveness: it does not support the full range of integers (it can only draw integers between 0 and max_int−1), so we cannot easily use it to emulate drawing from an arbitrary interval;

(2) code clarity: even when its range suffices, it must be used in a convoluted way if wanting to draw from an interval whose lower bound is not zero; namely, the code pattern min + Random.full_int (max - min + 1), which is not terribly explicit, prone to errors (it is easy to forget the +1); worse, this code is broken in the general case because max - min + 1 may overflow [this is a re-statement of point (1)].

Hence these new functions are useful. and they are not trivial to implement, (and indeed the implementation in Containers has the mentioned bug, for instance).

Interface

The proposed interface tries to be as abundantly explicit as possible about what the two integer arguments are: inclusive bounds (by contrast with exclusive bounds, or a length for the second argument, or something else yet…). This goes with the name "int_in_range" and labeled arguments "~min ~max".

Related functions in third-party libraries:

Future work

An analogous function for float should be provided, but this PR does not tackle it. This might be as simple as min +. float state (max -. min) (as Base does it), but I do not feel bold enough to venture into the floating-point land (the subtraction scares me). Also, the documentation states that the bounds are inclusive, but the implementation makes it clear that both zero and the upper bound are in fact exclusive (unless bound = 0.0).

Side comments

I also added one explanatory comment in the internal implementation of Random.

Regarding said implementation: why isn’t intaux replaced with int63aux entirely? In 32-bit OCaml, both functions are identical (so "int63aux" is a misnomer). In 64-bit OCaml, intaux produces the same thing as int63aux, only with a restricted input range, and (if I get it right) a higher probability of re-drawing. intaux does not even spare some random bits, it draws 64 bits of randomness anyway.

@gmevel
Copy link
Contributor Author

gmevel commented Aug 3, 2023

Answering my own question:

why isn’t intaux replaced with int63aux entirely?

I guess this is for backwards compatibility, to preserve pseudo-random sequences with a given seed? In that case, the new function int_in_range does not need to follow that logic which I dumbly mimicked from full_int, of switching between intaux and int63aux in function of the size of the arguments.

Waiting for confirmation. If my new understanding is correct, I believe it would be worthwhile to clarify the implementation by adding a small comment and by renaming intaux to int31aux and int63aux to full_int_aux.

@gasche
Copy link
Member

gasche commented Aug 3, 2023

I believe that the reason for intaux is for 32bit and 64bit generators to behave in the same way (on 32bit inputs), improving portability for computations that run on both kinds of machines, at little performance cost (compared to performing all computations on Int64 numbers, which would be fairly costly on 32bit machines).

@gmevel
Copy link
Contributor Author

gmevel commented Aug 3, 2023

Oh, that makes sense. In that case int_in_range should stick to that logic, I guess. This definitively deserves a comment.

@gasche
Copy link
Member

gasche commented Aug 3, 2023

More on this question of intaux vs. int63aux:

  • We changed our PRNG algorithm for 5.0 to support splitting.
  • If I remember correctly, the previous OCaml PRNG was tuned to generate exactly 30 random bits on each state update, so Random.int would fail (and still does) with an exception if asking for higher numbers. Note that even on 32bit systems, 2^30 (about one billion) is strictly smaller than max_int which is 2^31-1 (about two billions) (let's call this max_int_32), so users could observe a failure.
  • Random.int has the 32-64-compatibility property: it produces the same output on 32bit and 64bit machines on 32bit-representable inputs, which is (for this function) all non-failing inputs.
  • Random.full_int was added to generate larger numbers in Add Random.full_int #9489 , before the change or PRNG algorithm ; the implementation is not simple and the discussion was very long! In particular, its implementation was careful to provide the 32-64-compatibility property for bounds above 2^30 and below max_int_32, on which Random.int still fails.
  • The implementation was modified and much simplified when we migrated to the new PRNG. Random.int is still limited to bounds below 2^30, but on the other hand for full_int I don't know if the 32-64-compatibility property still holds today -- this is not obvious from the implementation, so a question for @xavierleroy.

It would be nice if int_int_range also satisfied the 32-64-compatibility property. I suspect that it does not satisfy it today: it is obviously compatible when 0 < span <= 2^30, but there are cases of 32-bit min, max inputs that give span > 2^30 on all machines, and also some cases that have span <= 0 on 32bit machines and span > 2^30 on 64bit machines.

@gasche
Copy link
Member

gasche commented Aug 3, 2023

It would also be nice to have the following properties:

forall (0 < n <= 2^30), Random.int n = Random.int_in_range ~min:0 ~max:(n-1)
forall (0 < n),         Random.full_int n = Random.int_in_range ~min:0 ~max:(n-1)

I believe that the implementation as proposed does satisfy these properties: with ~min:0 ~max:(n-1) we have span = n and the result is exactly Random.full_int n. (Same for int32, int64, nativeint.)

Documenting them somewhere may be useful.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that these functions are useful to have, and I find the naming choices tasteful.

I believe that the implementations are correct (including the conditional usage of intaux), except possibly for corner cases of 32-64-compatibility of int_in_range -- which I would prefer to see fixed.

In other words, I am heading towards approval. Stdlib PRs need two maintainer approvals, and PRNG stuff should get an opinion from @xavierleroy in any case.

stdlib/random.mli Show resolved Hide resolved
@xavierleroy
Copy link
Contributor

The implementation was modified and much simplified when we migrated to the new PRNG. Random.int is still limited to bounds below 2^30, but on the other hand for full_int I don't know if the 32-64-compatibility property still holds today

I see only one documented compatibility property:

If [bound] is less than 2{^30}, [Random.full_int bound] is equal to {!Random.int}[ bound].

It trivially holds because full_int calls intaux if the bound is less than 2^30.

Re-reading #9489, I see there was a special case for JS-of-OCaml and its 32-bit integers that disappeared in OCaml 5. Oh, well, nobody noticed :-) This will have to be discussed separately.

@gmevel
Copy link
Contributor Author

gmevel commented Aug 3, 2023

Thanks for the background! I agree that the 32-64-compatibility is desirable, and that the current implementation does not satisfy it. I will add a test to check that.

About what you said:

Note that even on 32bit systems, 2^30 (about one billion) is strictly smaller than max_int which is 2^31-1 (about two billions) (let's call this max_int_32), so users could observe a failure.

Unless I’m misunderstanding something, Int.max_int on 32-bit OCaml is 2^30-1, not 2^31-1 (because then ints have 31 bits and are signed). So users on 32-bit OCaml could not observe the failure.

[the implementation of Random.full_int] was careful to provide the 32-64-compatibility property for bounds above 2^30 and below max_int_32, on which Random.int still fails.

I don’t follow here: are you saying that 32-64-compatibility implies to produce the same result on 32-bit and 64-bit platforms when 2^30 <= bound <= 2^31-1? Such values are not representable with the type int on 32-bit platforms, so I don’t understand that point about full_int. However it makes sense for int_in_range because, even on 32-bit platforms, the span of the requested range (i.e. max-min+1) can be such that 2^30 <= span <= 2^31.

I agree with the other compatibility properties you mention. I will document them and try to add tests for them as well.

Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I usually prefer semi-open intervals [min, max+1) to closed intervals [min, max], but I agree that it's nice to be able to take max = max_int.

stdlib/random.ml Outdated Show resolved Hide resolved
stdlib/random.ml Outdated Show resolved Hide resolved
stdlib/random.mli Show resolved Hide resolved
@xavierleroy
Copy link
Contributor

xavierleroy commented Aug 3, 2023

Trying to clarify the discussion: the "32-64-compatibility" property is ill-defined. There are THREE sizes for OCaml's type int:

  • 31 bits for plain OCaml on a 32-bit platform
  • 32 bits with JS-of-OCaml
  • 63 bits for plain OCaml on a 64-bit platform.

The fancy footwork in #9489 was to guarantee consistent results for full_int between JS-of-OCaml and OCaml-on-a-64-bit-platform in the case 0x3FFF_FFFF < bound <= 0x7FFF_FFFF. But that's been lost in OCaml 5 and will have to be rediscussed elsewhere.

@xavierleroy
Copy link
Contributor

Concerning the equalities Random.int n = Random.int_in_range ~min:0 ~max:(n-1) et al, I don't think they should be guaranteed, even if they happen to hold in the current implementation. Two different ways of producing random numbers have no reason to produce the same random numbers.

@gmevel
Copy link
Contributor Author

gmevel commented Aug 3, 2023

I rebased, addressed all minor remarks but one, and in follow-up commits I addressed the "32-64-compatibility" question (the fact that, for 31-bit input, Random.full_int and Random.int_in_range yield identical output regardless of int size):

  • enforced it for Random.int_in_range;
    • the code is straightforward and I believe it is correct, but perhaps there is a clever way to reduce branching;
  • added clarifying comments for Random.full_int, and I also took the liberty of renaming internal functions as I suggested (intaux -> int31aux, int63aux -> full_int_aux);
  • documented it for both Random.full_int and Random.int_in_range;
  • added a test for both Random.full_int and Random.int_in_range.

No effort has been made to extend this compatibility to 32-bit integers (as found in JavaScript environments). Future work, as Xavier said.

Regarding the other compatibility property (the fact that full_int n = int_in_range ~min:0 ~max:(n-1), and similarly for other integer types): I mentioned it in an implementation comment. Since there is no consensus yet on whether it should be guaranteed, I did not document it, nor wrote a test for it.

On my machine, I have one build failure that I cannot understand:

  LINKC ocamlc
File "_none_", line 1:
Error: Generated bytecode executable "ocamlc" cannot be used on a 32-bit platform

Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is wrong in int31_in_range_aux and more generally in all this fancy footwork for 31/32/63 bit compatibility. Time to go back to the drawing board. In the meantime, it might be good to add tests to testsuite/tests/lib-random/chi2.ml : bugs like "it should draw numbers in [-1000,1000] but the result is always positive" are nicely caught by such statistical tests.

stdlib/random.ml Outdated Show resolved Hide resolved
Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the code for int_in_range carefully, going so far as to draw the flowchart, and I think it is correct, although there may be ways to write this logic more concisely. (I'll play with it later as I try to reintroduce the JS-of-OCaml special case that went away between 4.14 and 5.0.)

The chi-square tests may have biases, see below.

The 32/64 compatibility test is a good idea. It seems to be failing on 32-bit x86, see Github CI logs, but I didn't investigate.

testsuite/tests/lib-random/chi2.ml Outdated Show resolved Hide resolved
testsuite/tests/lib-random/compat_32_64.ml Outdated Show resolved Hide resolved
@xavierleroy
Copy link
Contributor

For the record, here is Euclidean division from truncating division :

let ediv a b =
  let q = a / b and r = a mod b in if r >= 0 then q else q - 1

@gmevel
Copy link
Contributor Author

gmevel commented Sep 12, 2023

I just pushed a fix to the chi2 tests. Because of the way I had chosen constants, the coverage of some tests was very poor. In particular, it was not catching the biased test spotted during review. Now, if we replace unsigned divisions with signed divisions, “suspicious results” are reported as expected.

Quoting the commit message:

There are chi2 tests for int_in_range and similar, with intervals
whose length is comprised between 2^k and 2^(k+1) for some k; in some
tests, the length is 256*p, where p is some prime number such that
2^(k-8) < p < 2^(k-7).

Previously, p had been chosen to be the smallest prime number larger
than 2^(k-8), which made it very unlikely to draw samples larger than
min_ + 2^k; hence, the coverage of the test was very poor.

Now, p is chosen near 2^(k-8) * 3/2.

Do you need anything from me?

Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At long last I was able to return to this code. The current state looks good to me in terms of functionality and algorithms used. The code is sometimes hard to read, see below for an example.

Cc @Octachron.

stdlib/random.ml Outdated Show resolved Hide resolved
@xavierleroy
Copy link
Contributor

FYI: starting from this PR, I have two extensions in preparation:

  • xavierleroy@a73dddd refactors the code of Random.{int,full_int,int_in_range} to share more code between the 32-bit OCaml case and the general case.
  • xavierleroy@5fd0768 adds a second special case for compatibility with JavaScript and its 32-bit integers. This compatibility is there in 4.14 for Random.full_int (PR Add Random.full_int #9489) but I wrongly removed it when reimplementing Random for 5.0.

@gasche
Copy link
Member

gasche commented Oct 4, 2023

@gmevel we discussed this PR at today's triaging meeting; it seems that the ball is in your court. When you have time, could you consider Xavier's last-minute comments (and decide to change your code or not) so that we can merge?

@xavierleroy
Copy link
Contributor

I have submitted my changes as a pull request against this pull request: gmevel#1 .

We will also need a second reviewer from the core dev team, since this is a standard library change.

@gmevel
Copy link
Contributor Author

gmevel commented Oct 4, 2023

My apologies, I misunderstood that Xavier’s commits were part of a PR that would supersede this one, or that he would push his commits to this PR. Do you just need me to pull Xavier’s commits (I just did)?

@gmevel
Copy link
Contributor Author

gmevel commented Oct 4, 2023

  • pulled Xavier’s commits
  • rebased
  • added a separate changelog entry for the JavaScript compatiblity (under the Standard Library section, feel free to move it to the Bugfixes section if you think that’s more appropriate)
    • reviewer to be filled in
  • documented 5.2 as the first version with the new functions.

Should the JavaScript compatibility be mentioned in the doc? As the 32-bit OCaml / 64-bit OCaml already is.

@xavierleroy
Copy link
Contributor

As far as I can say, this PR is good for merging. (Plus, I'm tired of reviewing it.) A second approval from a core developer is required, however.

@gasche
Copy link
Member

gasche commented Oct 20, 2023

Would @damiendoligez or @Ekdohibs be interested in reviewing this PR?

stdlib/random.ml Outdated
We must have [-2{^nbits - 1} <= min <= max < 2{^nbits - 1}].
Moreover, for the iteration to converge quickly, the interval
[[min, max]] should have width at least [2{^nbits - 1}]. *)
let rec int_in_range_alt s ~min ~max ~nbits =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we document that near the 2^{nbits - 1} limit, we have a quite high standard deviation in the number of samples (around 2+ε)?
The suffix _alt is also quite unclear, and it is not reused for the other variants of those functions: Maybe int_in_large_range would be more readable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done (7ebfbe4).

gmevel and others added 21 commits November 18, 2023 10:26
Explain a non-obvious bit of the implementation.
New functions:
- `Random.int_in_range`
- `Random.int32_in_range`
- `Random.int64_in_range`
- `Random.nativeint_in_range`
- `Random.State.int_in_range`
- `Random.State.int32_in_range`
- `Random.State.int64_in_range`
- `Random.State.nativeint_in_range`
- move the note about @before 5.0 to the top of the module,
  because it affects all functions rather than `bits` specifically
- document all the Invalid_argument raised
This was already the case of `Random.full_int`.
Comments have been added in the implementation to clarify that point.

A test has been added, which covers `full_int` and `int_in_range`.
- `intaux` -> `int31aux` (because this one is for 31-bit integers)
- `int63aux` -> `full_int_aux` (because this one is for any value of
  type `int`, regardless of the int size)
There are chi2 tests for `int_in_range` and similar, with intervals
whose length is comprised between 2^k and 2^(k+1) for some k; in some
tests, the length is 256*p, where p is some prime number such that
2^(k-8) < p < 2^(k-7).

Previously, p had been chosen to be the smallest prime number larger
than 2^(k-8), which made it very unlikely to draw samples larger than
min_ + 2^k; hence, the coverage of the test was very poor.

Now, p is chosen near 2^(k-8) * 3/2.
…bility

JS-of-OCaml uses JavaScript's 32-bit integers.  This commit makes sure
that we get the same results with JavaScript and with 64-bit OCaml.
For `Random.full_int` this was guaranteed in OCaml 4.14 but the code
to do so was wrongly removed in 5.0.
also:
- reword "fit in X bits" to "fit in X-bit signed integers" (disambiguation)
- rename [int_in_range_gen] to [int_in_range_aux] (consistency with [int_aux])
Rename `int_in_range_aux` to `int_in_large_range`.
Comment on the expected number of draws.
Copy link
Member

@Octachron Octachron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two rejection samplings are correct and restore the OS-independence of integer sampling.

Playing around with nodes confirmed that the implementation works and is testable with the js_of_ocaml backend (even if I am now tempted to generalize a bit the chi2 testing).

@Octachron
Copy link
Member

And in term of API, the int_in_range functions are certainly subtle enough to warrant inclusion in the Random module.

@gasche
Copy link
Member

gasche commented Nov 21, 2023

Great! Thanks @Octachron, this should now be ready to merge.

The last remaining issue is the commit history. It is not so clean right now and I wonder what we would want to do about it.

@xavierleroy
Copy link
Contributor

The last remaining issue is the commit history. It is not so clean right now and I wonder what we would want to do about it.

Squash! With a killer commit message that I'm about to write.

@xavierleroy xavierleroy merged commit 104656f into ocaml:trunk Nov 22, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants