-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add random functions #1260
Add random functions #1260
Conversation
Hi. Thanks for this contribution. It's very nice. What I'd like though is to use |
This change is implemented, it will need to be repushed when appveyor builds are fixed to make sure it works on windows. A number of questions / information about the patch:
I also wanted to be able to specify a random seed to get deterministic random output. Since |
I guess we could add a |
BTW, there's more liberally licensed code in an IETF RFC for the various SHA functions. We could just use HMAC-SHA256 with a key generated from /dev/urandom, first applied to a constant, outputting half the HMAC output as a the output and using the other half as the input to the next iteration. Also, the Mersenne Twister, or any LFSR, could be used as a non-cryptographically-secure PRNG, because that will be faster than an HMAC-based one. I can probably take over from here next week, maybe, if you don't want to put up with all my requests! :) |
I'm okay putting up with these requests, they're not hard. I guess before I update this, it's reasonable to discuss exactly what interfaces we/you want, so I don't waste any time. For the record, I'm going to be talking about *nix systems, I don't really care about windows...
In my mind, the only ones that seem unnecessary are
The next question would be what the cli would look like. I could see something somewhat like the As far as the license goes, it seems like we probably just need to have a license file available whenever anyone gets the binary, e.g. a download option on the website, and a licenses folder for things like ubuntu or otherwise where the binary is just distributed. The authors asked to get an email if it was ever used, so I can also ask if they have a specific request of how thy want the license to be available. |
Just (2) and (3). No /dev/random, only urandom, and no user-provided seeds. If you want to provide an env var for user seeding, that's fine with me. I see about the license -- that's fine. I'd also provide a |
Thanks for the help / advice / putting up with all of these questions. |
On a side note: Makoto Matsumoto was very prompt, and said:
So I'll make sure an update to the website is included in the PR that lists the license. |
I've no objection to the builtin functions you propose. |
I tried but couldn't find a cryptographically secure implementation of a DRBG that I thought was good enough to just use. The IETF does have nice implementations of HMAC-SHA* but NIST's recommendations on how to turn that into a DRBG are relatively complicated and I'd rather follow the first rule of crypto, "don't implement your own crypto", even if I have an implementation of HMAC-SHA256. If you (or anyone else who reads this) wants to try to implement something, the following script should download the IETF code in a way that the tests will compile and run. It seems like their license is also close to BSD, so I would add a copyright notice to the download page, like I did for the TinyMT implementation.
Other things that you may have an opinion about in this future change:
|
src/rand.c
Outdated
void jq_rand_init() { | ||
uint64_t seed; | ||
FILE *devur = fopen("/dev/urandom", "r"); | ||
if (fread(&seed, sizeof(seed), 1, devur) != 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to check that devur
isn't NULL... That should raise an error.
Ideally we could #ifdef out on WIN32
so that on Windows these builtins elicit a compile-time error.
We should also have a notion of unsupported builtins so that a compile-time error about them can be more user-friendly.
And, lastly, WIN32 support, using CNG, would be nice.
Lastly, getentropy(2)
or getrandom(2)
might be better than reading from /dev/urandom
where those are supported.
To be clear, I'm not asking for Windows support though, or for support for getentropy(2)
/getrandom(2)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does actually check at compile time for the existence of the "getrandom" syscall, and the "/dev/urandom" file. It's still possible for "dev/urandom" to be null, so you're right, it should be checked.
src/main.c
Outdated
@@ -528,6 +549,13 @@ int main(int argc, char* argv[]) { | |||
if (getenv("JQ_COLORS") != NULL && !jq_set_colors(getenv("JQ_COLORS"))) | |||
fprintf(stderr, "Failed to set $JQ_COLORS\n"); | |||
|
|||
// Randomness | |||
if (options & USER_DEFINED_SEED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we want user-defined seeds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I added in a user defined seed is for reproducability. Often time I need a source of numbers that seem random, but I also want to guarantee the the result is deterministic. That's why I added this option.
src/rand.c
Outdated
} | ||
|
||
double jq_rand_double() { | ||
return tinymt64_generate_double01(&fast_rand); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't we always read from /dev/urandom
(or getentropy(2)
or getrandom(2)
or CNG)?
I'm not keen on having our own PRNG. That way lies trouble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what's tinymt64_generate_double01()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I've mentioned previously, my motivation isn't cryptographic security, but often reproducability, which is why I used a PRNG and allowed specifying a seed. I agree that out own PRNG spell trouble, which is why I'm using code from the authors of Mersenne Twister as a submodule. tinymy64_generate_double01 is the name of their function that returns the next double in [0, 1).
Let's not add any PRNGs that are not cryptographically secure. Better yet, let's not add PRNGs. |
What I meant is that reading from |
Ultimately, I'm just aiming to be a contributor, and I'll respect whatever decisions you make. I would like the ability to specify a seed so that I can have reproducable randomness. A potential compromise would be to constantly pull from "getrandom" or "/dev/urandom" if no seed is specified, and only seed a PRNG if a seed is specified. The other decision to be made is how to handle when /dev/urandom and getrandom don't exist. Do you want to seed a PRNG with time like I'm doing now, or compile out the random functions entirely? |
I like that compromise. It might be good to have a differently named set of builtins that always use |
I considered that, but as is, this adds five new functions. I can't think of any reason why you'd want both secure random and PRNG generation in the same script, so having it behind the flag of |
As to Windows, for now just return a run-time error. As to a |
And thanks! |
I understand the desire to minimize command line args, but without the ability to set the seed, there's no way to get the reproducability from the PRNG. I could add a function to set the seed, but I can't think of an ideal way to do so. One potential implementation would take an argument to set the seed to and otherwise leave the input unchanged so you could do something like
And you put it through
I'm open to suggestions if you can think of a better way to set the seed in the face of input streams and/or the executive decision that this isn't a use case to support. If you still deem this unnecessary and would just prefer I implement taking all randomness from system sources I can do that too. P.S. I'm doing this because I like |
It should definitely be possible for users to set the seed in a straightforward way. Personally, I think a command-line flag would be fine, but there are two alternatives that would be possible: something based on
|
Well, the idea was to use As to @pkoppstein's idea of using a |
I like that idea. I'll look into how to catch setting that variable. Thanks for the feedback! |
- Functions include: `rand`, `randint`, `shuffle`, `rand_select`, and `rand_select_rep`. - By default these will pull from system random sources (/dev/urandom or getrandom) if they exist. Optionally a Mersenne Twister PRNG can be seeded with `--arg .seed <seed>` or `--argjson .seed <seed>`. This also allows using pseudoradom numbers on machines without hardware randomness. exists to add new PRNGs. - The build and test of TinyMT was added nonrecursively to the existing automake file, due to a) not needing the entire library, and b) the library not supporting a standard `check` target. fixes #677 and fixes #1038
Everything should be fixed. By default random functions will pull from the system random source, but by specifying a seed with either A downside of this is that the random jq tests and manual tests will fail on windows since no seed is specified. If you want, I could fix this by specifying a seed when running those tests, and adding a few shtests that don't use the seed with a guard around windows. However, it looks like test failures are ignored on windows builds as it is / the random tests aren't the only ones that fail. Maybe these failing tests will inspire a windows developer to add CNG support? |
This is to confirm that I had no problems installing PR #1260 on a "High Sierra" Mac:
The size of jq increased from 355872 to 362152. |
Being able to set the seed at the command-line is great, but having it does not preclude being able to set the seed programmatically as well. However, if making further changes in this PR is going to delay the next official release of jq, I'd vote for no delay. Regarding the details: what would |
I'm happy to add a set seed, and could have the modified PR in tonight. My feeling would be to provide both variants. A 0-arity that sets the seed to its input and returns the passed in seed, as well as a one argument version that otherwise works as the identity presuming the input is valid. As you mentioned, the 0-airty fits more in line with builtins, but with the 1-arity one could do things like:
Not that that example is particularly useful, but it seems like seeds will mostly come externally and it seems bad to force users to otherwise do
|
I'll review this later this week. We might not accept it into master until 1.6 is released. Alternatively we might create a branch for 1.6 (we have other things we want to push to master but not 1.6, so it's probably time to start using release branches). |
A set_seed that merely returned its input would be pretty pointless, I think; one that set the seed based on @nicowilliams - @erikbrinkman has evidently put much thought and care into this contribution, which I believe is important and deserves recognition, so I hope it makes it into 1.6 for both reasons. |
I see the purpose of
and it would work, instead of having to do something like:
However, setting the seed off of the time is probably only useful on windows right now as omitting it almost certainly gives you a better source of randomness. Ultimately, I'm kind of on the fence. Any of these functions are possible, but I'm not sure how necessary they are, or what primary usecase there is. if you just want to set the seed off of the time, then you can always do
|
@erikbrinkman - Maybe I need to be a bit clearer about a couple of things. First, I'd really like the randomness functions that you've already implemented (and documented) so well to be part of the next official jq release, and for that release, I'm fine with the functionality that is in place now (i.e. rand and friends, and the ".seed" method for setting the seed). Second, I believe that being able to set the seed programmatically (in addition to the command-line method) would be a Very Good Thing. Third, it seems to me that having
would probably be worse than useless, for the reason you already gave (i.e., to use it, one would have to use the long-winded idiom: On the other hand, with |
That all makes sense to me. I added a commit that adds the function The discussion of what's the most "jq" in style got me thinking about how integers are handled since jq and json don't have an integer concept. What I decided was that the random functions should more closely match how related functions handle inputs that should ostensibly be integers. I changed However, I decided that setting the seed should enforce that it's input matches it's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uniary
should be unary
@pkoppstein fixed... |
All very excellent, though I think it would be better for rand_range to behave even more like range, e.g. range(0;0) emits the empty stream rather than raising an error. |
I think arguably it should return |
@erikbrinkman - I have no strenuous objections to defining the functional semantics of
|
So if we define the semantics as it should return the same result as
|
- Also rename randint to rand_range and make treatment of floating points match the way other jq functions handle them.
@erikbrinkman - Yes, but I believe By the way, if you could easily provide some guidance regarding git, I'd appreciate it. When I tried fetching the latest version of your work, git complains:
Previously, I had successfully run: git fetch origin pull/1260/head:random Thanks. |
In the current PR, As for git, it looks like this error could be because you're on branch random, and it could be related to my commit amending. If you're on branch |
I tried various other things, all leading nowhere but frustration. |
@pkoppstein Actually, I think it should be |
|
@pkoppstein What if you delete the branch and refetch it?
|
@erikbrinkman - Thanks, that did it, though there was an ominous-looking warning:
|
@pkoppstein That's fine. There are build artifacts left in the TinyMT (the PRNG I chose) submodule so git couldn't delete it. Since you presumably checked random out again it doesn't matter, but you can also just delete the folder yourself. |
I'm also happy to report (on a High Sierra Mac):
If you're interested and have the time, it would be helpful if you could look at One of the mysteries here is that I am no longer able to recompile the prior versions that would help identify the commit that evidently introduced the problem. |
return type_error2(input, j_num, "can't select a non-number of elements"); | ||
} | ||
double length = jv_array_length(jv_copy(input)); | ||
double num = ceil(jv_number_value(jv_copy(j_num))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And - here goes the leak.
@wtlangford exactly my point about ambiguous method naming. jv_number_value
doesn't consume the parameter, jv_copy
here is unbalanced. Not visible until the changes where numbers become ref counted, so a kind of time bomb
Any update on this feature? Is anything blocking it? |
This should be superseded by #1843, which adds random and much more. |
rand
,randint
,shuffle
,rand_select
, andrand_select_rep
.rand
andsrand
), however they should be good enough for most purposes.fixes #677 and fixes #1038