New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Utils: Reimplement util.GetRandomString to avoid modulo bias #64481
Conversation
This pull request was removed from the 9.5.0 milestone because 9.5.0 is currently being released. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried running the tests a bunch and fuzzing this. With fuzzing, n
below zero and alphabets larger than 255 bytes misbehave (…which arguably isn't very likely unless there's repetition), otherwise this seems to work as intended. Those aren't problems for the ways we're using it, so it doesn't have to be addressed.
When running the test suite I noticed that particularly the alphanumeric test suite has a relatively noticeable flakiness that I don't think we should merge. I'm considering suggesting running your checks say three times and failing only if two of those runs fail. That should reduce the flakiness to "very low", even if it's still there as a consequence of what it's testing.
We should also have a note about it in the test 🤔
pkg/util/encoding_test.go
Outdated
// Ensure there is no more than 5% variance between lowest and highest frequency characters | ||
assert.LessOrEqual(t, float64(max-min)/float64(min), 0.05, "Variance between lowest and highest frequency characters must be no more than 5%") | ||
|
||
// Ensure chi-squared value is lower than the critical bound | ||
// 99.9% probability for 61 degrees of freedom | ||
assert.Less(t, chiSquared, 100.888, "Chi squared value must be less than the 99.9% critical bound") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to run this 10 000 times (it timed out after 10 minutes, I'm a little unsure how many rounds it got through, but since 1000 runs took a little over a minute, I don't think the lost runs on the end are too numerous) and it failed 165 times for the variance check and 3 times for the chi check, for a little over 1.5% failure rate.
I think that's a bit too frequent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll tweak the tests to make them a bit less sensitive, they're probably tighter than they need to be right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I relaxed the test criteria to 10% variance and 99.99% chi squared, 5k tests without any failures. The old code fails both chi squared and the string variance test. It passes the digits variance test which is expected because the skew was less than 10% when generating digits.
224eecb
to
8656fcd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
* reimplement GetRandomString, add tests that results are unbiased (cherry picked from commit 7e765c8)
* reimplement GetRandomString, add tests that results are unbiased (cherry picked from commit 7e765c8)
* reimplement GetRandomString, add tests that results are unbiased
We discovered that the
util.GetRandomString
function exhibits a bias, and when used to generate alphanumeric strings will produce the characters 0-7 with approximately 25% higher frequency on average than the other characters (8-9, A-Z, a-z) in the allowed character list.This is due to modulo bias, the fact that when we calculate the value of a random byte with value 0-255 module the number of allowed characters (62 by default), we get 4 input values that will produce each of the numbers 8-61 (mapped to characters 8-9, A-Z and a-z), but 5 inputs that will produce the numbers 0-7 (mapped to characters 0-7).
When we restrict the set of output characters to the numbers 0-9 we see a smaller effect, because there are 25 input values that produce the values 5-9 but 26 input values that map to the values 0-4, creating a 4% bias toward the numbers 0-4.
Further reading on modulo bias:
This PR reimplements
util.GetRandomString
following best practices to produce unbiased output regardless of the allowed characters. It applies the standard method of avoiding modulo bias, by discarding random values that are larger than the largest multiple of the number of characters in the output set that is smaller than 255.It also adds 2 different tests to verify that the function is working correctly by comparing the relative frequencies of the highest and lowest-frequency characters in the output, and by using a chi-squared test.
We generate ~25% more random bytes than output characters to avoid multiple calls to
rand.Read
(which are slow), but if that isn't enough we will read again (and again if needed) until we have enough random bytes to generate the output string. In my testing the performance is very similar to the existing function, in part due to some efficiencies in go offsetting the need to read more bytes withrand.Read
.We use
util.GetRandomString
for various purposes including generating random filenames but also for generating cryptographic salt values and api key secrets, so the impact of this bias on the security of Grafana needed to be assessed.@kelnage performed an analysis of the reduction in entropy of the current generated strings, with the following results:
@jmatosgrafana performed analysis on the current and proposed implementations:
At this point our assessment is that the current bias does not constitute a security issue as we still have sufficient entropy for the ways we use the generated strings. This PR improves that entropy value and brings our measured entropy closer to the theoretical values.