Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate the statistics of the random distributions #15911

Open
charris opened this issue Apr 4, 2020 · 11 comments
Open

Validate the statistics of the random distributions #15911

charris opened this issue Apr 4, 2020 · 11 comments

Comments

@charris
Copy link
Member

charris commented Apr 4, 2020

Possible candidate for GSOC. The testing will be long running, so should be separate and run occasionally. Organization open to discussion.

Reproducing code example:

import numpy as np
<< your code here >>

Error message:

Numpy/Python version information:

@anirudh2290
Copy link
Member

Is numpy participating in gsoc this year ?

@charris
Copy link
Member Author

charris commented Apr 5, 2020

No, I'm thinking of next year.

@GabrielSimonetto
Copy link

GabrielSimonetto commented Jan 2, 2021

@charris do you have any reference on this subject, any starting points? I would like to understand how this could be done, and how we would know that the work is done

@mdhaber
Copy link
Contributor

mdhaber commented Sep 12, 2023

@charris were you still interested in this being done?

I'm assuming you meant to perform hypothesis tests in which the null hypothesis is that a sample from the RNG is drawn from a hypothesized distribution. A small p-value would be reason to take a closer look. Of course, you could get false positives (especially if many tests are to be performed), and a large p-value is inconclusive.

SciPy has two tests that would work out of the box for any univariate distribution.

  • ks_1samp - the best known, and actually reasonable to use in this situation since all the parameters of the hypothesized distribution are known
  • cramervonmises - another option

goodness_of_fit can perform randomized versions of the two tests above as well as the Anderson-Darling test and Filliben's test. The caveat is that it needs to be able to produce random samples from the hypothesized distribution. Without care, this could lead to a circular test. If you trust NumPy's uniform RNG, though, you can generate samples using inverse transform sampling or something from scipy.stats.sampling.

import numpy as np
from scipy import stats
rng = np.random.default_rng(35823458245)
n = 1000

shape = 1.
loc = 0.
scale = 1.
sample = rng.gamma(shape=shape, scale=scale, size=n)

dist = stats.gamma
known_params=dict(a=shape, loc=loc, scale=scale)

res = stats.ks_1samp(sample, dist(**known_params).cdf)
assert res.pvalue > 0.05

res = stats.cramervonmises(sample, dist(**known_params).cdf)
assert res.pvalue > 0.05

for statistic in ('ad', 'filliben'):  # also supportds ks, cvm
    # As written, this test is sort of circular, since it will use `dist.rvs` to 
    # draw random samples, and `rvs` methods rely on NumPy
    # It can be modified to perform a valid test, though.
    res = stats.goodness_of_fit(dist, sample, statistic=statistic,
                                known_params=known_params)
    assert res.pvalue > 0.05

Also, since the parameters of the underlying distribution are known, one could take the CDF of the sampled values and use any test developed to validate a uniform RNG (e.g. here).

Perhaps @steppi has other ideas (e.g. Bayesian methods)?

@charris
Copy link
Member Author

charris commented Sep 12, 2023

Long ago the legacy random distributions were tested to see if they produced the correct distributions. The usual method is to map the output values so that the result should be uniformly distributed and check that. @rkern, @bashtage would have a better idea is it would be desirable to do that with the new code.

@bashtage
Copy link
Contributor

Would be a good project for someone. I think it would be particularly helpful to verify the distribution near various boundaries, especially when the variates are generated using a rejection sampler. This would lead to practical advice about the effective range of parameter values, e.g., a formal requirement in the distribution that x >= 0 might mean that either x == 0 or x >1e-8.

@mdhaber
Copy link
Contributor

mdhaber commented Sep 15, 2023

Example: zipf with a near 1 or large (gh-9829). This is something we're working toward for distributions in SciPy.

@jakevdp
Copy link
Contributor

jakevdp commented Sep 15, 2023

I'm interested to see what ideas are generated here. In JAX we use KS tests and chi-squared tests to validate distributions in jax.random, but currently this leads to very slow tests (we generate many samples for statistical robustness), and even with these many samples we still see statistically-expected false positive test failures when we change the random seed! (after all, p < 0.05 means that a test is expected to fail in one out of 20 cases due to statistical noise).

I'm sure there are better ideas out there, and I'd love to hear them!

@WarrenWeckesser
Copy link
Member

There must be an extensive body of literature and prior work on this. Is anyone familiar with any fundamental literature or existing software for this? For the low-level RNG algorithms, there are the "diehard tests".

For the reasons already mentioned, a single p-value from a single sample has limited value as a rigorous test. FWIW, when I added the multivariate_hypergeometric distribution to NumPy in #13794, I validated the algorithms by running many trials, with each trial generating a p-value from a G-test (a slight variation of the chi-squared test) that compared the actual and expected frequencies of the samples. I then plotted a histogram of the p-values and visually checked for any anomalies (so it was not an automated process).

@charris
Copy link
Member Author

charris commented Sep 15, 2023

I think diehard is considered outdated. @rkern Recommendations?

@rkern
Copy link
Member

rkern commented Sep 15, 2023

There must be an extensive body of literature and prior work on this.

Yes and no! On the no side, even the low-level PRNG tests are of the form "make a bunch of NHSTs and check the p-values; be suspicious of clusters of extreme values". And their concerns are largely distinct from ours. The low-level PRNG tests mostly spend their effort on testing for independence since the uniform distribution properties are much easier for these algorithms to obtain. We have complementary concerns here; independence is relatively straightforward when relying on a quality low-level PRNG while the fine details of the distribution, particularly in tail regions and extreme parameters, is of more concern.

On the yes side, testing whether a set of samples follows a given distribution is the bread and butter of NHSTing in general. The problem for us is that we are generally implementing both the numerical CDF/PPFs and the rvs at the same time and often with each other and testing both of them at the same time. We're looking for pretty fine deviations on large numbers of draws while real-data statistical practice is looking for rather larger deviations in smaller data (and the low-level PRNG test batteries are looking for even finer deviations).

NHSTs are okay for screening for problem areas, but once alerted to a potential problem area, then you have to go in and look at the generated values and compare them with the distributions to see how the issue actually presents itself. Because it's floating point, there are going to be issues somewhere; there is little chance of recovering the perfect uniformity of the BitGenerator.

I think diehard is considered outdated.

FWIW, PractRand is what I use. TestU01 is also well-regarded (if you hear about passing/failing "BigCrush", that's TestU01). Both are ... venerable, so I use lemire/testingRNG to get them built and runnable with the latest community patches.

But neither is particularly relevant for this issue. They test low-level bit-spouting PRNG algorithms, not floating point continuous distributions. Reading about their test suite philosophy and interpretation of extreme p-values is mildly applicable, but not by much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants