Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
CHANGELOG: Add
hl.pgenchisq
the cumulative distribution function of the generalized chi-squared distribution.The Generalized Chi-Squared
Distribution arises from weighted sums of sums of squares of independent normally distributed variables and is used by
hl.skat
to generate p-values. The simplest formulation I know for it is this:The non-central chi-squared distribution arises from a sum of independent normally distributed variables with non-zero mean and unit variance. The non-centrality parameter, lambda, is defined as the sum of the squares of the means of each component normal random variable.
Although the non-central chi-squared distribution has a closed form implementation (indeed, Hail implements this CDF:
hl.pchisqtail
), the generalized chi-squared distribution does not have a closed form. There are at least four distinct algorithms for evaluating the CDF. To my knowledge, the oldest one is by Robert Davies:The original publication includes a Fortran implementation in the publication. Davies' website also includes a C version.
Hail includes a copy of the C version as
davies.cpp
. I suspect this code contains undefined behavior. Moreover, it is not supported on Apple M1 machines because we don't ship binaries for that platform.It seemed to me that the simplest solution is to port this algorithm to Scala. This PR is that port. I tested against the 39 test cases provided Davies with the source code. I also added some doctests based on the CDF plots from Wikipedia. The same 39 test cases are tested in Scala and in Python.
I am open to suggestions for the name.
pgenchisq
seems to strike a balance between clarity and brevity.I believe this is the first CDF which can fail to converge. I included some relevant debugging information. I think we should standardize on a schema, but I need more examples before I am certain of the right standard.
I am open to critique of
GeneralizedChiSquaredDistribution.scala
but I will strongly argue against significant refactoring. I worry that we will subtly break this algorithm.I directly reached out to Robert Davies to clarify the licensing of this algorithm. It appears to have been released at least under both GPL2 and MIT by unaffiliated third parties (who, really, have no right to apply a license to it). Do not remove WIP until I resolve this.
With this PR in place,
hl.skat
can be implemented entirely in Python.