Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] add hl.pgenchisq #12605

Merged
merged 2 commits into from
Jan 23, 2023
Merged

[query] add hl.pgenchisq #12605

merged 2 commits into from
Jan 23, 2023

Conversation

danking
Copy link
Contributor

@danking danking commented Jan 18, 2023

CHANGELOG: Add hl.pgenchisq the cumulative distribution function of the generalized chi-squared distribution.

The Generalized Chi-Squared
Distribution
arises from weighted sums of sums of squares of independent normally distributed variables and is used by hl.skat to generate p-values. The simplest formulation I know for it is this:

w     : R^n
k     : Z^n
lam   : R^n
mu    : R
sigma : R

x   ~ N(mu, sigma^2)
y_i ~ NonCentralChiSquared(k_i, lam_i)

Z = x + w y^T
  = x + sum_i{ w_i y_i }
Z ~ GeneralizedNonCentralChiSquared(w, k, lam, mu, sigma)

The non-central chi-squared distribution arises from a sum of independent normally distributed variables with non-zero mean and unit variance. The non-centrality parameter, lambda, is defined as the sum of the squares of the means of each component normal random variable.

Although the non-central chi-squared distribution has a closed form implementation (indeed, Hail implements this CDF: hl.pchisqtail), the generalized chi-squared distribution does not have a closed form. There are at least four distinct algorithms for evaluating the CDF. To my knowledge, the oldest one is by Robert Davies:

Davies, Robert. "The distribution of a linear combination of chi-squared
random variables." Applied Statistics 29 323-333. 1980.

The original publication includes a Fortran implementation in the publication. Davies' website also includes a C version.

Hail includes a copy of the C version as davies.cpp. I suspect this code contains undefined behavior. Moreover, it is not supported on Apple M1 machines because we don't ship binaries for that platform.

It seemed to me that the simplest solution is to port this algorithm to Scala. This PR is that port. I tested against the 39 test cases provided Davies with the source code. I also added some doctests based on the CDF plots from Wikipedia. The same 39 test cases are tested in Scala and in Python.

I am open to suggestions for the name. pgenchisq seems to strike a balance between clarity and brevity.

I believe this is the first CDF which can fail to converge. I included some relevant debugging information. I think we should standardize on a schema, but I need more examples before I am certain of the right standard.

I am open to critique of GeneralizedChiSquaredDistribution.scala but I will strongly argue against significant refactoring. I worry that we will subtly break this algorithm.

I directly reached out to Robert Davies to clarify the licensing of this algorithm. It appears to have been released at least under both GPL2 and MIT by unaffiliated third parties (who, really, have no right to apply a license to it). Do not remove WIP until I resolve this.

With this PR in place, hl.skat can be implemented entirely in Python.

@danking
Copy link
Contributor Author

danking commented Jan 18, 2023

OK, based on personal e-mail correspondence, I can confirm that the Davies algorithm C code is MIT licensed.

@danking danking removed the WIP label Jan 18, 2023
Dan King added 2 commits January 20, 2023 11:08
CHANGELOG: Add `hl.pgenchisq` the cumulative distribution function of the generalized chi-squared distribution.

The [Generalized Chi-Squared
Distribution](https://en.wikipedia.org/wiki/Generalized_chi-squared_distribution)
arises from weighted sums of sums of squares of independent normally distributed
variables and is used by `hl.skat` to generate p-values. The simplest
formulation I know for it is this:

    w     : R^n
    k     : Z^n
    lam   : R^n
    mu    : R
    sigma : R

    x   ~ N(mu, sigma^2)
    y_i ~ NonCentralChiSquared(k_i, lam_i)

    Z = x + w y^T
      = x + sum_i{ w_i y_i }
    Z ~ GeneralizedNonCentralChiSquared(w, k, lam, mu, sigma)

The non-central chi-squared distribution arises from a sum of independent
normally distributed variables with non-zero mean and unit variance. The
non-centrality parameter, lambda, is defined as the sum of the squares of the
means of each component normal random variable.

Although the non-central chi-squared distribution has a closed form
implementation (indeed, Hail implements this CDF: `hl.pchisqtail`), the
generalized chi-squared distribution does not have a closed form. There are at
least four distinct algorithms for evaluating the CDF. To my knowledge, the
oldest one is by Robert Davies:

    Davies, Robert. "The distribution of a linear combination of chi-squared
    random variables." Applied Statistics 29 323-333. 1980.

The [original publication](http://www.robertnz.net/pdf/lc_chisq.pdf) includes a
Fortran implementation in the publication. Davies'
[website](http://www.robertnz.net/QF.htm) also includes a C version.

Hail includes a copy of the C version as `davies.cpp`. I suspect this code
contains undefined behavior. Moreover, it is not supported on Apple M1 machines
because we don't ship binaries for that platform.

It seemed to me that the simplest solution is to port this algorithm to
Scala. This PR is that port. I tested against the 39 test cases provided Davies
with the source code. I also added some doctests based on the CDF plots from
Wikipedia. The same 39 test cases are tested in Scala and in Python.

I am open to suggestions for the name. `pgenchisq` seems to strike a balance
between clarity and brevity.

I believe this is the first CDF which can fail to converge. I included some
relevant debugging information. I think we should standardize on a schema, but I
need more examples before I am certain of the right standard.

I am open to critique of `GeneralizedChiSquaredDistribution.scala` but I will
strongly argue against significant refactoring. I worry that we will subtly
break this algorithm.

I directly reached out to Robert Davies to clarify the licensing of this
algorithm. It appears to have been released at least under both GPL2 and MIT by
unaffiliated third parties (who, really, have no right to apply a license to
it). Do not remove WIP until I resolve this.

With this PR in place, `hl.skat` can be implemented entirely in Python.
@danking danking merged commit 20fc42f into hail-is:main Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants