-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[query] lower logistic SKAT #12643
[query] lower logistic SKAT #12643
Conversation
121d147
to
bce7b7d
Compare
CHANGELOG: `hl.skat(..., logistic=True)` now supported in the Batch backend.
@@ -2140,8 +2567,19 @@ def skat(key_expr, | |||
Table of SKAT results. | |||
|
|||
""" | |||
if hl.current_backend().requires_lowering and not logistic: | |||
ht = hl._linear_skat(key_expr, weight_expr, y, x, covariates, max_size, accuracy, iterations) | |||
if hl.current_backend().requires_lowering: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the second phase, we'll benchmark and address speed problems. Until we're certain it is fast enough, I don't want to enable it for all backends.
[2, 1, 1, 1, 0, 1, 1, 2, 1, 1, 2, 1, 0, 0, 1], | ||
[1, 0, 1, 1, 1, 2, 0, 2, 1, 1, 0, 1, 1, 0, 0], | ||
[0, 2, 0, 0, 2, 1, 1, 2, 2, 1, 1, 1, 0, 1, 1], | ||
[1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 0]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to create a matrix on which the logistic model converged.
bce7b7d
to
6165ffb
Compare
@@ -105,7 +186,124 @@ def test_linear_skat_no_weights_R_truth(): | |||
assert result.fault == 0 | |||
|
|||
|
|||
def test_linear_skat_R_truth(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is inadvertently duplicated in main. I just removed it entirely, but the diff is a bit awkward.a
Hmm. I trust the code now. I test against several R SKAT runs. I'm not sure I understand how we derive that Q is generalized chi-squared distributed. We use the residual phenotypes in the calculation of Q, but those are inverse-logit transformed normal variables. The derivation for the linear case doesn't apply, as far as I can tell. I assume the residuals are Bernoulli distributed? Maybe not. I guess the phenotypes are Bernoulli but the errors aren't? I'm not sure. |
bump @patrick-schultz ideally we would merge this before Wednesday's meeting! |
hail/python/hail/methods/statgen.py
Outdated
# See linear SKAT code comment for an extensive description of the mathematics here. | ||
|
||
sqrtv = hl.sqrt(ht.s2) | ||
Q, _ = hl.nd.qr((ht.covmat.T * sqrtv).T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use ht.covmat.T * sqrtv.reshape(-1, 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
covmat is K by N_samples. sqrtv is N_samples. The code I have above scales each sample by its s2, e.g.:
In [7]: import hail as hl
...: import numpy as np
...: s2 = np.array([1,2,3,4])
...: covmat = np.array([[1,1],[1,2],[1,3],[1,1]])
...:
...: hl.eval((hl.literal(covmat).T * hl.literal(s2)).T)
Out[7]:
array([[1, 1],
[2, 4],
[3, 9],
[4, 4]])
If I use your expression,
In [11]: import hail as hl
...: import numpy as np
...: s2 = np.array([1,2,3,4])
...: covmat = np.array([[1,1],[1,2],[1,3],[1,1]])
...:
...: hl.eval((hl.literal(covmat).T * hl.literal(s2).reshape(-1, 1)))
...
Error summary: HailException: Incompatible NDArray shapes: [ 2 4 ] vs [ 4 1 ]
But maybe you meant to say this?
In [11]: import hail as hl
...: import numpy as np
...: s2 = np.array([1,2,3,4])
...: covmat = np.array([[1,1],[1,2],[1,3],[1,1]])
...:
...: hl.eval((hl.literal(covmat) * hl.literal(s2).reshape(-1, 1)))
Out[11]:
array([[1, 1],
[2, 4],
[3, 9],
[4, 4]])
I've made the last change since it seems better than all the transposition, but I must admit to not understanding how -1
functions in reshape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yes, that's what I meant.
The -1 thing is a numpy behavior we mimic. From their docs:
One shape dimension can be -1. In this case, the value is inferred from the length of the array and remaining dimensions.
So reshape(-1, 1)
is a universal "make this a column vector" command.
hail/python/hail/methods/statgen.py
Outdated
sqrtv = hl.sqrt(ht.s2) | ||
Q, _ = hl.nd.qr((ht.covmat.T * sqrtv).T) | ||
weights_arr = ht.weight._data_array() | ||
G_scaled = (ht.G.T * sqrtv).T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
hail/python/hail/methods/statgen.py
Outdated
1. The residual phenotypes are Bernoulli distributed with mean :math:`p` and variance | ||
:math:`\sigma^2 = p(1 - p)` where :math:`p` is the best-fit probability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be "the phenotypes are Bernoulli ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to have a sample index since the sigmas are sample indexed in the logistic case.
hail/python/hail/methods/statgen.py
Outdated
We can transform the residuals into standard normal variables by normalizing by their | ||
variance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think everything below this doesn't make much sense. The residuals are certainly not normal -- they can only take two possible values. I tried to look into how to justify the logistic case, and made a little progress ("Pearson residuals" are relevant), but I think it's best to not try to write a complete derivation here. It should be enough to summarize the math from the paper that we implement.
My bad, didn't realize I hadn't submitted my review. |
@patrick-schultz , I've reworked the docs in a less wrong way. Let me know what you think! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, just a couple small fixes
hail/python/hail/methods/statgen.py
Outdated
X &: R^{N \times K} \\ | ||
G &: \{0, 1, 2\}^{N \times M} \\ | ||
\\ | ||
\textrm{logit}(P(y=1)) &= \beta_0 X + \beta_1 G + \varepsilon \quad\quad \varepsilon \sim N(0, \sigma^2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still saying the residuals are normally distributed. I think the right way to express the logistic regression model is
y ~ Bernoulli(logit^{-1}(beta_0 X + beta_1 G))
hail/python/hail/methods/statgen.py
Outdated
|
||
.. math:: | ||
|
||
\textrm{logit}(P(y=1)) = \beta_\textrm{null} X + \varepsilon \quad\quad \varepsilon \sim N(0, \sigma^2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too
hail/python/hail/methods/statgen.py
Outdated
Recall that the eigenvalues of a symmetric matrix and its transpose are the same, so we can | ||
instead consider :math:`Z^T Z` (note that we elide transpositions of symmetric matrices): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't saying what you mean (Z Z^T
is symmetric, so its transpose is Z Z^T
). The only way I know to say this is that the eigenvalues of both Z Z^T
and Z^T Z
are the squared singular values of Z
.
hail/python/hail/methods/statgen.py
Outdated
# See linear SKAT code comment for an extensive description of the mathematics here. | ||
|
||
sqrtv = hl.sqrt(ht.s2) | ||
Q, _ = hl.nd.qr((ht.covmat.T * sqrtv).T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yes, that's what I meant.
The -1 thing is a numpy behavior we mimic. From their docs:
One shape dimension can be -1. In this case, the value is inferred from the length of the array and remaining dimensions.
So reshape(-1, 1)
is a universal "make this a column vector" command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
CHANGELOG:
hl.skat(..., logistic=True)
now supported in the Batch backend.