Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] lower logistic SKAT #12643

Merged
merged 10 commits into from
Feb 17, 2023
Merged

Conversation

danking
Copy link
Contributor

@danking danking commented Feb 2, 2023

CHANGELOG: hl.skat(..., logistic=True) now supported in the Batch backend.

@danking danking force-pushed the lowered-logistic-skat branch 4 times, most recently from 121d147 to bce7b7d Compare February 6, 2023 14:13
CHANGELOG: `hl.skat(..., logistic=True)` now supported in the Batch backend.
@danking danking removed the stacked PR label Feb 6, 2023
@@ -2140,8 +2567,19 @@ def skat(key_expr,
Table of SKAT results.

"""
if hl.current_backend().requires_lowering and not logistic:
ht = hl._linear_skat(key_expr, weight_expr, y, x, covariates, max_size, accuracy, iterations)
if hl.current_backend().requires_lowering:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the second phase, we'll benchmark and address speed problems. Until we're certain it is fast enough, I don't want to enable it for all backends.

[2, 1, 1, 1, 0, 1, 1, 2, 1, 1, 2, 1, 0, 0, 1],
[1, 0, 1, 1, 1, 2, 0, 2, 1, 1, 0, 1, 1, 0, 0],
[0, 2, 0, 0, 2, 1, 1, 2, 2, 1, 1, 1, 0, 1, 1],
[1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 0]]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to create a matrix on which the logistic model converged.

@danking danking force-pushed the lowered-logistic-skat branch from bce7b7d to 6165ffb Compare February 6, 2023 15:26
@@ -105,7 +186,124 @@ def test_linear_skat_no_weights_R_truth():
assert result.fault == 0


def test_linear_skat_R_truth():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is inadvertently duplicated in main. I just removed it entirely, but the diff is a bit awkward.a

@danking
Copy link
Contributor Author

danking commented Feb 6, 2023

Hmm. I trust the code now. I test against several R SKAT runs. I'm not sure I understand how we derive that Q is generalized chi-squared distributed. We use the residual phenotypes in the calculation of Q, but those are inverse-logit transformed normal variables. The derivation for the linear case doesn't apply, as far as I can tell. I assume the residuals are Bernoulli distributed? Maybe not. I guess the phenotypes are Bernoulli but the errors aren't? I'm not sure.

@danking
Copy link
Contributor Author

danking commented Feb 13, 2023

bump @patrick-schultz ideally we would merge this before Wednesday's meeting!

# See linear SKAT code comment for an extensive description of the mathematics here.

sqrtv = hl.sqrt(ht.s2)
Q, _ = hl.nd.qr((ht.covmat.T * sqrtv).T)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use ht.covmat.T * sqrtv.reshape(-1, 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

covmat is K by N_samples. sqrtv is N_samples. The code I have above scales each sample by its s2, e.g.:

In [7]: import hail as hl
   ...: import numpy as np
   ...: s2 = np.array([1,2,3,4])
   ...: covmat = np.array([[1,1],[1,2],[1,3],[1,1]])
   ...: 
   ...: hl.eval((hl.literal(covmat).T * hl.literal(s2)).T)
Out[7]: 
array([[1, 1],
       [2, 4],
       [3, 9],
       [4, 4]])

If I use your expression,

In [11]: import hail as hl
    ...: import numpy as np
    ...: s2 = np.array([1,2,3,4])
    ...: covmat = np.array([[1,1],[1,2],[1,3],[1,1]])
    ...: 
    ...: hl.eval((hl.literal(covmat).T * hl.literal(s2).reshape(-1, 1)))
...
Error summary: HailException: Incompatible NDArray shapes: [ 2 4 ] vs [ 4 1 ]

But maybe you meant to say this?

In [11]: import hail as hl
    ...: import numpy as np
    ...: s2 = np.array([1,2,3,4])
    ...: covmat = np.array([[1,1],[1,2],[1,3],[1,1]])
    ...: 
    ...: hl.eval((hl.literal(covmat) * hl.literal(s2).reshape(-1, 1)))
Out[11]: 
array([[1, 1],
       [2, 4],
       [3, 9],
       [4, 4]])

I've made the last change since it seems better than all the transposition, but I must admit to not understanding how -1 functions in reshape.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes, that's what I meant.

The -1 thing is a numpy behavior we mimic. From their docs:

One shape dimension can be -1. In this case, the value is inferred from the length of the array and remaining dimensions.

So reshape(-1, 1) is a universal "make this a column vector" command.

sqrtv = hl.sqrt(ht.s2)
Q, _ = hl.nd.qr((ht.covmat.T * sqrtv).T)
weights_arr = ht.weight._data_array()
G_scaled = (ht.G.T * sqrtv).T
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 2062 to 2063
1. The residual phenotypes are Bernoulli distributed with mean :math:`p` and variance
:math:`\sigma^2 = p(1 - p)` where :math:`p` is the best-fit probability.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "the phenotypes are Bernoulli ..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to have a sample index since the sigmas are sample indexed in the logistic case.

Comment on lines 2068 to 2069
We can transform the residuals into standard normal variables by normalizing by their
variance.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think everything below this doesn't make much sense. The residuals are certainly not normal -- they can only take two possible values. I tried to look into how to justify the logistic case, and made a little progress ("Pearson residuals" are relevant), but I think it's best to not try to write a complete derivation here. It should be enough to summarize the math from the paper that we implement.

@patrick-schultz
Copy link
Collaborator

bump @patrick-schultz ideally we would merge this before Wednesday's meeting!

My bad, didn't realize I hadn't submitted my review.

@danking
Copy link
Contributor Author

danking commented Feb 15, 2023

@patrick-schultz , I've reworked the docs in a less wrong way. Let me know what you think!

Screen Shot 2023-02-15 at 18 32 21

Screen Shot 2023-02-15 at 18 32 26

Screen Shot 2023-02-15 at 18 32 29

Copy link
Collaborator

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just a couple small fixes

X &: R^{N \times K} \\
G &: \{0, 1, 2\}^{N \times M} \\
\\
\textrm{logit}(P(y=1)) &= \beta_0 X + \beta_1 G + \varepsilon \quad\quad \varepsilon \sim N(0, \sigma^2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still saying the residuals are normally distributed. I think the right way to express the logistic regression model is

y ~ Bernoulli(logit^{-1}(beta_0 X + beta_1 G))


.. math::

\textrm{logit}(P(y=1)) = \beta_\textrm{null} X + \varepsilon \quad\quad \varepsilon \sim N(0, \sigma^2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too

Comment on lines 2084 to 2085
Recall that the eigenvalues of a symmetric matrix and its transpose are the same, so we can
instead consider :math:`Z^T Z` (note that we elide transpositions of symmetric matrices):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't saying what you mean (Z Z^T is symmetric, so its transpose is Z Z^T). The only way I know to say this is that the eigenvalues of both Z Z^T and Z^T Z are the squared singular values of Z.

# See linear SKAT code comment for an extensive description of the mathematics here.

sqrtv = hl.sqrt(ht.s2)
Q, _ = hl.nd.qr((ht.covmat.T * sqrtv).T)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes, that's what I meant.

The -1 thing is a numpy behavior we mimic. From their docs:

One shape dimension can be -1. In this case, the value is inferred from the length of the array and remaining dimensions.

So reshape(-1, 1) is a universal "make this a column vector" command.

@danking
Copy link
Contributor Author

danking commented Feb 16, 2023

Thanks for clarifying these points and pointing out the errors!

Screen Shot 2023-02-16 at 18 31 01

Screen Shot 2023-02-16 at 18 31 04

Screen Shot 2023-02-16 at 18 31 09

Copy link
Collaborator

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@danking danking merged commit 9c5851c into hail-is:main Feb 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants