Conversation
CHANGELOG: Query-on-Batch now supports `hl.skat(..., logistic=False)`. I also added actual tests for `hl.skat`, which were lost at some point. I am somewhat not confident in my documentation and comments, because the SKAT paper is terse and unclear. I would really apprecaiate strong criticism of the documentation and the code comments.
2892148 to
6f5e795
Compare
patrick-schultz
left a comment
There was a problem hiding this comment.
Some notes after a first pass
hail/python/hail/methods/statgen.py
Outdated
| 4. Multiplying an orthogonal matrix by a vector of independent normal variables produces a new | ||
| vector of independent normal variables. |
There was a problem hiding this comment.
This is only true if you replace "independent" with "i.i.d."
hail/python/hail/methods/statgen.py
Outdated
| .. math:: | ||
|
|
||
| \begin{align*} | ||
| U \Lambda U^T &= G W G^T \quad\quad U \textrm{ orthonormal, } \Lambda \textrm{ diagonal} \\ |
| ht = ht.annotate( | ||
| Q = (((ht.y_residual @ ht.G) * ht.weight) @ ht.G.T) @ ht.y_residual.T | ||
| ) |
There was a problem hiding this comment.
More efficient:
((ht.y_residual @ ht.G).map(lambda x: x**2) * ht.weight).sum(0)
hail/python/hail/methods/statgen.py
Outdated
| # | ||
| # We avoid the square root in order to avoid complex numbers. | ||
|
|
||
| Q, _ = hl.nd.qr(ht.covmat) |
hail/python/hail/methods/statgen.py
Outdated
|
|
||
| Q, _ = hl.nd.qr(ht.covmat) | ||
| C0 = Q.T @ ht.G | ||
| singular_values = hl.nd.svd(ht.G.T @ (ht.G * ht.weight) - C0.T @ (C0 * ht.weight), compute_uv=False) |
There was a problem hiding this comment.
As discussed in zulip, I think a better way to compute this is:
R = nd.qr(ht.G - Q @ (Q.T @ G), mode="r")
singular_values = hl.nd.svd(R, compute_uv=False)
eigenvalues = singular_values.map(lambda x: x**2)
(replace singular_values with eigenvalues below)
|
Comments addressed. Still gotta work on the verbiage to clearly explain the present of P_0. |
03a8191 to
a45a1b7
Compare
a45a1b7 to
ce65d52
Compare
ce65d52 to
5bd1f52
Compare
|
@patrick-schultz Alright. This passes tests locally. I'm happy with the docs verbiage. I am eager for your review! |
patrick-schultz
left a comment
There was a problem hiding this comment.
A few docs comments. Everything else looks great!
hail/python/hail/methods/statgen.py
Outdated
| .. math:: | ||
|
|
||
| \begin{align*} | ||
| X &: R^{N \times K} \\ |
There was a problem hiding this comment.
Might mention that X is covariates. I was momentarily confused since in the args x is the genotypes.
hail/python/hail/methods/statgen.py
Outdated
| h &\sim N(0, 1) \\ | ||
| h &= \frac{1}{\widehat{\sigma}} r \\ |
There was a problem hiding this comment.
I don't know what the standard stats idiom is, but these feel backwards to me. I want to read this as "if we define h = ..., then h is distributed as N(0, 1)". The other way sounds like "draw a random normal variate h, then h = this other thing we already computed".
hail/python/hail/methods/statgen.py
Outdated
| .. math:: | ||
|
|
||
| \begin{align*} | ||
| U \Lambda U &= B \quad\quad \Lambda \textrm{ orthogonal } U \textrm{ diagonal} \\ |
There was a problem hiding this comment.
Should be U \Lambda U^T, Lambda diagonal, U orthogonal
hail/python/hail/methods/statgen.py
Outdated
| # B = A A.T | ||
| # Q = h.T B h | ||
| # | ||
| # This is called a "quadratic form". It is a weighted sum of the squares of the entries of h, |
There was a problem hiding this comment.
Not quite. It's a weighted sum of products of pairs of entries of h (w_{00} h_0^2 + w_{01} h_0 h_1 + ....). The eigendecomposition converts it to a sum of squares of normals.
hail/python/hail/methods/statgen.py
Outdated
| # Since B is a real symmetric matrix, U is orthogonal. U and W are not necessarily the same | ||
| # matrix but their determinants are +-1 so the squared singular values and eigenvalues differ by | ||
| # at most a sign. |
There was a problem hiding this comment.
U and W are the same, in the sense that eigenvectors / singular vectors are only defined up to sign anyways. But we don't care about the eigenvectors, do we?
The determinant bit is confusing (their determinants are +-1 because they're orthogonal, but that doesn't tell us anything), and the squared singular values are equal to the eigenvalues; no sign ambiguity there. B is pos. def., it has positive eigenvalues.
I think you can just drop this paragraph.
| ).or_error(hl.format('hl._linear_skat: every weight must be positive, in group %s, the weights were: %s', | ||
| ht.group, weights_arr)) | ||
| singular_values = hl.nd.svd(A, compute_uv=False) | ||
| # SVD(M) = U S V. U and V are unitary, therefore SVD(k M) = U (k S) V. |
There was a problem hiding this comment.
Not sure what this is about
There was a problem hiding this comment.
In the next line, we scale the singular values by \sigma^2 instead of multiplying A by sqrt(\sigma^2).
There was a problem hiding this comment.
I added a blank line to clarify to which expression the comment applies.
There was a problem hiding this comment.
Oh I see. That doesn't have anything to do with unitarity. Scalars commute with all matrices.
hail/python/hail/methods/statgen.py
Outdated
| # I *think* the reasoning for taking the complement of the CDF value is: | ||
| # | ||
| # 1. Q is a measure of variance and thus positive. | ||
| # | ||
| # 2. We want to know the probability of obtaining a variance even larger ("more extreme") | ||
| # | ||
| # Ergo, we want to check the right-tail of the distribution. |
I addressed every comment except for the one about the SVD(k M) comment; let me know if that makes sense now.
hail/python/hail/methods/statgen.py
Outdated
| .. math:: | ||
|
|
||
| \begin{align*} | ||
| U \Lambda U &= B \quad\quad \Lambda \textrm{ diagonal } U \textrm{ orthogonal} \\ |
There was a problem hiding this comment.
Still missing a transpose: U \Lambda U.T
fixed that one and a couple other ones where the transpose was missing






CHANGELOG: Query-on-Batch now supports
hl.skat(..., logistic=False).I also added actual tests for
hl.skat, which were lost at some point.I am somewhat not confident in my documentation and comments, because the SKAT paper is terse and unclear. I would really apprecaiate strong criticism of the documentation and the code comments.