New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[hail] Fix crash in ld_prune because we weren't imputing missing GTs #7653
Conversation
Unfortunately, I'm not sure. @jbloom22 or @alexb-3 -- do you remember if you have a matrix of genotypes and you want to compute the pearson correlation between variants, what is the right thing to do with missing values? Can you just mean impute for those? I looked at the local prune algorithm, and it looks like the missing values are mean imputed: val gtMean = gtSum.toDouble / nPresent
val gtSumAll = gtSum + nMissing * gtMean
val gtSumSqAll = gtSumSq + nMissing * gtMean * gtMean
val gtCenteredLengthRec = 1d / math.sqrt(gtSumSqAll - (gtSumAll * gtSumAll / nSamples)) |
Yeah, that's why I went with this solution; it seemed consistent./ |
although, wait, it looks like this also standardizes... |
I think this will deflate the value; the better thing would be to omit
terms where either genotype is missing, and then normalize by N_nonmissing.
…On Wed, Dec 4, 2019 at 10:03 Tim Poterba ***@***.***> wrote:
although, wait, it looks like this also standardizes...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7653?email_source=notifications&email_token=ACC577RCHOMU4HDJ5OTORD3QW7BDPA5CNFSM4JVAFXT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEF5KDBQ#issuecomment-561684870>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACC577W2MXDIB5MSVMFJZLLQW7BDPANCNFSM4JVAFXTQ>
.
|
This is a big redesign -- we use matrix multiplication to compute the correlations in parallel. It looks like the local prune stuff does mean-center and standardize, so I'll change it to match that. That sound OK? |
Makes sense, but what about the following: replace missing values with 0
*after* standardization, so you can still use matrix multiplication; the
only extra thing is computing N_nonmissing for each pair.
…On Wed, Dec 4, 2019 at 4:11 PM Tim Poterba ***@***.***> wrote:
I think this will deflate the value; the better thing would be to omit
terms where either genotype is missing, and then normalize by N_nonmissing.
This is a big redesign -- we use matrix multiplication to compute the
correlations in parallel.
It looks like the local prune stuff does mean-center and standardize, so
I'll change it to match that. That sound OK?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7653?email_source=notifications&email_token=ACC577TF6BFXU3TS7VAFXODQXAMH5A5CNFSM4JVAFXT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEF6QCBY#issuecomment-561840391>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACC577W6QWDBDJMC72SCWA3QXAMH5ANCNFSM4JVAFXTQ>
.
|
Yep, that's what we do. |
bump |
I don't feel qualified to look at this change. |
I’ll look for closely once I get to the retreat, but first impression is that centering and normalizing are redundant. |
Maybe no longer relevant, but zeroing missings *after* centering is
equivalent to using non-missing terms only rather than mean imputing,
provided you then use N_nonmissing for the final normalization.
…On Tue, Dec 10, 2019 at 8:50 AM Jon Bloom ***@***.***> wrote:
I’ll look for closely once I get to the retreat, but first impression is
that centering and normalizing are redundant.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7653?email_source=notifications&email_token=ACC577VJUORGGYMDZUE72IDQX6NDNA5CNFSM4JVAFXT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGPJKXQ#issuecomment-564041054>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACC577UWWGRAMQHAOV7TGADQX6NDNANCNFSM4JVAFXTQ>
.
|
I fixed this; it's a much more obvious change (the unfilter comes before the or_else). Should be reviewable now. |
Assigned Jackie because I want to make sure this is correct.