Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hail] Fix crash in ld_prune because we weren't imputing missing GTs #7653

Merged
merged 3 commits into from Dec 16, 2019

Conversation

@tpoterba
Copy link
Collaborator

tpoterba commented Dec 3, 2019

Assigned Jackie because I want to make sure this is correct.

@jigold

This comment has been minimized.

Copy link
Collaborator

jigold commented Dec 4, 2019

Unfortunately, I'm not sure. @jbloom22 or @alexb-3 -- do you remember if you have a matrix of genotypes and you want to compute the pearson correlation between variants, what is the right thing to do with missing values? Can you just mean impute for those? I looked at the local prune algorithm, and it looks like the missing values are mean imputed:

      val gtMean = gtSum.toDouble / nPresent
      val gtSumAll = gtSum + nMissing * gtMean
      val gtSumSqAll = gtSumSq + nMissing * gtMean * gtMean
      val gtCenteredLengthRec = 1d / math.sqrt(gtSumSqAll - (gtSumAll * gtSumAll / nSamples))
@tpoterba

This comment has been minimized.

Copy link
Collaborator Author

tpoterba commented Dec 4, 2019

I looked at the local prune algorithm, and it looks like the missing values are mean imputed

Yeah, that's why I went with this solution; it seemed consistent./

@tpoterba

This comment has been minimized.

Copy link
Collaborator Author

tpoterba commented Dec 4, 2019

although, wait, it looks like this also standardizes...

@alexb-3

This comment has been minimized.

Copy link
Contributor

alexb-3 commented Dec 4, 2019

@tpoterba

This comment has been minimized.

Copy link
Collaborator Author

tpoterba commented Dec 4, 2019

I think this will deflate the value; the better thing would be to omit
terms where either genotype is missing, and then normalize by N_nonmissing.

This is a big redesign -- we use matrix multiplication to compute the correlations in parallel.

It looks like the local prune stuff does mean-center and standardize, so I'll change it to match that. That sound OK?

@alexb-3

This comment has been minimized.

Copy link
Contributor

alexb-3 commented Dec 4, 2019

@tpoterba

This comment has been minimized.

Copy link
Collaborator Author

tpoterba commented Dec 5, 2019

Makes sense, but what about the following: replace missing values with 0
after standardization, so you can still use matrix multiplication; the
only extra thing is computing N_nonmissing for each pair.

Yep, that's what we do.

@tpoterba

This comment has been minimized.

Copy link
Collaborator Author

tpoterba commented Dec 10, 2019

bump

@jigold

This comment has been minimized.

Copy link
Collaborator

jigold commented Dec 10, 2019

I don't feel qualified to look at this change.

@jbloom22

This comment has been minimized.

Copy link

jbloom22 commented Dec 10, 2019

I’ll look for closely once I get to the retreat, but first impression is that centering and normalizing are redundant.

@alexb-3

This comment has been minimized.

Copy link
Contributor

alexb-3 commented Dec 10, 2019

@tpoterba

This comment has been minimized.

Copy link
Collaborator Author

tpoterba commented Dec 12, 2019

I fixed this; it's a much more obvious change (the unfilter comes before the or_else). Should be reviewable now.

Copy link
Contributor

johnc1231 left a comment

Jon Bloom signed off at retreat, and code looks fine

@danking danking merged commit 2fac8c4 into hail-is:master Dec 16, 2019
1 check passed
1 check passed
ci-test success
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.