Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hail] Fix crash in ld_prune because we weren't imputing missing GTs #7653

Merged
merged 3 commits into from Dec 16, 2019

Conversation

tpoterba
Copy link
Collaborator

@tpoterba tpoterba commented Dec 3, 2019

Assigned Jackie because I want to make sure this is correct.

@jigold
Copy link
Collaborator

@jigold jigold commented Dec 4, 2019

Unfortunately, I'm not sure. @jbloom22 or @alexb-3 -- do you remember if you have a matrix of genotypes and you want to compute the pearson correlation between variants, what is the right thing to do with missing values? Can you just mean impute for those? I looked at the local prune algorithm, and it looks like the missing values are mean imputed:

      val gtMean = gtSum.toDouble / nPresent
      val gtSumAll = gtSum + nMissing * gtMean
      val gtSumSqAll = gtSumSq + nMissing * gtMean * gtMean
      val gtCenteredLengthRec = 1d / math.sqrt(gtSumSqAll - (gtSumAll * gtSumAll / nSamples))

@tpoterba
Copy link
Collaborator Author

@tpoterba tpoterba commented Dec 4, 2019

I looked at the local prune algorithm, and it looks like the missing values are mean imputed

Yeah, that's why I went with this solution; it seemed consistent./

@tpoterba
Copy link
Collaborator Author

@tpoterba tpoterba commented Dec 4, 2019

although, wait, it looks like this also standardizes...

@alexb-3
Copy link
Contributor

@alexb-3 alexb-3 commented Dec 4, 2019

@tpoterba
Copy link
Collaborator Author

@tpoterba tpoterba commented Dec 4, 2019

I think this will deflate the value; the better thing would be to omit
terms where either genotype is missing, and then normalize by N_nonmissing.

This is a big redesign -- we use matrix multiplication to compute the correlations in parallel.

It looks like the local prune stuff does mean-center and standardize, so I'll change it to match that. That sound OK?

@alexb-3
Copy link
Contributor

@alexb-3 alexb-3 commented Dec 4, 2019

@tpoterba
Copy link
Collaborator Author

@tpoterba tpoterba commented Dec 5, 2019

Makes sense, but what about the following: replace missing values with 0
after standardization, so you can still use matrix multiplication; the
only extra thing is computing N_nonmissing for each pair.

Yep, that's what we do.

@tpoterba
Copy link
Collaborator Author

@tpoterba tpoterba commented Dec 10, 2019

bump

@jigold
Copy link
Collaborator

@jigold jigold commented Dec 10, 2019

I don't feel qualified to look at this change.

@jbloom22
Copy link
Contributor

@jbloom22 jbloom22 commented Dec 10, 2019

I’ll look for closely once I get to the retreat, but first impression is that centering and normalizing are redundant.

@alexb-3
Copy link
Contributor

@alexb-3 alexb-3 commented Dec 10, 2019

@tpoterba
Copy link
Collaborator Author

@tpoterba tpoterba commented Dec 12, 2019

I fixed this; it's a much more obvious change (the unfilter comes before the or_else). Should be reviewable now.

Copy link
Contributor

@johnc1231 johnc1231 left a comment

Jon Bloom signed off at retreat, and code looks fine

@danking danking merged commit 2fac8c4 into hail-is:master Dec 16, 2019
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants