-
Notifications
You must be signed in to change notification settings - Fork 251
[hail] implement KING #9343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[hail] implement KING #9343
Conversation
4acf83d
to
8b026c4
Compare
Needs a rebase to address conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool! Will give it another look after you rebase, so far noticed two documentation things.
|
||
- :math:`i` and :math:`j` be two individuals in the dataset | ||
|
||
- :math:`N^{Aa}_{i}` be the number of heterozygote genotype force individual |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this right? "heterozygote genotype force individual" sounds wrong, but I don't know the math here. At least should probably be "individuals"? I don't know what the "force" part means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be "genotypes for"
4 p_s (1 - p_s) (1 - 2\phi_{i,j}) | ||
|
||
This identity reveals that the quotient of the expected genetic distance and | ||
the four-trial binomial variance in the allele frequency, represents, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think there should be a comma between "frequency" and "represents".
CHANGELOG: Implement the KING method for relationship inference as hl.methods.king. Just look at the last commit. The other commits are PRs that I hope will merge on Tuesday. This PR implements `hl.methods.king` a new, relatively fast, method for relationship inference on genotype data. I am eager for criticism of the "Notes" section in which I attempt to describe the KING method to a Hail user with only a basic understanding of genotype matrices and Hail. I also include a benchmark which exercises MT->BM, matrix multiply, and BM->MT. We have an opportunity for a substantial improvement in performance by BM->replacing the BM interface by one which permits multiple entry fields. In BM->particular, note that I have to convert from row-partitioning to BM->block-partitioning four times! Co-authored-by: Gonzalo Anyosa <ganyosagalvez@gmail.com> Co-authored-by: Ahmed Benghomari <abenghomari@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits and a question, code all looks good, and I think the explanation is pretty thorough
datasets containing one homogeneous population. :meth:`pc_relate` is appropriate | ||
for datasets containing multiple homogeneous populations and admixture. | ||
Hail provides three methods for the inference of relatedness: PLINK-style | ||
identity by descent, KING, and PC-Relate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe these method names should be links to their papers / websites?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added references.
identity_by_descent | ||
pc_relate | ||
king |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but let's put them in the same order as above (ibd, king, pc_relate).
identity_by_descent | ||
pc_relate | ||
king | ||
|
||
.. autofunction:: identity_by_descent | ||
.. autofunction:: pc_relate | ||
.. autofunction:: king |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's match the order we describe them in (which is also alphabetical)
- :math:`X_{i,s}` be the genotype score matrix. Homozygous-reference | ||
genotypes are represented as 0, heterozygous genotypes are represented as | ||
1, and homozygous-alternate genotypes are represented as | ||
2. :math:`X_{i,s}` is calculated by invoking | ||
:meth:`CallExpression.n_alt_alleles` on the `call_expr`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a matrix or a vector? (Wondering because it refers to a particular sample, which makes me think of one column of a matrix table). What is little s
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a matrix. s
is for SNP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a sentence describing the subscripts.
d_{i,j} = 4 N^{AA,aa}_{i,j} | ||
+ N^{Aa}_{i} | ||
+ N^{Aa}_{j} | ||
- 2 N^{Aa,Aa}_{i,j} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the images you posted, the tops of the superscripts look cut off
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this is a problem with any symbol that has both subscripts and superscripts. I found the root-cause and proposed a fix to the sphinx-katex people: hagenw/sphinxcontrib-katex#33.
I'm inclined to wait for them to fix rather than try to monkey patch this. I'm not sure that my fix doesn't cause issues on other browsers (though I did verify our website looks fine on a phone).
Notes | ||
----- | ||
|
||
The following presentation summarizes the methods section of Manichaikul, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to paper?
# We need the count of times the pair is AA,aa and aa,AA. da_ref_var is only AA,aa. | ||
# Transposing da_ref_var gives da_var_ref, i.e. aa,AA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does da_ref_var
refer to? Old name for ref_var
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. fixed
This reverts commit 1f1abcb.
CHANGELOG: Implement the KING method for relationship inference as hl.methods.king.
Just look at the last commit. The other commits are PRs that I hope will merge
on Tuesday.
This PR implements
hl.methods.king
a new, relatively fast, method forrelationship inference on genotype data. I am eager for criticism of the "Notes"
section in which I attempt to describe the KING method to a Hail user with only
a basic understanding of genotype matrices and Hail.
I also include a benchmark which exercises MT->BM, matrix multiply, and
BM->MT. We have an opportunity for a substantial improvement in performance by
BM->replacing the BM interface by one which permits multiple entry fields. In
BM->particular, note that I have to convert from row-partitioning to
BM->block-partitioning four times!