Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hail] implement KING #9343

Merged
merged 6 commits into from Sep 1, 2020
Merged

[hail] implement KING #9343

merged 6 commits into from Sep 1, 2020

Conversation

danking
Copy link
Collaborator

@danking danking commented Aug 25, 2020

CHANGELOG: Implement the KING method for relationship inference as hl.methods.king.

Just look at the last commit. The other commits are PRs that I hope will merge
on Tuesday.

This PR implements hl.methods.king a new, relatively fast, method for
relationship inference on genotype data. I am eager for criticism of the "Notes"
section in which I attempt to describe the KING method to a Hail user with only
a basic understanding of genotype matrices and Hail.

I also include a benchmark which exercises MT->BM, matrix multiply, and
BM->MT. We have an opportunity for a substantial improvement in performance by
BM->replacing the BM interface by one which permits multiple entry fields. In
BM->particular, note that I have to convert from row-partitioning to
BM->block-partitioning four times!

@danking
Copy link
Collaborator Author

danking commented Aug 25, 2020

Some shots of the rendered documents.

Methods

Screen Shot 2020-08-25 at 4 11 57 PM

Relatedness

Screen Shot 2020-08-25 at 4 12 02 PM

King

Screen Shot 2020-08-25 at 4 12 17 PM

Screen Shot 2020-08-25 at 4 12 19 PM

@johnc1231
Copy link
Contributor

Needs a rebase to address conflicts.

Copy link
Contributor

@johnc1231 johnc1231 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! Will give it another look after you rebase, so far noticed two documentation things.


- :math:`i` and :math:`j` be two individuals in the dataset

- :math:`N^{Aa}_{i}` be the number of heterozygote genotype force individual
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? "heterozygote genotype force individual" sounds wrong, but I don't know the math here. At least should probably be "individuals"? I don't know what the "force" part means.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be "genotypes for"

4 p_s (1 - p_s) (1 - 2\phi_{i,j})

This identity reveals that the quotient of the expected genetic distance and
the four-trial binomial variance in the allele frequency, represents,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think there should be a comma between "frequency" and "represents".

hail/python/hail/methods/relatedness/king.py Show resolved Hide resolved
danking and others added 2 commits August 27, 2020 11:28
CHANGELOG: Implement the KING method for relationship inference as hl.methods.king.

Just look at the last commit. The other commits are PRs that I hope will merge
on Tuesday.

This PR implements `hl.methods.king` a new, relatively fast, method for
relationship inference on genotype data. I am eager for criticism of the "Notes"
section in which I attempt to describe the KING method to a Hail user with only
a basic understanding of genotype matrices and Hail.

I also include a benchmark which exercises MT->BM, matrix multiply, and
BM->MT. We have an opportunity for a substantial improvement in performance by
BM->replacing the BM interface by one which permits multiple entry fields. In
BM->particular, note that I have to convert from row-partitioning to
BM->block-partitioning four times!

Co-authored-by: Gonzalo Anyosa <ganyosagalvez@gmail.com>
Co-authored-by: Ahmed Benghomari <abenghomari@gmail.com>
Copy link
Contributor

@johnc1231 johnc1231 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits and a question, code all looks good, and I think the explanation is pretty thorough

datasets containing one homogeneous population. :meth:`pc_relate` is appropriate
for datasets containing multiple homogeneous populations and admixture.
Hail provides three methods for the inference of relatedness: PLINK-style
identity by descent, KING, and PC-Relate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these method names should be links to their papers / websites?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added references.

Comment on lines 97 to 99
identity_by_descent
pc_relate
king
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but let's put them in the same order as above (ibd, king, pc_relate).

Comment on lines 41 to 47
identity_by_descent
pc_relate
king

.. autofunction:: identity_by_descent
.. autofunction:: pc_relate
.. autofunction:: king
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's match the order we describe them in (which is also alphabetical)

Comment on lines 44 to 48
- :math:`X_{i,s}` be the genotype score matrix. Homozygous-reference
genotypes are represented as 0, heterozygous genotypes are represented as
1, and homozygous-alternate genotypes are represented as
2. :math:`X_{i,s}` is calculated by invoking
:meth:`CallExpression.n_alt_alleles` on the `call_expr`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a matrix or a vector? (Wondering because it refers to a particular sample, which makes me think of one column of a matrix table). What is little s?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a matrix. s is for SNP.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a sentence describing the subscripts.

Comment on lines +81 to +84
d_{i,j} = 4 N^{AA,aa}_{i,j}
+ N^{Aa}_{i}
+ N^{Aa}_{j}
- 2 N^{Aa,Aa}_{i,j}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the images you posted, the tops of the superscripts look cut off

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is a problem with any symbol that has both subscripts and superscripts. I found the root-cause and proposed a fix to the sphinx-katex people: hagenw/sphinxcontrib-katex#33.

I'm inclined to wait for them to fix rather than try to monkey patch this. I'm not sure that my fix doesn't cause issues on other browsers (though I did verify our website looks fine on a phone).

Notes
-----

The following presentation summarizes the methods section of Manichaikul,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to paper?

Comment on lines 237 to 238
# We need the count of times the pair is AA,aa and aa,AA. da_ref_var is only AA,aa.
# Transposing da_ref_var gives da_var_ref, i.e. aa,AA.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does da_ref_var refer to? Old name for ref_var?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. fixed

@danking danking merged commit 74990cc into hail-is:main Sep 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants