-
Notifications
You must be signed in to change notification settings - Fork 90
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Brian Ondov
committed
Mar 9, 2016
1 parent
f4add5d
commit 6e3d454
Showing
1 changed file
with
55 additions
and
31 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,85 +1,109 @@ | ||
Distance Estimation | ||
=================== | ||
|
||
Jaccard distance formulation | ||
---------------------------- | ||
MinHash Jaccard estimation | ||
-------------------------- | ||
|
||
For sequences with sketches :math:`A` and :math:`B`, the distance of the | ||
sequences can be estimated by the Jaccard distance of the sketches: | ||
Given :math:`k`-mer sets :math:`A` and :math:`B`, the MinHash algorithm provides an | ||
estimation of the Jaccard index: | ||
|
||
.. math:: | ||
d_J(A,B) = 1 - \frac {\lvert A \cap B \rvert} {\lvert A \cup B \rvert} | ||
j(A_s,B_s) = \frac {\lvert A_s \cap B_s \rvert} s | ||
As implemented in :code:`mash`, the subsets :math:`A_s` and :math:`B_s` are used | ||
such that :math:`\lvert A_s \cup B_s \rvert` is equal to the sketch size, | ||
allowing for a known error bound based on the given sketch size, as suggested by | ||
Broder [#f1]_: | ||
where :math:`A_s` and :math:`B_s` are subsets such that | ||
:math:`\lvert A_s \cup B_s \rvert` is equal to the sketch size, :math:`s`, | ||
allowing for a known error bound as suggested by Broder [#f1]_. This is done by | ||
using a merge-sort algorithm to find common values between the two sorted | ||
sketches and terminating when the total number of hashes seen reaches the sketch | ||
size (or all hashes in both sketches have been seen). | ||
|
||
Mash distance formulation | ||
------------------------- | ||
|
||
For mutating a sequence with :math:`t` total :math:`k`-mers and a | ||
conserved :math:`k`-mer count :math:`w`, an approximate mutation rate :math:`d` can be | ||
estimated using a Poisson model of mutations occurring in :math:`k`-mers, as suggested | ||
by Fan et al. [#f2]_: | ||
|
||
.. math:: | ||
d_J(A_s,B_s) = 1 - \frac {\lvert A_s \cap B_s \rvert} s | ||
d = \frac {-1} k \ln \frac w t | ||
This is done by | ||
using a merge-sort algorithm to find common values between the two sorted | ||
sketches and terminating when the total number of hashes seen reaches the sketch | ||
size (or all hashes in both sketches have been seen). | ||
In order to use a Jaccard estimate :math:`j` between two :math:`k`-mer sets of arbitrary | ||
sizes, the Jaccard estimate can be framed in terms of the conserved :math:`k`-mer count | ||
:math:`w` and the average set size :math:`n`: | ||
|
||
.. math:: | ||
For closer correlation with nucleotide identity, the distance can be log scaled | ||
(:code:`-L` in :code:`mash dist`) and divided by the kmer size to account for | ||
the increase chance of differences in larger kmers, as suggested by Fan et al. [#f2]_: | ||
j \approx \frac w {2n - w} | ||
To substitute :math:`n` for the total :math:`k`-mer count :math:`t` in the mutation | ||
estimation, this approximation can be reformulated as: | ||
|
||
.. math:: | ||
\frac w n \approx \frac {2j} {1 + j} | ||
Substituting :math:`\frac w n` for :math:`\frac w t` thus yields the Mash | ||
distance: | ||
|
||
.. math:: | ||
d(A_s,B_s)=\Bigg\{\begin{split} | ||
&1&,\ \lvert A_s \cap B_s \rvert=0\\ | ||
&\frac {-1} k \log \frac {\lvert A_s \cap B_s \rvert} s&,\ 0<\lvert A_s \cap B_s \rvert<=s | ||
D(k,j)=\Bigg\{\begin{split} | ||
&1&,\ j=0\\ | ||
&\frac {-1} k \ln \frac {2j} {1 + j}&,\ 0<j\le 1 | ||
\end{split} | ||
Assessing significance with p-values | ||
------------------------------------ | ||
Since MinHash distances are probabilistic estimates, it is important to | ||
consider the probability of seeing a given distance by chance. :code:`mash dist` | ||
thus provides p-values with distance estimations. Lower p-values correspond to | ||
more confident distance estimations, and will often be rounded down to 0 due to | ||
floating point limits. If p-values are high (above, say, 0.01), the k-mer size | ||
floating point limits. If p-values are high (above, say, 0.01), the :math:`k`-mer size | ||
is probably too small for the size of the genomes being compared. | ||
|
||
When estimating the distance of genome 1 and genome 2 from sketches with the | ||
properties: | ||
|
||
:math:`k` := k-mer size | ||
:math:`\Sigma` := alphabet | ||
|
||
:math:`k` := :math:`k`-mer size | ||
|
||
:math:`l_1` := length of genome 1 | ||
|
||
:math:`l_2` := length of genome 2 | ||
|
||
:math:`s` := sketch size | ||
|
||
:math:`x` := number of shared k-mers between sketches of size :math:`s` of | ||
:math:`x` := number of shared :math:`k`-mers between sketches of size :math:`s` of | ||
genome 1 and genome 2 | ||
|
||
...the chance of a k-mer appearing in random sequences of lengths :math:`l_1` | ||
...the chance of a :math:`k`-mer appearing in random sequences of lengths :math:`l_1` | ||
and :math:`l_2` are estimated as: | ||
|
||
.. math:: | ||
r_1 = 1-(1-\frac{1}{4^k})^{l_1} \approx \frac{l_1}{l_1+4^k} | ||
r_1 = 1-(1-\frac{1}{{\lvert\Sigma\rvert}^k})^{l_1} \approx \frac{l_1}{l_1+{\lvert\Sigma\rvert}^k} | ||
r_2 = 1-(1-\frac{1}{4^k})^{l_2} \approx \frac{l_2}{l_2+4^k} | ||
r_2 = 1-(1-\frac{1}{{\lvert\Sigma\rvert}^k})^{l_2} \approx \frac{l_2}{l_2+{\lvert\Sigma\rvert}^k} | ||
The expected Jaccard index of the sketches of the random sequences is then: | ||
|
||
.. math:: | ||
j = \frac{r_1 r_2}{r_1 + r_2 - r_1 r_2} | ||
j_r = \frac{r_1 r_2}{r_1 + r_2 - r_1 r_2} | ||
...and the probability of observing at least :math:`x` shared k-mers can be | ||
...and the probability of observing at least :math:`x` shared :math:`k`-mers can be | ||
estimated with the tail of a cumulative binomial distribution: | ||
|
||
.. math:: | ||
p = 1 - \sum\limits_{i=0}^{x-1} \binom{s}{i} j^i (1-j)^{s-i} | ||
p = 1 - \sum\limits_{i=0}^{x-1} \binom{s}{i} j_r^i (1-j_r)^{s-i} | ||
.. [#f1] Broder ... | ||
.. [#f2] Fan ... | ||
.. [#f1] `Broder, A.Z. On the resemblance and containment of documents. Compression and Complexity of Sequences 1997 - Proceedings, 21-29 (1998). <http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=666900&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D666900>`_ | ||
.. [#f2] `Fan, H., Ives, A.R., Surget-Groba, Y. & Cannon, C.H. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC genomics 16, 522 (2015). <http://www.ncbi.nlm.nih.gov/pubmed/26169061>`_ |