Skip to content

Commit

Permalink
docs: distance update
Browse files Browse the repository at this point in the history
  • Loading branch information
Brian Ondov committed Mar 9, 2016
1 parent f4add5d commit 6e3d454
Showing 1 changed file with 55 additions and 31 deletions.
86 changes: 55 additions & 31 deletions doc/sphinx/distances.rst
Original file line number Diff line number Diff line change
@@ -1,85 +1,109 @@
Distance Estimation
===================

Jaccard distance formulation
----------------------------
MinHash Jaccard estimation
--------------------------

For sequences with sketches :math:`A` and :math:`B`, the distance of the
sequences can be estimated by the Jaccard distance of the sketches:
Given :math:`k`-mer sets :math:`A` and :math:`B`, the MinHash algorithm provides an
estimation of the Jaccard index:

.. math::
d_J(A,B) = 1 - \frac {\lvert A \cap B \rvert} {\lvert A \cup B \rvert}
j(A_s,B_s) = \frac {\lvert A_s \cap B_s \rvert} s
As implemented in :code:`mash`, the subsets :math:`A_s` and :math:`B_s` are used
such that :math:`\lvert A_s \cup B_s \rvert` is equal to the sketch size,
allowing for a known error bound based on the given sketch size, as suggested by
Broder [#f1]_:
where :math:`A_s` and :math:`B_s` are subsets such that
:math:`\lvert A_s \cup B_s \rvert` is equal to the sketch size, :math:`s`,
allowing for a known error bound as suggested by Broder [#f1]_. This is done by
using a merge-sort algorithm to find common values between the two sorted
sketches and terminating when the total number of hashes seen reaches the sketch
size (or all hashes in both sketches have been seen).

Mash distance formulation
-------------------------

For mutating a sequence with :math:`t` total :math:`k`-mers and a
conserved :math:`k`-mer count :math:`w`, an approximate mutation rate :math:`d` can be
estimated using a Poisson model of mutations occurring in :math:`k`-mers, as suggested
by Fan et al. [#f2]_:

.. math::
d_J(A_s,B_s) = 1 - \frac {\lvert A_s \cap B_s \rvert} s
d = \frac {-1} k \ln \frac w t
This is done by
using a merge-sort algorithm to find common values between the two sorted
sketches and terminating when the total number of hashes seen reaches the sketch
size (or all hashes in both sketches have been seen).
In order to use a Jaccard estimate :math:`j` between two :math:`k`-mer sets of arbitrary
sizes, the Jaccard estimate can be framed in terms of the conserved :math:`k`-mer count
:math:`w` and the average set size :math:`n`:

.. math::
For closer correlation with nucleotide identity, the distance can be log scaled
(:code:`-L` in :code:`mash dist`) and divided by the kmer size to account for
the increase chance of differences in larger kmers, as suggested by Fan et al. [#f2]_:
j \approx \frac w {2n - w}
To substitute :math:`n` for the total :math:`k`-mer count :math:`t` in the mutation
estimation, this approximation can be reformulated as:

.. math::
\frac w n \approx \frac {2j} {1 + j}
Substituting :math:`\frac w n` for :math:`\frac w t` thus yields the Mash
distance:

.. math::
d(A_s,B_s)=\Bigg\{\begin{split}
&1&,\ \lvert A_s \cap B_s \rvert=0\\
&\frac {-1} k \log \frac {\lvert A_s \cap B_s \rvert} s&,\ 0<\lvert A_s \cap B_s \rvert<=s
D(k,j)=\Bigg\{\begin{split}
&1&,\ j=0\\
&\frac {-1} k \ln \frac {2j} {1 + j}&,\ 0<j\le 1
\end{split}
Assessing significance with p-values
------------------------------------
Since MinHash distances are probabilistic estimates, it is important to
consider the probability of seeing a given distance by chance. :code:`mash dist`
thus provides p-values with distance estimations. Lower p-values correspond to
more confident distance estimations, and will often be rounded down to 0 due to
floating point limits. If p-values are high (above, say, 0.01), the k-mer size
floating point limits. If p-values are high (above, say, 0.01), the :math:`k`-mer size
is probably too small for the size of the genomes being compared.

When estimating the distance of genome 1 and genome 2 from sketches with the
properties:

:math:`k` := k-mer size
:math:`\Sigma` := alphabet

:math:`k` := :math:`k`-mer size

:math:`l_1` := length of genome 1

:math:`l_2` := length of genome 2

:math:`s` := sketch size

:math:`x` := number of shared k-mers between sketches of size :math:`s` of
:math:`x` := number of shared :math:`k`-mers between sketches of size :math:`s` of
genome 1 and genome 2

...the chance of a k-mer appearing in random sequences of lengths :math:`l_1`
...the chance of a :math:`k`-mer appearing in random sequences of lengths :math:`l_1`
and :math:`l_2` are estimated as:

.. math::
r_1 = 1-(1-\frac{1}{4^k})^{l_1} \approx \frac{l_1}{l_1+4^k}
r_1 = 1-(1-\frac{1}{{\lvert\Sigma\rvert}^k})^{l_1} \approx \frac{l_1}{l_1+{\lvert\Sigma\rvert}^k}
r_2 = 1-(1-\frac{1}{4^k})^{l_2} \approx \frac{l_2}{l_2+4^k}
r_2 = 1-(1-\frac{1}{{\lvert\Sigma\rvert}^k})^{l_2} \approx \frac{l_2}{l_2+{\lvert\Sigma\rvert}^k}
The expected Jaccard index of the sketches of the random sequences is then:

.. math::
j = \frac{r_1 r_2}{r_1 + r_2 - r_1 r_2}
j_r = \frac{r_1 r_2}{r_1 + r_2 - r_1 r_2}
...and the probability of observing at least :math:`x` shared k-mers can be
...and the probability of observing at least :math:`x` shared :math:`k`-mers can be
estimated with the tail of a cumulative binomial distribution:

.. math::
p = 1 - \sum\limits_{i=0}^{x-1} \binom{s}{i} j^i (1-j)^{s-i}
p = 1 - \sum\limits_{i=0}^{x-1} \binom{s}{i} j_r^i (1-j_r)^{s-i}
.. [#f1] Broder ...
.. [#f2] Fan ...
.. [#f1] `Broder, A.Z. On the resemblance and containment of documents. Compression and Complexity of Sequences 1997 - Proceedings, 21-29 (1998). <http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=666900&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D666900>`_
.. [#f2] `Fan, H., Ives, A.R., Surget-Groba, Y. & Cannon, C.H. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC genomics 16, 522 (2015). <http://www.ncbi.nlm.nih.gov/pubmed/26169061>`_

0 comments on commit 6e3d454

Please sign in to comment.