Skip to content

Commit

Permalink
chp1 fig
Browse files Browse the repository at this point in the history
  • Loading branch information
luizirber committed Sep 15, 2020
1 parent 2d53e2d commit b211127
Show file tree
Hide file tree
Showing 7 changed files with 110 additions and 37 deletions.
Binary file added experiments/smol_gather/figures/containment.pdf
Binary file not shown.
Binary file removed experiments/smol_gather/figures/containment_1000.pdf
Binary file not shown.
Binary file not shown.
69 changes: 45 additions & 24 deletions thesis/01-scaled.Rmd
Expand Up @@ -8,15 +8,14 @@ The {#rmd-basics} text after the chapter declaration will allow us to link throu

## Introduction

<!-- TODO
- Note, can be narrow given the whole thesis introduction.
- paragraph 1: what is the technical problem of interest? lightweight compositional queries? motivate briefly with some biology, maybe.
- paragraph 2: motivate narrowing our focus to k-mer containment and minhash-based techniques. Will you consider dashing etc?
-->

A data sketch is a representative proxy for the original data focused on queries for specific properties.
It is also a probabilistic data structure since it uses hashing techniques to provide statistical guarantees on the precision of the answer for a query.
This allows a memory/accuracy trade-off:
New computational methods that can leverage the increasing availability
of sequencing data due to lower costs are required to analyze this data,
since traditional methods like alignment don't scale well to this data magnitude.

An interesting class of algorithms are sketches [@gibbons_synopsis_1999],
a sublinear space representation of the original data focused on queries for specific properties
using hashing techniques to provide statistical guarantees on the precision of the answer for a query.
These probabilistic data structures allow a memory/accuracy trade-off:
using more memory leads to more accurate results,
but in memory-constrained situations it still bounds results to an expected error rate.

Expand All @@ -25,16 +24,14 @@ but in memory-constrained situations it still bounds results to an expected erro
The MinHash sketch [@broder_resemblance_1997] was developed at Altavista in the context of document clustering and deduplication.
It provides an estimate of the Jaccard similarity
(called **resemblance** in the original article)
and the **containment** of two documents,
$$ J(A, B) = \frac{\vert A \cap B \vert }{\vert A \cup B \vert} $$
and the **containment** of two documents
$$C(A, B) = \frac{\vert A \cap B \vert }{\vert A \vert}$$
estimating how much of document $A$ is contained in document $B$.
<!-- TODO:
J(A, B) = frac{\vert A \cup B \vert }{\vert A \cap B \vert}
C(A, B) = frac{\vert A \cup B \vert }{\vert A \vert}
-->
These estimates depend on two processes:
converting documents to sets ("Shingling"),
converting documents to sets (*Shingling*),
and transforming large sets into short signatures,
while preserving similarity ("Min-Hashing").
while preserving similarity (*Min-Hashing*).
In the original use case the *$w$-shingling* $\Omega$ of a document $D$ is defined as the set of all continuous subsequence of $w$ words contained in $D$.
*Min-hashing* is the process of creating $W = \{\,h(x) \mid \forall x \in \Omega\,\}$,
where $h(x)$ is an uniform hash function,
Expand Down Expand Up @@ -88,11 +85,11 @@ This precludes the need to keep the original data for the query,
and since the Bloom Filter is much smaller than the original dataset in most cases
that leads to storage savings for this use case.

### Containment score and mash screen
### Containment score and Mash Screen

\emph{mash screen} [@ondov_mash_2019] is a new method implemented in Mash for calculating containment scores.
\emph{Mash Screen} [@ondov_mash_2019] is a new method implemented in Mash for calculating containment scores.
Given a collection of reference MinHash sketches and a query sequence mixture (a metagenome, for example),
\emph{mash screen} builds a mapping of each distinct hash from the set of all hashes in the reference MinHash sketches to a count of how many times the hash was observed.
\emph{Mash Screen} builds a mapping of each distinct hash from the set of all hashes in the reference MinHash sketches to a count of how many times the hash was observed.
The query sequence mixture is decomposed and hashed with the same parameters $k$
(from the $k$-mer composition) and $h$ (the hash function) used to generate the reference MinHash sketches,
and for each hash in the query also present in the mapping the counter is updated.
Expand All @@ -107,15 +104,15 @@ the Mash Containment,
to model the $k$-mer survival
-->

Comparing \emph{mash screen} to \emph{CMash},
Comparing \emph{Mash Screen} to \emph{CMash},
the main difference is that the former streams the query as raw sequencing data,
while the latter generates a Bloom Filter for the query first.
\emph{mash screen} avoids repeated membership checks to a Bloom Filter by collecting all distinct hashes across the reference MinHash sketches first,
\emph{Mash Screen} avoids repeated membership checks to a Bloom Filter by collecting all distinct hashes across the reference MinHash sketches first,
and then updating a counter associated with each hash if it is observed in the query.
After the query finishes streaming,
it is then summarized again against the sketches in the collection.

\emph{mash screen} needs the original data for the query during reanalysis,
\emph{Mash Screen} needs the original data for the query during reanalysis,
because adding new sketches to the collection of references might introduce new hashes not observed before.
For large scale projects like reanalysis of all the SRA metagenomes this
requirement means continuous storage or re-download of many petabytes of data.
Expand Down Expand Up @@ -204,8 +201,32 @@ Figure \ref{fig:minhashes} shows an example comparing MinHash, ModHash and Scale

### Comparison with other containment estimation methods

In this section the Scaled MinHash method is compared to CMash (containment
MinHash) and mash screen (containment score).
In this section the _Scaled MinHash_ method implemented in `smol`
is compared to CMash (_Containment MinHash_)
and Mash Screen (_Containment Score_) for containment queries.
`smol` is a minimal implementation of _Scaled MinHash_ for demonstration of the method
and doesn't include many required features for working with real biological data,
but its smaller code base makes it a more readable and concise example of the method.

Experiments were run for $k={21, 31, 51}$,
(except for Mash, which only supports $k \le 32$).
For Mash and CMash they were run with $n={1000, 100000}$
to evaluate the containment estimates when using larger sketches.
The truth set is calculated using an exact $k$-mer counter implemented with a
_HashSet_ data structure in the Rust programming language.

```{r minhash1000, eval=TRUE, echo=FALSE, message=FALSE, error=FALSE, warning=FALSE, cache=TRUE, out.width="100%", auto_pdf=TRUE, fig.cap='(ref:minhash1000)', fig.show="hold", fig.align="center"}
knitr::include_graphics('../experiments/smol_gather/figures/containment.pdf')
```

(ref:minhash1000) Letter-value plot [@hofmann_letter-value_2017] of the
differences from containment estimate to ground truth (exact).
Each method is evaluated for $k=\{21,31,51\}$,
except for `Mash` with $k=51$,
since `Mash` doesn't support $k>32$.
**A**: `Mash` and `CMash` using $n=1000$, _Scaled MinHash_ using $scaled=1000$
**B**: $n=1000$, $scaled=1000$, excluding low coverage genomes.
**C** and **D**: same as **A** and **B**, but with $n=10000$ for `Mash` and `CMash`.

<!-- TODO
- show method works, even if slow, lead to introduction of other indices
Expand Down
27 changes: 14 additions & 13 deletions thesis/05-decentralized.Rmd
Expand Up @@ -319,7 +319,14 @@ The goal is to show the performance impact of a cold start (experiment 2),
and what is the overhead that IPFS imposes when used in conditions more similar to experiment 1),
when all the data is available.

Table: (\#tab:sbt-ipfs) Performance of MHBT operations with IPFS storage. Units in seconds, with fold-difference to the baseline in parenthesis.
Table: (\#tab:sbt-ipfs) Performance of MHBT operations with IPFS storage.
Units in seconds, with fold-difference to the baseline in parenthesis.

| Experiment | takver | datasilo | rosewater |
|:---------------|-------:|-----------:|------------:|
| 1) Local (ZIP) | 9 | 43 (4.7x) | 14 (1.5x) |
| 2) IPFS | 12 | 115 (9.5x) | 415 (34.5x) |
| 3) IPFS rerun | 12 | 64 (5.5x) | 23 (1.9x) |

<!--
sourmash index -k 51 --traverse-directory index.sbt.zip sigs/
Expand All @@ -329,12 +336,6 @@ sourmash storage convert -b ipfs index.sbt.json
time to download zipped DB: 1m53s
-->

experiment takver datasilo rosewater
------------- -------- -------------- ------------------
1) local (zip) 9 43 (4.7x) 14 (1.5x)
2) ipfs 12 115 (9.5x) 415 (34.5x)
3) ipfs again 12 64 (5.5x) 23 (1.9x)

Experiment 1) is a measure of raw processing power,
and as expected `takver` is the fastest one.
`datasilo` suffers from the low cost components in the system,
Expand Down Expand Up @@ -376,6 +377,12 @@ but otherwise they follow the same conditions as experiments 1-3.

Table: (\#tab:leaf-ipfs) Performance of a leaf-only MHBT operations with IPFS storage. Units in seconds, with fold-difference to the baseline in parenthesis.

| Experiment | takver | datasilo | rosewater |
|:------------------------|-------:|------------:|-------------:|
| 4) Leaf-only (zip) | 20 | 92 (4.6x) | 35 (1.7x) |
| 5) Leaf-only IPFS | 31 | 307 (14.6x) | 1267 (40.8x) |
| 6) Leaf-only IPFS rerun | 31 | 170 (5.4x) | 63 (2x) |

<!--
sourmash index -k 51 --sparseness 1 --traverse-directory index_leaf.sbt.zip sigs/
sourmash index -k 51 --sparseness 1 --traverse-directory index_leaf.sbt.json sigs/
Expand All @@ -384,12 +391,6 @@ sourmash storage convert -b ipfs index_leaf.sbt.json
time to download zipped DB: 51s
-->

experiment takver datasilo rosewater
------------- -------- -------------- ------------------
4) leaf-only (zip) 20 92 (4.6x) 35 (1.7x)
5) leaf-only ipfs 31 307 (14.6x) 1267 (40.8x)
6) leaf-only ipfs again 31 170 (5.4x) 63 (2x)

The relative performance difference shows a 2-3 times slowdown when comparing experiments 4-6 to their counterparts in the previous section.
We can see more clearly the performance impact of reconstructing the internal nodes in experiment 4),
where all systems take twice as long to run when compared to experiment 1).
Expand Down
2 changes: 2 additions & 0 deletions thesis/06-conclusion.Rmd
@@ -1 +1,3 @@
# Conclusion {-}


49 changes: 49 additions & 0 deletions thesis/bib/thesis.bib
Expand Up @@ -3454,3 +3454,52 @@ @article{lapierre_metalign_2020
urldate = {2020-09-12},
date = {2020-09-10},
}

@article{gibbons_synopsis_1999,
title = {Synopsis data structures for massive data sets},
volume = {50},
pages = {39--70},
journaltitle = {External memory algorithms},
author = {Gibbons, Phillip B. and Matias, Yossi},
date = {1999},
}

@video{noauthor_diversity_2020,
title = {Diversity Sampling: Genome Informatics 2020},
url = {https://www.youtube.com/watch?v=ygDL1u62Xho},
shorttitle = {Diversity Sampling},
abstract = {Lightning talk at Genome Informatics 2020 for our paper "Diversified {RACE} Sampling on Data Streams Applied to Metagenomic Sequence Analysis"
https://www.biorxiv.org/content/10.11...},
urldate = {2020-09-14},
date = {2020-09-03},
}

@video{noauthor_rambo-sequence_2020,
title = {{RAMBO}-sequence search: Genome Informatics 2020},
url = {https://www.youtube.com/watch?v=4iIKph6DTPQ},
shorttitle = {{RAMBO}-sequence search},
abstract = {Lightning talk at Genome Informatics 2020 for our work on "{RAMBO}: Repeated And Merged {BloOm} filter for ultra-fast sequence search on large-scale genomic data"
https://arxiv.org/abs/1910.02611},
urldate = {2020-09-14},
date = {2020-09-05},
}

@article{hofmann_letter-value_2017,
title = {Letter-Value Plots: Boxplots for Large Data},
volume = {26},
issn = {1061-8600},
url = {https://doi.org/10.1080/10618600.2017.1305277},
doi = {10.1080/10618600.2017.1305277},
shorttitle = {Letter-Value Plots},
abstract = {Boxplots are useful displays that convey rough information about the distribution of a variable. Boxplots were designed to be drawn by hand and work best for small datasets, where detailed estimates of tail behavior beyond the quartiles may not be trustworthy. Larger datasets afford more precise estimates of tail behavior, but boxplots do not take advantage of this precision, instead presenting large numbers of extreme, though not unexpected, observations. Letter-value plots address this problem by including more detailed information about the tails using “letter values,” an order statistic defined by Tukey. Boxplots display the first two letter values (the median and quartiles); letter-value plots display further letter values so far as they are reliable estimates of their corresponding quantiles. We illustrate letter-value plots with real data that demonstrate their usefulness for large datasets. All graphics are created using the R package lvplot, and code and data are available in the supplementary materials.},
pages = {469--477},
number = {3},
journaltitle = {Journal of Computational and Graphical Statistics},
author = {Hofmann, Heike and Wickham, Hadley and Kafadar, Karen},
urldate = {2020-09-14},
date = {2017-07-03},
note = {Publisher: Taylor \& Francis
\_eprint: https://doi.org/10.1080/10618600.2017.1305277},
keywords = {Fourths, Location depth, Order statistics, Quantiles, Tail area},
}

0 comments on commit b211127

Please sign in to comment.