diff --git a/experiments/smol_gather/figures/containment.pdf b/experiments/smol_gather/figures/containment.pdf new file mode 100644 index 0000000..66cf252 Binary files /dev/null and b/experiments/smol_gather/figures/containment.pdf differ diff --git a/experiments/smol_gather/figures/containment_1000.pdf b/experiments/smol_gather/figures/containment_1000.pdf deleted file mode 100644 index 88a05ca..0000000 Binary files a/experiments/smol_gather/figures/containment_1000.pdf and /dev/null differ diff --git a/experiments/smol_gather/figures/containment_100000.pdf b/experiments/smol_gather/figures/containment_100000.pdf deleted file mode 100644 index 5b5e75e..0000000 Binary files a/experiments/smol_gather/figures/containment_100000.pdf and /dev/null differ diff --git a/thesis/01-scaled.Rmd b/thesis/01-scaled.Rmd index 1a63618..1092a83 100644 --- a/thesis/01-scaled.Rmd +++ b/thesis/01-scaled.Rmd @@ -8,15 +8,14 @@ The {#rmd-basics} text after the chapter declaration will allow us to link throu ## Introduction - - -A data sketch is a representative proxy for the original data focused on queries for specific properties. -It is also a probabilistic data structure since it uses hashing techniques to provide statistical guarantees on the precision of the answer for a query. -This allows a memory/accuracy trade-off: +New computational methods that can leverage the increasing availability +of sequencing data due to lower costs are required to analyze this data, +since traditional methods like alignment don't scale well to this data magnitude. + +An interesting class of algorithms are sketches [@gibbons_synopsis_1999], +a sublinear space representation of the original data focused on queries for specific properties +using hashing techniques to provide statistical guarantees on the precision of the answer for a query. +These probabilistic data structures allow a memory/accuracy trade-off: using more memory leads to more accurate results, but in memory-constrained situations it still bounds results to an expected error rate. @@ -25,16 +24,14 @@ but in memory-constrained situations it still bounds results to an expected erro The MinHash sketch [@broder_resemblance_1997] was developed at Altavista in the context of document clustering and deduplication. It provides an estimate of the Jaccard similarity (called **resemblance** in the original article) -and the **containment** of two documents, +$$ J(A, B) = \frac{\vert A \cap B \vert }{\vert A \cup B \vert} $$ +and the **containment** of two documents +$$C(A, B) = \frac{\vert A \cap B \vert }{\vert A \vert}$$ estimating how much of document $A$ is contained in document $B$. - These estimates depend on two processes: -converting documents to sets ("Shingling"), +converting documents to sets (*Shingling*), and transforming large sets into short signatures, -while preserving similarity ("Min-Hashing"). +while preserving similarity (*Min-Hashing*). In the original use case the *$w$-shingling* $\Omega$ of a document $D$ is defined as the set of all continuous subsequence of $w$ words contained in $D$. *Min-hashing* is the process of creating $W = \{\,h(x) \mid \forall x \in \Omega\,\}$, where $h(x)$ is an uniform hash function, @@ -88,11 +85,11 @@ This precludes the need to keep the original data for the query, and since the Bloom Filter is much smaller than the original dataset in most cases that leads to storage savings for this use case. -### Containment score and mash screen +### Containment score and Mash Screen -\emph{mash screen} [@ondov_mash_2019] is a new method implemented in Mash for calculating containment scores. +\emph{Mash Screen} [@ondov_mash_2019] is a new method implemented in Mash for calculating containment scores. Given a collection of reference MinHash sketches and a query sequence mixture (a metagenome, for example), -\emph{mash screen} builds a mapping of each distinct hash from the set of all hashes in the reference MinHash sketches to a count of how many times the hash was observed. +\emph{Mash Screen} builds a mapping of each distinct hash from the set of all hashes in the reference MinHash sketches to a count of how many times the hash was observed. The query sequence mixture is decomposed and hashed with the same parameters $k$ (from the $k$-mer composition) and $h$ (the hash function) used to generate the reference MinHash sketches, and for each hash in the query also present in the mapping the counter is updated. @@ -107,15 +104,15 @@ the Mash Containment, to model the $k$-mer survival --> -Comparing \emph{mash screen} to \emph{CMash}, +Comparing \emph{Mash Screen} to \emph{CMash}, the main difference is that the former streams the query as raw sequencing data, while the latter generates a Bloom Filter for the query first. -\emph{mash screen} avoids repeated membership checks to a Bloom Filter by collecting all distinct hashes across the reference MinHash sketches first, +\emph{Mash Screen} avoids repeated membership checks to a Bloom Filter by collecting all distinct hashes across the reference MinHash sketches first, and then updating a counter associated with each hash if it is observed in the query. After the query finishes streaming, it is then summarized again against the sketches in the collection. -\emph{mash screen} needs the original data for the query during reanalysis, +\emph{Mash Screen} needs the original data for the query during reanalysis, because adding new sketches to the collection of references might introduce new hashes not observed before. For large scale projects like reanalysis of all the SRA metagenomes this requirement means continuous storage or re-download of many petabytes of data. @@ -204,8 +201,32 @@ Figure \ref{fig:minhashes} shows an example comparing MinHash, ModHash and Scale ### Comparison with other containment estimation methods -In this section the Scaled MinHash method is compared to CMash (containment -MinHash) and mash screen (containment score). +In this section the _Scaled MinHash_ method implemented in `smol` +is compared to CMash (_Containment MinHash_) +and Mash Screen (_Containment Score_) for containment queries. +`smol` is a minimal implementation of _Scaled MinHash_ for demonstration of the method +and doesn't include many required features for working with real biological data, +but its smaller code base makes it a more readable and concise example of the method. + +Experiments were run for $k={21, 31, 51}$, +(except for Mash, which only supports $k \le 32$). +For Mash and CMash they were run with $n={1000, 100000}$ +to evaluate the containment estimates when using larger sketches. +The truth set is calculated using an exact $k$-mer counter implemented with a +_HashSet_ data structure in the Rust programming language. + +```{r minhash1000, eval=TRUE, echo=FALSE, message=FALSE, error=FALSE, warning=FALSE, cache=TRUE, out.width="100%", auto_pdf=TRUE, fig.cap='(ref:minhash1000)', fig.show="hold", fig.align="center"} +knitr::include_graphics('../experiments/smol_gather/figures/containment.pdf') +``` + +(ref:minhash1000) Letter-value plot [@hofmann_letter-value_2017] of the +differences from containment estimate to ground truth (exact). +Each method is evaluated for $k=\{21,31,51\}$, +except for `Mash` with $k=51$, +since `Mash` doesn't support $k>32$. +**A**: `Mash` and `CMash` using $n=1000$, _Scaled MinHash_ using $scaled=1000$ +**B**: $n=1000$, $scaled=1000$, excluding low coverage genomes. +**C** and **D**: same as **A** and **B**, but with $n=10000$ for `Mash` and `CMash`. - experiment takver datasilo rosewater -------------- -------- -------------- ------------------ -1) local (zip) 9 43 (4.7x) 14 (1.5x) -2) ipfs 12 115 (9.5x) 415 (34.5x) -3) ipfs again 12 64 (5.5x) 23 (1.9x) - Experiment 1) is a measure of raw processing power, and as expected `takver` is the fastest one. `datasilo` suffers from the low cost components in the system, @@ -376,6 +377,12 @@ but otherwise they follow the same conditions as experiments 1-3. Table: (\#tab:leaf-ipfs) Performance of a leaf-only MHBT operations with IPFS storage. Units in seconds, with fold-difference to the baseline in parenthesis. +| Experiment | takver | datasilo | rosewater | +|:------------------------|-------:|------------:|-------------:| +| 4) Leaf-only (zip) | 20 | 92 (4.6x) | 35 (1.7x) | +| 5) Leaf-only IPFS | 31 | 307 (14.6x) | 1267 (40.8x) | +| 6) Leaf-only IPFS rerun | 31 | 170 (5.4x) | 63 (2x) | + - experiment takver datasilo rosewater -------------- -------- -------------- ------------------ -4) leaf-only (zip) 20 92 (4.6x) 35 (1.7x) -5) leaf-only ipfs 31 307 (14.6x) 1267 (40.8x) -6) leaf-only ipfs again 31 170 (5.4x) 63 (2x) - The relative performance difference shows a 2-3 times slowdown when comparing experiments 4-6 to their counterparts in the previous section. We can see more clearly the performance impact of reconstructing the internal nodes in experiment 4), where all systems take twice as long to run when compared to experiment 1). diff --git a/thesis/06-conclusion.Rmd b/thesis/06-conclusion.Rmd index f3b7fa3..9b00b53 100644 --- a/thesis/06-conclusion.Rmd +++ b/thesis/06-conclusion.Rmd @@ -1 +1,3 @@ # Conclusion {-} + + diff --git a/thesis/bib/thesis.bib b/thesis/bib/thesis.bib index cf4e7fb..925ab73 100644 --- a/thesis/bib/thesis.bib +++ b/thesis/bib/thesis.bib @@ -3454,3 +3454,52 @@ @article{lapierre_metalign_2020 urldate = {2020-09-12}, date = {2020-09-10}, } + +@article{gibbons_synopsis_1999, + title = {Synopsis data structures for massive data sets}, + volume = {50}, + pages = {39--70}, + journaltitle = {External memory algorithms}, + author = {Gibbons, Phillip B. and Matias, Yossi}, + date = {1999}, +} + +@video{noauthor_diversity_2020, + title = {Diversity Sampling: Genome Informatics 2020}, + url = {https://www.youtube.com/watch?v=ygDL1u62Xho}, + shorttitle = {Diversity Sampling}, + abstract = {Lightning talk at Genome Informatics 2020 for our paper "Diversified {RACE} Sampling on Data Streams Applied to Metagenomic Sequence Analysis" + +https://www.biorxiv.org/content/10.11...}, + urldate = {2020-09-14}, + date = {2020-09-03}, +} + +@video{noauthor_rambo-sequence_2020, + title = {{RAMBO}-sequence search: Genome Informatics 2020}, + url = {https://www.youtube.com/watch?v=4iIKph6DTPQ}, + shorttitle = {{RAMBO}-sequence search}, + abstract = {Lightning talk at Genome Informatics 2020 for our work on "{RAMBO}: Repeated And Merged {BloOm} filter for ultra-fast sequence search on large-scale genomic data" +https://arxiv.org/abs/1910.02611}, + urldate = {2020-09-14}, + date = {2020-09-05}, +} + +@article{hofmann_letter-value_2017, + title = {Letter-Value Plots: Boxplots for Large Data}, + volume = {26}, + issn = {1061-8600}, + url = {https://doi.org/10.1080/10618600.2017.1305277}, + doi = {10.1080/10618600.2017.1305277}, + shorttitle = {Letter-Value Plots}, + abstract = {Boxplots are useful displays that convey rough information about the distribution of a variable. Boxplots were designed to be drawn by hand and work best for small datasets, where detailed estimates of tail behavior beyond the quartiles may not be trustworthy. Larger datasets afford more precise estimates of tail behavior, but boxplots do not take advantage of this precision, instead presenting large numbers of extreme, though not unexpected, observations. Letter-value plots address this problem by including more detailed information about the tails using “letter values,” an order statistic defined by Tukey. Boxplots display the first two letter values (the median and quartiles); letter-value plots display further letter values so far as they are reliable estimates of their corresponding quantiles. We illustrate letter-value plots with real data that demonstrate their usefulness for large datasets. All graphics are created using the R package lvplot, and code and data are available in the supplementary materials.}, + pages = {469--477}, + number = {3}, + journaltitle = {Journal of Computational and Graphical Statistics}, + author = {Hofmann, Heike and Wickham, Hadley and Kafadar, Karen}, + urldate = {2020-09-14}, + date = {2017-07-03}, + note = {Publisher: Taylor \& Francis +\_eprint: https://doi.org/10.1080/10618600.2017.1305277}, + keywords = {Fourths, Location depth, Order statistics, Quantiles, Tail area}, +}