chp1 fig

luizirber · Sep 15, 2020 · b211127 · b211127
1 parent 2d53e2d
commit b211127
Show file tree

Hide file tree

Showing 7 changed files with 110 additions and 37 deletions.
diff --git a/experiments/smol_gather/figures/containment.pdf b/experiments/smol_gather/figures/containment.pdf
diff --git a/experiments/smol_gather/figures/containment_1000.pdf b/experiments/smol_gather/figures/containment_1000.pdf
diff --git a/experiments/smol_gather/figures/containment_100000.pdf b/experiments/smol_gather/figures/containment_100000.pdf
diff --git a/thesis/01-scaled.Rmd b/thesis/01-scaled.Rmd
@@ -8,15 +8,14 @@ The {#rmd-basics} text after the chapter declaration will allow us to link throu
 
 ## Introduction
 
-<!-- TODO
-- Note, can be narrow given the whole thesis introduction.
-- paragraph 1: what is the technical problem of interest? lightweight compositional queries? motivate briefly with some biology, maybe.
-- paragraph 2: motivate narrowing our focus to k-mer containment and minhash-based techniques. Will you consider dashing etc?
--->
-
-A data sketch is a representative proxy for the original data focused on queries for specific properties.
-It is also a probabilistic data structure since it uses hashing techniques to provide statistical guarantees on the precision of the answer for a query.
-This allows a memory/accuracy trade-off:
+New computational methods that can leverage the increasing availability
+of sequencing data due to lower costs are required to analyze this data,
+since traditional methods like alignment don't scale well to this data magnitude.
+
+An interesting class of algorithms are sketches [@gibbons_synopsis_1999],
+a sublinear space representation of the original data focused on queries for specific properties
+using hashing techniques to provide statistical guarantees on the precision of the answer for a query.
+These probabilistic data structures allow a memory/accuracy trade-off:
 using more memory leads to more accurate results,
 but in memory-constrained situations it still bounds results to an expected error rate.
 
@@ -25,16 +24,14 @@ but in memory-constrained situations it still bounds results to an expected erro
 The MinHash sketch [@broder_resemblance_1997] was developed at Altavista in the context of document clustering and deduplication.
 It provides an estimate of the Jaccard similarity
 (called **resemblance** in the original article)
-and the **containment** of two documents,
+$$ J(A, B) = \frac{\vert A \cap B \vert }{\vert A \cup B \vert} $$
+and the **containment** of two documents
+$$C(A, B) = \frac{\vert A \cap B \vert }{\vert A \vert}$$
 estimating how much of document $A$ is contained in document $B$.
-<!-- TODO:
-J(A, B) = frac{\vert A \cup B \vert }{\vert A \cap B \vert}
-C(A, B) = frac{\vert A \cup B \vert }{\vert A \vert}
--->
 These estimates depend on two processes:
-converting documents to sets ("Shingling"),
+converting documents to sets (*Shingling*),
 and transforming large sets into short signatures,
-while preserving similarity ("Min-Hashing").
+while preserving similarity (*Min-Hashing*).
 In the original use case the *$w$-shingling* $\Omega$ of a document $D$ is defined as the set of all continuous subsequence of $w$ words contained in $D$.
 *Min-hashing* is the process of creating $W = \{\,h(x) \mid \forall x \in \Omega\,\}$,
 where $h(x)$ is an uniform hash function,
@@ -88,11 +85,11 @@ This precludes the need to keep the original data for the query,
 and since the Bloom Filter is much smaller than the original dataset in most cases
 that leads to storage savings for this use case.
 
-### Containment score and mash screen
+### Containment score and Mash Screen
 
-\emph{mash screen} [@ondov_mash_2019] is a new method implemented in Mash for calculating containment scores.
+\emph{Mash Screen} [@ondov_mash_2019] is a new method implemented in Mash for calculating containment scores.
 Given a collection of reference MinHash sketches and a query sequence mixture (a metagenome, for example),
-\emph{mash screen} builds a mapping of each distinct hash from the set of all hashes in the reference MinHash sketches to a count of how many times the hash was observed.
+\emph{Mash Screen} builds a mapping of each distinct hash from the set of all hashes in the reference MinHash sketches to a count of how many times the hash was observed.
 The query sequence mixture is decomposed and hashed with the same parameters $k$
 (from the $k$-mer composition) and $h$ (the hash function) used to generate the reference MinHash sketches,
 and for each hash in the query also present in the mapping the counter is updated.
@@ -107,15 +104,15 @@ the Mash Containment,
 to model the $k$-mer survival
 -->
 
-Comparing \emph{mash screen} to \emph{CMash},
+Comparing \emph{Mash Screen} to \emph{CMash},
 the main difference is that the former streams the query as raw sequencing data,
 while the latter generates a Bloom Filter for the query first.
-\emph{mash screen} avoids repeated membership checks to a Bloom Filter by collecting all distinct hashes across the reference MinHash sketches first,
+\emph{Mash Screen} avoids repeated membership checks to a Bloom Filter by collecting all distinct hashes across the reference MinHash sketches first,
 and then updating a counter associated with each hash if it is observed in the query.
 After the query finishes streaming,
 it is then summarized again against the sketches in the collection.
 
-\emph{mash screen} needs the original data for the query during reanalysis,
+\emph{Mash Screen} needs the original data for the query during reanalysis,
 because adding new sketches to the collection of references might introduce new hashes not observed before.
 For large scale projects like reanalysis of all the SRA metagenomes this
 requirement means continuous storage or re-download of many petabytes of data.
@@ -204,8 +201,32 @@ Figure \ref{fig:minhashes} shows an example comparing MinHash, ModHash and Scale
 
 ### Comparison with other containment estimation methods
 
-In this section the Scaled MinHash method is compared to CMash (containment
-MinHash) and mash screen (containment score).
+In this section the _Scaled MinHash_ method implemented in `smol`
+is compared to CMash (_Containment MinHash_)
+and Mash Screen (_Containment Score_) for containment queries.
+`smol` is a minimal implementation of _Scaled MinHash_ for demonstration of the method
+and doesn't include many required features for working with real biological data,
+but its smaller code base makes it a more readable and concise example of the method.
+
+Experiments were run for $k={21, 31, 51}$,
+(except for Mash, which only supports $k \le 32$).
+For Mash and CMash they were run with $n={1000, 100000}$
+to evaluate the containment estimates when using larger sketches.
+The truth set is calculated using an exact $k$-mer counter implemented with a
+_HashSet_ data structure in the Rust programming language.
+
+```{r minhash1000, eval=TRUE, echo=FALSE, message=FALSE, error=FALSE, warning=FALSE, cache=TRUE, out.width="100%", auto_pdf=TRUE, fig.cap='(ref:minhash1000)', fig.show="hold", fig.align="center"}
+knitr::include_graphics('../experiments/smol_gather/figures/containment.pdf')
+```
+
+(ref:minhash1000) Letter-value plot [@hofmann_letter-value_2017] of the
+differences from containment estimate to ground truth (exact).
+Each method is evaluated for $k=\{21,31,51\}$,
+except for `Mash` with $k=51$,
+since `Mash` doesn't support $k>32$.
+**A**: `Mash` and `CMash` using $n=1000$, _Scaled MinHash_ using $scaled=1000$
+**B**: $n=1000$, $scaled=1000$, excluding low coverage genomes.
+**C** and **D**: same as **A** and **B**, but with $n=10000$ for `Mash` and `CMash`.
 
 <!-- TODO
 - show method works, even if slow, lead to introduction of other indices

diff --git a/thesis/05-decentralized.Rmd b/thesis/05-decentralized.Rmd
@@ -319,7 +319,14 @@ The goal is to show the performance impact of a cold start (experiment 2),
 and what is the overhead that IPFS imposes when used in conditions more similar to experiment 1),
 when all the data is available.
 
-Table: (\#tab:sbt-ipfs) Performance of MHBT operations with IPFS storage. Units in seconds, with fold-difference to the baseline in parenthesis.
+Table: (\#tab:sbt-ipfs) Performance of MHBT operations with IPFS storage.
+Units in seconds, with fold-difference to the baseline in parenthesis.
+
+| Experiment     | takver | datasilo   | rosewater   |
+|:---------------|-------:|-----------:|------------:|
+| 1) Local (ZIP) | 9      | 43 (4.7x)  | 14 (1.5x)   |
+| 2) IPFS        | 12     | 115 (9.5x) | 415 (34.5x) |
+| 3) IPFS rerun  | 12     | 64 (5.5x)  | 23 (1.9x)   |
 
 <!--
 sourmash index -k 51 --traverse-directory index.sbt.zip sigs/
@@ -329,12 +336,6 @@ sourmash storage convert -b ipfs index.sbt.json
 time to download zipped DB: 1m53s
 -->
 
- experiment              takver      datasilo        rosewater
--------------           --------  --------------  ------------------
-1) local (zip)              9        43 (4.7x)        14 (1.5x)
-2) ipfs                    12       115 (9.5x)       415 (34.5x)
-3) ipfs again              12        64 (5.5x)        23 (1.9x)
-
 Experiment 1) is a measure of raw processing power,
 and as expected `takver` is the fastest one.
 `datasilo` suffers from the low cost components in the system,
@@ -376,6 +377,12 @@ but otherwise they follow the same conditions as experiments 1-3.
 
 Table: (\#tab:leaf-ipfs) Performance of a leaf-only MHBT operations with IPFS storage. Units in seconds, with fold-difference to the baseline in parenthesis.
 
+| Experiment              | takver | datasilo    | rosewater    |
+|:------------------------|-------:|------------:|-------------:|
+| 4) Leaf-only (zip)      | 20     | 92 (4.6x)   | 35 (1.7x)    |
+| 5) Leaf-only IPFS       | 31     | 307 (14.6x) | 1267 (40.8x) |
+| 6) Leaf-only IPFS rerun | 31     | 170 (5.4x)  | 63 (2x)      |
+
 <!--
 sourmash index -k 51 --sparseness 1 --traverse-directory index_leaf.sbt.zip sigs/
 sourmash index -k 51 --sparseness 1 --traverse-directory index_leaf.sbt.json sigs/
@@ -384,12 +391,6 @@ sourmash storage convert -b ipfs index_leaf.sbt.json
 time to download zipped DB: 51s
 -->
 
- experiment              takver      datasilo        rosewater
--------------           --------  --------------  ------------------
-4) leaf-only (zip)         20        92 (4.6x)        35 (1.7x)
-5) leaf-only ipfs          31       307 (14.6x)     1267 (40.8x)
-6) leaf-only ipfs again    31       170 (5.4x)        63 (2x)
-
 The relative performance difference shows a 2-3 times slowdown when comparing experiments 4-6 to their counterparts in the previous section.
 We can see more clearly the performance impact of reconstructing the internal nodes in experiment 4),
 where all systems take twice as long to run when compared to experiment 1).

diff --git a/thesis/06-conclusion.Rmd b/thesis/06-conclusion.Rmd
@@ -1 +1,3 @@
 # Conclusion {-}
+
+
diff --git a/thesis/bib/thesis.bib b/thesis/bib/thesis.bib
@@ -3454,3 +3454,52 @@ @article{lapierre_metalign_2020
 	urldate = {2020-09-12},
 	date = {2020-09-10},
 }
+
+@article{gibbons_synopsis_1999,
+	title = {Synopsis data structures for massive data sets},
+	volume = {50},
+	pages = {39--70},
+	journaltitle = {External memory algorithms},
+	author = {Gibbons, Phillip B. and Matias, Yossi},
+	date = {1999},
+}
+
+@video{noauthor_diversity_2020,
+	title = {Diversity Sampling: Genome Informatics 2020},
+	url = {https://www.youtube.com/watch?v=ygDL1u62Xho},
+	shorttitle = {Diversity Sampling},
+	abstract = {Lightning talk at Genome Informatics 2020 for our paper "Diversified {RACE} Sampling on Data Streams Applied to Metagenomic Sequence Analysis"
+
+https://www.biorxiv.org/content/10.11...},
+	urldate = {2020-09-14},
+	date = {2020-09-03},
+}
+
+@video{noauthor_rambo-sequence_2020,
+	title = {{RAMBO}-sequence search: Genome Informatics 2020},
+	url = {https://www.youtube.com/watch?v=4iIKph6DTPQ},
+	shorttitle = {{RAMBO}-sequence search},
+	abstract = {Lightning talk at Genome Informatics 2020 for our work on "{RAMBO}: Repeated And Merged {BloOm} filter for ultra-fast sequence search on large-scale genomic data"
+https://arxiv.org/abs/1910.02611},
+	urldate = {2020-09-14},
+	date = {2020-09-05},
+}
+
+@article{hofmann_letter-value_2017,
+	title = {Letter-Value Plots: Boxplots for Large Data},
+	volume = {26},
+	issn = {1061-8600},
+	url = {https://doi.org/10.1080/10618600.2017.1305277},
+	doi = {10.1080/10618600.2017.1305277},
+	shorttitle = {Letter-Value Plots},
+	abstract = {Boxplots are useful displays that convey rough information about the distribution of a variable. Boxplots were designed to be drawn by hand and work best for small datasets, where detailed estimates of tail behavior beyond the quartiles may not be trustworthy. Larger datasets afford more precise estimates of tail behavior, but boxplots do not take advantage of this precision, instead presenting large numbers of extreme, though not unexpected, observations. Letter-value plots address this problem by including more detailed information about the tails using “letter values,” an order statistic defined by Tukey. Boxplots display the first two letter values (the median and quartiles); letter-value plots display further letter values so far as they are reliable estimates of their corresponding quantiles. We illustrate letter-value plots with real data that demonstrate their usefulness for large datasets. All graphics are created using the R package lvplot, and code and data are available in the supplementary materials.},
+	pages = {469--477},
+	number = {3},
+	journaltitle = {Journal of Computational and Graphical Statistics},
+	author = {Hofmann, Heike and Wickham, Hadley and Kafadar, Karen},
+	urldate = {2020-09-14},
+	date = {2017-07-03},
+	note = {Publisher: Taylor \& Francis
+\_eprint: https://doi.org/10.1080/10618600.2017.1305277},
+	keywords = {Fourths, Location depth, Order statistics, Quantiles, Tail area},
+}