Skip to content

langmead-lab/dashing-experiments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dashing Experiments

This repository contains code and data for reproducing experiments from the manuscript accompanying the Dashing software application, available at https://github.com/dnbaker/dashing.

Build instructions

To build executables, either git clone --recursive bonsai && make, or if you have bonsai available on your system (with its submodules), make BONSAI_DIR=$PATH_TO_BONSAI.

Like dashing and bonsai, these require C++14.

dsexp

dsexp contains experiments testing the performance of various data structures for Jaccard-coefficient calculation.

  1. dsexp.cpp
    1. Performs numerical simulations comparing the error rates of bloom filters, minhash sketches, and hyperloglogs at specific sketch sizes for Jaccard-coefficient estimation of sets of varying sizes.
  2. dsexp.Rmd
    1. Contains code for visualizing results from dsexp.cpp, which can be used to reproduce Fig. 1 and Supplementary Table 1 from the manuscript.

timing

This code was used to generate Fig. 3, Table 2, and Supplementary Table 3.

  1. all_pairwise.py
    1. Performs all pairwise comparisons between a set of genomes across varying sketch size and kmer length, comparing bindash, mash, and several estimation methods for HyperLogLogs.
  2. filenames.txt
    1. A list of all genomes used for this experiment. These were fetched using bonsai's (https://github.com/dnbaker/bonsai/) download_genomes.py script, requesting all genomes.

accuracy

This code and data were used to generate Table 1, Fig. 2, and Supplementary Table 2.

  1. pairselector.py
    1. This script finds candidate genome pairs in specified ranges of Jaccard indices from a large, upper-triangular table of pairwise distances.
      1. We generated this table with dashing dist, with k=31 and p=16.
  2. pairwise_benchmark.cpp
    1. For all pairs of genomes provided, calculate the exact Jaccard index with hash sets, and report the difference and errors between estimates and the true value for bindash, mash, and several estimation methods for HyperLogLogs.
    2. This can be rather memory-intensive due to the use of full hash sets; for this reason, we suggest omitting large genomes from the call generating the table used in pairselector.py.
  3. genomes_for_exp.txt
    1. The set of genomes emitted by pairselector.py.
  4. ji_range.Rmd
    1. Contains code for generating Fig. 2 from the output of pairwise_benchmark.
  5. ji_range_postprocess.py
    1. Contains code for finalizing preparation of Table 1 from the output of pairwise_benchmark.

hash

This code was used to evaluate the accuracy of JI estimates as a function of hash function selection.

  1. testhash.cpp
    1. This code evaluates the performance of the HLL and experimental, related structures for cardinality estimation.
    2. We used this code, extracting only the HLL-relevant data, to provide experimental results relating to the performance of hash functions in the text.