Skip to content

lczech/fisk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

93 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FISK: Fast Iteration of Spaced K-mers

This is a code exploration project to test different implementations to extract spaced k-mers from a genomic sequence. We find that the algorithms implemented here are up to an order of magnitude faster than existing approaches to extract spaced k-mers.

Build and execute

Simply call

make

to build the program, and

./bin/fisk

This will produce performance measurements for all implemented benchmarks on the given hardware, measuring time per operation, such as per extracted k-mer:

  • bit_extract_weights.csv: Performance of bit extracting, for different weights of the mask.
  • bit_extract_blocks.csv: Performance of bit extracting, for masks with different numbers of blocks of consecutive 1s.
  • seq_enc.csv: Sequence encoding, from ASCII to 2-bit codes.
  • kmer_extract.csv: Regular k-mer extraction, traversing along an input sequence.
  • kmer_spaced_single.csv: Spaced k-mer extraction, using a single mask.
  • kmer_spaced_multi.csv: Spaced k-mer extraction, for multiple masks at once.
  • kmer_clark.csv: Small test to examine bottlenecks in the CLARK-S implementation.

We provide these results for all hardware architectures tested in benchmarks. We also added the results of our DuoHash benchmark there, called DuoHash.csv.

To create the plots from the manuscript, run ./plot/plot_all_cpus.sh. This requires some standard Python packages to be installed; a conda env file with these packages is provided in conda-env.yaml.

Implementation

The code separates the bit extraction functionality from the k-mer functionality. The bit extraction functions can thus be used in a general context as well.

Functions that process k-mers are mostly templated here, in order to allow us to benchmark different implemenations of, e.g., the nucleotide to two-bit encoding and the bit extraction. Thus, to use these function in your code, you might want to replace those template parameters with hard-coded versions for simplicity.

To use this in our own code, we recommend to use PEXT for bit extraction where possible, and a SIMD-accelerated Butterfly network otherwise. In the simplest case, use bit_extract_selector() as a dynamic check to test this, and a switch as shown in switch_bit_extract_mode() and run_bit_extract_mode() to run. This avoids overhead for function pointers in the hot loop per extracted value.

Similarly, for spaced k-mer extraction, use spaced_kmer_selector() to decide which extraction algorithm to use dynamically on a given hardware, and again a switch outside of the hot loop to dispatch between the algorithms.

Sources

Existing approaches for spaced k-mers:

Information on PEXT:

About

FISK: Fast Iteration of Spaced K-mers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors