Streaming algorithm for computing kmer statistics for massive genomics datasets
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
seqan
.gitignore
CountStream.cpp
Kmer.cpp
Kmer.hpp
KmerIterator.cpp
KmerIterator.hpp
KmerStream.cpp
KmerStreamEstimate.py
Makefile
README.md
RepHash.cpp
RepHash.hpp
StreamCounter.hpp
StreamJoin.cpp
common.h
hash.cpp
hash.hpp
kseq.h
lsb.cpp
lsb.hpp
mersennetwister.h
test_rephash.cpp

README.md

KmerStream

Streaming algorithm for computing kmer statistics for massive genomics datasets.

Installation

To compile just type make

Running

To see the usage just type KmerStream

KmerStream 1.1

Estimates occurrences of k-mers in fastq or fasta files and saves results

Usage: KmerStream [options] ... FASTQ files

-k, --kmer-size=INT      Size of k-mers, either a single value or comma separated list
-q, --quality-cutoff=INT Comma separated list, keep k-mers with bases above quality threshold in PHRED (default 0)
-o, --output=STRING      Filename for output
-e, --error-rate=FLOAT   Error rate guaranteed (default value 0.01)
-t, --threads=INT        Number of threads to use (default value 1)
-s, --seed=INT           Seed value for the randomness (default value 0, use time based randomness)
-b, --bam                Input is in BAM format (default false)
    --binary             Output is written in binary format (default false)
    --tsv                Output is written in TSV format (default false)
    --verbose            Print lots of messages during run
    --online             Prints out estimates every 100K reads
    --q64                set if PHRED+64 scores are used (@...h) default used PHRED+33

Options:

  • -k the k-mer size, this should be an integer or a list of integers e.g. -k 31 or -k 31,47,63, odd values behave better than even values
  • -q optional quality cutoff values, all k-mers with bases under the q threshold are discarded
  • -o filename where the output should be written
  • -e guarantee on the error of the estimator used, default value is 1%, lower values increase memory usage
  • -t number of threads to use
  • -s KmerStream uses random hash functions for computing the statistics, to fix the hash value for reproducibility set the seed to a fixed value, e.g. '-s 42'
  • -b Input is in BAM format
  • --binary Write output in binary format, this includes the data necessary for running KmerStreamJoin, the output filename is used as a prefix and the file containing the output is PREFIX + _Q_0_k_31
  • --tsv Write output in TSV (tab separated values) format for easier parsing
  • --online prints estimates every 100K reads, see (https://pmelsted.wordpress.com/2014/07/12/analyzing-data-while-downloading/)[https://pmelsted.wordpress.com/2014/07/12/analyzing-data-while-downloading/] for example usage
  • --q64 Quality values are enchoded in PHRED+64 format rather than the default PHRED+33, use this if your quality values are from @ to h rather than ! to I

KmerStreamJoin

KmerStreamJoin 1.1

Creates union of many stream estimates

Usage: KmerStreamJoin -o output files ...
       KmerStreamJoin merged-file

-o, --output=STRING      Filename for output
    --verbose            Print output at the end

KmerStreamJoin, when run with the -o option takes a list of KmerStream binary output files (created with --binary option to KmerStream) and creates a single binary output file that is equivalent to having run a single KmerStream run on all of the files. When the -o option is missing it outputs the KmerStream result of the binary input file.

This utility is useful when distributing the process of creating the binary files or computed incrementally.

KmerStreamEstimate.py

KmerStreamEstimate is a python script that reads a tsv file as input (generated using --tsv) and estimates the genome size (G), error rate (e), and coverage (lambda).