Counting k-mers in massive datasets
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
LICENCE.txt
Makefile
README.md
common.c
common.h
count_kmers.c
count_kmers.h
dna_common.c
extractRandomKmersFromReads.c
keyword_tree.c
keyword_tree.h
kmers_to_kwtree.c
kmers_to_kwtree.h
kseq.h
sample_data.zip
streamcount.c
streamcount.h

README.md

streamcount

This is a program which counts occurences of k-mers (strings of length k characters) in an arbitrarily large input.

The program first takes a set of pattern strings, breaks the strings into k-mers, and builds from this set of k-mers a keyword tree with suffix links (see Aho-Corasick algorithm).

In the second part, each line of an input file is streamed through the keyword tree and the counters of the corresponding k-mers for this file are collected.

The number of k-mers which can be simultaneously counted is limited by the amount of the available RAM. The number is also limited by the use of the signed integer SC_INT defined as int32_t on line 18 of common.h. With this definition, we can build an index for at most Int32.MaxValue/k input k-mers. To increase this limit, redefine SC_INT as int64_t and recompile.

Dependencies:

 zlib 
To install: apt-get install zlib1g-dev

To compile:

 make 

To run:

If you add a path to the compiled streamcount to your PATH variable, it can be run as a standard unix command: streamcount

Program arguments

Required:

 --kmers 'kmers_file' 
where 'kmers_file' is the full path and file name of the file from which to extract the k-mers.
NOTE: The file with k-mers should contain only characters from a valid DNA alphabet. This should be dealt with prior to running the program.
 -i --input 'input_file' 

where 'input_file' is the full path and file name of the file where to count the k-mers. If the input option is not specified, the program tries to read the input text from stdin. In this case, the following commands are valid:

 cat 'input_file' |./streamcount --kmers 'kmers_file' 
 ./streamcount --kmers 'kmers_file' < 'input_file' 

By specifying only these two parameters, we accept the following default program behaviour:

  1. 'input_file' is of type FASTA. It can be compressed.
  2. Each line of 'kmers_file' is treated as a separate k-mer.
  3. The final count for each k-mer includes a count for its reverse complement string.
  4. The final counts for each k-mer are written to stdout, one count per line.
  5. If some k-mers in 'kmers_file' are not unique, the information about this is supressed.
  6. Counting is performed with DEFAULT_NUMBER_OF_THREADS defined on line 24 in common.h.

Optional:

Input options:

length of each k-mer
 -k='k' 
If there are more than one k-mer in each input line, all of them will be considered. In this case, output for each line will consist of a line of comma-separated counts
type of k-mers input
 --kmers-multiline 
This will extract k-mers from 'kmers_file' treating the entire file as one string
type of input file
 --input-plain-text 
This will treat input as text lines, rather than FASTA.
number of threads
 -t 
It is optimal to define the number of threads as the number of cores. Maximum number of threads is set to 8. It can be redefined in common.h line 23

Counting options:

reverse complement
 --no-rc 
This will not include count of reverse complement into final count of each k-mer. This option can be useful when counting k-mers in a genomic sequence, rather than in set of reads.
memory in MB
 -m,     --mem='MEMORY_MB' 
Specify the amount of memory (in MB) that you are ready to sacrifice to hold a k-mer index. This is used to estimate if you can hold k-mers index prior to processing. Default: 4000MB

Output options:

print options
 --printseq 
Prints each original line of 'kmers_file' before its count(s).
mark repeats
 --repeat-mask-tofile='repeat-mask-file' 
For each k-mer, prints to 'repeat-mask-file' 0 or 1. 1 is printed if this k-mer is not unique (repeats) in the 'kmers_file'. This is used if you need a precise count for all k-mers extracted from the same line. Because the same k-mer occurs also on a different line, the counts of consecutive k-mers could be distorted.

Sample usage:

In folder 'sample_data.zip' there are one sample input file, and one k-mers file. Folder also contains SAMPLE_RUNS.txt with examples of running streamcount.