This is a program which counts occurences of k-mers (strings of length k characters) in an arbitrarily large input. streamcount
The program first takes a set of pattern strings, breaks the strings into k-mers, and builds from this set of k-mers a keyword tree with suffix links (see Aho-Corasick algorithm).
In the second part, each line of an input file is streamed through the keyword tree and the counters of the corresponding k-mers for this file are collected.
The number of k-mers which can be simultaneously counted is limited by the amount of the available RAM. The number is also limited by the use of the signed integer SC_INT defined as int32_t on line 18 of common.h. With this definition, we can build an index for at most Int32.MaxValue/k input k-mers. To increase this limit, redefine SC_INT as int64_t and recompile.
To install: apt-get install zlib1g-dev
If you add a path to the compiled streamcount to your PATH variable, it can be run as a standard unix command: streamcount To run:
where 'kmers_file' is the full path and file name of the file from which to extract the k-mers.
NOTE: The file with k-mers should contain only characters from a valid DNA alphabet. This should be dealt with prior to running the program.
-i --input 'input_file'
where 'input_file' is the full path and file name of the file where to count the k-mers. If the input option is not specified, the program tries to read the input text from stdin. In this case, the following commands are valid:
cat 'input_file' |./streamcount --kmers 'kmers_file'
./streamcount --kmers 'kmers_file' < 'input_file'
By specifying only these two parameters, we accept the following default program behaviour:
- 'input_file' is of type FASTA. It can be compressed.
- Each line of 'kmers_file' is treated as a separate k-mer.
- The final count for each k-mer includes a count for its reverse complement string.
- The final counts for each k-mer are written to stdout, one count per line.
- If some k-mers in 'kmers_file' are not unique, the information about this is supressed.
- Counting is performed with DEFAULT_NUMBER_OF_THREADS defined on line 24 in common.h.
length of each k-mer
If there are more than one k-mer in each input line, all of them will be considered. In this case, output for each line will consist of a line of comma-separated counts
type of k-mers input
This will extract k-mers from 'kmers_file' treating the entire file as one string
type of input file
This will treat input as text lines, rather than FASTA.
number of threads
It is optimal to define the number of threads as the number of cores. Maximum number of threads is set to 8. It can be redefined in common.h line 23
This will not include count of reverse complement into final count of each k-mer. This option can be useful when counting k-mers in a genomic sequence, rather than in set of reads.
memory in MB
Specify the amount of memory (in MB) that you are ready to sacrifice to hold a k-mer index. This is used to estimate if you can hold k-mers index prior to processing. Default: 4000MB
Prints each original line of 'kmers_file' before its count(s).
For each k-mer, prints to 'repeat-mask-file' 0 or 1. 1 is printed if this k-mer is not unique (repeats) in the 'kmers_file'. This is used if you need a precise count for all k-mers extracted from the same line. Because the same k-mer occurs also on a different line, the counts of consecutive k-mers could be distorted.