khmer-counting.tex

\documentclass{article}
\usepackage{simplemargins}

%\usepackage{multirow}
\usepackage[pdftex]{graphicx}
\graphicspath{{figures/}}
\bibliographystyle{plain}
\setlength{\parindent}{0pt} \setlength{\parskip}{1.6ex}
\setallmargins{1in} \linespread{1.6}

% figures: khmer blue; tallymer green; jellyfish red; dsk yellow

% @CTB do we want to talk about k-mer frequency spectrum instead of k-mer
%      counting?
% @CTB mention altmet stuff? ``top X packages''
% @CTB be sure to mention Conway & Bromage paper.
% @CTB point out that implementation is really easy!
% @CTB discuss importance of in-memory counting
% @CTB relate to compression, too.
% @CTB discuss tunable error rate.

% @CTB mention altmetrics/popularity; API
% @CTB put in 'diff' command in Makefile
% @CTB emphasize annoying size of data, etc.
% @CTB talk about standard analyses vs ours.
% @CTB put in two lines for khmer disk usage, 1% and 5%?

% @CTB   ref the use of Amazon for benchmarks; cloud environment.
% @CTB   do we want to compare cloud vs not cloud?
% @CTB   mention: can update at any time!
% @CTB   note tallymer .mct building into benchmarks?
% @CTB   point out we are memory bound

% @CTB new stuff/TODO for submission:
% @CTB   citation combine
% @CTB   spellcheck
% @CTB   put in github tag references

\begin{document}


% Title must be 150 characters or less
\begin{flushleft}
{\Large \textbf{Count Me Maybe: A Probabilistic Approach to k-mer Counting} }
% Insert Author names, affiliations and corresponding author email.
\\
Qingpeng Zhang$^{1}$, 
Jason Pell$^{1}$,
Eric McDonald$^{1}$,
Rosangela Canino-Koning$^{1}$,
Adina Chuang Howe$^{2,3}$,and 
C. Titus Brown$^{1,2\ast}$
\\
\bf{1} Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
\\
\bf{2} Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI, USA
\\
\bf{3} Plant, Soil, and Microbial Sciences, Michigan State University, East Lansing, MI, USA
\\
$\ast$ E-mail: ctb@msu.edu
\end{flushleft}

\section{Abstract}

\paragraph{Introduction:}

K-mer abundance analysis is widely used for many purposes in sequence
analysis, including data preprocessing for de novo assembly, repeat
detection, and sequencing coverage estimation.  Recently, a number of
new k-mer counting libraries have emerged to handle the increasing
amount of data available from sequencing platforms; these libraries
offer various trade-offs between memory and disk usage.

\paragraph{Results:}

We present the khmer software package for fast and memory efficient
online counting of individual k-mer abundances in sequencing data
sets. Unlike previous methods based on data structures such as hash
tables, suffix arrays, and trie structures, khmer relies entirely on a
simple probabilistic data structure, a Count-Min Sketch.  On sparse
data sets, this data structure is considerably more memory efficient
than any exact data structure.  The trade-off for this memory
efficiency is that the use of a Count-Min Sketch introduces a
systematic positive overcount for k-mers.  Here we analyze the
analysis speed, memory usage, and miscount rate of our Count-Min
Sketch implementation for generating the k-mer frequency distribution
in simulated and real sequencing data sets.  We also compare the
performance of khmer to several other k-mer counting packages,
Tallymer, Jellyfish, and DSK.  We discuss the tradeoffs and benefits
of khmer, which, while slower than Jellyfish and DSK, is very simple to
implement and allows online updating of k-mer counts.  Finally, we
present two applications of khmer on short-read sequencing data:
first, doing reference-free error analysis of short-read data with
k-mer abundances; and second, trimming errors from short reads.

\paragraph{Conclusion:}

The Count-Min Sketch, as implemented in khmer, is an effective and
efficient tool for online k-mer counting in biological sequences.  In
particular, the miscount behavior of the Count-Min Sketch performs
well on error-prone short-read sequencing data.  We show that the
khmer package offers a useful set of trade-offs for certain k-mer
counting applications.  khmer is implemented in C++ wrapped with a
Python interface, offers a tested and robust API, and is freely
available under the BSD license at github.com/ged-lab/khmer.

\section{Introduction}

A k-mer is a substring of length k in a DNA sequence, where k is
usually between 4 and the length of a short sequencing read. The goal
of k-mer counting is to determine the number of occurrences for each
k-mer in a dataset composed of many sequence reads. Efficient k-mer
counting plays an important role in many bioinformatics approaches,
including data preprocessing for de novo assembly, repeat detection,
and sequencing coverage estimation

One particular application of k-mer frequency analysis is the
detection and removal of sequencing errors.  Because sequencing errors
generate many erroneous k-mers with low abundance, we can optimize for
more heavyweight computational approaches such as assembly by removing
reads with too many low-abundance k-mers prior to assembly. Similarly,
we can estimate coverage and remove redundancy by analyzing k-mer
distributions within reads \cite{Brown2012}.  In both cases,
pre-assembly filtering of reads to reduce dataset sizes is a crucial
component of assembly time and memory reduction (cite SGA paper,
Simpson JT). Additionally, we can use k-mer counting to evaluate
genome size and the coverage of sequencing reads, which can help
determine parameter settings and analyze assembly results
\cite{Chikhi2013}. K-mer counting can also give important information
for predicting regions with repetitive elements such as transposons
\cite{Kurtz2008}.

The central challenge for k-mer counting in short-read shotgun
sequencing data set is that these data sets are both relatively sparse
in k-mers
and contain many erroneous k-mers.  The data sets are sparse because for
typical values of k such as k=32, only a small fraction of the total
possible number of k-mers ($4^{32}$) are actually present in the data
set.  The high error rate (e.g. Illumina has a ~0.1-1\% per-base error rate @cite) generates many unique k-mers.  In particular,
as the total number of
reads generated increases, the total number of errors grows linearly,
leading to data sets where the erroneous k-mers vastly outnumber the
true k-mers \cite{Conway2011}.  Dealing with the resulting large number of k-mers, most of which
are erroneous, has become an unavoidable and challenging task
\cite{Minoche2011}.

A variety of k-mer counting approaches, and standalone software
packages implementing them, have emerged in recent years; this
includes Tallymer\cite{Kurtz2008}, Jellyfish\cite{Marcais2011}, BF-Counter\cite{Melsted2011}, DSK\cite{Rizk2013}, KMC\cite{Deorowicz2013}, and Turtle\cite{Roy2013}.
% @CTB see http://arxiv.org/abs/1305.1861)
% @CTB include more discussion of BFcounter & turtle?
These
approaches and implementations each offer different algorithmic
trade-offs and different efficiencies, and enable a non-overlapping set
of functionality.  Tallymer uses a suffix tree to store k-mer counts
in memory and on disk.  Jellyfish stores k-mer counts in in-memory
hash tables, and makes use of disk storage to scale to larger
data sets.  BF-Counter uses a Bloom filter as a pre-filter to avoid
counting unique k-mers, and is the first published probabilistic approach
to k-mer counting.  DSK adopts a streaming approach to k-mers that
enables time- and memory-efficient k-mer counting with an explicit
trade-off between disk and memory usage.  And KMC relies primarily
on fast and inexpensive disk access to count k-mers in very little
memory.
% @CTB discuss turtle.

Our motivation for exploring efficient k-mer counting comes from our
work with metagenomic data, where we routinely encounter data sets
that contain $300^9$ bases of DNA and over 50 billion distinct k-mers
\cite{Howe2012}.  In order to efficiently filter, partition, and
assemble these data, we need to store counts for each of these k-mers
in main memory, and query and update them in realtime -- a set of
functionality not readily offered by any current packages.  Moreover,
we wish to enable the use cloud computing and desktop computers, which may
have poor I/O or limited memory. This has
dictated our exploration of efficient in-memory k-mer counting
techniques.


Below, we describe an implementation of a simple probabilistic data
structure for k-mer counting.  This data structure is based on a
Count-Min Sketch, a generalized probabilistic data structure for
storing the frequency distributions of distinct elements
\cite{Cormode2005}.  Our implementation is based on an extension of a
Bloom filter, which has been previously used for k-mer counting and de
Bruijn graph storage and traversal
\cite{Bloom70,BroderM03,Melsted2011,Pell2012}.

% @CTB also cite Quip, Jones et al.


The Count-Min Sketch approach is particularly memory efficient, with
memory usage that can significantly outperform exact data structures
 \cite{Conway2011,Pell2012}.  However, the use of a probabilistic
data structure introduces counting errors, which have effects that
must be analyzed in the context of specific problems.  Below, we
compare CPU, memory and disk usage of our implementation with that of
Tallymer, Jellyfish, and DSK, and show that khmer is competitive in speed,
memory, and disk usage.  We also analyze the effects of this
counting error on calculations of the frequency distribution for
sequencing data sets, and in particular on metagenomic data sets.

Our software implementation, called khmer, is implemented in C++ with
a Python wrapper, enabling flexible use and reuse by users with a wide
range of expertise.  The software package is freely available for
academic and commercial use and redistribution under the BSD license
at github.com/ged-lab/khmer/.  khmer comes with substantial
documentation and many tutorials, and contains extensive unit tests.
Moreover, we have built several applications on top of khmer,
including memory-efficient de Bruijn graph partitioning
\cite{Pell2012} and lossy compression of short read data sets for
assembly \cite{Brown2012}.

\section{Results}

\subsection{Implementing a Count-Min Sketch for k-mers}

The implementation details are similar to those of the Bloom filter in
\cite{Pell2012}, but with the use of 8 bit counters instead of 1 bit
counters.  Briefly, Z hash tables are allocated, each with a different
size of approximately H bytes; the sum of these hash table sizes must
fit within available main memory.  To increment the count for a
particular k-mer, a single hash is computed for the k-mer, and the
modulus of that hash with each hash table's size H gives the location
for each hash table; the associated count in each hash table is then
incremented by 1.  To retrieve the count for a k-mer, the same hash is
computed and the minimum count across all hash tables is computed.
While different in implementation detail from the standard Count-Min Sketch,
which uses a single hash table with many hash
functions, the performance details are identical \cite{Pell2012}.

To analyze the false positive rate -- the probability with which a
given k-mer count will be incorrect -- we can look at the hash table
load. Suppose N unique k-mers have been counted using Z hash tables,
each with size H.  The probability that no collisions happened in a
specific entry in one hash table is $(1-1/H)^{N}$, which can be
approximated as $e^{-N/H}$. The individual collision rate in one hash
table is $1-e^{-N/H}$. The total collision rate, which is the
probability that a collision occurred in each entry where a k-mer maps
to in all Z hash tables, is $(1-e^{-N/H})^{Z}$. In this situation, the
counts in all Z hash table bins cannot give the true count of a k-mer.

While the false positive rate can easily be calculated from the hash
table load, the average {\em miscount} depends on the k-mer frequency
distribution, which must be determined empirically.  We analyze the
effects of this below.

\subsection{khmer efficiently calculates k-mer abundance histograms}

We measured the time and memory usage to calculate k-mer abundance
histograms in five soil metagenomic read data sets using khmer,
Tallymer, Jellyfish, and DSK (Figures \ref{cmp_time} and
\ref{cmp_memory}).  We chose to benchmark abundance histograms because
this was a built-in feature common to all four software packages,
we applied each package to increasing subsets of a 50m read soil
metagenome data set (@cite adina).

Figure \ref{cmp_time} shows that the time usage of our khmer approach
is comparable to Jellyfish and DSK, and, as expected, increases linearly
with data set size.

From Figure \ref{cmp_memory}, we see that the memory usage of both
Jellyfish and Tallymer increases linearly with dataset size, although
Jellyfish is more efficient than Tallymer in memory usage for smaller
k size. Using option -parts 4 for Tallymer subroutine suffix reduces
the memory usage. But the second step of the Tallymer counting method
- the mkindex subroutine -- will always use more memory as the number
of k-mers being counted increases.  For a 5 GB dataset with 2.7
billion total k-mers, Jellyfish uses 5 GB memory; Tallymer exceeds 24
GB of memory usage for a smaller 4 GB data set.  DSK uses less memory
than all of the other programs for large data sets, but at the cost
of more limited functionality (discuss below).
% @CTB check units

In addition, the memory usage of khmer also increases linearly with
data set size as long as we hold the error rate constant.  However,
the memory usage of khmer varies substantially with the desired false
positive rate: we can decrease the memory usage by increasing the
false positive rate, as shown in Figure \ref{cmp_memory}.  We can also
see from the figure that with a low false positive rate of 1\%, the
memory usage is competitive with other programs; with a higher 5\%
false positive rate, the memory usage is lower than all but the
disk-based DSK.  The memory usage of Jellyfish is dependent on the k-mer
size, because it uses hash tables to store k-mer counts and must store
the exact k-mer to track collisions; however, khmer's memory
usage is independent of k, because it does not track collisions.

Another concern is disk usage: both Jellyfish and Tallymer use large
index files stored on the hard disk.  Figure \ref{cmp_disk} shows that
the disk usage also increases linearly with the dataset size. For a
dataset of 5 GB, the disk usage of both Jellyfish and Tallymer is
around 30 GB.  khmer is entirely in-memory and does not rely on any
on-disk storage to store or retrieve k-mer counts, although for
practicality the hash tables can be saved for later reuse; the
uncompressed disk usage for khmer in Figure \ref{cmp_disk} is the same
as its memory.  At the expense of more time, khmer supports saving and
loading gzip-compressed hash tables, which are competitive in size to
DSK's on-disk database.

\subsection{khmer counts k-mers efficiently}

Tallymer, Jellyfish, and khmer all support random access to k-mer
counts, while DSK does not.  We measured the time it took to access
9.7m 22-mers across five different data sets (Figure \ref{cmp_count}),
after the initial databases had been built.  Here, khmer and Tallymer
performed very well, dramatically outperforming Jellyfish.  In all
three cases, system time dominated the overall time required to
retrieve k-mers, suggesting that the primary reason for the linear
increase in retrieval time was due to the increased size of the
database on the disk.  In particular, khmer is constant in retrieval
time once the hash tables are loaded into memory.

\subsection{The measured counting error rate is low on short-read data}

Miscounts are a significant concern for probabilistic counters.
However, a large number of the k-mers in short-read dataset are
low-abundance, leading to a skewed k-mer abundance distribution which
should result in good performance for Count-Min Sketch-like approaches
\cite{Melsted2011} (also @cite CMS basics). Here we use both real and
simulated datasets to evaluate the counting performance in practice.
Our primary measure for evaluation is the {\em offset}, which is the
difference between the true k-mer frequency and the frequency reported
by khmer.

Figure \ref{average_offset_vs_fpr} shows the relationship between average 
miscount and counting error rate for four different test data sets with
identical numbers of distinct k-mers: one
metagenome data set; a simulated set of random k-mers; a simulated set
of reads, chosen with 3x coverage and 1\% error; a simulated set of
reads (3x) with no error.
Even when the counting error rate is as high as 0.9 -- 90\% of k-mers have an incorrect count -- the average offset is still below 4.

We separately analyzed the average {\em percentage} miscount between
true and false k-mers; e.g. a miscount of 4 for a k-mer whose true
count is 1 would be 400\%.  Figure \ref{percent_offset_vs_fpr} shows the relationship between average 
miscount and counting error rate for the same four data sets as in Figure \ref{average_offset_vs_fpr}.  For an error rate of 0.1 (10\% of k-mer counts are incorrect), the average percentage offset is less than 10\% for all four data sets.
% @CTB recheck these numbers when QP fixes the script :)

We see here that for a fixed counting error rate, the simulated reads
without error have the highest average miscount and the simulated
k-mers data have the lowest average miscount.  This is because the
abundance distributions have the least and most left-skew,
respectively: the simulated reads without error have no abundance-1
k-mers, while the randomly chosen k-mers are abundance-1.

\subsection{Sequencing error profiles can be measured with k-mer abundance
profiles}

One specific use for khmer is detecting random sequencing errors by
looking at the k-mer spectrum within reads \cite{Medvedev2011}.
Low-abundance k-mers contained in a high-coverage data set typically
represent random sequencing errors, and a variety of read trimming and
error correcting tools use k-mer counting to reduce the error content
of the read data set, independent of quality scores or reference
genomes \cite{Kelley2010}.  This is an application where the counting
error effect of the Count-Min Sketch approach used by khmer may be
particularly tolerable, because khmer never underestimates counts, so
will never falsely call an erroneous k-mer.

Here we use khmer to examine the sequencing error pattern of a 5m-read
subset of an Illumina reads dataset from single-colony sequencing of
{\em E. coli} \cite{pubmed21926975}.  As shown in Figure
\ref{perc_unique_pos} there are many more unique k-mers close to the
3' end of reads.  This is almost certainly due to the increased error
rate at the 3' end of reads.

\subsection{Using khmer to trim errors from reads}

% @CTB make sure we make the effects of false positives clear.

(@CTB revise this section :)

As discussed above, we can detect erroneous k-mers in high coverage
data sets by finding low abundance k-mers.  Removing or trimming reads
containing unique or low-abundance k-mers will remove many errors,
which can help scale de Bruijn graph based assembly methods.
\cite{Qin2010} \cite{Hess2011} \cite{Mackelprang2011}
% all of which used low abundance trimming for metagenomic data).

The khmer k-mer counting approach can filter reads based on
k-mer abundance, and can be used for arbitrary k. One approach to
k-mer abundance filtering involves removing any read that contains
even a single low-abundance k-mer.
This filtering can be implemented
in two passes: the first pass for counting k-mers from reads and the
second pass for filtering the reads. The counting error here manifests
as an overestimated count in hash entries with one or more collisions,
in which case k-mers hashing to that entry may not be correctly
flagged as low-abundance.  High counting error rate therefore manifest
as "lenient" filtering, in which reads may not be properly
removed. However, any read that is trimmed will be correctly
trimmed. To reduce the effect of such counting error rate, we can do
the filtering iteratively; after each run of filtering, more
reads with low-abundance k-mers will be discarded.

As an example of this method, we filtered out reads with low abundance
k-mers from a human gut microbiome metagenomic dataset (MH0001) with
more than 42 million reads. In fact we want any read with any unique
k-mer to be discarded.  We used hash tables with different size to
show the influence of hash table size. We also showed the effect of
iterative filtering to reduce false positive rate. To assess the
counting error rate, we used Tallymer to obtain the actual accurate
count of the k-mers in the dataset.

From Figure \ref{num_remaining_reads}, we see that after each run,
more low-abundance reads were discarded. With larger hash table, the
low-abundance reads were discarded faster. On the other hand, from
Figure \ref{num_remaining_reads}, we can see that after each
iteration of filtering, the percentage of ``bad'' reads - reads with
unique k-mers - decreased. After four iterations, the percentage of
bad reads was less than 4\%. The result showed that with our method
nearly 40\% of the original reads were discarded by removing the
low-abundance reads with an acceptable false positive rate (less than
4\% after four iterations of filtering). This reduces the memory and
time requirements considerably in follow-on assembly \cite{Howe2012}.

% @CTB what is the initial false positive rate here? It'd be nice
% to show how iterative filtering can lead to substantial improvements
% in the second round, if enough k-mers are removed in the first round.
% Also, can we use ``number of reads with low-abundance k-mers'' instead
% of number of remaining reads on the y axis?
% @QP Figure 7 But here one of the purpose of figure 6 is that it can show the
% iteration of filtering can reduce the number of remaining reads by discarding
% reads with low-abundance k-mers. Figure 7 can not show this information.

\section{Discussion}

\subsection{khmer is a useful k-mer counting approach}

From the performance comparison between khmer and other k-mer counting
packages in calculating k-mer abundance distributions, we can conclude
that khmer offers a range of useful performance tradeoffs for disk
I/O, time, and memory.  In time, khmer performs competitively with DSK
and Jellyfish; khmer also provides a way to systematically trade
memory for miscounts, across a wide range of parameters.  khmer's
uncompressed disk storage is competitive with Jellyfish, and, in
situations where disk space is at a premium, khmer can take advantage
of gzip compression to provide storage similar to that of DSK.

Interestingly, DSK performs especially well in terms of memory usage
for calculating the abundance distribution of k-mers. However, in
exchange for this efficiency, retrieving specific k-mer counts at
random is likely to be quite slow, as DSK is optimized for iterating
across partition sets of k-mers rather than randomly accessing k-mer
counts.

For direct k-mer counting speed, khmer significantly outperforms both
Tallymer and Jellyfish.  This is not surprising, since this was a
primary motivation for the development of khmer.

We also note that, unlike Tallymer and Jellyfish, khmer can be used
for {\em online} counting of k-mers: that is, query and updating of
k-mer counts can be done directly as k-mers are being loaded, with no
need for disk access or an indexing step.  This provides the
flexibility needed for streaming applications such as digital
normalization \cite{Brown2012}.

% @CTB ref UW QUIP?

\subsection{khmer memory usage is fixed and low}

The memory usage of the basic Count-Min Sketch approach is fixed, which means
that khmer will never crash due to memory limitations, and all
operations can be performed in main memory without recourse to disk
storage.  However, the memory size chosen must be considered in light
of the false positive rate and miscount acceptable for a given
application.  In practice, we have found that a false positive rate of
between 1\% and 10\% offers acceptable miscount performance for a wide range of
tasks, including digital normalization and low-abundance read-trimming.

For any given dataset, the size and number of hash tables will
determine the accuracy of k-mer counting with khmer.  Thus, the user
can control the memory usage based on the desired level of
accuracy. The time usage for the first step of k-mer counting , to
consume the k-mers into a counting data structure, depends on the
total amount of data, since we must traverse every k-mer in every read.
The second step, k-mer retrieval, is algorithmically constant for
fixed k; however, for practicality, the hash tables are usually saved
to and loaded from disk, meaning that k-mer retrieval time depends directly
on the size of the database being queried.

The memory usage of khmer is particularly low for sparse data sets,
especially since only main memory is used and no disk space is
necessary beyond that required for the read data sets.  This is no
surprise: the information theoretic comparison in
\cite{Pell2012} shows that, for sparse sequencing data sets, Bloom
filters require considerably less memory than any possible exact
information storage for a wide range of error rates and data set
sparseness.

In our implementation we use 1 byte to store the count of each k-mer
in the data structure. Thus the maximum count for a k-mer will be 255.
In cases where tracking bigger counts is required, khmer also provides
an option to use an STL map data structure to store counts above 255,
with the trade-off of significantly higher memory usage.  In the
future, we may extend khmer to counters of arbitrary bit sizes.

\subsection{Error rates in k-mer counting are low and predictable}

The Count-Min sketch is a probabilistic data structure with a
one-sided error that results in random overestimates of k-mer
frequency, but does not generate underestimates. The chosen parameters
of the data structure will influence the accuracy of the count.  While
the probability of an inaccurate count can easily be estimated based
on the hash table load, the miscount size is dependent on details of
the frequency distribution of k-mers \cite{Cormode2005}.

More specifically, in the analysis of the Count-Min
sketch\cite{Cormode2005}, the offset between the incorrect count and
actual count is related to the total number of k-mers in a dataset and
the size of each hash table. Further study has shown that the behavior
of Count-Min sketch depends on specific characteristics of the data
set under consideration, most especially skewness\cite{Rusu2008,
  CormodeM05}.  Thus these probabilistic properties suit short reads
from next generation sequencing data sets: the counts are not so
wrong for next generation sequencing reads data sets because of the
highly left-skewed abundance distribution of k-mers in those data sets.

Figures \ref{average_offset_vs_fpr} and \ref{percent_offset_vs_fpr}
demonstrate these properties very well.  We see more correct
counting for error-prone reads from a genome than for error-free
reads from a genome, with a normal
distribution of k-mer abundance.  Thus, this counting approach is
especially suitable for high diversity data sets, such as metagenomic
data, in which a larger proportion of k-mers are low abundance or
unique due to sequencing errors.

\subsection{Real-world applications for khmer}

For many applications, an approximate k-mer count is sufficient.  For
example, to eliminate reads with low abundance k-mers, we
can tolerate a certain number of reads with low frequency remaining in
the resulting data set falsely.  If RAM-limited we can do the
filtering iteratively so that at each step we are making more
effective use of the available memory.

Another example is measuring sequencing error profiles within reads
using k-mer abundance profiles. As shown above, we can reliably use
the abundance distribution of unique k-mers, since the counting errors
from collision does not influence the count of these unique
k-mers.  Furthermore, because the rate of inaccurate counting can be
predicted, we can adjust the parameters of the data structure to make
sure the count accuracy satisfies our downstream analysis.

A third example, digital normalization (discussed in
\cite{Brown2012}), uses khmer to efficiently compress short-read data
sets with a streaming lossy compression algorithm.  In all three
examples, the fact that khmer does not break an imposed memory bound
is extremely useful, since for many data sets -- especially
metagenomic data sets (@cite adina) -- we are operating in memory
limited circumstances.  Moreover, because the false positive rate is
straightforward to measure, the user can be warned or the results
invalidated when too little memory is used.

\subsection{Conclusion}

K-mer counting is widely used, and as sequencing data set sizes
increase, graceful degradation of data structures in the face of large
amounts of data becomes quite useful.  This is especially true when
the theoretical and practical effects of the degradation can be
predicted (see e.g. \cite{Melsted2011, Pell2012, Roy2013}).  This
graceful degradation is a key property of the Count-Min Sketch
approach, and its implementation in khmer.

The khmer implementation offers good performance, a robust and
well-tested Python API, and a number of useful and well-documented
scripts.  While Jellyfish and DSK also offer very good performance,
khmer is competitive in many situations, and, because it provides a
Python API, is more flexible than either Jellyfish or DSK.  In
memory-limited situations with poor I/O performance, khmer is
particularly useful, because it will not break an imposed memory bound
and does not use the disk to store k-mer counts.  However, in exchange
for this memory guarantee, counting becomes increasingly incorrect as
less memory is used or as the data set size grows larger.

\subsection{Future considerations}

Scaling khmer to extremely large data sets with many unique k-mers
requires a large amount of memory: approximately 446 GB of memory is
required to achieve a false positive rate of 1\% for $N \approx
50x10^9$. It is possible to reduce the required memory by dividing
k-mer space into multiple partitions and counting k-mers separately
for each partition. Partitioning k-mer space into $M$ partitions
results in a linear decrease in the number of k-mers under
consideration, thus reducing the occupancy by a constant factor $M$
and correspondingly reducing the collision rate.  Partitioning k-mer
space is a generalization of the systematic prefix filtering approach,
where one might first count all k-mers starting with AA, then AC, then
AG, AT, CA, etc., which is equivalent to partitioning k-mer space into
16 equal-sized partitions. These partitions can be calculated
independently, either across multiple machines or iteratively on a
single machine, and the results stored for later comparison or
analysis.  This is similar to the approach taken by DSK
\cite{Rizk2013}, and could easily be implemented in khmer.

Further optimization of khmer on single machines, e.g. for multi-core
architectures, is unlikely to achieve significantly greater speed.
Past a certain point k-mer counting is fundamentally I/O bound
@cite McDonald paper.

Perhaps the most interesting future direction for probabilistic k-mer
counting is that suggested by Turtle \cite{Roy2013}, in which several
data structures are provided, each with different tradeoffs, but with
a common API.  We hope to pursue this direction in the future by
integrating such approaches into khmer.
% see: http://delivery.acm.org.proxy1.cl.msu.edu/10.1145/2490000/2487357/p123-huang.pdf?ip=35.8.11.2&acc=ACTIVE%20SERVICE&key=C2716FEBFA981EF16B5540ABA7506E55713C06D93B9F76B9&CFID=344104418&CFTOKEN=64899208&__acm__=1372542398_54d84e5dea52f6a71c5b8e462040a819

\section{Methods}

\subsection{Code and data set availability}

% @CTB update

The version of khmer used to generate the results below is available
at http://github.com/ged-lab/khmer.git, tag '2013-khmer-counting'.
Scripts specific to this paper are available in the paper repository
at https://github.com/ged-lab/2013-khmer-counting.  Tutorials etc.

\subsection{Sequence Data}

Two human gut metagenome reads datasets (MH0001 and MH0002) were used from the 
MetaHIT (Metagenomics of the Human Intestinal Tract) project\cite{Qin2010}. 
The MH0001 dataset contains approximately 59 million reads, each 44bp long. 
The MH0002 dataset consists of about 61 million 75bp long reads.
We trimmed each FASTA file to remove low quality sequences. 

Five soil metagenomics reads data sets with different size were taken
from GPGC project for benchmark purpose.
(Iowa Prairie Table 1\ to cite here.)
% @CTB we will need accession numbers.  Adina has them.

We also generated four short read data sets to assess the false
positive rate and miscount distribution. One is a subset of a real
metagenomics data set from the MH0001 dataset, above. The second
consists of randomly generated reads. The third and fourth contain
reads simulated from a random, 1 Mbp long genome.  The third has a
substitution error rate of 3\%, and the fourth contains no errors. The
four data sets were chosen to contain identical numbers of unique
22-mers.

\subsection{Count-Min Sketch implementation}

We implemented the Count-Min Sketch data structure, a simple
probabilistic data structure for counting distinct elements
\cite{Cormode2005}.  Our implementation uses $Z$ independent hash
tables, each containing a prime number of counters $H_i$.  The hashing
function for each hash table is fixed, and bijectively converts each
DNA k-mer (for k $<=$ 32) into a 64-bit number to which the modulus of
the hash table size is applied.  This provides $Z$ distinct hash
functions (see also \cite{adina2013}).

To increment the count associated with a k-mer, the counter associated
with the hashed k-mer in each of the $N$ hash tables is incremented.
To retrieve the count associated with a k-mer, the minimum count
across all $N$ hash tables is chosen.

In this scheme, collisions are explicitly not handled, so the count
associated with a k-mer may not be accurate. Because collisions only
falsely {\em increment} counts, however, the retrieved count for any
given k-mer is guaranteed to be no less than the correct count.  Thus
the counting error is one-sided.

\subsection{Hash function and khmer implementation}

The current khmer hash function works only for $k <= 32$ and converts
DNA strings exactly into 64-bit numbers.  However, any hash function
would work. For example, a cyclic hash would enable khmer to count
k-mers larger in size than 32; this would not change the scaling
behavior of khmer at all, and is a planned extension.

By default khmer counts k-mers in DNA sequence, i.e. strandedness is
disregarded by having the hash function choose the lower numerical
value for the exact hash of both a k-mer and its reverse complement.
This behavior is configurable via a compile-time option.

\subsection{Comparing with other k-mer counting programs}

We generated k-mer abundance histograms from five soil metagenomic reads
datasets of different sizes using our khmer counting package and
three other k-mer counting packages - Tallymer, Jellyfish and DSK -  to compare
performance.  We fixed k at 22.

\paragraph{khmer:}
For khmer, we set hash table sizes to fix the false positive rate at
either 1\% or 5\%, and used 8 threads in loading the data.

The khmer random-access k-mer counting benchmark was done with a
custom-written Python script {\tt khmer-count-kmers} which loaded the
database file and then used the Python API to query each k-mer
individually.

\paragraph{Tallymer:}
Tallymer is from the genometools package version 1.3.4, and was run
with the following options. For the suffixerator subroutine we used:
{\tt -dna -pl -tis -suf -lcp}.  We varied the {\tt -parts n} option to
create the index with only 1/n of the total data in main memory, which
reduces the memory usage.  We separately used {\tt -parts 4} and {\tt
  -parts 1} to test performance.

For the mkindex subroutine, we used: {\tt -mersize 31} and {\tt -mersize 22}.

The Tallymer random access k-mer counting benchmark was done using the
'tallymer search' routine against both strands; see the script
{\tt tallymer-search.sh}.

\paragraph{Jellyfish:}
Jellyfish is version 1.1.2 and the multithreading option is set to 8 threads.

Jellyfish uses a hash table to store the k-mers and the size of the
hash table can be modified by the user.  When the specified hash table
size is not large enough and fills up, jellyfish writes it to the hard
disk and initializes a new hash table to more k-mers.  Here we use a
similar strategy as in \cite{Melsted2011} and chose the minimum size of the hash 
tables for Jellyfish so that all k-mers were stored in memory.

We ran Jellyfish with the options as below:

{\tt jellyfish count -m 22 -c 2 -C} for k=22.

{\tt jellyfish count -m 31 -c 2 -C} for k=31.

The Jellyfish random access k-mer counting benchmark was performed
using the 'query' routine and querying against both strands; see
the script {\tt jelly-search.sh}.

\paragraph{DSK:} We ran DSK with default parameters.

\bibliography{khmer-counting}

\begin{table}[ht]
    \begin{tabular}{ |c | c |c| c|c| }
      \hline                        
       & size of file(GB) & number of reads & number of distinct
      k-mers & total number of k-mers \\
      \hline
    dataset1 & 1.90 & 9,744,399 & 561,178,082 & 630,207,985
    \\
    dataset2 & 2.17 & 19,488,798 & 1,060,354,144 & 1,259,079,821
    \\ 
    dataset3 & 3.14 & 29,233,197 & 1,445,923,389 & 1,771,614,378
    \\ 
    dataset4 & 4.05 & 38,977,596 & 1,770,589,216 & 2,227,756,662
    \\ 
    dataset5 & 5.00 & 48,721,995 & 2,121,474,237 & 2,743,130,683
    \\
      \hline  
    \end{tabular}
\end{table}

\newcommand{\bigcell}[2]{\begin{tabular}{@{}#1@{}}#2\end{tabular}}

\begin{tabular}{ |c | c |c| c|c| }
  \hline                        
  ~ÊÊÊÊÊÊÊ & \bigcell{c}{Real metagenomics\\reads} & \bigcell{c}{Totally random 
reads\\with randomly\\ generated k-mers} & \bigcell{c}{Simulated reads\\from 
simulated\\genome with error} & \bigcell{c}{Simulated reads\\from 
simulated\\genome without error}  \\
  \hline
ÊÊÊÊÊÊÊÊSize of data set file      & 7.01MÊÊÊ & 3.53MÊÊÊÊ& 5.92MÊÊ   & 9.07M 
ÊÊÊÊÊÊÊÊÊÊÊÊ \\ 
ÊÊÊÊÊÊÊÊNumber of total k-mers     & 2917200Ê & 2250006ÊÊ& 3757479ÊÊÊ& 
5714973ÊÊÊÊÊÊÊÊÊ \\ 
ÊÊÊÊÊÊÊÊNumber of unique k-mers    & 1944996  & 1973430ÊÊ& 1982403ÊÊ & 
1991148ÊÊÊÊÊÊÊÊÊÊ \\ 
  \hline  
\end{tabular}

%\graphicspath{./figure/}

\begin{figure}
\center{\includegraphics[width=5in]{./figure/time_benchmark}}
\caption{Time usage of four different k-mer counting tools (y axis, in seconds) against data set size (in reads, x axis). @CTB WILLCHANGE}
\label{cmp_time}
\end{figure}

\begin{figure}
\center{\includegraphics[width=5in]{./figure/memory_benchmark}}
\caption{Memory usage of different k-mer counting tools (y axis, maximum resident program size, GB) plotted against the total number of distinct k-mers in the data set.  By increasing the error rate in khmer, memory usage can be substantially decreased.}
\label{cmp_memory}
\end{figure}

\begin{figure}
\center{\includegraphics[width=5in]{./figure/disk_benchmark}}
\caption{Disk storage usage of different k-mer counting tools in GB (y axis),
plotted against the number of distinct k-mers in the data set (x axis).  Note that khmer does not use the disk during counting or retrieval, although its hash tables can be saved for reuse.}
\label{cmp_disk}
\end{figure}

\begin{figure}
\center{\includegraphics[width=5in]{./figure/count_benchmark}}
\caption{Time for different k-mer counting tools to retrieve the counts of 9.7m randomly chosen k-mers (y axis), plotted against the number of unique k-mers in the data set being queried (x axis).  All k-mers are present in the data set being queried. @CTB WILLCHANGE}
\label{cmp_count}
\end{figure}

\begin{figure}
\center{\includegraphics[width=5in]{./figure/average_offset_vs_fpr}}
\caption{Relation between average miscount -- the amount by which
the average count for k-mers is incorrect -- on the y axis, plotted against
the false positive rate (x axis), for four data sets.  The four data
sets were chosen to have the same total number of k-mers: one metagenome
metagenome data set; a set of randomly generated k-mers; a set
of reads, chosen with 3x coverage and 1\% error, from a randomly generated
genome; and a simulated set of error-free reads (3x) chosen from a randomly
generated genome.}
\label{average_offset_vs_fpr}
\end{figure}

\begin{figure}
\center{\includegraphics[width=5in]{./figure/percent_offset_vs_fpr}}
\caption{Relation between percent miscount -- the amount by which
the count for k-mers is incorrect relative to its true count -- on the y axis, plotted against
the false positive rate (x axis), for four data sets.  The four data
sets are the same as in Figure \ref{average_offset_vs_fpr}.}
\label{percent_offset_vs_fpr}
% @CTB NOTE the negative numbers!!
\end{figure}

\begin{figure}
\center{\includegraphics[width=5in]{./figure/perc_unique_pos}}
\caption{Number of the unique k-mers (y axis) by starting position within read (x axis) in an untrimmed {\em E. coli} 100-bp Illumina shotgun data set, for k=17 and k=32.  The increasing numbers of unique k-mers are a sign of the increasing error rate of shotgun sequencing towards the 3' end of the read.  Note that there are only 69 starting positions for 32-mers in a 100 base read.}
\label{perc_unique_pos}
\end{figure}


\begin{figure}
\center{\includegraphics[width=5in]{./figure/num_remaining_reads}}
\caption{Number of remaining reads after iterating filtering out low-abundance 
reads that contain even a single unique k-mer with hash tables with different 
sizes(1e8 and 1e9) for a human gut microbiome metagenomic dataset(MH0001, 
42,458,402 reads).  @CTB WILLCHANGE}
\label{num_remaining_reads}
\end{figure}

\begin{figure}
\center{\includegraphics[width=5in]{./figure/num_bad_reads.pdf}}
\caption{Number of reads with low-abundance k-mers after iterating filtering out low-abundance 
reads that contain even a single unique k-mer with hash tables with different 
sizes(1e8 and 1e9) for a human gut microbiome metagenomic dataset(MH0001, 
42,458,402 reads).  What is FP rate for hash table size? @CTB WILLCHANGE}
\label{num_bad_reads}
\end{figure}

\end{document}