Pull request Compare This branch is 6249 commits behind dib-lab:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="Docutils 0.3.7: http://docutils.sourceforge.net/" />
<link rel="stylesheet" href="default.css" type="text/css" />
<div class="document">
<div class="section" id="khmer-a-simple-k-mer-counting-library">
<h1><a name="khmer-a-simple-k-mer-counting-library">khmer, a simple k-mer counting library</a></h1>
<p>khmer is a simple C++ library for counting k-mers in DNA sequences.
It has a complete Python wrapping and should be pretty darned fast;
it's intended for genome-scale k-mer counting.</p>
<p>The current version is <strong>0.2</strong>.  I haven't used it for much myself,
but the test code functions &amp; it should work as advertised.</p>
<p>khmer operates by building a 'ktable', a table of 4**k counters.
It then maps each k-mer into this table with a simple
(and reversible) hash function.</p>
<p>Right now, only the Python interface is documented here.  The C++
interface is essentially identical; if you need to use it and want
it documented, drop me a line.</p>
<div class="section" id="counting-speed-and-memory-usage">
<h1><a name="counting-speed-and-memory-usage">Counting Speed and Memory Usage</a></h1>
<p>On the 5 mb <em>Shewanella oneidensis</em> genome, khmer takes less than a second
to count all k-mers, for any k between 6 and 12.  At 13 it craps out
because the table goes over my default stack size limit.</p>
<p>Approximate memory usage can be calculated by finding the size of a
<tt class="docutils literal"><span class="pre">long</span> <span class="pre">long</span></tt> on your machine and then multiplying that by 4**k.
For a 12bp wordsize, this works out to 16384*1024; on an Intel-based
processor running Linux, <tt class="docutils literal"><span class="pre">long</span> <span class="pre">long</span></tt> is 8 bytes, so memory usage
is approximately 128 mb.</p>
<div class="section" id="python-interface">
<h1><a name="python-interface">Python interface</a></h1>
<p>Essentially everything requires a <tt class="docutils literal"><span class="pre">ktable</span></tt>.</p>
<pre class="literal-block">
import khmer
ktable = khmer.new_ktable(L)
<p>These commands will create a new <tt class="docutils literal"><span class="pre">ktable</span></tt> of size 4**L, suitable
for counting L-mers.</p>
<p>Each <tt class="docutils literal"><span class="pre">ktable</span></tt> object has a few accessor functions:</p>
<ul class="simple">
<li><tt class="docutils literal"><span class="pre">ktable.ksize()</span></tt> will return L.</li>
<li><tt class="docutils literal"><span class="pre">ktable.max_hash()</span></tt> will return the max hash value in the table, 4**L - 1.</li>
<li><tt class="docutils literal"><span class="pre">ktable.n_entries()</span></tt> will return the number of table entries, 4**L.</li>
<p>The forward and reverse hashing functions are directly accessible:</p>
<li><dl class="first docutils">
<dt><tt class="docutils literal"><span class="pre">hashval</span> <span class="pre">=</span> <span class="pre">ktable.forward_hash(kmer)</span></tt> will return the hash value</dt>
<dd><p class="first last">of the given kmer.</p>
<li><dl class="first docutils">
<dt><tt class="docutils literal"><span class="pre">kmer</span> <span class="pre">=</span> <span class="pre">ktable.reverse_hash(hashval)</span></tt> will return the kmer that hashes</dt>
<dd><p class="first last">to the given hashval.</p>
<p>There are also some counting functions:</p>
<ul class="simple">
<li><tt class="docutils literal"><span class="pre">ktable.count(kmer)</span></tt> will increment the count associated with the given kmer
by one.</li>
<li><tt class="docutils literal"><span class="pre">ktable.consume(sequence)</span></tt> will run through the sequence and count
each kmer present.</li>
<li><tt class="docutils literal"><span class="pre">n</span> <span class="pre">=</span> <span class="pre">ktable.get(kmer|hashval)</span></tt> will return the count associated with the
given kmer string or the given hashval, whichever is passed in.</li>
<li><tt class="docutils literal"><span class="pre">ktable.set(kmer|hashval,</span> <span class="pre">count)</span></tt> set the count for the given kmer
string or hashval.</li>
<p>In all of the cases above, 'kmer' is an L-length string, 'hashval' is
a non-negative integer, and 'sequence' is a DNA sequence containg ONLY
<p><strong>Note:</strong> 'N' is not a legal DNA character as far as khmer is concerned!</p>
<p>And, finally, there are some set operations:</p>
<ul class="simple">
<li><tt class="docutils literal"><span class="pre">ktable.clear()</span></tt> empties the ktable.</li>
<li><tt class="docutils literal"><span class="pre">ktable.update(other)</span></tt> adds all of the entries in <tt class="docutils literal"><span class="pre">other</span></tt> into
<tt class="docutils literal"><span class="pre">ktable</span></tt>.  The wordsize must be the same for both ktables.</li>
<li><tt class="docutils literal"><span class="pre">intersection</span> <span class="pre">=</span> <span class="pre">ktable.intersect(other)</span></tt> returns a ktable where
only nonzero entries in both ktables are kept.  The count for ach
entry is the sum of the counts in <tt class="docutils literal"><span class="pre">ktable</span></tt> and <tt class="docutils literal"><span class="pre">other</span></tt>.</li>
<div class="section" id="an-example">
<h1><a name="an-example">An Example</a></h1>
<p>This short code example will count all 6-mers present in the given
DNA sequence, and then print them all out along with their prevalence.</p>
<pre class="literal-block">
# make a new ktable, L=6
ktable = khmer.new_ktable(6)

# count all k-mers in the given string

# run through all entries. if they have nonzero presence, print.
for i in range(0, ktable.n_entries()):
   n = ktable.get(i)
   if n:
      print ktable.reverse_hash(i), &quot;is present&quot;, n, &quot;times.&quot;
<p>And that's all, folks... Let me know if there's other functionality that
you think is important.</p>
<pre class="literal-block">
CTB, 3/2005