This code contains a single function, kmer.complexity
that calculates
entropy (
Where
kmer.complexity(seq, k, window.size, olap, comp.beg=FALSE)
The function takes 6 arguments:
- seq
A character vector. Entropy will be determined for only the first element of this vector. - k
A numeric vector specifying the size of the k-mers to be used for the analysis. Any number of values of k can be specified, but they should be within 1 and 16. - window.size
The size of windows across which entropy should be calculated. If set to 0, the window size will be set to$4^k$ for eachk
specified. If a vector of window sizes is specified this will be recycled for each value ofk
. - olap
Should the complexity be calculated for all overlapping k-mers or only for distinct ones? That is, should the kmers be moved by a step of1
ork
during the determination. - comp.beg
Compensate the entropy calculated for the incomplete windows. If true, the frequency$f_i$ will be determined for the number of k-mers counted until a complete window is obtained. This means that the initial entropy will always be the maximal possible, and this will then drop until a complete window has been obtained.
The function uses
where
kmer.complexity
returns a named list containing the following components:
k<k>
: A vector of entropy for each value ofk
specified. These will be named "k" with<k>
representing the size ofk
. That is, ifk
is 5, the entry will be namedk5
.kws
: a matrix with columnsk
andws
giving the kmer and window sizes for which entropy has been estimated.olap
: A logical value indicating whether entropy was estimated for overlapping kmers.
Here I have run the function scaffold 4 of an assembly of the genome of L. piscatorius. This scaffold appears to include telomere sequences at both ends and contains about 42 million base pairs.
## seq contains genome assembly scaffolds
i <- 4
system.time(
entropy.ol <- kmer.complexity(seq[i], c(5), 1024, olap=TRUE, comp.beg=TRUE)
)
## user system elapsed
## 4.075 0.185 4.260
i <- 4
system.time(
entropy.no <- kmer.complexity(seq[i], c(5), 1024, olap=FALSE, comp.beg=TRUE)
)
## user system elapsed
## 1.695 0.050 1.744
It takes about 4 and 1.7 seconds respectively for overlapping and non-overlapping kmers with a k of 5. There is no discernible differences in the estimates for overlapping and non-overlapping kmers at this resolution.
The function uses a 2 bit representation of kmers by modifying a single 32 bit unsigned integer using the following bitwise operations:
kmer = (kmer << 2) | ( (seq[i] >> 1) & 3 );
This means that any stretches containing ambiguity symbols (especially
N
) will give incorrect entropy estimates without warning. The most
likely problem is that all windows that contain N
s will have a
reduced entropy. I do not consider this to be a major issue since the
estimates themselves do not consider sequence content and this is
something that should be extracted as a secondary step.
The window size is specified by the number of k-mers counted; this means that the physical window size (the size of the region of the chromosome) is different for overlapping and non-overlapping k-mers.