You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am writing to ask a question about whether we can compress DNA strings {A,G,C,T} (also N?) when reading and storing DNA sequences in memory such as kmer and minimizer. By default a character in C take 1 byte but for valid DNA sequences, there are only 4 possibilities, instead of 256 (2^8), so we can compress DNA character into 2 bit, 1/4 of a regular character. Is this already implemented in kseq.h? Since I saw a lot of kmer counting and minimizer counting tools were based on kseq.h. When there are a huge number of kmers or minimizers, memory consumption difference using 2 bit and 1 byte could be huge.
Thanks,
Jianshu
The text was updated successfully, but these errors were encountered:
Dear seqtk author,
I am writing to ask a question about whether we can compress DNA strings {A,G,C,T} (also N?) when reading and storing DNA sequences in memory such as kmer and minimizer. By default a character in C take 1 byte but for valid DNA sequences, there are only 4 possibilities, instead of 256 (2^8), so we can compress DNA character into 2 bit, 1/4 of a regular character. Is this already implemented in kseq.h? Since I saw a lot of kmer counting and minimizer counting tools were based on kseq.h. When there are a huge number of kmers or minimizers, memory consumption difference using 2 bit and 1 byte could be huge.
Thanks,
Jianshu
The text was updated successfully, but these errors were encountered: