Question: DNA string compressing #197

jianshu93 · 2022-09-03T02:22:41Z

Dear seqtk author,

I am writing to ask a question about whether we can compress DNA strings {A,G,C,T} (also N?) when reading and storing DNA sequences in memory such as kmer and minimizer. By default a character in C take 1 byte but for valid DNA sequences, there are only 4 possibilities, instead of 256 (2^8), so we can compress DNA character into 2 bit, 1/4 of a regular character. Is this already implemented in kseq.h? Since I saw a lot of kmer counting and minimizer counting tools were based on kseq.h. When there are a huge number of kmers or minimizers, memory consumption difference using 2 bit and 1 byte could be huge.

Thanks,

Jianshu

lh3 · 2022-09-03T03:17:49Z

Read the source code of kmer counters (e.g. this) to see how this is handled.

lh3 closed this as completed Sep 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: DNA string compressing #197

Question: DNA string compressing #197

jianshu93 commented Sep 3, 2022

lh3 commented Sep 3, 2022

Question: DNA string compressing #197

Question: DNA string compressing #197

Comments

jianshu93 commented Sep 3, 2022

lh3 commented Sep 3, 2022