Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: DNA string compressing #197

Closed
jianshu93 opened this issue Sep 3, 2022 · 1 comment
Closed

Question: DNA string compressing #197

jianshu93 opened this issue Sep 3, 2022 · 1 comment

Comments

@jianshu93
Copy link

Dear seqtk author,

I am writing to ask a question about whether we can compress DNA strings {A,G,C,T} (also N?) when reading and storing DNA sequences in memory such as kmer and minimizer. By default a character in C take 1 byte but for valid DNA sequences, there are only 4 possibilities, instead of 256 (2^8), so we can compress DNA character into 2 bit, 1/4 of a regular character. Is this already implemented in kseq.h? Since I saw a lot of kmer counting and minimizer counting tools were based on kseq.h. When there are a huge number of kmers or minimizers, memory consumption difference using 2 bit and 1 byte could be huge.

Thanks,

Jianshu

@lh3
Copy link
Owner

lh3 commented Sep 3, 2022

Read the source code of kmer counters (e.g. this) to see how this is handled.

@lh3 lh3 closed this as completed Sep 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants