Replies: 2 comments
-
Hey, sounds like you have made a lot of the same design choices in RAD that I use in the alignment-writer format. My format is currently structured like this:
This achieves many of the goals you described in RAD (multithreaded parsing, memory allocation in advance and sorting the alignments so they're in the same order as the reads in particular) but there is no extra information stored apart from what's in the header. The way I've stored the file additionally allows set operations on alignment files from paired-end reads to be streamed which helps immensely with large alignments. Themisto and others (I imagine) have the ability to produce some other read-specific information that might be useful to store, though. I'll ping Jarno about this discussion so he can give his input. Some clarifications regarding my comments
Pseudoalignments fortunately contain relatively little information compared to sam files so they're much easier to compact 😄 |
Beta Was this translation helpful? Give feedback.
-
Hey, so I went through the RAD specification and it's really neat, hopefully we can agree on that and get it implemented in both our tools. Re my work on a compressor and file format I mentioned in the other issue, it would be almost trivial for me to support it in my compressor as it's essentially in the RAD format already anyway; I currently convert the plaintext alignments to and from Themisto/Fulgor/Metagraph/Bifrost to the following format:
... so nearly RAD except with compressed fields, although it also looks like I store the read-level tags in the second header section whereas in RAD they are a part of the chunk itself. I also don't support multiple alignments for a read in the functions that decompress the data but the file format could handle them. This is currently WIP but available here if you are interested: https://github.com/tmaklin/alignment-writer/tree/dev The readme should also be up to date except the "File format section". I plan to fork this to another repository since it's diverged quite a bit from the original purpose, and also to write a rust version that can at least read the compressed data. The compressed format is essential for my application in bacterial metagenomics since the output files from the pseudoalignments can become several TBs in plaintext size but they compress extremely nicely (~10-100x smaller than the plaintext size and also better ratios than just gzipping), so it would be nice to have interoperability here, too! |
Beta Was this translation helpful? Give feedback.
-
Pursuant to the discussion that was started in #13 — this is a space that we can use to discuss ideas relevant to an optimized and standardized output format for mapping information. @tmaklin brings up some very good points and desiderata in this comment.
Several of these issues, I think, have clear solutions.
Beta Was this translation helpful? Give feedback.
All reactions