An optimized and general alignment output format #14

rob-p · 2023-07-12T14:53:51Z

rob-p
Jul 12, 2023
Collaborator

Pursuant to the discussion that was started in #13 — this is a space that we can use to discuss ideas relevant to an optimized and standardized output format for mapping information. @tmaklin brings up some very good points and desiderata in this comment.

Several of these issues, I think, have clear solutions.

Total number of alignment targets can't be inferred with certainty from the format.

The output format should have a header. We do this with RAD files and it's an essential place to store important information (like the number of potential targets, and metadata about those targets like length). Given that the relevant data is (or can easily be) stored in the index, there is virtually no down side to having a header start every file. Further, the header can contain extra information about the records that will follow — for example, do we need to record query names (i.e. the names of read records) or not. If not, we can save space, but if so, the header can inform us if this information is to be expected or not.

Not printing empty lines for no alignments.

There should be an empty record type — a read with no alignments. Historically, with respect to standard read alignments (SAM/BAM files), there has been some lack of consensus over whether unaligned reads should have records stored in the same file as the mapped reads, or in different files. My default preference is to have a single output file, though we could easily have a tool that splits the single file into mapped / unmapped sub-files.

Fragment names instead of the position of the read in the fastq files (makes sorting the file difficult and slows down matching the alignments with the reads if the file is not sorted).

If the specific records need be identified, then I totally agree. Sometimes, it's useful to have the mapping information and the actual read identifier is simply unimportant. In that case, it's possible to potentially save a lot of space by simply not writing down anything about the read (other than it's mapping information). However, if the read does need to be identified, its original record name is the most useful identifier. I think whether reads are identified or not is a piece of information that could go in the header.

Total number of reads can't be inferred.

Do you mean prior to reading through the file? The way we handle this in the RAD format is to reserve 8 bytes in the header (filled with a dummy value during mapping). Then, when mapping is finished, we simply rewind the file pointer to that location and fill in the true number of observed reads. This is very easy to do, and works well. Of course, it does preclude a 100% streaming solution, as the header is not complete until all of the reads have been observed and mapping attempted. However, I've found that it retains the most important aspect of streaming, which is that it permits essentially constant memory overhead for writing to the file. If one truly wanted an entirely streaming solution, the file could have both a header and a footer, where the footer contains information that we don't know about until processing all of the data.

Multiple files to store the results (for example unique alignments + their counts).

I'd be interested in discussing the use-cases for this. In general, I prefer one output file, with appropriate tags for the status of mappings, but am open to other ideas if they have an important use case.

I tend to prefer formats that support streaming the results rather than having to wait for the whole alignment to finish, or conversely read in the whole file before processing the results.

Depending on the application, one strategy we've found to work well (with the RAD file, e.g.) is to have the file be "chunked". The way the file is organized is that there is a header with relevant metadata, both about the alignment targets and info about e.g. the number of reads, followed by a series of "file-level" tags. These are tags that tell the parser about how to interpret the file and what to expect. Subsequently, the file consists of a series of chunks. Each chunk starts with a chunk header, which contains the number of reads in the subsequent chunk and the number of bytes occupied by the subsequent chunk. This is a very important strategy to allow low-contention, highly-multithreaded parsing, since the critical section of the read loop only copies over the next chunk and then threads can independently parse these chunks. Each chunk consists of a series of read records, each of which, themselves, consist of a series of alignment records. There can optionally also be read-level tags and alignment-level tags.

One thing that I think we do not yet have sufficiently general solution for is how to strike the best balance between optimally compacting the necessary information, versus making the format general enough to easily store extra information. I think this is a general engineering challenge, but if we can properly scope what we hope to cover, I think we can devise a sufficient strategy.

tmaklin · 2023-07-12T16:02:39Z

tmaklin
Jul 12, 2023

Hey, sounds like you have made a lot of the same design choices in RAD that I use in the alignment-writer format. My format is currently structured like this:

Header with the total number of reads and total number of alignment targets, both supplied as arguments to the program when run and not determined from the input.
Repeating sections of:
a. Size of a char array required to store the next chunk.
b. Binary chunk representing a subset of the contiguously stored num_reads x num_targets pseudoalignment matrix. The number of reads and the number of target sequences from the header can be used to determine which read and which target the alignments in the chunk are for, meaning that these chunks or the input data do not need to be in any particular order.

This achieves many of the goals you described in RAD (multithreaded parsing, memory allocation in advance and sorting the alignments so they're in the same order as the reads in particular) but there is no extra information stored apart from what's in the header. The way I've stored the file additionally allows set operations on alignment files from paired-end reads to be streamed which helps immensely with large alignments.

Themisto and others (I imagine) have the ability to produce some other read-specific information that might be useful to store, though. I'll ping Jarno about this discussion so he can give his input.

Some clarifications regarding my comments

Not printing empty lines for no alignments.

I agree a single file that also contains records for empty alignments is the way to go. Storing the empty records causes only a tiny increase in file size when any reasonable compression is applied so there's no reason to discard them.

Fragment names instead of the position of the read in the fastq files (makes sorting the file difficult and slows down matching the alignments with the reads if the file is not sorted).

I only store the index of the read in the fastq file (1st read, 2nd read, and so on) via writing it at the right position in the contiguously stored pseudoalignment. Knowing the index of a single alignment, or the indexes of a subset of the alignments, enables retrieving the original name(s) from the fastq file with a simple awk command so the original names do not seem that useful to me.

Total number of reads can't be inferred.

I was referring to formats that skip writing empty alignments or otherwise combine read information in a way that results in requiring access to the original reads to determine how many of them were actually processed. What you do in RAD sounds sensible. 100% streaming solution is probably not necessary as the total number of reads is mostly useful only when intending to read in the whole pseudoalignment to memory. However I think this is still important to store in the header.

Multiple files to store the results (for example unique alignments + their counts).

Sorry, this was supposed to be an example of a bad practice. I think kallisto did this originally and it was painful to use although it is an efficient way to actually store the alignments in memory if/when many of them are repeated and knowing which read an alignment is for does not matter. I agree that one output file is a better solution.

Pseudoalignments fortunately contain relatively little information compared to sam files so they're much easier to compact 😄

0 replies

tmaklin · 2024-08-19T17:08:33Z

tmaklin
Aug 19, 2024

Hey, so I went through the RAD specification and it's really neat, hopefully we can agree on that and get it implemented in both our tools.

Re my work on a compressor and file format I mentioned in the other issue, it would be almost trivial for me to support it in my compressor as it's essentially in the RAD format already anyway; I currently convert the plaintext alignments to and from Themisto/Fulgor/Metagraph/Bifrost to the following format:

File header: JSON section compressed with lzma, must contain names of the target sequences (refNames and refCount in RAD) and total number of queries (don't think RAD has this?), may contain other info about the alignment such as which tool produced it.
Repeating sections of
- chunk header: 2x JSON sections compressed with lzma, first section must contain size of the chunk in bytes and size of the second JSON section in bytes. Second JSON section may contain info about the reads, such as the names. There are two JSONs because I wanted to support skipping over the read info if it's not needed, and some pseudoaligner formats require more extensive read info than others.
- Pseudoalignment chunk: contiguously stored n_queries x n_targets bit vector compressed with BitMagic.

... so nearly RAD except with compressed fields, although it also looks like I store the read-level tags in the second header section whereas in RAD they are a part of the chunk itself. I also don't support multiple alignments for a read in the functions that decompress the data but the file format could handle them.

This is currently WIP but available here if you are interested: https://github.com/tmaklin/alignment-writer/tree/dev The readme should also be up to date except the "File format section". I plan to fork this to another repository since it's diverged quite a bit from the original purpose, and also to write a rust version that can at least read the compressed data.

The compressed format is essential for my application in bacterial metagenomics since the output files from the pseudoalignments can become several TBs in plaintext size but they compress extremely nicely (~10-100x smaller than the plaintext size and also better ratios than just gzipping), so it would be nice to have interoperability here, too!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An optimized and general alignment output format #14

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

An optimized and general alignment output format #14

rob-p Jul 12, 2023 Collaborator

Replies: 2 comments

tmaklin Jul 12, 2023

tmaklin Aug 19, 2024

rob-p
Jul 12, 2023
Collaborator

tmaklin
Jul 12, 2023

tmaklin
Aug 19, 2024