[query] Add composable option to text export
#8854
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
CHANGELOG: Add
composableoption to parallel text export for use withgsutil composeThe BGZF spec reccommends one empty BGZF block be written at the end of
at the end a BGZF file. The
gsutil composecommand concatenates a listof objects into one composite object. We recently discovered that when
these empty blocks are present in the middle of a file, utilities like
tabix will output pointers to them (as from a reading perspective, the
empty blocks are equivalent to the next block). This will hit assertions
in code like htsjdk that checks to make sure that seek operations from
tabix virtual pointers point to the end of a block if and only if that
block is end of file. This is a bug in tabix implementations.
Furthermore, the end-of-file marker probably shouldn't be appended to
BGZF streams in the first place.
In order to improve interoperability of hail with other tools, we add
the 'composable' output option to export types. 'composable' behaves
like 'separate_header', except we do not write the end-of-file marker at
the end of the header or every partition written, and an extra, empty
bgz file with the end-of-file marker is written to
part-composable-endwhich should sort later than any partfile written from the RDD and thus
should be amenable to globbing.