Skip to content

Conversation

@chrisvittal
Copy link
Collaborator

CHANGELOG: Add composable option to parallel text export for use with gsutil compose

The BGZF spec reccommends one empty BGZF block be written at the end of
at the end a BGZF file. The gsutil compose command concatenates a list
of objects into one composite object. We recently discovered that when
these empty blocks are present in the middle of a file, utilities like
tabix will output pointers to them (as from a reading perspective, the
empty blocks are equivalent to the next block). This will hit assertions
in code like htsjdk that checks to make sure that seek operations from
tabix virtual pointers point to the end of a block if and only if that
block is end of file. This is a bug in tabix implementations.
Furthermore, the end-of-file marker probably shouldn't be appended to
BGZF streams in the first place.

In order to improve interoperability of hail with other tools, we add
the 'composable' output option to export types. 'composable' behaves
like 'separate_header', except we do not write the end-of-file marker at
the end of the header or every partition written, and an extra, empty
bgz file with the end-of-file marker is written to part-composable-end
which should sort later than any partfile written from the RDD and thus
should be amenable to globbing.

@tpoterba
Copy link
Contributor

oh, sweet! didn't know you were working on this.

@chrisvittal chrisvittal marked this pull request as ready for review May 22, 2020 19:11
CHANGELOG: Add `composable` option to parallel text export for use with `gsutil compose`

The BGZF spec reccommends one empty BGZF block be written at the end of
at the end a BGZF file. The `gsutil compose` command concatenates a list
of objects into one composite object. We recently discovered that when
these empty blocks are present in the middle of a file, utilities like
tabix will output pointers to them (as from a reading perspective, the
empty blocks are equivalent to the next block). This will hit assertions
in code like htsjdk that checks to make sure that seek operations from
tabix virtual pointers point to the end of a block if and only if that
block is end of file. This is a bug in tabix implementations.
Furthermore, the end-of-file marker probably shouldn't be appended to
BGZF streams in the first place.

In order to improve interoperability of hail with other tools, we add
the 'composable' output option to export types. 'composable' behaves
like 'separate_header', except we do not write the end-of-file marker at
the end of the header or every partition written, and an extra, empty
bgz file with the end-of-file marker is written to `part-composable-end`
which should sort later than any partfile written from the RDD and thus
should be amenable to globbing.
Copy link
Contributor

@tpoterba tpoterba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice job!

@danking danking merged commit 4175ed6 into hail-is:master May 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants