Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] Add composable option to text export #8854

Merged
merged 2 commits into from
May 30, 2020

Conversation

chrisvittal
Copy link
Collaborator

CHANGELOG: Add composable option to parallel text export for use with gsutil compose

The BGZF spec reccommends one empty BGZF block be written at the end of
at the end a BGZF file. The gsutil compose command concatenates a list
of objects into one composite object. We recently discovered that when
these empty blocks are present in the middle of a file, utilities like
tabix will output pointers to them (as from a reading perspective, the
empty blocks are equivalent to the next block). This will hit assertions
in code like htsjdk that checks to make sure that seek operations from
tabix virtual pointers point to the end of a block if and only if that
block is end of file. This is a bug in tabix implementations.
Furthermore, the end-of-file marker probably shouldn't be appended to
BGZF streams in the first place.

In order to improve interoperability of hail with other tools, we add
the 'composable' output option to export types. 'composable' behaves
like 'separate_header', except we do not write the end-of-file marker at
the end of the header or every partition written, and an extra, empty
bgz file with the end-of-file marker is written to part-composable-end
which should sort later than any partfile written from the RDD and thus
should be amenable to globbing.

@tpoterba
Copy link
Contributor

oh, sweet! didn't know you were working on this.

@chrisvittal chrisvittal marked this pull request as ready for review May 22, 2020 19:11
CHANGELOG: Add `composable` option to parallel text export for use with `gsutil compose`

The BGZF spec reccommends one empty BGZF block be written at the end of
at the end a BGZF file. The `gsutil compose` command concatenates a list
of objects into one composite object. We recently discovered that when
these empty blocks are present in the middle of a file, utilities like
tabix will output pointers to them (as from a reading perspective, the
empty blocks are equivalent to the next block). This will hit assertions
in code like htsjdk that checks to make sure that seek operations from
tabix virtual pointers point to the end of a block if and only if that
block is end of file. This is a bug in tabix implementations.
Furthermore, the end-of-file marker probably shouldn't be appended to
BGZF streams in the first place.

In order to improve interoperability of hail with other tools, we add
the 'composable' output option to export types. 'composable' behaves
like 'separate_header', except we do not write the end-of-file marker at
the end of the header or every partition written, and an extra, empty
bgz file with the end-of-file marker is written to `part-composable-end`
which should sort later than any partfile written from the RDD and thus
should be amenable to globbing.
Copy link
Contributor

@tpoterba tpoterba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice job!

@danking danking merged commit 4175ed6 into hail-is:master May 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants