-
Notifications
You must be signed in to change notification settings - Fork 250
Make SemanticHash
Resilient to FileNotFoundExceptions
#13919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make SemanticHash
Resilient to FileNotFoundExceptions
#13919
Conversation
Fixes hail-is#13915 `MatrixVCFReader` accepts glob patterns (wildcards in glob names). This bamboozled `SemanticHash` which had assumed all files had been resolved. This change fixes this by adding explicit `FileNotFoundException` handling to `SemanticHash` and replacing the `params.files` object of `MatrixVCFReader` with the resolved paths.
log.warn( | ||
"""An internal compiler error occurred. | ||
|Please report this to the Hail Team using the link below, | ||
|including the stack trace at the end of this message. | ||
|https://github.com/hail-is/hail/issues/new/choose | ||
|""".stripMargin, | ||
error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems right that a failure in semhash shouldn't bring the whole query down. It's not that important. Simmer down.
We want to know about these failures though, so a shouty log might be appropriate. Though this needs to be more shouty.
"""AN INTERNAL COMPILER ERROR OCCURRED. | ||
|PLEASE REPORT THIS TO THE HAIL TEAM USING THE LINK BELOW, | ||
|INCLUDING THE STACK TRACE AT THE END OF THIS MESSAGE. | ||
|https://github.com/hail-is/hail/issues/new/choose | ||
|""".stripMargin, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed appropriately shouty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed that, with this change, #13915 is resolved. In the interest of fixing that for 0.2.125, I'm gonna approve and get it into this release.
However, I left a comment inline. It seems that the meaning of pathsUsed was always a bit buggy and I think we should kill that tech debt now before it trips us again.
I'm also still concerned that test_glob
didn't catch this bug; we should nail down why.
"""AN INTERNAL COMPILER ERROR OCCURRED. | ||
|PLEASE REPORT THIS TO THE HAIL TEAM USING THE LINK BELOW, | ||
|INCLUDING THE STACK TRACE AT THE END OF THIS MESSAGE. | ||
|https://github.com/hail-is/hail/issues/new/choose | ||
|""".stripMargin, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed appropriately shouty
@@ -1744,7 +1744,7 @@ object MatrixVCFReader { | |||
|
|||
LoadVCF.warnDuplicates(sampleIDs) | |||
|
|||
new MatrixVCFReader(params, fileListEntries, referenceGenome, header1) | |||
new MatrixVCFReader(params.copy(files = fileListEntries.map(_.getPath)), fileListEntries, referenceGenome, header1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels inconsistent across the readers.
StringTableReader
is special-cased in SemanticHash
to use the fileStatuses
but its params.files
is still the original glob-pattern-containing params.
pathsUsed
as introduced a long time ago to protect against silly pipelines like:
hl.read_matrix_table('foo.mt').write('foo.mt')
This works fine for non-globbed data sources like matrix tables and tables. However, even in the original PR #8327, it seems we would always fail to notice:
hl.export_vcf(hl.import_vcf('chr*.vcf'), 'chr1.vcf')
It seems to me that the durable & reliable fix to this is for pathsUsed
to always be the list of post-globbed files/blobs. I have a mild preference to treat params
as an immutable record of what the user requested.
Taking this one step further: pathsUsed
could be a list of FileStatus
(which is a super type of FileListEntry
) and FileStatus
could have an eTag
. This avoids O(N_FILES) calls to google to get the etag of all the files. Any reader that uses glob
will already have a FileListEntry
for every file. I suspect readers that don't use glob do still check the existence of the files ahead of time. They should just keep a FileStatus
(which is proof of existence anyway) around so that we can later get its etag.
We don't catch this in
|
Ahh! Drat. OK, I'm glad we know why. |
I'll follow this change up wth adding support for that, as well as addressing your comment. |
Semhash support: #13922 |
Fixes #13915
MatrixVCFReader
accepts glob patterns (wildcards in glob names). This bamboozledSemanticHash
which had assumed all files had been resolved. This change fixes this by adding explicitFileNotFoundException
handling toSemanticHash
and replacing theparams.files
object ofMatrixVCFReader
with the resolved paths.