Skip to content

Make SemanticHash Resilient to FileNotFoundExceptions #13919

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ehigham
Copy link
Member

@ehigham ehigham commented Oct 26, 2023

Fixes #13915
MatrixVCFReader accepts glob patterns (wildcards in glob names). This bamboozled SemanticHash which had assumed all files had been resolved. This change fixes this by adding explicit FileNotFoundException handling to SemanticHash and replacing the params.files object of MatrixVCFReader with the resolved paths.

Fixes hail-is#13915
`MatrixVCFReader` accepts glob patterns (wildcards in glob names). This
bamboozled `SemanticHash` which had assumed all files had been resolved.
This change fixes this by adding explicit `FileNotFoundException`
handling to `SemanticHash` and replacing the `params.files` object of
`MatrixVCFReader` with the resolved paths.
Comment on lines 53 to 59
log.warn(
"""An internal compiler error occurred.
|Please report this to the Hail Team using the link below,
|including the stack trace at the end of this message.
|https://github.com/hail-is/hail/issues/new/choose
|""".stripMargin,
error
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems right that a failure in semhash shouldn't bring the whole query down. It's not that important. Simmer down.
We want to know about these failures though, so a shouty log might be appropriate. Though this needs to be more shouty.

Comment on lines +54 to +58
"""AN INTERNAL COMPILER ERROR OCCURRED.
|PLEASE REPORT THIS TO THE HAIL TEAM USING THE LINK BELOW,
|INCLUDING THE STACK TRACE AT THE END OF THIS MESSAGE.
|https://github.com/hail-is/hail/issues/new/choose
|""".stripMargin,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed appropriately shouty

@ehigham ehigham requested a review from danking October 26, 2023 15:47
Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed that, with this change, #13915 is resolved. In the interest of fixing that for 0.2.125, I'm gonna approve and get it into this release.

However, I left a comment inline. It seems that the meaning of pathsUsed was always a bit buggy and I think we should kill that tech debt now before it trips us again.

I'm also still concerned that test_glob didn't catch this bug; we should nail down why.

Comment on lines +54 to +58
"""AN INTERNAL COMPILER ERROR OCCURRED.
|PLEASE REPORT THIS TO THE HAIL TEAM USING THE LINK BELOW,
|INCLUDING THE STACK TRACE AT THE END OF THIS MESSAGE.
|https://github.com/hail-is/hail/issues/new/choose
|""".stripMargin,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed appropriately shouty

@@ -1744,7 +1744,7 @@ object MatrixVCFReader {

LoadVCF.warnDuplicates(sampleIDs)

new MatrixVCFReader(params, fileListEntries, referenceGenome, header1)
new MatrixVCFReader(params.copy(files = fileListEntries.map(_.getPath)), fileListEntries, referenceGenome, header1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels inconsistent across the readers.

StringTableReader is special-cased in SemanticHash to use the fileStatuses but its params.files is still the original glob-pattern-containing params.

pathsUsed as introduced a long time ago to protect against silly pipelines like:

hl.read_matrix_table('foo.mt').write('foo.mt')

This works fine for non-globbed data sources like matrix tables and tables. However, even in the original PR #8327, it seems we would always fail to notice:

hl.export_vcf(hl.import_vcf('chr*.vcf'), 'chr1.vcf')

It seems to me that the durable & reliable fix to this is for pathsUsed to always be the list of post-globbed files/blobs. I have a mild preference to treat params as an immutable record of what the user requested.

Taking this one step further: pathsUsed could be a list of FileStatus (which is a super type of FileListEntry) and FileStatus could have an eTag. This avoids O(N_FILES) calls to google to get the etag of all the files. Any reader that uses glob will already have a FileListEntry for every file. I suspect readers that don't use glob do still check the existence of the files ahead of time. They should just keep a FileStatus (which is proof of existence anyway) around so that we can later get its etag.

@ehigham
Copy link
Member Author

ehigham commented Oct 26, 2023

I confirmed that, with this change, #13915 is resolved. In the interest of fixing that for 0.2.125, I'm gonna approve and get it into this release.

However, I left a comment inline. It seems that the meaning of pathsUsed was always a bit buggy and I think we should kill that tech debt now before it trips us again.

I'm also still concerned that test_glob didn't catch this bug; we should nail down why.

We don't catch this in test_glob because the matrixread is nested within a TableKeyByAndAggregate, which semhash can't handle yet:

2023-10-26 12:54:40.576 : WARN: Failed to compute SemanticHash: SemanticHash unknown: is.hail.expr.ir.TableKeyByAndAggregate

@danking
Copy link
Contributor

danking commented Oct 26, 2023

Ahh! Drat. OK, I'm glad we know why.

@ehigham
Copy link
Member Author

ehigham commented Oct 26, 2023

I'll follow this change up wth adding support for that, as well as addressing your comment.

@danking danking merged commit cfb7323 into hail-is:main Oct 26, 2023
@ehigham ehigham deleted the ehigham/fix-13915-fix-import-vcf-glob-patterns branch October 26, 2023 18:01
@ehigham
Copy link
Member Author

ehigham commented Oct 26, 2023

Semhash support: #13922

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[query] SemanticHash does not respect glob patterns
2 participants