New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[query] In Google, large pipelines which encounter transient errors often fail to cleanly restart. #13356
Labels
Comments
See also #13409 |
We will attempt to reproduce this by reading the VDS with a new partitioning and force counting. That should trigger lots of seeks. |
Here's a clear instance of buffer corruption after a transient error (in this case an SSLException). https://batch.hail.is/batches/7996481/jobs/182741
|
danking
changed the title
The VDS combiner is flaky on query on batch on GCP due to issues reading VCFs with intervals.
[query] In Google, large pipelines which encounter transient errors often fail to cleanly restart.
Sep 14, 2023
I think I have a fix for this. |
danking
added a commit
to danking/hail
that referenced
this issue
Sep 27, 2023
CHANGELOG: Fix hail-is#13356 and fix hail-is#13409. In QoB pipelines with 10K or more partitions, transient "Corrupted block detected" errors were common. This was caused by incorrect retry logic. That logic has been fixed. I now assume we cannot reuse a ReadChannel after any exception occurs during read. We also do not assume that the ReadChannel "atomically", in some sense, modifies the ByteBuffer. In particular, if we encounter any error, we blow away the ByteBuffer and restart our read entirely.
danking
added a commit
to danking/hail
that referenced
this issue
Sep 27, 2023
CHANGELOG: Fix hail-is#13356 and fix hail-is#13409. In QoB pipelines with 10K or more partitions, transient "Corrupted block detected" errors were common. This was caused by incorrect retry logic. That logic has been fixed. I now assume we cannot reuse a ReadChannel after any exception occurs during read. We also do not assume that the ReadChannel "atomically", in some sense, modifies the ByteBuffer. In particular, if we encounter any error, we blow away the ByteBuffer and restart our read entirely.
See details at #13409 (comment) and fix at #13730. |
danking
added a commit
that referenced
this issue
Sep 28, 2023
…13730) CHANGELOG: Fix #13356 and fix #13409. In QoB pipelines with 10K or more partitions, transient "Corrupted block detected" errors were common. This was caused by incorrect retry logic. That logic has been fixed. I now assume we cannot reuse a ReadChannel after any exception occurs during read. We also do not assume that the ReadChannel "atomically", in some sense, modifies the ByteBuffer. In particular, if we encounter any error, we blow away the ByteBuffer and restart our read entirely. As I described in [this comment to #13409](#13409 (comment)), I have a 10K partition pipeline which was reliably producing this error but now reliably *does not* produce this error (it produces another one, #13721, fix forthcoming for that too).
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The VDS combiner is flaky on query on batch on GCP due to issues reading VCFs with intervals.
Errors observed:
Both of these point to issues in the interface between the
FSSeekableInputStream
that underpins GoogleFS and theBGZipInputStream
that contains it at least in the presence of more than one seek.Unfortunately, the conditions that reproduce this are rare, and when our clusters are quieter (nighttime) the errors are even less frequent.
The text was updated successfully, but these errors were encountered: