-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[query] corrupted zstd block #13979
Comments
|
SummaryIt has happened twice. The failing partition is different in each run.
The pipeline runs two table collects to get sample information, then converts the matrix table to a table of ndarrays of the value The entries are getting subsetted, so there is skipping going on. In both cases, we are decoding the entry array when the corrupted block is discovered. In the first case, we are skipping an int (must be RGQ based on the etype and type). In the second case, we are decoding a string (must be FT). Since the error happens on a seemingly arbitrary partition, it seems likely this is related to our transient error handling. Both runs use a version of Hail after we fixed the broken transient error handling in GoogleStorageFS (run 1 used fcaafc5, run 2 used 0.2.126 / ee77707). Path forwardIf it is a transient error, we need to fix how we handle transient errors. Maybe our position handling logic is wrong? If it is not a transient error, maybe our skipping logic is wrong? FT appears immediately after RGQ and we know RGQ is getting skipped. Our implementation of Action items:
Debugging informationEType:
(zipped) Type:
Source buffer spec:
Error for run 1.
Error for run 2
|
I've asked Wenhan to run this pipeline with a JAR that has extra debugging information enabled main...danking:hail:debug-13979. |
OK, here's the most recent failure https://batch.hail.is/batches/8090848/jobs/21993 Don't be duped by my bad log message! There were zero transient errors. I added a log statement that increments the number of errors and prints that message after every error, even if it's not transient. This time it was partition 20053 (we keep moving earlier?). I forgot to catch and rethrow the error with the toString of the input buffer, but I'm not sure there is much to learn from that anyway. FWIW, 20053 was successful in the two previous executions: Interestingly the peak bytes are not consistent:
Whatever is causing this bug is rare. Approximately once every 31,000 partitions. The CDA IR is the same except for a couple iruid names and the order of the aggregators in the aggregator array is swapped (collect & take vs take & collect). AFAICT, the GCS Java library doesn't do any streaming verification of the hash. We could compute the CRC32c in a streaming manner and fail if/when we get to the end of the object, but this wouldn't work when we read intervals. I'm really mystified. |
Now five times: partitions 41869, 45088, 46688, 47294, and 59799. |
43809 execute(...)_stage2_table_native_writer_job41869 Failed 16s 647ms $0.0001 |
We tried updated to zstd-jni 1.5.5-11 from 1.5.5-2. 4 Failures. 6873 execute(...)_stage2_table_native_writer_job4933 Failed 13s 631ms $0.0001 |
In the latter two cases, the error does not come from zstd decompression. It comes later during region allocation and using isHet on a Call with ploidy 3. When zstd does notice a decompression issue, it's always immediately after a read. In this case, immediately after a read of the entries data, but in the past we've seen reads of other MTs/HTs. Note that the entries are the bulk of the bytes, so if there's something that's rare in terms of bytes processed, we're just much more likely to see it in the entries. |
Let's compare 8093951-8854 to 8093977-8854. The latter is a failed task (partition 6914) the former is successful. We'll download the logs and make toss away some debug info that changed between the experiments
Since the latter failed, the log obviously ends earlier, but there are no differences (besides timestamps) in the size of the blocks read from GCS. Since these block sizes are read from the input stream, this is pretty good evidence that the bytes aren't corrupted up until now.
The decompressed data size is the same: 65536. It's worth noting this is a relatively small compressed buffer after a series of much larger compressed buffers. This one is 2081 and the immediately previous one is 14675. Most of the ones before this are also in the 14k range. Same experiment on job 7157 again shows no differences in bytes read before the exception occurs.
The network reads are identical other than the size of the first read. That first read is the serialized function. I'm not that surprised it differs in size between different commits of Hail. The byte counting is done in our code. If we're counting bytes correctly, then it seems like we're reading the same series of chunks from GCS.
|
I'm really starting to grasp at straws here. Hail's code has been around the block several times, so I have some confidence in it. What's new here? Well, this is Query-on-Batch meaning we're using the Google Cloud Storage Java API directly rather than whatever Hadoop does. A recently posted bug reporting data corruption was fixed in 2.30.0 googleapis/java-storage#2301 . It wouldn't be the first time we found bugs in the Google Cloud Storage Java API. #13937 |
CHANGELOG: Fix hail-is#13979, affecting Query-on-Batch and manifesting most frequently as "com.github.luben.zstd.ZstdException: Corrupted block detected". This PR upgrades google-cloud-storage from 2.29.1 to 2.30.1. The google-cloud-storage java library has a bug present at least since 2.29.0 in which simply incorrect data was returned. googleapis/java-storage#2301 . The issue seems related to their use of multiple intremediate ByteBuffers. As far as I can tell, this is what could happen: 1. If there's no channel, open a new channel with the current position. 2. Read *some* data from the input ByteChannel into an intermediate ByteBuffer. 3. While attempting to read more data into a subsequent intermediate ByteBuffer, an retryable exception occurs. 4. The exception bubbles to google-cloud-storage's error handling, which frees the channel and loops back to (1) The key bug is that the intermediate buffers have data but the `position` hasn't been updated. When we recreate the channel we will jump to the wrong position and re-read some data. Lucky for us, between Zstd and our assertions, this usually crashes the program instead of silently returning bad data. This is the third bug we have found in Google's cloud storage java library. The previous two: 1. hail-is#13721 2. hail-is#13937 Be forewarned: the next time we see bizarre networking or data corruption issues, check if updating google-cloud-storage fixes the problem.
CHANGELOG: Fix #13979, affecting Query-on-Batch and manifesting most frequently as "com.github.luben.zstd.ZstdException: Corrupted block detected". This PR upgrades google-cloud-storage from 2.29.1 to 2.30.1. The google-cloud-storage java library has a bug present at least since 2.29.0 in which simply incorrect data was returned. googleapis/java-storage#2301 . The issue seems related to their use of multiple intremediate ByteBuffers. As far as I can tell, this is what could happen: 1. If there's no channel, open a new channel with the current position. 2. Read *some* data from the input ByteChannel into an intermediate ByteBuffer. 3. While attempting to read more data into a subsequent intermediate ByteBuffer, an retryable exception occurs. 4. The exception bubbles to google-cloud-storage's error handling, which frees the channel and loops back to (1) The key bug is that the intermediate buffers have data but the `position` hasn't been updated. When we recreate the channel we will jump to the wrong position and re-read some data. Lucky for us, between Zstd and our assertions, this usually crashes the program instead of silently returning bad data. This is the third bug we have found in Google's cloud storage java library. The previous two: 1. #13721 2. #13937 Be forewarned: the next time we see bizarre networking or data corruption issues, check if updating google-cloud-storage fixes the problem.
CHANGELOG: Fix hail-is#13979, affecting Query-on-Batch and manifesting most frequently as "com.github.luben.zstd.ZstdException: Corrupted block detected". This PR upgrades google-cloud-storage from 2.29.1 to 2.30.1. The google-cloud-storage java library has a bug present at least since 2.29.0 in which simply incorrect data was returned. googleapis/java-storage#2301 . The issue seems related to their use of multiple intremediate ByteBuffers. As far as I can tell, this is what could happen: 1. If there's no channel, open a new channel with the current position. 2. Read *some* data from the input ByteChannel into an intermediate ByteBuffer. 3. While attempting to read more data into a subsequent intermediate ByteBuffer, an retryable exception occurs. 4. The exception bubbles to google-cloud-storage's error handling, which frees the channel and loops back to (1) The key bug is that the intermediate buffers have data but the `position` hasn't been updated. When we recreate the channel we will jump to the wrong position and re-read some data. Lucky for us, between Zstd and our assertions, this usually crashes the program instead of silently returning bad data. This is the third bug we have found in Google's cloud storage java library. The previous two: 1. hail-is#13721 2. hail-is#13937 Be forewarned: the next time we see bizarre networking or data corruption issues, check if updating google-cloud-storage fixes the problem.
What happened?
https://hail.zulipchat.com/#narrow/stream/223457-Hail-Batch-support/topic/QoB.20Error.3A.20GoogleJsonResponseException.3A.20404.20Not.20Found/near/398355473
Version
0.2.126
Relevant log output
No response
The text was updated successfully, but these errors were encountered: