-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[query] Unhandled transient error for GCS 503s #13937
Comments
This is a bug in the Google storage Java API client library. It was introduced in 2.25.0 by googleapis/java-storage@4c2f44e and fixed in 2.29.1 by googleapis/java-storage@9b4bb82 |
Fix is to update to 2.29.1. |
danking
pushed a commit
to danking/hail
that referenced
this issue
Nov 21, 2023
CHANGELOG: Fix hail-is#13937 caused by faulty library code in the Google Cloud Storage API Java client library.
danking
added a commit
that referenced
this issue
Nov 21, 2023
danking
pushed a commit
to danking/hail
that referenced
this issue
Dec 7, 2023
CHANGELOG: Fix hail-is#13979, affecting Query-on-Batch and manifesting most frequently as "com.github.luben.zstd.ZstdException: Corrupted block detected". This PR upgrades google-cloud-storage from 2.29.1 to 2.30.1. The google-cloud-storage java library has a bug present at least since 2.29.0 in which simply incorrect data was returned. googleapis/java-storage#2301 . The issue seems related to their use of multiple intremediate ByteBuffers. As far as I can tell, this is what could happen: 1. If there's no channel, open a new channel with the current position. 2. Read *some* data from the input ByteChannel into an intermediate ByteBuffer. 3. While attempting to read more data into a subsequent intermediate ByteBuffer, an retryable exception occurs. 4. The exception bubbles to google-cloud-storage's error handling, which frees the channel and loops back to (1) The key bug is that the intermediate buffers have data but the `position` hasn't been updated. When we recreate the channel we will jump to the wrong position and re-read some data. Lucky for us, between Zstd and our assertions, this usually crashes the program instead of silently returning bad data. This is the third bug we have found in Google's cloud storage java library. The previous two: 1. hail-is#13721 2. hail-is#13937 Be forewarned: the next time we see bizarre networking or data corruption issues, check if updating google-cloud-storage fixes the problem.
danking
added a commit
that referenced
this issue
Dec 7, 2023
CHANGELOG: Fix #13979, affecting Query-on-Batch and manifesting most frequently as "com.github.luben.zstd.ZstdException: Corrupted block detected". This PR upgrades google-cloud-storage from 2.29.1 to 2.30.1. The google-cloud-storage java library has a bug present at least since 2.29.0 in which simply incorrect data was returned. googleapis/java-storage#2301 . The issue seems related to their use of multiple intremediate ByteBuffers. As far as I can tell, this is what could happen: 1. If there's no channel, open a new channel with the current position. 2. Read *some* data from the input ByteChannel into an intermediate ByteBuffer. 3. While attempting to read more data into a subsequent intermediate ByteBuffer, an retryable exception occurs. 4. The exception bubbles to google-cloud-storage's error handling, which frees the channel and loops back to (1) The key bug is that the intermediate buffers have data but the `position` hasn't been updated. When we recreate the channel we will jump to the wrong position and re-read some data. Lucky for us, between Zstd and our assertions, this usually crashes the program instead of silently returning bad data. This is the third bug we have found in Google's cloud storage java library. The previous two: 1. #13721 2. #13937 Be forewarned: the next time we see bizarre networking or data corruption issues, check if updating google-cloud-storage fixes the problem.
danking
pushed a commit
to danking/hail
that referenced
this issue
Dec 16, 2023
CHANGELOG: Fix hail-is#13979, affecting Query-on-Batch and manifesting most frequently as "com.github.luben.zstd.ZstdException: Corrupted block detected". This PR upgrades google-cloud-storage from 2.29.1 to 2.30.1. The google-cloud-storage java library has a bug present at least since 2.29.0 in which simply incorrect data was returned. googleapis/java-storage#2301 . The issue seems related to their use of multiple intremediate ByteBuffers. As far as I can tell, this is what could happen: 1. If there's no channel, open a new channel with the current position. 2. Read *some* data from the input ByteChannel into an intermediate ByteBuffer. 3. While attempting to read more data into a subsequent intermediate ByteBuffer, an retryable exception occurs. 4. The exception bubbles to google-cloud-storage's error handling, which frees the channel and loops back to (1) The key bug is that the intermediate buffers have data but the `position` hasn't been updated. When we recreate the channel we will jump to the wrong position and re-read some data. Lucky for us, between Zstd and our assertions, this usually crashes the program instead of silently returning bad data. This is the third bug we have found in Google's cloud storage java library. The previous two: 1. hail-is#13721 2. hail-is#13937 Be forewarned: the next time we see bizarre networking or data corruption issues, check if updating google-cloud-storage fixes the problem.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
What happened?
GCS library throws a
StorageException: Unknown Error
on 503s resulting in the below stacktrace. Such a transient error should be gracefully retried.Version
0.2.124
Relevant log output
The text was updated successfully, but these errors were encountered: