Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] update past a broken google cloud storage java library #14080

Merged
merged 1 commit into from
Dec 7, 2023

Conversation

danking
Copy link
Contributor

@danking danking commented Dec 7, 2023

CHANGELOG: Fix #13979, affecting Query-on-Batch and manifesting most frequently as "com.github.luben.zstd.ZstdException: Corrupted block detected".

This PR upgrades google-cloud-storage from 2.29.1 to 2.30.1. The google-cloud-storage java library has a bug present at least since 2.29.0 in which simply incorrect data was returned. googleapis/java-storage#2301 . The issue seems related to their use of multiple intremediate ByteBuffers. As far as I can tell, this is what could happen:

  1. If there's no channel, open a new channel with the current position.
  2. Read some data from the input ByteChannel into an intermediate ByteBuffer.
  3. While attempting to read more data into a subsequent intermediate ByteBuffer, an retryable exception occurs.
  4. The exception bubbles to google-cloud-storage's error handling, which frees the channel and loops back to (1)

The key bug is that the intermediate buffers have data but the position hasn't been updated. When we recreate the channel we will jump to the wrong position and re-read some data. Lucky for us, between Zstd and our assertions, this usually crashes the program instead of silently returning bad data.

This is the third bug we have found in Google's cloud storage java library. The previous two:

  1. [query] rare Google Cloud Storage error #13721
  2. [query] Unhandled transient error for GCS 503s #13937

Be forewarned: the next time we see bizarre networking or data corruption issues, check if updating google-cloud-storage fixes the problem.

CHANGELOG: Fix hail-is#13979, affecting Query-on-Batch and manifesting most frequently as "com.github.luben.zstd.ZstdException: Corrupted block detected".

This PR upgrades google-cloud-storage from 2.29.1 to 2.30.1. The google-cloud-storage java library
has a bug present at least since 2.29.0 in which simply incorrect data was
returned. googleapis/java-storage#2301 . The issue seems related to their
use of multiple intremediate ByteBuffers. As far as I can tell, this is what could happen:

1. If there's no channel, open a new channel with the current position.
2. Read *some* data from the input ByteChannel into an intermediate ByteBuffer.
3. While attempting to read more data into a subsequent intermediate ByteBuffer, an retryable exception occurs.
4. The exception bubbles to google-cloud-storage's error handling, which frees the channel and loops back to (1)

The key bug is that the intermediate buffers have data but the `position` hasn't been updated. When
we recreate the channel we will jump to the wrong position and re-read some data. Lucky for us,
between Zstd and our assertions, this usually crashes the program instead of silently returning bad
data.

This is the third bug we have found in Google's cloud storage java library. The previous two:

1. hail-is#13721
2. hail-is#13937

Be forewarned: the next time we see bizarre networking or data corruption issues, check if updating
google-cloud-storage fixes the problem.
@danking danking merged commit 98adcce into hail-is:main Dec 7, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[query] corrupted zstd block
2 participants