Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP cloud storage downloaded file corruption #2301

Closed
Cai-Chen opened this issue Nov 14, 2023 · 4 comments · Fixed by #2303 or #2296
Closed

GCP cloud storage downloaded file corruption #2301

Cai-Chen opened this issue Nov 14, 2023 · 4 comments · Fixed by #2303 or #2296
Assignees
Labels
api: storage Issues related to the googleapis/java-storage API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@Cai-Chen
Copy link

Hi, recently we got an intermittent issue that the file size downloaded via storage sdk is different from the GCP cloud storage. Our initial investigation pointed us to here (code) that when an exception is thrown the retry won't update position then data will be duplicated/corrupted.

We wrote a simple test to verify.

package test;

import com.google.auth.oauth2.GoogleCredentials;
import com.google.cloud.ReadChannel;
import com.google.cloud.storage.BlobId;
import com.google.cloud.storage.StorageOptions;
import org.testng.annotations.Test;
import org.testng.collections.Lists;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.nio.Buffer;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.file.Paths;

public class test {
    @Test
    public void testDownload() throws Exception {
        GoogleCredentials credentials = GoogleCredentials.fromStream(new FileInputStream("/path/secret.json"))
                .createScoped(Lists.newArrayList("https://www.googleapis.com/auth/cloud-platform"));
        var storage = StorageOptions.newBuilder().setCredentials(credentials).build().getService();

        var bucket = "test-bucket";
        var name = "test.file";

        var blobReference = new GCPRemoteObjectReference(BlobId.of(bucket, name));

        final File localFilePath = Paths.get("/local/path/test.file").toFile();

        try (final ReadChannel inputChannel = storage.reader(blobReference.getBlobId())) {
            localFilePath.getParentFile().mkdirs();
            try (FileChannel fileChannel = new FileOutputStream(localFilePath).getChannel()) {
                ByteBuffer bytes = ByteBuffer.allocate(64 * 1024);

                while (inputChannel.read(bytes) > 0) {
                    ((Buffer) bytes).flip();
                    fileChannel.write(bytes);
                    bytes.clear();
                }
            }
        }
    }

}

And we set a breakpoint in java.nio.channels.Channels
image

When debugging this test and hitting this breakpoint, manually throw an java.net.SocketTimeoutException. Then remove the breakpoint and Resume Program to let it proceed. And check the file size in local and bucket.
image

I know this internal/hack way is not a perfect way to reproduce this issue, but it's just our first investigation and hard to reproduce externally.

Could this be a false alarm?

Thanks.

@product-auto-label product-auto-label bot added the api: storage Issues related to the googleapis/java-storage API. label Nov 14, 2023
@BenWhitehead BenWhitehead self-assigned this Nov 14, 2023
BenWhitehead added a commit that referenced this issue Nov 14, 2023
…ytes copied

Add integration test with testbench to force failure during read

Fixes #2301
@BenWhitehead
Copy link
Collaborator

Thanks for the report, and the repro. I was able to translate your repro into an integration test and add it to our suite.

After that I was able to fix the tracking error in the read logic.

Fix is in #2303

Fair warning, we are currently in a code freeze for releases due to thanksgiving in the US. The next release of the library will be sometime in December.

@BenWhitehead BenWhitehead added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Nov 14, 2023
BenWhitehead added a commit that referenced this issue Nov 14, 2023
…ytes copied

Add integration test with testbench to force failure during read

Update retry conformance tests to assert full byte content when read through reader

Fixes #2301
BenWhitehead added a commit that referenced this issue Nov 14, 2023
…ytes copied (#2303)

Add integration test with testbench to force failure during read

Update retry conformance tests to assert full byte content when read through reader

Fixes #2301
@Cai-Chen
Copy link
Author

cool, thanks for you quick response.

@danking
Copy link

danking commented Dec 6, 2023

Hey @Cai-Chen , could you share which version you were using? We're seeing some intermittent rather rare corrupted data issues using 2.29.1 and wondering if this could be related.

@Cai-Chen
Copy link
Author

Cai-Chen commented Dec 7, 2023

Hey @Cai-Chen , could you share which version you were using? We're seeing some intermittent rather rare corrupted data issues using 2.29.1 and wondering if this could be related.

We are using libraries-bom 26.27.0 which google-cloud-storage version is 2.29.0

danking added a commit to danking/hail that referenced this issue Dec 7, 2023
CHANGELOG: Fix hail-is#13979, affecting Query-on-Batch and manifesting most frequently as "com.github.luben.zstd.ZstdException: Corrupted block detected".

This PR upgrades google-cloud-storage from 2.29.1 to 2.30.1. The google-cloud-storage java library
has a bug present at least since 2.29.0 in which simply incorrect data was
returned. googleapis/java-storage#2301 . The issue seems related to their
use of multiple intremediate ByteBuffers. As far as I can tell, this is what could happen:

1. If there's no channel, open a new channel with the current position.
2. Read *some* data from the input ByteChannel into an intermediate ByteBuffer.
3. While attempting to read more data into a subsequent intermediate ByteBuffer, an retryable exception occurs.
4. The exception bubbles to google-cloud-storage's error handling, which frees the channel and loops back to (1)

The key bug is that the intermediate buffers have data but the `position` hasn't been updated. When
we recreate the channel we will jump to the wrong position and re-read some data. Lucky for us,
between Zstd and our assertions, this usually crashes the program instead of silently returning bad
data.

This is the third bug we have found in Google's cloud storage java library. The previous two:

1. hail-is#13721
2. hail-is#13937

Be forewarned: the next time we see bizarre networking or data corruption issues, check if updating
google-cloud-storage fixes the problem.
danking added a commit to hail-is/hail that referenced this issue Dec 7, 2023
CHANGELOG: Fix #13979, affecting Query-on-Batch and manifesting most
frequently as "com.github.luben.zstd.ZstdException: Corrupted block
detected".

This PR upgrades google-cloud-storage from 2.29.1 to 2.30.1. The
google-cloud-storage java library has a bug present at least since
2.29.0 in which simply incorrect data was returned.
googleapis/java-storage#2301 . The issue seems
related to their use of multiple intremediate ByteBuffers. As far as I
can tell, this is what could happen:

1. If there's no channel, open a new channel with the current position.
2. Read *some* data from the input ByteChannel into an intermediate
ByteBuffer.
3. While attempting to read more data into a subsequent intermediate
ByteBuffer, an retryable exception occurs.
4. The exception bubbles to google-cloud-storage's error handling, which
frees the channel and loops back to (1)

The key bug is that the intermediate buffers have data but the
`position` hasn't been updated. When we recreate the channel we will
jump to the wrong position and re-read some data. Lucky for us, between
Zstd and our assertions, this usually crashes the program instead of
silently returning bad data.

This is the third bug we have found in Google's cloud storage java
library. The previous two:

1. #13721
2. #13937

Be forewarned: the next time we see bizarre networking or data
corruption issues, check if updating google-cloud-storage fixes the
problem.
danking added a commit to danking/hail that referenced this issue Dec 16, 2023
CHANGELOG: Fix hail-is#13979, affecting Query-on-Batch and manifesting most frequently as "com.github.luben.zstd.ZstdException: Corrupted block detected".

This PR upgrades google-cloud-storage from 2.29.1 to 2.30.1. The google-cloud-storage java library
has a bug present at least since 2.29.0 in which simply incorrect data was
returned. googleapis/java-storage#2301 . The issue seems related to their
use of multiple intremediate ByteBuffers. As far as I can tell, this is what could happen:

1. If there's no channel, open a new channel with the current position.
2. Read *some* data from the input ByteChannel into an intermediate ByteBuffer.
3. While attempting to read more data into a subsequent intermediate ByteBuffer, an retryable exception occurs.
4. The exception bubbles to google-cloud-storage's error handling, which frees the channel and loops back to (1)

The key bug is that the intermediate buffers have data but the `position` hasn't been updated. When
we recreate the channel we will jump to the wrong position and re-read some data. Lucky for us,
between Zstd and our assertions, this usually crashes the program instead of silently returning bad
data.

This is the third bug we have found in Google's cloud storage java library. The previous two:

1. hail-is#13721
2. hail-is#13937

Be forewarned: the next time we see bizarre networking or data corruption issues, check if updating
google-cloud-storage fixes the problem.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the googleapis/java-storage API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
3 participants