feat: improve throughput of http based storage#reader between 100 MiB/s and 200 MiB/s #1799

BenWhitehead · 2022-12-07T00:05:14Z

Work

Implement new BlobReadChannelV2 which replaces BlobReadChannel and improves on
its resource usage to reduce min number of RPCs to 1 from (objSize / chunkSize + 1)
while still maintaining the ability to restart a stream that may have been
interrupted.

Results

Throughput in MiB/s has increased across the board:

        ClassName           mean    25%    50%    75%    90%    95%    99%    max
READ[0] BlobReadChannel     32.2   25.3   29.0   32.6   42.1   56.1  111.9  214.1
READ[1] BlobReadChannel     32.1   25.4   28.7   32.6   41.7   55.4  106.1  224.4
READ[2] BlobReadChannel     31.9   25.2   28.6   32.8   41.6   55.2  105.4  227.2
READ[0] BlobReadChannelV2  214.1  196.4  219.8  239.3  254.3  261.2  278.0  315.2
READ[1] BlobReadChannelV2  215.9  198.8  221.0  240.0  254.4  261.8  281.8  315.6
READ[2] BlobReadChannelV2  216.4  199.5  221.2  239.4  253.9  261.6  281.6  308.6

Data collected using all default settings, against a regional bucket,
across a range of object sizes [256KiB, 2GiB]. Each object is read in full three
times to account for any GCS caching variability.

Internal implementation notes

Add ByteRangeSpec to encapsulate relative vs explicit(open) vs explicit(closed)
vs null vs open-ended ranges and their associated logical subtleties.

New StorageReadChannel interface possible candidate for new storage specific
interface we can expose to folks for improvements independent of core and
BigQuery.

Future Breaking Change

In order to facilitate migrating any RestorableState<ReadChannel> customers might
have, we have left the existing class hierarchy in place and updated
BlobReadChannel.StateImpl#restore() to produce a new BlobReadChannelV2 instance
when called.

In the next major version this compatibility path will be removed.

TODOs

Refactor BlobReadChannelContext to push retry config down
Wire through populated x-goog-* headers to a StorageObject instance which is set in ApiaryUnbufferedReadableByteChannel
Add ByteRangeSpec method argument validation at such time as it becomes public and can't solely rely on higher level pre-validation

google-cloud-storage/src/main/java/com/google/cloud/storage/BlobReadChannelV2.java

…s and 200 MiB/s ### Work Implement new BlobReadChannelV2 which replaces BlobReadChannel and improves on its resource usage to reduce min number of RPCs to 1 from (objSize / chunkSize + 1) while still maintaining the ability to restart a stream that may have been interrupted. ### Results Throughput in MiB/s has increased across the board: ``` ClassName mean 25% 50% 75% 90% 95% 99% max READ[0] BlobReadChannel 32.2 25.3 29.0 32.6 42.1 56.1 111.9 214.1 READ[1] BlobReadChannel 32.1 25.4 28.7 32.6 41.7 55.4 106.1 224.4 READ[2] BlobReadChannel 31.9 25.2 28.6 32.8 41.6 55.2 105.4 227.2 READ[0] BlobReadChannelV2 214.1 196.4 219.8 239.3 254.3 261.2 278.0 315.2 READ[1] BlobReadChannelV2 215.9 198.8 221.0 240.0 254.4 261.8 281.8 315.6 READ[2] BlobReadChannelV2 216.4 199.5 221.2 239.4 253.9 261.6 281.6 308.6 ``` Data collected using all default settings, against a regional bucket, across a range of object sizes [256KiB, 2GiB]. Each object is read in full three times to account for any GCS caching variability. ### Internal implementation notes Add ByteRangeSpec to encapsulate relative vs explicit(open) vs explicit(closed) vs null vs open-ended ranges and their associated logical subtleties. New StorageReadChannel interface possible candidate for new storage specific interface we can expose to folks for improvements independent of core and BigQuery.

ClientStuff is no longer Serializable. Instead, BlobReadChannelV2State will store the instance of HttpStorageOptions and then reconstitute the BlobReadChannelContext at restore() time.

In order to facilitate migrating any RestorableState<ReadChannel> customers might have, we leave the existing class hierarchy in place and update BlobReadChannel.StateImpl#restore() to produce a new BlobReadChannelV2. In the next major version this compatibility path will be removed. To test compatibility a serialized instance of a BlobReadChannel from v2.16.0 has been created and serialized into blobWriteChannel.ser.properties along with a comment describing how the ser was generated.

frankyn · 2022-12-15T20:18:37Z

...-cloud-storage/src/test/java/com/google/cloud/storage/it/runner/registry/BucketInfoShim.java

@@ -49,6 +48,6 @@ public void start() {

  @Override
  public void stop() {
-    BucketCleaner.doCleanup(bucketInfo.getName(), s);
+    // BucketCleaner.doCleanup(bucketInfo.getName(), s);


frankyn · 2022-12-19T17:47:36Z

google-cloud-storage/src/main/java/com/google/cloud/storage/BlobReadChannel.java

+/**
+ * Hierarchy retained for {@link RestorableState#restore()}. Will be removed in next major version!
+ */
+@Deprecated
 class BlobReadChannel implements ReadChannel {


I'm confused, this class has been reduced to unimplimented; that should be okay because it's a package private class and can only be initialized by Storage.writer() method.

Per @BenWhitehead; this is to continue supporting restore() / deserialize BlobReadChannel state from a serialized object as the implementation moves to BlobReadChannelV2; The class attributes require to still exist for deserialization to continue working. For now this is kept for this particular case.

frankyn · 2022-12-19T17:49:22Z

google-cloud-storage/src/main/java/com/google/cloud/storage/BlobReadChannelV2.java

+  }
+
+  private ApiaryReadRequest getApiaryReadRequest() {
+    // TODO: test what happens if you try to read from the last byte of an object


does this mean pos=99 of size=100 OR pos=-5 which is equivalent to pos=94-99?

Oops, I forgot to ask this question @BenWhitehead

frankyn · 2022-12-19T17:57:16Z

google-cloud-storage/src/main/java/com/google/cloud/storage/HttpDownloadSessionBuilder.java

+  public static final class ReadableByteChannelSessionBuilder {
+
+    private final BlobReadChannelContext blobReadChannelContext;
+    // private Hasher hasher; // TODO: wire in Hasher


Given you're merging into main; please address the Hasher todo

per @BenWhitehead: this is a feature request; the existing reader did not have checksum support.

frankyn · 2022-12-19T17:58:44Z

google-cloud-storage/src/test/java/com/google/cloud/storage/it/RequestAuditing.java

    ImmutableList<HttpRequest> requests = getRequests();

    List<T> actual =
        requests.stream()
            .map(HttpRequest::getUrl)
-            .distinct() // todo: figure out why requests seem to be recorded twice for blob create
+            // When a multipart (http, not MPU) request is sent it will show up as multiple requests


is that due to the multiple parts or is it a complete duplicate?

@BenWhitehead I forgot to ask this question

In this case it's due to the multiple parts, there will be a request for url /path for the object json, and a request for url /path for the content bytes. Here distinct() will collapse those to a single entry rather than two. I'm not totally sure why the internal structure of the requests is done like this, but it is (for multipart I would have expected some type of content rather than a whole new request but 🤷🏻‍♂️ )

frankyn · 2022-12-19T18:03:05Z

google-cloud-storage/src/main/java/com/google/cloud/storage/BlobReadChannelV2.java

+import java.util.Map;
+import java.util.Objects;
+
+final class BlobReadChannelV2 implements StorageReadChannel {


how does BlobReadChannelV2 surface limit() so it's usable? Seems like it's package private.

per @BenWhitehead: Yes, it's still usable because accessibility of these methods are set by ReadChannel which is where these classes / interfaces derive from and public gets carried from there.

Correct, the interface dictates the access scope, not the implementation itself. In fact, the compiler will prevent a class from narrowing the scope of an overriding method declaration.

Source-Link: googleapis/synthtool@1fd6dff Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-java:latest@sha256:ad9cabee4c022f1aab04a71332369e0c23841062124818a4490f73337f790337

Source-Link: https://togithub.com/googleapis/synthtool/commit/1fd6dff029bb3d873a4780e616388f802f086907 Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-java:latest@sha256:ad9cabee4c022f1aab04a71332369e0c23841062124818a4490f73337f790337

BenWhitehead added this to the Storage gRPC implementation milestone Dec 7, 2022

BenWhitehead requested review from a team as code owners December 7, 2022 00:05

product-auto-label bot added size: xl Pull request size is extra large. api: storage Issues related to the googleapis/java-storage API. labels Dec 7, 2022

BenWhitehead changed the title ~~Read channel v2~~ feat: improve throughput of http based storage#reader between 100 MiB/s and 200 MiB/s Dec 7, 2022

BenWhitehead commented Dec 7, 2022

View reviewed changes

google-cloud-storage/src/main/java/com/google/cloud/storage/BlobReadChannelV2.java Outdated Show resolved Hide resolved

BenWhitehead force-pushed the read-channel-v2 branch 2 times, most recently from 9e10999 to 6878b73 Compare December 13, 2022 22:12

BenWhitehead added 3 commits December 16, 2022 15:00

chore: ClientStuff -> BlobReadChannelContext

c6fce61

ClientStuff is no longer Serializable. Instead, BlobReadChannelV2State will store the instance of HttpStorageOptions and then reconstitute the BlobReadChannelContext at restore() time.

BenWhitehead force-pushed the read-channel-v2 branch from 6878b73 to cea4cdc Compare December 16, 2022 20:01

frankyn reviewed Dec 19, 2022

View reviewed changes

BenWhitehead added 4 commits December 19, 2022 16:18

chore: review cleanup

1d693e7

test: add tests reading around the last byte of an object

9835691

test: add test for 416 handling

78c5d1c

chore: special case 416 handling in BlobReadChannelV2

6207cf1

frankyn approved these changes Dec 19, 2022

View reviewed changes

BenWhitehead merged commit 94cd288 into main Dec 20, 2022

BenWhitehead deleted the read-channel-v2 branch December 20, 2022 16:26

release-please bot mentioned this pull request Dec 20, 2022

chore(main): release 2.17.0 #1804

Merged

This was referenced Feb 1, 2023

Blob download to provide InputStream #928

Closed

Can not download ~200GB file using storage blob #1425

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improve throughput of http based storage#reader between 100 MiB/s and 200 MiB/s #1799

feat: improve throughput of http based storage#reader between 100 MiB/s and 200 MiB/s #1799

BenWhitehead commented Dec 7, 2022 •

edited

frankyn Dec 15, 2022

frankyn Dec 19, 2022

frankyn Dec 19, 2022

frankyn Dec 19, 2022

frankyn Dec 19, 2022

frankyn Dec 19, 2022

frankyn Dec 19, 2022

frankyn Dec 19, 2022

frankyn Dec 19, 2022

BenWhitehead Dec 19, 2022

frankyn Dec 19, 2022

frankyn Dec 19, 2022

BenWhitehead Dec 19, 2022

feat: improve throughput of http based storage#reader between 100 MiB/s and 200 MiB/s #1799

feat: improve throughput of http based storage#reader between 100 MiB/s and 200 MiB/s #1799

Conversation

BenWhitehead commented Dec 7, 2022 • edited

Work

Results

Internal implementation notes

Future Breaking Change

TODOs

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenWhitehead commented Dec 7, 2022 •

edited