Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve throughput of http based storage#reader between 100 MiB/s and 200 MiB/s #1799

Merged
merged 7 commits into from
Dec 20, 2022

Conversation

BenWhitehead
Copy link
Collaborator

@BenWhitehead BenWhitehead commented Dec 7, 2022

Work

Implement new BlobReadChannelV2 which replaces BlobReadChannel and improves on
its resource usage to reduce min number of RPCs to 1 from (objSize / chunkSize + 1)
while still maintaining the ability to restart a stream that may have been
interrupted.

Results

Throughput in MiB/s has increased across the board:

        ClassName           mean    25%    50%    75%    90%    95%    99%    max
READ[0] BlobReadChannel     32.2   25.3   29.0   32.6   42.1   56.1  111.9  214.1
READ[1] BlobReadChannel     32.1   25.4   28.7   32.6   41.7   55.4  106.1  224.4
READ[2] BlobReadChannel     31.9   25.2   28.6   32.8   41.6   55.2  105.4  227.2
READ[0] BlobReadChannelV2  214.1  196.4  219.8  239.3  254.3  261.2  278.0  315.2
READ[1] BlobReadChannelV2  215.9  198.8  221.0  240.0  254.4  261.8  281.8  315.6
READ[2] BlobReadChannelV2  216.4  199.5  221.2  239.4  253.9  261.6  281.6  308.6

Data collected using all default settings, against a regional bucket,
across a range of object sizes [256KiB, 2GiB]. Each object is read in full three
times to account for any GCS caching variability.

Internal implementation notes

Add ByteRangeSpec to encapsulate relative vs explicit(open) vs explicit(closed)
vs null vs open-ended ranges and their associated logical subtleties.

New StorageReadChannel interface possible candidate for new storage specific
interface we can expose to folks for improvements independent of core and
BigQuery.

Future Breaking Change

In order to facilitate migrating any RestorableState<ReadChannel> customers might
have, we have left the existing class hierarchy in place and updated
BlobReadChannel.StateImpl#restore() to produce a new BlobReadChannelV2 instance
when called.

In the next major version this compatibility path will be removed.

TODOs

  • Refactor BlobReadChannelContext to push retry config down
  • Wire through populated x-goog-* headers to a StorageObject instance which is set in ApiaryUnbufferedReadableByteChannel
  • Add ByteRangeSpec method argument validation at such time as it becomes public and can't solely rely on higher level pre-validation

@BenWhitehead BenWhitehead requested review from a team as code owners December 7, 2022 00:05
@product-auto-label product-auto-label bot added size: xl Pull request size is extra large. api: storage Issues related to the googleapis/java-storage API. labels Dec 7, 2022
@BenWhitehead BenWhitehead changed the title Read channel v2 feat: improve throughput of http based storage#reader between 100 MiB/s and 200 MiB/s Dec 7, 2022
@BenWhitehead BenWhitehead force-pushed the read-channel-v2 branch 2 times, most recently from 9e10999 to 6878b73 Compare December 13, 2022 22:12
…s and 200 MiB/s

### Work
Implement new BlobReadChannelV2 which replaces BlobReadChannel and improves on
its resource usage to reduce min number of RPCs to 1 from (objSize / chunkSize + 1)
while still maintaining the ability to restart a stream that may have been
interrupted.

### Results
Throughput in MiB/s has increased across the board:
```
        ClassName           mean    25%    50%    75%    90%    95%    99%    max
READ[0] BlobReadChannel     32.2   25.3   29.0   32.6   42.1   56.1  111.9  214.1
READ[1] BlobReadChannel     32.1   25.4   28.7   32.6   41.7   55.4  106.1  224.4
READ[2] BlobReadChannel     31.9   25.2   28.6   32.8   41.6   55.2  105.4  227.2
READ[0] BlobReadChannelV2  214.1  196.4  219.8  239.3  254.3  261.2  278.0  315.2
READ[1] BlobReadChannelV2  215.9  198.8  221.0  240.0  254.4  261.8  281.8  315.6
READ[2] BlobReadChannelV2  216.4  199.5  221.2  239.4  253.9  261.6  281.6  308.6
```

Data collected using all default settings, against a regional bucket,
across a range of object sizes [256KiB, 2GiB]. Each object is read in full three
times to account for any GCS caching variability.

### Internal implementation notes
Add ByteRangeSpec to encapsulate relative vs explicit(open) vs explicit(closed)
vs null vs open-ended ranges and their associated logical subtleties.

New StorageReadChannel interface possible candidate for new storage specific
interface we can expose to folks for improvements independent of core and
BigQuery.
ClientStuff is no longer Serializable. Instead, BlobReadChannelV2State will store the instance of HttpStorageOptions and then reconstitute the BlobReadChannelContext at restore() time.
In order to facilitate migrating any RestorableState<ReadChannel> customers might have, we leave the existing class hierarchy in place and update BlobReadChannel.StateImpl#restore() to produce a new BlobReadChannelV2.

In the next major version this compatibility path will be removed.

To test compatibility a serialized instance of a BlobReadChannel from v2.16.0 has been created and serialized into blobWriteChannel.ser.properties along with a comment describing how the ser was generated.
@@ -49,6 +48,6 @@ public void start() {

@Override
public void stop() {
BucketCleaner.doCleanup(bucketInfo.getName(), s);
// BucketCleaner.doCleanup(bucketInfo.getName(), s);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uncomment

/**
* Hierarchy retained for {@link RestorableState#restore()}. Will be removed in next major version!
*/
@Deprecated
class BlobReadChannel implements ReadChannel {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, this class has been reduced to unimplimented; that should be okay because it's a package private class and can only be initialized by Storage.writer() method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per @BenWhitehead; this is to continue supporting restore() / deserialize BlobReadChannel state from a serialized object as the implementation moves to BlobReadChannelV2; The class attributes require to still exist for deserialization to continue working. For now this is kept for this particular case.

}

private ApiaryReadRequest getApiaryReadRequest() {
// TODO: test what happens if you try to read from the last byte of an object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean pos=99 of size=100 OR pos=-5 which is equivalent to pos=94-99?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I forgot to ask this question @BenWhitehead

public static final class ReadableByteChannelSessionBuilder {

private final BlobReadChannelContext blobReadChannelContext;
// private Hasher hasher; // TODO: wire in Hasher
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given you're merging into main; please address the Hasher todo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per @BenWhitehead: this is a feature request; the existing reader did not have checksum support.

ImmutableList<HttpRequest> requests = getRequests();

List<T> actual =
requests.stream()
.map(HttpRequest::getUrl)
.distinct() // todo: figure out why requests seem to be recorded twice for blob create
// When a multipart (http, not MPU) request is sent it will show up as multiple requests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that due to the multiple parts or is it a complete duplicate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenWhitehead I forgot to ask this question

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it's due to the multiple parts, there will be a request for url /path for the object json, and a request for url /path for the content bytes. Here distinct() will collapse those to a single entry rather than two. I'm not totally sure why the internal structure of the requests is done like this, but it is (for multipart I would have expected some type of content rather than a whole new request but 🤷🏻‍♂️ )

import java.util.Map;
import java.util.Objects;

final class BlobReadChannelV2 implements StorageReadChannel {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does BlobReadChannelV2 surface limit() so it's usable? Seems like it's package private.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per @BenWhitehead: Yes, it's still usable because accessibility of these methods are set by ReadChannel which is where these classes / interfaces derive from and public gets carried from there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, the interface dictates the access scope, not the implementation itself. In fact, the compiler will prevent a class from narrowing the scope of an overriding method declaration.

@BenWhitehead BenWhitehead merged commit 94cd288 into main Dec 20, 2022
@BenWhitehead BenWhitehead deleted the read-channel-v2 branch December 20, 2022 16:26
gcf-owl-bot bot added a commit that referenced this pull request May 19, 2023
Source-Link: googleapis/synthtool@1fd6dff
Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-java:latest@sha256:ad9cabee4c022f1aab04a71332369e0c23841062124818a4490f73337f790337
gcf-merge-on-green bot pushed a commit that referenced this pull request May 19, 2023
Source-Link: https://togithub.com/googleapis/synthtool/commit/1fd6dff029bb3d873a4780e616388f802f086907
Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-java:latest@sha256:ad9cabee4c022f1aab04a71332369e0c23841062124818a4490f73337f790337
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the googleapis/java-storage API. size: xl Pull request size is extra large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants