Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign Exchange protocol to reduce query latency #21926

Open
mbasmanova opened this issue Feb 14, 2024 · 37 comments
Open

Redesign Exchange protocol to reduce query latency #21926

mbasmanova opened this issue Feb 14, 2024 · 37 comments

Comments

@mbasmanova
Copy link
Contributor

An investigation of a slow exchange-heavy query prompted re-thinking of Exchange protocol used in Presto and Prestissimo.

The query under investigation has 10+ count(distinct) which are planned as 10+ stages with MarkDistinct and shuffle. This query shuffles same data 10+ times and some stages are skewed so that a handful of workers process 80% of the data. A simplified version of the query with fewer count(distinct) was taking 8 min, where 2 min went into fetching data over the network and 6 minutes went into waiting.

The consumer task has a memory limit (32GB) and therefore cannot pull from all the producers at once. The consumer also doesn't know which producers have data and which don't. When pulling from producers that do not have data, the consumer waits for 2 seconds before moving on to the next producer. These delays add up. We'd like to redesign Exchange protocol to eliminate these delays. We tried a prototype of the modified protocol described below on a query that used to time out after 1h. That query finished in < 4m.

Modified Protocol

Presto uses many-to-many streaming shuffle. M upstream workers partition data N ways. N downstream workers pull data from M upstream workers each.

Upstream workers use PartitionedOutput operators to partition the data and store it in OutputBuffer. There is a single OutputBuffer for each task with N partitions, one per downstream worker.

Downstream workers use ExchangeOperators and a single ExchangeClient to pull data from output buffers. ExchangeClient creates multiple ExchangeSources, one per upstream worker, to pull the data.

ExchangeSource uses HTTP protocol to pull data: https://prestodb.io/docs/current/develop/worker-protocol.html#data-plane

The existing protocol allows the consumer to ask for data and specify a limit on response size (maxBytes) and maximum time to wait for data to become available before sending an empty response (maxWait).

The limit on response size is advisory. The producer may not respect it. This would happen if the limit is very small (few KB) or if there is a single row that’s bigger than 1MB.

The consumer needs to maintain a cap on memory usage and therefore needs to be careful not to fetch from too many sources at once.

It is hard to come up with an optimal scheduling algorithm without knowing which sources have data and how much. Safe choices sacrifice latency.

Therefore we propose to modify the protocol to allow consumers to ask producers about how much data they have available first, then use this information to schedule data fetch.

There are cases when Prestissimo workers need to continue using the existing protocol to exchange data with Coordinator. This happens when the Coordinator pulls query results from the C++ worker. This also happens when C++ workers pull data from the leaf fragment that runs on Coordinator (TableScan for system tables). Therefore, we are going to extend the existing protocol in a backwards compatible way.

We introduce the HTTP header X-Presto-Get-Data-Size and use it with a GET {taskId}/results/{bufferId}/{token} request.

Prestissimo workers, in the presence of X-Presto-Get-Data-Size header, will return the amount of data available: a list of page sizes. The consumer will be able to request as many pages from the start of the list as would fit in memory by specifying the total bytes in these pages.

If data is not available yet, the producer would wait for up to maxWait seconds before responding.

We also introduce an HTTP header X-Presto-Buffer-Remaining-Bytes that is set by the producer in response to a GET {taskId}/results/{bufferId}/{token} to indicate how many bytes are still available.

A Prestissimo consumer would first query all (at the same time) producers for how much data they have.

X-Presto-Get-Data-Size: true
GET {taskId}/results/{bufferId}/{token} 

The producers would respond with available data specified in the X-Presto-Buffer-Remaining-Bytes header.

X-Presto-Buffer-Remaining-Bytes: 100,200,150,400

The consumer then comes up with a plan to fetch available data and control the response sizes using the existing X-Presto-Max-Size header.

X-Presto-Max-Size: 1MB
GET {taskId}/results/{bufferId}/{token} 

The consumer would prioritize fetching data from the sources that have most data. This will unblock these producers sooner and allow them to produce more data while the consumer fetches data from the remaining sources.

When fetching data, the consumer would receive the data and the size of the remaining data. This will allow the consumer to continue fetching the data without extra metadata requests.

Rinse and repeat until all sources report no-more-data.

The value of X-Presto-Buffer-Remaining-Bytes is a list of page sizes in order in which they will be produced. This is a comma-separated list of positive integers, where each value is the number of bytes in a page. For example,

X-Presto-Buffer-Remaining-Bytes: 12323432,123452342,23344354,34532343241341

Prestissimo worker includes X-Presto-Buffer-Remaining-Bytes header in all responses to GET {taskId}/results/{bufferId}/{token} regardless of whether X-Presto-Get-Data-Size header is present or not. When no data is available the header value is set to 0.

X-Presto-Buffer-Remaining-Bytes: 0

Backwards compatibility with the Coordinator

There are two use cases that involve data transfer between the Coordinator that doesn’t understand the new protocol and a Prestissimo worker.

Use case 1: Coordinator fetches query results from the worker.

Prestissimo workers would include the new X-Presto-Buffer-Remaining-Bytes header in response to existing GET {taskId}/results/{bufferId}/{token} request (without X-Presto-Get-Data-Size). Coordinator would ignore this header and continue fetching data without the benefit of knowing how much data is available.

Use case 2: Prestissimo worker fetches table scan results for system table from the Coordinator.

Prestissimo worker includes the new X-Presto-Get-Data-Size header in the existing GET {taskId}/results/{bufferId}/{token} request. Coordinator will ignore this header and return the data. Prestissimo workers should accept that response (even though they only asked for data size). This should work because in this case there is only one source and it doesn’t produce a lot of data. Prestissimo workers would also set the X-Presto-Max-Size header to avoid accidentally getting too much data (just in case).

Implementation

ExchangeClient in Velox would implement two independent loops: wait-for-data, fetch-data.

Wait-for-data loop is waiting for sources to report having data. This loop processes all sources in parallel.

Fetch-data loop is fetching data from sources that reported having some data. This loop processes a subset of sources at a time to stay within the memory budget.

Any given source is being assigned to either one loop or the other, but not both. Initially all sources go into the wait-for-data loop. As soon as the source responds as having data it is moved to the data-fetching loop. The source stays in the data-fetching loop until response indicates there is no more data available, but the source is not at-end. At this point the source is moved into the wait-for-data loop. Sources that reported being at-end are removed from the loops and do not participate in further processing.

CC: @Yuhta @spershin @arhimondr @tdcmeehan @aditi-pandit @majetideepak @pranjalssh

@mbasmanova mbasmanova added the bug label Feb 14, 2024
@mbasmanova
Copy link
Contributor Author

CC: @agrawaldevesh

@tdcmeehan
Copy link
Contributor

@mbasmanova thank you for the clear write up. I had one initial question.

A Prestissimo consumer would first query all (at the same time) producers for how much data they have.

Do we need special casing for this first request in order to receive most of the performance improvement? Would it not be sufficient to simply receive the header on subsequent shuffles and adapt according to the values we receive?

@mbasmanova
Copy link
Contributor Author

mbasmanova commented Feb 14, 2024

@tdcmeehan Tim, thank you for considering this proposal.

Do we need special casing for this first request in order to receive most of the performance improvement? Would it not be sufficient to simply receive the header on subsequent shuffles and adapt according to the values we receive?

A simpler improvement on the existing protocol is to simply add X-Presto-Buffer-Remaining-Bytes header that producer uses to specify how many more bytes it has buffered and ready to go.

However, this is not enough. Consider what happens when request for data comes back with X-Presto-Buffer-Remaining-Bytes indicating there is no more data yet. The consumer needs to continue polling the producer, but it no longer knows how much data it may get in response and when. So, we are back to original problem where we don't know whom to ask for data and end up spending lots of wall time asking the "wrong" producers before finally getting to the "right" one.

@tdcmeehan
Copy link
Contributor

tdcmeehan commented Feb 14, 2024

@mbasmanova so is the idea like this:

  1. Query all task result endpoints at the start of execution for how much data is available for shuffle. The hope is that even though this is at the start of execution, there is some meaningful number of bytes available for the consumer to pick amongst them.
  2. Once we have the list of available bytes from all task result endpoints, we sort them in ascending order so that we pick the endpoints that have the most bytes available.
  3. We prioritize the ones that have the most data available, in a way that doesn't cause starvation to the ones that initially reported low bytes available.

If this is accurate, does this mean, what we're trying to avoid is being biased by one particular task result reporting before others have finished reporting?

I also don't understand your example. If we receive X-Presto-Buffer-Remaining-Bytes, then this goes into the bottom of the priority queue of producers to poll. I imagine we go to another producer at this point. Either this next producer reports it has a lot of data, or not a lot. If not a lot, we continue until we reach a producer that does produce a lot of data. If it does produce a lot, then we are lucky and prioritize it on top until we either consider another producer for fairness (to prevent starvation) or consume quickly what data it has buffered and move on.

In other words, I believe there may be some fairness mechanism in this algorithm, otherwise we could be misled by the initially reported bytes which might not reflect current reality. And if there is, wouldn't this fairness mechanism be enough to ensure that even if we chose producers at random, given enough time it would prioritize the producers that end up producing the most data?

@mbasmanova
Copy link
Contributor Author

@tdcmeehan

The main goal is to ensure that as long as there is data available, the consumer is working hard at fetching that data and avoid having consumer being idle when there is data to fetch.

With the existing design, we see cases when consumers are not fetching data although it is available.

Consider, a single consumer with 100 producers. If consumer initiates a fetch from all 100 producers in parallel, it is possible that all 100 respond back with 1MB of data each. This would require consumer to allocate 100MB of memory and go over budget. To solve this problem, consumers were modified to fetch from a limited number of producers at once. Furthermore, to avoid too much back-and-worth when waiting for data to be produced, each fetch blocks for 2 seconds if data is not available.

Let's say a consumer has a budget to 10MB, so it fetches data from 10 producers in parallel. Let's also say that producer 95 has data, but no other producer does. The consumer will spent 2s waiting for producers 1-10 to report they have nothing, another 2s for producers 11-20 to report they have nothing, etc. until it gets to producer 95 after spending 18 seconds waiting.

Then, let's say producer 95 returned all the data it had and will take a bit of time to produce more data. The consumer will repeat the cycle above and again sit idle waiting for another 18 seconds before getting next batch of data.

@tdcmeehan
Copy link
Contributor

@mbasmanova thank you for the example, that clears things up a lot.

Just to clear up my understanding: in this example, would it be fair to say that if we just added the header, and not the initial special case call to retrieve the initial buffers, that it would take us 18s to discover that producer 95 has the majority of the data? And if so, that means going forward, producer 95 would be prioritized above the rest? Meaning, we would have just an 18s penality, instead of a penalty that is amortized over the life of the query?

@Yuhta
Copy link
Contributor

Yuhta commented Feb 14, 2024

@tdcmeehan Yes that is true, but only if the throughput distribution of producers does not change over time. With the new protocol, it can adapt to any change and react quickly so no large latency can occur no matter what.

@tdcmeehan
Copy link
Contributor

Oh, I see. Is this because we periodically take a "snapshot" of all producers' pending output buffers? I had presumed that the initial request with X-Presto-Get-Data-Size was done once at the beginning. Is it actually done periodically?

@Yuhta
Copy link
Contributor

Yuhta commented Feb 14, 2024

@tdcmeehan It's done periodically whenever the source is in wait mode:

Any given source is being assigned to either one loop or the other, but not both. Initially all sources go into the wait-for-data loop. As soon as the source responds as having data it is moved to the data-fetching loop. The source stays in the data-fetching loop until response indicates there is no more data available, but the source is not at-end. At this point the source is moved into the wait-for-data loop. Sources that reported being at-end are removed from the loops and do not participate in further processing.

@mbasmanova
Copy link
Contributor Author

@tdcmeehan

Is this because we periodically take a "snapshot" of all producers' pending output buffers?

Not quite.

We run two independent loops: wait-for-data, fetch-data.

Wait-for-data loop is waiting for sources to report having data. This loop processes all sources in parallel.

Fetch-data loop is fetching data from sources that reported having some data. This loop processes a subset of sources at a time to stay within the memory budget.

Any given source is being assigned to either one loop or the other, but not both. Initially all sources go into the wait-for-data loop. As soon as the source responds as having data it is moved to the data-fetching loop. The source stays in the data-fetching loop until response indicates there is no more data available, but the source is not at-end. At this point the source is moved into the wait-for-data loop. Sources that reported being at-end are removed from the loops and do not participate in further processing.

@tdcmeehan
Copy link
Contributor

Makes sense.

One idea on the API with respect to backwards compatibility. It seems like X-Presto-Get-Data-Size: true is like a metadata-only request. Typically, such requests are fulfilled by HTTP HEAD, whose purpose is to send back headers without data. What if instead of sending across this new header on the GET to {taskId}/results/{bufferId}/{token} , we just have consumers in wait-for-data mode to send HEAD requests to the underlying source. That way, we always send back X-Presto-Buffer-Remaining-Bytes consistently between HEAD and GET, and the only difference is the new HEAD request. It's backwards compatible because the only difference between the coordinator and workers is the coordinator doesn't care to make the HEAD requests, and we don't need to define a new header type for this purpose.

@mbasmanova
Copy link
Contributor Author

mbasmanova commented Feb 14, 2024

@tdcmeehan

What if instead of sending across this new header on the GET to {taskId}/results/{bufferId}/{token} , we just have consumers in wait-for-data mode to send HEAD requests to the underlying source.

To do that, we need Coordinator to add support for this new HEAD request. This is needed to support queries that use SYSTEM tables and have their leaf scans run on Coordinator. In this case, Prestissimo workers need to fetch data from Coordinator.

@tdcmeehan
Copy link
Contributor

@mbasmanova this is true, but from the description, it seems that we would just need to make sure that this call doesn't fail (perhaps, just returns 200 if the buffer exists at all), which would be a trivial change. It doesn't seem like it would require support for the X-Presto-Buffer-Remaining-Bytes header.

@mbasmanova
Copy link
Contributor Author

@tdcmeehan

we would just need to make sure that this call doesn't fail (perhaps, just returns 200 if the buffer exists at all), which would be a trivial change. It doesn't seem like it would require support for the X-Presto-Buffer-Remaining-Bytes header.

In the current design, this call must return remaining bytes, otherwise the consumer will never switch from waiting to fetching-data for that producer.

Currently, we say that get-data-size request is allowed to return data, but if it doesn't and if it doesn't specify remaining bytes, we'll continue issuing that same request until we get "no-more-data" or "remaining-bytes".

@tdcmeehan
Copy link
Contributor

tdcmeehan commented Feb 14, 2024

@mbasmanova got it, I missed that part of the design. Would something along these lines work in TaskResource? I suppose we'd need a similar change in the GET method as well.

    @HEAD
    @Path("{taskId}/results/{bufferId}")
    public Response head(
            @PathParam("taskId") TaskId taskId,
            @PathParam("bufferId") OutputBufferId bufferId,
            @HeaderParam(PRESTO_MAX_SIZE) DataSize maxSize)
    {
        requireNonNull(taskId, "taskId is null");
        requireNonNull(bufferId, "bufferId is null");

        OutputBufferInfo outputBuffers = taskManager.getTaskInfo(taskId).getOutputBuffers();
        Optional<BufferInfo> bufferInfo = outputBuffers.getBuffers().stream().filter(buffer -> buffer.getBufferId().equals(bufferId)).findFirst();
        Response.ResponseBuilder responseBuilder = Response.ok();
        return bufferInfo
                .map(info -> responseBuilder.header("X-Presto-Buffer-Remaining-Bytes", info.getPageBufferInfo().getBufferedBytes())).orElse(responseBuilder)
                .build();
    }

@mbasmanova
Copy link
Contributor Author

@tdcmeehan I'm not familiar with this code, but it looks like this would work. Do you know if this works for all kinds of buffers, i.e. broadcast/partitioned/arbitrary?

@tdcmeehan
Copy link
Contributor

There are BufferInfo implementations for all buffer types, and it's being reported in TaskInfo, so I believe this should work consistently.

@mbasmanova
Copy link
Contributor Author

@tdcmeehan Tim, any chance you could help add support for HTTP HEAD metadata requests and and returning remaining bytes from existing HTTP GET requests to coordinator?

@Yuhta Jimmy, what do you think about this proposal? It looks cleaner, but unfortunately requires changes to the coordinator, so we won't be able to modify Prestissimo independently.

@Yuhta
Copy link
Contributor

Yuhta commented Feb 15, 2024

I think this is good. On the Prestissimo side we will handle both HEAD and X-Presto-Get-Data-Size temporarily, once the coordinator is fully upgraded we can remove X-Presto-Get-Data-Size

@tdcmeehan
Copy link
Contributor

@mbasmanova @Yuhta I'll work on the coordinator part.

@mbasmanova
Copy link
Contributor Author

@tdcmeehan Thank you, Tim.

Yuhta added a commit to Yuhta/velox that referenced this issue Feb 15, 2024
Summary:
This is the first diff to upgrade the exchange protocol.  This change
only exposes the remaining bytes to buffer manager; it does not change the
existing protocol yet, and is compatible with the current Prestissimo code.

Also added a few statistics to spot skewed exchange.

See prestodb/presto#21926 for details about the
design.

Differential Revision: D53793123
Yuhta added a commit to Yuhta/velox that referenced this issue Feb 15, 2024
Summary:

This is the first diff to upgrade the exchange protocol.  This change
only exposes the remaining bytes to buffer manager; it does not change the
existing protocol yet, and is compatible with the current Prestissimo code.

Also added a few statistics to spot skewed exchange.

See prestodb/presto#21926 for details about the
design.

Differential Revision: D53793123
Yuhta added a commit to Yuhta/velox that referenced this issue Feb 15, 2024
Summary:

This is the first diff to upgrade the exchange protocol.  This change
only exposes the remaining bytes to buffer manager; it does not change the
existing protocol yet, and is compatible with the current Prestissimo code.

Also added a few statistics to spot skewed exchange.

See prestodb/presto#21926 for details about the
design.

Differential Revision: D53793123
Yuhta added a commit to Yuhta/velox that referenced this issue Feb 15, 2024
Summary:

This is the first diff to upgrade the exchange protocol.  This change
only exposes the remaining bytes to buffer manager; it does not change the
existing protocol yet, and is compatible with the current Prestissimo code.

Also added a few statistics to spot skewed exchange.

See prestodb/presto#21926 for details about the
design.

Differential Revision: D53793123
Yuhta added a commit to Yuhta/velox that referenced this issue Feb 15, 2024
Summary:

This is the first diff to upgrade the exchange protocol.  This change
only exposes the remaining bytes to buffer manager; it does not change the
existing protocol yet, and is compatible with the current Prestissimo code.

See prestodb/presto#21926 for details about the
design.

Reviewed By: mbasmanova

Differential Revision: D53793123
facebook-github-bot pushed a commit to facebookincubator/velox that referenced this issue Feb 16, 2024
Summary:
Pull Request resolved: #8758

This is the first diff to upgrade the exchange protocol.  This change
only exposes the remaining bytes to buffer manager; it does not change the
existing protocol yet, and is compatible with the current Prestissimo code.

See prestodb/presto#21926 for details about the
design.

Reviewed By: mbasmanova

Differential Revision: D53793123

fbshipit-source-id: 31d83f1e5b2f21db696a3aff094fa2953e65cf60
Yuhta added a commit to Yuhta/velox that referenced this issue Feb 23, 2024
Summary:
Upgrade the exchange protocol.  We will poll the remaining data sizes using `ExchangeSource::getDataSizes` from all the producers and schedule actual data fetch according to the memory budget.  This reduce the waiting for data time significantly in some cases, for a query that was timing out after 1 hour on 600 nodes cluster, we reduce the wall time to 4.72 minutes on 400 nodes cluster (Java is taking 36.08 minutes on 1000 nodes cluster).

See prestodb/presto#21926

Differential Revision: D54027466
Yuhta added a commit to Yuhta/presto that referenced this issue Feb 27, 2024
This is the change needed on presto_cpp side to implement the new exchange
protocol.  Implement `ExchangeSource::getDataSizes` and use it when the header
`X-Presto-Get-Data-Size` exists (will be removed in the future once coordinator
support the `HEAD` calls), or the endpoint is called with HTTP `HEAD` method.

All changes here are backward-compatible.  The new protocol will not be used
until we start calling `getDataSizes` on velox side in `ExchangeClient`.

See prestodb#21926 for details about design of
the new protocol.

Also added new runtime stats `numTopOutputBuffers`.
mbasmanova pushed a commit that referenced this issue Feb 27, 2024
This is the change needed on presto_cpp side to implement the new exchange
protocol.  Implement `ExchangeSource::getDataSizes` and use it when the header
`X-Presto-Get-Data-Size` exists (will be removed in the future once coordinator
support the `HEAD` calls), or the endpoint is called with HTTP `HEAD` method.

All changes here are backward-compatible.  The new protocol will not be used
until we start calling `getDataSizes` on velox side in `ExchangeClient`.

See #21926 for details about design of
the new protocol.

Also added new runtime stats `numTopOutputBuffers`.
Yuhta added a commit to Yuhta/velox that referenced this issue Feb 29, 2024
Summary:
facebookincubator#8845

Upgrade the exchange protocol.  We will poll the remaining data sizes using `ExchangeSource::getDataSizes` from all the producers and schedule actual data fetch according to the memory budget.  This reduce the waiting for data time significantly in some cases, for a query that was timing out after 1 hour on 600 nodes cluster, we reduce the wall time to 4.72 minutes on 400 nodes cluster (Java is taking 36.08 minutes on 1000 nodes cluster).

See prestodb/presto#21926

Reviewed By: amitkdutta

Differential Revision: D54027466
Yuhta added a commit to Yuhta/velox that referenced this issue Mar 1, 2024
Summary:
facebookincubator#8845

Upgrade the exchange protocol.  We will poll the remaining data sizes using `ExchangeSource::getDataSizes` from all the producers and schedule actual data fetch according to the memory budget.  This reduce the waiting for data time significantly in some cases, for a query that was timing out after 1 hour on 600 nodes cluster, we reduce the wall time to 4.72 minutes on 400 nodes cluster (Java is taking 36.08 minutes on 1000 nodes cluster).

See prestodb/presto#21926

Reviewed By: amitkdutta, mbasmanova

Differential Revision: D54027466
Yuhta added a commit to Yuhta/velox that referenced this issue Mar 1, 2024
Summary:
facebookincubator#8845

Upgrade the exchange protocol.  We will poll the remaining data sizes using `ExchangeSource::getDataSizes` from all the producers and schedule actual data fetch according to the memory budget.  This reduce the waiting for data time significantly in some cases, for a query that was timing out after 1 hour on 600 nodes cluster, we reduce the wall time to 4.72 minutes on 400 nodes cluster (Java is taking 36.08 minutes on 1000 nodes cluster).

See prestodb/presto#21926

Reviewed By: amitkdutta, mbasmanova

Differential Revision: D54027466
facebook-github-bot pushed a commit to facebookincubator/velox that referenced this issue Mar 4, 2024
Summary:
Pull Request resolved: #8845

#8845

Upgrade the exchange protocol.  We will poll the remaining data sizes using `ExchangeSource::getDataSizes` from all the producers and schedule actual data fetch according to the memory budget.  This reduce the waiting for data time significantly in some cases, for a query that was timing out after 1 hour on 600 nodes cluster, we reduce the wall time to 4.72 minutes on 400 nodes cluster (Java is taking 36.08 minutes on 1000 nodes cluster).

See prestodb/presto#21926

Reviewed By: amitkdutta, mbasmanova

Differential Revision: D54027466

fbshipit-source-id: 5f0274132aee27a61774bdc0287c35834285a1bd
kaikalur pushed a commit to kaikalur/presto that referenced this issue Mar 14, 2024
This is the change needed on presto_cpp side to implement the new exchange
protocol.  Implement `ExchangeSource::getDataSizes` and use it when the header
`X-Presto-Get-Data-Size` exists (will be removed in the future once coordinator
support the `HEAD` calls), or the endpoint is called with HTTP `HEAD` method.

All changes here are backward-compatible.  The new protocol will not be used
until we start calling `getDataSizes` on velox side in `ExchangeClient`.

See prestodb#21926 for details about design of
the new protocol.

Also added new runtime stats `numTopOutputBuffers`.
wypb pushed a commit to wypb/presto that referenced this issue Apr 25, 2024
This is the change needed on presto_cpp side to implement the new exchange
protocol.  Implement `ExchangeSource::getDataSizes` and use it when the header
`X-Presto-Get-Data-Size` exists (will be removed in the future once coordinator
support the `HEAD` calls), or the endpoint is called with HTTP `HEAD` method.

All changes here are backward-compatible.  The new protocol will not be used
until we start calling `getDataSizes` on velox side in `ExchangeClient`.

See prestodb#21926 for details about design of
the new protocol.

Also added new runtime stats `numTopOutputBuffers`.
@tdcmeehan
Copy link
Contributor

Have we made the changes to use the new protocol that relies on HEAD to get output buffer metadata in C++?

@Yuhta
Copy link
Contributor

Yuhta commented May 6, 2024

@tdcmeehan Not yet, what is the coordinator version that has the HEAD support in? We need to check all our internal coordinator has been upgraded to newer than this verison

@tdcmeehan
Copy link
Contributor

@Yuhta this has been present for any version cut after March 3rd. I believe given your internal release cadence, it would be safe to assume it's present.

Unrelated, but I also believe you're already consuming the X-Presto-Buffer-Remaining-Bytes header from the GETs, so I believe this is just removing the X-Presto-Get-Data-Size header.

@Yuhta
Copy link
Contributor

Yuhta commented May 7, 2024

Yes we just need to switch to HEAD and remove X-Presto-Get-Data-Size on consumer side. The producer side accepts requests on both endpoints.

@Yuhta
Copy link
Contributor

Yuhta commented May 8, 2024

@tdcmeehan I try to switch the implementation in #22697 but it seems hitting some bug in coordinator side. It seems the coordinator keep sending back 0 remaining byte without X-Presto-Buffer-Complete. The following is from https://app.circleci.com/pipelines/github/prestodb/presto/15575/workflows/d18f7aaf-3d95-49f0-87dc-11aa288f4371/jobs/61026/artifacts parallel run 3 worker 1:

I20240508 20:30:44.935407  3426 PrestoExchangeSource.cpp:291] Remaining bytes: 313
I20240508 20:30:44.935550  3424 PrestoExchangeSource.cpp:189] Fetching data from 127.0.0.1:8081 /v1/task/20240508_203044_00001_w7icq.3.0.0.0/results/0/0
I20240508 20:30:44.937670  3426 PrestoExchangeSource.cpp:276] Fetched data for /v1/task/20240508_203044_00001_w7icq.3.0.0.0/results/0/0: 230 bytes
I20240508 20:30:44.937734  3426 PrestoExchangeSource.cpp:291] Remaining bytes: 0
I20240508 20:30:44.937784  3426 PrestoExchangeSource.cpp:351] Enqueuing page for /v1/task/20240508_203044_00001_w7icq.3.0.0.0/results/0/0: 230 bytes
I20240508 20:30:44.937907  3426 PrestoExchangeSource.cpp:422] Sending ack /v1/task/20240508_203044_00001_w7icq.3.0.0.0/results/0/1/acknowledge
I20240508 20:30:44.937968  3422 PrestoExchangeSource.cpp:186] Get data sizes from 127.0.0.1:8081 /v1/task/20240508_203044_00001_w7icq.3.0.0.0/results/0/1
I20240508 20:30:44.939229  3424 PrestoExchangeSource.cpp:442] Ack 204
I20240508 20:30:44.940085  3424 PrestoExchangeSource.cpp:276] Fetched data for /v1/task/20240508_203044_00001_w7icq.3.0.0.0/results/0/1: 0 bytes
I20240508 20:30:44.940120  3424 PrestoExchangeSource.cpp:291] Remaining bytes: 0
I20240508 20:30:44.940227  3426 PrestoExchangeSource.cpp:186] Get data sizes from 127.0.0.1:8081 /v1/task/20240508_203044_00001_w7icq.3.0.0.0/results/0/0
I20240508 20:30:44.941635  3424 PrestoExchangeSource.cpp:276] Fetched data for /v1/task/20240508_203044_00001_w7icq.3.0.0.0/results/0/0: 0 bytes
I20240508 20:30:44.941684  3424 PrestoExchangeSource.cpp:291] Remaining bytes: 0
I20240508 20:30:44.941787  3426 PrestoExchangeSource.cpp:186] Get data sizes from 127.0.0.1:8081 /v1/task/20240508_203044_00001_w7icq.3.0.0.0/results/0/0
I20240508 20:30:44.943198  3424 PrestoExchangeSource.cpp:276] Fetched data for /v1/task/20240508_203044_00001_w7icq.3.0.0.0/results/0/0: 0 bytes
I20240508 20:30:44.943238  3424 PrestoExchangeSource.cpp:291] Remaining bytes: 0
I20240508 20:30:44.943327  3426 PrestoExchangeSource.cpp:186] Get data sizes from 127.0.0.1:8081 /v1/task/20240508_203044_00001_w7icq.3.0.0.0/results/0/0
I20240508 20:30:44.944546  3424 PrestoExchangeSource.cpp:276] Fetched data for /v1/task/20240508_203044_00001_w7icq.3.0.0.0/results/0/0: 0 bytes
I20240508 20:30:44.944590  3424 PrestoExchangeSource.cpp:291] Remaining bytes: 0
I20240508 20:30:44.944684  3426 PrestoExchangeSource.cpp:186] Get data sizes from 127.0.0.1:8081 /v1/task/20240508_203044_00001_w7icq.3.0.0.0/results/0/0
I20240508 20:30:44.945932  3424 PrestoExchangeSource.cpp:276] Fetched data for /v1/task/20240508_203044_00001_w7icq.3.0.0.0/results/0/0: 0 bytes
I20240508 20:30:44.945966  3424 PrestoExchangeSource.cpp:291] Remaining bytes: 0
I20240508 20:30:44.946046  3426 PrestoExchangeSource.cpp:186] Get data sizes from 127.0.0.1:8081 /v1/task/20240508_203044_00001_w7icq.3.0.0.0/results/0/0
I20240508 20:30:44.947177  3424 PrestoExchangeSource.cpp:276] Fetched data for /v1/task/20240508_203044_00001_w7icq.3.0.0.0/results/0/0: 0 bytes
I20240508 20:30:44.947213  3424 PrestoExchangeSource.cpp:291] Remaining bytes: 0
I20240508 20:30:44.947302  3426 PrestoExchangeSource.cpp:186] Get data sizes from 127.0.0.1:8081 /v1/task/20240508_203044_00001_w7icq.3.0.0.0/results/0/0
I20240508 20:30:44.948417  3424 PrestoExchangeSource.cpp:276] Fetched data for /v1/task/20240508_203044_00001_w7icq.3.0.0.0/results/0/0: 0 bytes
I20240508 20:30:44.948449  3424 PrestoExchangeSource.cpp:291] Remaining bytes: 0
I20240508 20:30:44.948565  3426 PrestoExchangeSource.cpp:186] Get data sizes from 127.0.0.1:8081 /v1/task/20240508_203044_00001_w7icq.3.0.0.0/results/0/0
I20240508 20:30:44.949687  3424 PrestoExchangeSource.cpp:276] Fetched data for /v1/task/20240508_203044_00001_w7icq.3.0.0.0/results/0/0: 0 bytes
I20240508 20:30:44.949720  3424 PrestoExchangeSource.cpp:291] Remaining bytes: 0
I20240508 20:30:44.949806  3426 PrestoExchangeSource.cpp:186] Get data sizes from 127.0.0.1:8081 /v1/task/20240508_203044_00001_w7icq.3.0.0.0/results/0/0
...

@tdcmeehan
Copy link
Contributor

@Yuhta I looked into this, and found that it's because there's a difference in behavior between buffer info when it's reported (e.g. in TaskInfo) versus when it's reported back as part of the exchange data. Basically the coordinator is expecting at least one GET to finish off the buffer and transition it to finished, while C++ is expecting this to be reported from the header. I opened #22711 which I think will fix it.

tdcmeehan added a commit to tdcmeehan/presto that referenced this issue May 10, 2024
The new exchange protocol in prestodb#21926 expects the HEAD
request to indicate the buffer is completed when
the noMorePages signal has been set and the buffer is
empty.  This commit fixes the current behavior, which
is that the coordinator expects at least one GET to
transition the output buffer to completed state before
returning that the buffer has been completed.
tdcmeehan added a commit that referenced this issue Jun 5, 2024
The new exchange protocol in #21926 expects the HEAD
request to indicate the buffer is completed when
the noMorePages signal has been set and the buffer is
empty.  This commit fixes the current behavior, which
is that the coordinator expects at least one GET to
transition the output buffer to completed state before
returning that the buffer has been completed.
Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this issue Jun 7, 2024
Summary:
Pull Request resolved: facebookincubator#8845

facebookincubator#8845

Upgrade the exchange protocol.  We will poll the remaining data sizes using `ExchangeSource::getDataSizes` from all the producers and schedule actual data fetch according to the memory budget.  This reduce the waiting for data time significantly in some cases, for a query that was timing out after 1 hour on 600 nodes cluster, we reduce the wall time to 4.72 minutes on 400 nodes cluster (Java is taking 36.08 minutes on 1000 nodes cluster).

See prestodb/presto#21926

Reviewed By: amitkdutta, mbasmanova

Differential Revision: D54027466

fbshipit-source-id: 5f0274132aee27a61774bdc0287c35834285a1bd
abhinavmuk04 pushed a commit to abhinavmuk04/presto that referenced this issue Jun 7, 2024
The new exchange protocol in prestodb#21926 expects the HEAD
request to indicate the buffer is completed when
the noMorePages signal has been set and the buffer is
empty.  This commit fixes the current behavior, which
is that the coordinator expects at least one GET to
transition the output buffer to completed state before
returning that the buffer has been completed.
@tdcmeehan
Copy link
Contributor

@Yuhta #22711 has been merged for some time, can we give #22697 another try.

@Yuhta
Copy link
Contributor

Yuhta commented Aug 29, 2024

@tdcmeehan Seems there is still some issue, the coordinator always reporting there is some data to fetch so the query never ends: https://app.circleci.com/pipelines/github/prestodb/presto/19551/workflows/2f010ea7-2634-4505-a5a1-aa3571f9a7be/jobs/78435/artifacts
https://output.circle-artifacts.com/output/job/a3fe959c-4463-4c32-85e7-85135cceeca0/artifacts/0/tmp/PrestoNativeQueryRunnerUtils/worker8375275480376116402/worker.1.out

I20240829 20:25:37.540262  2550 PrestoExchangeSource.cpp:188] Fetching data from 127.0.0.1:8081 /v1/task/20240829_201537_00000_c4u3h.1.0.0.0/results/0/0
I20240829 20:25:37.540719  3292 PrestoExchangeSource.cpp:275] Fetched data for /v1/task/20240829_201537_00000_c4u3h.1.0.0.0/results/0/0: 516 bytes
I20240829 20:25:37.540736  3292 PrestoExchangeSource.cpp:290] Remaining bytes: 0
I20240829 20:25:37.540755  3292 PrestoExchangeSource.cpp:352] Enqueuing page for /v1/task/20240829_201537_00000_c4u3h.1.0.0.0/results/0/0: 516 bytes
I20240829 20:25:37.540830  2550 PrestoExchangeSource.cpp:185] Get data sizes from 127.0.0.1:8081 /v1/task/20240829_201537_00000_c4u3h.1.0.0.0/results/0/1
I20240829 20:25:37.541275  3292 PrestoExchangeSource.cpp:275] Fetched data for /v1/task/20240829_201537_00000_c4u3h.1.0.0.0/results/0/1: 0 bytes
I20240829 20:25:37.541301  3292 PrestoExchangeSource.cpp:290] Remaining bytes: 599
I20240829 20:25:37.541357  3267 PrestoExchangeSource.cpp:188] Fetching data from 127.0.0.1:8081 /v1/task/20240829_201537_00000_c4u3h.1.0.0.0/results/0/0
I20240829 20:25:37.541846  3292 PrestoExchangeSource.cpp:275] Fetched data for /v1/task/20240829_201537_00000_c4u3h.1.0.0.0/results/0/0: 516 bytes
I20240829 20:25:37.541862  3292 PrestoExchangeSource.cpp:290] Remaining bytes: 0
I20240829 20:25:37.541879  3292 PrestoExchangeSource.cpp:352] Enqueuing page for /v1/task/20240829_201537_00000_c4u3h.1.0.0.0/results/0/0: 516 bytes
I20240829 20:25:37.541949  3267 PrestoExchangeSource.cpp:185] Get data sizes from 127.0.0.1:8081 /v1/task/20240829_201537_00000_c4u3h.1.0.0.0/results/0/1
I20240829 20:25:37.542418  3292 PrestoExchangeSource.cpp:275] Fetched data for /v1/task/20240829_201537_00000_c4u3h.1.0.0.0/results/0/1: 0 bytes
I20240829 20:25:37.542433  3292 PrestoExchangeSource.cpp:290] Remaining bytes: 599
I20240829 20:25:37.542484  3268 PrestoExchangeSource.cpp:188] Fetching data from 127.0.0.1:8081 /v1/task/20240829_201537_00000_c4u3h.1.0.0.0/results/0/0
I20240829 20:25:37.542917  3292 PrestoExchangeSource.cpp:275] Fetched data for /v1/task/20240829_201537_00000_c4u3h.1.0.0.0/results/0/0: 516 bytes
I20240829 20:25:37.542937  3292 PrestoExchangeSource.cpp:290] Remaining bytes: 0
I20240829 20:25:37.542956  3292 PrestoExchangeSource.cpp:352] Enqueuing page for /v1/task/20240829_201537_00000_c4u3h.1.0.0.0/results/0/0: 516 bytes
I20240829 20:25:37.543026  3268 PrestoExchangeSource.cpp:185] Get data sizes from 127.0.0.1:8081 /v1/task/20240829_201537_00000_c4u3h.1.0.0.0/results/0/1

@tdcmeehan
Copy link
Contributor

@Yuhta this is because #23097 was merged afterward. There's a corresponding change in documentation with #23237. This changed the behavior of the exchange protocol without making the same update in the Java code. Not to worry, I'll update the Java side and revert here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🆕 Unprioritized
Development

No branches or pull requests

3 participants