-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pubsub: Many acks/modacks could cause other acks/modacks to be delayed #9727
Comments
I have some high level thoughts before doing a thorough investigation. How large are your messages on average? StreamingPull streams support up to 10k MB/s, so having 40 Goroutines (equivalent of 40 streams open) might be overkill. If there's suspicions network bottleneck issues, reducing this could help.
It's possible the decrease comes from streams restarting, as streams are periodically terminated, but you would need to open a ticket to provide project ID / subscription ID for me to look into this.
We currently have a beta release of the library that includes OpenTelemetry tracing that would potentially provide insight into this. Given your very high throughput, you would have to significantly reduce the sampling rate to make sure you don't hit the tracing quota if using Cloud Trace. This might not be an issue if you use your own backend. Instructions to enable tracing live here for now. |
Mean is 768b and 95th percentile is 1kb. 40 is derived from 10 x NumCPU. We will try lowering that.
We will try that out and report back. |
@hongalex Here's a trace where we called the Ack after 1.7s but then the ack wasn't started for another 6s and then the ack itself took 8s: Here's one where the ack took 12s to start (maybe because the modack took 10s to end?): Another where the ack took 10s to start and I'm not sure why: There's already a lot of these cases and we only have the sampling set to record a few per minute. This trace is a bit confusing since we set the |
@hongalex Also for added context, our handler batches messages in-memory and uploads batches to GCS in chunks of ~25k. So we are acking ~25k messages at once which might exacerbate this problem. |
Thanks for the screenshots, this helps a lot with understanding the problem. Are these values already including the change to lower So I think the delayed acks are a result of what you wrote in the first comment, that when batching large groups of ackIDs, any delay will result in a lot of further delay for messages latter in the batch. Could you confirm what message batch size you're seeing? This can be seen on the "modack start" and "ack end" events' attributes.
That's correct. The first modack is the receipt modack that happens regardless, and subsequent modacks happen on a regular schedule. |
Interesting, so maybe the bottleneck isn't on the number of streams opened, but the number of modacks sent. If the number of batches coming in is small, then a large number of modacks will be sent, relatively to acks. Can you explain how your Mod Ack Count graph is being populated? |
I'm not sure why there aren't any grpc traces. We aren't setting
Both of those graphs are from the OpenCensus data from the pubsub client. We have this function: func registerPubSubOpenCensusViews() {
registerPubSubOpenCensusViewsOnce.Do(func() {
for _, v := range pubsub.DefaultPublishViews {
lmetric.RegisterOpenCensusView("", v)
}
for _, v := range pubsub.DefaultSubscribeViews {
lmetric.RegisterOpenCensusView("", v)
}
})
} Which will register each of those views with our own |
So the generated gRPC traces are children of batch ack/modack spans that live in their own trace, outside of the message trace spans you've been screenshotting. The reason is a bit complicated but it has to do with how parent/child relationships work in trace views. You should be able to see spans named StreamingPull traces should also be generated but they're not very useful, since it only represents the duration of the stream being opened, and not more specific information. |
Ah the reason our sampler was not catching these was because I thought they'd have I'll make some changes to the sampler to include these other spans. |
That's good feedback, let me get these added. |
I think the next thing is to figure out if Pub/Sub is seeing the same high modack latencies on the server side. I suspect this isn't a bug with the server, and rather either a throttled network or perhaps less likely a bug with the client library, but it wouldn't hurt to check if you can open a support ticket. |
@hongalex The case number is 50592102. Also, there are a few spans I found where the grpc is significantly delayed:
Then there are some like:
I'm not sure if that indicates some kind of client-side throttling is happening. |
So I understand that my colleagues are investigating the bug with the message IDs you provided. In the meantime, I'm still trying to rule out what's going on from the client library perspective. From your latest message (and also earlier messages), it is a bit concerning is that the batch is only comprised of a single message (ack ID). I was able to confirm that Pub/Sub should return, on a best effort basis, as many messages as available Some open questions I have
I plan to create some kind of reproduction with the numbers you gave me to see if I can repro the behavior above. |
Some more investigations
So I realized this might be happening because of how flow control works. It seems that maybe one or two messages are acked at a time, which allows a tiny bit more messages to come through the stream (which issues a modack RPC). Still need to do more investigation to see if this can be mitigated. |
It's hard for me to connect a subscribe span to a modack/ack span but it's easy to go the other way because there's a link. Is there anything we can do to make it easier to navigate both ways (subscribe -> modack/ack and modack/ack -> subscribe) with links?
I'm not sure if this is the reason because we are typically always acking in batches of ~25000 which should translate to ~9 Acknowledge calls containing 2500 ids and the final one which has the leftover. The GCP Console's trace explorer isn't complex enough to let me graph the I think the reason why were getting batches of 1-2 messages is because we're publishing in 1-2 message batches. My naive understanding of the server architecture is that the streaming subscribe batches are equivalent to the publish batches if our subscribers are keeping up. We have 2 subscribe VMs but we have >50 publishing VMs each publishing messages coming from end-user HTTP requests which typically only have a few messages in them. We also have the following settings on the publishing side:
I don't have the tracing beta version on the publishing side since I can't put beta code in that application so I can't tell you what the batches look like on the publishing side exactly. Depending on the region the servers are publishing between 150 messages/second to 250 messages/second. One thing we could try is raising the Also I noticed an odd pattern several times with the ModAckDeadline spans: It looks like the latencies start to increase until they plateau at 60 seconds (the client-side timeout) and then recover later. Almost like there's something happening on the server where it starts to respond slower and slower and then either crashes or recovers somehow. |
Hm that's fair. I had contrived a toy example where processing times was randomized between 5 and 9 seconds, which resulted in acks completing randomly within that interval and less batching.
I would hold off on making publisher side changes for now. I was recently told that Pub/Sub servers will try and deliver as many messages as possible in a single StreamingPull response, even if the messages are published in small batches. It used to be that publish batches were more tightly coupled with subscribe batches, but that was before server side flow control was introduced. I'll try reproducing with a large number of smaller publishers to see if that changes anything on my end, but aside from that I wasn't able to get my modack latency high. Edit: forgot to respond
The OpenTelemetry spec only recently started allowing links to be added after span creation, so this bidirectional linking wasn't previously possible. I'm looking into adding it now. |
@jameshartig apologies for the delay here. Would you be willing to try this fix that batches the initial receipt modacks? In cases where you're receiving a large number of small batched StreamingPull responses, this should reduce the number of modack RPCs being issued and hopefully lower the amount of network overhead those RPCs induce. Note, this branch is separate from the OpenTelemetry preview and builds upon the 1.37.1 release of the library.
|
I'm willing to test it but I think modacks need to be in their own goroutine because even in your proposed change they still could be held up by a long-running ack. If we had the acknowledge deadline at 10 seconds then we'd have only 10 seconds to send the receipt modack but if a large ack took 10+ seconds we would miss all of those receipt modacks. Even a separate goroutine isn't a perfect solution because we've already found modack calls that take 60+ seconds. However if you think the underlying delay is a network/grpc contention issue and not something on the server-side then this could help. |
@hongalex The Update: |
Yeah so the biggest pain point I saw was the number of receipt modacks that were being issued since they aren't batched. The reason I came back to this from a client library perspective is because I was working with someone on the backend side and couldn't see the Ack/Modacks actually reach the server, hinting there was network/cpu contention. I'm curious to see if the fix holds stable. Your previous two graphs are great insights (thanks for sharing those) but let me know if anything changes. I'll start working on getting this fix merged. Also, batching of receipts are only happening with non-exactly once delivery subscriptions. I don't think you're using that but thought I'd let you know anyway. |
That makes a lot of sense. Do you have any thoughts on if its a Linux kernel tcp contention issue, Go scheduling issue, or something in the Go grpc library? I'm not familiar enough with grpc to know if there's potentially a max-outstanding-in-flight limit or something similar that we might've been hitting. |
To be honest, I'm not confident where the bottleneck is. I suspect it's a gRPC issue, but I don't have the data to back that up. I remember a while back, I had suggested increasing the |
I missed that update to the docs so we just have it at the default which would be I can bump it up to 8 and see how that affects the ack latencies. |
Yeah so I'm inclined to making all of the ack/modack RPCs concurrent (within a single invocation of |
Let me know what's next on my end for debugging. Happy to help. We might need the tracing changes though which means maybe we need to wait until your batching branch is merged? |
@jameshartig Sorry for the delay. Can you try pulling from |
Thanks for reporting back. I am a bit confused as to why increasing the number of RPCs in flight does not decrease latency. Something I was hoping you could try is lowering You can just use the the grpc connection pool = 8 for testing this. |
So these two values are technically separate but related in a bit confusing way.
On the last point, I'm going to do some testing to see if setting NumGoroutines or MaxOutstanding to be too high will result in ack expirations. I have a suspicion that excess messages could be flow controlled (corresponding to the concurrency control span) on the client side and then expire. This is the main motivation for suggesting you to decrease |
I think that's what I was looking for. Ideally we would not expose any knobs to our developers but we've had a lot of use cases to expose
Interesting, that seems like a big flaw. I can see strictly speaking why it would be a breaking change but that also feels like an implementation detail that isn't documented either way.
That's very interesting and something I hadn't considered. But isn't the client doing ModAck's even if they're blocked in the flow controller? I guess they could exceed |
Thanks for your patience with this issue. I'm glad to see that lowering
Yeah, it's something we thought about in the past but haven't finalized on a concrete decision on how we want to fix this. I'll prioritize revisiting this, now that we have some stats to backup this claim.
Yes, modacks start happening soon after messages get received, before flow control.
Given that you are able to process messages in under 10s (in your most recent screenshots), this will probably just result in higher ack latency, which might be undesirable. It's better to have the client process the messages as soon as they come in.
Yeah it's worth considering disabling client side flow control, |
That's not controllable for receiving, right? That would need to be a change in the client to make that customizable. |
One feature request out of this would be to make the |
A dynamic level of
Agreed that the default of 10 doesn't make as much sense and should be significantly lower. This doesn't use to matter until server side flow control was added. The alternative is to introduce a mechanism that fixes the outstanding messages per stream behavior, making that opt-in and eventually deprecating the old behavior altogether.
You're right. It's something I'll consider adding, like a |
Edit: Ignore the above. As I finished typing this, I realized the client is then limited to 50 MB of messages simultaneously rather than 250 MB of messages, which is what you need to get 10k messages/second. The math above runs into the same issue of One of the issues you were running into is that the client library pulls too many messages at once and gets stuck holding the messages while waiting for flow control resources. Disabling both flow control settings and relying purely on the stream's limit of 10 MB/s might solve your problems temporarily until we can add in the other settings like LimitExceeded behavior + sensible flow control limits per stream. |
Does this work? https://issuetracker.google.com/issues/341124500
I think that's kind of what we're doing by setting |
Perfect thanks.
Since it seems you're ok with a slight increase in ack latency (up to MaxExtension), that seems like a decent tradeoff. If you want minimize ack latency (no waiting for client side flow control) you will probably want to play around with scaling up and disabling client slide flow control altogether. I'll leave this current issue open until I merge in the change that batches the receipt modacks / makes modacks/acks concurrent. |
Thanks again for all the help here @hongalex!! |
No, thank you for the detailed graphs and reproduction. We're hoping to introduce a new setting that will allow configuring the # of callbacks independently of flow control, which should also help. I can tag you when that gets implemented. |
Great! Additionally, should we store the time a message was received in the iterator so we can reject it if its been sitting in the client for longer than |
Yeah it's something I'm working through. Another behavior I'm thinking about is that, currently, messages that are expired are still being processed. The advantage of this is that messages can still be "acked" after they expire, but perhaps a better solution would be to stop processing and nack the message on expiration instead. |
Client
PubSub
Environment
CentOS on GCE (t2d-standard-4)
Go Environment
Expected behavior
I expect to see 0 in subscription/expired_ack_deadlines_count assuming that my AckCount match my PullCount.
Actual behavior
We periodically see a huge rate of expired acks as high as 15k/s. We are currently acknowledging 20k messages per second across 2 GCE instances, ~10k/s per instance, and pulling 20k messages per second across those instances as well. I would expect then that we shouldn't really be seeing any expired Acks.
I don't know the actual distribution of messages across the 40 goroutines but if some of them are getting most of the messages then it's possible for the
ackIDBatchSize
to be exceeded. When it's exceeded,sendModAck
andsendAck
both loop internally until all of the ackIDs have been sent. We don't have visibility into the distribution of time it takes toAcknowledge
2500 ackIDs but we can see from the GCP console that theAcknowledge
method has a 95th percentile latency of over 100ms. Separately, we are callingModifyAckDeadline
(at the 95th percentile latency takes 40ms) with 16k IDs per second which needs 7 calls per instance which could take more than 250ms+.Either of those would end up delaying the other since there's only a single
sender
goroutine which could be contributing to our expired acks issue.Additionally, since we aren't using exactly-once-delivery, there's no way for us to measure how long it took from when we called
Ack
to when the request was sent to Pub/Sub. One way to fix that would be if the*AckResult
returned fromAckWithResult
would actually beReady
once the message is sent, even if you're not using exactly-once-delivery.Screenshots
Something else that's interesting is that the latencies shown in the GCP Console do not match our application-level metrics (which measure from start of Receive callback to Ack/Nack function call) at all:
vs
This is what led us to investigate if there was some sort of delay between when we
Ack
a message and when the underlyingAcknowledge
is sent by the client.ModAckCount
(the reason for the increase at 19:30 UTC is because we increased MaxOutstandingMessages from 25000 to 40000)
Finally, the increase in expired acks happened after a sharp decrease in the StreamingPull response for which I have no explanation unless some change was made on Pub/Sub's side. It's not clear if this might mean that there's a higher concentration of messages in individual goroutines.
Additional context
We don't have any visibility into the 99th percentile modack extension being used and that would have been helpful in debugging.
The text was updated successfully, but these errors were encountered: