New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Google Cloud Pub/Sub triggers high latency on low messages throughput #925
Comments
@andryichenko Sorry for the late reply on this, but could it be that the publisher piece you showed above is queueing the messages for batch send? You might try decreasing the max batching timeout:
The default is already low (100), but it's possible something is getting set incorrectly. It's also possible that it's not properly obeying that number in your case, in which case I need to do some more debugging. The other possibility is that it's getting hung up in the machinery somewhere between your client and your GCF function executing. I can pass it off to the right people in that case, but we should eliminate the client library first. |
@feywind Thanks for the reply!
But it seemed also not really helpful. |
We also face the same problem: when we send 1-2 messages per minute the latency is around 1200 ms but once we push more messages it becomes faster and faster (around 10ms). It looks like there is some cache mechanism for pushing the messages. |
I'm not as knowledgable about what happens after it leaves the Pub/Sub library, so it's possible something down in gax/gRPC is doing its own buffering. I can ask the maintainers of those libraries if there's anything possibly there. But as for the Pub/Sub library itself, it should be obeying that batching configuration and not pausing. I will peek at that and see if there's possibly a bug there, that it's not using the right values. |
@feywind did you find any solution? |
1 similar comment
@feywind did you find any solution? |
I missed that there were two issues potentially related to this: #1087 Do you know if 2.5.0 still has the issue? |
Seems like this has been merged and hasn't solved the problem of latency. |
@dozie Thanks for the update! @alexander-fenster Sorry to ping you again :D but could there be something introducing latency and doing its own batching of calls at the gax level? If setting the batch size to 1 for the Pub/Sub library doesn't reduce the latency, it seems like it might be something at a lower level. |
@feywind I can confirm that batch size settings is set to 1 message, but latency didn't reduce. |
We are experiencing a similar issue: high latency (5 ~ 10 seconds inconsistently) on low messages throughput. We keep tracking the I have been looking for some SLA of pubsub to guarantee that "queue time" but could not find it. |
@kamalaboulhosn Since there's a question here about the server side... Is it possible that the service is doing some sort of batching or delaying to queue up more messages? |
@kamalaboulhosn Gentle ping on this since it seems like it might be a service question at this point. |
Same issue here, but the latency is around 5 minutes. We have two topics between a GAE app and a cloud function, a request-response cycle. The latency shows up only on the GCF -> GAE topic. The GAE app runs the python client, the GCF is on The function is almost identical to what @andryichenko has posted. Not sure how to debug the latency on the way back to GAE other than looking at HTTP log timestamps. Things I've tried:
Things I've considered but didn't have time to test:
|
I've enabled some additional logging, and it looks like the function exists almost instantly, leaving the actual (async) code to run in the background for some time: The last line comes from this code: const publish = async (event) => {
const client = new PubSub({ apiEndpoint: config.PUB_SUB_API_ENDPOINT });
const data = Buffer.from(JSON.stringify(event));
const topic = client.topic(topicName, { batching: { maxMessages: 1 } });
const messageId = await topic.publish(data, metadata);
logger.log('Message published', messageId);
await topic.flush();
logger.log('Flushed', messageId);
return messageId;
}; Some observations:
|
Ok, mistery solved! The main cloud function handler was wrapped into I've temporarily removed the wrapper, and the whole thing now takes 7 seconds to run. And that's including multiple external HTTP requests 🎉 |
@killthekitten Ah, yeah. I've seen a few issues in the past with GCF losing track of async happening in the background. I'm glad you found something to help there. I'm not sure how many of the comments above were using GCF, but it looks like the original poster was doing so. Is anyone else having a non-GCF issue with this? |
I am experiencing the same, having a delay of 15s and more for a few times. Most of the time everything works ok. If I don't use |
@flunderpero There is currently a known issue in the Pub/Sub service that is causing the 15s delay. The change that caused the issue is being rolled back right now. This issue would only have existed between 2/7/2022 and now. |
@kamalaboulhosn Thanks for letting me know. This actually coincides with when the problems started for us. |
Is there any way that we can track that issue as we're seeing these same 15s delays? From what we can see there hasn't been a change to the service since December according to the Pub/Sub release notes: https://cloud.google.com/pubsub/docs/release-notes Or are you talking about something else @kamalaboulhosn |
@ChrisWestcottUK The release notes for the server only track significant API/region availability/feature availability changes, not every server change that is made. There is no public tracking for all of these such changes. For tracking issues if they affect your projects, it is best to put in a support request so support engineers can follow up. This particular incident should now be resolved. |
One thing that appeared to have also occurred at this time was messages published multiple times (we're using an
Doesn't that show that it's important to choose good idempotency keys and not follow the suggestions in the documentation? Or is there something I'm missing? |
Linked to the meta-issue about transport problems: b/242894947 |
I'm seeing this issue as well. I also faced it with Cloud Functions but I took my cloud function code and modified it a bit so I could easily trigger it locally with To me this seems like the batching disabling doesn't work properly. Batching configurations are respected, if publishing less messages than Since Java pubsub doesn't suffer from problems such as this at all I'm leaning towards that the whole batch scheduling logic in this library is somehow broken. I'll look into this for a while still but if I can't find the issue really soon we have to rewrite our Cloud Functions with a different language to get acceptable performance since this issue causes a user facing performance problem. |
Going back to the original issue, I think the problem might be in the way the topic is being instantiated. The topic is being recreated for every publish, which means the topic is not being reused. It also means that the connection to the server has to be renegotiated every time. It would be better to do this:
If the topic name is not known before the publish and can change, then you should keep a map from topic name to |
I’m seeing a similar issue over here. My setup is similar to the OP - Firebase cloud function triggers individual PubSub topic messages. I’m seeing latency in minutes rather than seconds. Interestingly the Ive tried setting max messages but that didn’t seem to help.
Anything else I can try or information to share that would help debug this? |
Have you tried debugging it the same way I did in one of the comments above? It could be caused by throttling on a function because of an async wrapper, or any async code that you aren't awaiting properly. The key is to log when the function returns - does it return sooner than the publish() call? |
@killthekitten Great tip thanks! Turned out that was the problem in my case - I was not waiting for the |
Hi all, I just made some updates to the batch flushing logic for publishing, and that's in 3.4.0. I'd be interested to hear if it helps this. |
This issue has covered a lot of different causes and investigations that are not entirely related. The original issue looks to have been an inefficiency in the way the user code was written. Other issues included transient server-side problem as well as other miscellaneous issues. Going forward, if anyone is still experiencing issues, please enter a support case. Thanks! |
i'm running project which publishes messages to a PubSub topic and triggers background cloud function.
I read that with high volumes of messages, it performs well, but for lesser amounts like hundreds or even tens of messages per second, Pub/Sub may yield high latencies.
Code example to publish message:
Code example of function triggered by PubSub:
Environment details
@google-cloud/pubsub
version: 1.6The problem is when using PubSub with low throughput of messages (for example, 1 request per second) it struggles sometimes and shows incredibly high latency (up to 7-9s or more).
Is there a way or workaround to make PubSub perform well each time (50ms or less delay) even with small amount of incoming messages?
Thanks!
The text was updated successfully, but these errors were encountered: