Google Cloud Pub/Sub triggers high latency on low messages throughput #925

andryichenko · 2020-03-23T17:22:03Z

i'm running project which publishes messages to a PubSub topic and triggers background cloud function.

I read that with high volumes of messages, it performs well, but for lesser amounts like hundreds or even tens of messages per second, Pub/Sub may yield high latencies.

Code example to publish message:

const {PubSub} = require('@google-cloud/pubsub');

 const pubSubClient = new PubSub();

 async function publishMessage() {
    const topicName = 'my-topic';
    const dataBuffer = Buffer.from(data);

    const messageId = await pubSubClient.topic(topicName).publish(dataBuffer);
    console.log(`Message ${messageId} published.`);
 }

 publishMessage().catch(console.error);

Code example of function triggered by PubSub:

exports.subscribe = async (message) => {
  const name = message.data
    ? Buffer.from(message.data, 'base64').toString()
    : 'World';

  console.log(`Hello, ${name}!`);
}

Environment details

OS: Windows 10
Node.js version: 8
@google-cloud/pubsub version: 1.6

The problem is when using PubSub with low throughput of messages (for example, 1 request per second) it struggles sometimes and shows incredibly high latency (up to 7-9s or more).

Is there a way or workaround to make PubSub perform well each time (50ms or less delay) even with small amount of incoming messages?

Thanks!

The text was updated successfully, but these errors were encountered:

feywind · 2020-05-05T21:51:40Z

@andryichenko Sorry for the late reply on this, but could it be that the publisher piece you showed above is queueing the messages for batch send? You might try decreasing the max batching timeout:

pubSubClient.topic(topicName, {
    batching: { maxMilliseconds: 10 }
})

The default is already low (100), but it's possible something is getting set incorrectly. It's also possible that it's not properly obeying that number in your case, in which case I need to do some more debugging.

The other possibility is that it's getting hung up in the machinery somewhere between your client and your GCF function executing. I can pass it off to the right people in that case, but we should eliminate the client library first.

andryichenko · 2020-05-06T06:18:04Z

@feywind Thanks for the reply!
I already tried to decrease the max batching timeout, but it didn't really help. (It seemed to me that messages are going faster but still it struggles).
Also as can be found in this guide, PubSub tries to gather multiple messages before delivering it. In other words, it tries to deliver many messages at once. In this specific case to achieve a more realistic real time scenario, should be specified a batch size of 1, which would cause PubSub to delivery every message separately.
So i also added maxMessages property to batching object, like so:

pubSubClient.topic(topicName, {
    batching: { 
       maxMessages: 1,
       maxMilliseconds: 10
 }
})

But it seemed also not really helpful.
After some research i come to conclusion that it may be cloud functions causing delays on cold start. So as a workaround i tried to make a simple node.js server that pinging cloud functions via pub/sub so they remain heat. And it shows much better results now.
I'm still trying to find good solution for this though.

vyacheslav-kara · 2020-05-13T12:32:53Z

We also face the same problem: when we send 1-2 messages per minute the latency is around 1200 ms but once we push more messages it becomes faster and faster (around 10ms). It looks like there is some cache mechanism for pushing the messages.

feywind · 2020-05-19T20:27:29Z

I'm not as knowledgable about what happens after it leaves the Pub/Sub library, so it's possible something down in gax/gRPC is doing its own buffering. I can ask the maintainers of those libraries if there's anything possibly there. But as for the Pub/Sub library itself, it should be obeying that batching configuration and not pausing. I will peek at that and see if there's possibly a bug there, that it's not using the right values.

andryichenko · 2020-05-22T10:15:22Z

@feywind did you find any solution?

andresmargalef · 2020-08-12T20:02:48Z

@feywind did you find any solution?

feywind · 2020-08-25T23:14:58Z

I missed that there were two issues potentially related to this: #1087

Do you know if 2.5.0 still has the issue?

dozie · 2020-09-11T00:31:10Z

Seems like this has been merged and hasn't solved the problem of latency.

feywind · 2020-09-18T22:33:39Z

@dozie Thanks for the update!

@alexander-fenster Sorry to ping you again :D but could there be something introducing latency and doing its own batching of calls at the gax level? If setting the batch size to 1 for the Pub/Sub library doesn't reduce the latency, it seems like it might be something at a lower level.

andryichenko · 2020-10-01T07:04:21Z

@feywind I can confirm that batch size settings is set to 1 message, but latency didn't reduce.

xiaowei-routific · 2020-11-06T01:28:58Z

We are experiencing a similar issue: high latency (5 ~ 10 seconds inconsistently) on low messages throughput.
I don't think it's related to the message producer, but maybe the pubsub server itself and here is why:

We keep tracking the publish_time on message and from the official google doc we know that this publish_time is the time pubsub server receives the message, so it doesn't matter what you set up from the message producer side. And the time between the publish_time and the time our client/message-consumer receives the message can be 5 ~ 10 seconds when the traffic throughput is low, especially in the scenario that a message sent to pubsub after a long time of no message sent.

I have been looking for some SLA of pubsub to guarantee that "queue time" but could not find it.

feywind · 2020-11-20T20:17:02Z

@kamalaboulhosn Since there's a question here about the server side... Is it possible that the service is doing some sort of batching or delaying to queue up more messages?

feywind · 2021-01-15T20:02:35Z

@kamalaboulhosn Gentle ping on this since it seems like it might be a service question at this point.

killthekitten · 2021-03-12T11:01:12Z

Same issue here, but the latency is around 5 minutes. We have two topics between a GAE app and a cloud function, a request-response cycle. The latency shows up only on the GCF -> GAE topic. The GAE app runs the python client, the GCF is on "@google-cloud/pubsub": "^2.10.0".

The function is almost identical to what @andryichenko has posted. Not sure how to debug the latency on the way back to GAE other than looking at HTTP log timestamps.

Things I've tried:

Created a parallel pull subscription to check timestamps on the messages. Same 5 minutes there.
Change batching to maxMessages: 1, same behavior.

Things I've considered but didn't have time to test:

Forcing flush() on the cloud function before it ends. This is based on assumption that GCF blocks the CPU for any background functions that haven't been terminated after the function is done.
Recreating the topic/subscriptions.

killthekitten · 2021-03-12T12:24:16Z

I've enabled some additional logging, and it looks like the function exists almost instantly, leaving the actual (async) code to run in the background for some time:

The last line comes from this code:

const publish = async (event) => {
  const client = new PubSub({ apiEndpoint: config.PUB_SUB_API_ENDPOINT });

  const data = Buffer.from(JSON.stringify(event));
  const topic = client.topic(topicName, { batching: { maxMessages: 1 } });
  const messageId = await topic.publish(data, metadata);
  logger.log('Message published', messageId);

  await topic.flush();
  logger.log('Flushed', messageId);

  return messageId;
};

Some observations:

The latency up until (and including) the line 2021-03-12 12:45:49.244 is expected. But the 2 minute gap after that line isn't. I attribute the gap to the code above. In subsequent runs the latency on this step fluctuates around 1-2 minutes mark.
Interestingly, the call to .flush() fails with 3 INVALID_ARGUMENT: The value for message_count is too small. You passed 0 in the request, but the minimum value is 1..
Looks like the awaits are not respected by the cloud function runner.

killthekitten · 2021-03-12T12:58:59Z

Ok, mistery solved!

The main cloud function handler was wrapped into Sentry.GCPFunction.wrapEventFunction which somehow loses track of the returned promises. In its turn, that lead to throttling the function and skyrocketing the latency.

I've temporarily removed the wrapper, and the whole thing now takes 7 seconds to run. And that's including multiple external HTTP requests 🎉

feywind · 2021-03-16T15:34:41Z

@killthekitten Ah, yeah. I've seen a few issues in the past with GCF losing track of async happening in the background. I'm glad you found something to help there.

I'm not sure how many of the comments above were using GCF, but it looks like the original poster was doing so. Is anyone else having a non-GCF issue with this?

flunderpero · 2022-02-12T11:22:06Z

I am experiencing the same, having a delay of 15s and more for a few times. Most of the time everything works ok. If I don't use orderingKey there seems to be no delay after some quick testing (we rely on orderingKey so I did not investigate further). I re-created all subscriptions and topics and everything seems to be fine now. I started seeing these problems from 2022-02-09. Perhaps something changed in the PubSub-system?

kamalaboulhosn · 2022-02-14T19:04:24Z

@flunderpero There is currently a known issue in the Pub/Sub service that is causing the 15s delay. The change that caused the issue is being rolled back right now. This issue would only have existed between 2/7/2022 and now.

flunderpero · 2022-02-14T19:20:01Z

@kamalaboulhosn Thanks for letting me know. This actually coincides with when the problems started for us.

ChrisWestcottUK · 2022-02-15T11:58:10Z

@flunderpero There is currently a known issue in the Pub/Sub service that is causing the 15s delay. The change that caused the issue is being rolled back right now. This issue would only have existed between 2/7/2022 and now.

Is there any way that we can track that issue as we're seeing these same 15s delays? From what we can see there hasn't been a change to the service since December according to the Pub/Sub release notes: https://cloud.google.com/pubsub/docs/release-notes

Or are you talking about something else @kamalaboulhosn

kamalaboulhosn · 2022-02-15T21:06:27Z

@ChrisWestcottUK The release notes for the server only track significant API/region availability/feature availability changes, not every server change that is made. There is no public tracking for all of these such changes. For tracking issues if they affect your projects, it is best to put in a support request so support engineers can follow up. This particular incident should now be resolved.

ChrisWestcottUK · 2022-02-16T15:12:45Z

One thing that appeared to have also occurred at this time was messages published multiple times (we're using an orderingKey):

Without errors, we were getting back a message id when publishing. So the message publish appeared fine.
Comparing those message id's with the subscriber message id's shows a mismatch (because they are message id's of the hidden failures).
This means we have subscribers receiving multiple messages for the same source messages but different message id's.
According to documentation around idempotency with Cloud Functions it's suggested to use eventId as an idempotency key: https://cloud.google.com/blog/products/serverless/cloud-functions-pro-tips-retries-and-idempotency-in-action
The eventId is the message id: https://cloud.google.com/functions/docs/calling/pubsub#event_structure

Doesn't that show that it's important to choose good idempotency keys and not follow the suggestions in the documentation? Or is there something I'm missing?

feywind · 2022-08-17T20:25:03Z

Linked to the meta-issue about transport problems: b/242894947

empperi · 2022-10-06T15:13:46Z

I'm seeing this issue as well. I also faced it with Cloud Functions but I took my cloud function code and modified it a bit so I could easily trigger it locally with node testProblem.js. So it is just like your typical node command line app. This still causes a 2-5 second latency for publishing a message to a topic. At the same time a different app involved running in GKE uses the Java version of pubsub library (via Clojure) and that works lightning fast, no delay whatsoever. The same lightning fast performance is retained when running that app locally so it is not related to my personal internet connectivity or similar wonky stuff.

To me this seems like the batching disabling doesn't work properly. Batching configurations are respected, if publishing less messages than maxMessages then the latency is very close to maxMilliseconds as one would expect. If however maxMessages is set to 1 then this weird latency occurs. If sending two messages with maxMessages set to 2 they both finish in identically the same amount of time, which again is to be expected since they are batched. If sending two messages with maxMessages set to 1 they both finish about twice as fast! This does NOT make any sense.

Since Java pubsub doesn't suffer from problems such as this at all I'm leaning towards that the whole batch scheduling logic in this library is somehow broken. I'll look into this for a while still but if I can't find the issue really soon we have to rewrite our Cloud Functions with a different language to get acceptable performance since this issue causes a user facing performance problem.

kamalaboulhosn · 2023-01-30T20:59:05Z

Going back to the original issue, I think the problem might be in the way the topic is being instantiated. The topic is being recreated for every publish, which means the topic is not being reused. It also means that the connection to the server has to be renegotiated every time. It would be better to do this:

const {PubSub} = require('@google-cloud/pubsub');

 const pubSubClient = new PubSub();
 const topicName = 'my-topic';
 const topic = pubSubClient.topic(topicName);

 async function publishMessage() {
    
    const dataBuffer = Buffer.from(data);

    const messageId = await topic.publish(dataBuffer);
    console.log(`Message ${messageId} published.`);
 }

 publishMessage().catch(console.error);

If the topic name is not known before the publish and can change, then you should keep a map from topic name to topic object and reuse it.

gchristov · 2023-02-22T21:41:08Z

I’m seeing a similar issue over here. My setup is similar to the OP - Firebase cloud function triggers individual PubSub topic messages. I’m seeing latency in minutes rather than seconds.

Interestingly the topic().publish() method seems to be taking a long time based in the logs.

Ive tried setting max messages but that didn’t seem to help.

batching: {
  maxMessages: 1
}

Anything else I can try or information to share that would help debug this?

killthekitten · 2023-02-22T22:04:18Z

I’m seeing latency in minutes rather than seconds.

Have you tried debugging it the same way I did in one of the comments above? It could be caused by throttling on a function because of an async wrapper, or any async code that you aren't awaiting properly.

The key is to log when the function returns - does it return sooner than the publish() call?

gchristov · 2023-02-23T14:12:01Z

@killthekitten Great tip thanks! Turned out that was the problem in my case - I was not waiting for the topic().publish() to complete before terminating the cloud function. The fix was to add an .await() on the publish Promise and now the messages seem to be delivered much faster 👍

feywind · 2023-03-07T21:02:59Z

Hi all, I just made some updates to the batch flushing logic for publishing, and that's in 3.4.0. I'd be interested to hear if it helps this.

kamalaboulhosn · 2023-06-20T18:12:01Z

This issue has covered a lot of different causes and investigations that are not entirely related. The original issue looks to have been an inefficiency in the way the user code was written. Other issues included transient server-side problem as well as other miscellaneous issues. Going forward, if anyone is still experiencing issues, please enter a support case. Thanks!

product-auto-label bot added the api: pubsub Issues related to the googleapis/nodejs-pubsub API. label Mar 23, 2020

meredithslota added the type: question Request for information or clarification. Not an issue. label Mar 23, 2020

feywind self-assigned this Sep 15, 2020

killthekitten mentioned this issue Mar 12, 2021

Wraping a function Sentry.GCPFunction.wrapEventFunction leads to premature function exit getsentry/sentry-javascript#3325

Closed

3 tasks

feywind added the status: investigating The issue is under investigation, which is determined to be non-trivial. label Jan 26, 2022

feywind added priority: p3 Desirable enhancement or fix. May not be included in next release. and removed status: investigating The issue is under investigation, which is determined to be non-trivial. labels Aug 17, 2022

gchristov mentioned this issue Feb 23, 2023

Optimisation: revert PubSub publisher batching and await message to be published before finishing function gchristov/thecodinglove-kotlinjs#55

Merged

kamalaboulhosn closed this as completed Jun 20, 2023

jakee417 mentioned this issue Jun 28, 2023

Refactor loopMessageWebhookHandler lwcarani/aiMessages-backend#29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google Cloud Pub/Sub triggers high latency on low messages throughput #925

Google Cloud Pub/Sub triggers high latency on low messages throughput #925

andryichenko commented Mar 23, 2020 •

edited

feywind commented May 5, 2020

andryichenko commented May 6, 2020

vyacheslav-kara commented May 13, 2020

feywind commented May 19, 2020

andryichenko commented May 22, 2020

andresmargalef commented Aug 12, 2020

feywind commented Aug 25, 2020

dozie commented Sep 11, 2020

feywind commented Sep 18, 2020

andryichenko commented Oct 1, 2020

xiaowei-routific commented Nov 6, 2020 •

edited

feywind commented Nov 20, 2020

feywind commented Jan 15, 2021

killthekitten commented Mar 12, 2021 •

edited

killthekitten commented Mar 12, 2021

killthekitten commented Mar 12, 2021 •

edited

feywind commented Mar 16, 2021

flunderpero commented Feb 12, 2022

kamalaboulhosn commented Feb 14, 2022

flunderpero commented Feb 14, 2022

ChrisWestcottUK commented Feb 15, 2022 •

edited

kamalaboulhosn commented Feb 15, 2022

ChrisWestcottUK commented Feb 16, 2022 •

edited

feywind commented Aug 17, 2022

empperi commented Oct 6, 2022

kamalaboulhosn commented Jan 30, 2023

gchristov commented Feb 22, 2023

killthekitten commented Feb 22, 2023

gchristov commented Feb 23, 2023

feywind commented Mar 7, 2023

kamalaboulhosn commented Jun 20, 2023

Google Cloud Pub/Sub triggers high latency on low messages throughput #925

Google Cloud Pub/Sub triggers high latency on low messages throughput #925

Comments

andryichenko commented Mar 23, 2020 • edited

Environment details

feywind commented May 5, 2020

andryichenko commented May 6, 2020

vyacheslav-kara commented May 13, 2020

feywind commented May 19, 2020

andryichenko commented May 22, 2020

andresmargalef commented Aug 12, 2020

feywind commented Aug 25, 2020

dozie commented Sep 11, 2020

feywind commented Sep 18, 2020

andryichenko commented Oct 1, 2020

xiaowei-routific commented Nov 6, 2020 • edited

feywind commented Nov 20, 2020

feywind commented Jan 15, 2021

killthekitten commented Mar 12, 2021 • edited

killthekitten commented Mar 12, 2021

killthekitten commented Mar 12, 2021 • edited

feywind commented Mar 16, 2021

flunderpero commented Feb 12, 2022

kamalaboulhosn commented Feb 14, 2022

flunderpero commented Feb 14, 2022

ChrisWestcottUK commented Feb 15, 2022 • edited

kamalaboulhosn commented Feb 15, 2022

ChrisWestcottUK commented Feb 16, 2022 • edited

feywind commented Aug 17, 2022

empperi commented Oct 6, 2022

kamalaboulhosn commented Jan 30, 2023

gchristov commented Feb 22, 2023

killthekitten commented Feb 22, 2023

gchristov commented Feb 23, 2023

feywind commented Mar 7, 2023

kamalaboulhosn commented Jun 20, 2023

andryichenko commented Mar 23, 2020 •

edited

xiaowei-routific commented Nov 6, 2020 •

edited

killthekitten commented Mar 12, 2021 •

edited

killthekitten commented Mar 12, 2021 •

edited

ChrisWestcottUK commented Feb 15, 2022 •

edited

ChrisWestcottUK commented Feb 16, 2022 •

edited