Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Cloud Pub/Sub triggers high latency on low messages throughput #925

Closed
andryichenko opened this issue Mar 23, 2020 · 31 comments
Closed
Assignees
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: question Request for information or clarification. Not an issue.

Comments

@andryichenko
Copy link

andryichenko commented Mar 23, 2020

i'm running project which publishes messages to a PubSub topic and triggers background cloud function.

I read that with high volumes of messages, it performs well, but for lesser amounts like hundreds or even tens of messages per second, Pub/Sub may yield high latencies.

Code example to publish message:

const {PubSub} = require('@google-cloud/pubsub');

 const pubSubClient = new PubSub();

 async function publishMessage() {
    const topicName = 'my-topic';
    const dataBuffer = Buffer.from(data);

    const messageId = await pubSubClient.topic(topicName).publish(dataBuffer);
    console.log(`Message ${messageId} published.`);
 }

 publishMessage().catch(console.error);

Code example of function triggered by PubSub:

exports.subscribe = async (message) => {
  const name = message.data
    ? Buffer.from(message.data, 'base64').toString()
    : 'World';

  console.log(`Hello, ${name}!`);
}

Environment details

  • OS: Windows 10
  • Node.js version: 8
  • @google-cloud/pubsub version: 1.6

The problem is when using PubSub with low throughput of messages (for example, 1 request per second) it struggles sometimes and shows incredibly high latency (up to 7-9s or more).

Is there a way or workaround to make PubSub perform well each time (50ms or less delay) even with small amount of incoming messages?

Thanks!

@product-auto-label product-auto-label bot added the api: pubsub Issues related to the googleapis/nodejs-pubsub API. label Mar 23, 2020
@meredithslota meredithslota added the type: question Request for information or clarification. Not an issue. label Mar 23, 2020
@feywind
Copy link
Collaborator

feywind commented May 5, 2020

@andryichenko Sorry for the late reply on this, but could it be that the publisher piece you showed above is queueing the messages for batch send? You might try decreasing the max batching timeout:

pubSubClient.topic(topicName, {
    batching: { maxMilliseconds: 10 }
})

The default is already low (100), but it's possible something is getting set incorrectly. It's also possible that it's not properly obeying that number in your case, in which case I need to do some more debugging.

The other possibility is that it's getting hung up in the machinery somewhere between your client and your GCF function executing. I can pass it off to the right people in that case, but we should eliminate the client library first.

@andryichenko
Copy link
Author

@feywind Thanks for the reply!
I already tried to decrease the max batching timeout, but it didn't really help. (It seemed to me that messages are going faster but still it struggles).
Also as can be found in this guide, PubSub tries to gather multiple messages before delivering it. In other words, it tries to deliver many messages at once. In this specific case to achieve a more realistic real time scenario, should be specified a batch size of 1, which would cause PubSub to delivery every message separately.
So i also added maxMessages property to batching object, like so:

pubSubClient.topic(topicName, {
    batching: { 
       maxMessages: 1,
       maxMilliseconds: 10
 }
})

But it seemed also not really helpful.
After some research i come to conclusion that it may be cloud functions causing delays on cold start. So as a workaround i tried to make a simple node.js server that pinging cloud functions via pub/sub so they remain heat. And it shows much better results now.
I'm still trying to find good solution for this though.

@vyacheslav-kara
Copy link

We also face the same problem: when we send 1-2 messages per minute the latency is around 1200 ms but once we push more messages it becomes faster and faster (around 10ms). It looks like there is some cache mechanism for pushing the messages.

@feywind
Copy link
Collaborator

feywind commented May 19, 2020

I'm not as knowledgable about what happens after it leaves the Pub/Sub library, so it's possible something down in gax/gRPC is doing its own buffering. I can ask the maintainers of those libraries if there's anything possibly there. But as for the Pub/Sub library itself, it should be obeying that batching configuration and not pausing. I will peek at that and see if there's possibly a bug there, that it's not using the right values.

@andryichenko
Copy link
Author

@feywind did you find any solution?

1 similar comment
@andresmargalef
Copy link

@feywind did you find any solution?

@feywind
Copy link
Collaborator

feywind commented Aug 25, 2020

I missed that there were two issues potentially related to this: #1087

Do you know if 2.5.0 still has the issue?

@dozie
Copy link

dozie commented Sep 11, 2020

Seems like this has been merged and hasn't solved the problem of latency.

@feywind feywind self-assigned this Sep 15, 2020
@feywind
Copy link
Collaborator

feywind commented Sep 18, 2020

@dozie Thanks for the update!

@alexander-fenster Sorry to ping you again :D but could there be something introducing latency and doing its own batching of calls at the gax level? If setting the batch size to 1 for the Pub/Sub library doesn't reduce the latency, it seems like it might be something at a lower level.

@andryichenko
Copy link
Author

@feywind I can confirm that batch size settings is set to 1 message, but latency didn't reduce.

@xiaowei-routific
Copy link

xiaowei-routific commented Nov 6, 2020

We are experiencing a similar issue: high latency (5 ~ 10 seconds inconsistently) on low messages throughput.
I don't think it's related to the message producer, but maybe the pubsub server itself and here is why:

We keep tracking the publish_time on message and from the official google doc we know that this publish_time is the time pubsub server receives the message, so it doesn't matter what you set up from the message producer side. And the time between the publish_time and the time our client/message-consumer receives the message can be 5 ~ 10 seconds when the traffic throughput is low, especially in the scenario that a message sent to pubsub after a long time of no message sent.

I have been looking for some SLA of pubsub to guarantee that "queue time" but could not find it.

@feywind
Copy link
Collaborator

feywind commented Nov 20, 2020

@kamalaboulhosn Since there's a question here about the server side... Is it possible that the service is doing some sort of batching or delaying to queue up more messages?

@feywind
Copy link
Collaborator

feywind commented Jan 15, 2021

@kamalaboulhosn Gentle ping on this since it seems like it might be a service question at this point.

@killthekitten
Copy link

killthekitten commented Mar 12, 2021

Same issue here, but the latency is around 5 minutes. We have two topics between a GAE app and a cloud function, a request-response cycle. The latency shows up only on the GCF -> GAE topic. The GAE app runs the python client, the GCF is on "@google-cloud/pubsub": "^2.10.0".

The function is almost identical to what @andryichenko has posted. Not sure how to debug the latency on the way back to GAE other than looking at HTTP log timestamps.

Things I've tried:

  • Created a parallel pull subscription to check timestamps on the messages. Same 5 minutes there.
  • Change batching to maxMessages: 1, same behavior.

Things I've considered but didn't have time to test:

  • Forcing flush() on the cloud function before it ends. This is based on assumption that GCF blocks the CPU for any background functions that haven't been terminated after the function is done.
  • Recreating the topic/subscriptions.

@killthekitten
Copy link

I've enabled some additional logging, and it looks like the function exists almost instantly, leaving the actual (async) code to run in the background for some time:

gcf-pubsub-logs

The last line comes from this code:

const publish = async (event) => {
  const client = new PubSub({ apiEndpoint: config.PUB_SUB_API_ENDPOINT });

  const data = Buffer.from(JSON.stringify(event));
  const topic = client.topic(topicName, { batching: { maxMessages: 1 } });
  const messageId = await topic.publish(data, metadata);
  logger.log('Message published', messageId);

  await topic.flush();
  logger.log('Flushed', messageId);

  return messageId;
};

Some observations:

  • The latency up until (and including) the line 2021-03-12 12:45:49.244 is expected. But the 2 minute gap after that line isn't. I attribute the gap to the code above. In subsequent runs the latency on this step fluctuates around 1-2 minutes mark.
  • Interestingly, the call to .flush() fails with 3 INVALID_ARGUMENT: The value for message_count is too small. You passed 0 in the request, but the minimum value is 1..
  • Looks like the awaits are not respected by the cloud function runner.

@killthekitten
Copy link

killthekitten commented Mar 12, 2021

Ok, mistery solved!

The main cloud function handler was wrapped into Sentry.GCPFunction.wrapEventFunction which somehow loses track of the returned promises. In its turn, that lead to throttling the function and skyrocketing the latency.

I've temporarily removed the wrapper, and the whole thing now takes 7 seconds to run. And that's including multiple external HTTP requests 🎉

@feywind
Copy link
Collaborator

feywind commented Mar 16, 2021

@killthekitten Ah, yeah. I've seen a few issues in the past with GCF losing track of async happening in the background. I'm glad you found something to help there.

I'm not sure how many of the comments above were using GCF, but it looks like the original poster was doing so. Is anyone else having a non-GCF issue with this?

@feywind feywind added the status: investigating The issue is under investigation, which is determined to be non-trivial. label Jan 26, 2022
@flunderpero
Copy link

I am experiencing the same, having a delay of 15s and more for a few times. Most of the time everything works ok. If I don't use orderingKey there seems to be no delay after some quick testing (we rely on orderingKey so I did not investigate further). I re-created all subscriptions and topics and everything seems to be fine now. I started seeing these problems from 2022-02-09. Perhaps something changed in the PubSub-system?

@kamalaboulhosn
Copy link
Contributor

@flunderpero There is currently a known issue in the Pub/Sub service that is causing the 15s delay. The change that caused the issue is being rolled back right now. This issue would only have existed between 2/7/2022 and now.

@flunderpero
Copy link

@kamalaboulhosn Thanks for letting me know. This actually coincides with when the problems started for us.

@ChrisWestcottUK
Copy link

ChrisWestcottUK commented Feb 15, 2022

@flunderpero There is currently a known issue in the Pub/Sub service that is causing the 15s delay. The change that caused the issue is being rolled back right now. This issue would only have existed between 2/7/2022 and now.

Is there any way that we can track that issue as we're seeing these same 15s delays? From what we can see there hasn't been a change to the service since December according to the Pub/Sub release notes: https://cloud.google.com/pubsub/docs/release-notes

Or are you talking about something else @kamalaboulhosn

@kamalaboulhosn
Copy link
Contributor

@ChrisWestcottUK The release notes for the server only track significant API/region availability/feature availability changes, not every server change that is made. There is no public tracking for all of these such changes. For tracking issues if they affect your projects, it is best to put in a support request so support engineers can follow up. This particular incident should now be resolved.

@ChrisWestcottUK
Copy link

ChrisWestcottUK commented Feb 16, 2022

One thing that appeared to have also occurred at this time was messages published multiple times (we're using an orderingKey):

Doesn't that show that it's important to choose good idempotency keys and not follow the suggestions in the documentation? Or is there something I'm missing?

@feywind feywind added priority: p3 Desirable enhancement or fix. May not be included in next release. and removed status: investigating The issue is under investigation, which is determined to be non-trivial. labels Aug 17, 2022
@feywind
Copy link
Collaborator

feywind commented Aug 17, 2022

Linked to the meta-issue about transport problems: b/242894947

@empperi
Copy link

empperi commented Oct 6, 2022

I'm seeing this issue as well. I also faced it with Cloud Functions but I took my cloud function code and modified it a bit so I could easily trigger it locally with node testProblem.js. So it is just like your typical node command line app. This still causes a 2-5 second latency for publishing a message to a topic. At the same time a different app involved running in GKE uses the Java version of pubsub library (via Clojure) and that works lightning fast, no delay whatsoever. The same lightning fast performance is retained when running that app locally so it is not related to my personal internet connectivity or similar wonky stuff.

To me this seems like the batching disabling doesn't work properly. Batching configurations are respected, if publishing less messages than maxMessages then the latency is very close to maxMilliseconds as one would expect. If however maxMessages is set to 1 then this weird latency occurs. If sending two messages with maxMessages set to 2 they both finish in identically the same amount of time, which again is to be expected since they are batched. If sending two messages with maxMessages set to 1 they both finish about twice as fast! This does NOT make any sense.

Since Java pubsub doesn't suffer from problems such as this at all I'm leaning towards that the whole batch scheduling logic in this library is somehow broken. I'll look into this for a while still but if I can't find the issue really soon we have to rewrite our Cloud Functions with a different language to get acceptable performance since this issue causes a user facing performance problem.

@kamalaboulhosn
Copy link
Contributor

Going back to the original issue, I think the problem might be in the way the topic is being instantiated. The topic is being recreated for every publish, which means the topic is not being reused. It also means that the connection to the server has to be renegotiated every time. It would be better to do this:

const {PubSub} = require('@google-cloud/pubsub');

 const pubSubClient = new PubSub();
 const topicName = 'my-topic';
 const topic = pubSubClient.topic(topicName);

 async function publishMessage() {
    
    const dataBuffer = Buffer.from(data);

    const messageId = await topic.publish(dataBuffer);
    console.log(`Message ${messageId} published.`);
 }

 publishMessage().catch(console.error);

If the topic name is not known before the publish and can change, then you should keep a map from topic name to topic object and reuse it.

@gchristov
Copy link

I’m seeing a similar issue over here. My setup is similar to the OP - Firebase cloud function triggers individual PubSub topic messages. I’m seeing latency in minutes rather than seconds.

Interestingly the topic().publish() method seems to be taking a long time based in the logs.

Ive tried setting max messages but that didn’t seem to help.

batching: {
  maxMessages: 1
}

Anything else I can try or information to share that would help debug this?

@killthekitten
Copy link

I’m seeing latency in minutes rather than seconds.

Have you tried debugging it the same way I did in one of the comments above? It could be caused by throttling on a function because of an async wrapper, or any async code that you aren't awaiting properly.

The key is to log when the function returns - does it return sooner than the publish() call?

@gchristov
Copy link

@killthekitten Great tip thanks! Turned out that was the problem in my case - I was not waiting for the topic().publish() to complete before terminating the cloud function. The fix was to add an .await() on the publish Promise and now the messages seem to be delivered much faster 👍

@feywind
Copy link
Collaborator

feywind commented Mar 7, 2023

Hi all, I just made some updates to the batch flushing logic for publishing, and that's in 3.4.0. I'd be interested to hear if it helps this.

@kamalaboulhosn
Copy link
Contributor

This issue has covered a lot of different causes and investigations that are not entirely related. The original issue looks to have been an inefficiency in the way the user code was written. Other issues included transient server-side problem as well as other miscellaneous issues. Going forward, if anyone is still experiencing issues, please enter a support case. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests