Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many parallel subscriptions in one process => DEADLINE_EXCEEDED upon ack #550

Closed
ctavan opened this issue Mar 13, 2019 · 24 comments
Closed
Assignees
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@ctavan
Copy link
Contributor

ctavan commented Mar 13, 2019

Environment details

  • OS: GKE
  • Node.js version: 10.13.0
  • @google-cloud/pubsub version: 0.28.1

Steps to reproduce

This is a follow-up report to #240, the symptoms are the same however I think I have a better understanding of what might go wrong this time.

I have made another take on finally upgrading the node pubsub library from v0.18.0 to v0.28.1.

  • I am currently running a couple of pods that subscribe to several pubsub topics.
  • Most of the pods subscribe to only one or two topics, those work fine with the updated library.
  • I have one pod that subscribes to many subscriptions, this one shows problems, see below.

Here's what the problematic pod does:

  • Starts up by initializing 23 different (pre-existing) subscriptions on 23 different topics.
  • Starts pulling messages that sit in these subscriptions, these are low-frequency subscriptions, so we're talking about 0-500 messages that sit in the subscriptions initially.
  • Processes those messages in batches and acks() all of them, processing of a batch takes ~500ms.
  • After a few seconds (I've seen 5-12), I receive a DEADLINE_EXCEEDED error event on one of the subscriptions where acks have been sent (and I crash the process).

Here's the error:

Error: Failed to "acknowledge" for 6 message(s). Reason: 4 DEADLINE_EXCEEDED: Deadline Exceeded
    at AckQueue.<anonymous> (/app/node_modules/@google-cloud/pubsub/build/src/message-queues.js:162:23)
    at Generator.throw (<anonymous>)
    at rejected (/app/node_modules/@google-cloud/pubsub/build/src/message-queues.js:20:65)

For the affected subscriptions, the num_undelivered_messages stackdriver metric just keeps on rising:

Screenshot 2019-03-13 22 25 11

Stackdriver also indicates that there are indeed streaming_pull_message_operation_count operations (they keep on rising because my pod keeps on crashing and restarting, thus pulling the rising amount of unacked messages over and over again):

Screenshot 2019-03-13 22 26 44

However stackdriver doesn't show any pull_ack_message_operation_count operations (ignore the spike to the right, this is where I started the pod with only one subscription where it worked):

Screenshot 2019-03-13 22 30 30

Just for comparison, with version v0.18.0 of this client library, that I had been using before, it was still using streamingAcks (see discussion in #240) and everything was working well, even with the 23 parallel subscriptions in one node process:

Screenshot 2019-03-13 22 32 18

All screenshots above show the same subscription and the same timeframe, at time ~4:30pm I deployed the new version, the point where the amount of undelivered messages starts rising.

To me it looks like the amount of parallel subscriptions determines whether acks work as expected or whether I consistently receive DEADLINE_EXCEEDED errors as a result to the ack batches. It's interesting that acks either work for all subscriptions handled by the affected process or none.

I doubt that it is CPU related because the process does basically only I/O (pipe the messages from Pubsub to BigQuery) and the CPU-graphs of GKE also show no significant CPU usage. CPU usage of the same process with v0.18.0 of the library was also <1%.

Is the number of parallel subscriptions that can be handled in a single node process (all subscription message handlers registered with a single new PubSub() client instance) somehow limited?

Any pointer into what direction I should be looking to would be highly appreciated!

(Next thing I'd like to try out is see if I can narrow down the number of parallel subscriptions in one process where it starts to break… However due to this npm registry issue I have currently trouble deploying new code to the GKE cluster: https://status.npmjs.org/incidents/3p30y0pkblsp)

@callmehiphop callmehiphop added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Mar 14, 2019
@callmehiphop
Copy link
Contributor

@ctavan as always, thank you for the detailed write up!

Is the number of parallel subscriptions that can be handled in a single node process (all subscription message handlers registered with a single new PubSub() client instance) somehow limited?

I suspect this is exactly it. I'm a little fuzzy on the internal workings of grpc, but if you're re-using the same PubSub instance for all subscriptions, I think they are all sharing the same channel. Each subscription creates 5 StreamingPull streams by default, so in total we are looking at 115 streams total for 23 subscriptions.

You could limit the number of underlying streams per subscription via maxStreams option. This might give you lower throughput, but if they are low frequency subs it probably is ok.

const sub = pubsub.subscription('my-sub', {
  streamingOptions: {maxStreams: 1}
});

Alternatively, you could try and create a new PubSub instance per Subscription.

If you don't mind testing this out, that would be awesome! Otherwise I'll try and put together a reproduction case this week.

@ctavan
Copy link
Contributor Author

ctavan commented Mar 14, 2019

@callmehiphop thanks for the swift reply, I think we're getting closer!

The magic number where it starts failing in my setup seems to be 20 parallel subscriptions. With 19 it works fine, with 20 I get DEADLINE_EXCEEDED.

Reducing maxStreams to 4 just for testing made it work with 23 subscriptions. So maybe there's a limit of exactly 100 streams? Is this something you can confirm from the code?

I now even artificially restricted the number of concurrent streams to exactly 99 (spawning 19 subscriptions with maxStreams: 5 and 1 subscription with maxStreams: 4) and this setup was not crashing, going to 100 -> DEADLINE_EXCEEDED!

So even though I'm not 100% certain that my test setup is ideal since not all of the subscriptions that I subscribe to actually receive messages during the time of testing, the seemingly exact limit of 100 concurrent streams somehow sounds suspicious to me.

I see a few options to move forward:

  • Either find an explanation in the code for the "magic number" of 100 grpc streams.
  • Or build a better reproduction setup that isolates the problem and figures out the limits.

Then:

  • Document the limit.
  • Ideally have the library spit out warnings or even crash if a user tries to open more streams than supported.

WDYT @callmehiphop ?

For the time being I can live with the workaround of reducing maxStreams so this is no longer a blocker for me, however I do believe that for "future me"s it would be great to not let the user use this library in a way that is known not to work.

@callmehiphop
Copy link
Contributor

Came across this issue grpc/grpc-go#1514 which then lead me to this option. Sounds like grpc might default to 100, could you give the following a test run? Assuming that is the issue, I think the solution here would be to unbound the number of concurrent streams.

const pubsub = new PubSub({
  'grpc.max_concurrent_streams': 115
});

@ctavan
Copy link
Contributor Author

ctavan commented Mar 15, 2019

@callmehiphop unfortunately 'grpc.max_concurrent_streams' doesn't seem to have any effect at all on the described issue. Event setting it to 1 will still leave the setup working as long as NUM_SUBSCRIPTIONS * maxStreams < 100, setting it to a high number doesn't fix the issue for 20 concurrent subscriptions.

I have built a simple reproduction case that you can use to further debug this issue: https://gist.github.com/ctavan/2298c2bd965e902d4d8ad2788ffcb0b8

Just play with the two constants on top to see the errors appearing.

@ctavan
Copy link
Contributor Author

ctavan commented Mar 15, 2019

I should maybe mention that I was running the reproduction case from my local machine, not GKE: So the issue doesn't seem to be environment-related.

@callmehiphop
Copy link
Contributor

@kir-titievsky does the PubSub API limit the number of concurrent streams per channel?

@callmehiphop
Copy link
Contributor

friendly ping @kir-titievsky @jganetsk

@jganetsk
Copy link

Is 'grpc.max_concurrent_streams' per channel? Maybe there's another property that's specific to channels?

@callmehiphop
Copy link
Contributor

I believe that is per channel, @murgatroid99 might be able to help here

@bcoe
Copy link
Contributor

bcoe commented Apr 4, 2019

Hey @ctavan 👋 small world.

Just dusting off this issue, which has gotten a little bit stale; are you still bumping into issues? any updates.

@callmehiphop
Copy link
Contributor

@bcoe I'm going to go ahead and say yes, looks like there's a magical limit that I'm guessing gRPC sets, but it is pretty unclear to me how it is set and if it is configurable. I'm hoping @murgatroid99 can shed some light here

@ctavan
Copy link
Contributor Author

ctavan commented Apr 5, 2019

Hey @bcoe & @callmehiphop: I just re-ran my reproduction case and can confirm that the issue still persists.

Picking up my suggestions from above I see two routes moving forward:

  1. Figure out why this 100-stream limit exists and get rid of it.
  2. Accept the 100-stream limit, document it and make sure that the library throws a descriptive error as soon as somebody tries to establish the 100th connection.

While I think the first option is preferable I also believe that opening 100 gRPC streams on a single PubSub instance is rather an edge case and not being able to do so should not be a big deal for users of this library. So I think the second option would be acceptable as well.

For the second option we should make sure to check:

  1. Is the limit per PubSub instance or per node process?
  2. Do other connections like publishers also count towards the limit or is the issue unique to subscriptions?

I'll see if I can extend my reproduction case to answer these two questions.

@ctavan
Copy link
Contributor Author

ctavan commented Apr 8, 2019

@callmehiphop @bcoe I've extended my reproduction case a bit to show that:

  1. The limit is indeed per new PubSub() instance. If I create a new PubSub() instance for each subscription I can subscribe to 20 subscriptions (or more) with maxStreams: 5 (default) equaling >= 100 concurrent grpc streams.
  2. Publishing on the same new PubSub() instance doesn't seem to have any effect on the observed subscription behavior. No matter whether I published on the same new PubSub() instance or not before, I can spawn exactly 99 concurrent grpc streams and as soon as I spawn the 100th I get DEADLINE_EXCEEDED errors.

@sduskis sduskis assigned ajaaym and unassigned callmehiphop May 3, 2019
@sduskis sduskis added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed triaged for GA priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels May 20, 2019
@sduskis
Copy link

sduskis commented May 20, 2019

I changed this issue from a "bug" to a "feature request" of supporting multiple subscriptions in the same client. That isn't the standard use of the client, but we would like to support that feature.

There's a hard limit of 100 streams per channel, and we can't do anything about that. There are a number of options to fix this:

  1. A user can reduce the number of streams (1 to 100/subscription count, or Preferably 50/subscription count to allow for acks).
  streamingOptions: {maxStreams: 1}
});
  1. A user can use multiple clients, since each client will have its own channel. That said, there seems to be a problem with this approach as per this comment.

  2. A user can use AsyncPull (see this example).

  3. Enhance the nodejs client infrastructure with channel pooling to allow this and other scenarios to work.

Option 4 is a long term enhancement. Options 1-3 make the user code a bit more complicate, but are solutions that a user can implement.

@ctavan
Copy link
Contributor Author

ctavan commented May 20, 2019

@sduskis thank you for confirming the 100 streams per channel limit.

As outlined in my comment above I believe that my usage of opening more than 100 subscription streams per client is rather an edge case usage and I don't believe that it's really worth the extra complexity to implement your option 4 above.

However I still believe that now that we know the 100 streams limit exists and that there is nothing we can do about it I think we should consider route 2. from my comment above:

  1. Accept the 100-stream limit, document it and make sure that the library throws a descriptive error as soon as somebody tries to establish the 100th connection.

The issue for me wasn't really that I'm not able to use more than 100 streams on one pubsub client instance, I can easily create a new client instance for each subscription.

My issue was that the 100 streams limit is not documented and that the DEADLINE_EXCEEDED errors didn't give me a direct hint towards the underlying problem. In fact it took me several hours of debugging to ultimately pin it down to the amount of streams I was opening on a single client instance.

So can we agree that the client should keep track of the amount of streams and throw a descriptive error as soon as a user tries to open the 100th stream plus document this in the API docs?

@ctavan
Copy link
Contributor Author

ctavan commented May 20, 2019

@sduskis regarding your comment:

  1. A user can use multiple clients, since each client will have its own channel. That said, there seems to be a problem with this approach as per this comment.

I don't think that there is any problem with this approach, it worked very well for me. What part of my comment were you referring to?

@sduskis
Copy link

sduskis commented May 21, 2019

@ctavan, option 4 is actually still on the table, since there's work being done towards channel pooling for other products.

I agree that we should give an error at some point. I'm not sure what the actual limit ought to be. 100 streams in use for receiving messages means that the messages can't be acked on that connection. I'd say that erroring out after 50 streams is safer, even if it's not the absolute limit.

@ctavan
Copy link
Contributor Author

ctavan commented May 22, 2019

@sduskis totally fine with a safe limit of 50. Just whatever error message that is more descriptive than DEADLINE_EXCEEDED and that throws immediately and not only after a while when the first acks start to fail… 😉

I furthermore suggest to move option 4 into a separate issue and keep this one open only until the error message has been added.

@sduskis sduskis assigned callmehiphop and unassigned ajaaym May 22, 2019
@sduskis
Copy link

sduskis commented May 22, 2019

@callmehiphop, can you please add the error?

@callmehiphop
Copy link
Contributor

@sduskis I actually think this isn't the right place to solve this problem. If grpc receives a limit from the server and accepts an option to adjust said limit, shouldn't we ask grpc to provide a better error instead of implementing similar logic in the client?

@sduskis
Copy link

sduskis commented May 23, 2019

@callmehiphop, I think this is purely a client-side logic thing. In most cases, a client sending more than 100 streams is fine. The behavior out to be to keep a queue that waits until one of the streams is done. In unary call systems or short stream calls, which is a vast majority of use cases, that's a perfectly reasonable solution. 100 Pub/Sub subscriber streams are a special case since they stay open for an infinite amount of time. Even for Pub/Sub, Node.js (and other dynamic languages) is a special case. In Java or Go, the client would have a channel pool anyway which would allow ack-ing on a different channel. I don't think that the Pub/Sub servers should cut off more than x < 100 streams. I believe that's a pub/sub non-channel pooled client's job.

The real solution is to enhance the client with channel pooling so that this scenario works. IMHO, the short term fix is to tell the user to fix it.

@callmehiphop
Copy link
Contributor

@sduskis I think in that case we should look into adding support for grpc-js in grcp-gcp. I'd be happy to try and take that on if there are no objections. Otherwise I would suggest documenting this as a quirk of the PubSub client.

@sduskis
Copy link

sduskis commented May 28, 2019

+1 on a long term plan for grpc-js in grcp-gcp. Users have a way around it in the short term (specifically creating new clients), and we have higher priority issues than this.

+1 on a short term docs fix.

@callmehiphop
Copy link
Contributor

Since there's nothing we can do to resolve this (until grpc-gcp supports grpc-js), I'm going to close this out.

@google-cloud-label-sync google-cloud-label-sync bot added the api: pubsub Issues related to the googleapis/nodejs-pubsub API. label Jan 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants