Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pubsub: getter for all paused publishers on a topic, or method to resume all #5905

Closed
mgabeler-lee-6rs opened this issue Apr 18, 2022 · 12 comments
Assignees
Labels
api: pubsub Issues related to the Pub/Sub API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@mgabeler-lee-6rs
Copy link

Is your feature request related to a problem? Please describe.

When ordered publishers get stuck on an error, it's difficult to build good application recovery code when we cannot enumerate the paused publishers, nor resume them in bulk.

Describe the solution you'd like

One or both of:

  1. A method to call on a topic to get back a slice of paused ordering keys
  2. A method to call on a topic to resume all paused ordering keys

Describe alternatives you've considered

  • Resume all possible keys: comedically non-viable when keys are UUIDs
  • Resume before every publish request: wasteful, and defeats the point of having a controlled resume

Additional context

The same basic problem is affecting us in the NodeJS client: googleapis/nodejs-pubsub#1524

@mgabeler-lee-6rs mgabeler-lee-6rs added the triage me I really want to be triaged. label Apr 18, 2022
@product-auto-label product-auto-label bot added the api: pubsub Issues related to the Pub/Sub API. label Apr 18, 2022
@hongalex hongalex added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed triage me I really want to be triaged. labels Apr 18, 2022
@hongalex
Copy link
Member

If the issue is not knowing which ordering key has failed, would it be helpful if errors from PublishResult.Get had ErrorInfo metadata contained the ordering key?

@mgabeler-lee-6rs
Copy link
Author

Not in my use case, as in that context we would generally still know what we had tried to publish when we got that error, since we would be waiting on that result to mark a "thing" complete.

The issue I'm facing is more that applications get get into an apparent "broken until restart" condition because of gRPC timeouts (I have an Enterprise Support ticket open looking into why we have seen a sudden increase in those timeouts since early April, but of course am also looking to make our applications more robust).

Calling the resume before each publish or group of publishes, integrated in our retry logic, is an option, but seems poor for performance as it will involve various mutex and map hits each time, when the vast majority of the time it is not necessary. A higher level API would allow us to:

  1. Report internal metrics on the stuck ordering keys
  2. Build automated logic to do a "resume everything" at a safe point in our application logic as opposed to deep in the many routines that might try to publish messages
  3. Build manual circuit-breaker type triggers to resume all publishers based on human incident investigations

@hongalex
Copy link
Member

hongalex commented Apr 18, 2022

I think I'm a little confused here and would appreciate some more context. Here's what the ResumePublish usage should look like on a high level:

  1. Start by publish message 1, 2, and 3 with some ordering key and no batching. Keep track of the last message that was published successfully.
  2. Message 1 is published, but 2 fails due to timeout and 3 also fails (due to the library requiring ResumePublish needing to be called).
  3. Since we receive an error on 2, we reset publish state and call ResumePublish before proceeding to publish. We also try to handle this for message 3, but there's nothing to be done since the earliest message that failed was message 2.

I think my confusion stems from this:

Build automated logic to do a "resume everything" at a safe point in our application logic as opposed to deep in the many routines that might try to publish messages

Could you describe a bit more why this is preferred to handling the error within the publish goroutine? Also, if you need to handle this elsewhere, could you keep track of the failed ordering keys in a set/map to later call ResumePublish on?

Thanks!

@mgabeler-lee-6rs
Copy link
Author

A couple bits of context for my use case that may help:

  1. We may have hundreds of active ordering keys
  2. Within a single ordering key, we often have groups of messages that don't care about relative order within the group, but do care about being definitely before or after messages outside the group. So, e.g. we might have published 1, 2, and 3 OK, and then enqueued 4 through 400 to go in parallel, and then after those are going to enqueue 401, 402, 403, etc. We don't care about the relative order of 4 through 400, just that they are after 3 and before 401.
  3. Another pattern we might have is that, for a given key, we have messages 1, 2, 3 that need to go in that order, and A, B, and C that go in that order, but where the numbered messages go relative to the letter'd ones go doesn't matter. They end up on the same ordering key because it makes application code much simpler and more consistent if we have a fixed way of deriving the ordering key from the message payloads. In this context, 1,2,3 may be published from one goroutine and A,B,C from a different one, so having either of them resume publishing on their own is problematic, as it might cause weird out of order publishing for other concurrent workers, which is part of what I had in mind for the "at a safe point in our application logic" bit -- we could build logic to detect when things go bad, freeze processing so the app is in a safe state, and then clear all the errors. But with stuff flying around concurrently, there is no point where it is safe to resume any queue.
  4. When processing an incoming request, we can't always know in advance which keys we are going to publish on, so it's hard to have an "up front" resume call. Plus the safety issues with that mentioned above.

@hongalex
Copy link
Member

Thanks for the clarification. Let me bring this up with some folks (including the Node library maintainer) and get back to you.

@hongalex
Copy link
Member

Actually sorry one more point that I kind of brought up earlier but wasn't addressed, given that you know which ordering keys failed on Publish, is it not possible to keep track of these in a set? Since these errors are not happening that often, there shouldn't be much overhead to keeping this around.

@mgabeler-lee-6rs
Copy link
Author

Yes, and that is something we're looking into in our application. There are some complexities there due to potentially multiple components publishing.

And there is a bit of frustration when I can F12 Go To Definition and see the set of those that already exists in the PubSub client code, I just can't access it ;)

@hongalex
Copy link
Member

And there is a bit of frustration when I can F12 Go To Definition and see the set of those that already exists in the PubSub client code, I just can't access it ;)

Yeah I understand the frustration. However, for the majority of use cases, the best pattern is to handle the Publish error right away and call ResumePublish then and there.

We're not keen on exposing a ResumeAll since it promotes the wrong behavior where messages can end up being delivered out of order if users call this without knowing the implications.

@mgabeler-lee-6rs
Copy link
Author

That makes sense. How would you feel about:

  1. Exposing a PausedOrderingKeys method that returns a slice-copy? (I get that this is awful close to exposing ResumeAll, but at least makes someone think before they do that, and I'd love to have this just for metrics & alerting)
  2. Exposing the IsPaused method from the Publisher on the Topic?

@kamalaboulhosn
Copy link
Contributor

I think exposing those methods is kind of in the same realm as exposing ResumeAll. The client library is meant to make it easy to use the system in the way it was designed, which in this case, is handle the failure of a publish, reset any local state that is needed to preserve ordering, call resumePublish, then start publishing again. It should really be possible to do this with the context of the ordering key for the message that failed.

Metrics and alerting is something we are looking at more broadly and there may be some information within the client library exposed for summary purposes, but I'm not sure it would necessarily include just the list of paused ordering keys.

The request to expose these additional methods is something we'll keep in mind and see if we get other similar requests for such use cases.

@hongalex
Copy link
Member

hongalex commented Apr 20, 2022

I had a question from a coworker that wanted to pass along here:


For their example

Within a single ordering key, we often have groups of messages that don't care about relative order within the group, but do care about being definitely before or after messages outside the group. So, e.g. we might have published 1, 2, and 3 OK, and then enqueued 4 through 400 to go in parallel, and then after those are going to enqueue 401, 402, 403, etc. We don't care about the relative order of 4 through 400, just that they are after 3 and before 401.,

Isn't the proper use of ordering keys is to for them to publish just three messages, each of which would be a set or list of their original msgs: [ {msg1, msg2, msg3} , {mesg4, msg5, ... , msg400}, and {msg401, msg402, msg403} ]?


I think this question is slightly unrelated to the original issue, but wanted to bring this up anyway. Is it possible you might want to batch these messages?

@hongalex
Copy link
Member

hongalex commented Jun 2, 2022

Closing this issue since this isn't something we're looking to implement at this time.

@hongalex hongalex closed this as completed Jun 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

3 participants