-
Notifications
You must be signed in to change notification settings - Fork 18
Description
Problem
This is a simplistic pseudo-code of the deliverymq flow, with emphasis on some interesting cases:
step 1: pre-publish operations (query destination, query event, etc.)
step 2: publish event
step 3: post-publish operations (schedule retry, send event to logmq, idempotency cleanup etc.)
We have clear error handling around pre-publish ops & the publish step. The post-publish ops error handling is a bit trickier.
Currently, we don't have any special error handling to differentiate pre vs post publish ops. Is this something we should consider?
For example:
for event A
1: deliverymq
a: publish succeeds
b: logmq fails
c: nack
2: deliverymq
a: publish succeeds
b: logmq fails
c: nack
...
As you can see, essentially logmq becomes a very critical piece of infrastructure where if it fails, we will essentially spam all destinations with however many retries we can until the message ends up in DLQ.
It's not super clear to me if this is an expected problem of distributed systems, or if there's a way to limit the impact.
Another scenario is what is publish fails & log also fails.
- Should we simply nack & let the mq system retry?
- Should we schedule a retry via the Redis-based system? There's a chance that may fail too. If yes, should we nack or ack?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status