Skip to content

Question: DeliveryMQ edge case around post-publish errors #151

@alexluong

Description

@alexluong

Problem

This is a simplistic pseudo-code of the deliverymq flow, with emphasis on some interesting cases:

step 1: pre-publish operations (query destination, query event, etc.)
step 2: publish event
step 3: post-publish operations (schedule retry, send event to logmq, idempotency cleanup etc.)

We have clear error handling around pre-publish ops & the publish step. The post-publish ops error handling is a bit trickier.

Currently, we don't have any special error handling to differentiate pre vs post publish ops. Is this something we should consider?

For example:

for event A

1: deliverymq
  a: publish succeeds
  b: logmq fails
  c: nack

2: deliverymq
  a: publish succeeds
  b: logmq fails
  c: nack

...

As you can see, essentially logmq becomes a very critical piece of infrastructure where if it fails, we will essentially spam all destinations with however many retries we can until the message ends up in DLQ.

It's not super clear to me if this is an expected problem of distributed systems, or if there's a way to limit the impact.


Another scenario is what is publish fails & log also fails.

  • Should we simply nack & let the mq system retry?
  • Should we schedule a retry via the Redis-based system? There's a chance that may fail too. If yes, should we nack or ack?

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions