Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: allow unbuffered memory chan if ephemeral or deferred #1376

Merged
merged 1 commit into from
Aug 2, 2022

Conversation

ploxiln
Copy link
Member

@ploxiln ploxiln commented Sep 15, 2021

In b3b29b7 / #1159 unbuffered
memoryMsgChan were replaced with nil chan for both Topic and Channel
(this only happens if mem-queue-size=0). This inadvertently made
deferred publish never work (with mem-queue-size=0), because some
metadata is lost when messages go through the "backend" disk queue,
but previously messages could go immediately through the topic's
unbuffered chan with their metadata intact, to then be added to each
channel's deferred pqueue. This partial revert brings this part back.

Granted, mem-queue-size=0 never worked very well, in many respects.
For this functionality, at low message publish rates, it seems there
was rarely a problem. But at high message publish rates it was always
likely that some messages would hit the topic disk queue and lose
the deferred time metadata. This potential problem remains.

In fact, the motivation for fully disabling the memoryMsgQueue for
mem-queue-size=0 was to avoid excessively shuffling message order,
which would happen if some messages jump instantly through the
memory-queue while others take the longer way through the disk-queue.
That will be likely again. But the users who noticed this probably
did not use publish-deferred functionality!

Additional fixes would be appropriate, but this is a quick-fix to
restore previous useful behavior for mem-queue-size=0.
fixes #1375

@ploxiln
Copy link
Member Author

ploxiln commented Sep 19, 2021

I pushed up another commit, a hacky idea to handle the case of memQueueSize=0 (or memQueueSize=1 which it seems like some people might pick with the same idea/motivation), by waiting up to a second to push through the memoryMsgQueue, but only if there's a good chance the message would be imminently consumed and copied to the channels. Comments and ideas welcome :)

@ploxiln
Copy link
Member Author

ploxiln commented Sep 20, 2021

OK, I came up with another strategy, which I think is more efficient, fixes some long-standing potential for undesired behavior with very small MemQueueSize, and is pretty nifty. But it ran into a deadlock during tests, which I fixed easily enough (in the follow-up commit) ... but which is worrying:

  • test calls GetTopic() -> NewTopic() -> nsqd.Notify()
  • NewTopic() spawns a goroutine for topic.messagePump()
  • nsqd.Notify() spawns a goroutine for PersistMetadata()
  • test calls topic.PutMessage() -> topic.RLock() -> success
    • topic.put() waits for either memoryMsgChan or inactiveChan (new in this PR)
  • PersistMetadata() goroutine waits for topic.Lock() (behind RLock() just taken by PutMessage())
  • topic.messagePump() goroutine waits for topic.RLock() to get channel list (behind pending Lock() for PersistMetadata())

fixed by making PersistMetadata() also use an RLock() too ... just a bit weird that PutMessage() can get in front of messagePump() initialization and PersistMetadata(). But it seems that if anything else could sometimes call topic.Lock() between topic.PutMessage() and messagePump processing channelUpdateChan, it could deadlock again. So that's the downside/fault of this latest strategy, which I think is otherwise pretty darn cool.

@ploxiln
Copy link
Member Author

ploxiln commented Sep 27, 2021

Went back to the simpler idea, with an additional "started" flag.

@mreiferson
Copy link
Member

@ploxiln catching up on this...

Doesn't this effectively reduce the likelihood of messages being written to disk when mem-queue-size < 32, which is the opposite of what most users intend?

Seems like users are trying to accomplish a few things with mem-queue-size = 0:

  1. Round trip all messages to disk
  2. Preserve a more strict ordering (retries aside)
  3. Still be able to use deferred pub

This fix breaks (1), somewhat handles (2), and fixes (3).

Note, it looks like we also broke ephemeral channels according to metal-stack/metal-roles#74. This makes sense intuitively, if you don't want any messages stored in memory ephemeral channels shouldn't work? Not sure we did that intentionally though.

@mreiferson mreiferson added the bug label Nov 28, 2021
@ploxiln
Copy link
Member Author

ploxiln commented Nov 28, 2021

With mem-queue-size = 0 (and this change), the messages should still hit disk, but pass more directly to the channels diskqueues if possible (and if not possible still go directly to the topic diskqueue, like nsq-1.2.1). This change should fix deferred publish, while keeping mem-queue-size = 0 similarly useful as it was before (similarly not-100%-guarantee of message persistence ... but less I suppose, because the disk write error will happen in Topic.messagePump() -> Channel.PutMessage(), instead of {request} -> Topic.PutMessage(), thus will not return error to caller, but will still SetHealth()).

The change to nil channel was somewhat motivated by a desire for message order to be more generally preserved ... and with this PR, order will also be generally preserved, because the channels will quickly drain messages from (waiting 1 second to be sent on) the topic memchan, and then write them directly to the channel diskqueue (since the channel memchan is being left nil if mem-queue-size = 0).

@ploxiln
Copy link
Member Author

ploxiln commented Nov 28, 2021

... so yes, agreed with what you wrote, but I think this handles (2) just fine, and only somewhat breaks (1) (mainly the feedback).

@mreiferson
Copy link
Member

sorry I'm late on this... I think I follow what's happening now, but I'm still not sold. In the event consumers are slow, this creates undesirable back-pressure when you'd rather those messages get written to disk.

Why don't we just revert #1159 and take a step back with what (if anything) we want to achieve with --mem-queue-size=0?

Alternatively, I think the fix in 786c1f2 is fine — creating an ephemeral channel should arguably never write to disk. For DPUB, perhaps there's a similar highly-targeted fix (like keeping a non-nil memory message chan and only using it when a DPUB occurs?

@ploxiln
Copy link
Member Author

ploxiln commented Jan 17, 2022

Why don't we just revert #1159

Yeah, I'm amenable to this idea. I was trying to please the no-mem-queue use-case here but agree it might not be worth all this.

@herry-go
Copy link

herry-go commented Jan 21, 2022

Why don't we just revert #1159

Yeah, I'm amenable to this idea. I was trying to please the no-mem-queue use-case here but agree it might not be worth all this.

@mreiferson Sorry for bothering but is there any progress?
Our team has some resources to help fix this problem, that is, you can simply point out the potential problem location.
We want to help.

Here are some of my ideas that may not be consistent with your original design:

  1. For temporary channels, I find that the virtual back-end queue is currently useless. When memorymsgchan = 0, whether the message should also be sent to the backend queue, but the backend queue can have timeliness?
  2. Is it possible to allow some messages to exist in the backend queue, but this message is time effective? Even if you change memorymsgchan to not nil, I also find that when the number of messages I send in batches exceeds memqueuesize, only some messages are consumed and the rest are lost

@ploxiln
Copy link
Member Author

ploxiln commented Aug 1, 2022

Updated to only change behavior (compared to latest nsq-1.2.1 release) for deferred messages or paused topics. (But it's still kinda a lot of lines.)

@Gradelia21
Copy link

Want it back

return nil
case <-time.After(time.Second):
break // give up
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to wait around for a second in the ephemeral case - the intention is that it's bounded and lossy, why add back pressure when the queue is full?

My proposal would be to simplify this whole if/else block to:

if cap(t.memoryMsgChan) > 0 || m.deferred != 0 {
		select {
		case t.memoryMsgChan <- m:
			return nil
		default:
			break // write to backend
		}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can pare it all the way down and remove the wait. The 1 second is arbitrary ... the expectation/desire is more like a handful of milliseconds. If there's a burst of messages and a bunch of channels, they can all get through the topic to the channels in just a few milliseconds, instead of having deferred time lost or messages kinda sampled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(oh and ephemeral topics don't work at all without the memory chan)

In b3b29b7 / nsqio#1159 unbuffered
memoryMsgChan were replaced with nil chan for both Topic and Channel
if mem-queue-size=0. This inadvertently made deferred publish never
work (with mem-queue-size=0), because some metadata is lost when
messages go through the "backend" disk queue, instead of immediately
going through the topic memory chan to the channels' deferred pqueues.

The motivation for fully disabling the memoryMsgQueue for
mem-queue-size=0 was to avoid excessively shuffling message order,
which would happen if some messages jump instantly through the
memory-queue while others take the longer way through the disk-queue.
So, only allow using the memory queue in this case if really needed,
for deferred messages, or for ephemeral topic or channel which just
lose all messages if they all go through the no-op backend queue.

Granted, mem-queue-size=0 never worked very well, in many respects.
@ploxiln ploxiln changed the title topic: allow unbuffered memoryMsgChan (partial revert of b3b29b7eff41) nsqd: allow unbuffered memory chan if ephemeral or deferred Aug 2, 2022
@ploxiln
Copy link
Member Author

ploxiln commented Aug 2, 2022

much reduced, squashed, re-wrote commit message

@ploxiln
Copy link
Member Author

ploxiln commented Aug 2, 2022

Thanks for the reviews @mreiferson :)

@ploxiln ploxiln merged commit 4138c44 into nsqio:master Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nsqd: deferred messages not deferred with --mem-queue-size=0
4 participants