Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIXED] Stalled Queue member may stall the whole Queue #376

Merged
merged 2 commits into from
Aug 16, 2017

Conversation

kozlovic
Copy link
Member

This happens if a member is stalled and other members are not but
have a number of outstanding acks higher than the stalled member.

Resolves #375

Copy link
Contributor

@petemiron petemiron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for simplifying logic of findBestQueueSub, let me know your thoughts.

server/server.go Outdated
@@ -2187,7 +2189,7 @@ func findBestQueueSub(sl []*subState) (rsub *subState) {

// Favor non stalled subscribers and clients that do not have
// failed heartbeats
if (!sStalled || rStalled) && (!sHasFailedHB || rHasFailedHB) && (sOut < rOut) {
if (!sStalled && !sHasFailedHB && (rStalled || rHasFailedHB)) || (sOut < rOut) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic takes a bit to grok and got me thinking about the func as a whole. Could this be simplified to make the criteria evaluation for "BestQueueSub" more clear and reduce locking:

// ... (line 2170)
leastOutstanding := int(^uint(0) >> 1) // set to max int
for _, sub := range sl {

	sub.RLock()
	sOut := len(sub.acksPending)
	sStalled := sub.stalled
	sHasFailedHB := sub.hasFailedHB
	sub.RUnlock()

	if !sStalled && !sHasFailedHB {
		if sOut < leastOutstanding {
			leastOutstanding = sOut			
			rsub = sub
		}
	}
}
...

I realize this would increase the time between evaluating a subscriber's state and possibly sending a message to it, but it would also reduce the locks so hopefully the loop would run a tiny bit faster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The only difference in behavior is that with the current one, we at least pick one (even if it is stalled) as long as the array has 1 element. In the above approach (as-is), we would potentially return nil. (just so you know, I am working on a change for better handling stalled semantic, especially for queue groups, which then could address the fact that we return nil here if all are stalled).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for this PR and to maintain current behavior, we may want to pick the first at the end of the loop if rsub has not been set in the loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you're going to change the behavior shortly, I wouldn't recommend this change. I'll approve.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to return a sub, because this function is also used in case of redeliveries where redeliveries are always forced (even if the sub is stalled). If I don't return a sub, test TestPersistentStoreStalledQueueRedelivery would fail.

Copy link
Contributor

@petemiron petemiron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

This happens if a member is stalled and other members are not but
have a number of outstanding acks higher than the stalled member.

Resolves #375
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.02%) to 94.459% when pulling 22e376a on fix_queue_stall into 4e63b01 on master.

@nats-io nats-io deleted a comment from coveralls Aug 16, 2017
@kozlovic kozlovic merged commit 38cdcae into master Aug 16, 2017
@kozlovic kozlovic deleted the fix_queue_stall branch August 16, 2017 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants