Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid potential proxycfg/xDS deadlock using non-blocking send #9689

Merged
merged 7 commits into from Feb 8, 2021

Conversation

freddygv
Copy link
Contributor

@freddygv freddygv commented Feb 2, 2021

Deadlock scenario:

1. Due to scheduling, the state runner sends one snapshot into
snapCh and then attempts to send a second. The first send succeeds
because the channel is buffered, but the second blocks.

2. Separately, Manager.Watch is called by the xDS server after
getting a discovery request from Envoy. This function acquires the
manager lock and then blocks on receiving the CurrentSnapshot from
the state runner.

3. Separately, there is a Manager goroutine that reads the snapshots
from the channel in step 1. These reads are done to notify proxy
watchers, but they require holding the manager lock. This goroutine
goes to acquire that lock, but can't because it is held by step 2.

Now, the goroutine from step 3 is waiting on the one from step 2 to
release the lock. The goroutine from step 2 won't release the lock until
the goroutine in step 1 advances. But the goroutine in step 1 is waiting
for the one in step 3 to read from the channel it is blocking on. Deadlock.

By making the send in step 1 non-blocking then we can proceed. The coalesce
timer will be reset and a new snapshot (if valid) will be delivered after the
timer elapses or when one is requested by xDS.

Impact: This deadlock can block the delivery of proxy configuration to Envoy.


Also added some trace logging that would have helped while chasing this down.

Deadlock scenario:
    1. Due to scheduling, the state runner sends one snapshot into
    snapCh and then attempts to send a second. The first send succeeds
    because the channel is buffered, but the second blocks.
    2. Separately, Manager.Watch is called by the xDS server after
    getting a discovery request from Envoy. This function acquires the
    manager lock and then blocks on receiving the CurrentSnapshot from
    the state runner.
    3. Separately, there is a Manager goroutine that reads the snapshots
    from the channel in step 1. These reads are done to notify proxy
    watchers, but they require holding the manager lock. This goroutine
    goes to acquire that lock, but can't because it is held by step 2.

Now, the goroutine from step 3 is waiting on the one from step 2 to
release the lock. The goroutine from step 2 won't release the lock until
the goroutine in step 1 advances. But the goroutine in step 1 is waiting
for the one in step 3. Deadlock.

By making this send non-blocking step 1 above can proceed. The coalesce
timer will be reset and a new valid snapshot will be delivered after it
elapses or when one is requested by xDS.
@freddygv freddygv requested a review from a team February 2, 2021 18:52
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging February 2, 2021 19:26 Inactive
@vercel vercel bot temporarily deployed to Preview – consul February 2, 2021 19:26 Inactive
@@ -583,6 +583,8 @@ func (s *state) initialConfigSnapshot() ConfigSnapshot {
}

func (s *state) run() {
logger := s.logger.Named(logging.ProxyConfig)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non blocking idea:

Use the With() call here and tack on the proxy-id logging kv here so that all messages emitted by the logger automatically get proxy-id attached

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't know about .With(), done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the labeling was already taken care of:

Logger: a.logger.Named(logging.ProxyConfig),

state.logger = m.Logger.With("service_id", sid.String())

Removed it from here.

@@ -164,6 +165,8 @@ const (
)

func (s *Server) process(stream ADSStream, reqCh <-chan *envoy.DiscoveryRequest) error {
logger := s.Logger.Named(logging.XDS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment here about using With

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that would work here since the proxy ID isn't known until the xDS server state machine is running

Copy link
Member

@rboyer rboyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with comments

@rboyer
Copy link
Member

rboyer commented Feb 4, 2021

@freddygv should this also get backport labels?

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be misunderstanding this but...

default:
logger.Trace("Failed to deliver new snapshot to proxy config watchers",
"proxy-id", s.proxyID.String(),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see how this breaks the deadlock, but is it OK? If we just drop the new snapshot delivery on the floor that means that some change in proxy config that happened since the chan was last filled is now never going to get applied to any proxies consuming this state (at least until something else changes).

Isn't that just making the deadlock into a silent (other than the trace log) delivery failure that won't automatically resolve?

I could understand if the chan were just a "something changed" edge signal for the proxy to reload to the current config state, but it's not it's delivering a concrete state and now if we do hit this race, we'll just drop some valid config change on the floor won't we?

Copy link
Member

@banks banks Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be correct if we instead of dropping the update entirely, we just scheduled a new delivery of it later (by sending on the sendCh again).

That would unblock us here, release the deadlock, but then in a few milliseconds we'd be back and would finally get another chance to deliver the latest snapshot instead of dropping it?

We might have to only push to sendCh if it wouldn't block, but I think this is OK because the sendCh is just an edge trigger that causes re-reading of the latest when we re-enter this case so we'd only need to send to sendCh if it wasn't already full and be sure that eventually the latest config will be delivered.

That said, I don't recall if sendCh is buffered, if not we'd not be able to send to it directly here as by definition nothing will be reading it as we are already in the loop that reads it... We'd either need to buffer it by 1 and then drop if full or spawn a new goroutine with time.AfterFunc or something but I think buffering is reasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're already scheduling a new delivery of it though.

  1. After dropping the send, here we're resetting the coalesceTimer
  2. In a following loop we fall into this block where we re-create the coalesceTimer since the snapshot is valid
  3. After the timer elapses (200ms), we end up in the case where we started

So then a question is: is 200ms too long?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm not sure if I follow.

We reset the timer, but then we continue which means we skip the code at the bottom of the loop and start waiting for a further change. Doesn't that mean we will just wait without sending the actual update until something else triggers a change? That could be an unbounded amount of time later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yep I mistakenly thought we would fall through the select statement in a later loop even if nothing happened.

@vercel vercel bot temporarily deployed to Preview – consul-ui-staging February 5, 2021 22:16 Inactive
@vercel vercel bot temporarily deployed to Preview – consul February 5, 2021 22:16 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging February 6, 2021 01:01 Inactive
@vercel vercel bot temporarily deployed to Preview – consul February 6, 2021 01:01 Inactive
@freddygv freddygv requested a review from banks February 6, 2021 01:01
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

It's a shame this is not easy to test in its current form as it's for a race condition that's pretty hard to simulate reliably. Have you found any way to reproduce even manually while developing it?

agent/proxycfg/state.go Show resolved Hide resolved
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging February 8, 2021 16:46 Inactive
@vercel vercel bot temporarily deployed to Preview – consul February 8, 2021 16:46 Inactive
@freddygv freddygv merged commit 82c269a into master Feb 8, 2021
@freddygv freddygv deleted the proxycfg-deadlock branch February 8, 2021 23:14
@hashicorp-ci
Copy link
Contributor

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/325395.

@hashicorp-ci
Copy link
Contributor

🍒✅ Cherry pick of commit 82c269a onto release/1.9.x succeeded!

@hashicorp-ci
Copy link
Contributor

🍒✅ Cherry pick of commit 82c269a onto release/1.8.x succeeded!

@hashicorp-ci
Copy link
Contributor

🍒❌ Cherry pick of commit 82c269a onto release/1.7.x failed! Build Log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants