Avoid potential proxycfg/xDS deadlock using non-blocking send #9689

freddygv · 2021-02-02T18:52:46Z

Deadlock scenario:

1. Due to scheduling, the state runner sends one snapshot into
snapCh and then attempts to send a second. The first send succeeds
because the channel is buffered, but the second blocks.

2. Separately, Manager.Watch is called by the xDS server after
getting a discovery request from Envoy. This function acquires the
manager lock and then blocks on receiving the CurrentSnapshot from
the state runner.

3. Separately, there is a Manager goroutine that reads the snapshots
from the channel in step 1. These reads are done to notify proxy
watchers, but they require holding the manager lock. This goroutine
goes to acquire that lock, but can't because it is held by step 2.

Now, the goroutine from step 3 is waiting on the one from step 2 to
release the lock. The goroutine from step 2 won't release the lock until
the goroutine in step 1 advances. But the goroutine in step 1 is waiting
for the one in step 3 to read from the channel it is blocking on. Deadlock.

By making the send in step 1 non-blocking then we can proceed. The coalesce
timer will be reset and a new snapshot (if valid) will be delivered after the
timer elapses or when one is requested by xDS.

Impact: This deadlock can block the delivery of proxy configuration to Envoy.

Also added some trace logging that would have helped while chasing this down.

Deadlock scenario: 1. Due to scheduling, the state runner sends one snapshot into snapCh and then attempts to send a second. The first send succeeds because the channel is buffered, but the second blocks. 2. Separately, Manager.Watch is called by the xDS server after getting a discovery request from Envoy. This function acquires the manager lock and then blocks on receiving the CurrentSnapshot from the state runner. 3. Separately, there is a Manager goroutine that reads the snapshots from the channel in step 1. These reads are done to notify proxy watchers, but they require holding the manager lock. This goroutine goes to acquire that lock, but can't because it is held by step 2. Now, the goroutine from step 3 is waiting on the one from step 2 to release the lock. The goroutine from step 2 won't release the lock until the goroutine in step 1 advances. But the goroutine in step 1 is waiting for the one in step 3. Deadlock. By making this send non-blocking step 1 above can proceed. The coalesce timer will be reset and a new valid snapshot will be delivered after it elapses or when one is requested by xDS.

rboyer · 2021-02-04T22:59:09Z

agent/proxycfg/state.go

@@ -583,6 +583,8 @@ func (s *state) initialConfigSnapshot() ConfigSnapshot {
 }

 func (s *state) run() {
+	logger := s.logger.Named(logging.ProxyConfig)


non blocking idea:

Use the With() call here and tack on the proxy-id logging kv here so that all messages emitted by the logger automatically get proxy-id attached

Ah, didn't know about .With(), done

Actually the labeling was already taken care of:

consul/agent/agent.go

Line 543 in a4690ac

Logger: a.logger.Named(logging.ProxyConfig),

consul/agent/proxycfg/manager.go

Line 196 in a4690ac

state.logger = m.Logger.With("service_id", sid.String())

Removed it from here.

rboyer · 2021-02-04T23:00:30Z

agent/xds/server.go

@@ -164,6 +165,8 @@ const (
 )

 func (s *Server) process(stream ADSStream, reqCh <-chan *envoy.DiscoveryRequest) error {
+	logger := s.Logger.Named(logging.XDS)


similar comment here about using With

I'm not sure that would work here since the proxy ID isn't known until the xDS server state machine is running

rboyer

LGTM with comments

rboyer · 2021-02-04T23:01:41Z

@freddygv should this also get backport labels?

banks

I might be misunderstanding this but...

banks · 2021-02-05T15:04:26Z

agent/proxycfg/state.go

+			default:
+				logger.Trace("Failed to deliver new snapshot to proxy config watchers",
+					"proxy-id", s.proxyID.String(),
+				)


I see how this breaks the deadlock, but is it OK? If we just drop the new snapshot delivery on the floor that means that some change in proxy config that happened since the chan was last filled is now never going to get applied to any proxies consuming this state (at least until something else changes).

Isn't that just making the deadlock into a silent (other than the trace log) delivery failure that won't automatically resolve?

I could understand if the chan were just a "something changed" edge signal for the proxy to reload to the current config state, but it's not it's delivering a concrete state and now if we do hit this race, we'll just drop some valid config change on the floor won't we?

Would it be correct if we instead of dropping the update entirely, we just scheduled a new delivery of it later (by sending on the sendCh again).

That would unblock us here, release the deadlock, but then in a few milliseconds we'd be back and would finally get another chance to deliver the latest snapshot instead of dropping it?

We might have to only push to sendCh if it wouldn't block, but I think this is OK because the sendCh is just an edge trigger that causes re-reading of the latest when we re-enter this case so we'd only need to send to sendCh if it wasn't already full and be sure that eventually the latest config will be delivered.

That said, I don't recall if sendCh is buffered, if not we'd not be able to send to it directly here as by definition nothing will be reading it as we are already in the loop that reads it... We'd either need to buffer it by 1 and then drop if full or spawn a new goroutine with time.AfterFunc or something but I think buffering is reasonable.

We're already scheduling a new delivery of it though.

After dropping the send, here we're resetting the coalesceTimer

In a following loop we fall into this block where we re-create the coalesceTimer since the snapshot is valid

After the timer elapses (200ms), we end up in the case where we started

So then a question is: is 200ms too long?

Hmm I'm not sure if I follow.

We reset the timer, but then we continue which means we skip the code at the bottom of the loop and start waiting for a further change. Doesn't that mean we will just wait without sending the actual update until something else triggers a change? That could be an unbounded amount of time later.

Ah yep I mistakenly thought we would fall through the select statement in a later loop even if nothing happened.

banks

LGTM.

It's a shame this is not easy to test in its current form as it's for a race condition that's pretty hard to simulate reliably. Have you found any way to reproduce even manually while developing it?

agent/proxycfg/state.go

hashicorp-ci · 2021-02-08T23:18:31Z

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/325395.

hashicorp-ci · 2021-02-08T23:18:36Z

🍒✅ Cherry pick of commit 82c269a onto release/1.9.x succeeded!

hashicorp-ci · 2021-02-08T23:18:41Z

🍒✅ Cherry pick of commit 82c269a onto release/1.8.x succeeded!

hashicorp-ci · 2021-02-08T23:18:43Z

🍒❌ Cherry pick of commit 82c269a onto release/1.7.x failed! Build Log

freddygv requested a review from a team February 2, 2021 18:52

Add trace logs to proxycfg state runner and xds srv

5ba14ad

vercel bot temporarily deployed to Preview – consul-ui-staging February 2, 2021 19:26 Inactive

vercel bot temporarily deployed to Preview – consul February 2, 2021 19:26 Inactive

rboyer reviewed Feb 4, 2021

View reviewed changes

rboyer approved these changes Feb 4, 2021

View reviewed changes

banks requested changes Feb 5, 2021

View reviewed changes

freddygv added 2 commits February 5, 2021 15:14

Update proxycfg logging, labels were already attached

95e7641

Make xDS labeling consistent with proxycfg

de0cb1a

vercel bot temporarily deployed to Preview – consul-ui-staging February 5, 2021 22:16 Inactive

vercel bot temporarily deployed to Preview – consul February 5, 2021 22:16 Inactive

Retry send after timer fires, in case no updates occur

6e443e5

vercel bot temporarily deployed to Preview – consul-ui-staging February 6, 2021 01:01 Inactive

vercel bot temporarily deployed to Preview – consul February 6, 2021 01:01 Inactive

freddygv requested a review from banks February 6, 2021 01:01

banks approved these changes Feb 8, 2021

View reviewed changes

agent/proxycfg/state.go Show resolved Hide resolved

freddygv added backport/1.8 labels Feb 8, 2021

freddygv added 2 commits February 8, 2021 09:45

Update comments on avoiding proxycfg deadlock

ec5f757

Add changelog entry

5dca98e

vercel bot temporarily deployed to Preview – consul-ui-staging February 8, 2021 16:46 Inactive

vercel bot temporarily deployed to Preview – consul February 8, 2021 16:46 Inactive

freddygv merged commit 82c269a into master Feb 8, 2021

freddygv deleted the proxycfg-deadlock branch February 8, 2021 23:14

chrisboulton mentioned this pull request Feb 17, 2021

Consul Connect - Envoy Sidecars failing to bootstrap listeners after certain # of services hashicorp/nomad#8627

Closed

dependabot bot mentioned this pull request Mar 14, 2021

Bump github.com/hashicorp/consul from 1.8.0 to 1.9.4 fangyincheng/dubbo-go#14

Closed

dhiaayachi mentioned this pull request Dec 29, 2021

XDS snapshot updates are not correctly delivered after a first retry when certain conditions are met. #11923

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid potential proxycfg/xDS deadlock using non-blocking send #9689

Avoid potential proxycfg/xDS deadlock using non-blocking send #9689

freddygv commented Feb 2, 2021 •

edited

rboyer Feb 4, 2021

freddygv Feb 5, 2021

freddygv Feb 5, 2021

rboyer Feb 4, 2021

freddygv Feb 5, 2021

rboyer left a comment

rboyer commented Feb 4, 2021

banks left a comment

banks Feb 5, 2021

banks Feb 5, 2021 •

edited

freddygv Feb 5, 2021

banks Feb 5, 2021

freddygv Feb 5, 2021

banks left a comment

hashicorp-ci commented Feb 8, 2021

hashicorp-ci commented Feb 8, 2021

hashicorp-ci commented Feb 8, 2021

hashicorp-ci commented Feb 8, 2021

Avoid potential proxycfg/xDS deadlock using non-blocking send #9689

Avoid potential proxycfg/xDS deadlock using non-blocking send #9689

Conversation

freddygv commented Feb 2, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rboyer left a comment

Choose a reason for hiding this comment

rboyer commented Feb 4, 2021

banks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

banks Feb 5, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

banks left a comment

Choose a reason for hiding this comment

hashicorp-ci commented Feb 8, 2021

hashicorp-ci commented Feb 8, 2021

hashicorp-ci commented Feb 8, 2021

hashicorp-ci commented Feb 8, 2021

freddygv commented Feb 2, 2021 •

edited

banks Feb 5, 2021 •

edited