agent: fix several data races and bugs related to node-local alias checks #5876

rboyer · 2019-05-21T16:35:52Z

The observed bug was that a full restart of a consul datacenter (servers and clients) in conjunction with a restart of a connect-flavored application with bring-your-own-service-registration logic would very frequently cause the envoy sidecar service check to never reflect the aliased service.

Over the course of investigation several bugs and unfortunate interactions were corrected:

(1) data races

local.CheckState objects were only shallow copied, but the key piece of data that gets read and updated is one of the things not copied (the underlying Check with a Status field). When the stock code was run with the race detector enabled this high-relevant-to-the-test-scenario field
was found to be racy.

Changes

Update the existing Clone method to include the Check field.
Use copy-on-write when those fields need to change rather than incrementally updating them in place.

This made the observed behavior occur slightly less often.

(2) control loop corrections

If anything about how the runLocal() method for node-local alias check logic was ever flawed, there was no fallback option. Those checks are purely edge-triggered and failure to properly notice a single edge transition would leave the alias check incorrect until the next flap of the aliased check.

Change

The change was to introduce a fallback timer to act as a control loop to double check the alias check matches the aliased check every minute (borrowing the duration from the non-local alias check logic body).

This made the observed behavior eventually go away when it did occur.

(3) overlapping alias check lifetimes

Originally I thought there were two main actions involved in the data race:

(A) The act of adding the original check (from disk recovery) and its first health evaluation.

(B) The act of the HTTP API requests coming in and resetting the local state when re-registering the same services and checks.

It took awhile for me to realize that there's a third action at work:

(C) The goroutines associated with the original check and the later checks.

The actual sequence of actions that was causing the bad behavior was that the API actions result in the original check to be removed and re-added without waiting for the original goroutine to terminate. This means for brief windows of time during check definition edits there are
two goroutines that can be sending updates for the alias check status.

In extremely unlikely scenarios the original goroutine sees the aliased check start up in critical before being removed but does not get the notification about the nearly immediate update of that check to
passing.

This is interlaced with the new goroutine coming up, initializing its base case to passing from the current state and then listening for new notifications of edge triggers.

If the original goroutine "finishes" its update, it then commits one more write into the local state of critical and exits leaving the alias check no longer reflecting the underlying check.

Change

The correction here is to enforce that the old goroutines must terminate before spawning the new one for alias checks using sync.WaitGroup

…ecks The observed bug was that a full restart of a consul datacenter (servers and clients) in conjunction with a restart of a connect-flavored application with bring-your-own-service-registration logic would very frequently cause the envoy sidecar service check to never reflect the aliased service. Over the course of investigation several bugs and unfortunate interactions were corrected: (1) local.CheckState objects were only shallow copied, but the key piece of data that gets read and updated is one of the things not copied (the underlying Check with a Status field). When the stock code was run with the race detector enabled this high-relevant-to-the-test-scenario field was found to be racy. Changes: a) update the existing Clone method to include the Check field b) copy-on-write when those fields need to change rather than incrementally updating them in place. This made the observed behavior occur slightly less often. (2) If anything about how the runLocal method for node-local alias check logic was ever flawed, there was no fallback option. Those checks are purely edge-triggered and failure to properly notice a single edge transition would leave the alias check incorrect until the next flap of the aliased check. The change was to introduce a fallback timer to act as a control loop to double check the alias check matches the aliased check every minute (borrowing the duration from the non-local alias check logic body). This made the observed behavior eventually go away when it did occur. (3) Originally I thought there were two main actions involved in the data race: A. The act of adding the original check (from disk recovery) and it's first health evaluation. B. The act of the HTTP API requests coming in and resetting the local state when re-registering the same services and checks. It took awhile for me to realize that there's a third action at work: C. The goroutines associated with the original check and the later checks. The actual sequence of actions that was causing the bad behavior was that the API actions result in the original check to be removed and re-added _without waiting for the original goroutine to terminate_. This means for brief windows of time during check definition edits there are two goroutines that can be sending updates for the alias check status. In extremely unlikely scenarios the original goroutine sees the aliased check start up in `critical` before being removed but does not get the notification about the nearly immedidate update of that check to `passing`. This is interlaced wit the new goroutine coming up, initializing its base case to `passing` from the current state and then listening for new notifications of edge triggers. If the original goroutine "finishes" its update, it then commits one more write into the local state of `critical` and exits leaving the alias check no longer reflecting the underlying check. The correction here is to enforce that the old goroutines must terminate before spawning the new one for alias checks.

agent/testagent.go

agent/checks/alias.go

banks

Great job RB! That sounded like a nightmare to track down and this makes sense.

I have a couple of questions inline but unless they are real issues I think this looks good.

I wonder if we need to explain in comments somewhere why the "edge trigger" mechanisms used here doesn't actually prevent all races - it was assumed it was robust and copied in a few places IIRC (leaf cert cache type is one of them maybe).

I tried to work through a sequence diagram of how the race could happen but I couldn't quite figure it out assuming we only send the trigger event strictly after making the actual update. Did I just miss something or is the real bug here that we edge trigger before the update is actually visible from the source? Any interleaving I explored still ended up with the right value unless that assumption was broken.

agent/local/state.go

agent/structs/structs.go

…may be on the hot path

rboyer · 2019-05-23T21:50:18Z

I wonder if we need to explain in comments somewhere why the "edge trigger" mechanisms used here doesn't actually prevent all races - it was assumed it was robust and copied in a few places IIRC (leaf cert cache type is one of them maybe).

I tried to work through a sequence diagram of how the race could happen but I couldn't quite figure it out assuming we only send the trigger event strictly after making the actual update. Did I just miss something or is the real bug here that we edge trigger before the update is actually visible from the source? Any interleaving I explored still ended up with the right value unless that assumption was broken.

We hashed this out off-issue, but summarizing here for the record:

The pattern of having a notification channel and then independently fetching the notifications works fine. In this specific situation during the Check lifecycle:

trigger old alias check to stop
remove old alias check from notification map
create new alias check and install in notification map
start new alias check
new alias check does initialization and waits for notifications

The problem was that any given named check can have one notification channel registered to receive notifications. When (1) causes the old check to continue running until after (5) completes we run into issues where the overall check status may be updated like:

upstream check goes critical (as part of it independently being restarted and reinitialized)
old alias check is notified
remove old check from notification map
upstream check is passing; sends notification nowhere
new alias check spins up and registers into map
new alias check initializes to "passing" after querying for upstream (not via notification)
old alias check yolo's into the situation and finally gets around to applying that "critical" update to the alias check
now the alias check value is "critical" but the upstream value is "passing" and no more notifications will be sent.

banks

LGTM!

rboyer requested a review from a team May 21, 2019 16:35

rboyer commented May 21, 2019

View reviewed changes

agent/testagent.go Show resolved Hide resolved

freddygv reviewed May 21, 2019

View reviewed changes

agent/checks/alias.go Show resolved Hide resolved

retain the nil map if it was configured that way

6881a5c

banks reviewed May 22, 2019

View reviewed changes

agent/local/state.go Show resolved Hide resolved

agent/structs/structs.go Outdated Show resolved Hide resolved

remove the deep copying of these fields that are not used racily and …

f2487c2

…may be on the hot path

add a note about interrelated deferrs

5a994b3

rboyer requested a review from banks May 23, 2019 21:56

banks approved these changes May 24, 2019

View reviewed changes

rboyer merged commit 40336fd into master May 24, 2019

rboyer deleted the fix-alias-checks branch May 24, 2019 18:37

jameshartig mentioned this pull request Jun 19, 2019

Client stopped syncing health checks #5925

Closed

mikemorris mentioned this pull request Dec 2, 2019

test: wait for Serf check in TestServicesWatch #6867

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent: fix several data races and bugs related to node-local alias checks #5876

agent: fix several data races and bugs related to node-local alias checks #5876

rboyer commented May 21, 2019

banks left a comment

rboyer commented May 23, 2019

banks left a comment

agent: fix several data races and bugs related to node-local alias checks #5876

agent: fix several data races and bugs related to node-local alias checks #5876

Conversation

rboyer commented May 21, 2019

(1) data races

Changes

(2) control loop corrections

Change

(3) overlapping alias check lifetimes

Change

banks left a comment

Choose a reason for hiding this comment

rboyer commented May 23, 2019

banks left a comment

Choose a reason for hiding this comment