agent: avoid reverting any check updates that occur while a service is being added or the config is reloaded #6144

rboyer · 2019-07-16T17:08:10Z

An Agent has two confusingly named locks:

the outer lock Agent.stateLock
and the inner Agent.State embedded lock

Operations like Agent.addServiceInternal will grab the outer lock and then briefly acquire the inner lock to clone the current state of all of the checks registered on the agent. This is so that when a service is re-registered any checks re-added can have their statuses carried over.

Unfortunately there is a logical data race (rather than a -race race) whereby the individual check goroutines (like for an alias check) are independently sending updated statuses into the Agent.State and only acquiring the inner lock to do so.

Example situation:

(goroutine 1) User registers service "foo".
(goroutine 1) addServiceInternal locks the outer lock
(goroutine 1) snapshotCheckState locks the inner lock, copies the check states, and unlocks the inner lock
(goroutine 2) A checker determines the health of a check "bar" has changed and calls State.UpdateCheck to change from critical -> passing.
(goroutine 2) The inner lock is locked, the status for "bar" is flipped to passing, and the inner lock is unlocked.
(goroutine 1) The "foo" service finishes being added and the defer restoreCheckState call walks the captured check snapshot from (3).
(goroutine 1) The previous value of the "bar" check is reverted back to critical as it was in step (3).

The fix here is to only use the snapshot to seed the value of the check initially inserted/updated, rather than letting the check be inserted/updated at the default unmeasured state of critical.

…s being added or the config is reloaded

banks

Yay! Great job tracking this down.

agent/agent.go

agent/agent_test.go

…ervices This fixes issue hashicorp#7318 Between versions 1.5.2 and 1.5.3, a regression has been introduced regarding health of services. A patch hashicorp#6144 had been issued for HealthChecks of nodes, but not for healthchecks of services. What happened when a reload was: 1. save all healthcheck statuses 2. cleanup everything 3. add new services with healthchecks In step 3, the state of healthchecks was taken into account locally, so at step 3, but since we cleaned up at step 2, state was lost. This PR introduces the snap parameter, so step 3 can use information from step 1

…7345) This fixes issue #7318 Between versions 1.5.2 and 1.5.3, a regression has been introduced regarding health of services. A patch #6144 had been issued for HealthChecks of nodes, but not for healthchecks of services. What happened when a reload was: 1. save all healthcheck statuses 2. cleanup everything 3. add new services with healthchecks In step 3, the state of healthchecks was taken into account locally, so at step 3, but since we cleaned up at step 2, state was lost. This PR introduces the snap parameter, so step 3 can use information from step 1

fumikojandr3 · 2023-09-04T08:47:19Z

Service A register with status=critical,then re-register with status=passing.This commit makes the "re-register" status still being critical. @rboyer

agent: avoid reverting any check updates that occur while a service i…

d42a9e1

…s being added or the config is reloaded

rboyer requested a review from a team July 16, 2019 17:08

rboyer self-assigned this Jul 16, 2019

banks approved these changes Jul 17, 2019

View reviewed changes

agent/agent.go Outdated Show resolved Hide resolved

agent/agent_test.go Show resolved Hide resolved

fixes from comments

9aadebe

rboyer merged commit b16d7f0 into master Jul 17, 2019

rboyer deleted the dont-revert-check-states-on-restore branch July 17, 2019 19:48

jameshartig mentioned this pull request Jul 27, 2019

TTL check marked critical but then reset to passing and not removed #6231

Closed

lswinblad mentioned this pull request Dec 10, 2019

Checks transition to critical state during reload #6914

Closed

pierresouchay mentioned this pull request Feb 19, 2020

Consul 1.5.3 changes check status behavior when doing a consul reload? #7318

Closed

pierresouchay mentioned this pull request Feb 25, 2020

[BUGFIX] Configuration reload does not discard Check's statuses for services #7345

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent: avoid reverting any check updates that occur while a service is being added or the config is reloaded #6144

agent: avoid reverting any check updates that occur while a service is being added or the config is reloaded #6144

rboyer commented Jul 16, 2019

banks left a comment

fumikojandr3 commented Sep 4, 2023

agent: avoid reverting any check updates that occur while a service is being added or the config is reloaded #6144

agent: avoid reverting any check updates that occur while a service is being added or the config is reloaded #6144

Conversation

rboyer commented Jul 16, 2019

banks left a comment

Choose a reason for hiding this comment

fumikojandr3 commented Sep 4, 2023