Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: avoid reverting any check updates that occur while a service is being added or the config is reloaded #6144

Merged
merged 2 commits into from
Jul 17, 2019

Conversation

rboyer
Copy link
Member

@rboyer rboyer commented Jul 16, 2019

An Agent has two confusingly named locks:

  • the outer lock Agent.stateLock
  • and the inner Agent.State embedded lock

Operations like Agent.addServiceInternal will grab the outer lock and then briefly acquire the inner lock to clone the current state of all of the checks registered on the agent. This is so that when a service is re-registered any checks re-added can have their statuses carried over.

Unfortunately there is a logical data race (rather than a -race race) whereby the individual check goroutines (like for an alias check) are independently sending updated statuses into the Agent.State and only acquiring the inner lock to do so.

Example situation:

  1. (goroutine 1) User registers service "foo".
  2. (goroutine 1) addServiceInternal locks the outer lock
  3. (goroutine 1) snapshotCheckState locks the inner lock, copies the check states, and unlocks the inner lock
  4. (goroutine 2) A checker determines the health of a check "bar" has changed and calls State.UpdateCheck to change from critical -> passing.
  5. (goroutine 2) The inner lock is locked, the status for "bar" is flipped to passing, and the inner lock is unlocked.
  6. (goroutine 1) The "foo" service finishes being added and the defer restoreCheckState call walks the captured check snapshot from (3).
  7. (goroutine 1) The previous value of the "bar" check is reverted back to critical as it was in step (3).

The fix here is to only use the snapshot to seed the value of the check initially inserted/updated, rather than letting the check be inserted/updated at the default unmeasured state of critical.

@rboyer rboyer requested a review from a team July 16, 2019 17:08
@rboyer rboyer self-assigned this Jul 16, 2019
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay! Great job tracking this down.

agent/agent.go Outdated Show resolved Hide resolved
agent/agent_test.go Show resolved Hide resolved
@rboyer rboyer merged commit b16d7f0 into master Jul 17, 2019
@rboyer rboyer deleted the dont-revert-check-states-on-restore branch July 17, 2019 19:48
pierresouchay added a commit to pierresouchay/consul that referenced this pull request Feb 25, 2020
…ervices

This fixes issue hashicorp#7318

Between versions 1.5.2 and 1.5.3, a regression has been introduced regarding health
of services. A patch hashicorp#6144 had been issued for HealthChecks of nodes, but not for healthchecks
of services.

What happened when a reload was:

1. save all healthcheck statuses
2. cleanup everything
3. add new services with healthchecks

In step 3, the state of healthchecks was taken into account locally,
so at step 3, but since we cleaned up at step 2, state was lost.

This PR introduces the snap parameter, so step 3 can use information from step 1
hanshasselberg pushed a commit that referenced this pull request Mar 9, 2020
…7345)

This fixes issue #7318

Between versions 1.5.2 and 1.5.3, a regression has been introduced regarding health
of services. A patch #6144 had been issued for HealthChecks of nodes, but not for healthchecks
of services.

What happened when a reload was:

1. save all healthcheck statuses
2. cleanup everything
3. add new services with healthchecks

In step 3, the state of healthchecks was taken into account locally,
so at step 3, but since we cleaned up at step 2, state was lost.

This PR introduces the snap parameter, so step 3 can use information from step 1
freddygv pushed a commit that referenced this pull request Mar 12, 2020
…7345)

This fixes issue #7318

Between versions 1.5.2 and 1.5.3, a regression has been introduced regarding health
of services. A patch #6144 had been issued for HealthChecks of nodes, but not for healthchecks
of services.

What happened when a reload was:

1. save all healthcheck statuses
2. cleanup everything
3. add new services with healthchecks

In step 3, the state of healthchecks was taken into account locally,
so at step 3, but since we cleaned up at step 2, state was lost.

This PR introduces the snap parameter, so step 3 can use information from step 1
@fumikojandr3
Copy link

Service A register with status=critical,then re-register with status=passing.This commit makes the "re-register" status still being critical. @rboyer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants