New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

container: protect health monitor channel #35482

Merged
merged 1 commit into from Nov 14, 2017

Conversation

Projects
None yet
7 participants
@stevvooe
Contributor

stevvooe commented Nov 13, 2017

While this code was likely called from a single thread before, we have
now seen panics, indicating that it could be called in parallel. This
change adds a mutex to protect opening and closing of the channel. There
may be another root cause associated with this panic, such as something
that led to the calling of this in parallel, as this code is old and we
had seen this condition until recently.

This fix is by no means a permanent fix. Typically, bugs like this
indicate misplaced channel ownership. In idiomatic uses, the channel
should have a particular "owner" that coordinates sending and closure.
In this case, the owner of the channel is unclear, so it gets opened
lazily. Synchronizing this access is a decent solution, but a refactor
may yield better results.

Signed-off-by: Stephen J Day stephen.day@docker.com

Terror Bird doesn't like race conditions

Terror Bird: They ranged in height from 1–3 metres (3.3–9.8 ft) tall. Their closest modern-day relatives are believed to be the 80 cm-tall seriemas. Titanis walleri, one of the larger species, is known from Texas and Florida in North America.

container: protect health monitor channel
While this code was likely called from a single thread before, we have
now seen panics, indicating that it could be called in parallel. This
change adds a mutex to protect opening and closing of the channel. There
may be another root cause associated with this panic, such as something
that led to the calling of this in parallel, as this code is old and we
had seen this condition until recently.

This fix is by no means a permanent fix. Typically, bugs like this
indicate misplaced channel ownership. In idiomatic uses, the channel
should have a particular "owner" that coordinates sending and closure.
In this case, the owner of the channel is unclear, so it gets opened
lazily. Synchronizing this access is a decent solution, but a refactor
may yield better results.

Signed-off-by: Stephen J Day <stephen.day@docker.com>
@thaJeztah

LGTM, but would like @cpuguy83 to have a look as well

@tonistiigi

This comment has been minimized.

Show comment
Hide comment
@tonistiigi
Member

tonistiigi commented Nov 14, 2017

cc @talex5

@vieux

vieux approved these changes Nov 14, 2017

LGTM

@yongtang yongtang merged commit e4d0fe8 into moby:master Nov 14, 2017

6 checks passed

dco-signed All commits are signed
experimental Jenkins build Docker-PRs-experimental 37826 has succeeded
Details
janky Jenkins build Docker-PRs 46537 has succeeded
Details
powerpc Jenkins build Docker-PRs-powerpc 6950 has succeeded
Details
windowsRS1 Jenkins build Docker-PRs-WoW-RS1 18094 has succeeded
Details
z Jenkins build Docker-PRs-s390x 6754 has succeeded
Details
@andrewhsu

This comment has been minimized.

Show comment
Hide comment
@andrewhsu

andrewhsu Nov 14, 2017

Contributor

in the future, would be good to have a health_test.go to accompany a fix like this to exercise what we can verify

Contributor

andrewhsu commented Nov 14, 2017

in the future, would be good to have a health_test.go to accompany a fix like this to exercise what we can verify

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Nov 14, 2017

Member

@andrewhsu testing was already discussed, but the issue this is fixing is not possible to replicate in a testing scenario

Member

thaJeztah commented Nov 14, 2017

@andrewhsu testing was already discussed, but the issue this is fixing is not possible to replicate in a testing scenario

@andrewhsu

This comment has been minimized.

Show comment
Hide comment
@andrewhsu

andrewhsu Nov 14, 2017

Contributor

if not to replicate the testing scenario, just having the framework in place for a unit test prompts any future contributor to write an accompanying test if it relates.

just saying...not all fixes scenarios can be tested in a unit test, but some of the auxiliary code can be verified to not have changed behaviour in a fix attempt.

Contributor

andrewhsu commented Nov 14, 2017

if not to replicate the testing scenario, just having the framework in place for a unit test prompts any future contributor to write an accompanying test if it relates.

just saying...not all fixes scenarios can be tested in a unit test, but some of the auxiliary code can be verified to not have changed behaviour in a fix attempt.

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Nov 15, 2017

Member

Adding an empty boilerplate with the testing framework in place would make compilation fail (unused imports), and although I generally agree on having (unit)-tests; we should guard against adding tests that don't test anything, as they give a false sense of coverage.

Member

thaJeztah commented Nov 15, 2017

Adding an empty boilerplate with the testing framework in place would make compilation fail (unused imports), and although I generally agree on having (unit)-tests; we should guard against adding tests that don't test anything, as they give a false sense of coverage.

@stevvooe stevvooe deleted the stevvooe:protect-health-monitor-channel branch Nov 15, 2017

@stevvooe

This comment has been minimized.

Show comment
Hide comment
@stevvooe

stevvooe Nov 15, 2017

Contributor

@andrewhsu If a valid test were possible here, I would have included one. Unfortunately, races such as these fall into the category of a negative assertion. One cannot prove that there is not a race with a test that doesn't cover all of time (without the race detector). We have to rely on our mental faculties and experience to identify and address these kinds of conditions. In that framework, we can also only show that we are reasonably certain.

In this case, we have a race condition, where the value of the stop field ends up with a few different values depending on the race. In all cases, the first if statement sees that stop is not nil. After this branch, we get the problems. In one case, the thread sees stop and simply closes the channel and sets it to nil. Another thread comes along, seeing the partial result of that operation, leading to a panic on close or a panic on nil, depending on what point the value of stop is synced into the other core on the other thread.

Indeed, we can reproduce that situation in a test case by running CloseMonitorChannel and OpenMonitorChannel alternately in separate threads to produce the condition. The problem is that reproducing the ordering here is going to be probabilistic, so you'll have to identify a number of threads that will reproduce the timing with a high probability. In practice, this is a huge waste of resources: the test only has a chance of reproducing the condition. Even then, we'd really only be testing Go's runtime behavior and the implementation of the synchronization primitives, so the value is limited.

Ideally, the functionality here is covered by some functional test that ensures these implementation details are not working. It seems like the concurrency level of that testing might not be sufficient. I am not wildly familiar with the root cause, but there was some talk of a higher level condition that could cause this. A test there, with the race detector enabled, would likely more sufficiently address this issue, and expose other possible issues.

Contributor

stevvooe commented Nov 15, 2017

@andrewhsu If a valid test were possible here, I would have included one. Unfortunately, races such as these fall into the category of a negative assertion. One cannot prove that there is not a race with a test that doesn't cover all of time (without the race detector). We have to rely on our mental faculties and experience to identify and address these kinds of conditions. In that framework, we can also only show that we are reasonably certain.

In this case, we have a race condition, where the value of the stop field ends up with a few different values depending on the race. In all cases, the first if statement sees that stop is not nil. After this branch, we get the problems. In one case, the thread sees stop and simply closes the channel and sets it to nil. Another thread comes along, seeing the partial result of that operation, leading to a panic on close or a panic on nil, depending on what point the value of stop is synced into the other core on the other thread.

Indeed, we can reproduce that situation in a test case by running CloseMonitorChannel and OpenMonitorChannel alternately in separate threads to produce the condition. The problem is that reproducing the ordering here is going to be probabilistic, so you'll have to identify a number of threads that will reproduce the timing with a high probability. In practice, this is a huge waste of resources: the test only has a chance of reproducing the condition. Even then, we'd really only be testing Go's runtime behavior and the implementation of the synchronization primitives, so the value is limited.

Ideally, the functionality here is covered by some functional test that ensures these implementation details are not working. It seems like the concurrency level of that testing might not be sufficient. I am not wildly familiar with the root cause, but there was some talk of a higher level condition that could cause this. A test there, with the race detector enabled, would likely more sufficiently address this issue, and expose other possible issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment