Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky unit test Test_Run_Positive_VolumeMountControllerAttachEnabledRace data race #94676

Conversation

JornShen
Copy link
Member

@JornShen JornShen commented Sep 10, 2020

What type of PR is this?

/kind bug
/kind flake

What this PR does / why we need it:

  1. what's the failed

one failed output

WARNING: DATA RACE
Write at 0x00c000116ba0 by goroutine 20:
  k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler.Test_Run_Positive_VolumeMountControllerAttachEnabledRace()
      /opt/workspace/myspace/k8s/kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler_test.go:1794 +0x12fe
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1108 +0x202

Previous read at 0x00c000116ba0 by goroutine 25:
  k8s.io/kubernetes/pkg/volume/testing.(*FakeVolumePlugin).getFakeVolume()
      /opt/workspace/myspace/k8s/kubernetes/pkg/volume/testing/testing.go:416 +0x5bd
  k8s.io/kubernetes/pkg/volume/testing.(*FakeVolumePlugin).NewAttacher()
      /opt/workspace/myspace/k8s/kubernetes/pkg/volume/testing/testing.go:560 +0xf9
  k8s.io/kubernetes/pkg/volume/testing.(*FakeVolumePlugin).NewDeviceMounter()
      /opt/workspace/myspace/k8s/kubernetes/pkg/volume/testing/testing.go:564 +0x3c
  k8s.io/kubernetes/pkg/volume/util/operationexecutor.(*operationGenerator).GenerateMountVolumeFunc.func1()
      /opt/workspace/myspace/k8s/kubernetes/pkg/volume/util/operationexecutor/operation_generator.go:540 +0x2251
  k8s.io/kubernetes/pkg/volume/util/types.(*GeneratedOperations).Run()
      /opt/workspace/myspace/k8s/kubernetes/pkg/volume/util/types/types.go:54 +0x142
  k8s.io/kubernetes/pkg/volume/util/nestedpendingoperations.(*nestedPendingOperations).Run.func1()
/opt/workspace/myspace/k8s/kubernetes/pkg/volume/util/nestedpendingoperations/nestedpendingoperations.go:183 +0x159

Goroutine 20 (running) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:1159 +0x796
  testing.runTests.func1()
      /usr/local/go/src/testing/testing.go:1430 +0xa6
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1108 +0x202
  testing.runTests()
      /usr/local/go/src/testing/testing.go:1428 +0x5aa
  testing.(*M).Run()
      /usr/local/go/src/testing/testing.go:1338 +0x4eb
  main.main()
      _testmain.go:73 +0x236

Goroutine 25 (running) created at: 
  k8s.io/kubernetes/pkg/volume/util/nestedpendingoperations.(*nestedPendingOperations).Run()
      /opt/workspace/myspace/k8s/kubernetes/pkg/volume/util/nestedpendingoperations/nestedpendingoperations.go:178 +0x5b2
  k8s.io/kubernetes/pkg/volume/util/operationexecutor.(*operationExecutor).MountVolume()
      /opt/workspace/myspace/k8s/kubernetes/pkg/volume/util/operationexecutor/operation_executor.go:839 +0x2b0
  k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler.(*reconciler).mountAttachVolumes()
      /opt/workspace/myspace/k8s/kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go:255 +0x974
I0909 22:54:05.010993   24541 operation_generator.go:595] MountVolume.MountDevice succeeded for volume "volume-name" (UniqueName: "fake-plugin/fake-device1") pod "pod1" (UID: "pod1uid") device mount path ""
  k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler.(*reconciler).reconcile()
      /opt/workspace/myspace/k8s/kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go:174 +0x46
  k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler.(*reconciler).reconciliationLoopFunc.func1()
      /opt/workspace/myspace/k8s/kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go:151 +0x4a
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
      /opt/workspace/myspace/k8s/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x75
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
      /opt/workspace/myspace/k8s/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xba
  k8s.io/apimachinery/pkg/util/wait.JitterUntil()
      /opt/workspace/myspace/k8s/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x114
  k8s.io/apimachinery/pkg/util/wait.Until()
      /opt/workspace/myspace/k8s/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x8e
  k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler.(*reconciler).Run()
      /opt/workspace/myspace/k8s/kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go:146 +0x2f
  k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler.Test_Run_Positive_VolumeMountControllerAttachEnabledRace.func1()
      /opt/workspace/myspace/k8s/kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler_test.go:1785 +0x48

from the output, we can find the two routine run the race to WaitForAttachHook and UnmountDeviceHook of FakeVolumePlugin.

one routine created at

and most origin from Test_Run_Positive_VolumeMountControllerAttachEnabledRace

this children routine read field WaitForAttachHook and UnmountDeviceHook from FakeVolumePlugin

volume.WaitForAttachHook = plugin.WaitForAttachHook
volume.UnmountDeviceHook = plugin.UnmountDeviceHook

while main routine in Test_Run_Positive_VolumeMountControllerAttachEnabledRace reset field WaitForAttachHook and UnmountDeviceHook of FakeVolumePlugin

fakePlugin.UnmountDeviceHook = func(mountPath string) error {

fakePlugin.WaitForAttachHook = func(spec *volume.Spec, devicePath string, pod *v1.Pod, spectimeout time.Duration) (string, error) {

and they case the data race.(one read data and the other set data)

  1. how to deal with it

children routine's trace code is in method getFakeVolume, the the caller is the method NewAttacher

func (plugin *FakeVolumePlugin) NewAttacher() (Attacher, error) {
plugin.Lock()
defer plugin.Unlock()
plugin.NewAttacherCallCount = plugin.NewAttacherCallCount + 1
return plugin.getFakeVolume(&plugin.Attachers), nil
}

so in FakeVolumePlugin object, it has a lock to sync the data changing.

type FakeVolumePlugin struct {
sync.RWMutex
PluginName string

so from the outside in Test_Run_Positive_VolumeMountControllerAttachEnabledRace, we should acquire lock before we reset the field WaitForAttachHook and UnmountDeviceHook of FakeVolumePlugin.

  1. result of test when using my fix

running result on my local robot.

34017 runs so far, 0 failures
34096 runs so far, 0 failures

Which issue(s) this PR fixes:

Fixes # #94568
Part of # #94528 (comment)

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 10, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @JornShen. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 10, 2020
@JornShen
Copy link
Member Author

JornShen commented Sep 10, 2020

/cc @liggitt

@k8s-ci-robot
Copy link
Contributor

@JornShen: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Lion-Wei
Copy link

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 10, 2020
@figo
Copy link
Contributor

figo commented Sep 11, 2020

/test pull-kubernetes-bazel-test

@JornShen
Copy link
Member Author

/cc @liggitt @knight42 @msau42

@gnufied
Copy link
Member

gnufied commented Sep 14, 2020

/assign

@JornShen
Copy link
Member Author

ping @gnufied

@JornShen JornShen force-pushed the fix_Test_Run_Positive_VolumeMountControllerAttachEnabledRace_data_trace branch from c0870c9 to b6b462b Compare October 19, 2020 08:29
@msau42
Copy link
Member

msau42 commented Oct 26, 2020

Apologies for the delay.

Looking at the test, the test stops and waits for the reconciler to stop before proceeding to update the hooks. It seems like maybe the issue could be that stopping reconciler doesn't wait for nested pending operations to finish?

@JornShen
Copy link
Member Author

Apologies for the delay.

Looking at the test, the test stops and waits for the reconciler to stop before proceeding to update the hooks. It seems like maybe the issue could be that stopping reconciler doesn't wait for nested pending operations to finish?

@msau42. yeah, it can see from the trace log. reconciler will not wait pendingOperations.run as the latter run a routine to deal with generatedOperations.Run() which will get hook inside.

@msau42
Copy link
Member

msau42 commented Oct 27, 2020

I think adding the lock is fine. If the pending operation is still running, then it will finish without the hook set, and and a future reconciler loop will run the operation again with the hook set.

@msau42
Copy link
Member

msau42 commented Oct 27, 2020

/lgtm
/approve

Thanks for the fix!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 27, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JornShen, msau42

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 27, 2020
@JornShen
Copy link
Member Author

/kind bug
/kind flake

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Oct 28, 2020
@JornShen
Copy link
Member Author

/test pull-kubernetes-e2e-kind
/test pull-kubernetes-e2e-gce-ubuntu-containerd

@JornShen
Copy link
Member Author

/test pull-kubernetes-e2e-gce-ubuntu-containerd

@JornShen
Copy link
Member Author

/test pull-kubernetes-node-e2e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants