Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data race between hybrid test code and controller #3203

Closed
trozet opened this issue Oct 4, 2022 · 1 comment · Fixed by #3226
Closed

data race between hybrid test code and controller #3203

trozet opened this issue Oct 4, 2022 · 1 comment · Fixed by #3226
Assignees

Comments

@trozet
Copy link
Contributor

trozet commented Oct 4, 2022

Saw this downstream:
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1289/pull-ci-openshift-ovn-kubernetes-master-unit/1577350209917161472/build-log.txt

WARNING: DATA RACE
Read at 0x0000043fc0e9 by goroutine 1333:
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).addLogicalPort()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/pods.go:674 +0x2ffa
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).ensurePod()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/ovn.go:550 +0x9b5
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).addResource()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/obj_retry.go:593 +0x185
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).resourceRetry.func1()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/obj_retry.go:1160 +0x1d84
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*RetryObjs).DoWithLock()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/obj_retry.go:95 +0x12a
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).resourceRetry()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/obj_retry.go:1063 +0xdb
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).iterateRetryResources.func1()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/obj_retry.go:1197 +0x104
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).iterateRetryResources.func2()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/obj_retry.go:1198 +0x58

Previous write at 0x0000043fc0e8 by goroutine 1130:
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/config.PrepareTestConfig()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/config/config.go:527 +0x535
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.glob..func9.1()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/hybrid_test.go:173 +0x73
github.com/onsi/ginkgo/internal/leafnodes.(*runner).runSync()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/github.com/onsi/ginkgo/internal/leafnodes/runner.go:113 +0xf5
github.com/onsi/ginkgo/internal/leafnodes.(*runner).run()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/github.com/onsi/ginkgo/internal/leafnodes/runner.go:64 +0x157
github.com/onsi/ginkgo/internal/leafnodes.(*SetupNode).Run()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/github.com/onsi/ginkgo/internal/leafnodes/setup_nodes.go:15 +0x94
github.com/onsi/ginkgo/internal/spec.(*Spec).runSample()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/github.com/onsi/ginkgo/internal/spec/spec.go:193 +0x6a8
github.com/onsi/ginkgo/internal/spec.(*Spec).Run()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/github.com/onsi/ginkgo/internal/spec/spec.go:138 +0x1ab
github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).runSpec()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/github.com/onsi/ginkgo/internal/specrunner/spec_runner.go:200 +0x15a
github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).runSpecs()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/github.com/onsi/ginkgo/internal/specrunner/spec_runner.go:170 +0x255
github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).Run()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/github.com/onsi/ginkgo/internal/specrunner/spec_runner.go:66 +0x136
github.com/onsi/ginkgo/internal/suite.(*Suite).Run()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/github.com/onsi/ginkgo/internal/suite/suite.go:79 +0x6f5
github.com/onsi/ginkgo.runSpecsWithCustomReporters()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/github.com/onsi/ginkgo/ginkgo_dsl.go:245 +0x212
github.com/onsi/ginkgo.RunSpecs()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/github.com/onsi/ginkgo/ginkgo_dsl.go:220 +0x267
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.TestClusterNode()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/ovn_suite_test.go:12 +0xe4
testing.tRunner()
/usr/local/go/src/testing/testing.go:1439 +0x213
testing.(*T).Run.func1()
/usr/local/go/src/testing/testing.go:1486 +0x47

The reason is because there is no waitgroup to wait in between tests to make sure the controller is fully "shutdown" or in our case the retry object loop has stopped. This is because in these tests we never actually run the controller. We simply start WatchPods, and the retry goroutine there relies on the controller's stopChan. Perhaps we should change this to use the watchFactory's stopChan or alternatively fix the tests to wait for proper shutdown of the controller.

@bpickard22
Copy link
Contributor

/assign @bpickard22

bpickard22 added a commit to bpickard22/ovn-kubernetes that referenced this issue Oct 21, 2022
leverage stop channel on watch factory when periodic retry stopchannel is called

closes issue -> ovn-org#3203

Problem: Hybrid test was creating creating an ovn-controller for some tests. The tests could not use the waitGroup in the test
so the afterEach function (cleanup function after each test) could execute while leaving controllers up. In the tests, we were leveraging the code in ovn.go to run the controllers and creating a watcher pod on them. This issue was just a symptom of the main issue which was that we were using the incorrect stopChannel in the retry logic.

Solution: The stopChannel that we use in the watchFactory is private, so we created a function that when called will close the watcher stopChannel, closing the watcher. We then call this function in the select (basically a switch statement in go but for channels) to close the watcher, because when we stop watching an object, that means we no longer care about it and should stop retrying it.

Signed-off-by: Ben Pickard <bpickard@redhat.com>
bpickard22 added a commit to bpickard22/ovn-kubernetes that referenced this issue Oct 21, 2022
leverage stop channel on watch factory when periodic retry stopchannel is called

Problem: Hybrid test was creating creating an ovn-controller for some tests. The tests could not use the waitGroup in the test
so the afterEach function (cleanup function after each test) could execute while leaving controllers up. In the tests, we were leveraging the code in ovn.go to run the controllers and creating a watcher pod on them. This issue was just a symptom of the main issue which was that we were using the incorrect stopChannel in the retry logic.

Solution: The stopChannel that we use in the watchFactory is private, so we created a function that when called will close the watcher stopChannel, closing the watcher. We then call this function in the select (basically a switch statement in go but for channels) to close the watcher, because when we stop watching an object, that means we no longer care about it and should stop retrying it.

Signed-off-by: Ben Pickard <bpickard@redhat.com>
Closes: ovn-org#3203
bpickard22 added a commit to bpickard22/ovn-kubernetes that referenced this issue Oct 21, 2022
leverage stop channel on watch factory when periodic retry stopchannel is called

Problem: Hybrid test was creating creating an ovn-controller for some tests.
The tests could not use the waitGroup in the test so the afterEach function
(cleanup function after each test) could execute while leaving controllers up.
In the tests, we were leveraging the code in ovn.go to run the controllers
and creating a watcher pod on them. This issue was just a symptom of the main
issue which was that we were using the incorrect stopChannel in the retry logic.

Solution: The stopChannel that we use in the watchFactory is private, so we
created a function that when called will close the watcher stopChannel,
closing the watcher. We then call this function in the select
(basically a switch statement in go but for channels) to close the watcher,
because when we stop watching an object, that means we no longer care about it
and should stop retrying it.

Signed-off-by: Ben Pickard <bpickard@redhat.com>
Closes: ovn-org#3203
bpickard22 added a commit to bpickard22/ovn-kubernetes that referenced this issue Oct 26, 2022
leverage stop channel on watch factory when periodic retry stopchannel is called

Problem: Hybrid test was creating creating an ovn-controller for some tests.
The tests could not use the waitGroup in the test so the afterEach function
(cleanup function after each test) could execute while leaving controllers up.
In the tests, we were leveraging the code in ovn.go to run the controllers
and creating a watcher pod on them. This issue was just a symptom of the main
issue which was that we were using the incorrect stopChannel in the retry logic.

Solution: The stopChannel that we use in the watchFactory is private, so we
created a function that when called will close the watcher stopChannel,
closing the watcher. We then call this function in the select
(basically a switch statement in go but for channels) to close the watcher,
because when we stop watching an object, that means we no longer care about it
and should stop retrying it.

This also adds a public waitgroup property to the watchfactory that we
are using to ensure that the retry loop has exited before we shutdown
the watcher to avoid any races there

Signed-off-by: Ben Pickard <bpickard@redhat.com>
Closes: ovn-org#3203
bpickard22 added a commit to bpickard22/ovn-kubernetes that referenced this issue Oct 26, 2022
leverage stop channel on watch factory when periodic retry stopchannel is called

Problem: Hybrid test was creating creating an ovn-controller for some tests.
The tests could not use the waitGroup in the test so the afterEach function
(cleanup function after each test) could execute while leaving controllers up.
In the tests, we were leveraging the code in ovn.go to run the controllers
and creating a watcher pod on them. This issue was just a symptom of the main
issue which was that we were using the incorrect stopChannel in the retry logic.

Solution: The stopChannel that we use in the watchFactory is private, so we
created a function that when called will close the watcher stopChannel,
closing the watcher. We then call this function in the select
(basically a switch statement in go but for channels) to close the watcher,
because when we stop watching an object, that means we no longer care about it
and should stop retrying it.

This also adds a public waitgroup property to the watchfactory that we
are using to ensure that the retry loop has exited before we shutdown
the watcher to avoid any races there

Since we no longer use the stopChannel in the retryFramework, this also
removes that stopChannel from the newRetryFramework function definition,
and all its refrences (which at this time was just in obj_retry_master)

Signed-off-by: Ben Pickard <bpickard@redhat.com>
Closes: ovn-org#3203

Remove stopChannel from retryFrameowork

We no longer use the controller stopChannel passed to the
retryFramework, as we use the watcher stopChannel, so we can remove it

Signed-off-by: Ben Pickard <bpickard@redhat.com>
bpickard22 added a commit to bpickard22/ovn-kubernetes that referenced this issue Oct 27, 2022
leverage stop channel on watch factory when periodic retry stopchannel is called

Problem: Hybrid test was creating creating an ovn-controller for some tests.
The tests could not use the waitGroup in the test so the afterEach function
(cleanup function after each test) could execute while leaving controllers up.
In the tests, we were leveraging the code in ovn.go to run the controllers
and creating a watcher pod on them. This issue was just a symptom of the main
issue which was that we were using the incorrect stopChannel in the retry logic.

Solution: The stopChannel that we use in the watchFactory is private, so we
created a function that when called will close the watcher stopChannel,
closing the watcher. We then call this function in the select
(basically a switch statement in go but for channels) to close the watcher,
because when we stop watching an object, that means we no longer care about it
and should stop retrying it.

This also adds a public waitgroup property to the watchfactory that we
are using to ensure that the retry loop has exited before we shutdown
the watcher to avoid any races there

Since we no longer use the stopChannel in the retryFramework, this also
removes that stopChannel from the newRetryFramework function definition,
and all its refrences (which at this time was just in obj_retry_master)

Signed-off-by: Ben Pickard <bpickard@redhat.com>
Closes: ovn-org#3203

Remove stopChannel from retryFrameowork

We no longer use the controller stopChannel passed to the
retryFramework, as we use the watcher stopChannel, so we can remove it

Signed-off-by: Ben Pickard <bpickard@redhat.com>
jcaamano pushed a commit to jcaamano/ovn-kubernetes that referenced this issue Jan 19, 2023
leverage stop channel on watch factory when periodic retry stopchannel is called

Problem: Hybrid test was creating creating an ovn-controller for some tests.
The tests could not use the waitGroup in the test so the afterEach function
(cleanup function after each test) could execute while leaving controllers up.
In the tests, we were leveraging the code in ovn.go to run the controllers
and creating a watcher pod on them. This issue was just a symptom of the main
issue which was that we were using the incorrect stopChannel in the retry logic.

Solution: The stopChannel that we use in the watchFactory is private, so we
created a function that when called will close the watcher stopChannel,
closing the watcher. We then call this function in the select
(basically a switch statement in go but for channels) to close the watcher,
because when we stop watching an object, that means we no longer care about it
and should stop retrying it.

This also adds a public waitgroup property to the watchfactory that we
are using to ensure that the retry loop has exited before we shutdown
the watcher to avoid any races there

Since we no longer use the stopChannel in the retryFramework, this also
removes that stopChannel from the newRetryFramework function definition,
and all its refrences (which at this time was just in obj_retry_master)

Signed-off-by: Ben Pickard <bpickard@redhat.com>
Closes: ovn-org#3203

Remove stopChannel from retryFrameowork

We no longer use the controller stopChannel passed to the
retryFramework, as we use the watcher stopChannel, so we can remove it

Signed-off-by: Ben Pickard <bpickard@redhat.com>
(cherry picked from commit fc2c3b9)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants