Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor retry to close watchFactory #3226

Merged
merged 1 commit into from
Oct 27, 2022
Merged

Conversation

bpickard22
Copy link
Contributor

@bpickard22 bpickard22 commented Oct 14, 2022

leverage stop channel on watch factory when periodic retry stopchannel is called

Problem: Hybrid test was creating creating an ovn-controller for some tests.
The tests could not use the waitGroup in the test so the afterEach function
(cleanup function after each test) could execute while leaving controllers up.
In the tests, we were leveraging the code in ovn.go to run the controllers
and creating a watcher pod on them. This issue was just a symptom of the main
issue which was that we were using the incorrect stopChannel in the retry logic.

Solution: The stopChannel that we use in the watchFactory is private, so we
created a function that when called will close the watcher stopChannel,
closing the watcher. We then call this function in the select
(basically a switch statement in go but for channels) to close the watcher,
because when we stop watching an object, that means we no longer care about it
and should stop retrying it.

Signed-off-by: Ben Pickard bpickard@redhat.com
Closes: #3203

@coveralls
Copy link

coveralls commented Oct 17, 2022

Coverage Status

Coverage increased (+0.02%) to 52.961% when pulling 3eb11b1 on bpickard22:data-race into c72e852 on ovn-org:master.

Copy link
Contributor

@flavio-fernandes flavio-fernandes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add closes #3203 in commit msg
@trozet PTAL

go-controller/pkg/ovn/obj_retry.go Outdated Show resolved Hide resolved
@flavio-fernandes
Copy link
Contributor

/lgtm

@bpickard22 bpickard22 changed the title [draft] refactor retry to close watchFactory refactor retry to close watchFactory Oct 20, 2022
Copy link
Contributor

@trozet trozet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@bpickard22 can you please rebase?

@bpickard22 bpickard22 force-pushed the data-race branch 3 times, most recently from 01622a4 to e1a3919 Compare October 21, 2022 19:38
@flavio-fernandes
Copy link
Contributor

/lgtm
TY @bpickard22 !

@bpickard22
Copy link
Contributor Author

@ricky-rav PTAL

@bpickard22
Copy link
Contributor Author

/test Build-PR

@bpickard22
Copy link
Contributor Author

/test build-pr

@bpickard22
Copy link
Contributor Author

/retest

Copy link
Contributor

@flavio-fernandes flavio-fernandes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Let's squash all commits into one?

@bpickard22 bpickard22 force-pushed the data-race branch 2 times, most recently from 876dae2 to 6a6bf96 Compare October 26, 2022 21:38
@ricky-rav
Copy link
Contributor

Now we've introduced a new data race between r.watchFactory.Wg.Add(1) in the goroutine periodicallyRetryResources and wf.Wg.Wait() in Shutdown() in the main execution thread :)

WARNING: DATA RACE
Write at 0x00c01ce24b70 by goroutine 1183:
  runtime.racewrite()
      <autogenerated>:1 +0x24
  github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).Shutdown()
      /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/factory/factory.go:364 +0xf6
  github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*FakeOVN).shutdown()
      /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/ovn/ovn_test.go:98 +0x44
[...]

Previous read at 0x00c01ce24b70 by goroutine 2233:
  runtime.raceread()
      <autogenerated>:1 +0x24
  github.com/ovn-org/ovn-kubernetes/go-controller/pkg/retry.(*RetryFramework).periodicallyRetryResources()
      /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/retry/obj_retry.go:363 +0x74
  github.com/ovn-org/ovn-kubernetes/go-controller/pkg/retry.(*RetryFramework).WatchResourceFiltered.func4()
      /home/runner/work/ovn-kubernetes/ovn-kubernetes/go-controller/pkg/retry/obj_retry.go:663 +0x39

So Add and Wait want to be in the same thread... can we try moving Add to the line right before we call go r.periodicallyRetryResources() in obj_retry.go?

 // track the retry entries and every 30 seconds (or upon explicit request) check if any objects 
 // need to be retried 
 r.watchFactory.Wg.Add(1)
 go r.periodicallyRetryResources() 

leverage stop channel on watch factory when periodic retry stopchannel is called

Problem: Hybrid test was creating creating an ovn-controller for some tests.
The tests could not use the waitGroup in the test so the afterEach function
(cleanup function after each test) could execute while leaving controllers up.
In the tests, we were leveraging the code in ovn.go to run the controllers
and creating a watcher pod on them. This issue was just a symptom of the main
issue which was that we were using the incorrect stopChannel in the retry logic.

Solution: The stopChannel that we use in the watchFactory is private, so we
created a function that when called will close the watcher stopChannel,
closing the watcher. We then call this function in the select
(basically a switch statement in go but for channels) to close the watcher,
because when we stop watching an object, that means we no longer care about it
and should stop retrying it.

This also adds a public waitgroup property to the watchfactory that we
are using to ensure that the retry loop has exited before we shutdown
the watcher to avoid any races there

Since we no longer use the stopChannel in the retryFramework, this also
removes that stopChannel from the newRetryFramework function definition,
and all its refrences (which at this time was just in obj_retry_master)

Signed-off-by: Ben Pickard <bpickard@redhat.com>
Closes: ovn-org#3203

Remove stopChannel from retryFrameowork

We no longer use the controller stopChannel passed to the
retryFramework, as we use the watcher stopChannel, so we can remove it

Signed-off-by: Ben Pickard <bpickard@redhat.com>
@ricky-rav
Copy link
Contributor

/lgtm
Thanks again, Ben!

Copy link
Contributor

@flavio-fernandes flavio-fernandes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

NIT: the comment in the commit has creating mentioned twice. But that is no biggie.
"Problem: Hybrid test was creating creating an ovn-controller for some tests."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

data race between hybrid test code and controller
5 participants