-
Notifications
You must be signed in to change notification settings - Fork 39.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: fix potential blocking problems in unit tests #103667
Conversation
Signed-off-by: Ziheng Liu <lzhfromustc@gmail.com>
Hi @lzhfromustc. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: lzhfromustc The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/ok-to-test |
/retest |
/priority important-longterm |
/assign @cynepco3hahue |
@@ -501,6 +501,10 @@ udpIdleTimeout: 250ms`) | |||
errCh := make(chan error, 1) | |||
go func() { | |||
errCh <- opt.runLoop() | |||
select { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please give an example when it failed?
I am just curious how adding the select should fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the description part 1,
During the second testcase "fake error", channel opt.errCh will be sent a non-nil error (server.go:329) and a nil error (server.go:269).
However, its receive will only be executed once because the loop will be broken after the first non-nil error, as shown below:
kubernetes/cmd/kube-proxy/app/server.go
Lines 332 to 337 in 234d731
for {
err := <-o.errCh
if err != nil {
return err
}
}
This fix adds a select that can receive the second error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so please correct me if I am wrong, but the method
kubernetes/cmd/kube-proxy/app/server.go
Line 265 in 1098899
func (o *Options) eventHandler(ent fsnotify.Event) { |
kubernetes/cmd/kube-proxy/app/server.go
Line 274 in 1098899
o.errCh <- nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used a debugger to run this unit test, and found that (*Options).eventHandler() is called once, by this call chain:
err = opt.Complete() at cmd/kube-proxy/app/server_test.go:495
if err := o.initWatcher(); err != nil { at cmd/kube-proxy/app/server.go:233
err := fswatcher.Init(o.eventHandler, o.errorHandler) at cmd/kube-proxy/app/server.go:248
w.watcher, err = fsnotify.NewWatcher() at pkg/util/filesystem/watcher.go:63
go w.readEvents() at vendor/github.com/fsnotify/fsnotify/inotify.go:59
case w.Events <- event: at vendor/github.com/fsnotify/fsnotify/inotify.go:285
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think eventHandle() is called only for the first unit test. Sometimes, it is called once, but sometimes it is called twice. The call chain is exactly the same for the two runs.
@@ -51,7 +51,7 @@ func TestExtractFromNonExistentFile(t *testing.T) { | |||
} | |||
|
|||
func TestUpdateOnNonExistentFile(t *testing.T) { | |||
ch := make(chan interface{}) | |||
ch := make(chan interface{}, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what the point to make it unblocking channel when we have select anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the reason for making it unblocking is from consideration of unblocked sending. If select goes with timeout, it means no receiver for ch
, then sending operation in NewSourceFile
will be blocked if channel is unbuffered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cynepco3hahue There is a child goroutine created in NewSourceFile (line 95 in file.go). The child goroutine sends a message to ch (line 129 in file.go). If the select chooses the timeout case, the child goroutine is leaked.
@cynepco3hahue Hi here, kindly ask do we have any update on this PR? |
@charlesxsh I will give it another review round today. |
@@ -501,6 +501,10 @@ udpIdleTimeout: 250ms`) | |||
errCh := make(chan error, 1) | |||
go func() { | |||
errCh <- opt.runLoop() | |||
select { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so please correct me if I am wrong, but the method
kubernetes/cmd/kube-proxy/app/server.go
Line 265 in 1098899
func (o *Options) eventHandler(ent fsnotify.Event) { |
kubernetes/cmd/kube-proxy/app/server.go
Line 274 in 1098899
o.errCh <- nil |
@@ -184,6 +184,7 @@ func TestMonitorShutdown(t *testing.T) { | |||
signal := &dbus.Signal{Body: []interface{}{tc.shutdownActive}} | |||
fakeSystemBus.signalChannel <- signal | |||
<-done | |||
close(fakeSystemBus.signalChannel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add defer close(fakeSystemBus.signalChannel)
after the fakeSystemBus := &fakeSystemDBus{}
to make sure we will close it also when the test failed in the middle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cynepco3hahue no we cannot do that. fakeSystemBus.signalChannel
is initialized in MonitorShutdown()
. If we use defer
after fakeSystemBus := &fakeSystemDBus{}
, we are actually defer close(nil)
.
@cynepco3hahue all questions are answered. Could you help review this pull request again? |
sure, will review it on the week |
/assign @bowei @dchen1107 |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind bug
/kind failing-test
What this PR does / why we need it:
This PR fixes several potential blocking problems in unit tests. The fixes are explained as following:
(1) server_test.go
During the second testcase "fake error", channel
opt.errCh
will be sent a non-nil error (server.go:329) and a nil error (server.go:269).However, its receive will only be executed once because the loop will be broken after the first non-nil error, as shown below:
kubernetes/cmd/kube-proxy/app/server.go
Lines 332 to 337 in 234d731
This PR adds a select that can receive the second error.
(2) cloud_cidr_allocator_test.go
The goroutine running
go ca.worker(stopChan)
will never be stopped becausestopChan
is never closed.This PR adds the close to this channel.
(3) sync_test.go
This unit test will only send one message to
fake.reportChan
, which is at sync.go:148. However, it has two receive operations for this channel.This PR puts the second receive operation into a select that can unblock together with the unit test goroutine, so it won't be blocked.
(4) file_linux_test.go
ch
in this unit test is alias tos.updates
below:kubernetes/pkg/kubelet/config/file.go
Lines 124 to 131 in 234d731
It will be blocked if the unit test calls
t.Fatalf()
and doesn't receive fromch
.This PR adds 1 buffer to
ch
to make the send operation non blocking.(5) inhibit_linux_test.go
fakeSystemBus.signalChannel
is alias tobusChan
below:kubernetes/pkg/kubelet/nodeshutdown/systemd/inhibit_linux.go
Lines 145 to 156 in 234d731
Note that the goroutine created above can stop only when
busChan
is closed, which never happens.This PR adds the close operation of it.
Special notes for your reviewer:
Sorry that these problems may be somewhat trivial and can only waste some memory during testing. We found them by our fuzzer that just report any blocking in the program.
For the
worker()
mentioned in cloud_cidr_allocator_test.go, we also noticed that there are six places usingwait.NeverStop
when calling several worker functions, which also result in never stopped workers.Wondering if there is any way to avoid such behavior.
These places are: pkg/kubelet/cm/devicemanager/manager_test.go:275; pkg/kubelet/kubelet.go:1438; pkg/controller/nodeipam/ipam/range_allocator_test.go:554 & 652 & 794
Does this PR introduce a user-facing change?