test: fix potential blocking problems in unit tests #103667

lzhfromustc · 2021-07-13T14:52:13Z

What type of PR is this?

/kind bug
/kind failing-test

What this PR does / why we need it:

This PR fixes several potential blocking problems in unit tests. The fixes are explained as following:

(1) server_test.go
During the second testcase "fake error", channel opt.errCh will be sent a non-nil error (server.go:329) and a nil error (server.go:269).
However, its receive will only be executed once because the loop will be broken after the first non-nil error, as shown below:

kubernetes/cmd/kube-proxy/app/server.go

Lines 332 to 337 in 234d731

    
           for { 
        
           	err := <-o.errCh 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           }

This PR adds a select that can receive the second error.

(2) cloud_cidr_allocator_test.go
The goroutine running go ca.worker(stopChan) will never be stopped because stopChan is never closed.
This PR adds the close to this channel.

(3) sync_test.go
This unit test will only send one message to fake.reportChan, which is at sync.go:148. However, it has two receive operations for this channel.
This PR puts the second receive operation into a select that can unblock together with the unit test goroutine, so it won't be blocked.

(4) file_linux_test.go
ch in this unit test is alias to s.updates below:

kubernetes/pkg/kubelet/config/file.go

Lines 124 to 131 in 234d731

    
           if err != nil { 
        
           	if !os.IsNotExist(err) { 
        
           		return err 
        
           	} 
        
           	// Emit an update with an empty PodList to allow FileSource to be marked as seen 
        
           	s.updates <- kubetypes.PodUpdate{Pods: []*v1.Pod{}, Op: kubetypes.SET, Source: kubetypes.FileSource} 
        
           	return fmt.Errorf("path does not exist, ignoring") 
        
           }

It will be blocked if the unit test calls t.Fatalf() and doesn't receive from ch.
This PR adds 1 buffer to ch to make the send operation non blocking.

(5) inhibit_linux_test.go
fakeSystemBus.signalChannel is alias to busChan below:

kubernetes/pkg/kubelet/nodeshutdown/systemd/inhibit_linux.go

Lines 145 to 156 in 234d731

    
           busChan := make(chan *dbus.Signal, 1) 
        
           bus.SystemBus.Signal(busChan) 
        
           shutdownChan := make(chan bool, 1) 
        
           go func() { 
        
           	for { 
        
           		event, ok := <-busChan 
        
           		if !ok { 
        
           			close(shutdownChan) 
        
           			return 
        
           		}

Note that the goroutine created above can stop only when busChan is closed, which never happens.
This PR adds the close operation of it.

Special notes for your reviewer:

Sorry that these problems may be somewhat trivial and can only waste some memory during testing. We found them by our fuzzer that just report any blocking in the program.

For the worker() mentioned in cloud_cidr_allocator_test.go, we also noticed that there are six places using wait.NeverStop when calling several worker functions, which also result in never stopped workers.
Wondering if there is any way to avoid such behavior.
These places are: pkg/kubelet/cm/devicemanager/manager_test.go:275; pkg/kubelet/kubelet.go:1438; pkg/controller/nodeipam/ipam/range_allocator_test.go:554 & 652 & 794

Does this PR introduce a user-facing change?

NONE

Signed-off-by: Ziheng Liu <lzhfromustc@gmail.com>

k8s-ci-robot · 2021-07-13T14:52:21Z

Hi @lzhfromustc. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2021-07-13T14:52:47Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lzhfromustc
To complete the pull request process, please assign bowei, dchen1107 after the PR has been reviewed.
You can assign the PR to them by writing /assign @bowei @dchen1107 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

endocrimes · 2021-07-20T13:51:37Z

/ok-to-test

endocrimes · 2021-07-20T15:25:28Z

/retest

SergeyKanzhelev · 2021-07-21T16:24:21Z

/priority important-longterm
/triage accepted

SergeyKanzhelev · 2021-07-21T17:47:36Z

/assign @cynepco3hahue

cynepco3hahue · 2021-08-24T17:02:43Z

cmd/kube-proxy/app/server_test.go

@@ -501,6 +501,10 @@ udpIdleTimeout: 250ms`)
 		errCh := make(chan error, 1)
 		go func() {
 			errCh <- opt.runLoop()
+			select {


Can you please give an example when it failed?
I am just curious how adding the select should fix it?

As mentioned in the description part 1,
During the second testcase "fake error", channel opt.errCh will be sent a non-nil error (server.go:329) and a nil error (server.go:269).
However, its receive will only be executed once because the loop will be broken after the first non-nil error, as shown below:

kubernetes/cmd/kube-proxy/app/server.go
Lines 332 to 337 in 234d731
for {
err := <-o.errCh
if err != nil {
return err
}
}

This fix adds a select that can receive the second error.

so please correct me if I am wrong, but the method

kubernetes/cmd/kube-proxy/app/server.go

Line 265 in 1098899

func (o *Options) eventHandler(ent fsnotify.Event) {

called only when it was any change to the file, and if it no changes to the file no one will call it, so I do not see how can we get the nil error here

kubernetes/cmd/kube-proxy/app/server.go

Line 274 in 1098899

o.errCh <- nil

I used a debugger to run this unit test, and found that (*Options).eventHandler() is called once, by this call chain:
err = opt.Complete() at cmd/kube-proxy/app/server_test.go:495
if err := o.initWatcher(); err != nil { at cmd/kube-proxy/app/server.go:233
err := fswatcher.Init(o.eventHandler, o.errorHandler) at cmd/kube-proxy/app/server.go:248
w.watcher, err = fsnotify.NewWatcher() at pkg/util/filesystem/watcher.go:63
go w.readEvents() at vendor/github.com/fsnotify/fsnotify/inotify.go:59
case w.Events <- event: at vendor/github.com/fsnotify/fsnotify/inotify.go:285

I think eventHandle() is called only for the first unit test. Sometimes, it is called once, but sometimes it is called twice. The call chain is exactly the same for the two runs.

cynepco3hahue · 2021-08-24T17:05:41Z

pkg/kubelet/config/file_linux_test.go

@@ -51,7 +51,7 @@ func TestExtractFromNonExistentFile(t *testing.T) {
 }

 func TestUpdateOnNonExistentFile(t *testing.T) {
-	ch := make(chan interface{})
+	ch := make(chan interface{}, 1)


what the point to make it unblocking channel when we have select anyway?

I suppose the reason for making it unblocking is from consideration of unblocked sending. If select goes with timeout, it means no receiver for ch, then sending operation in NewSourceFile will be blocked if channel is unbuffered.

@cynepco3hahue There is a child goroutine created in NewSourceFile (line 95 in file.go). The child goroutine sends a message to ch (line 129 in file.go). If the select chooses the timeout case, the child goroutine is leaked.

charlesxsh · 2021-10-26T21:52:35Z

@cynepco3hahue Hi here, kindly ask do we have any update on this PR?

cynepco3hahue · 2021-10-27T07:26:58Z

@charlesxsh I will give it another review round today.

cynepco3hahue · 2021-10-27T11:53:35Z

cmd/kube-proxy/app/server_test.go

@@ -501,6 +501,10 @@ udpIdleTimeout: 250ms`)
 		errCh := make(chan error, 1)
 		go func() {
 			errCh <- opt.runLoop()
+			select {


so please correct me if I am wrong, but the method

kubernetes/cmd/kube-proxy/app/server.go

Line 265 in 1098899

func (o *Options) eventHandler(ent fsnotify.Event) {

called only when it was any change to the file, and if it no changes to the file no one will call it, so I do not see how can we get the nil error here

kubernetes/cmd/kube-proxy/app/server.go

Line 274 in 1098899

o.errCh <- nil

cynepco3hahue · 2021-10-27T12:19:52Z

pkg/kubelet/nodeshutdown/systemd/inhibit_linux_test.go

@@ -184,6 +184,7 @@ func TestMonitorShutdown(t *testing.T) {
 			signal := &dbus.Signal{Body: []interface{}{tc.shutdownActive}}
 			fakeSystemBus.signalChannel <- signal
 			<-done
+			close(fakeSystemBus.signalChannel)


can we add defer close(fakeSystemBus.signalChannel) after the fakeSystemBus := &fakeSystemDBus{} to make sure we will close it also when the test failed in the middle

@cynepco3hahue no we cannot do that. fakeSystemBus.signalChannel is initialized in MonitorShutdown(). If we use defer after fakeSystemBus := &fakeSystemDBus{}, we are actually defer close(nil).

songlh · 2021-12-20T05:23:30Z

@cynepco3hahue all questions are answered. Could you help review this pull request again?

cynepco3hahue · 2021-12-20T08:47:10Z

sure, will review it on the week

dims · 2022-01-06T14:27:35Z

/assign @bowei @dchen1107

k8s-triage-robot · 2022-04-06T15:04:53Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-05-06T15:21:10Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-06-05T15:53:33Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-06-05T15:53:47Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

test: fix potential blocking problems in unit tests

5aaaa43

Signed-off-by: Ziheng Liu <lzhfromustc@gmail.com>

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 13, 2021

k8s-ci-robot requested review from dcbw and derekwaynecarr July 13, 2021 14:53

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 20, 2021

k8s-ci-robot assigned cynepco3hahue Jul 21, 2021

cynepco3hahue reviewed Aug 24, 2021

View reviewed changes

cynepco3hahue reviewed Oct 27, 2021

View reviewed changes

k8s-ci-robot assigned bowei and dchen1107 Jan 6, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 6, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 6, 2022

k8s-ci-robot closed this Jun 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: fix potential blocking problems in unit tests #103667

test: fix potential blocking problems in unit tests #103667

lzhfromustc commented Jul 13, 2021 •

edited

Loading

k8s-ci-robot commented Jul 13, 2021

k8s-ci-robot commented Jul 13, 2021

endocrimes commented Jul 20, 2021

endocrimes commented Jul 20, 2021

SergeyKanzhelev commented Jul 21, 2021

SergeyKanzhelev commented Jul 21, 2021

cynepco3hahue Aug 24, 2021

charlesxsh Aug 26, 2021

cynepco3hahue Oct 27, 2021

lzhfromustc Dec 19, 2021

songlh Dec 20, 2021

cynepco3hahue Aug 24, 2021

charlesxsh Aug 26, 2021

songlh Dec 8, 2021

charlesxsh commented Oct 26, 2021

cynepco3hahue commented Oct 27, 2021

cynepco3hahue Oct 27, 2021

cynepco3hahue Oct 27, 2021

songlh Dec 8, 2021

songlh commented Dec 20, 2021

cynepco3hahue commented Dec 20, 2021

dims commented Jan 6, 2022

k8s-triage-robot commented Apr 6, 2022

k8s-triage-robot commented May 6, 2022

k8s-triage-robot commented Jun 5, 2022

k8s-ci-robot commented Jun 5, 2022

	if err != nil {
	if !os.IsNotExist(err) {
	return err
	}
	// Emit an update with an empty PodList to allow FileSource to be marked as seen
	s.updates <- kubetypes.PodUpdate{Pods: []*v1.Pod{}, Op: kubetypes.SET, Source: kubetypes.FileSource}
	return fmt.Errorf("path does not exist, ignoring")
	}

	busChan := make(chan *dbus.Signal, 1)
	bus.SystemBus.Signal(busChan)

	shutdownChan := make(chan bool, 1)

	go func() {
	for {
	event, ok := <-busChan
	if !ok {
	close(shutdownChan)
	return
	}

test: fix potential blocking problems in unit tests #103667

test: fix potential blocking problems in unit tests #103667

Conversation

lzhfromustc commented Jul 13, 2021 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Jul 13, 2021

k8s-ci-robot commented Jul 13, 2021

endocrimes commented Jul 20, 2021

endocrimes commented Jul 20, 2021

SergeyKanzhelev commented Jul 21, 2021

SergeyKanzhelev commented Jul 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charlesxsh commented Oct 26, 2021

cynepco3hahue commented Oct 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

songlh commented Dec 20, 2021

cynepco3hahue commented Dec 20, 2021

dims commented Jan 6, 2022

k8s-triage-robot commented Apr 6, 2022

k8s-triage-robot commented May 6, 2022

k8s-triage-robot commented Jun 5, 2022

k8s-ci-robot commented Jun 5, 2022

lzhfromustc commented Jul 13, 2021 •

edited

Loading