Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: fix potential blocking problems in unit tests #103667

Closed
wants to merge 1 commit into from

Conversation

lzhfromustc
Copy link
Contributor

@lzhfromustc lzhfromustc commented Jul 13, 2021

What type of PR is this?

/kind bug
/kind failing-test

What this PR does / why we need it:

This PR fixes several potential blocking problems in unit tests. The fixes are explained as following:

(1) server_test.go
During the second testcase "fake error", channel opt.errCh will be sent a non-nil error (server.go:329) and a nil error (server.go:269).
However, its receive will only be executed once because the loop will be broken after the first non-nil error, as shown below:

for {
err := <-o.errCh
if err != nil {
return err
}
}

This PR adds a select that can receive the second error.

(2) cloud_cidr_allocator_test.go
The goroutine running go ca.worker(stopChan) will never be stopped because stopChan is never closed.
This PR adds the close to this channel.

(3) sync_test.go
This unit test will only send one message to fake.reportChan, which is at sync.go:148. However, it has two receive operations for this channel.
This PR puts the second receive operation into a select that can unblock together with the unit test goroutine, so it won't be blocked.

(4) file_linux_test.go
ch in this unit test is alias to s.updates below:

if err != nil {
if !os.IsNotExist(err) {
return err
}
// Emit an update with an empty PodList to allow FileSource to be marked as seen
s.updates <- kubetypes.PodUpdate{Pods: []*v1.Pod{}, Op: kubetypes.SET, Source: kubetypes.FileSource}
return fmt.Errorf("path does not exist, ignoring")
}

It will be blocked if the unit test calls t.Fatalf() and doesn't receive from ch.
This PR adds 1 buffer to ch to make the send operation non blocking.

(5) inhibit_linux_test.go
fakeSystemBus.signalChannel is alias to busChan below:

busChan := make(chan *dbus.Signal, 1)
bus.SystemBus.Signal(busChan)
shutdownChan := make(chan bool, 1)
go func() {
for {
event, ok := <-busChan
if !ok {
close(shutdownChan)
return
}

Note that the goroutine created above can stop only when busChan is closed, which never happens.
This PR adds the close operation of it.

Special notes for your reviewer:

Sorry that these problems may be somewhat trivial and can only waste some memory during testing. We found them by our fuzzer that just report any blocking in the program.

For the worker() mentioned in cloud_cidr_allocator_test.go, we also noticed that there are six places using wait.NeverStop when calling several worker functions, which also result in never stopped workers.
Wondering if there is any way to avoid such behavior.
These places are: pkg/kubelet/cm/devicemanager/manager_test.go:275; pkg/kubelet/kubelet.go:1438; pkg/controller/nodeipam/ipam/range_allocator_test.go:554 & 652 & 794

Does this PR introduce a user-facing change?

NONE

 Signed-off-by: Ziheng Liu <lzhfromustc@gmail.com>
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 13, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @lzhfromustc. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 13, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lzhfromustc
To complete the pull request process, please assign bowei, dchen1107 after the PR has been reviewed.
You can assign the PR to them by writing /assign @bowei @dchen1107 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/kubelet sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 13, 2021
@endocrimes
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 20, 2021
@endocrimes
Copy link
Member

/retest

@SergeyKanzhelev
Copy link
Member

/priority important-longterm
/triage accepted

@k8s-ci-robot k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 21, 2021
@SergeyKanzhelev
Copy link
Member

/assign @cynepco3hahue

@@ -501,6 +501,10 @@ udpIdleTimeout: 250ms`)
errCh := make(chan error, 1)
go func() {
errCh <- opt.runLoop()
select {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please give an example when it failed?
I am just curious how adding the select should fix it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the description part 1,
During the second testcase "fake error", channel opt.errCh will be sent a non-nil error (server.go:329) and a nil error (server.go:269).
However, its receive will only be executed once because the loop will be broken after the first non-nil error, as shown below:

kubernetes/cmd/kube-proxy/app/server.go
Lines 332 to 337 in 234d731
for {
err := <-o.errCh
if err != nil {
return err
}
}

This fix adds a select that can receive the second error.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so please correct me if I am wrong, but the method

func (o *Options) eventHandler(ent fsnotify.Event) {
called only when it was any change to the file, and if it no changes to the file no one will call it, so I do not see how can we get the nil error here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used a debugger to run this unit test, and found that (*Options).eventHandler() is called once, by this call chain:
err = opt.Complete() at cmd/kube-proxy/app/server_test.go:495
if err := o.initWatcher(); err != nil { at cmd/kube-proxy/app/server.go:233
err := fswatcher.Init(o.eventHandler, o.errorHandler) at cmd/kube-proxy/app/server.go:248
w.watcher, err = fsnotify.NewWatcher() at pkg/util/filesystem/watcher.go:63
go w.readEvents() at vendor/github.com/fsnotify/fsnotify/inotify.go:59
case w.Events <- event: at vendor/github.com/fsnotify/fsnotify/inotify.go:285

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think eventHandle() is called only for the first unit test. Sometimes, it is called once, but sometimes it is called twice. The call chain is exactly the same for the two runs.

@@ -51,7 +51,7 @@ func TestExtractFromNonExistentFile(t *testing.T) {
}

func TestUpdateOnNonExistentFile(t *testing.T) {
ch := make(chan interface{})
ch := make(chan interface{}, 1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what the point to make it unblocking channel when we have select anyway?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the reason for making it unblocking is from consideration of unblocked sending. If select goes with timeout, it means no receiver for ch, then sending operation in NewSourceFile will be blocked if channel is unbuffered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cynepco3hahue There is a child goroutine created in NewSourceFile (line 95 in file.go). The child goroutine sends a message to ch (line 129 in file.go). If the select chooses the timeout case, the child goroutine is leaked.

@charlesxsh
Copy link
Contributor

@cynepco3hahue Hi here, kindly ask do we have any update on this PR?

@cynepco3hahue
Copy link

@charlesxsh I will give it another review round today.

@@ -501,6 +501,10 @@ udpIdleTimeout: 250ms`)
errCh := make(chan error, 1)
go func() {
errCh <- opt.runLoop()
select {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so please correct me if I am wrong, but the method

func (o *Options) eventHandler(ent fsnotify.Event) {
called only when it was any change to the file, and if it no changes to the file no one will call it, so I do not see how can we get the nil error here

@@ -184,6 +184,7 @@ func TestMonitorShutdown(t *testing.T) {
signal := &dbus.Signal{Body: []interface{}{tc.shutdownActive}}
fakeSystemBus.signalChannel <- signal
<-done
close(fakeSystemBus.signalChannel)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add defer close(fakeSystemBus.signalChannel) after the fakeSystemBus := &fakeSystemDBus{} to make sure we will close it also when the test failed in the middle

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cynepco3hahue no we cannot do that. fakeSystemBus.signalChannel is initialized in MonitorShutdown(). If we use defer after fakeSystemBus := &fakeSystemDBus{}, we are actually defer close(nil).

@songlh
Copy link
Contributor

songlh commented Dec 20, 2021

@cynepco3hahue all questions are answered. Could you help review this pull request again?

@cynepco3hahue
Copy link

sure, will review it on the week

@dims
Copy link
Member

dims commented Jan 6, 2022

/assign @bowei @dchen1107

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 6, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 6, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Archived in project
Development

Successfully merging this pull request may close these issues.