Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/poll: TestSplicePipePool failures with "at least one pipe is still open" on linux-arm #48066

Closed
bcmills opened this issue Aug 30, 2021 · 14 comments

Comments

@bcmills
Copy link
Member Author

@bcmills bcmills commented Aug 30, 2021

This test was added for Go 1.17 (March 9, in CL 271537 for #42740), so these failures represent at least a regression in test flakiness in the current release (CC @golang/release). A rollback of the flaky test in CL 308089 was abandoned in favor of an attempt to fix the test in CL 308329.

The failure rate seems to have been partially obscured by the low rate of changes during the Go 1.17 code freeze.

It is not clear to me whether the test failures represent a defect in the test itself or an incomplete fix for #42740. Since this flakiness is new as of Go 1.17, I think a fix for it should at least block the next release.

@panjf2000
Copy link
Contributor

@panjf2000 panjf2000 commented Aug 30, 2021

Given that at least one pipe was not closed in the test, it could only be either that the finalizers for each pipe didn't work properly on some platforms, or that the CloseFunc in destroyPipe() failed when finalizers were executed.

If it is the latter mentioned above, then this issue has been there since the day splice was implemented, otherwise, the problematic issue will go to the former: runtime.SetFinalizer() and we have to figure out why it was not working as expected.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Aug 30, 2021

Per https://pkg.go.dev/runtime#SetFinalizer (emphasis mine):

The finalizer is scheduled to run at some arbitrary time after the program can no longer reach the object to which obj points. There is no guarantee that finalizers will run before a program exits, so typically they are useful only for releasing non-memory resources associated with an object during a long-running program.

If the test depends on a finalizer actually executing, it is depending on a property of the runtime that is not guaranteed to hold — it should probably have extra scrutiny from the runtime team.

@panjf2000
Copy link
Contributor

@panjf2000 panjf2000 commented Aug 31, 2021

It seems to have been working well on https://github.com/golang/go/blob/go1.17/src/sync/pool_test.go#L100-L135 with runtime.SetFinalizer.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Sep 1, 2021

  1. Is it possible that F_GETPIPE_SZ somehow succeeds on a file descriptor for a pipe that was actually destroyed? (Perhaps the FD is somehow reused for another, unrelated file for which F_GETPIPE_SZ succeeds, or some bug in the kernel affecting ARM processors causes a spurious 0-return?)

  2. Is it possible that the destroyPipe call in the finalizer is actually failing to destroy the pipe? I notice that it makes two calls to CloseFunc, which returns an error, but does not check whether those errors are non-nil.

  3. Is it possible that the finalizer is failing to run entirely? (Would it make sense to modify the test to count or examine the finalizer calls independently from the status of the FDs, perhaps by allowing the test to inject an atomic-counter-update into the finalizer?)

@panjf2000
Copy link
Contributor

@panjf2000 panjf2000 commented Sep 2, 2021

I want to debug and reproduce this test failure on those platforms, but I don't have any of those machines with ARM processors on hand now, is there a way for me to run the test on machines of golang build farm without interfering with the routine builds?

@panjf2000
Copy link
Contributor

@panjf2000 panjf2000 commented Sep 2, 2021

  1. Is it possible that F_GETPIPE_SZ somehow succeeds on a file descriptor for a pipe that was actually destroyed? (Perhaps the FD is somehow reused for another, unrelated file for which F_GETPIPE_SZ succeeds, or some bug in the kernel affecting ARM processors causes a spurious 0-return?)
  2. Is it possible that the destroyPipe call in the finalizer is actually failing to destroy the pipe? I notice that it makes two calls to CloseFunc, which returns an error, but does not check whether those errors are non-nil.
  3. Is it possible that the finalizer is failing to run entirely? (Would it make sense to modify the test to count or examine the finalizer calls independently from the status of the FDs, perhaps by allowing the test to inject an atomic-counter-update into the finalizer?)

Another thought about improving test code is to leverage epoll to monitor all pipe fds and when we receive the sufficient amount (equal to the amount of pipes) of POLLERR/EPOLLERR, then it's legitimate to mark this test as successful.

@panjf2000
Copy link
Contributor

@panjf2000 panjf2000 commented Sep 2, 2021

Besides, I just found a analogical discussion at #26049

@toothrot
Copy link
Contributor

@toothrot toothrot commented Sep 8, 2021

Just a note on the original issue: the scaleway builders no longer exist, as Scaleway dropped ARM support.

@panjf2000 We aren't granting new access to our builders at this time.

It looks like it also failed today on linux-386-jessie: https://build.golang.org/log/e4fcae576822945ee58df39050d0db1b2a8a7b8d, for which we have a public image: gcr.io/symbolic-datum-552/linux-x86-jessie:latest

The only other special thing about that builder is the following environment variables, which can also be seen in the linked logs:

			"GOARCH=386",
			"GOHOSTARCH=386",
			"GO_DISABLE_OUTBOUND_NETWORK=1",

@gopherbot
Copy link

@gopherbot gopherbot commented Sep 8, 2021

Change https://golang.org/cl/348579 mentions this issue: internal/poll: report open fds when TestSplicePipePool fails

gopherbot pushed a commit that referenced this issue Sep 9, 2021
For #48066

Change-Id: I1152a1c15756df35b71b27d3e7025d97da9e70b0
Reviewed-on: https://go-review.googlesource.com/c/go/+/348579
Trust: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Emmanuel Odeke <emmanuel@orijtech.com>
@panjf2000
Copy link
Contributor

@panjf2000 panjf2000 commented Sep 14, 2021

It appears that all finalizers occasionally failed to execute on the android platform for some unknown reasons, take a look at this test log, where none of the finalizers executed.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Sep 15, 2021

I think you are misreading the log. Of the various descriptors to check, only descriptor 8 is still open.

This makes me wonder whether descriptor 8 has been opened by some other thread, as in #25628. This could in fact be the same issue as that one. According to the internal/testenv package, android-arm always uses external linking. That doesn't explain the linux-arm failure, though.

@panjf2000
Copy link
Contributor

@panjf2000 panjf2000 commented Sep 15, 2021

I think you are misreading the log. Of the various descriptors to check, only descriptor 8 is still open.

This makes me wonder whether descriptor 8 has been opened by some other thread, as in #25628. This could in fact be the same issue as that one. According to the internal/testenv package, android-arm always uses external linking. That doesn't explain the linux-arm failure, though.

I did misread the log and sorry about that, it was the minimum fd that was still alive, which makes it stands a good chance of being reused inside the process, I will submit a CL to improve the splice_linux_test.go, it would be rational for this test to ignore those fds reopened by other threads.

@gopherbot
Copy link

@gopherbot gopherbot commented Sep 15, 2021

Change https://golang.org/cl/349774 mentions this issue: internal/poll: ignore the reused pipe file descriptors when checking if all pipes from pool are closed.

@gopherbot gopherbot closed this in 74e384f Sep 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants