-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/sys/unix: panic when using solaris Event Ports #54254
Comments
Reproduction is hard and inconsistent in how long it takes. Here's a test case that can eventually trigger it at least sometimes: diff --git a/unix/syscall_solaris_test.go b/unix/syscall_solaris_test.go
index c2b28be..d597355 100644
--- a/unix/syscall_solaris_test.go
+++ b/unix/syscall_solaris_test.go
@@ -12,7 +12,9 @@ import (
"io/ioutil"
"os"
"os/exec"
+ "path/filepath"
"runtime"
+ "sync"
"testing"
"golang.org/x/sys/unix" Then add this test func TestEventPortMemoryStress(t *testing.T) {
path, err := os.MkdirTemp("", "eventport")
if err != nil {
t.Fatalf("unable to create a tempdir: %v", err)
}
defer os.RemoveAll(path)
stat, err := os.Stat(path)
if err != nil {
t.Fatalf("Failed to stat %s: %v", path, err)
}
port, err := unix.NewEventPort()
if err != nil {
t.Fatalf("NewEventPort failed: %v", err)
}
defer port.Close()
cookie := stat.Mode()
err = port.AssociatePath(path, stat, unix.FILE_MODIFIED, cookie)
if err != nil {
t.Errorf("AssociatePath failed: %v", err)
}
if !port.PathIsWatched(path) {
t.Errorf("PathIsWatched unexpectedly returned false")
}
c := make(chan int)
done := make(chan bool)
var mu sync.Mutex
go func (c chan int, done chan bool) {
for {
_, err = port.GetOne(nil)
if err != nil {
t.Errorf("GetOne failed: %v", err)
}
mu.Lock()
err = port.AssociatePath(path, stat, unix.FILE_MODIFIED, cookie)
mu.Unlock()
select {
case _, _ = <-done:
return
default:
if err != nil {
t.Errorf("AssociatePath failed: %v", err)
}
}
c <- 1
}
} (c, done)
iterations := 500000
for i := 0; i < iterations; i++ {
mu.Lock()
file, err := os.Create(filepath.Join(path, fmt.Sprintf("%d", i)))
if err != nil {
t.Fatalf("unable to create files in %s: %v", path, err)
}
file.Close()
mu.Unlock()
}
var sum int
for i := 0; i < iterations; i++ {
sum += <-c
}
done <- true
if sum != iterations {
t.Errorf("didn't get all %d events", iterations)
}
} It usually takes many seconds before it triggers, e.g.:
|
CC @golang/runtime, @ianlancetaylor. |
I'm not sure what the issue is, but in your test the Is that intentional? It looks like you want those goroutines to race as much as possible? If so, eliminating |
FWIW, at https://cs.opensource.google/go/x/sys/+/master:unix/syscall_solaris.go;l=954 That said, I don't immediately see how that would cause problems, as the |
For posterity I'm updating this comment with the version of the test that I ended up using to further debug this issue and to test the fix. func TestEventPortMemoryStress(t *testing.T) {
path, err := os.MkdirTemp("", "eventport")
if err != nil {
t.Fatalf("unable to create a tempdir: %v", err)
}
defer os.RemoveAll(path)
stat, err := os.Stat(path)
if err != nil {
t.Fatalf("Failed to stat %s: %v", path, err)
}
port, err := unix.NewEventPort()
if err != nil {
t.Fatalf("NewEventPort failed: %v", err)
}
defer port.Close()
iterations := 100000
for i := 0; i < iterations; i++ {
cookie := fmt.Sprintf("cookie %d", i)
err = port.AssociatePath(path, stat, unix.FILE_MODIFIED, cookie)
if err != nil {
t.Errorf("AssociatePath failed: %v", err)
}
if !port.PathIsWatched(path) {
t.Errorf("PathIsWatched unexpectedly returned false")
}
file, err := os.Create(filepath.Join(path, fmt.Sprintf("%d", i)))
if err != nil {
t.Fatalf("unable to create files in %s: %v", path, err)
}
file.Close()
err = os.Remove(filepath.Join(path, fmt.Sprintf("%d", i)))
if err != nil {
t.Errorf("os.Remove failed: %v", err)
}
_, err = port.GetOne(nil)
if err != nil {
t.Errorf("GetOne failed: %v", err)
}
}
} |
Yeah... If I run that test with
Without |
Based on the discovery that the panics are pretty clearly related to garbage collection, I think you're onto something. I thought all of these maps were safe (the key in
One thing that I appear to be doing that could maybe be changed is that I'm working pretty hard to pass the pointer to the exact thing that the user gave me to the call to port_associate, but maybe I'm trying too hard. As long as I pass in some pointer in that is not going to get garbage collected, as long as I can figure out what it points to when I get it back, I can then return to the user whatever they asked me to associate... |
Well, I tried simplifying things and while that makes all the code way easier to read, and the tests pass with For now, my code is in a github branch here: The simplification looks like a good idea to me, but doesn't solve the problem. Maybe there really is a GC bug?! |
These are all fine. Storing as Casting to This code is full of unsafe uintptr casts. e.g., https://cs.opensource.google/go/x/sys/+/master:unix/syscall_solaris.go;l=833-834 casts Given the amount of uintptr code here, I suspect that one of these cases fails to keep an object alive and is causing the problem. If you have trouble determining the problematic spots, one debugging technique would be to add [1] Note: you've probably seen case 4 in |
I suspect it would also be helpful to know in the |
I've pushed up a commit with some heavy-handed debugging to answer those questions at nshalman/sys@1bbdf46
So it looks like However, on one run I saw the most confusing output of all:
Instead of getting back the wrong cookie, I somehow got the right cookie but at a different address. The address seeming to be different at Can we add the OS-illumos tag as well as that's where I'm doing my development and testing. |
Had a brief discussion with @rmustacc on IRC who helped me come up with some dtrace to look a little closer at what's going on from the OS side. fbt:portfs:port_associate_fop:entry
{
printf(" in: %p->%p", args[2], args[4]);
}
fbt::port_copy_event:entry
/args[1]->portkev_source == 7/
{
printf("out: %x->%p\n", args[1]->portkev_object, args[1]->portkev_user);
}
fbt::port_pfp_setup:entry
{
printf(" in: %x->%p", args[4], args[6]);
}
I think the lack of
I've filed https://www.illumos.org/issues/14898 in case this is indeed a bug on the illumos end. |
Oh sweet summer child... I'm now convinced that this is https://www.illumos.org/issues/14898 rather than an issue with the code in x/sys/unix. Since it will take a while for that fix to land, and longer for it to make it out to all machines in production, I'm inclined to updated the panic message to reference this issue. Perhaps something along the lines of: I think there is a nontrivial chance that once fsnotify/fsnotify#371 lands (which this bug was preventing me from feeling comfortable landing), illumos and Solaris (which might have this same issue) systems might start tripping over this panic and I want to provide users with a hint of what to do about it. Thank you everyone for your help guiding me into narrowing down the issue. Your feedback on the panic message is requested so that I know what to put into the CR. |
Change https://go.dev/cl/422099 mentions this issue: |
For golang/go#54254 Change-Id: Id59bacfabc5c818478f6a9af8d585f56f74c2bde
For golang/go#54254 Change-Id: Id59bacfabc5c818478f6a9af8d585f56f74c2bde
I've updated the description at the top of this issue to include some helpful text to anyone directed here from the error message proposed in https://go.dev/cl/422099 |
For golang/go#54254 Change-Id: Id59bacfabc5c818478f6a9af8d585f56f74c2bde Reviewed-on: https://go-review.googlesource.com/c/sys/+/422099 Reviewed-by: Nahum Shalman <nahamu@gmail.com> Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Robert Griesemer <gri@google.com>
I've tweaked the language in the description one more time now that golang/sys@c680a09 has landed. |
Sorry to drive-by this issue with a tangential one, but was anything determined about the |
@davepacheco The code was rewritten (https://go.dev/cl/422338) to avoid the unsafe conversions. The exact rules can be found at https://pkg.go.dev/unsafe#Pointer. |
Thanks -- that's helpful! Do you know what release(s) that will wind up in (or is there some way for me to figure that out)? |
Sorry, I'm not sure exactly what you are asking. If you are asking about the changes to golang.org/x/sys/unix, the golang.org/x/sys/unix package doesn't currently have a release system. We expect people to update ( (Systematic releases for the golang.org/x repos is #21324, but nobody is thinking about that as far as I know.) |
My mistake. I thought this package was tied to the Go runtime and so needed a new Go release to get the fix. (Sorry for my ignorance here...I'm a user of software written in Go but I haven't worked with Go much myself.) |
@davepacheco No worries. (For future notice, see https://go.dev/wiki/Questions for good places to ask questions about Go.) |
Summary
If you are running on illumos, you are likely suffering from https://www.illumos.org/issues/14898 and you should look into applying updates. If you are on an LTS release of some sort, you can ask your distro provider if that fix can be backported for you.
If you are running on Solaris, you should probably contact support and send them a link to the illumos issue to see if there's a corresponding fix available for Solaris.
There's a small chance that you're running into #54363 if the version of x/sys/unix in use doesn't contain the fix from https://go.dev/cl/422338.
Original Report (for posterity)
CAVEAT
This is clearly my own fault, but I am not smart enough (yet?) to figure out where my bug is.
It was introduced while attempting to fix a different problem in golang/sys@594fa53
as worked on in https://go-review.googlesource.com/c/sys/+/380034
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
See fsnotify/fsnotify#371 (comment)
What did you expect to see?
No panics
What did you see instead?
The text was updated successfully, but these errors were encountered: