Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: netpoll port_getn() fails with impossible errno on illumos #45643

Open
whorfin opened this issue Apr 20, 2021 · 19 comments
Open

runtime: netpoll port_getn() fails with impossible errno on illumos #45643

whorfin opened this issue Apr 20, 2021 · 19 comments

Comments

@whorfin
Copy link

@whorfin whorfin commented Apr 20, 2021

What version of Go are you using (go version)?

$ go version
go version go1.16.3 illumos/amd64

Does this issue reproduce with the latest release?

I do not know, I found the issue running Navidrome
https://github.com/navidrome/navidrome
which enforces building only with Go 1.16
The relevant bits of netpoll_solaris.go do not seem to be changed against latest.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/whorfin/.cache/go-build"
GOENV="/home/whorfin/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="illumos"
GOINSECURE=""
GOMODCACHE="/home/whorfin/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="illumos"
GOPATH="/home/whorfin/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/opt/ooce/go-1.16"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/opt/ooce/go-1.16/pkg/tool/illumos_amd64"
GOVCS=""
GOVERSION="go1.16.3"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/whorfin/navidrome/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3292860960=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Having built the latest navidrome (master from https://github.com/navidrome/navidrome), it will sporadically crash with:

runtime: port_getn on fd 4 failed (errno=0)
fatal error: runtime: netpoll failed

runtime stack:
runtime.throw(0xe01def, 0x17)
        /opt/ooce/go-1.16/src/runtime/panic.go:1117 +0x72
runtime.netpoll(0x71c11d, 0x12202ce5e8f688)
        /opt/ooce/go-1.16/src/runtime/netpoll_solaris.go:249 +0x505
runtime.findrunnable(0xc000034000, 0x0)
        /opt/ooce/go-1.16/src/runtime/proc.go:2879 +0x3ee
runtime.schedule()
        /opt/ooce/go-1.16/src/runtime/proc.go:3125 +0x2d7
runtime.park_m(0xc000001080)
        /opt/ooce/go-1.16/src/runtime/proc.go:3274 +0x9d
runtime.mcall(0x57eac8)
        /opt/ooce/go-1.16/src/runtime/asm_amd64.s:327 +0x64

goroutine 1 [chan receive, 26 minutes]:
github.com/oklog/run.(*Group).Run(0xc000052d20, 0xc0003fa0b0, 0x1)
        /home/whorfin/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:43 +0xed
github.com/navidrome/navidrome/cmd.runNavidrome()
        /home/whorfin/navidrome/cmd/root.go:59 +0x107
github.com/navidrome/navidrome/cmd.glob..func2(0x25e8540, 0x266d098, 0x0, 0x0)
        /home/whorfin/navidrome/cmd/root.go:31 +0x25
github.com/spf13/cobra.(*Command).execute(0x25e8540, 0xc000092020, 0x0, 0x0, 0x25e8540, 0xc000092020)
        /home/whorfin/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:856 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0x25e8540, 0xc000000180, 0x200000003, 0xc000000180)
        /home/whorfin/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
        /home/whorfin/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897
github.com/navidrome/navidrome/cmd.Execute()
        /home/whorfin/navidrome/cmd/root.go:39 +0x65
main.main()
        /home/whorfin/navidrome/main.go:11 +0x2b

That (errno=0) appears to be impossible.

What did you expect to see?

Ideally, no crashes. If there were a crash, I'd expect a meaningful errno.

What did you see instead?

A crash with errno=0

@toothrot
Copy link
Contributor

@toothrot toothrot commented Apr 20, 2021

Is this possibly related to #35261? @jclulow @ianlancetaylor

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Apr 20, 2021

This is not related to #35261.

I don't see how this could happen. This is calling the function port_getn in the C library. That function is returning -1, but is not setting errno. That sounds like a bug in the C library.

Is there a Solaris equivalent to strace that will show the exact system calls being made and what they return? That might help identify the problem.

It's odd that nobody else is seeing this.

@jclulow
Copy link
Contributor

@jclulow jclulow commented Apr 20, 2021

That definitely seems unusual. On illumos, truss(1) is the moral equivalent of strace.

@jclulow
Copy link
Contributor

@jclulow jclulow commented Apr 20, 2021

e.g., given process 597:

# truss -t port_getn -p 597
/6:     port_getn(4, 0xFFFFFC7FEB3FF200, 128, 1, 0x00000000) (sleeping...)
/6:     port_getn(4, 0xFFFFFC7FEB3FF200, 128, 1, 0x00000000) = 0 [1]
/8:     port_associate(4, 4, 0x00000007, 0x00000001, 0xFFFFFC7FEF023F08) = 0
/6:     port_getn(4, 0xFFFFFC7FEB3FF210, 128, 1, 0xFFFFFC7FEB3FF1E8) = 0 [0]
/6:     port_getn(4, 0xFFFFFC7FEB3FF210, 128, 1, 0x00000000) (sleeping...)
...
@whorfin
Copy link
Author

@whorfin whorfin commented Apr 20, 2021

I made this evil, gross hack as an experiment; I do not understand port_getn well enough to know if this is safe,; it is probably stupid, but it confirms that something funky is happening. the errno() function here does seem to be doing something different AFAIK from what the "normal" pattern of using the "raw" or "Err" variant of sysvicall5 would be.

The gross hack allowed Navidrome to power through an initial scan - "seemed to work"
I saw the the "returning gList..." branch fire each time
Probably of no use to anybody, but in the spirit of "hope this helps":

        if r < 0 {
                if e != _EINTR && e != _ETIME {
                        print("runtime: port_getn on fd ", portfd, " failed (errno=", e, ")\n")
                        // whorfin
                        if (e ==0) {
                                if delay > 0 {
                                        print("\tImpossible zero errno, returning gList...\n")
                                        return gList{}
                                }
                                print("\tImpossible zero errno, retrying...\n")
                                goto retry
                        }
                        throw("runtime: netpoll failed")
                }
                // If a timed sleep was interrupted and there are no events,
                // just return to recalculate how long we should sleep now.
                if delay > 0 {
                        return gList{}
                }
                goto retry
        }

I was spooked by the whole thing and backed out the change.

@jclulow
Copy link
Contributor

@jclulow jclulow commented Apr 20, 2021

Can you give a bit more detail about the environment you're running this in? Which illumos distribution are you using, and what release or version of that distribution?

Were you able to collect information about the port_getn(3C) calls which are failing, using truss or DTrace?

I'm not familiar with Navidrome -- is it doing anything unusual like polling on a character device, or is it just regular network (e.g., TCP & UDP) things?

@whorfin
Copy link
Author

@whorfin whorfin commented Apr 21, 2021

SunOS xxxxxx 5.11 omnios-r151030-5bd7739fe4 i86pc i386 i86pc illumos

ie OmniOS CE r151030 LTS

The failures were random-seeming enough that I didn't get a good opportunity to run truss at an opportune time. I don't know how to make truss reveal errnos of syscalls which don't return them directly though; wouldn't it just show the "-1" retval?

It was always fd 4. And it was always from the findrunnable call, called by park_m, both in runtime/proc.go as shown in the stacktrace; Navidrome does not call netpoll directly, this seems to be from some fairly intrinsic threading code. That's why this seemed worth elevating to y'all golang folks.

(navidrome is a subsonic-API compatible music server)

I am using golang 1.16 as mentioned above, but also worth mentioning that is the go pkg from the omnios "extras" repo [OOCE], ie as OmniOS-standard as can be.

@whorfin
Copy link
Author

@whorfin whorfin commented Apr 22, 2021

Update:
I did a hack with a local runtime build, where I made a sysvicall5Err variant of sysvicall5 in order to call port_getn and directly get the library errno.
I then modified the netpoll call to always check if that errno matched that returned by errno(). Which it should but I'm paranoid
Managed to catch a crash, confirmed the errno() matched that from sysvicall5Err, and also that the return value of port_getn was -1. errno was the impossible 0
Finally, confirmed with ldd of my navidrome build that

libc.so.1 => /lib/64/libc.so.1

which is as expected, and the same version used by /opt/ooce/bin/go

@jclulow
Copy link
Contributor

@jclulow jclulow commented Apr 22, 2021

Managed to catch a crash, confirmed the errno() matched that from sysvicall5Err, and also that the return value of port_getn was -1. errno was the impossible 0

That's good news! Can you include the truss output that shows that information as well?

@whorfin
Copy link
Author

@whorfin whorfin commented Apr 23, 2021

Well that's not what I expected...
It looks like errno=0 because ... errno is zero
But the retval of port_getn was not negative!
here's the last port_getn lines captured by truss as it died:

/10:	port_getn(4, 0xFFFFFC7FDFFFF170, 128, 1, 0xFFFFFC7FDFFFF148) = 0 [0]
/5:	port_getn(4, 0xFFFFFC7FE87FF170, 128, 1, 0xFFFFFC7FE87FF148) = 0 [0]
/1:	port_getn(4, 0xFFFFFC7FFFDFAFA8, 128, 1, 0xFFFFFC7FFFDFAF80) = 62 [0]
/1:	port_getn(4, 0xFFFFFC7FFFDFAFA8, 128, 1, 0xFFFFFC7FFFDFAF80) = 0 [0]
/1:	port_getn(4, 0xFFFFFC7FFFDFAFA8, 128, 1, 0xFFFFFC7FFFDFAF80) = 0 [0]
/10:	port_getn(4, 0xFFFFFC7FDFFFF170, 128, 1, 0xFFFFFC7FDFFFF148) = 0 [0]
/5:	port_getn(4, 0xFFFFFC7FE87FF170, 128, 1, 0xFFFFFC7FE87FF148) = 0 [0]
/1:	port_getn(4, 0xFFFFFC7FFFDFAFA8, 128, 1, 0xFFFFFC7FFFDFAF80) = 62 [0]
/1:	port_getn(4, 0xFFFFFC7FFFDFAFA8, 128, 1, 0xFFFFFC7FFFDFAF80) = 0 [0]
/10:	port_getn(4, 0xFFFFFC7FDFFFF170, 128, 1, 0xFFFFFC7FDFFFF148) = 0 [0]
/1:	port_getn(4, 0xFFFFFC7FFFDFAFA8, 128, 1, 0xFFFFFC7FFFDFAF80) = 0 [0]
/5:	port_getn(4, 0xFFFFFC7FE87FF170, 128, 1, 0xFFFFFC7FE87FF148) = 0 [0]
/1:	port_getn(4, 0xFFFFFC7FFFDFAFA8, 128, 1, 0xFFFFFC7FFFDFAF80) = 62 [0]

Something seems borked here - the man page says either 0 or -1 should be returned
Here we see a retval of 62 with an errno of 0?! Am i reading this wrong?
This is on

SunOS illumosbox 5.11 omnios-r151030-5bd7739fe4 i86pc i386 i86pc illumos

and for completeness, the running program said just as became non-running:

runtime: port_getn on fd 4 failed (errno=0)
fatal error: runtime: netpoll failed

runtime stack:
runtime.throw(0xe02209, 0x17)
        /opt/ooce/go-1.16/src/runtime/panic.go:1117 +0x72
runtime.netpoll(0x7256d3, 0xc00003d800)
        /opt/ooce/go-1.16/src/runtime/netpoll_solaris.go:258 +0x505
runtime.findrunnable(0xc000036000, 0x0)
        /opt/ooce/go-1.16/src/runtime/proc.go:2879 +0x3ee
runtime.schedule()
        /opt/ooce/go-1.16/src/runtime/proc.go:3125 +0x2d7
runtime.park_m(0xc000001080)
        /opt/ooce/go-1.16/src/runtime/proc.go:3274 +0x9d
runtime.mcall(0x57eac8)
        /opt/ooce/go-1.16/src/runtime/asm_amd64.s:327 +0x64
@jclulow
Copy link
Contributor

@jclulow jclulow commented Apr 23, 2021

The mechanics are a little complicated here. What you see from truss is the underlying system call, which is actually a bit different from the signature of the C function port_getn(). The return value (here, either = 0 or = 62) is either success, or the error ETIME as you see on some of the line. The value in square brackets is, I believe, the number of events the kernel returned, and which we will write to the *nget pointer.

I am a bit confused by the output, as it doesn't seem like there was a non-zero return at least in the truss output you've pasted. If there was an error, I would expect to see something like:

port_getn(10000, 0x00000000, 128, 1, 0x00000000) Err#9 EBADF

That's what comes out of truss when I run...

#include <port.h>

int
main(int argc, char *argv[])
{
        int nget = 1;

        port_getn(10000, NULL, 128, &nget, NULL);
}

... at any rate. The ETIME condition (system call level return value of 62) is not technically an error from the perspective of truss because we don't communicate that back through the regular "set errno" path of a system call return.

I will try and knock together a DTrace script that will show you the user-level port_getn() calls.

@jclulow
Copy link
Contributor

@jclulow jclulow commented Apr 23, 2021

Here is a script that will trace some of the port_getn() calls for a process:

#!/usr/sbin/dtrace -qs

pid$target::port_getn:entry
{
        self->x = timestamp;
        self->fd = args[0];
        self->ngetp = args[3];
        self->nget = *self->ngetp;

        printf("%u [%d/%d:%s] port_getn(%d, nget %p = %d)...\n",
            timestamp, pid, tid, execname,
            self->fd, self->ngetp, self->nget);
}

syscall::portfs:entry
/self->x/
{
        printf("%u [%d/%d:%s] SYSCALL portfs %d entry\n",
            timestamp, pid, tid, execname,
            (int)arg0);
}

syscall::portfs:return
/self->x/
{
        printf("%u [%d/%d:%s] SYSCALL portfs return %d (errno %d)\n",
            timestamp, pid, tid, execname,
            (int)arg0, errno);
}

pid$target::port_getn:return
/self->x/
{
        this->dur = (timestamp - self->x) / 1000;

        /*
         * Get the per-thread errno location for this thread:
         */
        this->fsbase = curthread->t_lwp->lwp_pcb.pcb_fsbase;
        this->ulwp = (userland struct pid`ulwp *)this->fsbase;
        this->errnop = this->ulwp->ul_errnop;

        printf("%u [%d/%d:%s] -> %d (errnop %p = %d, nget = %d) %dus\n",
            timestamp, pid, tid, execname,
            (int)arg1, this->errnop, *this->errnop, *self->ngetp, this->dur);

        if ((int)arg1 < 0 && *this->errnop == 0) {
                printf("WARNING: oddball return detected\n");
                ustack();
                printf("\n");
        }

        self->x = 0;
        self->fd = 0;
        self->ngetp = 0;
        self->nget = 0;
}

e.g.,

some output from an acme-dns process:

dns2 # /var/tmp/portgetn.d -p $(pgrep acme-dns)
13334996135948040 [597/6:acme-dns] port_getn(4, nget fffffc7feb3ff1c8 = 1)...
13334996135962747 [597/6:acme-dns] SYSCALL portfs 6 entry
13334996135968032 [597/6:acme-dns] SYSCALL portfs return 0 (errno 0)
13334996135973638 [597/6:acme-dns] -> 0 (errnop fffffc7feec8234c = 0, nget = 0) 25us

13334996135986941 [597/6:acme-dns] port_getn(4, nget fffffc7feb3ff1c8 = 1)...
13334996135991899 [597/6:acme-dns] SYSCALL portfs 6 entry
13335017554884999 [597/6:acme-dns] SYSCALL portfs return 1 (errno 0)
13335017554908883 [597/6:acme-dns] -> 0 (errnop fffffc7feec8234c = 0, nget = 1) 21418921us

13335017555435247 [597/6:acme-dns] port_getn(4, nget fffffc7feb3ff1c8 = 1)...
13335017555446540 [597/6:acme-dns] SYSCALL portfs 6 entry
13335017555451788 [597/6:acme-dns] SYSCALL portfs return 0 (errno 0)
13335017555456734 [597/6:acme-dns] -> 0 (errnop fffffc7feec8234c = 0, nget = 0) 21us

13335017555465954 [597/6:acme-dns] port_getn(4, nget fffffc7feb3ff1c8 = 1)...
13335017555470793 [597/6:acme-dns] SYSCALL portfs 6 entry

Critically this script looks at two things:

  • the libc-level port_getn() call, including the per-thread errno value that Go should be collecting via errno()
  • the underlying portfs system call made by port_getn() which is what truss is showing you as well, including the errno value that the kernel put in place for the system call return (this is separate but related to the libc errno value)

There is a conditional print if we detect a failure return (-1) where errno is 0, which should print something like:

WARNING: oddball return detected

              libc.so.1`port_getn+0x51
              acme-dns`runtime.netpoll+0x1ef
              acme-dns`runtime.startTheWorldWithSema+0x236
              acme-dns`runtime.gcMarkTermination.func3+0x26
              acme-dns`runtime.systemstack+0x73
              acme-dns`runtime.gcMarkDone+0x1ff
              acme-dns`runtime.gcBgMarkWorker+0x29d
              acme-dns`runtime.goexit+0x1

It would be good to capture this sort of trace output while the problem is occurring, so that we can determine whether the C library is mishandling the error numbers here, or if it is some part of the Go machinery for fetching thread local errno after a C library call, or something else.

@whorfin
Copy link
Author

@whorfin whorfin commented Apr 23, 2021

OK! and slaps forehead for confusing native syscall vs libc illusions. Living in userland...
Nicely done. Here's the last reelvant bits, showing some -1 as well as 0 retvals, and then we die right after your dtrace script detects the oddball return:

5375078499701446 [28188/6:navidrome] -> 0 (errnop fffffc7fef2a238c = 0, nget = 0) 86us
5375078499704052 [28188/1:navidrome] SYSCALL portfs 6 entry
5375078499714442 [28188/8:navidrome] -> 0 (errnop fffffc7fef2a338c = 0, nget = 0) 82us
5375078506469969 [28188/1:navidrome] SYSCALL portfs return 0 (errno 0)
5375078506524468 [28188/1:navidrome] -> -1 (errnop fffffc7feea28ed8 = 62, nget = 0) 6883us
5375078506593065 [28188/1:navidrome] port_getn(4, nget fffffc7fffdfe338 = 1)...
5375078506613707 [28188/1:navidrome] SYSCALL portfs 6 entry
5375078506626595 [28188/1:navidrome] SYSCALL portfs return 0 (errno 0)
5375078506632341 [28188/1:navidrome] -> 0 (errnop fffffc7feea28ed8 = 0, nget = 0) 39us
5375078506769053 [28188/1:navidrome] port_getn(4, nget fffffc7fffdfe338 = 1)...
5375078506776102 [28188/1:navidrome] SYSCALL portfs 6 entry
5375078506779076 [28188/8:navidrome] port_getn(4, nget fffffc7fe87ff118 = 1)...
5375078506779730 [28188/1:navidrome] SYSCALL portfs return 0 (errno 0)
5375078506786340 [28188/1:navidrome] -> 0 (errnop fffffc7feea28ed8 = 0, nget = 0) 17us
5375078506792631 [28188/6:navidrome] port_getn(4, nget fffffc7fe83ff120 = 1)...
5375078506800609 [28188/1:navidrome] port_getn(4, nget fffffc7fffdfe338 = 1)...
5375078506807195 [28188/1:navidrome] SYSCALL portfs 6 entry
5375078506819033 [28188/8:navidrome] SYSCALL portfs 6 entry
5375078506831290 [28188/6:navidrome] SYSCALL portfs 6 entry
5375078506838443 [28188/8:navidrome] SYSCALL portfs return 0 (errno 0)
5375078506851638 [28188/6:navidrome] SYSCALL portfs return 0 (errno 0)
5375078506863374 [28188/8:navidrome] -> 0 (errnop fffffc7fef2a338c = 0, nget = 0) 84us
5375078506875764 [28188/6:navidrome] -> 0 (errnop fffffc7fef2a238c = 0, nget = 0) 83us
5375078512819552 [28188/1:navidrome] SYSCALL portfs return 0 (errno 0)
5375078512870335 [28188/1:navidrome] -> -1 (errnop fffffc7feea28ed8 = 0, nget = 0) 6069us
WARNING: oddball return detected

              libc.so.1`port_getn+0x5f
              navidrome`runtime.netpoll+0xc5
              navidrome`runtime.findrunnable+0x3ee
              navidrome`runtime.schedule+0x2d7
              navidrome`runtime.park_m+0x9d
              navidrome`runtime.mcall+0x64
              navidrome`runtime.scavengeSleep+0xc5
              navidrome`runtime.bgscavenge+0x1e5
              navidrome`runtime.goexit+0x1
@jclulow
Copy link
Contributor

@jclulow jclulow commented Apr 26, 2021

This is definitely strange. I have been looking at the kernel and the C library code for a while and I cannot see how that condition could arise so far.

I have added some more in-kernel bits to the tracing here to try and find what I'm missing:

#!/usr/sbin/dtrace -qs

pid$target::port_getn:entry
{
        self->x = timestamp;
        self->fd = args[0];
        self->ngetp = args[3];
        self->nget = *self->ngetp;

        printf("%u [%d/%d:%s] port_getn(%d, nget %p = %d)...\n",
            timestamp, pid, tid, execname,
            self->fd, self->ngetp, self->nget);
}

syscall::portfs:entry
/self->x/
{
        printf("%u [%d/%d:%s] SYSCALL portfs %d entry\n",
            timestamp, pid, tid, execname,
            (int)arg0);
}

syscall::portfs:return
/self->x/
{
        printf("%u [%d/%d:%s] SYSCALL portfs return %d (errno %d)\n",
            timestamp, pid, tid, execname,
            (int)arg0, errno);
}

pid$target::port_getn:return
/self->x/
{
        this->dur = (timestamp - self->x) / 1000;

        /*
         * Get the per-thread errno location for this thread:
         */
        this->fsbase = curthread->t_lwp->lwp_pcb.pcb_fsbase;
        this->ulwp = (userland struct pid`ulwp *)this->fsbase;
        this->errnop = this->ulwp->ul_errnop;

        printf("%u [%d/%d:%s] -> %d (errnop %p = %d, nget = %d) %dus\n",
            timestamp, pid, tid, execname,
            (int)arg1, this->errnop, *this->errnop, *self->ngetp, this->dur);

        self->x = 0;
        self->fd = 0;
        self->ngetp = 0;
        self->nget = 0;
}

fbt::portfs:entry
/self->x/
{
        printf("%u [%d/%d:%s] KERNEL portfs op %d a0 %x a1 %x a2 %x a3 %x a4 %x\n",
            timestamp, pid, tid, execname,
            args[0], args[1], args[2], args[3], args[4], args[5]);
}

fbt::port_getn:entry
/self->x/
{
        self->k_nget = args[3];
        self->k_pgt = args[4];

        printf("%u [%d/%d:%s] KERNEL port_getn max %u nget %p = %u pgt %p loop %u\n",
            timestamp, pid, tid, execname,
            args[2], self->k_nget, *self->k_nget, self->k_pgt,
            self->k_pgt->pgt_loop);
}

fbt::set_errno:entry
/self->x/
{
        printf("%u [%d/%d:%s] KERNEL set_errno %d\n",
            timestamp, pid, tid, execname,
            args[0]);
        stack();
        printf("\n");
}

fbt::port_getn:return
/self->x/
{
        printf("%u [%d/%d:%s] KERNEL port_getn ret i %x v %d 0x%x nget %u loop %u\n",
            timestamp, pid, tid, execname,
            arg0, arg1, arg1, *self->k_nget, self->k_pgt->pgt_loop);

        self->k_nget = 0;
        self->k_pgt = 0;
}


fbt::portfs:return
/self->x/
{
        printf("%u [%d/%d:%s] KERNEL portfs ret i %x v %d 0x%x\n",
            timestamp, pid, tid, execname,
            arg0, arg1, arg1);
}

Can you run this one while reproducing again?

As an aside: does the system where you're running this program have ECC memory in it?

@whorfin
Copy link
Author

@whorfin whorfin commented Apr 26, 2021

as before, last bits:

5673413389769518 [11542/1:navidrome] KERNEL port_getn max 128 nget fffffe0011153dec = 1 pgt fffffe0011153da0 loop 0
5673413389781061 [11542/1:navidrome] KERNEL port_getn ret i 94 v 0 0x0 nget 0 loop 0
5673413389785050 [11542/1:navidrome] KERNEL portfs ret i e1 v 0 0x0
5673413389787737 [11542/1:navidrome] SYSCALL portfs return 0 (errno 0)
5673413389792140 [11542/1:navidrome] -> 0 (errnop fffffc7feefb8ed8 = 0, nget = 0) 53us
5673413389886540 [11542/3:navidrome] port_getn(4, nget fffffc7fe53ff120 = 1)...
5673413389886581 [11542/1:navidrome] port_getn(4, nget fffffc7fffdfd708 = 1)...
5673413389893263 [11542/1:navidrome] SYSCALL portfs 6 entry
5673413389896461 [11542/1:navidrome] KERNEL portfs op 6 a0 4 a1 fffffc7fffdfd758 a2 80 a3 1 a4 fffffc7fffdfd730
5673413389896890 [11542/11:navidrome] port_getn(4, nget fffffc7fe37ff120 = 1)...
5673413389900305 [11542/1:navidrome] KERNEL port_getn max 128 nget fffffe0011153dec = 1 pgt fffffe0011153da0 loop 0
5673413389905345 [11542/1:navidrome] KERNEL port_getn ret i 94 v 0 0x0 nget 0 loop 0
5673413389909244 [11542/1:navidrome] KERNEL portfs ret i e1 v 0 0x0
5673413389909799 [11542/3:navidrome] SYSCALL portfs 6 entry
5673413389911866 [11542/1:navidrome] SYSCALL portfs return 0 (errno 0)
5673413389916351 [11542/1:navidrome] -> 0 (errnop fffffc7feefb8ed8 = 0, nget = 0) 29us
5673413389918754 [11542/11:navidrome] SYSCALL portfs 6 entry
5673413389920104 [11542/3:navidrome] KERNEL portfs op 6 a0 4 a1 fffffc7fe53ff170 a2 80 a3 1 a4 fffffc7fe53ff148
5673413389927285 [11542/1:navidrome] port_getn(4, nget fffffc7fffdfd708 = 1)...
5673413389928718 [11542/11:navidrome] KERNEL portfs op 6 a0 4 a1 fffffc7fe37ff170 a2 80 a3 1 a4 fffffc7fe37ff148
5673413389933200 [11542/1:navidrome] SYSCALL portfs 6 entry
5673413389934911 [11542/3:navidrome] KERNEL port_getn max 128 nget fffffe00114cedec = 1 pgt fffffe00114ceda0 loop 0
5673413389936374 [11542/1:navidrome] KERNEL portfs op 6 a0 4 a1 fffffc7fffdfd758 a2 80 a3 1 a4 fffffc7fffdfd730
5673413389940191 [11542/1:navidrome] KERNEL port_getn max 128 nget fffffe0011153dec = 1 pgt fffffe0011153da0 loop 0
5673413389943122 [11542/11:navidrome] KERNEL port_getn max 128 nget fffffe0010780dec = 1 pgt fffffe0010780da0 loop 0
5673413389953552 [11542/3:navidrome] KERNEL port_getn ret i 94 v 0 0x0 nget 0 loop 0
5673413389959349 [11542/11:navidrome] KERNEL port_getn ret i 94 v 0 0x0 nget 0 loop 0
5673413389965264 [11542/3:navidrome] KERNEL portfs ret i e1 v 0 0x0
5673413389970619 [11542/11:navidrome] KERNEL portfs ret i e1 v 0 0x0
5673413389972164 [11542/3:navidrome] SYSCALL portfs return 0 (errno 0)
5673413389977261 [11542/11:navidrome] SYSCALL portfs return 0 (errno 0)
5673413389981458 [11542/3:navidrome] -> 0 (errnop fffffc7feecd0b8c = 0, nget = 0) 94us
5673413389986837 [11542/11:navidrome] -> 0 (errnop fffffc7feecd4b8c = 0, nget = 0) 89us
5673413394332656 [11542/1:navidrome] KERNEL port_getn ret i 94 v 62 0x3e nget 0 loop 0
5673413394378638 [11542/1:navidrome] KERNEL portfs ret i e1 v 266287972352 0x3e00000000
5673413394387872 [11542/1:navidrome] SYSCALL portfs return 0 (errno 0)
5673413394403418 [11542/1:navidrome] -> -1 (errnop fffffc7feefb8ed8 = 62, nget = 0) 4476us

Yes, I believe the server has ECC memory. All filesystems are regularly scrubbed and show no errors.
For extra paranoia:
$ ls -l /lib/64/libc.so.1
-rwxr-xr-x 1 root bin 2115832 Oct 23 2020 /lib/64/libc.so.1
$ sha512sum /lib/64/libc.so.1
802da4ec8c007ffc9354c64ef715e730853574b72c2ee324e9fe4992eaa1fe6a43c256d3c57de71d3b8bde715e1207c4edc38a7c3be6c26c6940d0f9429baf40 /lib/64/libc.so.1

@jclulow
Copy link
Contributor

@jclulow jclulow commented Apr 26, 2021

In this last trial did the WARNING: oddball return detected message not appear? The final line in the trace output you pasted shows that errno was ETIME (62):

...
5673413394378638 [11542/1:navidrome] KERNEL portfs ret i e1 v 266287972352 0x3e00000000
5673413394387872 [11542/1:navidrome] SYSCALL portfs return 0 (errno 0)
5673413394403418 [11542/1:navidrome] -> -1 (errnop fffffc7feefb8ed8 = 62, nget = 0) 4476us
@whorfin
Copy link
Author

@whorfin whorfin commented Apr 26, 2021

correct, those are the last lines of the output
the program blew up, per usual, with the message about failed (errno=0)
(that last line looks just like what truss showed, right?)

@jclulow
Copy link
Contributor

@jclulow jclulow commented Apr 26, 2021

If it blew up and reported errno=0 in that case I'm not sure it can be a libc issue, as the last line of this trace output shows that the memory address of the thread local errno was set to 62 (ETIME). The return from the kernel portfs() function just prior was 0x3e00000000 which we shift down, effectively, to get 0x3e which is 62 -- this causes us to set errno to ETIME in libc before returning:

https://github.com/illumos/illumos-gate/blob/9ecd05bdc59e4a1091c51ce68cce2028d5ba6fd1/usr/src/lib/libc/port/gen/event_port.c#L94-L97

(that the ETIME case is handled this way is why truss does not see an error, but libc is able to return one anyway)

@whorfin
Copy link
Author

@whorfin whorfin commented Apr 26, 2021

I have no idea what I'm doing as regards the illumos kernel.
But, uh, couldn't this be the return at line 91 with something globally haywire?
...or line 98 for that matter; a million ways to die in the kernel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants