Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: panics and/or deadlocks on FreeBSD 12.2-RELEASE #43873

Open
problame opened this issue Jan 23, 2021 · 54 comments
Open

runtime: panics and/or deadlocks on FreeBSD 12.2-RELEASE #43873

problame opened this issue Jan 23, 2021 · 54 comments

Comments

@problame
Copy link

@problame problame commented Jan 23, 2021

What version of Go are you using (go version)?

$ go version
go version go1.15.6 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="on"
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/cs/.cache/go-build"
GOENV="/home/cs/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/cs/development/golang/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/cs/development/golang"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/cs/go1.15.6"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/cs/go1.15.6/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build804757202=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I maintain zrepl, a tool for ZFS replication written in Go.
Some users have been hitting Go runtime panics and/or lock-ups on FreeBSD.

The problems started with reports in July 2020 with a user on a development version of FreeBSD (12-STABLE, it must have been somewhere between 12.1-RELEASE and 12.2-RELEASE).
The problem occurred both with Go 1.13.5 and 14.*.
At the time I requested to reproduce the problem on an official binary release of FreeBSD but that did not happen.

After I updated my personal FreeBSD server to 12.2-RELEASE, I started to encounter similar issues as reported in July 2020.
I have not yet encountered runtime panics.
But several lock-ups (for lack of a better term) of the Go runtime.
The last of the links above contains a stack trace of a goroutine blocked forever on runtime.newobject (stack obtained using dlv).

Summary of my triaging since July:

  • We have ruled out that it's due to the one use of unsafe in zrepl by removing the unsafe code path in a test build. The panics / lock-ups still occurred.
  • The problems stop reproducibly when limiting the process to one CPU using the OS scheduler (cpuset).
  • Most often the problems happen while sockets are being used. It does not happen when the daemon is idle.
  • We have ruled out faulty hardware.
  • The issue has only occurred on Intel systems on bare metal so far. I was unable to reproduce it in a stress test between two FreeBSD VMs on a Ryzen 1700X.
  • I suspect the root cause is one of the following:
    • FreeBSD kernel bug introduced between 12.1-RELEASE and 12.2-RELEASE
    • Go runtime bug (would need to be present in multiple Go versions though)

It would be very helpful to get a quick explanation of what these panics mean so that I can narrow down my audit of the changes between FreeBSD 12.1-RELEASE and 12.2-RELEASE.

Also, I can offer a tmate or similar to a Go / FreeBSD developer to the system with the locked-up daemon.
The lock-up usually occurs after 2-3 days on my system, sometimes sooner, but I can leave it in the locked-up state for a day or two.

Related zrepl issues:

@seankhliao seankhliao changed the title freebsd: runtime panics and/or deadlocks on FreeBSD 12.2-RELEASE runtime: panics and/or deadlocks on FreeBSD 12.2-RELEASE Jan 24, 2021
@dmgk
Copy link
Contributor

@dmgk dmgk commented Jan 25, 2021

Just a wild guess, but it look related to 4d27c4c
@problame Can you try reproducing this issue with the recent tip? I can update lang/go-devel to include the above commit for easier testing.

@problame
Copy link
Author

@problame problame commented Jan 25, 2021

@dmgk that seems ... quite concerning, and since zrepl uses pipes like crazy, I wouldn't be surprised if this is the problem.

Is this a regression in Go or something that was always handled incorrectly?
I'm asking because I currently try to 'bisect' Go versions by compiling zrepl with older Go versions ans seeing whether the problem occurs.
The version that is running right now is built with Go 1.13, I would expect a crash to happen within the next 5 days.

Should I abort that run and build with tip + 4d27c4c as you told me to, or should we wait and see whether it crashes with Go 1.13?

(I cross-compile from Linux so I don't need the lang/go-devel port to be updated).

@dmgk
Copy link
Contributor

@dmgk dmgk commented Jan 25, 2021

Well, looking at the commit, it seems that the issue goes back to 2009 - https://github.com/golang/go/blame/master/src/runtime/sys_freebsd_amd64.s#L21, so I'd try reproducing with the tip first.

@problame
Copy link
Author

@problame problame commented Jan 25, 2021

Any idea why this would surface only since FreeBSD 12.2 then?

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Jan 28, 2021

Just a wild guess, but it look related to 4d27c4c
@problame Can you try reproducing this issue with the recent tip? I can update lang/go-devel to include the above commit for easier testing.

Hi,
I would be happy to try locally if you can update the port with this tip/fix/patch. I was the original reporter. Its fairly easy for me to reproduce. A couple of other details about when I ran into the issue last summer, I added more retention to the server (45K+ snapshots) and more clients replicating. So it would have become more busy as well as moving to RELENG12 at the time on FreeBSD.

@dmgk
Copy link
Contributor

@dmgk dmgk commented Jan 28, 2021

@mdtancsa Thanks, I just updated lang/go-devel to go1.16rc1 a few hours ago, it includes 4d27c4c.

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Jan 28, 2021

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Jan 29, 2021

So far so good! I am allowing all threads to bind to whatever core of the CPU. Normally, I would have had a few deadlocks by now. So its certainly MUCH improved.

PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
10778 root 21 0 739M 83M uwait 7 8:05 4.72% zrepl-jan2021{zrepl-jan2021}
10778 root 21 0 739M 83M wait 10 3:03 3.50% zrepl-jan2021{zrepl-jan2021}
10778 root 23 0 739M 83M kqread 1 1:21 3.00% zrepl-jan2021{zrepl-jan2021}
10778 root 24 0 739M 83M uwait 5 2:44 2.87% zrepl-jan2021{zrepl-jan2021}
10778 root 21 0 739M 83M uwait 5 2:39 2.75% zrepl-jan2021{zrepl-jan2021}
10778 root 21 0 739M 83M uwait 11 3:10 2.27% zrepl-jan2021{zrepl-jan2021}
10778 root 21 0 739M 83M uwait 2 2:05 1.70% zrepl-jan2021{zrepl-jan2021}
10778 root 21 0 739M 83M uwait 9 2:51 1.45% zrepl-jan2021{zrepl-jan2021}
10778 root 21 0 739M 83M uwait 11 3:02 1.27% zrepl-jan2021{zrepl-jan2021}
10778 root 20 0 739M 83M uwait 0 3:02 0.51% zrepl-jan2021{zrepl-jan2021}
10778 root 20 0 739M 83M uwait 8 2:40 0.46% zrepl-jan2021{zrepl-jan2021}
10778 root 52 0 739M 83M uwait 7 0:25 0.34% zrepl-jan2021{zrepl-jan2021}
10778 root 23 0 739M 83M wait 4 2:34 0.28% zrepl-jan2021{zrepl-jan2021}
10778 root 21 0 739M 83M uwait 6 3:05 0.10% zrepl-jan2021{zrepl-jan2021}
10778 root 22 0 739M 83M uwait 7 3:11 0.04% zrepl-jan2021{zrepl-jan2021}
10778 root 20 0 739M 83M uwait 0 2:57 0.03% zrepl-jan2021{zrepl-jan2021}
10778 root 21 0 739M 83M uwait 4 2:56 0.02% zrepl-jan2021{zrepl-jan2021}
10778 root 22 0 739M 83M uwait 3 3:00 0.00% zrepl-jan2021{zrepl-jan2021}
10778 root 27 0 739M 83M uwait 8 1:59 0.00% zrepl-jan2021{zrepl-jan2021}
10778 root 20 0 739M 83M uwait 3 1:41 0.00% zrepl-jan2021{zrepl-jan2021}
10778 root 20 0 739M 83M uwait 1 0:00 0.00% zrepl-jan2021{zrepl-jan2021}
10778 root 20 0 739M 83M uwait 9 0:00 0.00% zrepl-jan2021{zrepl-jan2021}

This is still v0.3.0-rc1. If it survives the weekend without a deadlock, I will check out a newer version. Its a bit of work since I have to update all the clients as well. @problame, is there a particular version from the repo I should use ? Or can I just stick with 0.3.1 ?

@problame
Copy link
Author

@problame problame commented Jan 30, 2021

@mdtancsa Chasing this bug the last few weeks we've found a few places in zrepl that could cause deadlocks / data races. In all of those cases it was easy to rule out that they were related to the bug I have reported here. Those patches are in branch problame/develop. But that branch also contains a bump of a bunch of dependencies, including gRPC, which should not be included in your testing.

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Jan 30, 2021

@problame
Copy link
Author

@problame problame commented Jan 30, 2021

If you are already running 0.3.1 then I wouldn't worry.

You shouldn't run the Update to protobuf v1.25 and grpc 1.35; bump CI to go1.12 when doing testing for this issue though because a bump in grpc / protobuf might introduce a regression that we could falsely identify as the issue we are debugging here.

So the commit for you is 166e80bb.

If you are not running 0.3.1 yet, just make sure that your grid pruning rules are correct (ref zrepl/zrepl#400 )

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Jan 30, 2021

@problame
Copy link
Author

@problame problame commented Jan 30, 2021

Dependencies are determined by go.mod, no need to deal with that.
The commit is correct, build process seems correct.
Please make sure you have looked into zrepl/zrepl#400 (mentioned above).

The other slightly 'risky' commit is 0d96627 which adds additional concurrency in some places. I have not run this in production yet. So you may want to revert it for your deployment.
Also we want to remove the little piece of unsafe code for this investigation.

I pushed a branch problame/golang-freebsd-investigation (commit 538382a36bd0c71c679134a09373a98f3d3f4cc4 ) for you to deploy.

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Jan 30, 2021

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Jan 31, 2021

@problame
Copy link
Author

@problame problame commented Jan 31, 2021

@dmgk I'm pretty sure this is not related to zrepl. @mdtancsa is running a build with no unsafe code and AFAIK we don't have dependencies that use unsafe. The SIGSEGV is most likely a problem in the Go runtime.
The crash is the same as your bug report as the one you posted for 0.3-rc1 on the corresponding zrepl issue. My analysis there holds for this stack trace as well.

@mdtancsa could you avoid pasting these large stack traces verbatim here? I think it hurts readability. Either write the reply through the web interface instead of email so that Markdown works and you can use

<details>
yourstacktrace surrounded by three backticks
</details>

or put them in an atttachment / uploaded file (if that works)

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Jan 31, 2021

@problame
Copy link
Author

@problame problame commented Jan 31, 2021

So one difference between our two deployments is that I cross compile from Linux so it's a fully static binary. @dmgk how does this affect the Go commit that we are testing?

@dmgk
Copy link
Contributor

@dmgk dmgk commented Jan 31, 2021

@problame My guess is that cross compilation shouldn't affect anything, but I'm not very familiar with Go internals.

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Feb 1, 2021

Another crash. But still no deadlocks at least. Should I try compiling the server process with the regular release golang version without the patch in a day or two ?

crash2.txt

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Feb 4, 2021

@dmgk another crash. Any thoughts ? This was built with make GO_EXTRA_BUILDFLAGS=-race. Anything else I can add to the build process to help narrow down the issue ?

crash-feb3.txt

@problame
Copy link
Author

@problame problame commented Feb 4, 2021

There was no output from the race detector?
(Would have been prior to the crash)

I would like to see what happens if you use a completely static binary. Build with CGO_ENABLED=0 and verify that the binary is static using ldd.

Please post go version, zrepl commit sha and maybe the build command + output with every crash, I'm starting to lose track of the configurations we already went through.

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Feb 4, 2021

@problame
Copy link
Author

@problame problame commented Feb 4, 2021

So I don't follow: which binary is dynamically linked and which statically linked? The ldd outputs you posted are contradictory.

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Feb 4, 2021

So I don't follow: which binary is dynamically linked and which statically linked? The ldd outputs you posted are contradictory.

The procstat output was from yesterday's binary that was dynamically linked with the race detector code. I recompiled this morning and am running with a statically linked binary now. However, I cannot have both a statically linked binary and the race detector code compiled in at the same time. I started the statically linked binary just a few hours ago.

@problame
Copy link
Author

@problame problame commented Feb 4, 2021

All right, thanks for clearing this up. I'm "looking forward" to the results ;)

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Feb 5, 2021

@problame
Copy link
Author

@problame problame commented Feb 5, 2021

Does golang on FreeBSD do this too ? If by compiling the app statically vs dynamically,
is the app now bypassing one layer of API ? I hesitate to bring it up in
case its way off base.

AFAIK the static binaries do the syscall directly, not sure though.

The other detail however that radically improves stability of the app is
to pin it to one or two CPUs via cpuset. It will last many weeks
without a crash / deadlock if I do that.

Yes. With a single CPU a race condition in the runtime could only manifest through preemption (either through Go's scheduler or through the OS scheduler). Please don't do the cpuset limit if you can afford the annoyance of the frequent crashes, we shouldn't be contempt with this as a solution or even workaround.

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Feb 5, 2021

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Feb 5, 2021

Another crash on the server process.

./zrepl.feb4 version

client: zrepl version=v0.3.1-11-g538382a go=go1.16rc1 GOOS=freebsd GOARCH=amd64 Compiler=gc

statically built

ldd zrepl.feb4

ldd: zrepl.feb4: not a dynamic ELF executable
crash-feb5.txt

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Feb 6, 2021

Another crash. Same binary as before.

crashfeb6.txt

@problame
Copy link
Author

@problame problame commented Feb 6, 2021

I have an idea:

=> Let's see if it's related to some compiler optimization.

make GO_EXTRA_BUILDFLAGS=-gcflags=-N

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Feb 6, 2021

I have an idea:

=> Let's see if it's related to some compiler optimization.

make GO_EXTRA_BUILDFLAGS=-gcflags=-N

Running with this now

gmake CGO_ENABLED=0 GO_EXTRA_BUILDFLAGS=-gcflags=-N

GO111MODULE=on go build -mod=readonly -gcflags=-N -ldflags "-X github.com/zrepl/zrepl/version.zreplVersion=v0.3.1-11-g538382a" -o "artifacts/zrepl-freebsd-amd64"

cd artifacts/

ldd zrepl-freebsd-amd64

ldd: zrepl-freebsd-amd64: not a dynamic ELF executable

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Feb 8, 2021

I have an idea:

=> Let's see if it's related to some compiler optimization.

make GO_EXTRA_BUILDFLAGS=-gcflags=-N

Well, I dont know if it fixes the problem, but it certainly has let the process run a lot longer than it has to date!

ps -auxwwp 32356 -o lstart=

USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
root 32356 3.0 0.1 724264 81244 2 S+ Sat11 127:23.30 /usr/local/bin/z Sat Feb 6 11:37:11 2021

normally I would have seen a crash or deadlock by now.
The process is running naturally across all cores, real and HT without issue for almost 2 days
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
32356 root 20 0 707M 79M uwait 1 18:54 0.78% zrepl.feb6{zrepl.feb6}
32356 root 21 0 707M 79M uwait 10 6:50 0.20% zrepl.feb6{zrepl.feb6}
32356 root 21 0 707M 79M uwait 8 6:44 0.20% zrepl.feb6{zrepl.feb6}
32356 root 21 0 707M 79M uwait 7 6:39 0.20% zrepl.feb6{zrepl.feb6}
32356 root 21 0 707M 79M uwait 11 1:02 0.20% zrepl.feb6{zrepl.feb6}
32356 root 20 0 707M 79M uwait 6 7:11 0.10% zrepl.feb6{zrepl.feb6}
32356 root 20 0 707M 79M uwait 7 6:44 0.10% zrepl.feb6{zrepl.feb6}
32356 root 21 0 707M 79M uwait 5 6:41 0.10% zrepl.feb6{zrepl.feb6}
32356 root 20 0 707M 79M uwait 3 7:02 0.00% zrepl.feb6{zrepl.feb6}
32356 root 20 0 707M 79M kqread 5 7:01 0.00% zrepl.feb6{zrepl.feb6}
32356 root 21 0 707M 79M uwait 3 7:00 0.00% zrepl.feb6{zrepl.feb6}
32356 root 21 0 707M 79M uwait 0 6:47 0.00% zrepl.feb6{zrepl.feb6}
32356 root 20 0 707M 79M uwait 9 6:44 0.00% zrepl.feb6{zrepl.feb6}
32356 root 20 0 707M 79M uwait 9 6:39 0.00% zrepl.feb6{zrepl.feb6}
32356 root 20 0 707M 79M uwait 1 6:29 0.00% zrepl.feb6{zrepl.feb6}
32356 root 20 0 707M 79M uwait 4 6:25 0.00% zrepl.feb6{zrepl.feb6}
32356 root 21 0 707M 79M uwait 8 6:25 0.00% zrepl.feb6{zrepl.feb6}
32356 root 22 0 707M 79M uwait 7 5:26 0.00% zrepl.feb6{zrepl.feb6}
32356 root 43 0 707M 79M uwait 2 0:44 0.00% zrepl.feb6{zrepl.feb6}
32356 root 20 0 707M 79M uwait 6 0:00 0.00% zrepl.feb6{zrepl.feb6}
32356 root 21 0 707M 79M uwait 10 0:00 0.00% zrepl.feb6{zrepl.feb6}

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Feb 9, 2021

Well, a hang and a crash. Tuesday is a very busy day CPU and disk wise on this server as I offload a copy to disks that go physically offsite. Still, it seemed to last a LOT longer than it normally would. The deadlock happened first after several days and then the crash a few hours later after the process was restarted.

crash-feb9.txt

deadlock-feb9.txt

@problame
Copy link
Author

@problame problame commented Feb 9, 2021

:/

The procstat kstack output is not very helpful.
It would be very helfpul if you could attach using dlv and get a dump of all the goroutines (grs -t).
Instructions on how to obtain dlv: https://github.com/go-delve/delve/blob/master/Documentation/installation/freebsd/install.md

@problame
Copy link
Author

@problame problame commented Feb 9, 2021

Also: If I understand correctly these were two separate occasions. Did they happen the same day? Same machine? Different machine?

@problame
Copy link
Author

@problame problame commented Feb 9, 2021

crash-feb9.txt

That's a runtime panic:

gmake CGO_ENABLED=0 GO_EXTRA_BUILDFLAGS=-gcflags=-N


wirep: p->m=824638963712(10) p->status=1
fatal error: wirep: invalid p state
wirep: p->m=0(0) p->status=3
fatal error: wirep: invalid p state

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Feb 9, 2021

Also: If I understand correctly these were two separate occasions. Did they happen the same day? Same machine? Different machine?

Same machine, same binary, but two separate occurrences. The hang happened early this AM, my script killed off the process and restarted it, and then a few hours later the runtime crash.
Is there a way to automate the dlv attach if there is a hang ? I am guessing not. OK, I will re-work it so the next time I get a hang, I will send myself an alert and attach to the process. I check to see if its working via its stats interface. shell script attached
check-script.txt

@problame
Copy link
Author

@problame problame commented Feb 10, 2021

Is there a way to automate the dlv attach if there is a hang ?

  • I think you can start zrepl daemon through dlv in headless mode. In that mode dlv runs as a daemon itself. Then you can use dlv to connect to the dlv daemon (much like gdb remote debugging).
  • there appears to be some support for scripting, never used it though aarzilli/delve@bbbdd55

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Feb 13, 2021

:/

The procstat kstack output is not very helpful.
It would be very helfpul if you could attach using dlv and get a dump of all the goroutines (grs -t).
Instructions on how to obtain dlv: https://github.com/go-delve/delve/blob/master/Documentation/installation/freebsd/install.md

OK, some progress (I think). My watcher script that restarts the daemon after a crash had a bug I introduced so it didnt restart after yet another runtime crash. The good news is since making the binary with (gmake CGO_ENABLED=0 GO_EXTRA_BUILDFLAGS=-gcflags=-N) , the crashes and hangs are fewer. (feb13-crash.txt)

As a number of clients needed to catch up, I was watching the load on all the threads. There really is not much CPU contention going on. It seems all contention for disk. And many of the threads are at zero CPU. But the more clients I woke up, the more threads got some level of busy. As I manually logged into to my servers and did a zrepl signal wakeup to make more and more of the server threads busy, I finally got to the point where the server process deadlocked. Not sure if it was a fluke or not and will try and reproduce again later after all the clients catch up. the capture of dlv-feb13.txt has that grs -t output
It sort of makes sense in that if the server is under extra load like it is on Tuesdays, it seems to be more likely of a crash or hang

dlv-feb13.txt

feb13-crash.txt

here is the process right now. I watch it via top -p -H

PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
81446 root 23 0 771M 100M uwait 5 0:12 3.46% zrepl.feb6{zrepl.feb6}
81446 root 21 0 771M 100M nanslp 11 0:25 2.63% zrepl.feb6{zrepl.feb6}
81446 root 21 0 771M 100M uwait 10 0:11 2.60% zrepl.feb6{zrepl.feb6}
81446 root 22 0 771M 100M wait 2 0:11 1.96% zrepl.feb6{zrepl.feb6}
81446 root 20 0 771M 100M kqread 9 0:04 1.75% zrepl.feb6{zrepl.feb6}
81446 root 22 0 771M 100M wait 5 0:13 1.51% zrepl.feb6{zrepl.feb6}
81446 root 22 0 771M 100M uwait 2 0:07 0.67% zrepl.feb6{zrepl.feb6}
81446 root 20 0 771M 100M uwait 5 0:10 0.61% zrepl.feb6{zrepl.feb6}
81446 root 52 0 771M 100M uwait 4 0:03 0.04% zrepl.feb6{zrepl.feb6}
81446 root 23 0 771M 100M wait 9 0:12 0.02% zrepl.feb6{zrepl.feb6}
81446 root 22 0 771M 100M wait 10 0:00 0.02% zrepl.feb6{zrepl.feb6}
81446 root 20 0 771M 100M wait 1 0:12 0.02% zrepl.feb6{zrepl.feb6}
81446 root 20 0 771M 100M wait 8 0:06 0.01% zrepl.feb6{zrepl.feb6}
81446 root 22 0 771M 100M uwait 3 0:14 0.00% zrepl.feb6{zrepl.feb6}
81446 root 20 0 771M 100M uwait 0 0:13 0.00% zrepl.feb6{zrepl.feb6}
81446 root 20 0 771M 100M uwait 4 0:13 0.00% zrepl.feb6{zrepl.feb6}
81446 root 21 0 771M 100M uwait 4 0:12 0.00% zrepl.feb6{zrepl.feb6}
81446 root 23 0 771M 100M uwait 1 0:12 0.00% zrepl.feb6{zrepl.feb6}
81446 root 23 0 771M 100M uwait 5 0:12 0.00% zrepl.feb6{zrepl.feb6}
81446 root 21 0 771M 100M uwait 1 0:11 0.00% zrepl.feb6{zrepl.feb6}
81446 root 22 0 771M 100M uwait 2 0:11 0.00% zrepl.feb6{zrepl.feb6}
81446 root 20 0 771M 100M uwait 5 0:10 0.00% zrepl.feb6{zrepl.feb6}
81446 root 22 0 771M 100M uwait 4 0:05 0.00% zrepl.feb6{zrepl.feb6}
81446 root 20 0 771M 100M uwait 4 0:00 0.00% zrepl.feb6{zrepl.feb6}
81446 root 20 0 771M 100M uwait 9 0:00 0.00% zrepl.feb6{zrepl.feb6}

Normally it looks like above with a while bunch of threads idle. But as I generated more load from manually waking up the clients, none were idle and I got the deadlock.

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Feb 14, 2021

deadlock again this AM. grs -t output attached

dlv-hang-feb14.txt

@egonelbre
Copy link
Contributor

@egonelbre egonelbre commented Feb 14, 2021

I noticed that the only freebsd specific thing in the stacks was sys_umtx_op and there was a fix to it in https://go-review.googlesource.com/c/go/+/276892 / #43106. Maybe there are multiple causes for the failures?

Were these failures with go1.16rc1? Maybe Go 1.16rc1 without optimizations is a good thing to try?

@problame
Copy link
Author

@problame problame commented Feb 14, 2021

@egonelbre yes we use 1.16rc1 after ruling out that it's a regression in Go.

(See #43873 on the Go issue tracker as well )

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Feb 14, 2021

I noticed that the only freebsd specific thing in the stacks was sys_umtx_op and there was a fix to it in https://go-review.googlesource.com/c/go/+/276892 / #43106. Maybe there are multiple causes for the failures?

Were these failures with go1.16rc1? Maybe Go 1.16rc1 without optimizations is a good thing to try?
@egonelbre this was with golang from the ports tree lang/go-devel which is 1.16rc1 that @dmgk updated. I built with the binary with
gmake CGO_ENABLED=0 GO_EXTRA_BUILDFLAGS=-gcflags=-N
are there any other flags I should try building with ?

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Feb 16, 2021

@problame , another deadlock. Unless you need more examples, I am going to pin the process to two CPUs
feb16.log
which makes it more stable and long lasting

@problame
Copy link
Author

@problame problame commented Feb 17, 2021

Allright. I'm pretty pessimistic about this getting fixed any time soon since FreeBSD is not a tier 1 platform in Go.
I referenced the issue from a bunch of other suspicious FreeBSD-related issues.
I encounter the issue on my FreeBSD 12.2 system about every other day now so there's at least a relatively timely feedback loop should anyone have a proposal on how to make progress here :/

@egonelbre
Copy link
Contributor

@egonelbre egonelbre commented Feb 17, 2021

I think the major hurdle in making the problem actionable is making the problem easier to reproduce. I.e. trying to create something that fails within 1 minute of running the program.

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Feb 17, 2021

Now that go1.16 is R, I rebuilt everything with it from lang/go on my FreeBSD server. I still used
gmake CGO_ENABLED=0 GO_EXTRA_BUILDFLAGS=-gcflags=-N
to build.

@mdtancsa
Copy link

@mdtancsa mdtancsa commented Feb 19, 2021

Now that go1.16 is R, I rebuilt everything with it from lang/go on my FreeBSD server. I still used
gmake CGO_ENABLED=0 GO_EXTRA_BUILDFLAGS=-gcflags=-N
to build.

Still runtime crashes with the Release version. One more crash dump for the archives
feb19-crash.txt

@cuonglm
Copy link
Member

@cuonglm cuonglm commented Aug 30, 2021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants