Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: fail with: Fatal glibc error: rseq registration failed #51315

Closed
belimawr opened this issue Feb 22, 2022 · 14 comments
Closed

runtime: fail with: Fatal glibc error: rseq registration failed #51315

belimawr opened this issue Feb 22, 2022 · 14 comments
Labels
NeedsInvestigation

Comments

@belimawr
Copy link

@belimawr belimawr commented Feb 22, 2022

What version of Go are you using (go version)?

$ go version
go version go1.17.7 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/tiago/.cache/go-build"
GOENV="/home/tiago/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/tiago/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/tiago/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17.7"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/tiago/go/src/github.com/elastic/beats/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3405069067=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I built filebeat and run it, then I get this message:

Fatal glibc error: rseq registration failed
zsh: IOT instruction (core dumped)  ./filebeat -c ./config.yml > /dev/null

I looked at the core dump with delve and that's what I got:

0  0x0000000000f5d8a1 in runtime.raise
   at /usr/local/go/src/runtime/sys_linux_amd64.s:165
1  0x0000000000f5bfe1 in runtime.goexit
   at /usr/local/go/src/runtime/asm_amd64.s:1581

The issue started when I updated my system (Arch Linux) and as part of the update glibc was updated to 2.35.

Somehow it seems related to starting threads, so I have not managed to reproduce on a simple Go program.

I'm not sure if this is an issue with glibc or the way the Go runtime interacts with it. The rest of my system is stable, so I'm inclined to believe it's Go related. I have not changed my application at all, just the system update. I also had the same behaviour with an older Go version (1.16 I believe). I also have some colleges reporting similar problems when running Go programs.

Disabling CGO when compiling makes the issue go away.

I see the same issues on an Arch Linux VM running glibc 2.35 and the latest Go compiler
Everything works on a Arch Linux with glibc 2.33.

Some more information about my environment:
Linux distro: Arch Linux

$ pacman -Ss glibc | grep installed
core/glibc 2.35-2 [installed]
extra/nss-mdns 0.15.1-1 [installed]

$ uname -a
Linux x-wing 5.16.10-arch1-1 #1 SMP PREEMPT Wed, 16 Feb 2022 19:35:18 +0000 x86_64 GNU/Linux

$ go version
go version go1.17.7 linux/amd64

Full steps to reproduce

Clone, compile and run filebeat:

git clone https://github.com/elastic/beats
cd beats/filebeat
go build .
./filebeat -c ./config.yml

The config.yml file is this:

filebeat.inputs:
- type: filestream
  enabled: true
  paths:
    - /home/tiago/sandbox/filestream-duplication/first.log
  fields:
    app.name: "first"
  fields_under_root: true
  parsers:
    - ndjson:
        keys_under_root: true
        add_error_key: true

- type: filestream
  enabled: true
  paths:
    - /home/tiago/sandbox/filestream-duplication/to-append.log
  fields:
    app.name: "second"
  fields_under_root: true
  fields_under_root: true
  parsers:
    - ndjson:
        keys_under_root: true
        add_error_key: true

output.console:
  enabled: true

#logging.level: debug
logging.to_stderr: true
#logging.selectors: ["havester", "filestream"]
logging.selectors: ["input.filestream", "file_watcher", "input_type"]
logging.metrics.enabled: false
logging.to_files: false

path.home: ./_tmp

What did you expect to see?

The program to run normally.

What did you see instead?

The program crashing and a core dump generated. Most of the times it breaks, but some times (very few times) it just runs without any issue.

@mengzhuo mengzhuo changed the title affected/package: all. Some Go programs fail with: Fatal glibc error: rseq registration failed runtime: fail with: Fatal glibc error: rseq registration failed Feb 22, 2022
@mengzhuo
Copy link
Contributor

@mengzhuo mengzhuo commented Feb 22, 2022

Could you provide coredump logs ? Try coredumpctl if you are using systemd.

@mengzhuo
Copy link
Contributor

@mengzhuo mengzhuo commented Feb 22, 2022

I can't reproduce with your steps:

root@anton:~/beats/filebeat# ldd filebeat                                              
        linux-vdso.so.1 (0x00007ffeb01fd000)                                           
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f32e7d4b000)  
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f32e7d45000)            
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f32e7b80000)              
        /lib64/ld-linux-x86-64.so.2 (0x00007f32e7d76000)                               
 /lib/x86_64-linux-gnu/libc.so.6 --version
GNU C Library (Debian GLIBC 2.31-13+deb11u2) stable release version 2.31.                                  

cc @cherrymui @randall77

@mengzhuo mengzhuo added the NeedsInvestigation label Feb 22, 2022
@belimawr
Copy link
Author

@belimawr belimawr commented Feb 22, 2022

@mengzhuo yes, I have plenty of coredumps!

Some more info about my glibc:

 ldd ./filebeat
        linux-vdso.so.1 (0x00007fff3d540000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007f3175016000)
        /lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007f317523c000)
 tiago@x-wing~/go/src/github.com/elastic/beats/filebeat(filestream-duplication|…) % /usr/lib/libc.so.6 --version
GNU C Library (GNU libc) stable release version 2.35.
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 11.2.0.
libc ABIs: UNIQUE IFUNC ABSOLUTE
For bug reporting instructions, please see:
<https://bugs.archlinux.org/>.

Some extra info about the binary crashing:

file ./filebeat
./filebeat: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=c17a039ad495afc6908ef2e1c4d7e385c22e5eac, for GNU/Linux 4.4.0, not stripped

Here is the output of coredumpctl:

tiago@x-wing~/go/src/github.com/elastic/beats/filebeat(filestream-duplication|…) % coredumpctl -o ./filebeat.dump dump filebeat
           PID: 607351 (filebeat)
           UID: 1000 (tiago)
           GID: 1000 (tiago)
        Signal: 6 (ABRT)
     Timestamp: Tue 2022-02-22 13:05:36 CET (2h 31min ago)
  Command Line: ./filebeat -c ./config.yml
    Executable: /home/tiago/go/src/github.com/elastic/beats/filebeat/filebeat
 Control Group: /user.slice/user-1000.slice/session-3.scope
          Unit: session-3.scope
         Slice: user-1000.slice
       Session: 3
     Owner UID: 1000 (tiago)
       Boot ID: 1ff037b947a344fbbf4de57ce1b499a9
    Machine ID: a6a19d57efcf4bf38705c63217a63ba3
      Hostname: x-wing
       Storage: /var/lib/systemd/coredump/core.filebeat.1000.1ff037b947a344fbbf4de57ce1b499a9.607351.1645531536000000.zst (present)
     Disk Size: 2.0M
       Message: Process 607351 (filebeat) of user 1000 dumped core.
                
                Module /home/tiago/go/src/github.com/elastic/beats/filebeat/filebeat with build-id c17a039ad495afc6908ef2e1c4d7e385c22e5eac
                Module linux-vdso.so.1 with build-id 2ad6b24b487ac2a9c2901c8ddf457eeb465c1d5a
                Module ld-linux-x86-64.so.2 with build-id c09c6f50f6bcec73c64a0b4be77eadb8f7202410
                Module libc.so.6 with build-id 85766e9d8458b16e9c7ce6e07c712c02b8471dbc
                Stack trace of thread 607369:
                #0  0x0000000000f5d8a1 n/a (/home/tiago/go/src/github.com/elastic/beats/filebeat/filebeat + 0xb5d8a1)
                #1  0x0000000000f3e985 n/a (/home/tiago/go/src/github.com/elastic/beats/filebeat/filebeat + 0xb3e985)
                #2  0x0000000000f3d38b n/a (/home/tiago/go/src/github.com/elastic/beats/filebeat/filebeat + 0xb3d38b)
                #3  0x0000000000f5e60e n/a (/home/tiago/go/src/github.com/elastic/beats/filebeat/filebeat + 0xb5e60e)
                #4  0x0000000000f5db3d n/a (/home/tiago/go/src/github.com/elastic/beats/filebeat/filebeat + 0xb5db3d)
                #5  0x00007f4fc842d560 __restore_rt (libc.so.6 + 0x42560)
                #6  0x00007f4fc847a34c __pthread_kill_implementation (libc.so.6 + 0x8f34c)
                #7  0x00007f4fc842d4b8 raise (libc.so.6 + 0x424b8)
                #8  0x00007f4fc8417534 abort (libc.so.6 + 0x2c534)
                #9  0x00007f4fc846e397 __libc_message (libc.so.6 + 0x83397)
                #10 0x00007f4fc846e41a __libc_fatal (libc.so.6 + 0x8341a)
                #11 0x00007f4fc84785e0 start_thread (libc.so.6 + 0x8d5e0)
                #12 0x00007f4fc84fd584 __clone (libc.so.6 + 0x112584)
                
                Stack trace of thread 607351:
                #0  0x0000000000f5d83d n/a (/home/tiago/go/src/github.com/elastic/beats/filebeat/filebeat + 0xb5d83d)
                #1  0x0000000000f37585 n/a (/home/tiago/go/src/github.com/elastic/beats/filebeat/filebeat + 0xb37585)
                #2  0x0000000000f376dd n/a (/home/tiago/go/src/github.com/elastic/beats/filebeat/filebeat + 0xb376dd)
                #3  0x0000000000f30192 n/a (/home/tiago/go/src/github.com/elastic/beats/filebeat/filebeat + 0xb30192)
                #4  0x0000000000f2f68c n/a (/home/tiago/go/src/github.com/elastic/beats/filebeat/filebeat + 0xb2f68c)
                #5  0x0000000000f30c19 n/a (/home/tiago/go/src/github.com/elastic/beats/filebeat/filebeat + 0xb30c19)
                #6  0x0000000000f3116d n/a (/home/tiago/go/src/github.com/elastic/beats/filebeat/filebeat + 0xb3116d)
                #7  0x0000000000f59da3 n/a (/home/tiago/go/src/github.com/elastic/beats/filebeat/filebeat + 0xb59da3)
                ELF object binary architecture: AMD x86-64
More than one entry matches, ignoring rest.
 tiago@x-wing~/go/src/github.com/elastic/beats/filebeat(filestream-duplication|…)

And the file: filebeat.dump.gz

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 22, 2022

The code that is failing appears to have been added to glibc very recently, in December, 2021. The first version it appears in is glibc 2.35. I think the patch series starts here: https://sourceware.org/pipermail/libc-alpha/2021-December/133656.html .

I don't know why it would affect Go programs in this way. CC @fweimer for thoughts.

@belimawr Do all Go programs fail? Do all Go programs that use cgo fail? Or is there something particular to the program that you are reporting? Thanks.

@belimawr
Copy link
Author

@belimawr belimawr commented Feb 22, 2022

@ianlancetaylor So far only this program is affected, it's a huge project so it's hard to pinpoint something, but there are a few high-level things I know happen when I experience the crash:

  1. There is a low-level interaction with the file system to watch for changes and read them. (Filebeat ships log files to a Elasticsearch cluster)
  2. There are multiple goroutines involved in this process, if I recall correctly at least one per file being watched, plus some others to manage the whole thing.
  3. Very rarely everything just works (on a very small testing window, like a minute or less)
  4. The stack trace returned by the coredump/delve is always the same, those two assembly files I mentioned earlier.

So far I haven't had much opportunity to experiment with other applications, but the go tool/compile is working fine, also my unit tests run without any problems.

I opened up this issue because the only change to create this problem was a system update (no code change) and both the error and stack trace show low level stuff leading me to believe could be related to how the runtime interacts with the OS/Kernel.

I'll try to gather more information about how cgo is used on this application.

@belimawr
Copy link
Author

@belimawr belimawr commented Feb 24, 2022

I've been playing with delve trying to find some pattern for those crashes, and I managed to get the program crashing on the same line multiple times. Interestingly those crashes are on a channel read, I also got a coredump right before the crash.

Here is the coredump:
coredump.tar.gz

I built the binary from this commit on elastic/beats, to build the binary:

  • clone the repo/checkout the commit dc8ed3759ff10cea5b28c980112e5239da6f3084
  • from the root of the repo
  • cd filebeat
  • go build .

A few interesting things:

  • There were 63 goroutines running
  • The stack trace returned by bt on the core dump is now different, but it still crashes on some assembly code.
  • The crash happened right after I "stepped into" the channel read line.

The stack trace and where the program crashed:

(dlv) bt
0  0x0000000000f5dde3 in runtime.futex
   at /usr/local/go/src/runtime/sys_linux_amd64.s:520
1  0x0000000000f23b76 in runtime.futexsleep
   at /usr/local/go/src/runtime/os_linux.go:44
2  0x0000000000efdde7 in runtime.notesleep
   at /usr/local/go/src/runtime/lock_futex.go:160
3  0x0000000000f2d2ca in runtime.mPark
   at /usr/local/go/src/runtime/proc.go:1441
4  0x0000000000f2e818 in runtime.stopm
   at /usr/local/go/src/runtime/proc.go:2408
5  0x0000000000f2fce5 in runtime.findrunnable
   at /usr/local/go/src/runtime/proc.go:2984
6  0x0000000000f30c19 in runtime.schedule
   at /usr/local/go/src/runtime/proc.go:3367
7  0x0000000000f3116d in runtime.park_m
   at /usr/local/go/src/runtime/proc.go:3516
8  0x0000000000f59da3 in runtime.mcall
   at /usr/local/go/src/runtime/asm_amd64.s:307
(dlv) l
> runtime.futex() /usr/local/go/src/runtime/sys_linux_amd64.s:520 (PC: 0xf5dde3)
Warning: debugging optimized function
   515:         MOVQ    ts+16(FP), R10
   516:         MOVQ    addr2+24(FP), R8
   517:         MOVL    val3+32(FP), R9
   518:         MOVL    $SYS_futex, AX
   519:         SYSCALL
=> 520:         MOVL    AX, ret+40(FP)
   521:         RET
   522: 
   523: // int32 clone(int32 flags, void *stk, M *mp, G *gp, void (*fn)(void));
   524: TEXT runtime·clone(SB),NOSPLIT,$0
   525:         MOVL    flags+0(FP), DI

The list of goroutines

(dlv) grs
  Goroutine 1 - User: ./beater/signalwait.go:45 github.com/elastic/beats/v7/filebeat/beater.(*Filebeat).Run (0x277cb0d) (thread 514837)
  Goroutine 2 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0xf2a996) [force gc (idle) 457136h28m39.07599196s]
  Goroutine 3 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0xf2a996) [GC worker (idle) 457136h28m39.075997315s]
  Goroutine 4 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0xf2a996) [GC worker (idle)]
  Goroutine 5 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0xf2a996) [GC worker (idle)]
  Goroutine 6 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0xf2a996) [GC worker (idle)]
  Goroutine 7 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0xf2a996) [GC worker (idle)]
  Goroutine 8 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0xf2a996) [GC worker (idle)]
  Goroutine 9 - User: /home/tiago/go/pkg/mod/github.com/klauspost/compress@v1.13.6/zstd/blockdec.go:212 github.com/klauspost/compress/zstd.(*blockDec).startDecoder (0x19dc674) [chan receive 457136h28m39.069730013s]
  Goroutine 10 - User: /home/tiago/go/pkg/mod/github.com/klauspost/compress@v1.13.6/zstd/blockdec.go:212 github.com/klauspost/compress/zstd.(*blockDec).startDecoder (0x19dc674) [chan receive 457136h28m39.069735859s]
  Goroutine 11 - User: /home/tiago/go/pkg/mod/github.com/klauspost/compress@v1.13.6/zstd/blockdec.go:212 github.com/klauspost/compress/zstd.(*blockDec).startDecoder (0x19dc674) [chan receive 457136h28m39.069744499s]
  Goroutine 12 - User: /home/tiago/go/pkg/mod/github.com/klauspost/compress@v1.13.6/zstd/blockdec.go:212 github.com/klauspost/compress/zstd.(*blockDec).startDecoder (0x19dc674) [chan receive 457136h28m39.069748707s]
  Goroutine 13 - User: /home/tiago/go/pkg/mod/github.com/klauspost/compress@v1.13.6/zstd/blockdec.go:212 github.com/klauspost/compress/zstd.(*blockDec).startDecoder (0x19dc674) [chan receive 457136h28m39.069753349s]
  Goroutine 14 - User: /home/tiago/go/pkg/mod/github.com/klauspost/compress@v1.13.6/zstd/blockdec.go:212 github.com/klauspost/compress/zstd.(*blockDec).startDecoder (0x19dc674) [chan receive 457136h28m39.069759205s]
  Goroutine 15 - User: /home/tiago/go/pkg/mod/github.com/klauspost/compress@v1.13.6/zstd/blockdec.go:212 github.com/klauspost/compress/zstd.(*blockDec).startDecoder (0x19dc674) [chan receive 457136h28m39.069763277s]
  Goroutine 16 - User: /home/tiago/go/pkg/mod/github.com/klauspost/compress@v1.13.6/zstd/blockdec.go:212 github.com/klauspost/compress/zstd.(*blockDec).startDecoder (0x19dc674) [chan receive 457136h28m39.069768363s]
  Goroutine 18 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0xf2a996) [GC sweep wait]
  Goroutine 19 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0xf2a996) [GC scavenge wait]
  Goroutine 20 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0xf2a996) [finalizer wait 457136h28m39.076077093s]
  Goroutine 23 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0xf2a996) [GC worker (idle) 457136h28m39.076081776s]
  Goroutine 24 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0xf2a996) [GC worker (idle) 457136h28m39.076085699s]
  Goroutine 25 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0xf2a996) [GC worker (idle)]
  Goroutine 29 - User: /home/tiago/go/pkg/mod/github.com/elastic/go-concert@v0.2.0/timed/timed.go:76 github.com/elastic/go-concert/timed.Periodic (0x2781a15) [select]
  Goroutine 30 - User: ./input/filestream/fswatch.go:229 github.com/elastic/beats/v7/filebeat/input/filestream.(*fileWatcher).Event (0x27910dc) [chan receive]
  Goroutine 31 - User: /home/tiago/go/pkg/mod/github.com/elastic/go-concert@v0.2.0/timed/timed.go:76 github.com/elastic/go-concert/timed.Periodic (0x2781a15) [select]
  Goroutine 32 - User: ./input/filestream/fswatch.go:229 github.com/elastic/beats/v7/filebeat/input/filestream.(*fileWatcher).Event (0x27910dc) [chan receive]
  Goroutine 33 - User: /home/tiago/go/src/github.com/elastic/beats/libbeat/common/streambuf/streambuf.go:147 github.com/elastic/beats/v7/libbeat/common/streambuf.(*Buffer).doAppend (0x1732070) (thread 514828)                                                                     
  Goroutine 34 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0xf2a996) [GC worker (idle)]
  Goroutine 35 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0xf2a996) [GC worker (idle)]
  Goroutine 36 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0xf2a996) [GC worker (idle)]
  Goroutine 37 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0xf2a996) [GC worker (idle)]
  Goroutine 38 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0xf2a996) [GC worker (idle)]
  Goroutine 39 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0xf2a996) [GC worker (idle)]
  Goroutine 40 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0xf2a996) [GC worker (idle)]
  Goroutine 41 - User: /usr/local/go/src/runtime/sigqueue.go:169 os/signal.signal_recv (0xf585f8) (thread 514836)
  Goroutine 42 - User: /home/tiago/go/src/github.com/elastic/beats/libbeat/service/service.go:50 github.com/elastic/beats/v7/libbeat/service.HandleSignals.func1 (0x18875a5) [chan receive]
  Goroutine 44 - User: ./input/filestream/internal/input-logfile/update_writer.go:145 github.com/elastic/beats/v7/filebeat/input/filestream/internal/input-logfile.(*updateChan).Recv (0x278aa1b) [select]
  Goroutine 45 - User: /home/tiago/go/pkg/mod/github.com/elastic/go-concert@v0.2.0/timed/timed.go:76 github.com/elastic/go-concert/timed.Periodic (0x2781a15) [select]
  Goroutine 46 - User: ./registrar/registrar.go:164 github.com/elastic/beats/v7/filebeat/registrar.(*Registrar).Run (0x26e0cc5) [select]
  Goroutine 47 - User: /usr/local/go/src/runtime/sema.go:56 sync.runtime_Semacquire (0xf57e25) [semacquire]
  Goroutine 48 - User: /usr/local/go/src/runtime/sema.go:56 sync.runtime_Semacquire (0xf57e25) [semacquire]
  Goroutine 49 - User: ./beater/signalwait.go:79 github.com/elastic/beats/v7/filebeat/beater.(*signalWait).AddChan.func1 (0x277d6df) [chan receive]
  Goroutine 66 - User: /home/tiago/go/pkg/mod/github.com/klauspost/compress@v1.13.6/zstd/blockdec.go:212 github.com/klauspost/compress/zstd.(*blockDec).startDecoder (0x19dc674) [chan receive 457136h28m39.069896298s]
  Goroutine 67 - User: /home/tiago/go/pkg/mod/github.com/klauspost/compress@v1.13.6/zstd/blockdec.go:212 github.com/klauspost/compress/zstd.(*blockDec).startDecoder (0x19dc674) [chan receive 457136h28m39.069901848s]
  Goroutine 68 - User: /home/tiago/go/pkg/mod/github.com/klauspost/compress@v1.13.6/zstd/blockdec.go:212 github.com/klauspost/compress/zstd.(*blockDec).startDecoder (0x19dc674) [chan receive 457136h28m39.069906838s]
  Goroutine 69 - User: /home/tiago/go/pkg/mod/github.com/klauspost/compress@v1.13.6/zstd/blockdec.go:212 github.com/klauspost/compress/zstd.(*blockDec).startDecoder (0x19dc674) [chan receive 457136h28m39.069913122s]
  Goroutine 70 - User: /home/tiago/go/pkg/mod/github.com/klauspost/compress@v1.13.6/zstd/blockdec.go:212 github.com/klauspost/compress/zstd.(*blockDec).startDecoder (0x19dc674) [chan receive 457136h28m39.069918689s]
  Goroutine 71 - User: /home/tiago/go/pkg/mod/github.com/klauspost/compress@v1.13.6/zstd/blockdec.go:212 github.com/klauspost/compress/zstd.(*blockDec).startDecoder (0x19dc674) [chan receive 457136h28m39.069923822s]
  Goroutine 72 - User: /home/tiago/go/pkg/mod/github.com/klauspost/compress@v1.13.6/zstd/blockdec.go:212 github.com/klauspost/compress/zstd.(*blockDec).startDecoder (0x19dc674) [chan receive 457136h28m39.069928852s]
  Goroutine 73 - User: /home/tiago/go/pkg/mod/github.com/klauspost/compress@v1.13.6/zstd/blockdec.go:212 github.com/klauspost/compress/zstd.(*blockDec).startDecoder (0x19dc674) [chan receive 457136h28m39.069933798s]
  Goroutine 74 - User: /home/tiago/go/pkg/mod/k8s.io/klog/v2@v2.30.0/klog.go:1181 k8s.io/klog/v2.(*loggingT).flushDaemon (0x1bddeca) [chan receive 457136h28m39.069938727s]
  Goroutine 87 - User: /home/tiago/go/src/github.com/elastic/beats/libbeat/publisher/queue/memqueue/eventloop.go:316 github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue.(*bufferingEventLoop).run (0x189803c) (thread 514839)                                                                                            
  Goroutine 88 - User: /home/tiago/go/src/github.com/elastic/beats/libbeat/publisher/queue/memqueue/ackloop.go:60 github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue.(*ackLoop).run (0x1894c66) [select]
  Goroutine 89 - User: /home/tiago/go/src/github.com/elastic/beats/libbeat/publisher/pipeline/consumer.go:159 github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*eventConsumer).run (0x1880dae) [select]
  Goroutine 90 - User: /home/tiago/go/src/github.com/elastic/beats/libbeat/publisher/queue/memqueue/consume.go:65 github.com/elastic/beats/v7/libbeat/publisher/queue/memqueue.(*consumer).Get (0x189699e) [select]
  Goroutine 91 - User: /home/tiago/go/src/github.com/elastic/beats/libbeat/publisher/pipeline/client_worker.go:93 github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*clientWorker).run (0x187fc36) [select]
  Goroutine 92 - User: /usr/local/go/src/runtime/proc.go:367 runtime.gopark (0xf2a996) [select]
  Goroutine 93 - User: /usr/local/go/src/encoding/json/decode.go:682 encoding/json.(*decodeState).object (0x10b1f29) (thread 514562)
  Goroutine 94 - User: /usr/local/go/src/runtime/select.go:573 reflect.rselect (0xf57c7e) [select]
  Goroutine 95 - User: /home/tiago/go/pkg/mod/github.com/elastic/go-concert@v0.2.0/timed/timed.go:76 github.com/elastic/go-concert/timed.Periodic (0x2781a15) [select]
  Goroutine 96 - User: /home/tiago/go/pkg/mod/github.com/elastic/go-concert@v0.2.0/ctxtool/func.go:64 github.com/elastic/go-concert/ctxtool.(*funcContext).wait (0x1933d45) [select]
  Goroutine 98 - User: /home/tiago/go/pkg/mod/github.com/elastic/go-concert@v0.2.0/unison/taskgroup.go:159 github.com/elastic/go-concert/unison.(*TaskGroup).Go.func1 (0x1944a00)
  Goroutine 99 - User: /home/tiago/go/pkg/mod/github.com/elastic/go-concert@v0.2.0/ctxtool/func.go:50 github.com/elastic/go-concert/ctxtool.WithFunc·dwrap·1 (0x1933c20)

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 25, 2022

I'm not sure the goroutine stack traces are all that helpful. The error is almost certainly coming from the libc function pthread_create. A Go program that uses cgo will call pthread_create whenever it needs to create a new thread. The scheduler will create a new thread whenever there are fewer than GOMAXPROCS threads available to run Go code.

It is also possible that the C code called by your program is calling pthead_create for its own purposes.

If you can capture a backtrace from the code in pthread_create that calls rseq_register_current_thread, that might be helpful.

@mengzhuo
Copy link
Contributor

@mengzhuo mengzhuo commented Feb 25, 2022

sched_getcpu may be the culprit.

https://www.phoronix.com/scan.php?page=news_item&px=Glibc-RSEQ-Still-2020

On Monday was the latest version of a Glibc patch for making use of RSEQ for sched_getcpu(), the call to determine the CPU core on which the calling thread is executing.

cc @compudj per author of this patch

@fweimer-rh
Copy link

@fweimer-rh fweimer-rh commented Feb 25, 2022

It is likely that beats installs a seccomp handler that blocks the rseq system call for the current process (as opposed to for new processes). This means that rseq is active for the main thread, but cannot be activated for other threads, a configuration that is deliberately not supported.

@Azertooth
Copy link

@Azertooth Azertooth commented Feb 25, 2022

The code that is failing appears to have been added to glibc very recently, in December, 2021. The first version it appears in is glibc 2.35. I think the patch series starts here: https://sourceware.org/pipermail/libc-alpha/2021-December/133656.html .

I don't know why it would affect Go programs in this way. CC @fweimer for thoughts.

@belimawr Do all Go programs fail? Do all Go programs that use cgo fail? Or is there something particular to the program that you are reporting? Thanks.

Hi, I can confirm it crashes journalbeat and journalbeat-oss instances too

@prattmic
Copy link
Member

@prattmic prattmic commented Feb 25, 2022

It is likely that beats installs a seccomp handler that blocks the rseq system call for the current process (as opposed to for new processes). This means that rseq is active for the main thread, but cannot be activated for other threads, a configuration that is deliberately not supported.

This sounds correct to me. I see a policy here which does not include rseq (and action errno, meaning the syscall fails rather than killing the process), and this package would apply the policy to all threads in the current process, which would cause a failure on future thread creation.

@belimawr
Copy link
Author

@belimawr belimawr commented Mar 1, 2022

Thank you so much for the help folks ❤️ !

@fweimer-rh, @prattmic that was the issue. I added rseq to the syscalls allowed list and everything seems to be working fine.

I'm closing this issue as it not a runtime bug.

Sorry for the trouble.

@prattmic
Copy link
Member

@prattmic prattmic commented Mar 1, 2022

For what its worth, difficult to debug cases like this are why when using a seccomp policy I prefer to:

  • Use ActionKillProcess, or ActionTrap so violations fail loudly and clearly.
  • Disable cgo or at least statically-link libc, so an update of libc on the system can't suddenly make a program start failing, without even rebuilding the main binary.

@belimawr
Copy link
Author

@belimawr belimawr commented Mar 1, 2022

For what its worth, difficult to debug cases like this are why when using a seccomp policy I prefer to:

  • Use ActionKillProcess, or ActionTrap so violations fail loudly and clearly.
  • Disable cgo or at least statically-link libc, so an update of libc on the system can't suddenly make a program start failing, without even rebuilding the main binary.

Thank you so much for those tips ❤️ !

I'll try to implement them so we won't be bitten by it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation
Projects
None yet
Development

No branches or pull requests

6 participants