Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

external-dns eventually segfaults on ARM build due to istio library #1215

Open
nickbp opened this issue Sep 28, 2019 · 9 comments
Open

external-dns eventually segfaults on ARM build due to istio library #1215

nickbp opened this issue Sep 28, 2019 · 9 comments

Comments

@nickbp
Copy link

@nickbp nickbp commented Sep 28, 2019

external-dns revision: 9424664 (v0.5.17)
go version: go1.13 linux/amd64

Recently noticed that the external-dns pod in an ARM cluster was restarting with an error, and left open a log to see if I could catch a sample. After several minutes external-dns crashed with the following, which appears to be a cache entry getting cleared after some TTL expired?:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x120cc]

goroutine 10 [running]:
runtime/internal/atomic.goStore64(0x4275fcc, 0x6d886169, 0x15c8bd5c)
	/usr/lib/go/src/runtime/internal/atomic/atomic_arm.go:144 +0x1c
istio.io/istio/pkg/cache.(*ttlCache).evictExpired(0x4275f80, 0xb0558d69, 0xbf5c1868, 0x195f6390, 0x1a3, 0x2b392c8)
	/external-dns/.gopath/pkg/mod/istio.io/istio@v0.0.0-20190322063008-2b1331886076/pkg/cache/ttlCache.go:138 +0x124
istio.io/istio/pkg/cache.(*ttlCache).evicter(0x4275f80, 0x185c5000, 0x1a3)
	/external-dns/.gopath/pkg/mod/istio.io/istio@v0.0.0-20190322063008-2b1331886076/pkg/cache/ttlCache.go:123 +0x54
created by istio.io/istio/pkg/cache.NewTTLWithCallback
	/external-dns/.gopath/pkg/mod/istio.io/istio@v0.0.0-20190322063008-2b1331886076/pkg/cache/ttlCache.go:100 +0x1f4

Built for arm as a static binary as follows:

# create temp GOPATH to avoid cross-pollination with host
git clone git@github.com:kubernetes-incubator/external-dns
cd external-dns
mkdir -p .gopath/src/github.com/kubernetes-incubator/
ln -s $(pwd) .gopath/src/github.com/kubernetes-incubator/external-dns
export GOPATH=$(pwd)/.gopath
cd .gopath/src/github.com/kubernetes-incubator/external-dns

# build static ARM binary
CGO_ENABLED=0 GOARCH=arm GOOS=linux go build -ldflags="-s"

# static binary built
file external-dns

I haven't tried running it on x86 so I don't know whether the issue is specific to ARM builds, but given the atomic_arm.go in the stacktrace I wouldn't be surprised. Before the crash, external-dns works fine - certs are updated successfully.

@nickbp

This comment has been minimized.

Copy link
Author

@nickbp nickbp commented Oct 12, 2019

Checked the go atomics code and noticed that this segfault appears to be intentional due to the cache library not using aligned values:

//go:nosplit
func goStore64(addr *uint64, v uint64) {
	if uintptr(unsafe.Pointer(addr))&7 != 0 {
		*(*int)(nil) = 0 // crash on unaligned uint64 <<<<< SEGFAULT HERE
	}
	_ = *addr // if nil, fault before taking the lock
	addrLock(addr).lock()
	*addr = v
	addrLock(addr).unlock()
}

The docs meanwhile mention:

On ARM, x86-32, and 32-bit MIPS, it is the caller's responsibility to arrange for 64-bit alignment of 64-bit words accessed atomically. The first word in a variable or in an allocated struct, array, or slice can be relied upon to be 64-bit aligned.

So ultimately the fix would likely need to be in the istio cache library being used by external-dns, or alternatively external-dns could switch to a cache implementation that aligns its atomics.

@nickbp

This comment has been minimized.

Copy link
Author

@nickbp nickbp commented Oct 12, 2019

I was able to trivially reproduce the cache segfault with some example code. I've submitted a fix for the cache library here: istio/pkg#75

@nickbp

This comment has been minimized.

Copy link
Author

@nickbp nickbp commented Oct 12, 2019

One thing I hadn't checked is where the cache is being created/used to begin with. The crash stacktrace doesn't show this since it's not occuring on the main thread. To answer that I hacked in a panic at the point where the cache is created:

goroutine 1 [running]:
istio.io/istio/pkg/cache.NewTTLWithCallback(0x914f0000, 0x4e94, 0x185c5000, 0x1a3, 0x1998e4c, 0x40000000, 0x0)
	/home/nick/code/external-dns/.gopath/pkg/mod/istio.io/istio@v0.0.0-20190322063008-2b1331886076/pkg/cache/ttlCache.go:91 +0x24
istio.io/istio/pkg/cache.NewTTL(0x914f0000, 0x4e94, 0x185c5000, 0x1a3, 0x18b08e9, 0x5)
	/home/nick/code/external-dns/.gopath/pkg/mod/istio.io/istio@v0.0.0-20190322063008-2b1331886076/pkg/cache/ttlCache.go:85 +0x3c
istio.io/istio/pilot/pkg/model.newJwksResolver(0x30b8a000, 0x346, 0xf9290000, 0x2260f, 0x6592e000, 0x117, 0x2ab8270)
	/home/nick/code/external-dns/.gopath/pkg/mod/istio.io/istio@v0.0.0-20190322063008-2b1331886076/pilot/pkg/model/jwks_resolver.go:113 +0x34
istio.io/istio/pilot/pkg/model.init()
	/home/nick/code/external-dns/.gopath/pkg/mod/istio.io/istio@v0.0.0-20190322063008-2b1331886076/pilot/pkg/model/authentication.go:62 +0x2f0

Turns out that the cache is being created within a JwksResolver object, which is in turn being statically initialized when istio/pilot/pkg/model is imported. This package is imported by external-dns' source/gateway.go and source/store.go.

So this istio library crash will occur even if the user isn't actually using istio gateway (in my case I'm using ingress-nginx). Ideally the istio library wouldn't have a shared static global that's going and initializing a bunch of caches/threads at import-time to begin with.

@nickbp nickbp changed the title external-dns eventually segfaults on ARM build external-dns eventually segfaults on ARM build due to istio library Oct 12, 2019
@szuecs

This comment has been minimized.

Copy link
Contributor

@szuecs szuecs commented Oct 13, 2019

@njuettner @linki this needs a better dependency handling. Istio import is causing a crash on non istio use.
We should disable istio at compile time if this is happening. We can provide istio builds if needed.

@nickbp

This comment has been minimized.

Copy link
Author

@nickbp nickbp commented Oct 18, 2019

The upstream fix has been merged but I imagine it will be a while before a new version of the library is released.

It's worth pointing out that this just fixes the segfault. The fact that istio will launch its own background timers at import time, regardless of whether the external-dns is even configured to use istio, remains. In other words, this doesn't fix the fact that the istio library uncontrollably launches its own timers, but it at least ensures that those timers aren't destroying the entire process every 30 minutes.

I don't know the full context of why the istio library is necessary, but if it were me I'd look at finding a way to not use it anymore.

@nickbp

This comment has been minimized.

Copy link
Author

@nickbp nickbp commented Oct 18, 2019

In the meantime, I've ended up with a workaround of just running external-dns with the --once flag, within a shell script that handles the loop:

#!/bin/sh
set -e
while [ true ]; do
  # We manually handle our own sleep loop.
  # Avoid istio segfault: https://github.com/kubernetes-incubator/external-dns/issues/1215
  /usr/bin/external-dns --log-level=debug --once --source=ingress --provider=cloudflare --cloudflare-proxied
  sleep 300s
done
@fejta-bot

This comment has been minimized.

Copy link

@fejta-bot fejta-bot commented Jan 16, 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@nickbp

This comment has been minimized.

Copy link
Author

@nickbp nickbp commented Jan 16, 2020

/remove-lifecycle stale

@nickbp

This comment has been minimized.

Copy link
Author

@nickbp nickbp commented Jan 16, 2020

external-dns will continue to segfault in all 32bit builds until the istio dependency is updated or removed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.