Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: unexpected fault address runtime.memhash16 #33635

Open
kevinconaway opened this issue Aug 13, 2019 · 6 comments

Comments

@kevinconaway
Copy link

commented Aug 13, 2019

What version of Go are you using (go version)?

go version go1.12.5 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.5 LTS"
NAME="Ubuntu"
VERSION="14.04.5 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.5 LTS"
VERSION_ID="14.04"

$ uname -a
Linux {pod name} 4.15.0-1033-aws #35-Ubuntu SMP Wed Feb 6 13:29:46 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

What did you do?

What did you expect to see?

What did you see instead?

Our application is occasionally crashing with the following error:

unexpected fault address 0x432200
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x2 addr=0x432200 pc=0x432200]

goroutine 2373 [running]:
runtime.throw(0x1306be9, 0x5)
	/root/.gimme/versions/go1.12.5.linux.amd64/src/runtime/panic.go:617 +0x72 fp=0xc007f9cd28 sp=0xc007f9ccf8 pc=0x45dc92
runtime.sigpanic()
	/root/.gimme/versions/go1.12.5.linux.amd64/src/runtime/signal_unix.go:397 +0x401 fp=0xc007f9cd58 sp=0xc007f9cd28 pc=0x473251
runtime.memhash16(0xc007f9ce1e, 0xd2bd649e, 0x0)
	/root/.gimme/versions/go1.12.5.linux.amd64/src/runtime/alg.go:56 fp=0xc007f9cd60 sp=0xc007f9cd58 pc=0x432200
runtime.mapassign(0x1146c80, 0xc001b329f0, 0xc007f9ce1e, 0x11)
	/root/.gimme/versions/go1.12.5.linux.amd64/src/runtime/map.go:593 +0x73 fp=0xc007f9cde8 sp=0xc007f9cd60 pc=0x43e273
...snip...

The code in question looks like:

type MessageReader struct {
        //...
	latency map[string]map[uint16]int64
}

func (mr *MessageReader) captureLatency(timestamp int64, cluster string, partition uint16) {
	if _, ok := mr.latency[cluster]; !ok {
		mr.latency[cluster] = make(map[uint16]int64)
	}
	mr.latency[cluster][partition] = time.Now().Unix() - timestamp
}

This particular application is processing messages and calling this once per message at a rate of ~75k / second. We've seen this happen twice over the past week or so.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2019

This is exceedingly strange. memhash16 only reads and writes to the stack. It never loads or stores off of anything except %rsp (and %rcx in the stack check, which is the g).

Could you disassemble your binary and show us exactly the disassembled code for memhash16? I want to know what instruction is at 0x432200.

The fault address is also 0x432200, which implies that the code page containing that function is getting unmapped somehow. I have no idea how that could happen.

@kevinconaway

This comment has been minimized.

Copy link
Author

commented Aug 13, 2019

Could you disassemble your binary and show us exactly the disassembled code for memhash16?

Below is the output from go tool objdump

TEXT runtime.memhash16(SB) /root/.gimme/versions/go1.12.5.linux.amd64/src/runtime/alg.go
  alg.go:56		0x432200		64488b0c2560e8ffff	MOVQ FS:0xffffe860, CX			
  alg.go:56		0x432209		483b6110		CMPQ 0x10(CX), SP			
  alg.go:56		0x43220d		7643			JBE 0x432252				
  alg.go:56		0x43220f		4883ec28		SUBQ $rd_kafka_thread_name+24(SB), SP	
  alg.go:56		0x432213		48896c2420		MOVQ BP, 0x20(SP)			
  alg.go:56		0x432218		488d6c2420		LEAQ 0x20(SP), BP			
  alg.go:57		0x43221d		488b442430		MOVQ 0x30(SP), AX			
  alg.go:57		0x432222		48890424		MOVQ AX, 0(SP)				
  alg.go:57		0x432226		488b442438		MOVQ 0x38(SP), AX			
  alg.go:57		0x43222b		4889442408		MOVQ AX, 0x8(SP)			
  alg.go:57		0x432230		48c744241002000000	MOVQ $0x2, 0x10(SP)			
  alg.go:57		0x432239		e8e2690000		CALL runtime.memhash(SB)		
  alg.go:57		0x43223e		488b442418		MOVQ 0x18(SP), AX			
  alg.go:57		0x432243		4889442440		MOVQ AX, 0x40(SP)			
  alg.go:57		0x432248		488b6c2420		MOVQ 0x20(SP), BP			
  alg.go:57		0x43224d		4883c428		ADDQ $rd_kafka_thread_name+24(SB), SP	
  alg.go:57		0x432251		c3			RET					
  alg.go:56		0x432252		e849910500		CALL runtime.morestack_noctxt(SB)	
  alg.go:56		0x432257		eba7			JMP runtime.memhash16(SB)	
@randall77

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2019

Yes, that's weird. The instruction that's loading the G pointer from the TLS area is faulting, and on the address of the instruction itself, not the address it's loading from.
Are you using any unsafe? cgo?
I can't imagine it would fail this way unless you're calling syscall.Mmap to unmap the code segment.
Any chance you could give us a program with which we could reproduce this ourselves? Otherwise I don't see any way to figure out what is going on.

@kevinconaway

This comment has been minimized.

Copy link
Author

commented Aug 13, 2019

Are you using any unsafe? cgo?

We are indeed using cgo libraries: https://github.com/confluentinc/confluent-kafka-go/ and https://github.com/DataDog/zstd would be involved here but the scope of the captureLatency method that I posted above is not directly calling cgo, it is reproduced in its entirety.

Unfortunately, I don't think we can hand over a binary for this particular service.

Are there any debug steps that we can take on our end to gather more data for you? As I mentioned in the initial post, this service has been running for quite a long time and has only crashed twice with this error, I'm not quite sure how we could reproduce it.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2019

If you can catch it crash in a debugger, a dump of the memory regions with their permissions would be nice.

Try running on a different machine. We've seen memory corruption issues generate weird bugs before.

Those cgo libraries don't seem particularly dangerous.

@agnivade agnivade added this to the Go1.14 milestone Aug 14, 2019

@kevinconaway

This comment has been minimized.

Copy link
Author

commented Aug 14, 2019

Try running on a different machine

Interesting that you bring this up. We are running this application in our k8s environment but I just looked and both of these pods happened to be scheduled on the same underlying VM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.