Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: stalls under Rosetta 2 #42700

Closed
FiloSottile opened this issue Nov 18, 2020 · 30 comments
Closed

runtime: stalls under Rosetta 2 #42700

FiloSottile opened this issue Nov 18, 2020 · 30 comments

Comments

@FiloSottile
Copy link
Contributor

@FiloSottile FiloSottile commented Nov 18, 2020

What version of Go are you using (go version)?

$ go version
go version devel +041a4e4c34 Tue Nov 17 22:57:34 2020 +0000 darwin/amd64

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/filippo/Library/Caches/go-build"
GOENV="/Users/filippo/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/filippo/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/filippo"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/filippo/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/filippo/go/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="devel +041a4e4c34 Tue Nov 17 22:57:34 2020 +0000"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/jh/3ydm4lxd71s2__g_x4hny6r00000gn/T/go-build062328339=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

While trying to run make.bash under Rosetta 2, I noticed high-CPU endless stalls in link, asm and compile. Without GOMAXPROCS=1, make.bash never gets to finish; with it, it works more often than not, but not always.

Once, I got this printed to my terminal during a stall

rosetta error: ulock_wait failure: 105

I have no reason to think this is specific to make.bash, they are probably just the longest-running things I ran so far.

/cc @cherrymui

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 18, 2020

Try GODEBUG=asyncpreemptoff=1 and don't use signals in general. I think Rosetta 2's signal handling may be buggy (I have a list of different crashes, including the emulator internal errors).

@FiloSottile
Copy link
Contributor Author

@FiloSottile FiloSottile commented Nov 18, 2020

Indeed, with GODEBUG=asyncpreemptoff=1 I am not observing the stalls anymore, thank you! I'll leave this open to figure out if there's anything we should do for Rosetta 2 compatibility.

@zwarich
Copy link

@zwarich zwarich commented Nov 18, 2020

I think Rosetta 2's signal handling may be buggy (I have a list of different crashes, including the emulator internal errors).

@cherrymui, could you share a gist of any Rosetta error messages you've seen? I'm the main developer of Rosetta 2, and I would be interested in seeing them.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 18, 2020

@zwarich thanks for reaching out!

One of the internal error is

assertion failed [abi_info.kind == AbiKind::TranslatedCode]: emulated forward to an arm pc that isn't in translated code. arm_pc=0x1020ad4e8 abi_kind=6 emulation_interval=[0x1021e13c0,0x1021e13d0) instruction_interval=[0x1021e13b4, 0x1021e13d0) x86_rip=0x1222ffc

I'll see if I can find out others. Thanks.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 18, 2020

Here is another one

oah error: unexpectedly need to EmulateForward on a synchronous exception x86_rip=0x404421e arm_pc=0x10472e9bc num_insts=6 inst_index=4 x86 instruction bytes: 0xc3940f045ab10ff0 0x9e3d83487175db84

This program includes sending itself a SIGSEGV using kill.

Others more likely materialize as program failures, e.g. hanging, or memory corruption. I'll see if I can reproduce them with C code. Thanks.

@zwarich
Copy link

@zwarich zwarich commented Nov 19, 2020

@cherrymui, is the program that sends itself a SIGSEGV publicly available?

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 19, 2020

Yes, all programs I mentioned are publicly available. They are mostly Go programs. If you're okay with building and running Go programs, I can attach instructions here. Or if you prefer, I can attach binaries.

I'm still working on C producers.

@zwarich
Copy link

@zwarich zwarich commented Nov 19, 2020

@cherrymui, instructions would be greatly appreciated.

@zwarich
Copy link

@zwarich zwarich commented Nov 19, 2020

Actually, I think we can now reproduce all of the issues mentioned here.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 19, 2020

@zwarich that's great!

I'm about to send you the instruction for running Go programs. So I'll do it anyway.

  1. install Go. The easiest way is to download the darwin-amd64 binary distribution from https://golang.org/dl/ . You can also do it with other methods mentioned in https://golang.org/doc/install and https://golang.org/doc/install/source
  2. suppose GOROOT is the root of your Go installation, do
cd GOROOT/src/runtime/testdata/testprogcgo
GOROOT/bin/go build -o /tmp/test
/tmp/test SegvInCgo

This is the program that sends SIGSEGV to itself. This program is expected to crash due to SIGSEGV with a stack dump (but not emulator crash :). This does not always fail. In fact, it fails in a fairly low rate.

I also see another error for this program:

assertion failed [!is_synchronous_signal(sig)]: can't pend synchronous signals
        (ThreadContextSignals.cpp:234 pend_signal)

This is somewhat more likely to happen that the one mentioned previously.

I managed to reproduce the last one with a C program:

#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <unistd.h>
#include <pthread.h>

void e() {}

void handler(int sig, siginfo_t *info, ucontext_t *uap)
{
	// inject a call to e
	uintptr_t pc = uap->uc_mcontext->__ss.__rip;
	uintptr_t sp = uap->uc_mcontext->__ss.__rsp;
	sp -= 8;
	*(uintptr_t*)sp = pc;
	uap->uc_mcontext->__ss.__rsp = sp;
	uap->uc_mcontext->__ss.__rip = (uintptr_t)e;
}

void setup_handler()
{
	struct sigaction sa = {(void*)&handler, -1, SA_SIGINFO | SA_ONSTACK | SA_RESTART};
	stack_t sigstk;
	if ((sigstk.ss_sp = malloc(SIGSTKSZ)) == NULL)
		exit(1);
	sigstk.ss_size = SIGSTKSZ;
	sigstk.ss_flags = 0;
	if (sigaltstack(&sigstk,0) < 0)
		exit(2);

	sigaction(SIGSEGV, &sa, 0);
	sigaction(SIGURG, &sa, 0);
}

void *thr(void *a)
{
	setup_handler();
	while (1);
}

int main()
{
	pid_t pid = getpid();
	int i, j;
	pthread_t tid[10];

	setup_handler();

	for (i = 0; i < 10; i++)
		pthread_create(&tid[i], 0, thr, 0);

	e();
	usleep(1000);
	for (j = 0; j < 10; j++)
		for (i = 0; i < 10; i++) {
			pthread_kill(tid[i], SIGURG);
			pthread_kill(tid[i], SIGSEGV);
		}
	usleep(1000000);
	return 0;
}

This program, running under Rosetta 2, is likely to emit the error mentioned above, or hang.

The call injection part is a bit hacky under C ABI, but it works okay given what e() compiles to. And it somewhat mimics what the Go runtime does.

Thanks!

@tmm1
Copy link
Contributor

@tmm1 tmm1 commented Nov 19, 2020

Seeing something similar as well:

assertion failed [abi_info.kind == AbiKind::TranslatedCode]: emulated forward to an arm pc that isn't in translated code. arm_pc=0x10686e4d8 abi_kind=6 emulation_interval=[0x106ea790c,0x106ea7920) instruction_interval=[0x106ea78f8, 0x106ea7920) x86_rip=0x4bddb6d
(ThreadContextRegisterState.cpp:677 move_to_instruction_boundary)

@tmm1
Copy link
Contributor

@tmm1 tmm1 commented Nov 19, 2020

Is there a way to change runtime.asyncpreemptoff once the process has already started? If I use os.Setenv("GODEBUG", "asyncpreemptoff=1") during init() will that work, or is it too late already.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 19, 2020

Is there a way to change runtime.asyncpreemptoff once the process has already started?

No. This is a temporary workaround and not a long-term solution.

(You can re-exec the process with the environment variable set, if that counts.)

@tmm1
Copy link
Contributor

@tmm1 tmm1 commented Nov 20, 2020

(You can re-exec the process with the environment variable set, if that counts.)

Thanks for this suggestion. I needed a quick workaround and this fit the bill.

The simplest way I've found to check if a process is running under Rosetta 2:

if v, _ := syscall.SysctlUint32("sysctl.proc_translated"); v == 1 {
    // Rosetta 2
}

@Developer-Ecosystem-Engineering

Can you see if this is still happening in beta build 20C5061b, released today?

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Dec 4, 2020

I updated the DTK to 20C5061b. I tried the the C reproducer above and the Go program that sends itself SIGSEGV, and they no longer fail. And I didn't see the assertion failures I saw before. Thanks!

However, when running Go build and testsuite I got a different assertion failure:

assertion failed: unexpectedly received EXC_BAD_INSTRUCTION
(ExceptionServer.cpp:598 translate_exception_types)

Also, I still see hanging processes occasionally. I sent a SIGQUIT to a hanging process (which typically triggers a stack dump for Go), and I got

assertion failed [*ret_index + 2 == num_insts]: RET should be second last instruction in translation
(StateRecovery.cpp:304 check_cfg)

And

rosetta error: no code fragment associated with the given arm pc

(These are separate cases.)

Thanks.

@Developer-Ecosystem-Engineering
Copy link

@Developer-Ecosystem-Engineering Developer-Ecosystem-Engineering commented Dec 4, 2020

Agreed, different but potentially related. Investigating, thanks!

@directionless
Copy link

@directionless directionless commented Dec 6, 2020

I see hanging processes, and often with a assertion failed [*ret_index + 2 == num_insts]: RET should be second last instruction in translation It is in public code, (in https://github.com/kolide/launcher/) but I haven't been able to extract it into a simple test case.

david9991 added a commit to david9991/remote-kubernetes that referenced this issue Dec 8, 2020
@zosmac
Copy link

@zosmac zosmac commented Dec 9, 2020

I observe sporadically in a golang program now running on an M1 Air:

assertion failed [*ret_index + 2 == num_insts]: RET should be second last instruction in translation
(StateRecovery.cpp:304 check_cfg)

Subsequently, the program enters a loop, with process CPU at 100%, requiring kill to terminate.

Setting GODEBUG=asyncpreemptoff=1 appears to address this, and I do not see the assertion failure after several days of running program (it is a server).

In addition to running as a HTTP server, the program also issues HTTP calls to another server. I do not know if net connections are subject to SIGURG signals. Also, the program relies on kqueue/kevents for notifications about process changes, as well as an FSEvents callback for file system updates and the local OSLogStore for tracking logging events. Perhaps the async preemption of go routines is confusing network activity or some of the host C API calls. Perhaps some of the native layers are already ARM, and some transition to/from translation mode is causing an issue?

No issues are seen on an Intel Macbook, where this program has been installed and run regularly for several years. Perhaps with ARM support in Go 1.16, Rosetta will not be required?

Thank you.

@zwarich
Copy link

@zwarich zwarich commented Dec 9, 2020

@directionless @zosmac That issue will be fixed in an upcoming version of Rosetta. I will post here when a beta or release is available.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Dec 9, 2020

Perhaps with ARM support in Go 1.16, Rosetta will not be required?

@zosmac yes. You can try that with Go tip (the master branch).

@zosmac
Copy link

@zosmac zosmac commented Dec 10, 2020

Thank you both. And perhaps another useful datapoint:

I run my golang program using sudo to give it access to some root owned resources. I added a fcntl(0, F_GETOWN) to see where SIGURG handling was assigned and discovered the handler is assigned to the process group established by sudo. I redefined my program as root owned and enabled setuid so it could run standalone. It has been running now for several hours without encountering the assertion failure. I did not set GODEBUG=asyncpreemptoff=1.

I will continue investigating go with arm support.

@cfstras
Copy link

@cfstras cfstras commented Dec 10, 2020

@zosmac I managed to bootstrap an arm version using the above workarounds with rosetta. It's nothing special, see my notes: https://gist.github.com/cfstras/c7e2d537114eb3f640b4c4c3cd5c0809#go-golang

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Dec 10, 2020

@cfstras I looked at your link above. After you get go-darwin-arm64-bootstrap, I think it is better to do a full build of the ARM64 toolchain using go-darwin-arm64-bootstrap as the bootstrap toolchain. Something linke

cd go/src
GOROOT_BOOTSTRAP=/path/to/go-darwin-arm64-bootstrap ./make.bash

The bootstrap toolchain is, although mostly functional, special purposed for bootstrapping. It is not as complete as full build.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Dec 11, 2020

I updated the DTK to 11.1 20C69, and I still see hanging processes, and still see rosetta errors when sending SIGQUIT to hanging processes

rosetta error: no code fragment associated with the given arm pc

and

assertion failed: unexpectedly received EXC_BAD_INSTRUCTION
(ExceptionServer.cpp:598 translate_exception_types)

@ziqiqi
Copy link

@ziqiqi ziqiqi commented Dec 16, 2020

@zwarich I also had the below two error messages with Java:

rosetta error: no code fragment associated with the given arm pc

assertion failed: unexpectedly received EXC_BAD_INSTRUCTION
(ExceptionServer.cpp:598 translate_exception_types)

Tried different JDK versions, but the issue persists. I'm using M1 MacBook Pro. The same program runs just fine on Windows.

@Developer-Ecosystem-Engineering
Copy link

@Developer-Ecosystem-Engineering Developer-Ecosystem-Engineering commented Dec 16, 2020

There are additional issues under investigation beyond the original issue reported, which should be resolved in macOS 11.1. This particular issue has evolved into various different problems and errors. It is helpful if issues are treated independently with their own Feedback, as their root cause and fixes may be different.

#⁠1

rosetta error: ulock_wait failure: 105

#⁠2

assertion failed [abi_info.kind == AbiKind::TranslatedCode]: emulated forward to an arm pc that isn't in translated code. arm_pc=0x1020ad4e8 abi_kind=6 emulation_interval=[0x1021e13c0,0x1021e13d0) instruction_interval=[0x1021e13b4, 0x1021e13d0) x86_rip=0x1222ffc

#⁠3

oah error: unexpectedly need to EmulateForward on a synchronous exception x86_rip=0x404421e arm_pc=0x10472e9bc num_insts=6 inst_index=4 x86 instruction bytes: 0xc3940f045ab10ff0 0x9e3d83487175db84

#⁠4

assertion failed [!is_synchronous_signal(sig)]: can't pend synchronous signals
        (ThreadContextSignals.cpp:234 pend_signal)

#⁠5

assertion failed [abi_info.kind == AbiKind::TranslatedCode]: emulated forward to an arm pc that isn't in translated code. arm_pc=0x10686e4d8 abi_kind=6 emulation_interval=[0x106ea790c,0x106ea7920) instruction_interval=[0x106ea78f8, 0x106ea7920) x86_rip=0x4bddb6d
(ThreadContextRegisterState.cpp:677 move_to_instruction_boundary)

#⁠6

assertion failed: unexpectedly received EXC_BAD_INSTRUCTION
(ExceptionServer.cpp:598 translate_exception_types)

#⁠7
rosetta error: no code fragment associated with the given arm pc

Issues #⁠1-#⁠5 should be verified on 20C69 (macOS 11.1) Issues #⁠6 and #⁠7 you should confirm if they are still happening in beta build 20D5029f, available today.

@zwarich
Copy link

@zwarich zwarich commented Dec 16, 2020

I believe that 6) should be fixed in 20D5029f. Additionally, 7) is mostly fixed, but still occurs very rarely, which impedes debugging. It is a fairly generic assertion, which can have many causes. If you find a way to reproduce 7) where the elapsed time is measured in minutes rather than hours, please let me know.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Dec 18, 2020

Having updated to 20D5029f, I'm able to run all.bash under Rosetta 10 times in a row without hanging or assertion failure. Thank you for the fix!

I'll let you know if I can reproduce 7.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Feb 21, 2021

This should be fixed in macOS 11.2. Thanks @zwarich , @Developer-Ecosystem-Engineering !

Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
10 participants