Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: unexpected SIGURG in TestCgoCallbackGC on openbsd-386-62 builder #36996

Closed
bcmills opened this issue Feb 3, 2020 · 11 comments
Closed

runtime: unexpected SIGURG in TestCgoCallbackGC on openbsd-386-62 builder #36996

bcmills opened this issue Feb 3, 2020 · 11 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Feb 3, 2020

--- FAIL: TestCgoCallbackGC (13.52s)
    crash_test.go:95: testprogcgo CgoCallbackGC exit status: signal: urgent I/O condition
    crash_cgo_test.go:70: expected "OK\n", but got:
FAIL
FAIL	runtime	86.188s

2020-01-31T23:32:02-53558cb/openbsd-386-62

Given the SIGURG, I'm guessing this is related to the 1.14 preemption changes.
Is this a bug in the runtime, or does the test need to be updated to account for the preemption signals?

@aclements @mknyszek @ianlancetaylor

@bcmills bcmills added this to the Go1.14 milestone Feb 3, 2020
@bcmills
Copy link
Member Author

@bcmills bcmills commented Feb 3, 2020

Marking as release-blocker until we understand the impact.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 3, 2020

OpenBSD is not a first-class port per https://golang.org/wiki/PortingPolicy, so I don't think an occasional failure on OpenBSD counts as a release blocker.

But of course it would be nice to understand this.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Feb 3, 2020

Since we only have N=1 test failures, I don't think we have enough evidence to say whether the issue is OpenBSD-specific at this point. (And even with a larger N, sometimes timing or other variations can cause an issue that affects multiple platforms to only manifest on one specific builder.)

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 3, 2020

I'm 99% sure it's a problem with the test. We should be blocking signal before creating the new C thread.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 3, 2020

Although that does suggest a possible problem for Go programs that call C code that creates new threads. Hmmm. We aren't seeing problems on GNU/Linux. I may be mistaken.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 3, 2020

I note that on GNU/Linux go test -test.count=100 -test.run=TestCgoCallbackGC runtime runs in about 8 seconds. On the openbsd-386-62 gomote it takes about 90 seconds.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 3, 2020

This is strange. The process is dying of a SIGURG signal. But the default action for a SIGURG signal is to ignore the signal.

So far I've run the test several thousand times under ktrace on the openbsd-386-62 gomote. I was able to get one failure. Unfortunately, the gomote then crashed before I could look at all the data. The gomote continues to crash periodically, forcing me to rebuild everything before I can do more testing. I'm still trying.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 3, 2020

I think I see the problem.

This test case has a C thread that calls into Go code. The scheduler will sometimes decide to preempt that Go code. When that happens, the scheduler will send a SIGURG signal to the thread. That signal will only be sent if the scheduler believes that the thread is executing Go code.

However, it possible that by the time the signal is delivered, the thread is no longer executing Go code. In that case, the thread-local 'g' variable will be nil. The Go runtime will have installed a signal handler for SIGURG, which will call sigtrampgo. That function will fetch g, find that it is nil, and then call badsignal. The badsignal function will call raisebadsignal to raise a SIGURG signal for the thread.

All those steps can happen on any system. Normally, it causes a bit of extra work but no actual harm, as SIGURG is ignored by default.

However, the OpenBSD 6.2 kernel appears to have a race condition. If thread 1 changes the SIGURG signal to SIG_DFL while thread 2 sends a SIGURG to thread 3, it appears that the program can fail as though it were killed by a SIGURG signal, even though that is supposed to be impossible.

This program shows the problem. It exits with 0 on GNU/Linux, and with 144 (SIGURG) on OpenBSD 6.2 (at least on 386 OpenBSD 6.2). The program does not fail with SIGURG on OpenBSD 6.4, either 386 or amd64, so it is likely that the race condition, if there was one, has been fixed.

#include <errno.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

/* Die with an error message.  */
static int die(const char *s) {
  perror(s);
  exit(EXIT_FAILURE);
}

/* Signal handler.  */
static void handler(int sig, siginfo_t* info, void *ctxt) {
  sigset_t set;
  struct sigaction sa;
  struct timespec ts;

  /* Set the signal handler to the default handler.  */
  memset(&sa, 0, sizeof sa);
  sa.sa_handler = SIG_DFL;
  if (sigfillset(&sa.sa_mask) < 0) {
    die("sigfillset");
  }
  sa.sa_flags = SA_SIGINFO | SA_ONSTACK | SA_RESTART;
  if (sigaction(sig, &sa, NULL) < 0) {
    die("sigaction");
  }

  /* Unblock the signal.  */
  if (sigemptyset(&set) < 0) {
    die("sigemptyset");
  }
  if (sigaddset(&set, sig) < 0) {
    die("sigaddset");
  }
  if (sigprocmask(SIG_UNBLOCK, &set, NULL) < 0) {
    die("sigprocmask");
  }

  /* Sleep for ten microseconds.  */
  memset(&ts, 0, sizeof ts);
  ts.tv_nsec = 10000;
  nanosleep(&ts, NULL);

  /* Block the signal.  */
  if (sigprocmask(SIG_BLOCK, &set, NULL) < 0) {
    die("sigprocmask");
  }

  /* Set the signal handler back to this handler.  */
  sa.sa_sigaction = handler;
  if (sigaction(sig, &sa, NULL) < 0) {
    die("sigaction");
  }
}

/* Thread that just waits for a signal.  */
static void *thread_waiter(void *arg) {
  sigset_t mask;

  if (sigemptyset(&mask) < 0) {
    die("sigemptyset");
  }
  while (1) {
    sigsuspend(&mask);
  }
  return NULL;
}

/* Thread that sends regular SIGURG signal to the tid passed in arg.  */
static void *thread_sender(void *arg) {
  pthread_t* ptid;
  int i;

  ptid = (pthread_t*)arg;
  for (i = 0; i < 10000000; i++) {
    int err;

    err = pthread_kill(ptid[0], SIGURG);
    if (err != 0) {
      errno = err;
      die("pthread_kill");
    }

    err = pthread_kill(ptid[1], SIGURG);
    if (err != 0) {
      errno = err;
      die("pthread_kill");
    }
  }
  return NULL;
}

int main() {
  struct sigaction sa;
  pthread_t waiter;
  pthread_t sendto[2];
  pthread_t sender;
  int err;
  struct timespec ts;

  /* Arrange to set up a signal handler for SIGURG.  */
  memset(&sa, 0, sizeof sa);
  sa.sa_sigaction = handler;
  if (sigfillset(&sa.sa_mask) < 0) {
    die("sigfillset");
  }
  sa.sa_flags = SA_SIGINFO | SA_ONSTACK | SA_RESTART;
  if (sigaction(SIGURG, &sa, NULL) < 0) {
    die("sigaction");
  }

  /* Start the waiter thread.  */
  err = pthread_create(&waiter, NULL, thread_waiter, NULL);
  if (err != 0) {
    errno = err;
    die("pthread_create");
  }

  /* Start the sender thread.  */
  sendto[0] = pthread_self();
  sendto[1] = waiter;
  err = pthread_create(&sender, NULL, thread_sender, (void *)(&sendto[0]));
  if (err != 0) {
    errno = err;
    die("pthread_create");
  }

  /* Run for five seconds.  */
  memset(&ts, 0, sizeof ts);
  ts.tv_sec = 6;
  while (ts.tv_sec > 0) {
    if (nanosleep(&ts, &ts) == 0) {
      break;
    }
  }

  return 0;
}

I think the fix is going to be to avoid raising SIGURG as a bad signal.

@josharian
Copy link
Contributor

@josharian josharian commented Feb 3, 2020

cc @mdempsky because for keywords "openbsd kernel"

@gopherbot
Copy link

@gopherbot gopherbot commented Feb 3, 2020

Change https://golang.org/cl/217617 mentions this issue: runtime: don't treat SIGURG as a bad signal

@esote
Copy link

@esote esote commented Feb 4, 2020

If this is 6.2-specific then #35712 is related. Switching to only have builders with the supported OpenBSD releases as @4a6f656c suggests, may resolve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.