Avoid double sending of SIGINT to subprocesses #889

rdokov · 2023-01-13T18:07:55Z

This addresses an issue when manually interrupting the test suite via Ctrl+C in the shell. In particular we've noticed that this can sometimes leave orphaned web browser processes from our selenium feature specs, which then cause various problems until manually terminated.

Tracing the problem I found that it's caused by subprocesses receiving SIGINT twice thus preventing the graceful shutdown that would normally happen. The fix prevents the shell from sending a SIGINT directly to the subprocesses leaving only one SIGINT from the interrupt handler in the CLI class.

Standard bash behavior when pressing Ctrl+C is to send SIGINT to the foreground process group. This combines well with standard fork behavior because subprocesses are started in the same group, so a multi-process program is correctly terminated without any explicit signal handling. It however does not work well in cases where signals are handled explicitly like here, because on Ctrl+C the subprocesses receive two SIGINTs - one directly from the shell and one from the interrupt handler in the `CLI` class. This in turn can prevent graceful shutdown in the subprocesses. The specific example that prompted this fix was interrupting feature specs using selenium. In case of a double interrupt the browser process will sometimes remain running after rspec is terminated.

grosser · 2023-01-14T23:38:24Z

thx, sounds like a good change overall ... a little scared that will trigger other edge-cases, but worth a try! :)

grosser · 2023-01-15T00:55:06Z

4.1.0

rdokov · 2023-01-16T11:10:11Z

Thanks for the quick response. We've been running with this change internally for about a month, so it's moderately well tested, but we definitely do not cover all possible shell and os combinations :)

jdelStrother · 2023-01-16T19:56:27Z

I think I've managed to find one of those edge cases, but I'm struggling to narrow it down. Our parallel tests have started hanging forever if they call sleep 0.1 😬

I've got some code that spawns a subprocess and loops while waiting for it to finish - it boils down to something like:

    pid = spawn("some_command")
    loop do
      _child_pid, status = Process.waitpid2(pid, Process::WNOHANG)
      if status
        puts "OK we're done #{status.inspect}"
        break
      else
        puts "sleeeeepy"
        sleep 0.1
        puts "OK time to do some work"
      end
    end

With parallel_tests 4.1.0, the specs that touch this code hang forever when they hit that sleep 0.1 line.
The same code works fine if I use parallel_tests 4.0.0 or just use rspec without parallelisation.

However, just that code on its own doesn't seem enough to trigger it - only in the context of our much larger test suite. I'll carry on trying to narrow it down tomorrow, but I've already lost a couple of hours to this and starting to lose the will to live... 😭

I'm seeing this on ruby 3.1.2p20 with an M1 running macOS 13.1.
Our Github CI seems unaffected (ubuntu/x86, with a subtly different version of ruby 3.1).

Appreciate this probably sounds crazy, but posting here in the hope that one of you could see some obvious reason this might fail.

grosser · 2023-01-16T23:24:00Z

yeah issues like this suck :(
the pgroup change makes sense to me (and also came with a hard to debug issue)
... I'd rather err on the side of reverting since I expect there will be more issues, and this PR did not have a failing test case either
would be great to have a reproduction or workaround tough, since it might be that the code was bad before and the fix just surfaces the badness ...

jdelStrother · 2023-01-17T11:57:40Z

So I think the problem I was seeing is my fault, but it's an interesting edge case at least...

RSpec.describe "WTF" do
  specify do
    puts "#{$$}/#{Process.getpgid($$)} #{$0} start spec"

    src = "./audio.mp3"
    dst = "./audio.aiff"
    args = %W[ffmpeg -i #{src} -y -f aiff #{dst}]
    pid = spawn(*args)
    puts "waitpid(#{pid})"
    Process.waitpid(pid)
    expect($?).to be_success
  end
end

Turns out that ffmpeg allows you to hit 'q' while it's in the middle of transcoding and it'll abort. When my rspec process is the process group leader, that results in the rspec process receiving a TTOU signal, and then it blocks forever - either in waitpid like above, or in the sleep 0.1 call from my original example. I'm still unsure why that causes a hang, would love some explanation if anyone knows.

Being able to interactively interrupt ffmpeg isn't really desirable behaviour for my usecase, so for me this is fixed by adding -nostdin to the ffmpeg arguments.

rdokov · 2023-01-17T12:14:55Z

@jdelStrother according to the man page the default behaviour for TTOU is to stop the process (i.e. it's the behaviour you would get from Ctrl+Z on the shell, although that's a different signal). So it will indeed result in a what seems like a hang. I don't have a good theory as to why this has been surfaced by the process group change, because even if the signal was hitting a different process I would have expected something to be hanging.

rdokov · 2023-01-17T15:55:22Z

@grosser if you prefer I should be able to put together a minimal example that shows that the test process incorrectly receives SIGINT twice when manually interrupted from the command line. Then you can treat it as a bug report and maybe find a different way to fix it.

grosser · 2023-01-17T17:16:41Z

a minimal test that fails without the change would be nice so we have something that breaks if it ever gets reverted :)

this TTOU signal seems super sus, but at least the fix is easy, so let's keep this change for now and see if we see more bug reports 🤞

jdelStrother · 2023-01-17T17:36:22Z

TTOU signal seems super sus

Yep. I'm still trying to wrap my head around it. I think the answer is something along the lines of:

the spawned process (eg ffmpeg) tries to read from stdin is if it were a TTY
which, if the process is a background process, sends TTOU
TTOU halts the process by default. (You can unblock it by sending CONT signals.)

In 4.0.0, it's not considered a background process, so there's no TTOUs, so no halting.

I'm a little confused about whether rspec is receiving the TTOU directly, or whether the ffmpeg process receives TTOU and that's propagated to the parent somehow.

rdokov · 2023-01-20T11:21:19Z

I'm having a little trouble writing a spec for this because the problem is actually triggered by the behaviour of Ctrl+C in bash/zsh. You should however be able to reproduce the double SIGINT problem with the following script:

# dummy_spec.rb
count = 0
Signal.trap(:INT) do
  puts 'Received SIGINT'
  count += 1
  exit if count == 2
end
sleep 10

Then what happens when I run it in the shell and press Ctrl+C is the following:

$ ./bin/parallel_rspec dummy_spec.rb 
1 process for 1 spec, ~ 1 spec per process
^CReceived SIGINT
Received SIGINT

While loading ./dummy_spec.rb an `exit` / `raise SystemExit` occurred, RSpec will now quit.
Failure/Error: exit if count == 2

SystemExit:
  exit
# ./dummy_spec.rb:5:in `exit'

The reason I can't easily convert this to a spec is that in the integration tests there is no shell process and the problem is actually triggered by the fact that Ctrl+C in the shell sends a SIGINT to each process of the foreground group. So to write a spec I'll need to wrap the test runner in an extra shell process and I've yet to find a sensible way to do that.

grosser · 2023-02-04T17:40:38Z

possible fix #891
lmk if you see any issue with this ... I'll let it sit for a bit and then release as next version 🤞

grosser · 2023-02-04T17:42:03Z

I was unable to reproduce any way of making it not send the kill to all children :/

grosser · 2023-02-04T18:52:46Z

... got it working now I think

rdokov · 2023-02-06T11:05:05Z

I can confirm that #891 solves the problem for me.

I'll just point out that the documentation states that Process.getpgid is not available on all platforms. If that means NotImplementedError on non-posix systems such as windows you may need some additional logic. (I only have a linux machine, so can't check)

grosser · 2023-02-06T22:36:15Z

yeah good point, will add

rdokov force-pushed the fix-double-sigint branch from a19e97f to b4c238b Compare January 13, 2023 18:18

grosser merged commit 445ec9e into grosser:master Jan 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid double sending of SIGINT to subprocesses #889

Avoid double sending of SIGINT to subprocesses #889

rdokov commented Jan 13, 2023

grosser commented Jan 14, 2023

grosser commented Jan 15, 2023

rdokov commented Jan 16, 2023

jdelStrother commented Jan 16, 2023

grosser commented Jan 16, 2023

jdelStrother commented Jan 17, 2023

rdokov commented Jan 17, 2023

rdokov commented Jan 17, 2023

grosser commented Jan 17, 2023

jdelStrother commented Jan 17, 2023 •

edited

Loading

rdokov commented Jan 20, 2023 •

edited

Loading

grosser commented Feb 4, 2023

grosser commented Feb 4, 2023

grosser commented Feb 4, 2023

rdokov commented Feb 6, 2023

grosser commented Feb 6, 2023

Avoid double sending of SIGINT to subprocesses #889

Avoid double sending of SIGINT to subprocesses #889

Conversation

rdokov commented Jan 13, 2023

grosser commented Jan 14, 2023

grosser commented Jan 15, 2023

rdokov commented Jan 16, 2023

jdelStrother commented Jan 16, 2023

grosser commented Jan 16, 2023

jdelStrother commented Jan 17, 2023

rdokov commented Jan 17, 2023

rdokov commented Jan 17, 2023

grosser commented Jan 17, 2023

jdelStrother commented Jan 17, 2023 • edited Loading

rdokov commented Jan 20, 2023 • edited Loading

grosser commented Feb 4, 2023

grosser commented Feb 4, 2023

grosser commented Feb 4, 2023

rdokov commented Feb 6, 2023

grosser commented Feb 6, 2023

jdelStrother commented Jan 17, 2023 •

edited

Loading

rdokov commented Jan 20, 2023 •

edited

Loading