Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid double sending of SIGINT to subprocesses #889

Merged
merged 1 commit into from
Jan 14, 2023

Conversation

rdokov
Copy link
Contributor

@rdokov rdokov commented Jan 13, 2023

This addresses an issue when manually interrupting the test suite via Ctrl+C in the shell. In particular we've noticed that this can sometimes leave orphaned web browser processes from our selenium feature specs, which then cause various problems until manually terminated.

Tracing the problem I found that it's caused by subprocesses receiving SIGINT twice thus preventing the graceful shutdown that would normally happen. The fix prevents the shell from sending a SIGINT directly to the subprocesses leaving only one SIGINT from the interrupt handler in the CLI class.

Standard bash behavior when pressing Ctrl+C is to send SIGINT to the
foreground process group. This combines well with standard fork behavior
because subprocesses are started in the same group, so a multi-process
program is correctly terminated without any explicit signal handling.

It however does not work well in cases where signals are handled
explicitly like here, because on Ctrl+C the subprocesses receive two
SIGINTs - one directly from the shell and one from the interrupt handler
in the `CLI` class. This in turn can prevent graceful shutdown in the
subprocesses.

The specific example that prompted this fix was interrupting feature
specs using selenium. In case of a double interrupt the browser process
will sometimes remain running after rspec is terminated.
@grosser grosser merged commit 445ec9e into grosser:master Jan 14, 2023
@grosser
Copy link
Owner

grosser commented Jan 14, 2023

thx, sounds like a good change overall ... a little scared that will trigger other edge-cases, but worth a try! :)

@grosser
Copy link
Owner

grosser commented Jan 15, 2023

4.1.0

@rdokov
Copy link
Contributor Author

rdokov commented Jan 16, 2023

Thanks for the quick response. We've been running with this change internally for about a month, so it's moderately well tested, but we definitely do not cover all possible shell and os combinations :)

@jdelStrother
Copy link

I think I've managed to find one of those edge cases, but I'm struggling to narrow it down. Our parallel tests have started hanging forever if they call sleep 0.1 😬

I've got some code that spawns a subprocess and loops while waiting for it to finish - it boils down to something like:

    pid = spawn("some_command")
    loop do
      _child_pid, status = Process.waitpid2(pid, Process::WNOHANG)
      if status
        puts "OK we're done #{status.inspect}"
        break
      else
        puts "sleeeeepy"
        sleep 0.1
        puts "OK time to do some work"
      end
    end

With parallel_tests 4.1.0, the specs that touch this code hang forever when they hit that sleep 0.1 line.
The same code works fine if I use parallel_tests 4.0.0 or just use rspec without parallelisation.

However, just that code on its own doesn't seem enough to trigger it - only in the context of our much larger test suite. I'll carry on trying to narrow it down tomorrow, but I've already lost a couple of hours to this and starting to lose the will to live... 😭

I'm seeing this on ruby 3.1.2p20 with an M1 running macOS 13.1.
Our Github CI seems unaffected (ubuntu/x86, with a subtly different version of ruby 3.1).

Appreciate this probably sounds crazy, but posting here in the hope that one of you could see some obvious reason this might fail.

@grosser
Copy link
Owner

grosser commented Jan 16, 2023

yeah issues like this suck :(
the pgroup change makes sense to me (and also came with a hard to debug issue)
... I'd rather err on the side of reverting since I expect there will be more issues, and this PR did not have a failing test case either
would be great to have a reproduction or workaround tough, since it might be that the code was bad before and the fix just surfaces the badness ...

@jdelStrother
Copy link

So I think the problem I was seeing is my fault, but it's an interesting edge case at least...

RSpec.describe "WTF" do
  specify do
    puts "#{$$}/#{Process.getpgid($$)} #{$0} start spec"

    src = "./audio.mp3"
    dst = "./audio.aiff"
    args = %W[ffmpeg -i #{src} -y -f aiff #{dst}]
    pid = spawn(*args)
    puts "waitpid(#{pid})"
    Process.waitpid(pid)
    expect($?).to be_success
  end
end

Turns out that ffmpeg allows you to hit 'q' while it's in the middle of transcoding and it'll abort. When my rspec process is the process group leader, that results in the rspec process receiving a TTOU signal, and then it blocks forever - either in waitpid like above, or in the sleep 0.1 call from my original example. I'm still unsure why that causes a hang, would love some explanation if anyone knows.

Being able to interactively interrupt ffmpeg isn't really desirable behaviour for my usecase, so for me this is fixed by adding -nostdin to the ffmpeg arguments.

@rdokov
Copy link
Contributor Author

rdokov commented Jan 17, 2023

@jdelStrother according to the man page the default behaviour for TTOU is to stop the process (i.e. it's the behaviour you would get from Ctrl+Z on the shell, although that's a different signal). So it will indeed result in a what seems like a hang. I don't have a good theory as to why this has been surfaced by the process group change, because even if the signal was hitting a different process I would have expected something to be hanging.

@rdokov
Copy link
Contributor Author

rdokov commented Jan 17, 2023

@grosser if you prefer I should be able to put together a minimal example that shows that the test process incorrectly receives SIGINT twice when manually interrupted from the command line. Then you can treat it as a bug report and maybe find a different way to fix it.

@grosser
Copy link
Owner

grosser commented Jan 17, 2023

a minimal test that fails without the change would be nice so we have something that breaks if it ever gets reverted :)

this TTOU signal seems super sus, but at least the fix is easy, so let's keep this change for now and see if we see more bug reports 🤞

@jdelStrother
Copy link

jdelStrother commented Jan 17, 2023

TTOU signal seems super sus

Yep. I'm still trying to wrap my head around it. I think the answer is something along the lines of:

  • the spawned process (eg ffmpeg) tries to read from stdin is if it were a TTY
  • which, if the process is a background process, sends TTOU
  • TTOU halts the process by default. (You can unblock it by sending CONT signals.)

In 4.0.0, it's not considered a background process, so there's no TTOUs, so no halting.

I'm a little confused about whether rspec is receiving the TTOU directly, or whether the ffmpeg process receives TTOU and that's propagated to the parent somehow.

@rdokov
Copy link
Contributor Author

rdokov commented Jan 20, 2023

I'm having a little trouble writing a spec for this because the problem is actually triggered by the behaviour of Ctrl+C in bash/zsh. You should however be able to reproduce the double SIGINT problem with the following script:

# dummy_spec.rb
count = 0
Signal.trap(:INT) do
  puts 'Received SIGINT'
  count += 1
  exit if count == 2
end
sleep 10

Then what happens when I run it in the shell and press Ctrl+C is the following:

$ ./bin/parallel_rspec dummy_spec.rb 
1 process for 1 spec, ~ 1 spec per process
^CReceived SIGINT
Received SIGINT

While loading ./dummy_spec.rb an `exit` / `raise SystemExit` occurred, RSpec will now quit.
Failure/Error: exit if count == 2

SystemExit:
  exit
# ./dummy_spec.rb:5:in `exit'

The reason I can't easily convert this to a spec is that in the integration tests there is no shell process and the problem is actually triggered by the fact that Ctrl+C in the shell sends a SIGINT to each process of the foreground group. So to write a spec I'll need to wrap the test runner in an extra shell process and I've yet to find a sensible way to do that.

@grosser
Copy link
Owner

grosser commented Feb 4, 2023

possible fix #891
lmk if you see any issue with this ... I'll let it sit for a bit and then release as next version 🤞

@grosser
Copy link
Owner

grosser commented Feb 4, 2023

I was unable to reproduce any way of making it not send the kill to all children :/

@grosser
Copy link
Owner

grosser commented Feb 4, 2023

... got it working now I think

@rdokov
Copy link
Contributor Author

rdokov commented Feb 6, 2023

I can confirm that #891 solves the problem for me.

I'll just point out that the documentation states that Process.getpgid is not available on all platforms. If that means NotImplementedError on non-posix systems such as windows you may need some additional logic. (I only have a linux machine, so can't check)

@grosser
Copy link
Owner

grosser commented Feb 6, 2023

yeah good point, will add

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants