Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster - worker shutdown - use WNOHANG with nil return tests #1741

Merged
merged 2 commits into from
Mar 19, 2019

Conversation

MSP-Greg
Copy link
Member

@MSP-Greg MSP-Greg commented Mar 12, 2019

Ruby 2.6 introduced a change that affects worker shutdown. I believe the change occurred between 'commits' r64316 and r64376 of ruby trunk.

  1. Added code using Process::WNOHANG along with needed logic. Adds worker status (via $?) and total shutdown time to log.

  2. Logged status matches status returned by current code (exit 0) for Ruby 2.5.x and lower.

Co-authored-by: MSP-Greg greg.mpls@gmail.com
Co-authored-by: guilleiguaran guilleiguaran@gmail.com

@MSP-Greg
Copy link
Member Author

force-pushed change - cleaned up sleep logic

@guilleiguaran
Copy link
Contributor

Thanks for this!

I prefer this version over #1738 since I appreciate the extra logs but I'll keep #1738 open just in case if the Puma authors don't want the extra logging.

@MSP-Greg
Copy link
Member Author

@evanphx

See #1674 (comment)

As to the code I added, it's replacing line 40 of:

puma/lib/puma/cluster.rb

Lines 35 to 44 in ca03c52

def stop_workers
log "- Gracefully shutting down workers..."
@workers.each { |x| x.term }
begin
@workers.each { |w| Process.waitpid(w.pid) }
rescue Interrupt
log "! Cancelled waiting for workers"
end
end

  1. Line 40 did not change the @workers array, so my code does not.
  2. Since Process.waitpid(w_pid, Process::WNOHANG) is essentially non-blocking on a running process, I wanted the code to loop thru the whole @workers array, as the original blocked on a running process. Also, I added code to not add a 'sleep' if the all processes were shutdown.
  3. I can remove the logging, but it might be helpful to see the 'shutdown' time?

@MSP-Greg MSP-Greg force-pushed the wait-issue branch 2 times, most recently from 664c877 to d294d63 Compare March 15, 2019 13:42
end
end
sleep 0.5 if wait
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two cycles instead of one.

You can use redo.

See #1738 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexWayfer

Thanks. As mentioned, I did it this way so it would loop thru all the workers on first pass, then go back for the nil returns. I'd rather use Process.waitpid(-1, Process::WNOHANG), as that's a bit cleaner, but it seems to repeatedly fail on trunk JIT, I never tried with 2.6.x JIT...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, I did it this way so it would loop thru all the workers on first pass, then go back for the nil returns.

It's strange algorithm.

Option 1, with redo:

  1. Go through all workers
  2. Sleep and redo on the first fail (wait every first fail)

Option 2:

  1. Go through all workers and attempt to stop them
  2. Delete from array in success, wait after all if any failed
  3. Retry if there are any remained

Option 3, your choice, as I see:

  1. Go through all workers
  2. Delete in success, and retry (go further)
  3. Don't do anything in fail, just sleep in the end
  4. Retry until there are any

I think, the first option is simplest (but may be longest), the second is most optimal, and the third is strange.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the logic to use reject!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the logic to use reject!

You can use loop do instead of while true do. 😉

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I often look here:
https://msp-greg.github.io/ruby_trunk/file.control_expressions.html

loop do isn't mentioned there. Time to review kernel methods again...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, about two cycles (and map): do we need for @workers after successful run of this code?

If no — we can use @workers.reject!.

@MSP-Greg MSP-Greg force-pushed the wait-issue branch 2 times, most recently from 726fe01 to 54e8030 Compare March 16, 2019 00:27
@evanphx
Copy link
Member

evanphx commented Mar 18, 2019

I don't see why we need this. Is Process.wait without WNOHANG fundementally broken on ruby now? This PR just implements a busy polling loop which is unnecessary if a blocking Process.wait works the way it should, which is that it waits until a child is finished.

@MSP-Greg
Copy link
Member Author

MSP-Greg commented Mar 18, 2019

Is Process.wait without WNOHANG fundamentally broken on ruby now?

As I think I've mentioned elsewhere, is it a bug or a breaking change? I tried to create a small repo of it, but being a windows type, I can't fork locally. Couldn't find a way to repo in Windows.

There are a few Ruby tests with waitpid & waitpid2, and also a few specs. I believe these have continued to pass even when Puma started having issues last year with trunk.

busy polling loop

I tried to make it as 'light' as possible, with the minimum number of waitpid calls.

@guilleiguaran
Copy link
Contributor

guilleiguaran commented Mar 18, 2019

@evanphx indeed, it seems to be caused by ruby/ruby@48b6bd7 but doesn't look like it will be fixed anytime soon since the Ruby team isn't sure about how to fix it 😞

Edit: The related issue is https://bugs.ruby-lang.org/issues/15499, but it was closed as resolved even when someone pointed that the patch didn't solve the problem with Puma.

@evanphx
Copy link
Member

evanphx commented Mar 18, 2019

This is absolutely a bug in ruby, not a behavior change. Looks like they attempted to fix it here: ruby/ruby@9e66910

@MSP-Greg
Copy link
Member Author

I could add a conditional on RUBY_VERSION and/or comments about the code being needed for a bug in Ruby 2.6+.

See #1674 (comment)

@evanphx
Copy link
Member

evanphx commented Mar 18, 2019

I'm pretty annoyed that ruby-core has broken a fundamental and simple unix function and hasn't fixed it.

Conditionalizing it on ruby version is fine.

@MSP-Greg
Copy link
Member Author

MSP-Greg commented Mar 19, 2019

Okay with a string (RUBY_VERSION < '2.6'), which will fail on Ruby 2.10?

MSP-Greg and others added 2 commits March 18, 2019 19:09
Ruby 2.6 introduced a bug that affects worker shutdown (waitpid).

Added code using Process::WNOHANG along with needed logic. Adds worker status (via $?) and total shutdown time to log.

Co-authored-by: MSP-Greg <greg.mpls@gmail.com>
Co-authored-by: guilleiguaran <guilleiguaran@gmail.com>
JRuby 9.2.6.0 may have intermittent test failures, leave for now

Co-authored-by: MSP-Greg <greg.mpls@gmail.com>
Co-authored-by: guilleiguaran <guilleiguaran@gmail.com>
@MSP-Greg
Copy link
Member Author

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants