Fix child processes not being reaped when `Process.detach` used #3314

stanhu · 2024-01-10T23:02:16Z

Description

Starting with Puma v6.4.1, we observed that killed Puma cluster workers were never being restarted when the parent was run as PID 1. For example, I issued a kill 44 and PID 44 remained in the defunct state:

git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
git            1       0  0 Jan09 ?        00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git           23       1  0 Jan09 ?        00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab
git           41       1  0 Jan09 ?        00:01:55 ruby /srv/gitlab/bin/metrics-server
git           44       1  0 Jan09 ?        00:02:41 [ruby] <defunct>
git           46       1  0 Jan09 ?        00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker]
git           48       1  0 Jan09 ?        00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker]
git           49       1  0 Jan09 ?        00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker]
git         5205       0  0 21:57 pts/0    00:00:00 bash
git         5331    5205  0 22:00 pts/0    00:00:00 ps -ef

Further investigation showed that the introduction of Process.wait2(-1, Process::WNOHANG) in #3255 never appears to return anything when Process.detach is run on some process that has not exited. This bug appears to be present from Ruby 2.6 to 3.2, but has been been fixed in Ruby 3.3: https://bugs.ruby-lang.org/issues/19837

Previously Process.wait(w.pid, Process::WNOHANG) was called on each known worker PID. #3255 changed this behavior to do this only if the fork_worker config parameter were enabled, but it seems that we should always do this to ensure that terminated workers are reaped in a timely manner.

Closes #3313

Your checklist for this pull request

I have reviewed the guidelines for contributing to this repository.
I have added (or updated) appropriate tests if this PR fixes a bug or adds a feature.
My pull request is 100 lines added/removed or less so that it can be easily reviewed.
If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
I have updated the documentation accordingly.
All new and existing tests passed, including Rubocop.

MSP-Greg · 2024-01-11T02:01:47Z

@stanhu

Thanks for the PR. LGTM, but I'd like to run some additional tests. Currently knee deep in some dumb s*&(...

dentarg · 2024-01-11T08:50:03Z

Does your latest comment #3313 (comment) change anything for these changes @stanhu?

dentarg · 2024-01-11T08:52:23Z

Also, from your investigation in #3313, it looks like it wouldn't be impossible to add some test (using Docker) for this PR and #3255? Could be something only GitHub Actions run, not part of bundle exec rake. What do you and others (@byroot) think about that? Is it worth it?

stanhu · 2024-01-11T08:53:48Z

@dentarg I'm still investigating how it's possible for the application to interfere with Process.wait2(-1, Process::WNOHANG), presumably by trapping certain signals.

I think this PR can't hurt, and it restores the previous behavior of checking known PIDs in addition to checking child processes.

byroot · 2024-01-11T08:59:35Z

Could be something only GitHub Actions run, not part of bundle exec rake. What do you and others (@byroot) think about that? Is it worth it?

I think it would be worth it. Note that an alternative to running as PID 1 is to declare the process as a SUBREAPER: https://github.com/Shopify/pitchfork/blob/50f3e3389218e6e82c65638ab3c91f805ec02c4b/ext/pitchfork_http/child_subreaper.h

It's a Linux only API, but would allow to do it as part of the main test suite when ran on Linux. I made a gem a while ago you could use: https://rubygems.org/gems/child_subreaper

stanhu · 2024-01-11T17:37:25Z

I've updated this pull request to reflect that it appears that this works around a Ruby 3.1/3.2 bug with Process.detach: #3313 (comment)

I think a test is a good idea, but that will take me longer to get to at the moment.

byroot · 2024-01-12T08:59:08Z

a Ruby 3.1/3.2 bug with Process.detach

Would be worth reporting upstream. Even if it's fixed in 3.3, it may be worth a backport.

stanhu · 2024-01-12T17:05:51Z

Good idea. I filed this bug: https://bugs.ruby-lang.org/issues/20181

stanhu · 2024-01-12T20:21:00Z

@MSP-Greg @dentarg I've added an integration test that appears to reproduce https://bugs.ruby-lang.org/issues/20181. This test fails in master, but passes in this branch.

stanhu · 2024-01-13T05:44:10Z

It seems this bug was already reported in https://bugs.ruby-lang.org/issues/19837 and fixed in the ruby_3_2 and ruby_3_1 branches, but the fixes aren't in a release yet.

Starting with Puma v6.4.1, we observed that killed Puma cluster workers were never being restarted when the parent was run as PID 1. For example, I issued a `kill 44` and PID 44 remained in the `defunct` state: ``` git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef UID PID PPID C STIME TTY TIME CMD git 1 0 0 Jan09 ? 00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker] git 23 1 0 Jan09 ? 00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab git 41 1 0 Jan09 ? 00:01:55 ruby /srv/gitlab/bin/metrics-server git 44 1 0 Jan09 ? 00:02:41 [ruby] <defunct> git 46 1 0 Jan09 ? 00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker] git 48 1 0 Jan09 ? 00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker] git 49 1 0 Jan09 ? 00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker] git 5205 0 0 21:57 pts/0 00:00:00 bash git 5331 5205 0 22:00 pts/0 00:00:00 ps -ef ``` Further investigation showed that the introduction of `Process.wait2(-1, Process::WNOHANG)` in puma#3255 never appears to return anything when `Process.detach` is run on some process that has not exited. This bug appears to be present from Ruby 2.6 to 3.2, but has been been fixed in Ruby 3.3: https://bugs.ruby-lang.org/issues/19837 Previously `Process.wait(w.pid, Process::WNOHANG)` was called on each known worker PID. puma#3255 changed this behavior to do this only if the `fork_worker` config parameter were enabled, but it seems that we should always do this to ensure that terminated workers are reaped in a timely manner. Closes puma#3313

stanhu · 2024-01-18T17:13:07Z

@MSP-Greg Is there anything else I can help with to get this merged?

MSP-Greg · 2024-01-18T17:44:53Z

Is there anything else I can help with to get this merged?

Do you know how to clone one's self? Sorry.

Soon, like no later than Sunday. I apologize for the delay, 'when it rains, it pours`

It seems this bug was already reported

Ruby 3.2.3 is released...

stanhu · 2024-01-26T05:29:17Z

@MSP-Greg Sorry to bother you again, but would you have a moment to review?

MSP-Greg · 2024-01-26T14:34:18Z

@stanhu No problem. Did you see the comment above about line 189 in test_integration_cluster.rb? I tried this without and with the lib patch on several Ruby versions. As you've mentioned, it passes on some 'current' patch releases and 'head', but fails on many older ones.

stanhu · 2024-01-26T17:08:55Z

@MSP-Greg I did not see the comment. Is it published?

test/test_integration_cluster.rb

MSP-Greg · 2024-01-26T17:41:36Z

I did not see the comment. Is it published?

Sorry, my mistake. I didn't click 'Publish Review'. Until one does that, the reviewer can see it, but no one else...

Do you see it now?

This test ensures that Puma handles the `Process.detach` bug described in https://bugs.ruby-lang.org/issues/19837.

MSP-Greg · 2024-01-26T18:39:09Z

@stanhu

Thank you for the PR. Sorry for the delay.

…ed (puma#3314)" This reverts commit 9bd838b. Did this start to happen after this commit? Sure looks like that so far https://github.com/dentarg/puma/actions/runs/7709969145/job/21012318760#step:10:43

stanhu · 2024-04-10T14:45:59Z

@nateberkopec Would you mind releasing an update with this? We're blocked on Puma 6.4.0 until this gets shipped.

stanhu · 2024-05-03T00:23:36Z

@nateberkopec Sorry to bother you again. Could you find some time to release a new version of Puma?

stanhu force-pushed the sh-fix-pid1-wait branch 5 times, most recently from 02096b5 to 83f53bd Compare January 10, 2024 23:24

stanhu mentioned this pull request Jan 11, 2024

Puma cluster not reaping child processes when PID is 1 with Puma 6.4.1 #3313

Closed

stanhu force-pushed the sh-fix-pid1-wait branch from 83f53bd to bd08e8a Compare January 11, 2024 17:04

stanhu force-pushed the sh-fix-pid1-wait branch from bd08e8a to b20415f Compare January 12, 2024 20:17

stanhu changed the title ~~Fix child processes not being reaped with PID 1~~ Fix child processes not being reaped when Process.detach used Jan 12, 2024

dentarg added the waiting-for-review Waiting on review from anyone label Jan 14, 2024

mperham mentioned this pull request Jan 15, 2024

Sidekiq swarm not shutting down gracefully on Heroku sidekiq/sidekiq#5956

Closed

stanhu force-pushed the sh-fix-pid1-wait branch from b20415f to 883b630 Compare January 16, 2024 16:51

MSP-Greg reviewed Jan 26, 2024

View reviewed changes

test/test_integration_cluster.rb Outdated Show resolved Hide resolved

Add integration test for Puma worker reaping

d4ac708

This test ensures that Puma handles the `Process.detach` bug described in https://bugs.ruby-lang.org/issues/19837.

stanhu force-pushed the sh-fix-pid1-wait branch from 883b630 to d4ac708 Compare January 26, 2024 17:59

MSP-Greg merged commit 9bd838b into puma:master Jan 26, 2024
65 of 71 checks passed

MSP-Greg added bug and removed waiting-for-review Waiting on review from anyone labels Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix child processes not being reaped when `Process.detach` used #3314

Fix child processes not being reaped when `Process.detach` used #3314

stanhu commented Jan 10, 2024 •

edited

MSP-Greg commented Jan 11, 2024

dentarg commented Jan 11, 2024

dentarg commented Jan 11, 2024

stanhu commented Jan 11, 2024

byroot commented Jan 11, 2024

stanhu commented Jan 11, 2024

byroot commented Jan 12, 2024

stanhu commented Jan 12, 2024

stanhu commented Jan 12, 2024

stanhu commented Jan 13, 2024

stanhu commented Jan 18, 2024

MSP-Greg commented Jan 18, 2024

stanhu commented Jan 26, 2024

MSP-Greg commented Jan 26, 2024

stanhu commented Jan 26, 2024

MSP-Greg commented Jan 26, 2024

MSP-Greg commented Jan 26, 2024

stanhu commented Apr 10, 2024

stanhu commented May 3, 2024

Fix child processes not being reaped when Process.detach used #3314

Fix child processes not being reaped when Process.detach used #3314

Conversation

stanhu commented Jan 10, 2024 • edited

Description

Your checklist for this pull request

MSP-Greg commented Jan 11, 2024

dentarg commented Jan 11, 2024

dentarg commented Jan 11, 2024

stanhu commented Jan 11, 2024

byroot commented Jan 11, 2024

stanhu commented Jan 11, 2024

byroot commented Jan 12, 2024

stanhu commented Jan 12, 2024

stanhu commented Jan 12, 2024

stanhu commented Jan 13, 2024

stanhu commented Jan 18, 2024

MSP-Greg commented Jan 18, 2024

stanhu commented Jan 26, 2024

MSP-Greg commented Jan 26, 2024

stanhu commented Jan 26, 2024

MSP-Greg commented Jan 26, 2024

MSP-Greg commented Jan 26, 2024

stanhu commented Apr 10, 2024

stanhu commented May 3, 2024

Fix child processes not being reaped when `Process.detach` used #3314

Fix child processes not being reaped when `Process.detach` used #3314

stanhu commented Jan 10, 2024 •

edited