Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix child processes not being reaped when Process.detach used #3314

Merged
merged 2 commits into from Jan 26, 2024

Conversation

stanhu
Copy link
Contributor

@stanhu stanhu commented Jan 10, 2024

Description

Starting with Puma v6.4.1, we observed that killed Puma cluster workers were never being restarted when the parent was run as PID 1. For example, I issued a kill 44 and PID 44 remained in the defunct state:

git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
git            1       0  0 Jan09 ?        00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git           23       1  0 Jan09 ?        00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab
git           41       1  0 Jan09 ?        00:01:55 ruby /srv/gitlab/bin/metrics-server
git           44       1  0 Jan09 ?        00:02:41 [ruby] <defunct>
git           46       1  0 Jan09 ?        00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker]
git           48       1  0 Jan09 ?        00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker]
git           49       1  0 Jan09 ?        00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker]
git         5205       0  0 21:57 pts/0    00:00:00 bash
git         5331    5205  0 22:00 pts/0    00:00:00 ps -ef

Further investigation showed that the introduction of Process.wait2(-1, Process::WNOHANG) in #3255 never appears to return anything when Process.detach is run on some process that has not exited. This bug appears to be present from Ruby 2.6 to 3.2, but has been been fixed in Ruby 3.3: https://bugs.ruby-lang.org/issues/19837

Previously Process.wait(w.pid, Process::WNOHANG) was called on each known worker PID. #3255 changed this behavior to do this only if the fork_worker config parameter were enabled, but it seems that we should always do this to ensure that terminated workers are reaped in a timely manner.

Closes #3313

Your checklist for this pull request

  • I have reviewed the guidelines for contributing to this repository.
  • I have added (or updated) appropriate tests if this PR fixes a bug or adds a feature.
  • My pull request is 100 lines added/removed or less so that it can be easily reviewed.
  • If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
  • If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
  • I have updated the documentation accordingly.
  • All new and existing tests passed, including Rubocop.

@stanhu stanhu force-pushed the sh-fix-pid1-wait branch 5 times, most recently from 02096b5 to 83f53bd Compare January 10, 2024 23:24
@MSP-Greg
Copy link
Member

@stanhu

Thanks for the PR. LGTM, but I'd like to run some additional tests. Currently knee deep in some dumb s*&(...

@dentarg
Copy link
Member

dentarg commented Jan 11, 2024

Does your latest comment #3313 (comment) change anything for these changes @stanhu?

@dentarg
Copy link
Member

dentarg commented Jan 11, 2024

Also, from your investigation in #3313, it looks like it wouldn't be impossible to add some test (using Docker) for this PR and #3255? Could be something only GitHub Actions run, not part of bundle exec rake. What do you and others (@byroot) think about that? Is it worth it?

@stanhu
Copy link
Contributor Author

stanhu commented Jan 11, 2024

@dentarg I'm still investigating how it's possible for the application to interfere with Process.wait2(-1, Process::WNOHANG), presumably by trapping certain signals.

I think this PR can't hurt, and it restores the previous behavior of checking known PIDs in addition to checking child processes.

@byroot
Copy link
Contributor

byroot commented Jan 11, 2024

Could be something only GitHub Actions run, not part of bundle exec rake. What do you and others (@byroot) think about that? Is it worth it?

I think it would be worth it. Note that an alternative to running as PID 1 is to declare the process as a SUBREAPER: https://github.com/Shopify/pitchfork/blob/50f3e3389218e6e82c65638ab3c91f805ec02c4b/ext/pitchfork_http/child_subreaper.h

It's a Linux only API, but would allow to do it as part of the main test suite when ran on Linux. I made a gem a while ago you could use: https://rubygems.org/gems/child_subreaper

@stanhu
Copy link
Contributor Author

stanhu commented Jan 11, 2024

I've updated this pull request to reflect that it appears that this works around a Ruby 3.1/3.2 bug with Process.detach: #3313 (comment)

I think a test is a good idea, but that will take me longer to get to at the moment.

@byroot
Copy link
Contributor

byroot commented Jan 12, 2024

a Ruby 3.1/3.2 bug with Process.detach

Would be worth reporting upstream. Even if it's fixed in 3.3, it may be worth a backport.

@stanhu
Copy link
Contributor Author

stanhu commented Jan 12, 2024

Good idea. I filed this bug: https://bugs.ruby-lang.org/issues/20181

@stanhu stanhu changed the title Fix child processes not being reaped with PID 1 Fix child processes not being reaped when Process.detach used Jan 12, 2024
@stanhu
Copy link
Contributor Author

stanhu commented Jan 12, 2024

@MSP-Greg @dentarg I've added an integration test that appears to reproduce https://bugs.ruby-lang.org/issues/20181. This test fails in master, but passes in this branch.

@stanhu
Copy link
Contributor Author

stanhu commented Jan 13, 2024

It seems this bug was already reported in https://bugs.ruby-lang.org/issues/19837 and fixed in the ruby_3_2 and ruby_3_1 branches, but the fixes aren't in a release yet.

Starting with Puma v6.4.1, we observed that killed Puma cluster
workers were never being restarted when the parent was run as PID
1. For example, I issued a `kill 44` and PID 44 remained in the
`defunct` state:

```
git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
git            1       0  0 Jan09 ?        00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git           23       1  0 Jan09 ?        00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab
git           41       1  0 Jan09 ?        00:01:55 ruby /srv/gitlab/bin/metrics-server
git           44       1  0 Jan09 ?        00:02:41 [ruby] <defunct>
git           46       1  0 Jan09 ?        00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker]
git           48       1  0 Jan09 ?        00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker]
git           49       1  0 Jan09 ?        00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker]
git         5205       0  0 21:57 pts/0    00:00:00 bash
git         5331    5205  0 22:00 pts/0    00:00:00 ps -ef
```

Further investigation showed that the introduction of
`Process.wait2(-1, Process::WNOHANG)` in puma#3255 never appears to return
anything when `Process.detach` is run on some process that has not
exited. This bug appears to be present from Ruby 2.6 to 3.2, but has
been been fixed in Ruby 3.3: https://bugs.ruby-lang.org/issues/19837

Previously `Process.wait(w.pid, Process::WNOHANG)` was called on each
known worker PID. puma#3255 changed this behavior to do this only if the
`fork_worker` config parameter were enabled, but it seems that we
should always do this to ensure that terminated workers are
reaped in a timely manner.

Closes puma#3313
@stanhu
Copy link
Contributor Author

stanhu commented Jan 18, 2024

@MSP-Greg Is there anything else I can help with to get this merged?

@MSP-Greg
Copy link
Member

Is there anything else I can help with to get this merged?

Do you know how to clone one's self? Sorry.

Soon, like no later than Sunday. I apologize for the delay, 'when it rains, it pours`

It seems this bug was already reported

Ruby 3.2.3 is released...

@stanhu
Copy link
Contributor Author

stanhu commented Jan 26, 2024

@MSP-Greg Sorry to bother you again, but would you have a moment to review?

@MSP-Greg
Copy link
Member

@stanhu No problem. Did you see the comment above about line 189 in test_integration_cluster.rb? I tried this without and with the lib patch on several Ruby versions. As you've mentioned, it passes on some 'current' patch releases and 'head', but fails on many older ones.

@stanhu
Copy link
Contributor Author

stanhu commented Jan 26, 2024

@MSP-Greg I did not see the comment. Is it published?

@MSP-Greg
Copy link
Member

I did not see the comment. Is it published?

Sorry, my mistake. I didn't click 'Publish Review'. Until one does that, the reviewer can see it, but no one else...

Do you see it now?

This test ensures that Puma handles the `Process.detach` bug described
in https://bugs.ruby-lang.org/issues/19837.
@MSP-Greg
Copy link
Member

@stanhu

Thank you for the PR. Sorry for the delay.

@MSP-Greg MSP-Greg merged commit 9bd838b into puma:master Jan 26, 2024
65 of 71 checks passed
dentarg added a commit to dentarg/puma that referenced this pull request Jan 30, 2024
@MSP-Greg MSP-Greg added bug and removed waiting-for-review Waiting on review from anyone labels Jan 31, 2024
@stanhu
Copy link
Contributor Author

stanhu commented Apr 10, 2024

@nateberkopec Would you mind releasing an update with this? We're blocked on Puma 6.4.0 until this gets shipped.

@stanhu
Copy link
Contributor Author

stanhu commented May 3, 2024

@nateberkopec Sorry to bother you again. Could you find some time to release a new version of Puma?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Puma cluster not reaping child processes when PID is 1 with Puma 6.4.1
4 participants