Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make puma cluster process suitable as PID 1 #3255

Merged
merged 1 commit into from Oct 19, 2023

Conversation

casperisfine
Copy link
Contributor

@casperisfine casperisfine commented Oct 9, 2023

We recently ran into an issue in production where our puma containers had thousands of zombie (defunct) processes.

This was caused by the web application spawning some multi-process command (Google Chrome) and sometimes interrupting it hard.

- pid=1 puma cluster
 \- pid=2 puma worker
  \- pid=3 google chrome
   \- pid=4 google chrome subprocess

In such scenario if pid=3 dies without first reaping pid=4, then pid=4 get reparented as a child of pid=1, and pid=1 is responsible to reap it.

On classic style hosting, PID 1 is the init system (e.g. systemd), but with containers PID 1 is the container initial command, in our case the Puma cluster process.

Some people uses some minimal init implementations like tini to handle this, but I believe this enter the category of "unknown unknowns", as it you need to know about this concern to avoid the problem.

This commit simply change the wait_workers method to reap all childrens not just the workers it knows about.

If it end up reaping an unknown process, it logs it.

This is pretty much exactly what Unicorn and Pitchfork do, and remove the need for an init system in containers.

PS: I really don't know how to cover this with tests, it would require the test to be ran as PID 1.

@casperisfine
Copy link
Contributor Author

The CI failures seem legit, I'll start digging into them.

@nateberkopec
Copy link
Member

On classic style hosting, PID 1 is the init system (e.g. systemd), but with containers PID 1 is the container initial command, in our case the Puma cluster process.

TIL re: the container part

We recently ran into an issue in production where our puma
containers had thousands of zombie (defunct) processes.

This was caused by the web application spawning some multi-process
command (Google Chrome) and sometimes interrupting it hard.

- pid=1 puma cluster
 \- pid=2 puma worker
  \- pid=3 google chrome
   \- pid=4 google chrome subprocess

In such scenario if pid=3 dies without first reaping pid=4,
then pid=4 get reparented as a child of pid=1, and pid=1 is
responsible to reap it.

On classic style hosting, PID 1 is the init system (e.g. systemd),
but with containers PID 1 is the container initial command, in our
case the Puma cluster process.

Some people uses some minimal init implementations like `tini` to
handle this, but I believe this enter the category of "unknown unknowns",
as it you need to know about this concern to avoid the problem.

This commit simply change the `wait_workers` method to reap all childrens
not just the workers it knows about.

If it end up reaping an unknown process, it logs it.

This is pretty much exactly what Unicorn and Pitchfork do, and remove
the need for an init system in containers.
@casperisfine
Copy link
Contributor Author

Apologies for the delay, I was a bit swamped last week.

I believe I fixed CI, there are two jobs failing on the last push, but both are ruby-head I suspect it's not related to my PR.

It should be ready for review now.

Copy link
Member

@nateberkopec nateberkopec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good for me but I want @MSP-Greg to approve as well

@MSP-Greg
Copy link
Member

@byroot @casperisfine

Thanks for the PR. I'm working on a revised test suite, and this passed when cherry picked on top of it.

both are ruby-head I suspect it's not related to my PR.

Ruby head has been unstable recently, especially on limited resource CI VMs. Or, fails in the cloud, but never on my local desktop...

@MSP-Greg MSP-Greg mentioned this pull request Oct 18, 2023
@casperisfine
Copy link
Contributor Author

Ah thanks, I'll rebase then.

@casperisfine
Copy link
Contributor Author

Nvm, I just realized you haven't merged these fixes yet.

@nateberkopec nateberkopec merged commit a4826bb into puma:master Oct 19, 2023
62 of 64 checks passed
@nateberkopec
Copy link
Member

Thanks! Should get 6.4.1 out tomorrow or so.

nateberkopec pushed a commit that referenced this pull request Jan 2, 2024
We recently ran into an issue in production where our puma
containers had thousands of zombie (defunct) processes.

This was caused by the web application spawning some multi-process
command (Google Chrome) and sometimes interrupting it hard.

- pid=1 puma cluster
 \- pid=2 puma worker
  \- pid=3 google chrome
   \- pid=4 google chrome subprocess

In such scenario if pid=3 dies without first reaping pid=4,
then pid=4 get reparented as a child of pid=1, and pid=1 is
responsible to reap it.

On classic style hosting, PID 1 is the init system (e.g. systemd),
but with containers PID 1 is the container initial command, in our
case the Puma cluster process.

Some people uses some minimal init implementations like `tini` to
handle this, but I believe this enter the category of "unknown unknowns",
as it you need to know about this concern to avoid the problem.

This commit simply change the `wait_workers` method to reap all childrens
not just the workers it knows about.

If it end up reaping an unknown process, it logs it.

This is pretty much exactly what Unicorn and Pitchfork do, and remove
the need for an init system in containers.

Co-authored-by: Jean Boussier <jean.boussier@gmail.com>
nateberkopec pushed a commit that referenced this pull request Jan 2, 2024
We recently ran into an issue in production where our puma
containers had thousands of zombie (defunct) processes.

This was caused by the web application spawning some multi-process
command (Google Chrome) and sometimes interrupting it hard.

- pid=1 puma cluster
 \- pid=2 puma worker
  \- pid=3 google chrome
   \- pid=4 google chrome subprocess

In such scenario if pid=3 dies without first reaping pid=4,
then pid=4 get reparented as a child of pid=1, and pid=1 is
responsible to reap it.

On classic style hosting, PID 1 is the init system (e.g. systemd),
but with containers PID 1 is the container initial command, in our
case the Puma cluster process.

Some people uses some minimal init implementations like `tini` to
handle this, but I believe this enter the category of "unknown unknowns",
as it you need to know about this concern to avoid the problem.

This commit simply change the `wait_workers` method to reap all childrens
not just the workers it knows about.

If it end up reaping an unknown process, it logs it.

This is pretty much exactly what Unicorn and Pitchfork do, and remove
the need for an init system in containers.

Co-authored-by: Jean Boussier <jean.boussier@gmail.com>
stanhu added a commit to stanhu/puma that referenced this pull request Jan 10, 2024
Starting with Puma v6.4.1, we observed that killed Puma cluster
workers were never being restarted when run as PID 1. Note that PID 44
remained in the `defunct` state after a `kill 44` was issued:

```
git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
git            1       0  0 Jan09 ?        00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git           23       1  0 Jan09 ?        00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab
git           41       1  0 Jan09 ?        00:01:55 ruby /srv/gitlab/bin/metrics-server
git           44       1  0 Jan09 ?        00:02:41 [ruby] <defunct>
git           46       1  0 Jan09 ?        00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker]
git           48       1  0 Jan09 ?        00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker]
git           49       1  0 Jan09 ?        00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker]
git         5205       0  0 21:57 pts/0    00:00:00 bash
git         5331    5205  0 22:00 pts/0    00:00:00 ps -ef
```

Further investigation showed that the introduction of
`Process.wait2(-1, Process::WNOHANG)` in puma#3255 never appears to return
anything inside Google Kubernetes Engine running as PID 1.

Previously `Process.wait(w.pid, process::WNOHANG)` was called on each
known worker PID. puma#3255 changed this behavior to do this only if the
`fork_worker` config parameter were enabled, but it seems that we
should always do this.
stanhu added a commit to stanhu/puma that referenced this pull request Jan 10, 2024
Starting with Puma v6.4.1, we observed that killed Puma cluster
workers were never being restarted when run as PID 1. Note that PID 44
remained in the `defunct` state after a `kill 44` was issued:

```
git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
git            1       0  0 Jan09 ?        00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git           23       1  0 Jan09 ?        00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab
git           41       1  0 Jan09 ?        00:01:55 ruby /srv/gitlab/bin/metrics-server
git           44       1  0 Jan09 ?        00:02:41 [ruby] <defunct>
git           46       1  0 Jan09 ?        00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker]
git           48       1  0 Jan09 ?        00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker]
git           49       1  0 Jan09 ?        00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker]
git         5205       0  0 21:57 pts/0    00:00:00 bash
git         5331    5205  0 22:00 pts/0    00:00:00 ps -ef
```

Further investigation showed that the introduction of
`Process.wait2(-1, Process::WNOHANG)` in puma#3255 never appears to return
anything inside Google Kubernetes Engine running as PID 1.

Previously `Process.wait(w.pid, process::WNOHANG)` was called on each
known worker PID. puma#3255 changed this behavior to do this only if the
`fork_worker` config parameter were enabled, but it seems that we
should always do this.
stanhu added a commit to stanhu/puma that referenced this pull request Jan 10, 2024
Starting with Puma v6.4.1, we observed that killed Puma cluster
workers were never being restarted when run as PID 1. Note that PID 44
remained in the `defunct` state after a `kill 44` was issued:

```
git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
git            1       0  0 Jan09 ?        00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git           23       1  0 Jan09 ?        00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab
git           41       1  0 Jan09 ?        00:01:55 ruby /srv/gitlab/bin/metrics-server
git           44       1  0 Jan09 ?        00:02:41 [ruby] <defunct>
git           46       1  0 Jan09 ?        00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker]
git           48       1  0 Jan09 ?        00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker]
git           49       1  0 Jan09 ?        00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker]
git         5205       0  0 21:57 pts/0    00:00:00 bash
git         5331    5205  0 22:00 pts/0    00:00:00 ps -ef
```

Further investigation showed that the introduction of
`Process.wait2(-1, Process::WNOHANG)` in puma#3255 never appears to return
anything inside Google Kubernetes Engine running as PID 1.

Previously `Process.wait(w.pid, process::WNOHANG)` was called on each
known worker PID. puma#3255 changed this behavior to do this only if the
`fork_worker` config parameter were enabled, but it seems that we
should always do this.

Closes puma#3313
stanhu added a commit to stanhu/puma that referenced this pull request Jan 10, 2024
Starting with Puma v6.4.1, we observed that killed Puma cluster
workers were never being restarted when run as PID 1. Note that PID 44
remained in the `defunct` state after a `kill 44` was issued:

```
git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
git            1       0  0 Jan09 ?        00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git           23       1  0 Jan09 ?        00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab
git           41       1  0 Jan09 ?        00:01:55 ruby /srv/gitlab/bin/metrics-server
git           44       1  0 Jan09 ?        00:02:41 [ruby] <defunct>
git           46       1  0 Jan09 ?        00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker]
git           48       1  0 Jan09 ?        00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker]
git           49       1  0 Jan09 ?        00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker]
git         5205       0  0 21:57 pts/0    00:00:00 bash
git         5331    5205  0 22:00 pts/0    00:00:00 ps -ef
```

Further investigation showed that the introduction of
`Process.wait2(-1, Process::WNOHANG)` in puma#3255 never appears to return
anything inside Google Kubernetes Engine running as PID 1.

Previously `Process.wait(w.pid, process::WNOHANG)` was called on each
known worker PID. puma#3255 changed this behavior to do this only if the
`fork_worker` config parameter were enabled, but it seems that we
should always do this.

Closes puma#3313
stanhu added a commit to stanhu/puma that referenced this pull request Jan 10, 2024
Starting with Puma v6.4.1, we observed that killed Puma cluster
workers were never being restarted when the parent was run as PID
1. For example, I issued a `kill 44` and PID 44 remained in the
`defunct` state:

```
git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
git            1       0  0 Jan09 ?        00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git           23       1  0 Jan09 ?        00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab
git           41       1  0 Jan09 ?        00:01:55 ruby /srv/gitlab/bin/metrics-server
git           44       1  0 Jan09 ?        00:02:41 [ruby] <defunct>
git           46       1  0 Jan09 ?        00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker]
git           48       1  0 Jan09 ?        00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker]
git           49       1  0 Jan09 ?        00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker]
git         5205       0  0 21:57 pts/0    00:00:00 bash
git         5331    5205  0 22:00 pts/0    00:00:00 ps -ef
```

Further investigation showed that the introduction of
`Process.wait2(-1, Process::WNOHANG)` in puma#3255 never appears to return
anything inside Google Kubernetes Engine running as PID 1.

Previously `Process.wait(w.pid, process::WNOHANG)` was called on each
known worker PID. puma#3255 changed this behavior to do this only if the
`fork_worker` config parameter were enabled, but it seems that we
should always do this.

Closes puma#3313
stanhu added a commit to stanhu/puma that referenced this pull request Jan 10, 2024
Starting with Puma v6.4.1, we observed that killed Puma cluster
workers were never being restarted when the parent was run as PID
1. For example, I issued a `kill 44` and PID 44 remained in the
`defunct` state:

```
git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
git            1       0  0 Jan09 ?        00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git           23       1  0 Jan09 ?        00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab
git           41       1  0 Jan09 ?        00:01:55 ruby /srv/gitlab/bin/metrics-server
git           44       1  0 Jan09 ?        00:02:41 [ruby] <defunct>
git           46       1  0 Jan09 ?        00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker]
git           48       1  0 Jan09 ?        00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker]
git           49       1  0 Jan09 ?        00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker]
git         5205       0  0 21:57 pts/0    00:00:00 bash
git         5331    5205  0 22:00 pts/0    00:00:00 ps -ef
```

Further investigation showed that the introduction of
`Process.wait2(-1, Process::WNOHANG)` in puma#3255 never appears to return
anything inside Google Kubernetes Engine running as PID 1.

Previously `Process.wait(w.pid, process::WNOHANG)` was called on each
known worker PID. puma#3255 changed this behavior to do this only if the
`fork_worker` config parameter were enabled, but it seems that we
should always do this.

Closes puma#3313
stanhu added a commit to stanhu/puma that referenced this pull request Jan 10, 2024
Starting with Puma v6.4.1, we observed that killed Puma cluster
workers were never being restarted when the parent was run as PID
1. For example, I issued a `kill 44` and PID 44 remained in the
`defunct` state:

```
git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
git            1       0  0 Jan09 ?        00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git           23       1  0 Jan09 ?        00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab
git           41       1  0 Jan09 ?        00:01:55 ruby /srv/gitlab/bin/metrics-server
git           44       1  0 Jan09 ?        00:02:41 [ruby] <defunct>
git           46       1  0 Jan09 ?        00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker]
git           48       1  0 Jan09 ?        00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker]
git           49       1  0 Jan09 ?        00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker]
git         5205       0  0 21:57 pts/0    00:00:00 bash
git         5331    5205  0 22:00 pts/0    00:00:00 ps -ef
```

Further investigation showed that the introduction of
`Process.wait2(-1, Process::WNOHANG)` in puma#3255 never appears to return
anything inside Google Kubernetes Engine running as PID 1.

Previously `Process.wait(w.pid, Process::WNOHANG)` was called on each
known worker PID. puma#3255 changed this behavior to do this only if the
`fork_worker` config parameter were enabled, but it seems that we
should always do this.

Closes puma#3313
stanhu added a commit to stanhu/puma that referenced this pull request Jan 11, 2024
Starting with Puma v6.4.1, we observed that killed Puma cluster
workers were never being restarted when the parent was run as PID
1. For example, I issued a `kill 44` and PID 44 remained in the
`defunct` state:

```
git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
git            1       0  0 Jan09 ?        00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git           23       1  0 Jan09 ?        00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab
git           41       1  0 Jan09 ?        00:01:55 ruby /srv/gitlab/bin/metrics-server
git           44       1  0 Jan09 ?        00:02:41 [ruby] <defunct>
git           46       1  0 Jan09 ?        00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker]
git           48       1  0 Jan09 ?        00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker]
git           49       1  0 Jan09 ?        00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker]
git         5205       0  0 21:57 pts/0    00:00:00 bash
git         5331    5205  0 22:00 pts/0    00:00:00 ps -ef
```

Further investigation showed that the introduction of
`Process.wait2(-1, Process::WNOHANG)` in puma#3255 never appears to return
anything when:

1. The parent PID is 1.
2. `Process.detach(some PID != 1)` is run after a `Process.spawn`.
This bug appears to be present in Ruby 3.1 and 3.2, but it seems to
have been fixed in Ruby 3.3.

Previously `Process.wait(w.pid, Process::WNOHANG)` was called on each
known worker PID. puma#3255 changed this behavior to do this only if the
`fork_worker` config parameter were enabled, but it seems that we
should always do this.

Closes puma#3313
stanhu added a commit to stanhu/puma that referenced this pull request Jan 12, 2024
Starting with Puma v6.4.1, we observed that killed Puma cluster
workers were never being restarted when the parent was run as PID
1. For example, I issued a `kill 44` and PID 44 remained in the
`defunct` state:

```
git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
git            1       0  0 Jan09 ?        00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git           23       1  0 Jan09 ?        00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab
git           41       1  0 Jan09 ?        00:01:55 ruby /srv/gitlab/bin/metrics-server
git           44       1  0 Jan09 ?        00:02:41 [ruby] <defunct>
git           46       1  0 Jan09 ?        00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker]
git           48       1  0 Jan09 ?        00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker]
git           49       1  0 Jan09 ?        00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker]
git         5205       0  0 21:57 pts/0    00:00:00 bash
git         5331    5205  0 22:00 pts/0    00:00:00 ps -ef
```

Further investigation showed that the introduction of
`Process.wait2(-1, Process::WNOHANG)` in puma#3255 never appears to return
anything when `Process.detach` is run on some process that has not
exited. This bug appears to be present from Ruby 2.6 to 3.2, but has
been been fixed in Ruby 3.3: https://bugs.ruby-lang.org/issues/20181

Previously `Process.wait(w.pid, Process::WNOHANG)` was called on each
known worker PID. puma#3255 changed this behavior to do this only if the
`fork_worker` config parameter were enabled, but it seems that we
should always do this to ensure that terminated workers are
reaped in a timely manner.

Closes puma#3313
stanhu added a commit to stanhu/puma that referenced this pull request Jan 16, 2024
Starting with Puma v6.4.1, we observed that killed Puma cluster
workers were never being restarted when the parent was run as PID
1. For example, I issued a `kill 44` and PID 44 remained in the
`defunct` state:

```
git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
git            1       0  0 Jan09 ?        00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git           23       1  0 Jan09 ?        00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab
git           41       1  0 Jan09 ?        00:01:55 ruby /srv/gitlab/bin/metrics-server
git           44       1  0 Jan09 ?        00:02:41 [ruby] <defunct>
git           46       1  0 Jan09 ?        00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker]
git           48       1  0 Jan09 ?        00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker]
git           49       1  0 Jan09 ?        00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker]
git         5205       0  0 21:57 pts/0    00:00:00 bash
git         5331    5205  0 22:00 pts/0    00:00:00 ps -ef
```

Further investigation showed that the introduction of
`Process.wait2(-1, Process::WNOHANG)` in puma#3255 never appears to return
anything when `Process.detach` is run on some process that has not
exited. This bug appears to be present from Ruby 2.6 to 3.2, but has
been been fixed in Ruby 3.3: https://bugs.ruby-lang.org/issues/19837

Previously `Process.wait(w.pid, Process::WNOHANG)` was called on each
known worker PID. puma#3255 changed this behavior to do this only if the
`fork_worker` config parameter were enabled, but it seems that we
should always do this to ensure that terminated workers are
reaped in a timely manner.

Closes puma#3313
MSP-Greg pushed a commit that referenced this pull request Jan 26, 2024
* Fix child processes not being reaped when `Process.detach` used

Starting with Puma v6.4.1, we observed that killed Puma cluster
workers were never being restarted when the parent was run as PID
1. For example, I issued a `kill 44` and PID 44 remained in the
`defunct` state:

```
git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
git            1       0  0 Jan09 ?        00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git           23       1  0 Jan09 ?        00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab
git           41       1  0 Jan09 ?        00:01:55 ruby /srv/gitlab/bin/metrics-server
git           44       1  0 Jan09 ?        00:02:41 [ruby] <defunct>
git           46       1  0 Jan09 ?        00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker]
git           48       1  0 Jan09 ?        00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker]
git           49       1  0 Jan09 ?        00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker]
git         5205       0  0 21:57 pts/0    00:00:00 bash
git         5331    5205  0 22:00 pts/0    00:00:00 ps -ef
```

Further investigation showed that the introduction of
`Process.wait2(-1, Process::WNOHANG)` in #3255 never appears to return
anything when `Process.detach` is run on some process that has not
exited. This bug appears to be present from Ruby 2.6 to 3.2, but has
been been fixed in Ruby 3.3: https://bugs.ruby-lang.org/issues/19837

Previously `Process.wait(w.pid, Process::WNOHANG)` was called on each
known worker PID. #3255 changed this behavior to do this only if the
`fork_worker` config parameter were enabled, but it seems that we
should always do this to ensure that terminated workers are
reaped in a timely manner.

Closes #3313

* Add integration test for Puma worker reaping

This test ensures that Puma handles the `Process.detach` bug described
in https://bugs.ruby-lang.org/issues/19837.
@dentarg dentarg added the bug label Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants