ext/opcache: Retry waitpid on EINTR from repeated reload signals #20051

welcomycozyhom · 2025-10-03T15:48:45Z

Background

Our company has been running a self-developed e-commerce solution on Apache/httpd
with mod_php for over 20 years. This runtime environment is deployed across
thousands of physical servers with different configurations.

Recently, we started using FFI::scope() from ext/ffi, which required enabling
opcache.preload. During this implementation, we discovered an issue with how
preload handles graceful restarts.

The Problem

In our production environment, Apache graceful restarts (triggered by SIGUSR1)
occur frequently as part of our operational workflows. We found that when
multiple restart signals are accidentally sent to the master process in rapid
succession, the waitpid() call in accel_finish_startup() can be interrupted
with EINTR.

The current code doesn't handle this:

static zend_result accel_finish_startup(void)
{
    ..snip..
    if (waitpid(pid, &status, 0) < 0) {
        zend_shared_alloc_unlock();
        zend_accel_error_noreturn(ACCEL_LOG_FATAL, 
            "Preloading failed to waitpid(%d)", pid);
    }
}

When waitpid() returns -1 due to EINTR, the master process terminates
unexpectedly, causing a complete service outage that looks like a system-wide
shutdown.
Notably, Apache's internal SIGUSR1 signal handler is registered without
SA_RESTART, which means interrupted system calls return EINTR rather than
auto-resuming.

This failure is completely unrelated to the actual success or failure of the
preload subprocess itself - it's purely a signal handling timing issue. While
the occurrence is rare and non-deterministic, it has significant impact when
it happens.

Given our large-scale on-premises infrastructure with 20+ years of accumulated
workflows, eliminating all scenarios where duplicate restart signals might be
sent is impractical. We believe a defensive approach in the PHP code is more
pragmatic.

Reproduction

While our production scenario is complex, here's a simplified test case that
demonstrates the issue:

Test script

#!/bin/bash

CURR=100
NAME="/httpd"
PID=$(ps -ef | grep ${NAME} | grep -v grep | sort -k3 -n | head -1 | awk '{print $2}')
if [ -z "$PID" ]; then
    echo "NOT FOUND PROCESS: ${PID}"
    exit 1
fi

echo "PID: ${PID}"

(seq ${CURR} | xargs -P 0 -I{} sudo kill -USR1 ${PID}) &

echo "DONE"

php.ini

...snip...
zend_extension=opcache

opcache.enable=1
opcache.preload=/path/to/preload.php
opcache.preload_user=www-data
opcache.log_verbosity_level=4

Before test - Normal process tree

root       3908672    3063  0 Oct03 ?        00:00:00 /path/to/bin/httpd -k start
www-data   3912675 3908672  0 Oct03 ?        00:00:00 /path/to/bin/httpd -k start
www-data   3912676 3908672  0 Oct03 ?        00:00:00 /path/to/bin/httpd -k start
www-data   3912677 3908672  0 Oct03 ?        00:00:00 /path/to/bin/httpd -k start
www-data   3912678 3908672  0 Oct03 ?        00:00:00 /path/to/bin/httpd -k start
www-data   3912679 3908672  0 Oct03 ?        00:00:00 /path/to/bin/httpd -k start

Error log when issue occurs

$ tail -f logs/error_log

...snip...
[Fri Oct 03 11:54:15.677855 2025] [mpm_prefork:notice] [pid 3908672:tid 3908672] AH00171: Graceful restart requested, doing restart
Fri Oct  3 11:54:15 2025 (3908672): Fatal Error Preloading failed to waitpid(3912758)
Fri Oct  3 11:54:15 2025 (3912758): Message Cached script '$PRELOAD$'
Fri Oct  3 11:54:15 2025 (3912758): Message Cached script '/path/to/preload.php'

After this error, the master process exits and all worker processes are lost,
requiring manual intervention to restore service.

In production, reload signals may arrive multiple times before previous restart operations complete. This occurs when: - Legacy deployment scripts trigger rapid reloads - Monitoring systems aggressively check service health - Orchestration platforms retry operations The SAPI registers signal handlers without SA_RESTART, causing system calls to return EINTR. Without retry logic, waitpid() during preload can fail non-deterministically, terminating the master process unexpectedly. This adds EINTR handling to ensure stable operation in signal-heavy environments.

dstogov · 2025-10-06T07:25:51Z

ext/opcache/ZendAccelerator.c

+		do {
+			chld_pid = waitpid(pid, &status, 0);
+		} while (chld_pid < 0 && errno == EINTR);


This looks right, except we probably have to check for a stronger signals (e.g. SIGQUIT).
Otherwise we may hung in this loop forever.
@arnaud-lb can you please take care about this.

welcomycozyhom requested a review from dstogov as a code owner October 3, 2025 15:48

github-actions bot added the Extension: opcache label Oct 3, 2025

dstogov reviewed Oct 6, 2025

View reviewed changes

This was referenced Oct 6, 2025

Mask USR1/HUP while waiting for preloading #20079

Open

Fix access to uninitialized vars in preload_load() #20081

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ext/opcache: Retry waitpid on EINTR from repeated reload signals #20051

ext/opcache: Retry waitpid on EINTR from repeated reload signals #20051

welcomycozyhom commented Oct 3, 2025

Uh oh!

dstogov Oct 6, 2025

Uh oh!

Uh oh!

ext/opcache: Retry waitpid on EINTR from repeated reload signals #20051

Are you sure you want to change the base?

ext/opcache: Retry waitpid on EINTR from repeated reload signals #20051

Conversation

welcomycozyhom commented Oct 3, 2025

Background

The Problem

Reproduction

Uh oh!

dstogov Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!