Skip to content

Conversation

welcomycozyhom
Copy link

Background

Our company has been running a self-developed e-commerce solution on Apache/httpd
with mod_php for over 20 years. This runtime environment is deployed across
thousands of physical servers with different configurations.

Recently, we started using FFI::scope() from ext/ffi, which required enabling
opcache.preload. During this implementation, we discovered an issue with how
preload handles graceful restarts.

The Problem

In our production environment, Apache graceful restarts (triggered by SIGUSR1)
occur frequently as part of our operational workflows. We found that when
multiple restart signals are accidentally sent to the master process in rapid
succession, the waitpid() call in accel_finish_startup() can be interrupted
with EINTR.

The current code doesn't handle this:

static zend_result accel_finish_startup(void)
{
    ..snip..
    if (waitpid(pid, &status, 0) < 0) {
        zend_shared_alloc_unlock();
        zend_accel_error_noreturn(ACCEL_LOG_FATAL, 
            "Preloading failed to waitpid(%d)", pid);
    }
}

When waitpid() returns -1 due to EINTR, the master process terminates
unexpectedly, causing a complete service outage that looks like a system-wide
shutdown.
Notably, Apache's internal SIGUSR1 signal handler is registered without
SA_RESTART, which means interrupted system calls return EINTR rather than
auto-resuming.

This failure is completely unrelated to the actual success or failure of the
preload subprocess itself - it's purely a signal handling timing issue. While
the occurrence is rare and non-deterministic, it has significant impact when
it happens.

Given our large-scale on-premises infrastructure with 20+ years of accumulated
workflows, eliminating all scenarios where duplicate restart signals might be
sent is impractical. We believe a defensive approach in the PHP code is more
pragmatic.

Reproduction

While our production scenario is complex, here's a simplified test case that
demonstrates the issue:

Test script

#!/bin/bash

CURR=100
NAME="/httpd"
PID=$(ps -ef | grep ${NAME} | grep -v grep | sort -k3 -n | head -1 | awk '{print $2}')
if [ -z "$PID" ]; then
    echo "NOT FOUND PROCESS: ${PID}"
    exit 1
fi

echo "PID: ${PID}"

(seq ${CURR} | xargs -P 0 -I{} sudo kill -USR1 ${PID}) &

echo "DONE"

php.ini

...snip...
zend_extension=opcache

opcache.enable=1
opcache.preload=/path/to/preload.php
opcache.preload_user=www-data
opcache.log_verbosity_level=4

Before test - Normal process tree

root       3908672    3063  0 Oct03 ?        00:00:00 /path/to/bin/httpd -k start
www-data   3912675 3908672  0 Oct03 ?        00:00:00 /path/to/bin/httpd -k start
www-data   3912676 3908672  0 Oct03 ?        00:00:00 /path/to/bin/httpd -k start
www-data   3912677 3908672  0 Oct03 ?        00:00:00 /path/to/bin/httpd -k start
www-data   3912678 3908672  0 Oct03 ?        00:00:00 /path/to/bin/httpd -k start
www-data   3912679 3908672  0 Oct03 ?        00:00:00 /path/to/bin/httpd -k start

Error log when issue occurs

$ tail -f logs/error_log

...snip...
[Fri Oct 03 11:54:15.677855 2025] [mpm_prefork:notice] [pid 3908672:tid 3908672] AH00171: Graceful restart requested, doing restart
Fri Oct  3 11:54:15 2025 (3908672): Fatal Error Preloading failed to waitpid(3912758)
Fri Oct  3 11:54:15 2025 (3912758): Message Cached script '$PRELOAD$'
Fri Oct  3 11:54:15 2025 (3912758): Message Cached script '/path/to/preload.php'

After this error, the master process exits and all worker processes are lost,
requiring manual intervention to restore service.

In production, reload signals may arrive multiple times before
previous restart operations complete. This occurs when:
- Legacy deployment scripts trigger rapid reloads
- Monitoring systems aggressively check service health
- Orchestration platforms retry operations

The SAPI registers signal handlers without SA_RESTART, causing
system calls to return EINTR. Without retry logic, waitpid()
during preload can fail non-deterministically, terminating the
master process unexpectedly.

This adds EINTR handling to ensure stable operation in signal-heavy
environments.
Comment on lines +4891 to +4893
do {
chld_pid = waitpid(pid, &status, 0);
} while (chld_pid < 0 && errno == EINTR);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks right, except we probably have to check for a stronger signals (e.g. SIGQUIT).
Otherwise we may hung in this loop forever.
@arnaud-lb can you please take care about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants