Skip to content

Fixed bug #77335 (SA_RESTART despite SIGALRM unless no restart) #3717

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

yitam
Copy link

@yitam yitam commented Dec 24, 2018

For details please refer to bug report 77335

@krakjoe
Copy link
Member

krakjoe commented Dec 27, 2018

Is it possible to write a test case for this change ?

@yitam
Copy link
Author

yitam commented Dec 31, 2018

@krakjoe not at this moment, but I suspect it's similar to the problem reported on this page by wally at soggysoftware dot co dot uk. Let me see what I can do

@mathieuk
Copy link

mathieuk commented Jan 2, 2019

I think this may introduce some unexpected behaviour for some people. I ran into atleast one instance where people are expecting pcntl_signal() to work as if the 3rd argument (restart_syscalls) is false which it will effectively always be in the current state.

With this patch, restart_syscalls actually starts working and in that particular example I think you may very well end up with a hung process due to the way signal handlers are called by PCNTL as I've described in more details here: microsoft/msphpsql#885 (comment) .

For the patch to land I think the default for restart_syscalls should become false.

@yitam
Copy link
Author

yitam commented Jan 3, 2019

Hi @krakjoe @mathieuk and I have some brief discussion regarding this patch (see my response and his subsequent comment for msphpsql issue 885)

While I agree that my current proposed fix may introduce other issues, our recent findings did prove that the PCTNL way of calling signal handlers is flawed. I am inclined to fix the PCTNL code rather than changing the existing API however.

@mathieuk
Copy link

mathieuk commented Jan 4, 2019

So to summarize:

  • The Microsoft PHP SQL Server driver depends on the ODBC driver and the ODBC driver expects clients to use SA_RESTART on signals. This might actually apply to other db drivers. This change actually enables the setting of SA_RESTART but only for SIGALRM.
  • After this change, you stand a good chance of never having your SIGALRM userland signal handlers called because PCNTL queues them for (slightly) later execution. But, with SA_RESTART enabled it might never get there if the primitive (likeflock) is in a hung state.

I think this would be the ideal change to make.

  • SA_RESTART becomes available for all signals, not just SIGALRM. This prevents signals like SIGUSR1/SIGUSR2/SIGTERM from causing similar issues in db drivers.
  • The current behaviour for $restart_syscalls is broken, which means that SA_RESTART is never set. This change would default to always having it set. This may break existing scripts. To mitigate we chould change the default value for $restart_syscalls to false.
  • To ensure signal delivery to userland PCNTL should call the signal handlers from pcntl_signal_handler() directly instead of queuing them for slightly later execution because you might never get back to the main execution loop.

So that last one might be tricky. It introduces the concept of calling non-signal-safe functions. We might cover that with documentation. But if that's deemed too risky, maybe we could add the ability to register signal handlers that you want called directly? Like:

pcntl_signal(
  SIGALRM,
  function() { /* handling goes here */ },
 $restart_syscalls = true, 
 $call_directly = true
); 

Which makes it a power-user kinda thing, which I think a lot of PCNTL-using code already is.

@nikic
Copy link
Member

nikic commented Jan 4, 2019

@mathieuk Running PHP code inside a signal handler is definitely a no-go. We used to allow such things and it kinda somewhat works -- until the stars align and you have a malloc deadlock. PHP does not provide a sufficiently controlled environment to execute code in signal handlers.

@@ -43,7 +43,7 @@ Sigfunc *php_signal4(int signo, Sigfunc *func, int restart, int mask_all)
#ifdef HAVE_STRUCT_SIGINFO_T
act.sa_flags |= SA_SIGINFO;
#endif
if (signo == SIGALRM || (! restart)) {
if (signo == SIGALRM && (! restart)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than changing this to an &&, shouldn't the signo == SIGALRM part just be dropped? We still want to allow using SA_RESTART with non-SIGALRM signals.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please see my original intention but it might be drastic and break other existing code.

@mathieuk
Copy link

mathieuk commented Jan 4, 2019

@mathieuk Running PHP code inside a signal handler is definitely a no-go. We used to allow such things and it kinda somewhat works -- until the stars align and you have a malloc deadlock. PHP does not provide a sufficiently controlled environment to execute code in signal handlers.

Yeah, makes sense. Any ideas on how the userland signal handler could still be called (timely) with SA_RESTART enabled?

@nikic
Copy link
Member

nikic commented Jan 4, 2019

@mathieuk Not really. I think we should do the change to allow setting SA_RESTART for SIGALRM, but it's going to be an either/or thing. Either you get a reliable interrupt, or you get restarting syscalls, but you won't be getting both...

We should only land this for master though. As you mentioned, this may break user code that is relying on the current (undocumented) exception for SIGALRM.

@ralphschindler
Copy link
Contributor

@nikic is there any path forward that would allow this to not be a bug in the 7.3 release cycle?

@nikic
Copy link
Member

nikic commented Jan 8, 2019

@ralphschindler I think we could do the following:

  • Allow setting SA_RESTART for SIGALRM
  • Change the $restart_syscalls parameter for pcntl_signal() to accept ?int. If null is passed, we will choose the default $restart_syscalls=false for SIGALRM and $restart_syscalls=true for everything else.
  • Make the default value of $restart_syscalls null.

That is, if no explicit value is passed, then we will stick with the (effective) current default of SA_RESTART for everything but SIGALRM, but have the option of explicitly passing $restart_syscalls=false.

Not sure how much this will really solve in the end -- if you're using SIGALRM quite likely you do want to interrupt syscalls, at which point you're back where you started.

@yitam
Copy link
Author

yitam commented Jan 8, 2019

@nikic @mathieuk @ralphschindler

Not sure how much this will really solve in the end -- if you're using SIGALRM quite likely you do want to interrupt syscalls, at which point you're back where you started.

As indicated here "Some older systems, such as SunOS, define the SA_INTERRUPT flag. These systems restart interrupted system calls by default, so specifying this flag causes system calls to be interrupted. Linux defines the SA_INTERRUPT flag for compatibility with applications that use it, but the default is to not restart system calls when the signal handler is installed with sigaction. The XSI extension of the Single UNIX Specification specifies that the sigaction function not restart interrupted system calls unless the SA_RESTART flag is specified."

In short, Linux behaves completely the opposite way from the older systems like SunOS. This makes interrupting system calls questionable,

@yitam
Copy link
Author

yitam commented Jan 9, 2019

@nikic @mathieuk @ralphschindler

Upon more investigation, in many places it says that "SA_INTERRUPT is a no-op left for historical reasons or backward compatibility. That is, it is unused or not really recommended. You can also look for SA_INTERRUPT on signal.h and check its comment.

Nonetheless, below is a very simple test case with different combinations (borrowed from this example and inspired by the latest script provided by @mathieuk). Only one task is able to write to the file. Hopefully the test scenarios help moving things forward.

<?php
    /**
    * @file
    * Basic demonstration of how to do parallel threads in PHP.
    */
    // This array of "tasks" could be anything. For demonstration purposes
    // these are just strings, but they could be a callback, class or
    // include file (hell, even code-as-a-string to pass to eval()).
    $tasks = [
        "task 1",
        "task 2",
        "task 3",
        "task 4",
        ];

    pcntl_async_signals(true);

    $longAlarm = ($_SERVER['argv'][1] ?? '') === 'true';
    $restart = ($_SERVER['argv'][2] ?? '') === 'true';

    // This loop creates a new fork for each of the items in $tasks.
    foreach ($tasks as $task) {
        $pid = pcntl_fork();
        if ($pid == -1) {
            exit("Error forking...\n");
        }
        else if ($pid == 0) {
            execute_task($task, $longAlarm, $restart);
            exit();
        }
    }
    // This while loop holds the parent process until all the child threads
    // are complete - at which point the script continues to execute.
    while(pcntl_waitpid(0, $status) != -1);

    // You could have more code here.
    echo "\nNow read the file after all parallel execution is complete.\n";

    $fp = fopen('lock.txt', 'r');
    echo fread($fp, 50);
    fclose($fp);

    /**
    * Helper method to execute a task.
    */
    function execute_task($task_id, $longAlarm, $restart) {
        echo "Starting: $task_id\n";

        pcntl_signal(
            SIGALRM,
            function() use ($task_id) {
                echo "HANDLED ALARM for $task_id!\n";
                exit();
            },
            $restart 
        );

        $seconds = rand(5, 10);

        if ($longAlarm) {
            pcntl_alarm($seconds + 2);
        } else {
            pcntl_alarm(1);
        }

        $fp = fopen('lock.txt', 'w');
        echo "$task_id: ";
        var_dump($fp);
        $locked = flock($fp, LOCK_EX);
        echo "$task_id: ";
        var_dump($locked);

        sleep($seconds);
        $txt = "Completed: $task_id. Took $seconds.\n";
        $bytes = fwrite($fp, $txt);
        fclose($fp);

        echo "$task_id: ";
        var_dump($bytes);
    }
?>

Tested my original patch in this PR as well as this one (dropped the first condition) -- seemingly no difference.

-	if (signo == SIGALRM || (! restart)) {
+	if (! restart) {
 #ifdef SA_INTERRUPT
 		act.sa_flags |= SA_INTERRUPT; /* SunOS */
 #endif

The output of running php pcntl_test.php false true is (if the first argument is true it means it will wait longer before sending a SIGALRM signal to the process; otherwise it only waits 1 second):

Starting: task 1
task 1: resource(5) of type (stream)
task 1: bool(true)
Starting: task 2
task 2: resource(5) of type (stream)
Starting: task 3
task 3: resource(5) of type (stream)
Starting: task 4
task 4: resource(5) of type (stream)
HANDLED ALARM for task 1!
HANDLED ALARM for task 2!
HANDLED ALARM for task 3!
HANDLED ALARM for task 4!

Now read the file after all parallel execution is complete.

The output of running php pcntl_test.php true true is:

tarting: task 1
task 1: resource(5) of type (stream)
task 1: bool(true)
Starting: task 2
task 2: resource(5) of type (stream)
Starting: task 3
task 3: resource(5) of type (stream)
Starting: task 4
task 4: resource(5) of type (stream)
task 4: bool(true)
task 1: int(27)
HANDLED ALARM for task 4!
HANDLED ALARM for task 2!
HANDLED ALARM for task 3!

Now read the file after all parallel execution is complete.
Completed: task 1. Took 9.

The output of running php pcntl_test.php true false is:

Starting: task 1
task 1: resource(5) of type (stream)
task 1: bool(true)
Starting: task 2
task 2: resource(5) of type (stream)
Starting: task 3
task 3: resource(5) of type (stream)
Starting: task 4
task 4: resource(5) of type (stream)
HANDLED ALARM for task 4!
task 1: int(28)
HANDLED ALARM for task 2!
HANDLED ALARM for task 3!

Now read the file after all parallel execution is complete.
Completed: task 1. Took 10.

@mathieuk
Copy link

mathieuk commented Jan 9, 2019

@nikic @yitam The main issue - for me, and I think others in the same boat (relying on SIGALRM to protect against runaway/stalled jobs in a worker daemon) - is that the MSSQL driver would cause an exception to be thrown before the signal handler was being called. That mucks with my ability to deal with problems.

With my current understanding that happens because:

  1. ODBC might be in read()
  2. SIGALRM arrives and interrupts that read()
  3. The pcntl_signal_handler function gets called and 'schedules' a user-land signal handler PHP function to be called when Zend Engine regains control. It then returns.
  4. Control is passed back to the code where read() was being called, that gets EINTR for read() and bubbles an error all the way back to PDO which throws an exception (if so configured).
  5. Only now is the engine in a position where it might call the user-land signal handler function but now it's too late because an exception was thrown.

While providing proper SA_RESTART control will prevent ODBC from causing an exception to be thrown it might be useless to me as I can't be sure my signal handler will ever be called.

One may question what should happen first here: should the signal be processed first (an 'interrupt' was declared after all) or should the exception be thrown as-is? If I could get the signal handler first, I might be able to do my housekeeping (update some job status somewhere) and kill the process, so the exception wouldn't matter anymore (Laravel's queue worker basically works that way).

@nikic I understand this is grasping at straws but would there even be a way to do that?

If not, it seems my options are a) creating a worker that forks each job and handles error situations from the child, b) disable exceptions for PDO and handle error values or c) hope the ODBC driver starts handling EINTR ánd is extended to have client side timeouts (otherwise handling EINTR would leave me in the same place as enabling SA_RESTART on the signal handler).

@yitam
Copy link
Author

yitam commented Jan 9, 2019

@mathieuk The real problem in the original pcntl code is that it explicitly disables SA_RESTART for SIGALRM, which is error prone as already explained by the pgsql patch. In short, one can not (or should not) assume all system processes can be fully EINTR-proof, not just ODBC.

Now I'm inclined to propose the following change instead, which makes sense if the users specifically set restart to false:

-	if (signo == SIGALRM || (! restart)) {
+	if (! restart) {
 #ifdef SA_INTERRUPT
 		act.sa_flags |= SA_INTERRUPT; /* SunOS */
 #endif

creating a worker that forks each job and handles error situations from the child

Yes @mathieuk that sounds like a good workaround.

You might want to read about potential race conditions discussed above the example on this page, in which it says,

"A common use for alarm, in addition to implementing the sleep function, is to put an upper time limit on operations that can block."

@nikic
Copy link
Member

nikic commented Jan 14, 2019

I've opened #3742 to implement the variant I described in #3717 (comment). I think this is the best we can do here.

If you need both support for a reliable alarm and want to use libraries that are not interrupt safe, then I think the only option is to run the alarm in a separate process.

@yitam
Copy link
Author

yitam commented Jan 14, 2019

Sounds good @nikic
Should I close this one then?

@nikic
Copy link
Member

nikic commented Oct 2, 2019

Closing this in favor of #3742, which is now merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants