Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHP 8.1.16 segfaults on line 597 of sapi/apache2handler/sapi_apache2.c #10737

Closed
ElliotNB opened this issue Feb 28, 2023 · 108 comments
Closed

PHP 8.1.16 segfaults on line 597 of sapi/apache2handler/sapi_apache2.c #10737

ElliotNB opened this issue Feb 28, 2023 · 108 comments

Comments

@ElliotNB
Copy link

ElliotNB commented Feb 28, 2023

Description

I apologize in advance for the low quality issue report. I'm hoping that someone can point me in the right direction on next steps for completing a more detailed analysis.

I am running Apache 2.4.54 mpm_event with PHP 8.1.16 on a CentOS 8 Stream machine. When we gradually introduce production traffic load to this server, it begins to segfault every 25-45 minutes. I've been unable to find any correlation between the PHP code being executed and the segmentation faults. There is no particular request or script that reliably triggers a segmentation fault.

I captured core dumps for the segfaults and ran them through gdb. gdb produced the following traces. I also ran valgrind in an attempt to capture debug info on memory problems (see below).

(gdb) bt
#0  0x00007fb84f317161 in php_handler (r=0x7fb7582003b8)
    at /usr/src/debug/php-8.1.16-1.el8.x86_64/sapi/apache2handler/sapi_apache2.c:597
#1  0x0000557a01527f38 in ap_run_handler (r=r@entry=0x7fb7582003b8) at config.c:169
#2  0x0000557a015284f6 in ap_invoke_handler (r=r@entry=0x7fb7582003b8) at config.c:443
#3  0x0000557a0153e1ec in ap_internal_redirect (new_uri=<optimized out>, r=<optimized out>) at http_request.c:790
#4  0x00007fb84fb12a69 in handler_redirect (r=0x7fb7581f8c80) at mod_rewrite.c:5293
#5  0x0000557a01527f38 in ap_run_handler (r=r@entry=0x7fb7581f8c80) at config.c:169
#6  0x0000557a015284f6 in ap_invoke_handler (r=r@entry=0x7fb7581f8c80) at config.c:443
#7  0x0000557a0153eed3 in ap_process_async_request (r=0x7fb7581f8c80) at http_request.c:452
#8  0x0000557a0153f042 in ap_process_request (r=r@entry=0x7fb7581f8c80) at http_request.c:487
#9  0x00007fb85139bb76 in h2_task_process_request (c=0x7fb758194300, task=<optimized out>) at h2_task.c:671
#10 h2_task_process_conn (c=0x7fb758194300) at h2_task.c:713
#11 h2_task_process_conn (c=0x7fb758194300) at h2_task.c:700
#12 0x0000557a01531a08 in ap_run_process_connection (c=c@entry=0x7fb758194300) at connection.c:42
#13 0x00007fb85139ced7 in h2_task_do (task=0x7fb75824d210, thread=thread@entry=0x557a02546cc8, worker_id=<optimized out>) at h2_task.c:631
#14 0x00007fb8513a0c80 in slot_run (thread=0x557a02546cc8, wctx=0x557a02557700) at h2_workers.c:263
#15 0x00007fb85622b1ca in start_thread () from /lib64/libpthread.so.0
#16 0x00007fb855c93e73 in clone () from /lib64/libc.so.6
Thread 703 "httpd" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fb7bc9ff700 (LWP 31768)]
0x00007fb84f317161 in php_handler (r=0x7fb7582003b8)
    at /usr/src/debug/php-8.1.16-1.el8.x86_64/sapi/apache2handler/sapi_apache2.c:597
597             ctx = SG(server_context);
(gdb) bt full
#0  0x00007fb84f317161 in php_handler (r=0x7fb7582003b8)
    at /usr/src/debug/php-8.1.16-1.el8.x86_64/sapi/apache2handler/sapi_apache2.c:597
        ctx = 0x557a02269ef8
        conf = 0x557a0222e230
        brigade = 0x0
        bucket = 0x7fb758201918
        rv = 0
        parent_req = 0x0
#1  0x0000557a01527f38 in ap_run_handler (r=r@entry=0x7fb7582003b8) at config.c:169
        pHook = <optimized out>
        n = 7
        rv = -1
#2  0x0000557a015284f6 in ap_invoke_handler (r=r@entry=0x7fb7582003b8) at config.c:443
        handler = <optimized out>
        p = <optimized out>
        result = 0
        old_handler = 0x0
        ignore = <optimized out>
#3  0x0000557a0153e1ec in ap_internal_redirect (new_uri=<optimized out>, r=<optimized out>) at http_request.c:790
        access_status = <optimized out>
        new = 0x7fb7582003b8
#4  0x00007fb84fb12a69 in handler_redirect (r=0x7fb7581f8c80) at mod_rewrite.c:5293
No locals.
#5  0x0000557a01527f38 in ap_run_handler (r=r@entry=0x7fb7581f8c80) at config.c:169
        pHook = <optimized out>
        n = 6
        rv = -1
#6  0x0000557a015284f6 in ap_invoke_handler (r=r@entry=0x7fb7581f8c80) at config.c:443
        handler = <optimized out>
        p = <optimized out>
        result = 0
        old_handler = 0x7fb84fb1a44a "redirect-handler"
        ignore = <optimized out>
#7  0x0000557a0153eed3 in ap_process_async_request (r=0x7fb7581f8c80) at http_request.c:452
        c = <optimized out>
        access_status = 0
#8  0x0000557a0153f042 in ap_process_request (r=r@entry=0x7fb7581f8c80) at http_request.c:487
        bb = 0x0
        b = <optimized out>
        c = 0x7fb758194300
        rv = <optimized out>
#9  0x00007fb85139bb76 in h2_task_process_request (c=0x7fb758194300, task=<optimized out>) at h2_task.c:671
        req = <optimized out>
        cs = 0x7fb758194998
        r = 0x7fb7581f8c80
        req = <optimized out>
        cs = <optimized out>
        r = <optimized out>
--Type <RET> for more, q to quit, c to continue without paging--
#10 h2_task_process_conn (c=0x7fb758194300) at h2_task.c:713
        ctx = <optimized out>
        ctx = <optimized out>
#11 h2_task_process_conn (c=0x7fb758194300) at h2_task.c:700
        ctx = <optimized out>
#12 0x0000557a01531a08 in ap_run_process_connection (c=c@entry=0x7fb758194300) at connection.c:42
        pHook = <optimized out>
        n = 1
        rv = -1
#13 0x00007fb85139ced7 in h2_task_do (task=0x7fb75824d210, thread=thread@entry=0x557a02546cc8, worker_id=<optimized out>) at h2_task.c:631
        c = 0x7fb758194300
#14 0x00007fb8513a0c80 in slot_run (thread=0x557a02546cc8, wctx=0x557a02557700) at h2_workers.c:263
        slot = 0x557a02557700
#15 0x00007fb85622b1ca in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#16 0x00007fb855c93e73 in clone () from /lib64/libc.so.6
No symbol table info available.

The trace ends at line 597 of sapi/apache2handler/sapi_apache2.c which contains this line: ctx = SG(server_context); in this block of code:

#define PHPAP_INI_OFF php_apache_ini_dtor(r, parent_req);

        conf = ap_get_module_config(r->per_dir_config, &php_module);

        /* apply_config() needs r in some cases, so allocate server_context early */
        ctx = SG(server_context);
        if (ctx == NULL || (ctx && ctx->request_processed && !strcmp(r->protocol, "INCLUDED"))) {
normal:
                ctx = SG(server_context) = apr_pcalloc(r->pool, sizeof(*ctx));
                /* register a cleanup so we clear out the SG(server_context)
                 * after each request. Note: We pass in the pointer to the
                 * server_context in case this is handled by a different thread.
                 */
                apr_pool_cleanup_register(r->pool, (void *)&SG(server_context), php_server_context_cleanup, apr_pool_cleanup_null);
                ctx->r = r;
                ctx = NULL; /* May look weird to null it here, but it is to catch the right case in the first_try later on */
        } else {
                parent_req = ctx->r;
                ctx->r = r;
        }
        apply_config(conf);

I'm guessing that this isn't enough information to figure out what's going on. Could anyone recommend next steps for troubleshooting this?

PHP Version

8.1.16

configure line:

./configure --prefix=/usr/local --enable-fpm --disable-fileinfo --enable-bcmath --enable-calendar --with-libxml --enable-soap --enable-mbstring --enable-pdo --enable-sockets --with-zip --with-apxs2=/usr/local/apache/bin/apxs --with-bz2 --with-curl=/usr/local --with-gettext --with-libdir=lib64 --with-openssl=/usr --with-openssl-dir=/usr --with-pdo-pgsql=/usr/pgsql-11 --with-pgsql=/usr/pgsql-11 --with-pic --with-zlib --with-zlib-dir=/usr --enable-opcache --enable-debug --enable-gd --with-jpeg PKG_CONFIG_PATH=:/usr/local/lib/pkgconfig:/usr/local/bin/pkgconfig

Dynamically loaded extensions:

zend_extension=/usr/local/lib/php/extensions/debug-zts-20210902/opcache.so
opcache.memory_consumption=256
opcache.interned_strings_buffer=16
opcache.max_accelerated_files=5000
opcache.revalidate_freq=0
;opcache.validate_timestamps=1
opcache.fast_shutdown=1

extension=/usr/local/lib/php/extensions/debug-zts-20210902/memcached.so;

We experimented with disabling Zend opcache, but the segfaults persisted.

Operating System

CentOS 8 Stream

@ElliotNB ElliotNB changed the title PHP 8.1.18 segfaults on line 597 of sapi/apache2handler/sapi_apache2.c PHP 8.1.16 segfaults on line 597 of sapi/apache2handler/sapi_apache2.c Feb 28, 2023
@ElliotNB
Copy link
Author

ElliotNB commented Mar 1, 2023

Just a little bit more information... I launched my Apache/PHP with valgrind to see if I could identify the source of any memory problems.

I don't understand these valgrind logs very well, but I'm seeing lots of traces point back to line 426 of zend_string.c (see below). The trace differs, but the final file and line number always remain the same.

Does this indicate that anything is amiss?

==210750== Thread 26:
==210750== Conditional jump or move depends on uninitialised value(s)
==210750==    at 0xD469860: zend_string_equal_val (zend_string.c:426)
==210750==    by 0x128173D6: zend_string_equal_content (zend_string.h:355)
==210750==    by 0x12817418: zend_string_equals (zend_string.h:360)
==210750==    by 0x1281D2D0: cache_script_in_shared_memory (ZendAccelerator.c:1632)
==210750==    by 0x128200AA: persistent_compile_file (ZendAccelerator.c:2175)
==210750==    by 0xD3C96D1: zend_include_or_eval (zend_execute.c:4791)
==210750==    by 0xD3D9A95: ZEND_INCLUDE_OR_EVAL_SPEC_CONST_HANDLER (zend_vm_execute.h:4869)
==210750==    by 0xD4484E1: execute_ex (zend_vm_execute.h:56188)
==210750==    by 0xD376BA7: zend_call_function (zend_execute_API.c:912)
==210750==    by 0xD377155: zend_call_known_function (zend_execute_API.c:1001)
==210750==    by 0xD18397B: spl_perform_autoload (php_spl.c:446)
==210750==    by 0xD3778C8: zend_lookup_class_ex (zend_execute_API.c:1145)
==210750==    by 0xD3785DB: zend_fetch_class_by_name (zend_execute_API.c:1605)
==210750==    by 0xD3E43DB: ZEND_NEW_SPEC_CONST_UNUSED_HANDLER (zend_vm_execute.h:10167)
==210750==    by 0xD448CA1: execute_ex (zend_vm_execute.h:56684)
==210750==    by 0xD44C4A3: zend_execute (zend_vm_execute.h:60151)
==210750==    by 0xD3908F5: zend_execute_scripts (zend.c:1799)
==210750==    by 0xD2DB70F: php_execute_script (main.c:2542)
==210750==    by 0xD507196: php_handler (sapi_apache2.c:710)
==210750==    by 0x45582F: ap_run_handler (config.c:169)
==210750==    by 0x455DC5: ap_invoke_handler (config.c:443)
==210750==    by 0x469BAB: ap_internal_redirect (http_request.c:790)
==210750==    by 0xC8B5FB4: handler_redirect (mod_rewrite.c:5293)
==210750==    by 0x45582F: ap_run_handler (config.c:169)
==210750==    by 0x455DC5: ap_invoke_handler (config.c:443)
==210750==    by 0x46A87A: ap_process_async_request (http_request.c:452)
==210750==    by 0x46A9CD: ap_process_request (http_request.c:487)
==210750==    by 0xB04AD75: h2_task_process_request (h2_task.c:671)
==210750==    by 0xB04AD75: h2_task_process_conn (h2_task.c:713)
==210750==    by 0xB04AD75: h2_task_process_conn (h2_task.c:700)
==210750==    by 0x45E66F: ap_run_process_connection (connection.c:42)
==210750==    by 0xB04C006: h2_task_do (h2_task.c:631)

@ElliotNB
Copy link
Author

ElliotNB commented Mar 1, 2023

@Girgias I saw looking at zend_string.c and saw that you most recently touched the function relevant to this valgrind trace in #8304. Any thoughts on why valgrind reports a Conditional jump or move depends on uninitialised value(s) on line 426 of zend_string.c and if it's a significant problem? My valgrind logs end up filled with that error pointing to the same file and line number. I'm out of my depth to go much further with the troubleshooting.

@nielsdos
Copy link
Member

nielsdos commented Mar 1, 2023

Any thoughts on why valgrind reports a Conditional jump or move depends on uninitialised value(s) on line 426 of zend_string.c and if it's a significant problem? My valgrind logs end up filled with that error pointing to the same file and line number. I'm out of my depth to go much further with the troubleshooting.

This is a false positive. See #10221 and #9068 (comment) for the analysis

@ElliotNB
Copy link
Author

ElliotNB commented Mar 2, 2023

@nielsdos Got it, so according to #9068 I should ignore any valgrind complaints about zend_string_equal_val?

I think I've narrowed down the root cause a bit further. I ran make test on our build and a whole bunch of opcache tests failed. Starting to dig through the cause on those. Perhaps those errors are contributing to the intermittent segfault.

@nielsdos
Copy link
Member

nielsdos commented Mar 2, 2023

Yes, you should ignore those valgrind reports.

If you run opcache in production, and you have failing opcache tests then those 2 seem correlated indeed. EDIT: just read that the segfault persists without opcache.
Could you give a list of the failing opcache tests?

To debug you can configure php with --enable-address-sanitizer , which can find memory errors without having to use valgrind. You could try reproducing your apache crash with that and see what happens.

Also what do you mean with "our build"? Is it a custom fork of php or do you just mean that you compiled php yourself without modifications?

@ElliotNB
Copy link
Author

ElliotNB commented Mar 2, 2023

Could you give a list of the failing opcache tests?

@nielsdos About 30 opcache-related unit tests were failing via make test, but I eventually figured out that was due to opcache.preload_user missing from our php.ini. opcache.preload_user seems to have been added in PHP 7.4. Our company is in the process of upgrading our fleet of web servers from 7.3.27 to 8.1.16 (which explains why the opcache.preload_user option was missing). After adding that missing option to our config, we still do have some failing tests, but the list is much smaller now:

Bug #79919 (Stack use-after-scope in define()) [Zend/tests/bug79919.phpt]
Bug #80839: PHP problem with JIT [ext/opcache/tests/jit/bug80861.phpt]
Multicast support: IPv4 receive options [ext/sockets/tests/mcast_ipv4_recv.phpt]
setcookie() array variant error tests [ext/standard/tests/network/setcookie_array_option_error.phpt]
setcookie() error tests [ext/standard/tests/network/setcookie_error.phpt]
setrawcookie() error tests [ext/standard/tests/network/setrawcookie_error.phpt]
Bug #69487 (SAPI may truncate POST data) [sapi/cgi/tests/bug69487.phpt]

To debug you can configure php with --enable-address-sanitizer , which can find memory errors without having to use valgrind. You could try reproducing your apache crash with that and see what happens.

Thanks for the tip -- I'll give that a try and report back.

We also just tried adding SetEnv USE_ZEND_ALLOC 0 to our httpd.conf and restarted Apache. Unfortunately the segmentation faults persisted after that.

We confirmed that USE_ZEND_ALLOC 0 appeared in the Environment Variables section of the phpinfo() output, but strangely Zend Memory Manager was still listed as enabled up top in the phpinfo() output. Would that indicate that the Zend Memory Manager was still in effect despite the USE_ZEND_ALLOC environment variable?

Also what do you mean with "our build"? Is it a custom fork of php or do you just mean that you compiled php yourself without modifications?

I'm just referring to compiling PHP from the standard 8.1.16 source with our own selection of extensions, etc. Not a custom fork of PHP or anything like that.

@nielsdos
Copy link
Member

nielsdos commented Mar 2, 2023

After adding that missing option to our config, we still do have some failing tests, but the list is much smaller now:

I don't see an immediate relation between those tests. What do the failures look like? Are they segfaults? If so, is it possible to attach gdb and see where they segfault?

We confirmed that USE_ZEND_ALLOC 0 appeared in the Environment Variables section of the phpinfo() output, but strangely Zend Memory Manager was still listed as enabled up top in the phpinfo() output. Would that indicate that the Zend Memory Manager was still in effect despite the USE_ZEND_ALLOC environment variable?

I think so. But I don't think the memory manager is the source of the issue tbh.
Looking at line 597 I don't see how that can segfault unless something goes really wrong in TSRM.

I just noticed however that you didn't use --enable-zts in your configure command. But given your mpm-event setup I would expect that you need to use zts. I also see that some extensions in your configuration use zts. Are you actually configuring&compiling php with zts enabled?

@ElliotNB
Copy link
Author

ElliotNB commented Mar 2, 2023

@nielsdos

I just noticed however that you didn't use --enable-zts in your configure command. But given your mpm-event setup I would expect that you need to use zts. I also see that some extensions in your configuration use zts. Are you actually configuring&compiling php with zts enabled?

Oh that's a good catch. Yes, we are running Apache 2.4.54 with mpm_event:

[root@app ~]# /usr/local/apache/bin/httpd -l
Compiled in modules:
  core.c
  mod_so.c
  http_core.c
  event.c
[root@app ~]#

And nope, we have not compiled PHP with --enable-zts. Yet we have those ZTS extensions enabled in our php.ini...

For what it's worth, I confirmed that our prior 7.3.27 PHP build was also compiled without --enable-zts and also had those same ZTS extensions (and Apache with mpm_event), but it never caused segmentation faults.

I'll try re-compiling PHP with --enable-zts included and will report back. Thank you for the help!

@ElliotNB
Copy link
Author

ElliotNB commented Mar 3, 2023

@nielsdos Unfortunately adding --enable-zts to our ./configure line didn't make a difference -- the segmentation faults persist. I came across a post made by @derickr ( https://stackoverflow.com/questions/58918146/why-php-zts-is-disabled-by-default ) where he says If you were to build PHP for the ISAPI SAPI (Server Abstraction) layer, then the PHP build process automatically turns on ZTS mode. That probably explains why our prior build of PHP 7.3.27 didn't have ZTS issues even while omitting the --enable-zts line.

I'm still working on compiling with --enable-address-sanitizer (my first compile wasn't working with Apache) and will report back on that soon. I'm also digging deeper into the handful of failed unit tests to see if they any produce segfaults (didn't see any segfaults running them individually, but will take a closer look).

I immensely appreciate your help! Aside from what you've suggested above, I'm completely out of ideas on where to go from here.

@ElliotNB
Copy link
Author

ElliotNB commented Mar 3, 2023

@nielsdos Another update...

I managed to resolve all but one of the unit tests that were failing. They were all caused by a small set of differences in my php.ini compared to the default php.ini. For example, Zend/tests/bug79919.phpt and ext/opcache/tests/jit/bug80861.phpt both fail if you define a custom error_log in your php.ini (I had set error_log = /var/log/php_error_log). Is that something worth reporting as a separate minor bug? The only unit test failing now is Multicast support: IPv4 receive options [ext/sockets/tests/mcast_ipv4_recv.phpt]. However, despite getting all but one of the unit tests to pass -- the segmentation faults still persist.

To debug you can configure php with --enable-address-sanitizer

I've run into some trouble trying to accomplish that. When I compile PHP with --enable-address-sanitizer, I get this error when trying to start Apache: httpd: Syntax error on line 177 of /usr/local/apache/conf/httpd.conf: Cannot load modules/libphp.so into server: /usr/local/apache/modules/libphp.so: undefined symbol: __asan_option_detect_stack_use_after_return I'm still researching on what might be causing that, but I'm not seeing anyone else who has reported that error in the context of compiling PHP from source.

I'm also going to try switching from gcc to clang to compile PHP. The standard repos for CentOS 8 Stream only offer gcc version 8.5.0. Meanwhile clang version 15.0.7 is available.

@ElliotNB
Copy link
Author

ElliotNB commented Mar 3, 2023

No luck :( I compiled PHP with clang, restarted Apache and had a segmentation fault within ~30 minutes.

@nielsdos
Copy link
Member

nielsdos commented Mar 3, 2023

The error_log issue is probably worth reporting separately yeah. I would expect clang vs gcc to make no difference.
For the asan issue, try setting LDFLAGS="-lasan" upon configuring and compiling.

@ElliotNB
Copy link
Author

ElliotNB commented Mar 3, 2023

@nielsdos Thanks for the tip on the LDFLAGS -- will give that a try tomorrow AM. For what it's worth, I just tried out PHP 8.2.3 and still had segmentation faults after running ~30 minutes of production traffic.

@ElliotNB
Copy link
Author

ElliotNB commented Mar 3, 2023

@nielsdos Added the LDFLAGS flag prior to the ./configure line. Compiles fine (just as before), but still has the same error unfortunately upon starting up httpd. Will keep trying to get the --enable-address-sanitizer option enabled.

@nielsdos
Copy link
Member

nielsdos commented Mar 3, 2023

That's unfortunate. Thanks for keeping trying.
Next time you're able to get the crash in gdb, can you not only show the faulting line, but also do:
layout asm? That's so we can see the assembly code that faulted (and the assembly around that).
And also info registers so we can see the registers, which helps analyze the assembly.

EDIT: there's a PR open that does something with LDFLAGS, maybe that's related #10678 to the linker issue

nielsdos added a commit to nielsdos/php-src that referenced this issue Mar 4, 2023
Some tests fail if the error_log is overriden by the loaded ini
configuration. Explicitly set it to an empty value to prevent the
failures.
See php#10737 (comment)
@ElliotNB
Copy link
Author

ElliotNB commented Mar 4, 2023

@nielsdos I haven't made much new progress today, but I did make an interesting discovery. If I deploy PHP 7.3.27 with the same configuration that we've used on our older fleet of web servers, it also incurs a segmentation fault. So whatever is causing these segfaults, it's nothing new to PHP -- 7.3.27 segfaults, 8.1.16 segfault and 8.2.3 segfaults.

I'm scratching my head because we're experiencing these segfaults in a very "vanilla" environment. A KVM virtual machine running CentOS 8 Stream (4.18.0-408.el8.x86_64 #1 SMP with standard repos and updated to the latest), Apache 2.4.54, PostgreSQL 11 and memcached. That's it!

@ElliotNB
Copy link
Author

ElliotNB commented Mar 4, 2023

@nielsdos Here's the gdb info on a segfault with PHP 7.3.27 -- the line numbers are slightly different, but the filename and line of code for where the segfault occurs remains the same as before.

I captured the layout asm and info registers information you were looking for: https://pastebin.com/raw/4fykKB8t

It's hard to see in that text output, but <php_handler+130> is the line highlighted by the layout asm command.

All of that is pretty much gibberish to me, but does it have some meaning to you? :)

@nielsdos
Copy link
Member

nielsdos commented Mar 4, 2023

Yes that is actually very very helpful, thank you very much!
It faults at mov (%rax),%rdx which makes sense because rax is 0, so it derefs a null pointer. Looking at rbx, which should contain the return value of tsrm_get_ls_cache(), that is also null, so that's probably the issue. Interestingly I did find something that sounds very related online: https://stackoverflow.com/questions/56838536/call-to-get-resource-in-php-7-3-4-intermittently-returns-nullptr
I'm going to spend some more time today to try figure out what is going on.

@nielsdos
Copy link
Member

nielsdos commented Mar 4, 2023

I managed to reproduce the failure on my system, by creating lots and lots of concurrent requests. It also seems to help to stretch the execution of the php script with a busy-wait forloop to increase the chance of a concurrent request.
I'll keep this issue report updated with new information I get.

@nielsdos
Copy link
Member

nielsdos commented Mar 4, 2023

@ElliotNB Can you please try to add the following flag to what you already have for your configure:
--disable-zend-signals

And then compile PHP and see if you still get segfaults.
Thanks in advance!

@ElliotNB
Copy link
Author

ElliotNB commented Mar 4, 2023

@nielsdos Wow, that's incredible news! Yes, I will recompile with that option and report back right away. With our production traffic sometimes it takes the seg faults just over an hour to show up so it'll probably be about an hour and a half before I can confirm if they're gone. Thank you!

@ElliotNB
Copy link
Author

ElliotNB commented Mar 4, 2023

@nielsdos Unfortunately I just had a segfault. The gdb output appears pretty similar to what I saw before, but here's an updated output: https://pastebin.com/raw/AexnR52Q

Here's the system and compile info:

PHP 7.3.27
CentOS 8 Stream
Linux 4.18.0-408.el8.x86_64 #1 SMP Mon Jul 18 17:42:52 UTC 2022 x86_64
./configure --prefix=/usr/local --enable-fpm --disable-fileinfo --enable-bcmath --enable-calendar --with-libxml --enable-soap --enable-mbstring --enable-pdo --enable-sockets --with-zip --with-apxs2=/usr/local/apache/bin/apxs --with-bz2 --with-curl=/usr/local --with-gettext --with-libdir=lib64 --with-openssl=/usr --with-openssl-dir=/usr --with-pdo-pgsql=/usr/pgsql-11 --with-pgsql=/usr/pgsql-11 --with-pic --with-zlib --with-zlib-dir=/usr --enable-opcache --enable-debug --enable-gd --with-jpeg --disable-zend-signals PKG_CONFIG_PATH=:/usr/local/lib/pkgconfig:/usr/local/bin/pkgconfig

Any other debug info you'd like me to capture?

It also seems to help to stretch the execution of the php script with a busy-wait forloop to increase the chance of a concurrent request.

Your observation aligns very well with our production traffic. Part of our production traffic involves data ingestion for an ETL process. Inbound POST requests with a large JSON payload that we iterate over and update a database. That might explain why we're triggering these segfaults so easily while, presumably, it doesn't happen as much for others. I'm curious why we weren't seeing these segfaults on our older fleet of servers with PHP 7.3.27 -- those servers were running CentOS 6.10, Apache 2.4.39 and PHP 7.3.27 (with essentially the same configure options). I looked at the error logs on our older fleet and we've had maybe 4 segfaults over the past year as opposed to one segfault every 25-45 minutes.

@nielsdos
Copy link
Member

nielsdos commented Mar 4, 2023

Thanks for trying, that's unfortunate. I'm kinda out of ideas for now to be honest.
It looks like some sort of race conditions. Disabling zend signals worked for me because I seem to have hit another, but similar issue.

The issue for me was that the signal was racy delivered to a thread that hadn't initialised the tsrm properly yet. And the following patch (on top of PHP-8.1) fixed that for me:

diff --git a/TSRM/TSRM.c b/TSRM/TSRM.c
index 7cd924318e..962a295555 100644
--- a/TSRM/TSRM.c
+++ b/TSRM/TSRM.c
@@ -113,7 +113,7 @@ static pthread_key_t tls_key;
 #endif
 
 TSRM_TLS uint8_t in_main_thread = 0;
-TSRM_TLS uint8_t is_thread_shutdown = 0;
+TSRM_TLS uint8_t is_thread_shutdown = 1;
 
 /* Startup TSRM (call once for the entire process) */
 TSRM_API int tsrm_startup(int expected_threads, int expected_resources, int debug_level, const char *debug_filename)

Both of our issues stem from the fact that TSRMLS_CACHE is NULL. However, your issue occurs in the handler itself somehow, which shouldn't be possible.

You may try with the patch above, although I doubt that's gonna make a difference. When you get a crash again you could try the following GDB commands to get more debug info:
p _tsrm_ls_cache
p is_thread_shutdown
p in_main_thread

@ElliotNB
Copy link
Author

ElliotNB commented Mar 4, 2023

@nielsdos Here's two sets of gdb output -- the first set is for a segfault on PHP 7.3.27 and the second is a segfault on PHP 8.1.16 (--disable-zend-signals for both):

gdb output with PHP 7.3.27 and --disable-zend-signals: https://pastebin.com/raw/3cFYHksi
gdb output with PHP 8.1.16 and --disable-zend-signals: https://pastebin.com/raw/a2XPmyDy

There's a difference in those three variables between the PHP versions:

PHP 7.3.27:

(gdb) p _tsrm_ls_cache
$1 = (void *) 0x0
(gdb) p is_thread_shutdown
No symbol "is_thread_shutdown" in current context.
(gdb) p in_main_thread
$2 = 0 '\000'
(gdb)

PHP 8.1.16:

(gdb) p _tsrm_ls_cache
$1 = (void *) 0x0
(gdb) p is_thread_shutdown
$2 = 0 '\000'
(gdb) p in_main_thread
$3 = 0 '\000'
(gdb)

Next, I'll try out your TSRM.c patch on 8.1.16 and see if it resolves the segfaults. Thanks again for digging so deep into this!

@nielsdos
Copy link
Member

nielsdos commented Mar 4, 2023

Also thank you for putting all that effort in testing this and following up!
The difference, i.e. that the shutdown variable doesn't exist on 7.3, is expected.
So what I suspected is correct: _tsrm_ls_cache is NULL, while shutdown is 0, which should be impossible.
I'll also think about it more tomorrow...

@nielsdos
Copy link
Member

@ElliotNB Yep that's the old commit, so I think that's the issue. It's my mistake really... I noticed I had a mistake which could cause a UAF crash, and I thought I could just quickly force push the fix because I thought you wouldn't have checked out yet.
Lesson learned, always notify when pushing. Sorry for the inconvenience.

@nielsdos
Copy link
Member

@ElliotNB Actually, looking more at it, that commit should work too I think. Is it possible to give a GDB backtrace of the crash?
I'll mark the PR as draft for now.

@ElliotNB
Copy link
Author

@nielsdos No worries! I'm going to have to restart Apache anyway to grab a gdb trace so I'll just compile the most recent code from your fork/branch and test it. This time around I'm going to wait at least two hours before giving the 'all clear' so it'll be a while before I post an update.

@ElliotNB
Copy link
Author

ElliotNB commented Mar 16, 2023

@nielsdos Just had a segfault with the code taken from your fork/branch, here are the details: https://pastebin.com/raw/pAzg2eSB Looks a little different than what I've seen before.

I used the same ./configure line as before:

./configure --prefix=/usr/local --enable-fpm --disable-fileinfo --enable-bcmath --enable-calendar --with-libxml --enable-soap --enable-mbstring --enable-pdo --enable-sockets --with-zip --with-apxs2=/usr/local/apache/bin/apxs --with-bz2 --with-curl=/usr/local --with-gettext --with-libdir=lib64 --with-openssl=/usr --with-openssl-dir=/usr --with-pdo-pgsql=/usr/pgsql-11 --with-pgsql=/usr/pgsql-11 --with-pic --with-zlib --with-zlib-dir=/usr --enable-opcache --enable-debug --enable-gd --enable-debug-assertions --disable-zend-signals --with-jpeg PKG_CONFIG_PATH=:/usr/local/lib/pkgconfig:/usr/local/bin/pkgconfig

@ElliotNB
Copy link
Author

ElliotNB commented Mar 17, 2023

@nielsdos I'm currently testing this patch #10737 (comment) -- no segfaults yet after 60 minutes, but I'm going to let it run for a couple hours longer.

Edit: still no segfault at the two hour mark, but I'll let it run overnight just to be sure.

@nielsdos
Copy link
Member

@ElliotNB Thanks! With the information you gave I was able to find and reproduce the issue with the branch.
The bug was another pre-existing issue in which the dependency order of modules during destruction wasn't respected. I changed the destruction routine to destroy in reverse order, which fixes the problem.

I pushed the commit which fixes the issue on my branch. You should be able to pull from it, and recompile as usual.
Could you please test whether it fixed the segfault? Thanks in advance! :)

Branch URL is still the same: https://github.com/nielsdos/php-src/commits/fix-tsrm-apache

Also: if you haven't already, you should probably close the issue on the Apache bugtracker.

@ElliotNB
Copy link
Author

@nielsdos Sounds good, I'll give that a whirl this morning. The other patch I was testing also segfaulted, but it took a long time for the segfaults to show up. Let me know if you want to see that gdb output too.

@nielsdos
Copy link
Member

@ElliotNB Thanks! Yes, maybe it's best if I also see those GDB reports to at least know if it's not another underlying issue.

@ElliotNB
Copy link
Author

ElliotNB commented Mar 17, 2023

@nielsdos Here's the gdb on the segfault from this ( #10737 (comment) ) patch: https://pastebin.com/raw/yXmENG7A -- I captured ~7 core dumps overnight and they all appeared to be the same.

Currently testing the revised code, I'll let it run for at least a couple hours (unless I get a segfault sooner).

nielsdos added a commit to nielsdos/php-src that referenced this issue Mar 17, 2023
Fixes phpGH-8789.
Fixes phpGH-10015.

This is one small part of the underlying bug for phpGH-10737, as in my
attempts to reproduce the issue I constantly hit this crash easily.
(The fix for the other underlying issue for that bug will follow soon.)

It's possible that a signal arrives at a thread that never handled a PHP
request before. This causes the signal globals to dereference a NULL
pointer because the TSRM pointers for the thread aren't set up to point
to the thread resources yet.

PR phpGH-9766 previously fixed this for master by ignoring the signal if
the thread didn't handle a PHP request yet. While this fixes the crash
bug, I think the solution is suboptimal for 3 reasons:

1) The signal is ignored and a message is printed saying there is a bug.
   However, this is not a bug at all. For example in Apache, the signal
   set up happens on child process creation, and the thread resource
   creation happens lazily when the first request is handled by the
   thread. Hence, the fact that the thread resources aren't set up yet
   is not actually buggy behaviour.

2) I believe since it was believed to be buggy behaviour, that fix was
   only applied to master, so 8.1 & 8.2 keep on crashing.

3) We can do better than ignoring the signal. By just acting in the
   same way as if the signals aren't active. This means we need to
   take the same path as if the TSRM had already shut down.
@ElliotNB
Copy link
Author

ElliotNB commented Mar 17, 2023

@nielsdos 2.5 hours of no segfaults using your branch/fork, I think it's safe to call the 'all clear' on this one! My apologies for the false positive earlier. I'm going to check back in another couple hours just in case there's a latent segfault, but we're having peak middle of the day traffic and no segfaults for 2.5 hours seems very safe.

@nielsdos
Copy link
Member

@ElliotNB thanks for reporting back. Let's indeed wait a bit longer to see for sure :)

@ElliotNB
Copy link
Author

ElliotNB commented Mar 17, 2023

@nielsdos Almost 5 hours without a segfault, I'll keep watching but I'd be shocked if one showed up at this point.

Edit: 11 hours and still no segfaults 👍

nielsdos added a commit that referenced this issue Mar 18, 2023
…bals

Fixes GH-8789.
Fixes GH-10015.

This is one small part of the underlying bug for GH-10737, as in my
attempts to reproduce the issue I constantly hit this crash easily.
(The fix for the other underlying issue for that bug will follow soon.)

It's possible that a signal arrives at a thread that never handled a PHP
request before. This causes the signal globals to dereference a NULL
pointer because the TSRM pointers for the thread aren't set up to point
to the thread resources yet.

PR GH-9766 previously fixed this for master by ignoring the signal if
the thread didn't handle a PHP request yet. While this fixes the crash
bug, I think the solution is suboptimal for 3 reasons:

1) The signal is ignored and a message is printed saying there is a bug.
   However, this is not a bug at all. For example in Apache, the signal
   set up happens on child process creation, and the thread resource
   creation happens lazily when the first request is handled by the
   thread. Hence, the fact that the thread resources aren't set up yet
   is not actually buggy behaviour.

2) I believe since it was believed to be buggy behaviour, that fix was
   only applied to master, so 8.1 & 8.2 keep on crashing.

3) We can do better than ignoring the signal. By just acting in the
   same way as if the signals aren't active. This means we need to
   take the same path as if the TSRM had already shut down.

Closes GH-10861.
@nielsdos
Copy link
Member

Assuming no new failures happened, I'll mark my PR as ready tonight.
My PR contains one additional commit now: there was one more place where the destruction happened in the wrong order. That bug is apparantly hard to trigger as it's been years like that and no one complained. Still, better to fix it to avoid problems.

@ElliotNB
Copy link
Author

ElliotNB commented Mar 20, 2023

@nielsdos No segfaults over the weekend either 👍

That bug is apparantly hard to trigger as it's been years like that and no one complained. Still, better to fix it to avoid problems.

Yeah I've been wondering if there's something unique about our applications and/or usage that's made it relatively easy for us to re-pro these segfaults that not many others appear to have reported. As far as I know our stack is pretty common (KVM guest VMs running CentOS 8 Stream, Apache 2.4.54, PHP 8.1.16, PostgreSQL 11, memcache).

@nielsdos
Copy link
Member

@ElliotNB Thanks for reporting back. That convinces me the bug is really gone.

As for the reproduction environment: it looks indeed like a normal setup. I don't know what causes this for you. I know of one other bug report that I'm now pretty convinced it's the same bug as you were experiencing and they also didn't have a special setup. I tried a lot of different things but still to this day I'm unable to reproduce the issue naturally, I can only reproduce it artificially by using an interposer to modify the thread ID.

@ElliotNB
Copy link
Author

ElliotNB commented Mar 22, 2023

@nielsdos I've not seen any segfaults after several days now, but I am seeing something unusual in my Apache error_log that I've not seen before. Our Apache web servers restart nightly for log rotation. When those restarts occur, I see these lines logged in our Apache error_log: https://pastebin.com/raw/9CwYZzhC

Maybe it's just from the newer Apache version, but I thought I'd mention it in case you think that your changes might have an impact on the ability of child processes to exit gracefully when the httpd service is issued the restart command. I've also never seen those memory leak warnings before either.

@nielsdos
Copy link
Member

@ElliotNB this looks unrelated.

The reason you're seeing leak warnings is because the PHP module is forcefully shut down. The reason you haven't seen the leaks before is because you seem to be running a debug PHP build right now, which will report leaks while a production build will not.

And the reason it's forcefully shut down is because the children didn't exit in time. I believe this could also be because you're running a PHP debug build: they are a lot slower than production builds.

You can create a production build of PHP by removing --enable-debug from your configure command line.

If you still experience the message about children that did not exit in time after switching to a production build, then it is simply because you have long running scripts and Apache wants to restart and not wait until the scripts are finished. In this case it's due to the way current versions of Apache work.

@ElliotNB
Copy link
Author

Ahh that makes sense -- I forgot I had added --enable-debug as a part of all this troubleshooting. Thanks for pointing that out 👍

nielsdos added a commit that referenced this issue Apr 8, 2023
* PHP-8.1:
  Fix GH-10737: PHP 8.1.16 segfaults on line 597 of sapi/apache2handler/sapi_apache2.c
nielsdos added a commit that referenced this issue Apr 8, 2023
* PHP-8.2:
  Fix GH-10737: PHP 8.1.16 segfaults on line 597 of sapi/apache2handler/sapi_apache2.c
@ElliotNB
Copy link
Author

ElliotNB commented Apr 11, 2023

@nielsdos

Woohoo! It's closed! Thanks again for all your help on this 👍

I want to credit you in the changelog (aka NEWS file in repository root) when we finally merge the PR. The NEWS file always mentions the people who worked on the issue, and since I couldn't have fixed it without you constantly testing and helping debugging, I think it's only fair if you are mentioned as well. So I wanted to ask for your permission to mention your name as a way to give credit. :)

If it's not too late it would be awesome if you could toss my username in the NEWS file, thank you! 🕺 🥇

@nielsdos
Copy link
Member

Ah right! Sorry I forgot about this, I'll set this right.

@nielsdos
Copy link
Member

@ElliotNB Done!

nielsdos added a commit that referenced this issue Apr 24, 2023
Some tests fail if the error_log is overriden by the loaded ini
configuration. Explicitly set it to an empty value to prevent the
failures.
See #10737 (comment)

Closes GH-10772.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants