New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hung apaches on pthread wrlocks #19
Comments
Could we know a bit about the software you are running, or anything else about your setup that might help to reproduce the same behaviour ... frame appears significant, but it's difficult to guess what has caused it ... |
@anoakie ping ... any update on this, a repro script maybe? |
We tried with both PHP 5.5 RC3 and PHP 5.5.0 final. It's quite hard to reproduce. It happens to random servers in our apache cluster during heavy load every few days. |
@anoakie alternative https://github.com/laruence/yac |
@anoakie it's been a while ... I got to assume you still have the error (I think others have mentioned the same thing) ... without a way to reproduce I'm not able to debug ... it may be that your operating systems default rwlock is not suitable in some way (which we'll eventually find and fix when we can reproduce reliably and determine that to be definite cause) ... the only workaround I can suggest to get things going is to disable the use of rwlocks with --disable-apcu-rwlocks on configure for apcu ... |
We're experiencing what I think is this problem as well - all our Apache children hang, and a GDB trace shows this:
This is on a not-particularly-busy server. There are a couple of places in our code where we write to APCu, and various Apache children are stuck on writing various different keys (ie they're at different points in our code but all are hung when writing to APCu) I'm guessing something has obtained the futex and not released it for some reason, but I'm not sure how to find out what. Here's the APC bit from "php -i" (I'm pretty sure we're using the same config in Apache as on the CLI):
|
|
I believe I have managed to reproduce this issue with a test script and some load injection. I've uploaded the test script and instructions here: https://github.com/rathers/apcu-repro Essentially the test script does a load of apc_store and apc_fetch calls against the same key. Loading this at increasing rates eventually triggers the number of apache children to spiral out of control and lock up. |
@rathers Thanks for your effort ... got some time this evening to have a look, think we got it ... can I get some feedback ?? |
Hi, I'm trying to get a handle on what has changed in that diff to reason about how the fix works. As far as I can tell, it:
Under what circumstances is the cache considered "busy"? |
The eval serializer is unrelated, it was out in the wild for testing, and caused more harm than good ... so it's gotta go ... The cache is considered busy during gc, which is invoked implicitly throughout, sometimes by apc itself, sometimes by the allocator underneath ... allowing php to call a function that operates on a busy cache is futile ... The thing that fixes the problem with apache is utilizing [un]block interruptions on locks, just like APC did ... |
@krakjoe, thanks for the commit. We've rebuilt apcu including this commit and deployed it to a test box. It hasn't made any difference :( We're still seeing the same locking problem, reproducible in exactly the same way as before. Any further ideas to try out? |
Damn ... I have that pulse you get under your eye when things get a bit stressful ... I'm a bit baffled, let me have some thinking time ... I'm not able to make apache spiral out of control anymore, the number of processes remain steady and apache remains responsive but slow ... |
What version of apache are you using? We're still on 2.2 which may or may not make a difference! |
|
try now ?? bit of a mistake on my part there, and an omission ... this has got to work ... |
here is the options I am running http_load with:
I've run it several times (lots and lots and still am) like that without a problem ... is that enough ?? |
That looks a lot higher than I've managed to achieve. How beefy is your box? Try it without -parallel as I'm not sure which would take precedence. We're just redeploying APCu with your latest patch then will try again |
8 (4+HT) cores @ 3.4ghz with 16GB DDR3 ... when I run it more agressively (x10) the amount of processes spawned to handle requests does shoot up, but they are all, eventually, shutdown gracefully and so cannot be blocking waiting to acquire a mutex ... |
@rathers I can't stand the suspense ?? |
I'm starting to get very confused. I think there may be two things at play here: 1 - apache lockups (2) is still happenning, even on the latest build. It is possible we only ever encountered (1) because (2) can occur, I'm really not sure. If we set apc.ttl and apc.gc_ttl to 0 (as opposed to some number) then (1) doesn't happen. The apache procs will recover and reduce in number if we remove the load. I've been trying to think what would cause (2) and wonder if there is some hard limit in APC/PHP/Apache/Linux somewhere that limits how many locks can be processed in a second. Do you think the kernel could have anything to do with it? Our test boxes are fairly old (CentOS 5.9) with a 2.6.18 kernel. WDYT? |
This is APC:
This is APCu:
They both behave with regard to apache processes in the same way now, lots are created but eventually shutdown. I don't think there is a bug present anymore, tried comparing behaviour with normal APC ?? |
This is APCu now:
|
something does smell a bit fishy about that result ... I'm still looking, haven't given up ... |
When you're comparing APC and APCu are they using different PHP versions? |
The apache process explosion thing is quite subtle and hard to explain but using -rate 100 wont reveal it as it's too heavyweight. I found there was an implicit maximum rate that could be sustained (in our case around -rate 3 or 4) with apache beahving perfectly normal with a just a few busy procs and response times constant at around 300ms. Then just by injecting a few manual requests with curl (and i really do mean a few, like 3 or 4!) the whole thing then explodes. Response times increase into multiple seconds and beyond and apache spawns procs until it hits MaxClients. This just from a few extra requests! Never seen apache behave like that before. Depending on the setup (APCu version, compile settings, ini settings etc) the apache procs may or may not recover |
Still persist
|
I've experienced the same issue, but found the cause (at least in my case) At one point I called apc_cache_info(), at a time where the result of that call was too big to fit in PHPs memory_limit. This gave me a "500 Internal Server Error", which seems acceptable, BUT it failed to release the lock. Therefor subsequent calls, in that or other apache child processes, all waited for a lock, that was never going to be released. This does seem like a bug in my humble opinion: the system should never end up in this state, no matter what flaw the PHP programmer might introduce. One thing I'd like to suggest, is that the apc_cache_info() returns something like FALSE if the memory_limit prevents the APC system from returning the requested data. ps. to whoever needs a quick-fix: I simply changed the memory_limit in the script that called apc_cache_info(), and I haven't seen the problem since. I use 4.0.10 |
Has anyone looked into this issue recently? I was able to reproduce this in the same manner @jr997 mentioned using 4.0.10. It looks like a fix would be as simple as allocating all necessary memory for the cache_info function prior to taking this lock https://github.com/krakjoe/apcu/blob/PHP5/apc_cache.c#L1539. |
The same problem on php7.0.5 and apcu 5.1.2 |
@muxx do you have a script that would show how your reproduce it with php7. I was trying to do it the way it locks with php5.6 but were not able to. |
@Zlender Unfortunally I don't have script which can reproduce this case :( Problem appears on the workload at ~20-30 rps and can trigger after hour or after 3 hours of working or after 5 hours. And 20-30 rps prevent to understand what particular request causes the problem. |
Instructions on how to reproduce this with PHP 5.6 https://github.com/Zlender/apcu_19 similar way also works on latest PHP 7 and apcu 5 |
@krakjoe I think that fix is too narrow? Similar code is in apc_cache.c/apc_cache_stat (APC_RLOCK, then array_init()) and several places in apc_iterator.c. Shouldn't basically all parts after any APC_RLOCK (and possibly APC_LOCK, too) be guarded by zend_try? |
@sarumpaet I think you are right, after a quick review I've added some more tries ... I'm looking for feedback now on stability ... hopefully if nothing bad happens @remicollet can do a release in the coming days ... |
No problem, just ping me when ready. |
I don't know how the zend_try mechanism works exactly - is it possible that some of the functions return ill defined values now in the out of memory case? E.g., |
Fixed that on last commit, forgot to tag issue ... |
Crap ... |
Hi, Will there be a 4.0.12 release? I can see that the fix was already backported in the PHP5 branch but I don't know when it'll be released. |
We seem to be getting hung apaches w/PHP 5.5.0 RC3 w/APCu.
0x00007f1a9572853d in pthread_rwlock_wrlock () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0 0x00007f1a9572853d in pthread_rwlock_wrlock () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f1a90debd49 in apc_lock_wlock (lock=) at /home/phpbuild/buildphp-5.5.0RC3/apcu/apc_lock.c:129
#2 0x00007f1a90df05b9 in apc_cache_insert (cache=0x7f1a970d2960, key=..., value=0x7f1a83591880, ctxt=0x7fff1c983920, t=1371246459, exclusive=0 '\000') at /home/phpbuild/buildphp-5.5.0RC3/apcu/apc_cache.c:837
#3 0x00007f1a90df0b85 in apc_cache_store (cache=0x7f1a970d2960, strkey=0x7f1a9778ccc0 "AutoLoad::prefix::6::xxxxxxxxxxxxxxxxxxx::xxxxxxxxxxxxxxxxxx", keylen=52, val=0x7f1a977b5238, ttl=600, exclusive=0 '\000')
#4 0x00007f1a90dedb5b in apc_store_helper (ht=, return_value=0x7f1a977b5208, exclusive=0 '\000', return_value_ptr=, this_ptr=, return_value_used=)
#5 0x00007f1a93f80de6 in zend_do_fcall_common_helper_SPEC (execute_data=0x7f1a96df65b8) at /home/phpbuild/buildphp-5.5.0RC3/php-src-php-5.5.0RC3/Zend/zend_vm_execute.h:543
#6 0x00007f1a93f450d8 in execute_ex (execute_data=0x7f1a96df65b8) at /home/phpbuild/buildphp-5.5.0RC3/php-src-php-5.5.0RC3/Zend/zend_vm_execute.h:356
#7 0x00007f1a93ec8b70 in zend_call_function (fci=0x7fff1c983cb0, fci_cache=) at /home/phpbuild/buildphp-5.5.0RC3/php-src-php-5.5.0RC3/Zend/zend_execute_API.c:939
#8 0x00007f1a93eee3d7 in zend_call_method (object_pp=0x0, obj_ce=, fn_proxy=0x7f1a977137e0, function_name=0x7f1a977136a8 "autoload::loadone", function_name_len=,
#9 0x00007f1a93dd8507 in zif_spl_autoload_call (ht=, return_value=, return_value_ptr=, this_ptr=, return_value_used=)
#10 0x00007f1a93ec8b14 in zend_call_function (fci=0x7fff1c983fc0, fci_cache=0x7fff1c984010) at /home/phpbuild/buildphp-5.5.0RC3/php-src-php-5.5.0RC3/Zend/zend_execute_API.c:957
#11 0x00007f1a93ec93fb in zend_lookup_class_ex (name=0x7f1a96f7d988 "xxxxxxxxx", name_length=9, key=0x7f1a977b52b0, use_autoload=1, ce=0x7fff1c9840d8)
#12 0x00007f1a93ec9b22 in zend_fetch_class_by_name (class_name=0x7f1a96f7d988 "xxxxxxxxx", class_name_len=, key=, fetch_type=4)
#13 0x00007f1a93f12c1a in ZEND_FETCH_CLASS_SPEC_CONST_HANDLER (execute_data=0x7f1a96df4ed8) at /home/phpbuild/buildphp-5.5.0RC3/php-src-php-5.5.0RC3/Zend/zend_vm_execute.h:1188
#14 0x00007f1a93f450d8 in execute_ex (execute_data=0x7f1a96df4ed8) at /home/phpbuild/buildphp-5.5.0RC3/php-src-php-5.5.0RC3/Zend/zend_vm_execute.h:356
#15 0x00007f1a93ed83b3 in zend_execute_scripts (type=8, retval=0x0, file_count=3) at /home/phpbuild/buildphp-5.5.0RC3/php-src-php-5.5.0RC3/Zend/zend.c:1316
#16 0x00007f1a93e7661c in php_execute_script (primary_file=0x7fff1c9865b0) at /home/phpbuild/buildphp-5.5.0RC3/php-src-php-5.5.0RC3/main/main.c:2481
#17 0x00007f1a93f8408d in php_handler (r=0x7f1a960f70a0) at /home/phpbuild/buildphp-5.5.0RC3/php-src-php-5.5.0RC3/sapi/apache2handler/sapi_apache2.c:667
#18 0x00007f1a96239508 in ap_run_handler ()
#19 0x00007f1a9623997e in ap_invoke_handler ()
#20 0x00007f1a96249570 in ap_process_request ()
#21 0x00007f1a96246398 in ?? ()
#22 0x00007f1a9623ffa8 in ap_run_process_connection ()
#23 0x00007f1a9624e1d0 in ?? ()
#24 0x00007f1a9624e93a in ?? ()
#25 0x00007f1a9624f4e7 in ap_mpm_run ()
#26 0x00007f1a962244a4 in main ()
The text was updated successfully, but these errors were encountered: