New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High wall time on apc_* functions #106

Closed
itarchmerc opened this Issue Jan 15, 2015 · 15 comments

Comments

Projects
None yet
4 participants
@itarchmerc

itarchmerc commented Jan 15, 2015

I am troubleshooting an issue where our PHP site starts to slow down after about an hour of running a load test. In looking at the xhprof profiles, the calls taking the time are the apc_* functions. We are running the following:

ubuntu 14.04
apache 2.4.7
php 5.5.9
apcu 4.0.7

Our apcu settings are:

apc.coredump_unmap 0
apc.enable_cli 0
apc.enabled 1
apc.entries_hint 4096
apc.gc_ttl 3600
apc.mmap_file_mask /apc.shm.8C5Mjz
apc.preload_path
apc.rfc1867 0
apc.rfc1867_freq 0
apc.rfc1867_name APC_UPLOAD_PROGRESS
apc.rfc1867_prefix upload_
apc.rfc1867_ttl 3600
apc.serializer php
apc.shm_segments 1
apc.shm_size 1024M
apc.slam_defense 0
apc.smart 0
apc.ttl 0
apc.use_request_time 1
apc.writable /tmp

Beyond seeing that these functions are taking most of the wall time, I'm unable to figure out what might be misconfigured. Any ideas on the possible cause, or what other information I could gather to help diagnose this issue? Thanks.

@lieut-data

This comment has been minimized.

Show comment
Hide comment
@lieut-data

lieut-data Jan 15, 2015

Not sure if this helps, but we recently discovered that if apc.ttl is set to some non-zero value, any cache insert operation triggers a removal of items already in the cache that hash to the same value and that haven't been accessed within apc.ttl seconds. In our case, there was plenty of memory for the insert, so we were surprised to find random cache entries ejected just because they hadn't been accessed recently.

I share this because you have apc.ttl already set to 0 and a significant shared memory allocation of 1024M but a relatively low apc.entries_hint of 4096. Is there a chance you are filling your cache with far more than 4096 entries? Since the hash collision purging won't occur with this configuration (unless you started to run out of memory), I could imagine there's a chance that APCu might be spending more and more time traversing ever growing linked lists when querying the cache. If so, you might try increasing apc.entries_hint to better represent the number of entries you expect to insert into the cache.

(Of course, your issue may be entirely unrelated!)

lieut-data commented Jan 15, 2015

Not sure if this helps, but we recently discovered that if apc.ttl is set to some non-zero value, any cache insert operation triggers a removal of items already in the cache that hash to the same value and that haven't been accessed within apc.ttl seconds. In our case, there was plenty of memory for the insert, so we were surprised to find random cache entries ejected just because they hadn't been accessed recently.

I share this because you have apc.ttl already set to 0 and a significant shared memory allocation of 1024M but a relatively low apc.entries_hint of 4096. Is there a chance you are filling your cache with far more than 4096 entries? Since the hash collision purging won't occur with this configuration (unless you started to run out of memory), I could imagine there's a chance that APCu might be spending more and more time traversing ever growing linked lists when querying the cache. If so, you might try increasing apc.entries_hint to better represent the number of entries you expect to insert into the cache.

(Of course, your issue may be entirely unrelated!)

@itarchmerc

This comment has been minimized.

Show comment
Hide comment
@itarchmerc

itarchmerc Jan 15, 2015

Thanks for the information. I looked at the cache after the load test completed and we have 2512 cached variables using 185MB. However, if I look at the detailed memory usage it says Fragmentation: 44.32% (371.6 MBytes out of 838.4 MBytes in 2676 fragments). I tried to set the memory allocation lower, but then I kept hitting cache full and the site got even worse. Here is the current cache information:

Cached Variables 2512 (185.5 MBytes)
Hits 62627113
Misses 993319
Request Rate (hits, misses) 1154.05 cache requests/second
Hit Rate 1136.03 cache requests/second
Miss Rate 18.02 cache requests/second
Insert Rate 101.98 cache requests/second
Cache full count 0

Any other ideas? Thanks!

itarchmerc commented Jan 15, 2015

Thanks for the information. I looked at the cache after the load test completed and we have 2512 cached variables using 185MB. However, if I look at the detailed memory usage it says Fragmentation: 44.32% (371.6 MBytes out of 838.4 MBytes in 2676 fragments). I tried to set the memory allocation lower, but then I kept hitting cache full and the site got even worse. Here is the current cache information:

Cached Variables 2512 (185.5 MBytes)
Hits 62627113
Misses 993319
Request Rate (hits, misses) 1154.05 cache requests/second
Hit Rate 1136.03 cache requests/second
Miss Rate 18.02 cache requests/second
Insert Rate 101.98 cache requests/second
Cache full count 0

Any other ideas? Thanks!

@teo1978

This comment has been minimized.

Show comment
Hide comment
@teo1978

teo1978 Nov 2, 2015

I have what seems to be the same issue.

How is it possible that there's been no response (from the devs) in almost a year on such a hugely critical issue??

teo1978 commented Nov 2, 2015

I have what seems to be the same issue.

How is it possible that there's been no response (from the devs) in almost a year on such a hugely critical issue??

@teo1978

This comment has been minimized.

Show comment
Hide comment
@teo1978

teo1978 Nov 2, 2015

And by the way, @itarchmerc did you ever find any workaround?

teo1978 commented Nov 2, 2015

And by the way, @itarchmerc did you ever find any workaround?

@krakjoe

This comment has been minimized.

Show comment
Hide comment
@krakjoe

krakjoe Nov 3, 2015

Owner

How is it possible that there's been no response (from the devs) in almost a year on such a hugely critical issue??

2 people came along and mentioned they were having problems on one day in January.

They didn't provide me nearly enough information to do anything about it, no reproducing code at all.

I can't guess why something behaves the way it does, when you report a bug anywhere, you need to include enough information to reproduce the problem. Very rarely is providing configuration, or statistics, enough to determine the cause of a problem.

A good description of what is required to actually help with a problem is here: http://stackoverflow.com/help/mcve

APCu is not my favourite thing, so shoot me ... given enough information, I always look, when schedule allows ...

Owner

krakjoe commented Nov 3, 2015

How is it possible that there's been no response (from the devs) in almost a year on such a hugely critical issue??

2 people came along and mentioned they were having problems on one day in January.

They didn't provide me nearly enough information to do anything about it, no reproducing code at all.

I can't guess why something behaves the way it does, when you report a bug anywhere, you need to include enough information to reproduce the problem. Very rarely is providing configuration, or statistics, enough to determine the cause of a problem.

A good description of what is required to actually help with a problem is here: http://stackoverflow.com/help/mcve

APCu is not my favourite thing, so shoot me ... given enough information, I always look, when schedule allows ...

@teo1978

This comment has been minimized.

Show comment
Hide comment
@teo1978

teo1978 Nov 3, 2015

What is "enough information"?
It is pretty clear from the nature of the issue that providing "reproducing code" is nearly impossible.

Finding a reproducing scenario in this case is almost equivalent to finding the root cause of the bug.

So, given you show little interest in the issue, it's clear it's going to stay unfixed for ages.
Can you (or anybody reading this) recommend me an alternative cache (pun not intended) that can be relied upon?

P.S. if there is some information that I can provide, such as any log or the like, that I can extract from the live system while the issue is happening, or better after it has happened, I'll be happy to provide it.

teo1978 commented Nov 3, 2015

What is "enough information"?
It is pretty clear from the nature of the issue that providing "reproducing code" is nearly impossible.

Finding a reproducing scenario in this case is almost equivalent to finding the root cause of the bug.

So, given you show little interest in the issue, it's clear it's going to stay unfixed for ages.
Can you (or anybody reading this) recommend me an alternative cache (pun not intended) that can be relied upon?

P.S. if there is some information that I can provide, such as any log or the like, that I can extract from the live system while the issue is happening, or better after it has happened, I'll be happy to provide it.

@teo1978

This comment has been minimized.

Show comment
Hide comment
@teo1978

teo1978 Nov 3, 2015

http://stackoverflow.com/help/mcve

Yeah right. That's the dream of every developer. That is just not always (often even remotely) possible.
If you expect to have that for every bug (from the same person who experience it), you are simply never going to fix most of them.

teo1978 commented Nov 3, 2015

http://stackoverflow.com/help/mcve

Yeah right. That's the dream of every developer. That is just not always (often even remotely) possible.
If you expect to have that for every bug (from the same person who experience it), you are simply never going to fix most of them.

@krakjoe

This comment has been minimized.

Show comment
Hide comment
@krakjoe

krakjoe Nov 4, 2015

Owner

Every piece of information you discover, or becomes obvious during your effort to create a MCVE is valuable information.

I'm not saying you will always succeed, but try you must ...

Owner

krakjoe commented Nov 4, 2015

Every piece of information you discover, or becomes obvious during your effort to create a MCVE is valuable information.

I'm not saying you will always succeed, but try you must ...

@teo1978

This comment has been minimized.

Show comment
Hide comment
@teo1978

teo1978 Nov 4, 2015

Meanwhile: I strongly suspect this is some kind of race condition, since I observe this in a scenario where a PHP script does a lot of reads and a few writes; so I guess, when there are many concurrent requests, somebody may shoehow get stuck wating to be able to read what somebody else is writing, who in turn for some reason is also stuck waiting (of course this doesn't make sense unless there's a bug)

teo1978 commented Nov 4, 2015

Meanwhile: I strongly suspect this is some kind of race condition, since I observe this in a scenario where a PHP script does a lot of reads and a few writes; so I guess, when there are many concurrent requests, somebody may shoehow get stuck wating to be able to read what somebody else is writing, who in turn for some reason is also stuck waiting (of course this doesn't make sense unless there's a bug)

@teo1978

This comment has been minimized.

Show comment
Hide comment
@teo1978

teo1978 Nov 9, 2015

Is there at least a way I can enable some sort of logging so that I can attach a log of what's going on?
That is the only way I can help fix this critical issue.
There's no way I (or almost anybody, I would guess) can easily provide a reproducible test case (where "easily" means in a lifetime), as this is almost obviously the result of a chain of locks (if not even a deadlock) between reads and writes.

teo1978 commented Nov 9, 2015

Is there at least a way I can enable some sort of logging so that I can attach a log of what's going on?
That is the only way I can help fix this critical issue.
There's no way I (or almost anybody, I would guess) can easily provide a reproducible test case (where "easily" means in a lifetime), as this is almost obviously the result of a chain of locks (if not even a deadlock) between reads and writes.

@teo1978

This comment has been minimized.

Show comment
Hide comment
@teo1978

teo1978 Nov 9, 2015

Thinking about it, this might be intrinsically inevitable as long as #142 isn't fixed.

teo1978 commented Nov 9, 2015

Thinking about it, this might be intrinsically inevitable as long as #142 isn't fixed.

@krakjoe

This comment has been minimized.

Show comment
Hide comment
@krakjoe

krakjoe Nov 20, 2015

Owner

Try with the latest release on pecl and report back ?

Owner

krakjoe commented Nov 20, 2015

Try with the latest release on pecl and report back ?

@krakjoe

This comment has been minimized.

Show comment
Hide comment
@krakjoe

krakjoe Dec 7, 2015

Owner

Bump, can you try with 4.0.10 or 5.1.2 ?

Owner

krakjoe commented Dec 7, 2015

Bump, can you try with 4.0.10 or 5.1.2 ?

@krakjoe krakjoe added the feedback label Dec 7, 2015

@teo1978

This comment has been minimized.

Show comment
Hide comment
@teo1978

teo1978 Dec 7, 2015

Me? Sorry, we switched to Memcached, and since I only observed the issue on a production server, it's unlikely that I'll have the opportunity to test any time soon.

teo1978 commented Dec 7, 2015

Me? Sorry, we switched to Memcached, and since I only observed the issue on a production server, it's unlikely that I'll have the opportunity to test any time soon.

@krakjoe

This comment has been minimized.

Show comment
Hide comment
@krakjoe

krakjoe Sep 29, 2016

Owner

No feedback, closing ...

Owner

krakjoe commented Sep 29, 2016

No feedback, closing ...

@krakjoe krakjoe closed this Sep 29, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment