Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken 5.1.13 release apc_store() #347

Closed
thg2k opened this issue Nov 20, 2018 · 35 comments

Comments

@thg2k
Copy link

commented Nov 20, 2018

Good morning,

with my utter disappointment release 5.1.13 broke my production. Last working release is 5.1.12 and I checked the diff and saw changes I'm not comfortable with in a patch release.

Are you guys scared of bumping major/minor versions? At least you give a signal that the upgrade should go through test staging before hitting production!

Anyway, this is the problem I'm facing: apc_store() randomly fails in this pre-check code:

    /* try to store something to test if the cache is active */
    $_check = "fail";
    if (!apcu_store('th_cachesite_check', 42) ||
        (($_check = apcu_fetch('th_cachesite_check')) !== 42))
      throw new ThException("Site cache APCu supported but not enabled (check=\"$_check\")");

My setup is non-zts PHP 7.2.12 with apache prefork 2.4.37 and php as shared module.

@nikic

This comment has been minimized.

Copy link
Collaborator

commented Nov 20, 2018

I'm sorry that the last release broke your production setup. The release was not intended to cause any significant behavior changes.

However, in this release the cache slam defense implementation, which has been broken for a while now, has been fixed. I strongly suspect that this is what causes the random apc_store() failures you are seeing. Can you please check whether setting the apc.slam_defense=0 ini setting resolves your issue?

If you can confirm that this is the problem, I suggest that we switch the default of apc.slam_defense to off. While this setting has technically been enabled by default forever, in practice the slam defense was effectively off due to an implementation bug, and at this point it probably makes more sense to default it to off.

/cc @LionsAd

@LionsAd

This comment has been minimized.

Copy link
Contributor

commented Nov 20, 2018

Yes, I agree. We can properly re-enable cache defense in 5.2.0 as default with a proper announcement.

Technically apcu_store can fail at any time as there are many ways it can fail, so maybe adding a proper apcu_enabled function is the way to go so that OP can use an API function instead?

2 similar comments
@LionsAd

This comment has been minimized.

Copy link
Contributor

commented Nov 20, 2018

Yes, I agree. We can properly re-enable cache defense in 5.2.0 as default with a proper announcement.

Technically apcu_store can fail at any time as there are many ways it can fail, so maybe adding a proper apcu_enabled function is the way to go so that OP can use an API function instead?

@LionsAd

This comment has been minimized.

Copy link
Contributor

commented Nov 20, 2018

Yes, I agree. We can properly re-enable cache defense in 5.2.0 as default with a proper announcement.

Technically apcu_store can fail at any time as there are many ways it can fail, so maybe adding a proper apcu_enabled function is the way to go so that OP can use an API function instead?

@LionsAd

This comment has been minimized.

Copy link
Contributor

commented Nov 20, 2018

Lol, seems like GH could use some slam detection.

I think the proper way to do that for now is to use apc_fetch + apcu_add + apcu_fetch.

Currently OP you are putting a lot of Write pressure on your cache.

I however agree that the behavior change was unintended and am also sorry for breaking your site.

We should put a warning in release notes or release 5.1.14 branched off 5.1.13 with it off.

1 similar comment
@LionsAd

This comment has been minimized.

Copy link
Contributor

commented Nov 20, 2018

Lol, seems like GH could use some slam detection.

I think the proper way to do that for now is to use apc_fetch + apcu_add + apcu_fetch.

Currently OP you are putting a lot of Write pressure on your cache.

I however agree that the behavior change was unintended and am also sorry for breaking your site.

We should put a warning in release notes or release 5.1.14 branched off 5.1.13 with it off.

@nikic nikic closed this in c510a5f Nov 20, 2018
@nikic nikic referenced this issue Nov 20, 2018
@thelounge-zz

This comment has been minimized.

Copy link

commented Nov 21, 2018

i guess / hope https://bugs.php.net/bug.php?id=77180 had the same reasons

the whole documentation talks about files and really don't match APCu these days
http://php.net/manual/en/apc.configuration.php#ini.apc.slam-defense
http://php.net/manual/en/apc.configuration.php#ini.apc.write-lock

@nikic

This comment has been minimized.

Copy link
Collaborator

commented Nov 21, 2018

APCu 5.1.14 with slam defense disabled by default has been released and should resolve this issue without manual configuration changes.

@thelounge-zz Yeah, this is very likely the same issue. I've opened #349 to track the outdated documentation.

@thelounge-zz

This comment has been minimized.

Copy link

commented Nov 22, 2018

after a few hours with 5.1.14 AND apc.slam_defense=0 on one machine still flooded with failed apcu_store() calls which did not happen for at least a whole year and so the question is why store around 7 MB becomes a problem now

hell - apc.shm_size = 256M - even with a shitload of memory fragmentation i can't imagine any sane reason, that the code is buggy to write again and again the same value and not usong the cached one is a different story, i instrcued the repsonsible developer now that he should disable the SHM option for that specific call but this must not happen to begin with

APCU-Size: 180,20 KB
APCU-Hits: 29763 (30.7/Min)
APCU-Count: 53

'cl_cache::write_shm_private()' failed - key: 'cl_pri_37747663f4c5fbd3fc4173ab9ed1297e7d7fc4f5sap_portal_ws10_m013_0', length: '7683806'

@nikic

This comment has been minimized.

Copy link
Collaborator

commented Nov 22, 2018

@thelounge-zz A couple of questions:

  • Does this store always fail? Or does it say succeed the first few times but then consistently fail? Or always sometimes succeed, sometimes fail?
  • What's the type of the stored value?
  • Are you getting any warnings? I'm assuming not, just double checking.
@thelounge-zz

This comment has been minimized.

Copy link

commented Nov 22, 2018

a) impossible to say butg given it appears hours after update and restart httpd
it looks like it works and starts to fail later, sadly not my code, it writes
external webservice repsonses and i strongly doubt that they don't happen
for hours
b) looks like a large serialized array but that can also be my error handler
c) no, the error comes from my code when apcu_store() fails

@nikic

This comment has been minimized.

Copy link
Collaborator

commented Nov 22, 2018

These are currently the reasons why an apcu_store call may fail:

  • APCu is disabled (no warning).
  • Slam defense triggers (no warning).
  • Persisting fails:
    • Value contains resource (warning).
    • Serialization fails (warning/exception).
    • Invalid type (asserts in debug build).
    • SMA allocation fails (no warning).
  • Write lock acquisition fails (warning).
  • If apcu_add is used and the entry already exists (no warning).

In your case (no warnings generated by APCu), the only one that seems relevant is that SMA allocation fails. This usually should not happen, as SMA allocation failure will trigger a cache expunge and then make the allocation succeed.

b) looks like a large serialized array but that can also be my error handler

It would be good to know whether it's an array or a serialized array that's passed to apcu, as these cases are handled very differently. If it's an actual array, it would be worthwhile to check whether setting the apc.serializer=php ini setting resolves the issue.

@thelounge-zz

This comment has been minimized.

Copy link

commented Nov 22, 2018

the guy using my cms cache layer says he hands over a string

APCu can't be disabled and isn't
i said that i explicit set apc.slam_defense=0 everywhere
value can't be a ressource
if serialisation would fail i won't have a serialized error message from my cache-layer
SMA allocation is the most likely case
as said no warnings - we are runnung error_reporting E_ALL with php_errorlog mailed every 30 minuts to the whole IT staff if it's not empty

apcu_add don't exist anywhere in my code because it makes no sense at all to use it when you have to look if the key exists and then switch over to apcu_store - apcu_store all over the codebase

@thelounge-zz

This comment has been minimized.

Copy link

commented Nov 22, 2018

but it is an array what is handed over to apcu_store containing the serialized data as value and the metadata to handle 'allow_stale' in the upper layer

that's the whole function - there is even a retry when it fails first in the hope GC would make it sucess the second time, if it would have been an array it would have been mentioned in the error message

private function write_shm_private(array $data): void
{
$key = "cl_pri_{$this->shm_key}{$data['hash']}";
try
{
if($data['ttl'] > $this->shm_content_cache_ttl)
{
$data['ttl'] = $this->shm_content_cache_ttl;
}
$cache_rw = apcu_store($key, $data, $data['ttl']);
if($cache_rw === FALSE)
{
$cache_rw = apcu_store($key, $data, $data['ttl']);
}
if($cache_rw === TRUE)
{
$this->shm_inventory_add($data['hash'], $key);
}
else
{
if(is_array($data['content']) === TRUE)
{
$data['content'] = 'Array: ' . serialize($data['content']);
}
$data['content'] = (string)$data['content'];
$content_length = strlen($data['content']);
$this->error_handler(METHOD, "key: '$key', length: '$content_length', content: '" . substr($data['content'], 0, 256) . "'", /**$internal_call*/TRUE);
}
}
catch(Throwable $exception)
{
$this->exception_handler((string)$exception);
}
}

@nikic

This comment has been minimized.

Copy link
Collaborator

commented Nov 22, 2018

For now the only case I could come up with to make apcu_store fail is high fragmentation:

// apc.shm_size=6M
var_dump(apcu_store("key1", str_repeat('x', 1024 * 1024)));
var_dump(apcu_store("key2", str_repeat('x', 1024 * 1024)));
var_dump(apcu_store("key3", str_repeat('x', 1024 * 1024)));
var_dump(apcu_store("key4", str_repeat('x', 1024 * 1024)));
var_dump(apcu_store("key5", str_repeat('x', 1024 * 1024)));
var_dump(apcu_delete("key1"));
var_dump(apcu_delete("key3"));
var_dump(apcu_delete("key5"));
//var_dump(apcu_sma_info());
var_dump(apcu_store("key6", str_repeat('x', 2 * 1024 * 1024))); // bool(false)
//var_dump(apcu_sma_info());

Under a default configuration, the cache will only be expunged if it is at least half full. If the cache is less than half full, but the allocation cannot succeed due to fragmentation, then apcu_store may fail.

Do you know if this issue occurred at a time when the cache was at least half full? In that case it might be a fragmentation issue. If not, then it must be something else.

@thelounge-zz

This comment has been minimized.

Copy link

commented Nov 22, 2018

i doubt that it was half-full after that few hours given on the machine is only running one website and yesterady with the previous APCu version starting to throw these errors i rasied the SHM size to 256 MB

the other machine hosting some hundret domains has a SHM size of 512 MB for APCu and so the chance of high fragmentation would be there much higher while with 5.1.13 it has "only" thrown a few apcu_store() errors

APCu is heavily used with really low TTL values in all cases so there are for sure many deletes and writes

@nikic

This comment has been minimized.

Copy link
Collaborator

commented Nov 22, 2018

Err, I turned that the wrong way around. If the cache was less than half full it might be a fragmentation issue.

This can in principle also happen if the cache is quite empty, e.g. if you do a big allocation, then a small one, then another big one (e.g. because the previous big one is updated to a slightly larger value), then a small one, etc. You can end up with a cache that is mostly empty but regularly interrupted by small allocations. So if this workload regularly writes both large 7MB values and small values, it could run into this situation.

This issue also existed before the changes in apcu 5.1.13, however there is a significant difference: In apcu 5.1.13 allocation sizes are determined precisely, while in apcu 5.1.12 they end up (glossing over some details) rounded to a multiple of at least 512 (and up to 8192). This means that if the size of the large allocations is slightly increasing over time, it will still be possible to reuse the space left by previous allocations.

If this is indeed the issue, then we can fix it by rounding up allocation sizes in the allocator. (Longer term it might be good to separate small and large allocations into separate pools).

@LionsAd

This comment has been minimized.

Copy link
Contributor

commented Nov 22, 2018

@thelounge-zz So you say:

  • It worked perfectly in 5.1.12
  • It was problematic in 5.1.13 (with default configuration)
  • It is worse in APCu 5.1.14 than 5.1.13

Is that correct so far?

Did you try 5.1.13 with apc.slam_defense=0?

Just asking as we improved the locking (less contentions) in 5.1.14 for the SMA and it would be good to know if that is the new cause.

Are you able to reproduce yourself or only on your customers machine?

The other thing is: Can it happen that many things write the same keys and values at the same time?

1 similar comment
@LionsAd

This comment has been minimized.

Copy link
Contributor

commented Nov 22, 2018

@thelounge-zz So you say:

  • It worked perfectly in 5.1.12
  • It was problematic in 5.1.13 (with default configuration)
  • It is worse in APCu 5.1.14 than 5.1.13

Is that correct so far?

Did you try 5.1.13 with apc.slam_defense=0?

Just asking as we improved the locking (less contentions) in 5.1.14 for the SMA and it would be good to know if that is the new cause.

Are you able to reproduce yourself or only on your customers machine?

The other thing is: Can it happen that many things write the same keys and values at the same time?

@thelounge-zz

This comment has been minimized.

Copy link

commented Nov 22, 2018

i see not much difference between 5.1.13 and 5.1.14
5.1.14 likely is better because only on one machine the errors with apcu_store()
no way to reproduce myself, it takes hours to manifest in general

"Can it happen that many things write the same keys and values at the same time"
for sure and no way to work around without defeat the whole win of apcu

point is that i did not see any apcu_store() fail over at least a complete years excecpt it lied and did not give back FALSE in case id din't work

@LionsAd

This comment has been minimized.

Copy link
Contributor

commented Nov 22, 2018

@thelounge-zz

I was just thinking that a simple fetch before set might avoid a write operation.

For that you would serialize the data yourself and just compare if the old value is already present.

That should also improve performance.

That said the SMA layer should not fail for 7MB sized items.

But if eg all your processes try to write the 7 MB data at the same time, it means the SMA will need to allocate potentially 7x n clients MB before freeing the data of the old items. Because APCu uses delete + insert, but allocates the memory before the delete.

Also could you confirm that all writes for a key fail or if it manages to write them.

@nikic Might it be that the write lock before was taken before allocating the memory?

@nikic

This comment has been minimized.

Copy link
Collaborator

commented Nov 22, 2018

Might it be that the write lock before was taken before allocating the memory?

Nope, the cache write lock and the SMA lock must be disjoint, otherwise expunge operations may write lock recursively, which deadlocks rwlocks. Things used to work like this in older versions, but not in 5.1.12.

@thelounge-zz

This comment has been minimized.

Copy link

commented Nov 22, 2018

I was just thinking that a simple fetch before set might avoid a write operation.
For that you would serialize the data yourself and just compare if the old value is already present.
That should also improve performance

sorry, but no!

thats deeper knowlegde APCu has to handle instead userland

@LionsAd

This comment has been minimized.

Copy link
Contributor

commented Nov 22, 2018

The only difference I see is that before we allocated memory 3x instead of just once as far as I can see.

The problem is imagine:

100 processes wanting to write 4 different 7 MB items at the same time.

If they all queue up before the write lock, that's an allocation of 700 MB up front, which means that some stores will fail, but some will succeed.

The only way to avoid that would be to serialize the writers, but that would make things slower.

I mean we could try fetch + compare before SMA allocation in APCu itself, but that's also not true genrically.

And for small items this likely is never an issue.

Only thing I could think of is to allow for larger items only one through at a time, but again APCu failing to allocate memory under write pressure is not a bug per se - eventually things will get written.

Increasing memory for APCu and clearing keys from time to time should help though.

@thelounge-zz

This comment has been minimized.

Copy link

commented Nov 22, 2018

Sorry, but that's all nonsense when you look at that numbers below (at time where the errors appeared) and the fact of a 256 MB SHM for APCu given it did not fail over more than a year without hit that error a single time until the recent two updates

APCU-Size: 180,20 KB
APCU-Hits: 29763 (30.7/Min)
APCU-Count: 53

either the return value of apcu_store() lied the whole last year or there is some rgeression, it's that easy

@LionsAd

This comment has been minimized.

Copy link
Contributor

commented Nov 22, 2018

180,20 KB is that 180 mega byte or kilo byte?

And the count means you have only 53 items right?

And those items are 7 MB large each?

@thelounge-zz

This comment has been minimized.

Copy link

commented Nov 22, 2018

KILOBYTE
none of them except one is 7 MB and even not one becuase otherwise 180,20 KB would be larger than that

@thelounge-zz

This comment has been minimized.

Copy link

commented Nov 22, 2018

look at "APCU-Hits: 29763 (30.7/Min)"
given that the whole navigation comes from APCu how do you imagine 100 processes want to write the same 7 MB at the same time - it's ridicolous

$apcu_iterator = new APCUIterator();
$apcu_total_hits = $apcu_iterator->getTotalHits();
$apcu_total_count = $apcu_iterator->getTotalCount();
$apcu_cache_size = GetSizeString($apcu_iterator->getTotalSize());

function GetSizeString(string $filepath='', int $bytes=0): string
{
if($bytes < 1)
{
$bytes = filesize($filepath);
}
if($bytes > 1073741824)
{
return (number_format(round(($bytes / 1073741824),2), 2, ',', '.')) . ' GB';
}
elseif($bytes > 1048576)
{
return (number_format(round(($bytes / 1048576),2), 2, ',', '.')) . ' MB';
}
else
{
return (number_format(round(($bytes / 1024),1), 2, ',', '.')) . ' KB';
}
}

@LionsAd

This comment has been minimized.

Copy link
Contributor

commented Nov 22, 2018

@thelounge-zz

Thanks a lot for the data. I am not sure why 53 items with 180 kB in total would ever fail a write.

I assume you want to fix that.

So first we need to see if this is caused by SMA changes from 5.1.12 to 5.1.13 or from making the SMA faster in 5.1.14.

You said that you think that 5.1.14 is a little better as it does not fail on all machines.

However that failing on all machines was due to the slam defense being enabled, right?

My advice would be to:

  • Use 5.1.13
  • Set apc.slam_defense = 0

And then report back if that fixes the errors after leaving that running for a while.


Another question:

As you log the items in question with their size and beginning of data: Are only large items failing to store or are also small items affected? (I was not clear on that)

Are the errors in time "near" each other or far apart, i.e. do they happen in chunks or continously.

That all will help diagnose the problem.

@thelounge-zz

This comment has been minimized.

Copy link

commented Nov 22, 2018

sorry, i can't reproduce that on testing machines and i can and will not play around on production and the write code in questions is even no longer there at all - 5.1.13 and 5.1.14 are a step backwards, there is no reason looking at the stats that any write could even imagine to fail and more i can't do than say "rever or careful review the changes from 5.1.12 to now because they are introducing more problems they ever could fix"

@LionsAd

This comment has been minimized.

Copy link
Contributor

commented Nov 22, 2018

That's a fair opinion, but without a way to reproduce or any further information there is nothing we can do about it.

@thelounge-zz

This comment has been minimized.

Copy link

commented Nov 26, 2018

5.1.13 and 5.1.14 may contain some improvements but it's fact that in setup making heavily use of apcu_store() with short TTLs things start to break randomly under mixed load after some hours

in fact that's are all arrays

 private function shm_inventory_add(string $cache_hash, string $shm_key): void
 {
  try
  {
   while(apcu_add("cl_{$this->shm_key}_shm_inventory_lock", 1, 1) === FALSE){usleep(1);}
   /** Inventory auslesen */
   $cache_inventory = apcu_fetch("cl_{$this->shm_key}_shm_inventory");
   if($cache_inventory === FALSE)
   {
    $cache_inventory = [];
   }
   $cache_inventory[$cache_hash] = $shm_key;
   if(apcu_store("cl_{$this->shm_key}_shm_inventory", $cache_inventory) === FALSE)
   {
    $this->error_handler(__METHOD__, /**$message*/'', /**$internal_call*/TRUE);
   }
   apcu_delete("cl_{$this->shm_key}_shm_inventory_lock");
  }
  catch(Throwable $exception)
  {
   $this->exception_handler((string)$exception);
  }
 }

[26-Nov-2018 12:00:57 Europe/Vienna] 'cl_cache::shm_inventory_add()' failed - called by show_content_common.php on line 421 (parent: /show_content2.php on line 27) (apache2handler)
[26-Nov-2018 12:00:57 Europe/Vienna] 'cl_cache::shm_inventory_add()' failed - called by api_cache.php on line 318 (parent: api_cssmenu_internal.php on line 64) (apache2handler)
[26-Nov-2018 12:00:57 Europe/Vienna] 'cl_cache::shm_inventory_add()' failed - called by api_cache.php on line 318 (parent: api_navigation_base_internal.php on line 225) (apache2handler)
[26-Nov-2018 12:00:57 Europe/Vienna] 'cl_cache::shm_inventory_add()' failed - called by api_content_ext.php on line 200 (parent: api_content_ext.php on line 101) (apache2handler)
[26-Nov-2018 12:00:57 Europe/Vienna] 'cl_cache::shm_inventory_add()' failed - called by api_cache.php on line 318 (parent: functions_newsticker.php on line 23) (apache2handler)
[26-Nov-2018 12:01:04 Europe/Vienna] 'cl_cache::shm_inventory_add()' failed - called by show_content_common.php on line 421 (parent: /show_content2.php on line 27) (apache2handler)
[26-Nov-2018 12:01:04 Europe/Vienna] 'cl_cache::shm_inventory_add()' failed - called by api_cache.php on line 318 (parent: api_cssmenu_internal.php on line 64) (apache2handler)
[26-Nov-2018 12:01:04 Europe/Vienna] 'cl_cache::shm_inventory_add()' failed - called by api_cache.php on line 318 (parent: api_navigation_base_internal.php on line 225) (apache2handler)
[26-Nov-2018 12:01:04 Europe/Vienna] 'cl_cache::shm_inventory_add()' failed - called by api_cache.php on line 318 (parent: api_imagedb.php on line 232) (apache2handler)
[26-Nov-2018 12:01:04 Europe/Vienna] 'cl_cache::shm_inventory_add()' failed - called by api_content_ext.php on line 200 (parent: api_content_ext.php on line 101) (apache2handler)

@nikic

This comment has been minimized.

Copy link
Collaborator

commented Nov 26, 2018

@thelounge-zz Can you check whether setting the apc.serializer=php ini setting resolves the issue?

If so I will create a new release with that set and try to figure out what's wrong with the default serialization afterwards.

@thelounge-zz

This comment has been minimized.

Copy link

commented Nov 26, 2018

well, i can try it but i don't know how long i have to wait to confirm anything because it tooks days on the main server to get the errors above while the uptime becuas eof kernel update is below 24 hours and hopefully that don't break other things

@thelounge-zz

This comment has been minimized.

Copy link

commented Dec 18, 2018

apcu-5.1.15.tgz still don't make me terrible happy, it shouldn't be possible to reach that code path when everything works as it did before 5.1.13

[18-Dec-2018 13:47:06 Europe/Vienna] 'cl_cache::shm_inventory_delete()' failed - failed to release inventory lock - called by api_cache.php on line 487 (parent: sap_webservice10_api.php on line 854) (apache2handler)

private function shm_inventory_delete(array $hash_list): void
 {
  try
  {
   while(apcu_add("cl_{$this->shm_key}_shm_inventory_lock", 1, 1) === FALSE)
   {
    $lock_retries = isset($lock_retries) ? (int)$lock_retries : 0;
    if($lock_retries > 1000)
    {
     $this->error_handler(__METHOD__, /**$message*/'failed to get inventory lock', /**$internal_call*/TRUE);
     return;
    }
    $lock_retries++;
    usleep(5);
   }
   $cache_inventory = apcu_fetch("cl_{$this->shm_key}_shm_inventory");
   if($cache_inventory !== FALSE)
   {
    foreach($hash_list as $hash)
    {
     unset($cache_inventory[$hash]);
    }
    foreach($cache_inventory as $cache_key=>$shm_key)
    {
     if(apcu_exists($shm_key) === FALSE)
     {
      unset($cache_inventory[$cache_key]);
     }
    }
    if(apcu_store("cl_{$this->shm_key}_shm_inventory", $cache_inventory) === FALSE)
    {
     usleep(5);
     if(apcu_store("cl_{$this->shm_key}_shm_inventory", $cache_inventory) === FALSE)
     {
      $this->error_handler(__METHOD__, /**$message*/'', /**$internal_call*/TRUE);
     }
    }
   }
   if(apcu_delete("cl_{$this->shm_key}_shm_inventory_lock") === FALSE)
   {
    usleep(5);
    if(apcu_delete("cl_{$this->shm_key}_shm_inventory_lock") === FALSE)
    {
     $this->error_handler(__METHOD__, /**$message*/'failed to release inventory lock', /**$internal_call*/TRUE);
    }
   }
  }
  catch(Throwable $exception)
  {
   $this->exception_handler((string)$exception);
  }
 }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.