Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix performance issue in cache locking mechanism #22829

Merged
merged 60 commits into from
May 27, 2020

Conversation

IvanChepurnyi
Copy link
Contributor

@IvanChepurnyi IvanChepurnyi commented May 11, 2019

Description

This change replaces the locking strategy for resolving race condition in the configuration cache loader.
The previous implementation was waiting for the release of the lock for all concurrent connections when the cache was flushed. It was redundant to do as the whole website visitors wait for a lock release.

The new implementation creates a lock only for the write operation and the other connections who try to write it just return data uncached without performing any write operations. This approach removes any possibility of cyclic locking.

Fixed Issues

  1. Fixes usleep Statments in Cache Loader causing Pathological Page Load Times #22824

Manual testing scenarios

  1. Create a load test that sends 20 concurrent users to your Magento without Varnish cache enabled.
  2. Flush Magento cache.
  3. Visit Magento backend.
  4. Validate that you still can access and operate Magento store.

Contribution checklist (*)

  • Pull request has a meaningful description of its purpose
  • All commits are accompanied by meaningful commit messages
  • All new or changed code is covered with unit/integration tests (if applicable)
  • All automated tests passed successfully (all builds are green)

@m2-assistant
Copy link

m2-assistant bot commented May 11, 2019

Hi @IvanChepurnyi. Thank you for your contribution
Here is some useful tips how you can test your changes using Magento test environment.
Add the comment under your pull request to deploy test or vanilla Magento instance:

  • @magento-engcom-team give me test instance - deploy test instance based on PR changes
  • @magento-engcom-team give me 2.3-develop instance - deploy vanilla Magento instance

For more details, please, review the Magento Contributor Assistant documentation

@magento-cicd2
Copy link
Contributor

magento-cicd2 commented May 11, 2019

CLA assistant check
All committers have signed the CLA.

@IvanChepurnyi
Copy link
Contributor Author

Static test failures related to legacy code in AbstractBlock that are unrelated to the current PR.

@kandy
Copy link
Contributor

kandy commented May 11, 2019

Still, don't get why we need locks on write and how it will resolve problem? All client will try to write to cache storage and wait on lock

@IvanChepurnyi
Copy link
Contributor Author

The latest update exactly resolves an issue, test it with the load test and put logging on a number of writes happening during the flush operation.
Then make a load test without locking, and then compare.

@kandy
Copy link
Contributor

kandy commented May 11, 2019

Well, now only one customer will write data.
But let's imagine that we need to cache the code that calculates the number of lines in a 100 GB file.
How many times will this operation be performed for your code?

@IvanChepurnyi
Copy link
Contributor Author

@kandy I would say something is wrong if you want to cache a number of lines in the 100GB file. Locking is currently used only in 2 places: config cache and block cache.

@kandy
Copy link
Contributor

kandy commented May 11, 2019

This is just an example. You can choose any resource-intensive operation.

@IvanChepurnyi
Copy link
Contributor Author

IvanChepurnyi commented May 11, 2019

Can you explain how this theoretical problem is affecting this specific pull request?

The second option with continuous lock is still available for such cases - an original method is not removed.

PS: Also, as probably you know, I am not a fan of optimization by caching, but for some places caching is inevitable.

For example in resource-intensive operations, if this data can be stored and invalidated by any other means than volatile cache storage, I prefer that method. Usually, resource-intensive operations can be optimized by using indexation, using a better algorithm, building memory based microservice, etc.

@kandy
Copy link
Contributor

kandy commented May 11, 2019

Well, let's discuss the initial implementation with your suggestion.
They are distinguished by the fact that in the first, parallel processes they sleep (and use no resources) while data is being generated.
In the proposed by you, all parallel processes generate data (makes sql queries, reads from fs, etc.)
I see no reason why the new version is better. Can you explain it for me?

PS: agree that if the cache has overheads comparable to the generation of data, we must delete the cache

@IvanChepurnyi
Copy link
Contributor Author

@kandy The reason is simple, webshop needs to sell goods to make money for the merchant. If it saves CPU cycles to make sure only one non-cached request happen, the merchant won't be happy.

When all the requests are waiting for a released lock, then nothing can be sold. PHP-FPM pool gets used up and NGINX receives a growing number of incoming connections it cannot resolve to send through to fpm, so it responds with 503 error. All of these just because of a single lock.

In the second case, requests are processed, yes slower than with a case when we have cached data, but still, nothing is putting a plug on FPM pool and server responds, hoster can scale by adding additional frontend nodes - everyone is happy!

@kandy
Copy link
Contributor

kandy commented May 11, 2019

your sentence is correct only in case when time on generation is less than time on the check of lock present + sleep timeout.
And pay attention that in your case site will work slower because the client still need to wait until data will be regenerated

@IvanChepurnyi
Copy link
Contributor Author

@kandy I am not talking theoretically, I am talking out of my experience. I've run into similar errors before when was working directly with merchants who received huge traffic where these issues were real. Mutex is not the answer, especially for the cache, especially in PHP.

@maksek
Copy link
Contributor

maksek commented May 11, 2019

@kandy @IvanChepurnyi - What do you think if we will have community architecture call and discuss the solution, instead of typing a lot of text?

@kandy
Copy link
Contributor

kandy commented May 12, 2019

@IvanChepurnyi I more deeply analyzed your solution and think it will resolve problems with locks on reading, especially on edge cases where cache disabled or data always return false.
I believe that it makes sense to accept PR now, and then we will try to find a better solution for improvement on data generation

@vzabaznov vzabaznov self-requested a review May 27, 2020 14:55
@magento magento deleted a comment from m2-assistant bot May 27, 2020
@slavvka slavvka merged commit becbf3d into magento:2.4-develop May 27, 2020
@ghost ghost moved this from Pending Review to Recently Merged in Pull Requests Dashboard May 27, 2020
@magento magento deleted a comment from m2-assistant bot May 28, 2020
@ihor-sviziev
Copy link
Contributor

ihor-sviziev commented May 28, 2020

@vzabaznov @slavvka Wow. This PR was finally merged even after it was closed? 🤯

@slavvka
Copy link
Member

slavvka commented May 28, 2020

@ihor-sviziev Yes, it was closed because the related issue was closed. Once the code synced it got merged. I was also amazed :)

@kingletas
Copy link

Thank you @slavvka. I look forward to upgrading to 2.4 now

@SilvanLaroo
Copy link

We're also facing this issue, is there a patch available for 2.3?

@IvanChepurnyi
Copy link
Contributor Author

@SilvanLaroo you can extract a patch from this pull request but you have to separate patched files into composer patches per package as PR has diff file for a monorepo.

@riconeitzel
Copy link
Contributor

riconeitzel commented Oct 6, 2020

did anyone face installation issues with this patch since 2.3.5?

When we run a setup with an existing codebase and env.php on an empty database (bin/magento set:up) we end up with


SQLSTATE[42S02]: Base table or view not found: 1146 Table 'magento.core_config_data' doesn't exist, query was: SELECT `main_table`.* FROM `core_config_data` AS `main_table`

In Mysql.php line 110:

  SQLSTATE[42S02]: Base table or view not found: 1146 Table 'magento.core_config_data' doesn't exist, query was: SELECT `main_table`.* FROM `core_config_data` AS `main_table`


In Mysql.php line 91:

  SQLSTATE[42S02]: Base table or view not found: 1146 Table 'magento.core_config_data' doesn't exist

  1. Triggered by \Magento\Setup\Console\Command\UpgradeCommand::execute()
  2. calling \Magento\Setup\Model\Installer::updateModulesSequence()
  3. calling \Magento\Setup\Model\Installer::cleanCaches()
  4. iterating over all cache types and for … $type = full_page
  5. calling \Magento\PageCache\Model\Cache\Type::clean()
  6. calling $this->eventManager->dispatch('adminhtml_cache_refresh_type');
  7. which triggers
  8. \Magento\CacheInvalidate\Observer\FlushAllCacheObserver::execute()
  9. calling \Magento\Framework\Cache\LockGuardedCacheLoader::lockedLoadData()
  10. because it needs the config value for default/system/full_page_cache/caching_application
  11. which results in
  12. \Magento\Framework\App\Config\ConfigSourceAggregated::get() loading a collection from the database …
  13. which does not (yet) exist
  14. 💥

@ihor-sviziev
Copy link
Contributor

@riconeitzel if you’ll be able to reproduce this issue on 2.4-develop branch - please report it as separate issue. Thank you!

@riconeitzel
Copy link
Contributor

if it's broken on 2.3.x not?

@IvanChepurnyi
Copy link
Contributor Author

IvanChepurnyi commented Oct 8, 2020

@riconeitzel are you sure this pull request is a source of your issue? As seen in your stack trace, it is calling \Magento\Framework\Cache\LockGuardedCacheLoader::lockedLoadData(), which is is not called in System config class if this pull request is applied:
\Magento\Framework\Cache\LockGuardedCacheLoader::nonBlockingLockedLoadData(). Looks like you found a general issue with the cache being used when Magento is not yet completely installed, not issue with configuration cache locking, and it definitely should be reported as a separate issue affecting both 2.3.5 and 2.4.x

@riconeitzel
Copy link
Contributor

@IvanChepurnyi this patch isn't compatible with 2.3.5-p2, right? 😢

@serbaninno
Copy link

We encountered an issue after applying this patch on a Magento 2.3.5-p2 + Varnish 5. When a Varnish request was made to the homepage for the first time, the X-Magento-Tags were missing. We only got the tags on the next requests to the same page. The "fix" was removing this patch

@ihor-sviziev
Copy link
Contributor

Hi @serbaninno,
Could you check if the issue reproducing on 2.4-develop? If so - please report separate issue for investigation & fixing that.

[$this->currentLock]
)
);
return false;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates infinite loop if Database is used for AbstractBlock locking. All next $this->locker->lock() calls return false.

If a cacheable block has a child cacheable block, the LockGuardedCacheLoader::lockedLoadData() call is nested, but since there is already a lock in the Database backend, and it does not depend on the lock name, lock() returns false. See 2 lockedLoadData() in this call stack.

LockGuardedCacheLoader.php:111, Magento\Framework\Cache\LockGuardedCacheLoader->lockedLoadData()
AbstractBlock.php:1125, Amasty\Label\Block\Label->_loadCache()
AbstractBlock.php:674, Amasty\Label\Block\Label->toHtml()
...
Interceptor.php:429, Magento\Catalog\Block\Product\Image\Interceptor->toHtml()
grid.phtml:52, include()
...
Template.php:301, Magento\CatalogWidget\Block\Product\ProductsList\Interceptor->_toHtml()
AbstractBlock.php:1100, Magento\CatalogWidget\Block\Product\ProductsList\Interceptor->Magento\Framework\View\Element\{closure:/var/www/html/vendor/magento/framework/View/Element/AbstractBlock.php:1094-1101}()
LockGuardedCacheLoader.php:103, Magento\Framework\Cache\LockGuardedCacheLoader->lockedLoadData()
AbstractBlock.php:1125, Magento\CatalogWidget\Block\Product\ProductsList\Interceptor->_loadCache()
AbstractBlock.php:674, Magento\CatalogWidget\Block\Product\ProductsList\Interceptor->toHtml()
Interceptor.php:830, Magento\CatalogWidget\Block\Product\ProductsList\Interceptor->toHtml()
Filter.php:121, Magento\Widget\Model\Template\Filter\Interceptor->generateWidget()
Interceptor.php:24, Magento\Widget\Model\Template\Filter\Interceptor->generateWidget()
Filter.php:132, Magento\Widget\Model\Template\Filter\Interceptor->widgetDirective()
Interceptor.php:37, Magento\Widget\Model\Template\Filter\Interceptor->widgetDirective()
LegacyDirective.php:42, ReflectionMethod->invokeArgs()
LegacyDirective.php:42, Magento\Framework\Filter\DirectiveProcessor\LegacyDirective->process()
Template.php:184, Magento\Widget\Model\Template\Filter\Interceptor->filter()
Filter.php:1080, Magento\Widget\Model\Template\Filter\Interceptor->filter()
Interceptor.php:58, Magento\Widget\Model\Template\Filter\Interceptor->___callParent()
Interceptor.php:138, Magento\Widget\Model\Template\Filter\Interceptor->Magento\Framework\Interception\{closure:/var/www/html/vendor/magento/framework/Interception/Interceptor.php:104-151}()
Interceptor.php:153, Magento\Widget\Model\Template\Filter\Interceptor->___callPlugins()
Interceptor.php:403, Magento\Widget\Model\Template\Filter\Interceptor->filter()
Page.php:170, Magento\VersionsCms\Block\Cms\Page->_toHtml()
AbstractBlock.php:1100, Magento\VersionsCms\Block\Cms\Page->Magento\Framework\View\Element\{closure:/var/www/html/vendor/magento/framework/View/Element/AbstractBlock.php:1094-1101}()
AbstractBlock.php:1104, Magento\VersionsCms\Block\Cms\Page->_loadCache()
AbstractBlock.php:674, Magento\VersionsCms\Block\Cms\Page->toHtml()
...
Http.php:120, Magento\Framework\App\Http\Interceptor->launch()
Interceptor.php:24, Magento\Framework\App\Http\Interceptor->launch()
Bootstrap.php:261, Magento\Framework\App\Bootstrap->run()
index.php:40, {main}()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please report a separate issue with all details for verifying and than fixing that?

@niclashoyer
Copy link

niclashoyer commented May 12, 2021

We're having the same issues with very long load times on a 2.3.5-p2 shop, where sometimes page loading takes around 30 seconds and lockedLoadData is just busy waiting.

We may upgrade to 2.4.x, but I'm a bit unsure if this really helps. This merge merges initial parts of this PR and is tagged with 2.4.2, but if I look e.g. at the test file in the tagged 2.4.2 version, it does not mention the non blocking variant and even the history does not mention fixes in these files.

Can someone clear this up? @vzabaznov @ihor-sviziev is this really merged in 2.4.x and fixed? Or was the non blocking lock method removed again somehow?

@IvanChepurnyi
Copy link
Contributor Author

Can someone clear this up? is this really merged in 2.4.x and fixed? Or was the non blocking lock method removed again somehow?

Probably it was merged and then removed, as they implemented something own; I don't think that rewritten solution is efficient and really solves the concurrency issue, but I already gave up on the original pull request. Maybe someday I will make it as an installable module that can implement a non-blocking cache, but that one is far from even the start of development as I am too busy right now.

@kandy
Copy link
Contributor

kandy commented May 12, 2021

@jonathanribas jonathanribas mentioned this pull request Sep 14, 2021
5 tasks
@dooblem
Copy link

dooblem commented Mar 28, 2024

Hello all.
This is still causing issues on the last magento versions (currently 2.4.5 on my customer).

On our project the config cache is huge because we have a lot of websites. Each time the config cache was flushed (config save in backoffice, or during imports or any other reason), I was seing on PHP process generating the configuration. And the other processes waiting for the lock release.
But once the config was generated and the lock released, all the processes where piling up, consuming a lot of memory and eventually causing oomkills.

After enabling l2 stale cache on the config as suggested by @kandy ☝️ , it worked like magic, and our problems went away !

Important thing : we only enabled the l2 cache for the config cache. Because I heard from colleagues that enabling it for all the caches slows overall response times, and also as it basically duplicates the cache on all local fronts, I did not want to overload my web front memories.

Here is an extract from env.php:

    'cache' => [
        'frontend' => [
            'default' => [
                'id_prefix' => '954_',
                'backend' => 'Magento\\Framework\\Cache\\Backend\\Redis',
                'backend_options' => [
                    'server' => 'myredishost',
                    'database' => '0',
                    'port' => '6379',
                    'password' => 'xxxx',
                    'compress_data' => '1'
                ]
            ],
            'l2_stale_cache_enabled' => [
                'id_prefix' => '954_',
                'backend' => '\\Magento\\Framework\\Cache\\Backend\\RemoteSynchronizedCache',
                'backend_options' => [
                    'remote_backend' => '\\Magento\\Framework\\Cache\\Backend\\Redis',
                    'remote_backend_options' => [
                        'persistent' => 0,
                        'server' => 'myredishost',
                        'database' => '0',
                        'port' => '6379',
                        'password' => 'xxxx',
                        'compress_data' => '1'
                    ],
                    'local_backend' => 'Cm_Cache_Backend_File',
                    'local_backend_options' => [
                        'cache_dir' => '/dev/shm/'
                    ],
                    'use_stale_cache' => true
                ],
                'frontend_options' => [
                    'write_control' => false
                ]
            ]
        ],
        'type' => [
            'default' => [
                'frontend' => 'default'
            ],
            'config' => [
                'frontend' => 'l2_stale_cache_enabled'
            ]
        ]
    ],

In order to test it. On a single web front on dev or preprod:

watch the php fpm processes piling up in one terminal window:

while true; do date; top -bn1 | grep -vF ' 0.0 '  | grep fpm; sleep 0.5; done

on another window, run a bin/magento cache:clean config

then quickly run 4 curls in background, and observe the response time, the locking and how processes pile up.

time curl -is -o/dev/null -H 'Host:www.example.com' -H X-forwarded-proto:https  http://localhost/agen/checkout/cart &

before the config fix: 1 process taking 30s, 3 processes taking 15s
after the config fix: 1 process taking 30s, 3 processes taking less than 1s (those using stale cache)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

usleep Statments in Cache Loader causing Pathological Page Load Times