-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Magento varnish 6 Too many restarts #24353
Comments
Hi @HOSTED-POWER. Thank you for your report.
Please make sure that the issue is reproducible on the vanilla Magento instance following Steps to reproduce. To deploy vanilla Magento instance on our environment, please, add a comment to the issue:
For more details, please, review the Magento Contributor Assistant documentation. @HOSTED-POWER do you confirm that you were able to reproduce the issue on vanilla Magento instance following steps to reproduce?
|
PS: We read the information here: https://varnish-cache.org/docs/6.2/whats-new/upgrading-6.2.html#whatsnew-upgrading-2019-03 and added in vcl_recv:
Which improved the situation, although were not sure this is resolved properly & 100% supported Update: We did further tests and it looks properly solved. |
Hi @HOSTED-POWER, Could you add steps to reproduce to make sure that we'll be able to reproduce this issue? That's really good that you already found solution for your issue. Could you create Pull Request with suggested fix? |
Hello To reproduce, install any Magento site (we had 2.3.1 and some 2.x versions) and wait for it to happen: You can see the log like this: varnishlog -q 'RespStatus == 503' -g request Probably after a few minutes already you will see the 503 on certain objects and Varnish which goes into guru meditation error :) We've seen it on all sites we tried it on, so it will be hard to not notice it. To be on the safe side, we added this on top in vcl_recv:
|
@Stepa4man: can you maybe also take a look at this to see if this proposed fix is ok? HostedPower helped us solve this issue yesterday on one of our shops which is hosted with them, where we ran into unexplainable 503 Varnish errors. Their change seems to have fixed it 👍 |
@engcom-Alfa @engcom-Bravo @engcom-Charlie, |
Hi @engcom-Delta. Thank you for working on this issue.
|
Hi @HOSTED-POWER thank you for your report. I am not able to reproduce issue by steps you described on 2.3-develop. If you'd like to update the issue, please reopen it. |
Hello @engcom-Delta Did you enable varnishlog -q 'RespStatus == 503' -g request And then crawled the whole site? Try that again after 30 min and again after 2 hours, it should really start happning :/ |
Used varnish 6.2 btw, not sure that would matter |
PS: The vcl is for Varnish 6.2: 8823790#diff-2f64f6171deecba61bea147539cf72ec So at least I would test with that and not 6.0.5 which is outdated for this test. Furthermore, if you want to see it even faster, try the caching of static files: I.e. change this part:
In any case you need to crawl the whole site, not just look to the homepage and assume it didn't occur :) |
@HOSTED-POWER thanks for reply. Rechecked on varnish 6.2.2 and issue is not reproducible: |
With static caching enabled and with crawling a whole site? I think the shop in your screenshot was empty (It's not happening on all objects on all pages). Also it takes some time, it works sometimes fine for a few hours even, but happens eventually. Sometimes very fast also, but you need to let it run longer time. We noticed with static file caching enabled, it occurred even faster, so it would be nice to check Last but not least, Varnish itself states in the documentation how you should replace the "miss": https://varnish-cache.org/docs/6.2/whats-new/upgrading-6.2.html#whatsnew-upgrading-2019-03
|
@HOSTED-POWER Still cannot reproduce issue: |
Veryyyyyy strange :) We use nginx --> varnish --> nginx, but I doubt that's the reason. We saw it on several sites for sure, at least 7 or 8 different ones. (production websites ,so not with the default theme etc). |
Sadly this problem also occured on one of my main projects. I can confirm that these 503 errors are happenning from nowhere. In my case, there were other problems (memory issues), so I tought that the problem comes from those. But no, those weren't related. The fix seems to me that solved my issue. |
Hi, I have a question regarding this. We were having this same issue. We added that code from @HOSTED-POWER but we still saw an error but this time it had out of "workspace (bo)" in the log file which led me to this: https://www.claudiokuenzler.com/blog/737/varnish-panic-crash-low-sess-workspace-backend-client-sizing Now what I think what is happening is once we added that code to set req.hash_always_miss = true; is it allowed the error stack to finally finish with that error when before it was just returning 503 early. OR maybe now that I am setting that enough restarts happened to run out of workspace. Either that or this was a totally unrelated error. So my question is after applying this fix did anyone else get "workspace (bo)"? Also fyi you can log the 50X errors with this command: varnishlog -a -A -w /var/log/varnish/varnish50x.log -q "RespStatus >= 500 or BerespStatus >= 500" FYI we also use nginx SSL--> varnish --> nginx butt he last one nginx is a separate server all on port 80 and 443 with the a record pointed to nginx SSL. We do not have 503 errors anymore except these in admin: https://prnt.sc/qe81q9 which I am still debugging. Anyway anyone having the "workspace (bo)" issue with the restart issue fix? |
Hello @weismannweb , I'm not sure if I understand it completely (lack of time atm), however after using the updated VCL we had 0 critical errors. So I don't think we hit that error (if I understand correctly that's a critical one too) |
PS: I see we have this as a default in our optimized settings: "-p workspace_backend=320k " |
Have the same issue, as described. You even don't need to surf the website. The website has ~5 products and 7 cms pages. |
Let me re-open this issue, it seems that the error only occurs after a while, but @engcom-Delta only took a few minutes to test the issue, so that's not really representative. |
@zhartaunik @weismannweb |
BIG DISCLAIMER: I am totally new to varnish cache with Magento 2.3 which we did on this project for the first time as it was a large and heavily trafficked site so what I write below is a total guess. Please bear that in mind. I think it depends on what change we did fixed it. I think the restarts code fixed it but then i got the workspace error. I am not sure if they are related or separate. If it is because of this in the actual end https://www.claudiokuenzler.com/blog/737/varnish-panic-crash-low-sess-workspace-backend-client-sizing and the code "if (req.restarts > 0) { set req.hash_always_miss = true; }" fixes the restarts which then once the restarts don't cause the 503 error the workspace runs out from too many restarts then I would look to this statement as to how to reproduce:
Mentioned here http://www.streppone.it/cosimo/blog/2010/03/varnish-sess_workspace-and-why-it-is-important/ Which I think indicates you have to have a large number of headers and/or be manipulating them too. Also, my site has a lot of redirects happening too maybe which might add to it. Also, note we have a store with 5000 products and 350 categories, many extensions, and several layered navigations options on each category page. Here is one of our varnish logs with 50x errors before we made the final fix. https://www.dropbox.com/s/4a8mlaj03wjl6up/varnish50x.log-old2?dl=0 Here is a working vcl but it might be useful to see what we are doing with headers: Here is our varnish settings for system d with it now working: That is about all I can add. I have had a cron run varnishlog -a -A -w /var/log/varnish/varnish50x.log -q "RespStatus >= 500 or BerespStatus >= 500" 24x7 and we have yet to get a single 50x error with the restart code fix and the workspace fix. |
@ihor-sviziev it seems confirmation took quite a while, happy it's finally getting confirmed :) |
I've created PR for fixing this issue #28137 |
Hey everyone, while following does solve the problem, it doesn't solve the problem when you have distributed deployment with multiple FE instances:
Problem happens when you add / remove backends to varnish.vcl and reload varnish service (emphasis on reload, not restart - to keep everything in cache and just reconfigure) - backend fetch fail happens for short interval (was 10ish seconds for us) - resulting in HTTP 503 for users that didn't hit the cache. Fix for that is to set N to a number which is
P.S. you can check what the max_restarts value by using Oh, we've also added following snippet to force retry up to
Edit: Note that this kind of setup uses Varnish Transient storage (short lived cache) and if you don't set memory limit for that storage, it will eat up your RAM and eventually lead to crash of the server (source: https://varnish-cache.org/docs/trunk/users-guide/storage-backends.html, search for "By default Varnish would use an unlimited malloc backend for this.") so make sure to edit your startup script for Varnish and name the Transient storage with limit. E.g. P.S. thanks @robolmos for pointing out |
@lotar I'm not a Varnish expert, but you might want to look at other solutions like compiling the VCL, letting the backends register as healthy, then load the VCL. Or, update the backends config to be healthy initially. Maybe it's OK to retry with a server-side error to help prevent the client getting transient backend errors.. in vcl_backend_response() I believe it's technically return(retry) and uses the max_retries value rather than max_restarts. |
Hey @robolmos, Thanks for the feedback.
Nor am I, but compiling / reloading of VCL after new backend is healthy it's not an option for us give the rest of the setup.
Fixed, ty ;) |
@ihor-sviziev honestly I think there's no need since my update was specific for infrastructure setup (auto scaling group issue). While it does solve the problem for us, it won't necessarily be 100% correct solution (or even needed) for different kinds of setup. Also rethinking the problem, it would be better to do What I'd suggest on the other hand is to update official documentation for Varnish 6 setup regarding Transient storage explained here. Reason being is that default Varnish installation has no memory limit and fix from the PR actually uses this kind of storage. It is infrastructure part as well (Varnish startup setup) but if not set correctly it will cause Varnish service to eat up RAM in combination with this VCL (from PR) eventually and will lead to site going down. To conclude, I'd say it's up to you ;) |
Hi @HOSTED-POWER. Thank you for your report. The fix will be available with the upcoming 2.4.1 release. |
backports magento#24353 to Magento 2.3
Preconditions (*)
Magento 2.4-develop;
Production mode;
Sample Data;
Php 7.3;
Varnish v. 6.2
We tried 8823790#diff-2f64f6171deecba61bea147539cf72ec
However it results in too many restarts after a while.
Steps to reproduce (*)
Go to Admin->Stores->Configuration->System->Full Page Cache:
Use the VCL on production site with varnish 6.2, after a while certain objects get into a restart loop.
Expected result (*)
We expect no 503 errors caused by restart
Actual result (*)
✖️ VCL keeps restarting forever resulting in: - VCL_Error Too many restarts after a few tries
![Peek 2020-04-28 11-20](https://user-images.githubusercontent.com/51679138/80469217-ecb29a80-8948-11ea-93ed-0082ad2c97e4.gif)
503 Response status and VCL error: Too many restarts
Varnish logs
![Screenshot from 2020-04-28 11-24-25](https://user-images.githubusercontent.com/51679138/80469489-4b781400-8949-11ea-9203-2788b33a0423.png)
![screenshotvar](https://user-images.githubusercontent.com/51679138/80469502-50d55e80-8949-11ea-9fed-08fabe799dab.png)
![Screenshot from 2020-04-28 11-27-17](https://user-images.githubusercontent.com/51679138/80469515-53d04f00-8949-11ea-8a71-22df393d2e44.png)
The text was updated successfully, but these errors were encountered: