Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Magento varnish 6 Too many restarts #24353

Closed
HOSTED-POWER opened this issue Aug 29, 2019 · 57 comments · Fixed by #28137
Closed

Magento varnish 6 Too many restarts #24353

HOSTED-POWER opened this issue Aug 29, 2019 · 57 comments · Fixed by #28137
Assignees
Labels
Component: PageCache Fixed in 2.4.x The issue has been fixed in 2.4-develop branch Issue: Confirmed Gate 3 Passed. Manual verification of the issue completed. Issue is confirmed Issue: Format is valid Gate 1 Passed. Automatic verification of issue format passed Issue: Ready for Work Gate 4. Acknowledged. Issue is added to backlog and ready for development Priority: P2 A defect with this priority could have functionality issues which are not to expectations. Reproduced on 2.4.x The issue has been reproduced on latest 2.4-develop branch Severity: S1 Affects critical data or functionality and forces users to employ a workaround. Triage: Dev.Experience Issue related to Developer Experience and needs help with Triage to Confirm or Reject it

Comments

@HOSTED-POWER
Copy link

HOSTED-POWER commented Aug 29, 2019

Preconditions (*)

Magento 2.4-develop;
Production mode;
Sample Data;
Php 7.3;
Varnish v. 6.2

We tried 8823790#diff-2f64f6171deecba61bea147539cf72ec

However it results in too many restarts after a while.

Steps to reproduce (*)

  1. Configure Varnish and Web server (Apache 2 in my case) devdocs;
  2. Configure Magento to use Varnish

Go to Admin->Stores->Configuration->System->Full Page Cache:

  • set Cache Application to Varnish Cache;
  • set TTL for public content to 120 sec;
    Screenshot from 2020-04-28 11-52-06
  1. Export Varnish 6 VCL, configure and restart varnish service;
  2. Run varnishlog in console;
  3. Go to Storefront and try browsing the site for longer than 10min.

Use the VCL on production site with varnish 6.2, after a while certain objects get into a restart loop.

Expected result (*)

We expect no 503 errors caused by restart

Actual result (*)

✖️ VCL keeps restarting forever resulting in: - VCL_Error Too many restarts after a few tries
503 Response status and VCL error: Too many restarts
Peek 2020-04-28 11-20

Varnish logs
Screenshot from 2020-04-28 11-24-25
screenshotvar
Screenshot from 2020-04-28 11-27-17

@m2-assistant
Copy link

m2-assistant bot commented Aug 29, 2019

Hi @HOSTED-POWER. Thank you for your report.
To help us process this issue please make sure that you provided the following information:

  • Summary of the issue
  • Information on your environment
  • Steps to reproduce
  • Expected and actual results

Please make sure that the issue is reproducible on the vanilla Magento instance following Steps to reproduce. To deploy vanilla Magento instance on our environment, please, add a comment to the issue:

@magento give me 2.3-develop instance - upcoming 2.3.x release

For more details, please, review the Magento Contributor Assistant documentation.

@HOSTED-POWER do you confirm that you were able to reproduce the issue on vanilla Magento instance following steps to reproduce?

  • yes
  • no

@magento-engcom-team magento-engcom-team added Issue: Format is not valid Gate 1 Failed. Automatic verification of issue format is failed Issue: Format is valid Gate 1 Passed. Automatic verification of issue format passed and removed Issue: Format is not valid Gate 1 Failed. Automatic verification of issue format is failed labels Aug 29, 2019
@HOSTED-POWER
Copy link
Author

HOSTED-POWER commented Aug 29, 2019

PS: We read the information here: https://varnish-cache.org/docs/6.2/whats-new/upgrading-6.2.html#whatsnew-upgrading-2019-03

and added in vcl_recv:

if (req.restarts > 0) { set req.hash_always_miss = true; }

Which improved the situation, although were not sure this is resolved properly & 100% supported

Update: We did further tests and it looks properly solved.

@ihor-sviziev
Copy link
Contributor

Hi @HOSTED-POWER,
Thank you for report!

Could you add steps to reproduce to make sure that we'll be able to reproduce this issue?

That's really good that you already found solution for your issue. Could you create Pull Request with suggested fix?

@HOSTED-POWER
Copy link
Author

HOSTED-POWER commented Aug 30, 2019

Hello

To reproduce, install any Magento site (we had 2.3.1 and some 2.x versions) and wait for it to happen:

You can see the log like this:

varnishlog -q 'RespStatus == 503' -g request

Probably after a few minutes already you will see the 503 on certain objects and Varnish which goes into guru meditation error :)

We've seen it on all sites we tried it on, so it will be hard to not notice it.

To be on the safe side, we added this on top in vcl_recv:

sub vcl_recv {

    if (req.restarts > 0) {
        set req.hash_always_miss = true;
    }

    if (req.method == "PURGE") {
...

@hostep
Copy link
Contributor

hostep commented Aug 30, 2019

@Stepa4man: can you maybe also take a look at this to see if this proposed fix is ok?

HostedPower helped us solve this issue yesterday on one of our shops which is hosted with them, where we ran into unexplainable 503 Varnish errors. Their change seems to have fixed it 👍

@ihor-sviziev
Copy link
Contributor

ihor-sviziev commented Sep 4, 2019

@engcom-Alfa @engcom-Bravo @engcom-Charlie,
Could you verify this issue?

@engcom-Delta engcom-Delta self-assigned this Nov 7, 2019
@m2-assistant
Copy link

m2-assistant bot commented Nov 7, 2019

Hi @engcom-Delta. Thank you for working on this issue.
In order to make sure that issue has enough information and ready for development, please read and check the following instruction: 👇

  • 1. Verify that issue has all the required information. (Preconditions, Steps to reproduce, Expected result, Actual result).

    DetailsIf the issue has a valid description, the label Issue: Format is valid will be added to the issue automatically. Please, edit issue description if needed, until label Issue: Format is valid appears.

  • 2. Verify that issue has a meaningful description and provides enough information to reproduce the issue. If the report is valid, add Issue: Clear Description label to the issue by yourself.

  • 3. Add Component: XXXXX label(s) to the ticket, indicating the components it may be related to.

  • 4. Verify that the issue is reproducible on 2.3-develop branch

    Details- Add the comment @magento give me 2.3-develop instance to deploy test instance on Magento infrastructure.
    - If the issue is reproducible on 2.3-develop branch, please, add the label Reproduced on 2.3.x.
    - If the issue is not reproducible, add your comment that issue is not reproducible and close the issue and stop verification process here!

  • 5. Add label Issue: Confirmed once verification is complete.

  • 6. Make sure that automatic system confirms that report has been added to the backlog.

@engcom-Delta
Copy link
Contributor

Hi @HOSTED-POWER thank you for your report. I am not able to reproduce issue by steps you described on 2.3-develop.
#24353issue

image

If you'd like to update the issue, please reopen it.

@HOSTED-POWER
Copy link
Author

Hello @engcom-Delta

Did you enable

varnishlog -q 'RespStatus == 503' -g request

And then crawled the whole site? Try that again after 30 min and again after 2 hours, it should really start happning :/

@HOSTED-POWER
Copy link
Author

Used varnish 6.2 btw, not sure that would matter

@HOSTED-POWER
Copy link
Author

HOSTED-POWER commented Nov 7, 2019

PS: The vcl is for Varnish 6.2: 8823790#diff-2f64f6171deecba61bea147539cf72ec

So at least I would test with that and not 6.0.5 which is outdated for this test.

Furthermore, if you want to see it even faster, try the caching of static files:

I.e. change this part:


    # Static files caching
    if (req.url ~ "^/(pub/)?(media|static)/") {
        # Static files should not be cached by default
        # return (pass);

        # But if you use a few locales and don't use CDN you can enable caching static files by commenting previous line (#return (pass);) and uncommenting next 3 lines
        unset req.http.Https;
        unset req.http./* {{ ssl_offloaded_header }} */;
        unset req.http.Cookie;
    }

In any case you need to crawl the whole site, not just look to the homepage and assume it didn't occur :)

@engcom-Delta
Copy link
Contributor

@HOSTED-POWER thanks for reply. Rechecked on varnish 6.2.2 and issue is not reproducible:
image

@HOSTED-POWER
Copy link
Author

With static caching enabled and with crawling a whole site? I think the shop in your screenshot was empty (It's not happening on all objects on all pages).

Also it takes some time, it works sometimes fine for a few hours even, but happens eventually. Sometimes very fast also, but you need to let it run longer time.

We noticed with static file caching enabled, it occurred even faster, so it would be nice to check

Last but not least, Varnish itself states in the documentation how you should replace the "miss": https://varnish-cache.org/docs/6.2/whats-new/upgrading-6.2.html#whatsnew-upgrading-2019-03

return(miss) from vcl_hit{} is now removed. An option for implementing similar functionality is:

return (restart) from vcl_hit{}
in vcl_recv{} for the restart (when req.restarts has increased), set req.hash_always_miss = true;.

@engcom-Delta
Copy link
Contributor

@HOSTED-POWER Still cannot reproduce issue:
#24353issue
In default.vcl static file caching is enabled:
image
And header with MISS value is taken from vcl file that was generated from magento admin
https://devdocs.magento.com/guides/v2.3/config-guide/varnish/config-varnish-final.html

@HOSTED-POWER
Copy link
Author

Veryyyyyy strange :)

We use nginx --> varnish --> nginx, but I doubt that's the reason.

We saw it on several sites for sure, at least 7 or 8 different ones. (production websites ,so not with the default theme etc).

@laccas
Copy link

laccas commented Dec 20, 2019

Sadly this problem also occured on one of my main projects. I can confirm that these 503 errors are happenning from nowhere. In my case, there were other problems (memory issues), so I tought that the problem comes from those. But no, those weren't related. The fix seems to me that solved my issue.

@weismannweb
Copy link

weismannweb commented Dec 22, 2019

Hi,

I have a question regarding this. We were having this same issue. We added that code from @HOSTED-POWER but we still saw an error but this time it had out of "workspace (bo)" in the log file which led me to this: https://www.claudiokuenzler.com/blog/737/varnish-panic-crash-low-sess-workspace-backend-client-sizing

Now what I think what is happening is once we added that code to set req.hash_always_miss = true; is it allowed the error stack to finally finish with that error when before it was just returning 503 early. OR maybe now that I am setting that enough restarts happened to run out of workspace. Either that or this was a totally unrelated error.

So my question is after applying this fix did anyone else get "workspace (bo)"? Also fyi you can log the 50X errors with this command:

varnishlog -a -A -w /var/log/varnish/varnish50x.log -q "RespStatus >= 500 or BerespStatus >= 500"

FYI we also use nginx SSL--> varnish --> nginx butt he last one nginx is a separate server all on port 80 and 443 with the a record pointed to nginx SSL.

We do not have 503 errors anymore except these in admin: https://prnt.sc/qe81q9 which I am still debugging.

Anyway anyone having the "workspace (bo)" issue with the restart issue fix?

@HOSTED-POWER
Copy link
Author

Hello @weismannweb , I'm not sure if I understand it completely (lack of time atm), however after using the updated VCL we had 0 critical errors. So I don't think we hit that error (if I understand correctly that's a critical one too)

@HOSTED-POWER
Copy link
Author

PS: I see we have this as a default in our optimized settings: "-p workspace_backend=320k "

@zhartaunik
Copy link
Contributor

Have the same issue, as described. You even don't need to surf the website. The website has ~5 products and 7 cms pages.
I simply opened homepage at once. In one hour there was already inline 503 error (returned by varnish as a content part) instead of the menu. (It was on Friday). After the weekend the site was totally down and we got 503 error from varnish.
The issue reproduced on staging and local environment. (But with the same docker configuration)

@hostep
Copy link
Contributor

hostep commented Dec 24, 2019

Let me re-open this issue, it seems that the error only occurs after a while, but @engcom-Delta only took a few minutes to test the issue, so that's not really representative.

@hostep hostep reopened this Dec 24, 2019
@ghost ghost unassigned engcom-Delta Dec 24, 2019
@Stepa4man
Copy link
Contributor

@zhartaunik @weismannweb
Do you have clear steps to reproduce?

@weismannweb
Copy link

weismannweb commented Dec 24, 2019

BIG DISCLAIMER: I am totally new to varnish cache with Magento 2.3 which we did on this project for the first time as it was a large and heavily trafficked site so what I write below is a total guess. Please bear that in mind.

I think it depends on what change we did fixed it. I think the restarts code fixed it but then i got the workspace error. I am not sure if they are related or separate. If it is because of this in the actual end https://www.claudiokuenzler.com/blog/737/varnish-panic-crash-low-sess-workspace-backend-client-sizing and the code "if (req.restarts > 0) { set req.hash_always_miss = true; }" fixes the restarts which then once the restarts don't cause the 503 error the workspace runs out from too many restarts then I would look to this statement as to how to reproduce:

We found out we had to increase the default (16kb), especially since we're doing quite a bit of HTTP header copying and rewriting around. In fact, if you do that, each varnish thread uses a memory space at most sess_workspace bytes.

If you happen to need more space, maybe because clients are sending long HTTP header values, or because you are (like we do) writing lots of additional varnish-specific headers, then Varnish won't be able to allocate enough memory, and will just write the assert condition on syslog and drop the request.

Mentioned here http://www.streppone.it/cosimo/blog/2010/03/varnish-sess_workspace-and-why-it-is-important/

Which I think indicates you have to have a large number of headers and/or be manipulating them too. Also, my site has a lot of redirects happening too maybe which might add to it.

Also, note we have a store with 5000 products and 350 categories, many extensions, and several layered navigations options on each category page.

Here is one of our varnish logs with 50x errors before we made the final fix.

https://www.dropbox.com/s/4a8mlaj03wjl6up/varnish50x.log-old2?dl=0

Here is a working vcl but it might be useful to see what we are doing with headers:
https://www.dropbox.com/s/ouq5tmhwdgwouew/default.vcl?dl=0

Here is our varnish settings for system d with it now working:
ExecStart=/usr/sbin/varnishd -a :80 -T 127.0.0.1:8080 -f /etc/varnish/default.vcl -S /etc/varnish/secret -s malloc,4096m -p thread_pool_min=200 -p thread_pool_max=4000 -p thread_pool_add_delay=2 -p http_req_size=64000 -p http_resp_hdr_len=90536 -p http_resp_size=120536 -p cli_timeout=25 -p workspace_client=256k -p workspace_backend=256k

That is about all I can add. I have had a cron run varnishlog -a -A -w /var/log/varnish/varnish50x.log -q "RespStatus >= 500 or BerespStatus >= 500" 24x7 and we have yet to get a single 50x error with the restart code fix and the workspace fix.

@HOSTED-POWER
Copy link
Author

@ihor-sviziev it seems confirmation took quite a while, happy it's finally getting confirmed :)

ihor-sviziev added a commit to ihor-sviziev/magento2 that referenced this issue May 7, 2020
@ihor-sviziev ihor-sviziev linked a pull request May 7, 2020 that will close this issue
4 tasks
@ihor-sviziev
Copy link
Contributor

I've created PR for fixing this issue #28137

@lotar
Copy link

lotar commented Jul 15, 2020

Hey everyone, while following does solve the problem, it doesn't solve the problem when you have distributed deployment with multiple FE instances:

Just experienced the same issue with Varnish 6. That condition does seem to fix the issue so far. I would suggest editing the varnish6.vcl to include that when generated. See patch code below.

Index: app/code/Magento/PageCache/etc/varnish6.vcl
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- app/code/Magento/PageCache/etc/varnish6.vcl	(revision 7aa94564d85e408baea01abc5315a0441401c375)
+++ app/code/Magento/PageCache/etc/varnish6.vcl	(date 1582819203827)
@@ -23,6 +23,10 @@
 }
 
 sub vcl_recv {
+    if (req.restarts > 0) {
+        set req.hash_always_miss = true;
+    }
+
     if (req.method == "PURGE") {
         if (client.ip !~ purge) {
             return (synth(405, "Method not allowed"));

Problem happens when you add / remove backends to varnish.vcl and reload varnish service (emphasis on reload, not restart - to keep everything in cache and just reconfigure) - backend fetch fail happens for short interval (was 10ish seconds for us) - resulting in HTTP 503 for users that didn't hit the cache.

Fix for that is to set N to a number which is max_restarts - 1 just below configuration for max_restarts:

  • if (req.restarts > N) {
  •    set req.hash_always_miss = true;
    
  • }

P.S. you can check what the max_restarts value by using varnishadm command.

Oh, we've also added following snippet to force retry up to max_restarts - 1 times (to give varnish time to see backend is unhealthy using probe in backend config):

  # cache only successfully responses and 404s
   if (beresp.status != 200 && beresp.status != 404) {
       set beresp.ttl = 0s;
       set beresp.uncacheable = true;
       return (deliver);
   } elsif (beresp.status > 500) { // THIS IS ADDED
       return (retry); // THIS IS ADDED
   } elsif (beresp.http.Cache-Control ~ "private") {
       set beresp.uncacheable = true;
       set beresp.ttl = 86400s;
       return (deliver);
   }

Edit:

Note that this kind of setup uses Varnish Transient storage (short lived cache) and if you don't set memory limit for that storage, it will eat up your RAM and eventually lead to crash of the server (source: https://varnish-cache.org/docs/trunk/users-guide/storage-backends.html, search for "By default Varnish would use an unlimited malloc backend for this.") so make sure to edit your startup script for Varnish and name the Transient storage with limit.

E.g. /usr/sbin/varnishd -a :6081 -f /etc/varnish/default.vcl -s Cache=malloc,2048m -s Transient=malloc,512m:

Screen Shot 2020-08-05 at 1 44 48 PM

P.S. thanks @robolmos for pointing out return(retry) issue, I've updated the example above.

@VladimirZaets VladimirZaets added the Triage: Dev.Experience Issue related to Developer Experience and needs help with Triage to Confirm or Reject it label Jul 15, 2020
@robolmos
Copy link

@lotar I'm not a Varnish expert, but you might want to look at other solutions like compiling the VCL, letting the backends register as healthy, then load the VCL. Or, update the backends config to be healthy initially.

Maybe it's OK to retry with a server-side error to help prevent the client getting transient backend errors.. in vcl_backend_response() I believe it's technically return(retry) and uses the max_retries value rather than max_restarts.

@VladimirZaets VladimirZaets added the Priority: P2 A defect with this priority could have functionality issues which are not to expectations. label Aug 4, 2020
@ghost ghost assigned ihor-sviziev Aug 4, 2020
@ghost ghost moved this from Ready for Dev to PR In Progress in Community Backlog Aug 4, 2020
@ghost ghost moved this from PR In Progress to Ready for Dev in Community Backlog Aug 5, 2020
@lotar
Copy link

lotar commented Aug 5, 2020

Hey @robolmos,

Thanks for the feedback.

@lotar I'm not a Varnish expert, but you might want to look at other solutions like compiling the VCL, letting the backends register as healthy, then load the VCL. Or, update the backends config to be healthy initially.

Nor am I, but compiling / reloading of VCL after new backend is healthy it's not an option for us give the rest of the setup.

Maybe it's OK to retry with a server-side error to help prevent the client getting transient backend errors.. in vcl_backend_response() I believe it's technically return(retry) and uses the max_retries value rather than max_restarts.

Fixed, ty ;)

@ihor-sviziev
Copy link
Contributor

@lotar should I update my PR #28137 ?

@lotar
Copy link

lotar commented Aug 5, 2020

@lotar should I update my PR #28137 ?

@ihor-sviziev honestly I think there's no need since my update was specific for infrastructure setup (auto scaling group issue). While it does solve the problem for us, it won't necessarily be 100% correct solution (or even needed) for different kinds of setup.

Also rethinking the problem, it would be better to do
} elsif (beresp.status == 503) {
instead of
} elsif (beresp.status > 500) {
Since it's here for a specific reason...

What I'd suggest on the other hand is to update official documentation for Varnish 6 setup regarding Transient storage explained here.

Reason being is that default Varnish installation has no memory limit and fix from the PR actually uses this kind of storage.

It is infrastructure part as well (Varnish startup setup) but if not set correctly it will cause Varnish service to eat up RAM in combination with this VCL (from PR) eventually and will lead to site going down.

To conclude, I'd say it's up to you ;)

@slavvka
Copy link
Member

slavvka commented Aug 8, 2020

Hi @HOSTED-POWER. Thank you for your report.
The issue has been fixed in #28137 by @ihor-sviziev in 2.4-develop branch
Related commit(s):

The fix will be available with the upcoming 2.4.1 release.

@slavvka slavvka added the Fixed in 2.4.x The issue has been fixed in 2.4-develop branch label Aug 8, 2020
@slavvka slavvka closed this as completed Aug 8, 2020
@ghost ghost moved this from Ready for Dev to Done (last 30 days) in Community Backlog Aug 8, 2020
magento-engcom-team added a commit that referenced this issue Aug 8, 2020
 - Merge Pull Request #28137 from ihor-sviziev/magento2:fix-varnish6-too-many-restarts
 - Merged commits:
   1. cef2ac3
jonashrem added a commit to jonashrem/magento2 that referenced this issue Sep 27, 2021
backports magento#24353  to Magento 2.3
@jonashrem jonashrem mentioned this issue Sep 27, 2021
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: PageCache Fixed in 2.4.x The issue has been fixed in 2.4-develop branch Issue: Confirmed Gate 3 Passed. Manual verification of the issue completed. Issue is confirmed Issue: Format is valid Gate 1 Passed. Automatic verification of issue format passed Issue: Ready for Work Gate 4. Acknowledged. Issue is added to backlog and ready for development Priority: P2 A defect with this priority could have functionality issues which are not to expectations. Reproduced on 2.4.x The issue has been reproduced on latest 2.4-develop branch Severity: S1 Affects critical data or functionality and forces users to employ a workaround. Triage: Dev.Experience Issue related to Developer Experience and needs help with Triage to Confirm or Reject it
Projects
No open projects
Community Backlog
  
Done (last 30 days)
Development

Successfully merging a pull request may close this issue.