New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend marked as unhealthy even so no request was unsuccessful #667

Closed
stp-ip opened this Issue Mar 9, 2016 · 16 comments

Comments

3 participants
@stp-ip
Contributor

stp-ip commented Mar 9, 2016

1. What version of Caddy are you running (caddy -version)?

0.8.1

2. What are you trying to do?

Proxy traffic to an apache backend with mod_php

3. What is your entire Caddyfile?

https://testing.example.com {
        proxy / localhost:8080 {
                proxy_header Host {host}
                proxy_header X-Real-IP {remote}
                proxy_header X-Forwarded-Proto {scheme}
        }

        header / Strict-Transport-Security "max-age=15768000; includeSubDomains; preload"

        log /var/www/access.log {
                rotate {
                        size 100
                }
        }
        errors {
                log /var/www/error.log {
                        size 50
                }
        }
        tls certs@example.com
}

4. How did you run Caddy (give the full command and describe the execution environment)?

./caddy -agree -conf caddyfile
Run on Ubuntu 14.04.4 LTS

5. What did you expect to see?

I did expect to see no errors.

6. What did you see instead (give full error messages and/or log)?

  1. Running siege to check the responses for plain apache and proxied traffic.
  2. Run specific request, which let's caddy jump into error mode
  3. 10s of errors 502 from caddy, while apache is serving none.

It seems as the faulty request is letting caddy mark the backend as down and therefore make it unavailable for 10s (502), but the seemingly faulty request still comes back successful.

Setting a backend to unhealthy with only one backend available seems like a bad default.
max_fails 0 fixes the issue and during the seemingly faulty request no 502 error comes up.

Discussed the issue with @abiosoft on slack. Thanks for taking the time to look into it.

@stp-ip stp-ip changed the title from Backend marked as unhealthy even so no request was unsucessful to Backend marked as unhealthy even so no request was unsuccessful Mar 9, 2016

@mholt

This comment has been minimized.

Owner

mholt commented Mar 9, 2016

Setting a backend to unhealthy with only one backend available seems like a bad default.

I think I agree, although I wonder if there's a gotcha with just letting a single backend get hammered.

Run specific request, which let's caddy jump into error mode

By this do you mean that you issue a request which causes the backend to issue an improper response? Typically, if Caddy gets an error while doing a RoundTrip to upstream, it will return an error thus marking the backend as down.

@stp-ip

This comment has been minimized.

Contributor

stp-ip commented Mar 9, 2016

Setting a backend to unhealthy with only one backend available seems like a bad default.

I think I agree, although I wonder if there's a gotcha with just letting a single backend get hammered.

The proxy could just be used for ssl termination or so and a single backend would be down 10s, if only one error occurs.

Run specific request, which let's caddy jump into error mode

By this do you mean that you issue a request which causes the backend to issue an improper response?

Typically, if Caddy gets an error while doing a RoundTrip to upstream, it will return an error thus marking the backend as down.
The request doesn't seem to issue an improper response from the backend. I am not really sure why caddy marks the backend as unhealthy. I could not reproduce any error with apache. Only caddy did error out, but only when max_fails was not 0. With max_fails set to 0 no error was recorded with caddy.

Hope that clears it up.

@mholt

This comment has been minimized.

Owner

mholt commented Mar 9, 2016

@stp-ip If you have Go installed, could we delve into this further? I'd like to know more why a request that succeeds with Apache tricks Caddy into thinking it is down. Not marking the upstream as "down" if there's only one upstream is ancillary, and we can revisit it after we track this down first.

Could you please print the value of backendErr above this line:

if backendErr == nil {

That will give us a better idea what is happening.

@stp-ip

This comment has been minimized.

Contributor

stp-ip commented Mar 10, 2016

Sure we can dig deeper. I will try to get a clean install and get the backendErr value. Could take me a few days so. Will get back to you.

@mholt mholt added the help wanted label Mar 21, 2016

@mholt

This comment has been minimized.

Owner

mholt commented Mar 21, 2016

Have you had a chance yet to find out what backendErr is? (If backendErr is not nil, it returns a nil error value. Perhaps we should return backendErr instead so it gets logged in the error log. What do you think?)

@stp-ip

This comment has been minimized.

Contributor

stp-ip commented Mar 21, 2016

Not yet unfortunately. But it's not forgotten. ;)

@mholt

This comment has been minimized.

Owner

mholt commented Apr 12, 2016

...The suspense is killin' me 😄

@stp-ip

This comment has been minimized.

Contributor

stp-ip commented Apr 12, 2016

I do my best to make open source work more entertaining and exciting...

@stp-ip

This comment has been minimized.

Contributor

stp-ip commented Apr 14, 2016

Finally got to setting up the test environment. Tried to reproduce this with v0.8.1, v0.8.2 and master.

Unfortunate thing is. I could not get this to reproduce again. Will close for now.

Sorry for the suspense for nothing :(

@stp-ip stp-ip closed this Apr 14, 2016

@mholt mholt removed the help wanted label Apr 14, 2016

@mholt

This comment has been minimized.

Owner

mholt commented Apr 14, 2016

We're actually working on this issue in chat right now with someone else (a brave soul who is willing to make the change in production since reproducing it was hard!), so we may have a fix forthcoming soon anyway.

Will reopen until we know more 😉

@mholt mholt reopened this Apr 14, 2016

@robertmeta

This comment has been minimized.

robertmeta commented Apr 14, 2016

@mholt I ("brave soul") am putting together a patch for this. Basically, it feels like we got a few issues... should we split them up?

  1. Random failures while backend is responding. This is what we are debugging in production.
  2. What to do during an non-bug failure with a single backend. I think the answer is nothing, no timeout, just keep trying!
  3. Do we need a SEPARATE feature for being used as a circuit breaker? This option could trigger on 429, 503, 509 and backend down events? I know some people running delicate stacks behind Caddy might like this (ruby, python, etc).... IE: anti-hammer.
@stp-ip

This comment has been minimized.

Contributor

stp-ip commented Apr 14, 2016

@robertmeta
Might me needed to split them up

  1. I tried to debug the issue both in a test environment and in production, but couldn't reproduce yet.
  2. I would say having "max_fails" default to 1 with multiple upstreams and to 0 with a single upstream. Not perfect, but the saner default and with a minimal documentation easy to reason about.
  3. Not sure what you are getting at. Will jump on gitter to get more insight.
@robertmeta

This comment has been minimized.

robertmeta commented Apr 14, 2016

@stp-ip

  1. We get reproducers at least every few days -- and we get a handful on most normal days. During the weekend rush I can virtually guarantee one, no worried about that part.
  2. max_fails to upstreams-1 would make sense right? 1 backend, it is 0, 5 backends it is 4?
  3. Certain setups can get into corner cases were constant connections forces them into unrecoverable states (or prolongs the recovery window). On those systems haven't a circuit-breaker could be useful.
@stp-ip

This comment has been minimized.

Contributor

stp-ip commented Apr 14, 2016

  1. perfect
  2. max_fails == upstreams-1 seems like a easy solution
  3. adding circuit_breakers (default set to false), which mark a backend down even in single upstream mode on 429, 502 and 509 error codes?
@robertmeta

This comment has been minimized.

robertmeta commented Apr 14, 2016

  1. Cool.
  2. Cool (maybe we can have an override setting?!)
  3. Yep, that is my idea, I think that one I will do last, first 2 seem free and obvious.
    and
  4. If I find the source, I will post it here and try to build a PR around it.
@mholt

This comment has been minimized.

Owner

mholt commented Sep 28, 2016

I think #1135 fixes this. @robertmeta and @stp-ip feel free to pull the latest source and give it a whirl (note the description in the PR so you know what changed!) or wait until 0.9.3 and let me know! (We also improved error reporting.)

Closing for now, since we drastically changed some of the logic and defaults of these parameters.

@mholt mholt closed this Sep 28, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment