Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

responseClassifier not working with http/1.0 requests #409

Closed
klingerf opened this issue May 17, 2016 · 3 comments · Fixed by #413
Closed

responseClassifier not working with http/1.0 requests #409

klingerf opened this issue May 17, 2016 · 3 comments · Fixed by #413
Assignees
Labels

Comments

@klingerf
Copy link
Member

It appears that retries aren't working as expected for http clients that don't use persistent connections when talking to linkerd.

For example, I have the following router config:

routers:
- protocol: http
  baseDtab: /http => /$/inet/127.1/9000;
  servers:
  - port: 4140
    ip: 0.0.0.0
  responseClassifier:
    kind: retryableRead5XX

Which forwards requests to a go http server running on port 9000. The go server is configured to respond to 10% of requests with a 503:

package main

import (
    "fmt"
    "math/rand"
    "net/http"
)

func main() {
    http.HandleFunc("/ping", func(w http.ResponseWriter, r *http.Request) {
        if rand.Intn(10) == 0 {
            w.WriteHeader(http.StatusServiceUnavailable)
        }
        fmt.Fprintf(w, "pong\n")
    })
    http.ListenAndServe(":9000", nil)
}

If I send 1000 http/1.1 requests to the router with curl, it uses persistent connections by default and all 1000 requests succeed.

$ for i in {1..1000}; do curl -so /dev/null -w '%{http_code}\n' :4140/ping; done | sort | uniq -c
1000 200

From the router stats:

$ curl -s :9990/admin/metrics.json | jq -S . | grep -E '(success|failures)' | grep -v zipkin
  "rt/http/dst/id/$/inet/127.1/9000/failures": 119,
  "rt/http/dst/id/$/inet/127.1/9000/failures/com.twitter.finagle.service.ResponseClassificationSyntheticException": 119,
  "rt/http/dst/id/$/inet/127.1/9000/success": 1000,
  "rt/http/srv/0.0.0.0/4140/success": 1000,

You can see that roughly 10% of client requests failed, but overall router success rate was 100%. This is confirmed by looking at requeue stats:

$ curl -s :9990/admin/metrics.json | jq -S . | grep requeues | grep -v zipkin
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues": 119,
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues_per_request.avg": 0.119,
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues_per_request.count": 1000,
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues_per_request.max": 3,
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues_per_request.min": 0,
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues_per_request.p50": 0,
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues_per_request.p90": 1,
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues_per_request.p95": 1,
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues_per_request.p99": 1,
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues_per_request.p9990": 3,
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues_per_request.p9999": 3,
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues_per_request.sum": 119,

If instead I send http/1.0 requests that don't use persistent connections, the retry behavior does not work as expected, and I see 502 and 503 responses from the router.

$ for i in {1..1000}; do curl -0 -so /dev/null -w '%{http_code}\n' :4140/ping; done | sort | uniq -c
 890 200
   3 502
 107 503

The router stats show an overall router success rate of 89%:

$ curl -s :9990/admin/metrics.json | jq -S . | grep -E '(success|failures)' | grep -v zipkin
  "rt/http/dst/id/$/inet/127.1/9000/failures": 122,
  "rt/http/dst/id/$/inet/127.1/9000/failures/com.twitter.finagle.ChannelClosedException": 12,
  "rt/http/dst/id/$/inet/127.1/9000/failures/com.twitter.finagle.service.ResponseClassificationSyntheticException": 110,
  "rt/http/dst/id/$/inet/127.1/9000/sourcedfailures/$/inet/127.1/9000": 9,
  "rt/http/dst/id/$/inet/127.1/9000/sourcedfailures/$/inet/127.1/9000/com.twitter.finagle.ChannelClosedException": 9,
  "rt/http/dst/id/$/inet/127.1/9000/success": 890,
  "rt/http/srv/0.0.0.0/4140/failures": 110,
  "rt/http/srv/0.0.0.0/4140/failures/com.twitter.finagle.ChannelClosedException": 3,
  "rt/http/srv/0.0.0.0/4140/failures/com.twitter.finagle.service.ResponseClassificationSyntheticException": 107,
  "rt/http/srv/0.0.0.0/4140/sourcedfailures/$/inet/127.1/9000": 3,
  "rt/http/srv/0.0.0.0/4140/sourcedfailures/$/inet/127.1/9000/com.twitter.finagle.ChannelClosedException": 3,
  "rt/http/srv/0.0.0.0/4140/success": 890,

And the requeue stats indicate that some requests were retried 4 times and still failed, which is unlikely:

$ curl -s :9990/admin/metrics.json | jq -S . | grep requeues | grep -v zipkin
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues": 12,
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues_per_request.avg": 0.012,
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues_per_request.count": 1000,
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues_per_request.max": 4,
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues_per_request.min": 0,
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues_per_request.p50": 0,
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues_per_request.p90": 0,
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues_per_request.p95": 0,
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues_per_request.p99": 0,
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues_per_request.p9990": 4,
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues_per_request.p9999": 4,
  "rt/http/dst/id/$/inet/127.1/9000/retries/requeues_per_request.sum": 12,

The 3 502s that we received correspond to 3 ChannelClosedExceptions in the router's logs. For instance:

I 0517 00:39:50.516 THREAD67: FailureAccrualFactory marking connection to "$/inet/127.1/9000" as dead. Remote Address: Inet(/127.0.0.1:9000,Map())
E 0517 00:39:50.521 THREAD67: service failure
com.twitter.finagle.ChannelClosedException: ChannelException at remote address: /127.0.0.1:9000 from service: 0.0.0.0/4140. Remote Info: Upstream Address: /127.0.0.1:52722, Upstream Client Id: Not Available, Downstream Address: /127.0.0.1:9000, Downstream Client Id: $/inet/127.1/9000, Trace Id: 57efcb7401aafc5e.32b071322c81ef00<:57efcb7401aafc5e
    at com.twitter.finagle.NoStacktrace(Unknown Source)

As far as I can tell, when a request to linkerd is made without a persistent http connection and the first response it receives from the downstream is a failure, it goes into a retry loop that fully exhausts the retry budget, and still fails the request.

@klingerf klingerf added the bug label May 17, 2016
@olix0r
Copy link
Member

olix0r commented May 17, 2016

Does this configuration works properly when configured without a responseClassifier?

I expect that this issue isn't related to response classifiers. From the router's perspective, retrying shouldn't be any different from issuing two requests. It may be helpful to get a PCAP file (i.e. from tcpdump) to give us some more insight into what's going on between linkerd and your server.

@klingerf
Copy link
Member Author

Good question -- it "works" without responseClassifier, in the sense that 10% of requests fail with a 503 (as expected), and I don't see any ChannelClosedExceptions in the logs.

$ for i in {1..1000}; do curl -0 -so /dev/null -w '%{http_code}\n' :4140/ping; done | sort | uniq -c | sort -rn
 898 200
 102 503
$ curl -s :9990/admin/metrics.json | jq -S . | grep -E '(success|failures)' | grep -v zipkin
  "rt/http/dst/id/$/inet/127.1/9000/failures": 102,
  "rt/http/dst/id/$/inet/127.1/9000/failures/com.twitter.finagle.service.ResponseClassificationSyntheticException": 102,
  "rt/http/dst/id/$/inet/127.1/9000/success": 898,
  "rt/http/srv/0.0.0.0/4140/failures": 102,
  "rt/http/srv/0.0.0.0/4140/failures/com.twitter.finagle.service.ResponseClassificationSyntheticException": 102,
  "rt/http/srv/0.0.0.0/4140/success": 898,

@olix0r
Copy link
Member

olix0r commented May 17, 2016

Okay, I've been able to reproduce this in a test... So we've got that going for us

olix0r added a commit that referenced this issue May 17, 2016
Test Http/1.0 requests with and without retries.
olix0r added a commit that referenced this issue May 17, 2016
Test Http/1.0 requests with and without retries.
olix0r added a commit that referenced this issue May 18, 2016
The retry filter may issue multiple requests on the underlying service factory.
This interaction needs to be managed by a FactoryToService module so that the
underlying service (connection) is released to finagle between requests so
that, for instance, it may be closed in the case of HTTP/1.0 requests.

Fixes #409
@olix0r olix0r self-assigned this May 18, 2016
olix0r added a commit that referenced this issue May 18, 2016
* Introduce a FactoryToService module into the path stack

The retry filter may issue multiple requests on the underlying service factory.
This interaction needs to be managed by a FactoryToService module so that the
underlying service (connection) is released to finagle between requests so
that, for instance, it may be closed in the case of HTTP/1.0 requests.

Fixes #409

* Move acquisition failure tracing below retries.

Now that factory acquisition is pushed below retries, the client factory lookup
always succeeds.  In order to record acquisition failures, recording must be
pushed below retries.

* closes aren't synchronous, so we can't rely on them to be timely in CI.

* Fixup commentary

* Move NoBrokersAvailable lifting down under the path stack

With the prior change, NoBrokersAvailableExceptions would not be decorated with
Dtab information because they are thrown at request time and not service
acquisition time (due to factoryToService). This change moves NoBrokers
decoration below the path stack.

* improve DstBindingFactory testing in face of factory-to-service
@olix0r olix0r removed the reviewable label May 18, 2016
Tim-Brooks pushed a commit to Tim-Brooks/linkerd that referenced this issue Dec 20, 2018
The build scripts assume they are executed from the root of this repo.
This prevents running scripts from other locations, for example,
`cd web && ../bin/go-run .`.

Modify the build scripts to work regardless of current directory.

Fixes linkerd#301

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants