Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for retries #2038

Merged
merged 6 commits into from
Jan 16, 2019
Merged

Add support for retries #2038

merged 6 commits into from
Jan 16, 2019

Conversation

adleong
Copy link
Member

@adleong adleong commented Jan 3, 2019

This change adds the isRetryable property to routes in a service profile. This causes the proxy to retry failed requests to retryable routes whenever it is able to. A retry budget may also be added to the service profile which can limit the number of retries. The linkerd profile --template command has been updated to describe these new properties.

The linkerd routes command has been updated to differentiate between actual and effective RPS and success rate. This data is only available for outgoing queries i.e. when the --to flag is used. Effective metrics are the metrics as observed by the client whereas actual metrics are the metrics which describe the requests and responses which are actually sent on the wire. Adding retries will usually increase the actual RPS and increase the effective success rate while leaving the actual success rate and effective RPS unchanged.

$linkerd routes deploy/books --to svc/authors
ROUTE                     SERVICE   EFFECTIVE_SUCCESS   EFFECTIVE_RPS   ACTUAL_SUCCESS   ACTUAL_RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99
HEAD /authors/{id}.json   authors              68.28%          2.4rps           45.21%       3.6rps          14ms          40ms          48ms
[UNKNOWN]                 authors               0.00%          0.0rps            0.00%       0.0rps           0ms           0ms           0ms

@adleong adleong self-assigned this Jan 3, 2019
@adleong
Copy link
Member Author

adleong commented Jan 3, 2019

How I tested

  1. Install Linkerd control plane and booksapp
  2. Run linkerd routes deploy/webapp --to svc/books, notice: no useful data
  3. Apply service profiles from swagger:
curl https://run.linkerd.io/booksapp/webapp.swagger | linkerd profile --open-api - webapp | kubectl apply -f -
curl https://run.linkerd.io/booksapp/authors.swagger | linkerd profile --open-api - authors | kubectl apply -f -
curl https://run.linkerd.io/booksapp/books.swagger | linkerd profile --open-api - books | kubectl apply -f -
  1. Run linkerd routes deploy/webapp --to svc/books: Notice SR on the POST and PUT routes is ~50%
  2. Run linkerd routes deploy/books --to svc/authors: Notice SR on the HEAD route is ~50%
  3. Download the authors profile: kubectl -n linkerd get sp/authors.default.svc.cluster.local -o yaml > authors-profile.yaml
  4. Edit the HEAD route to be retryable:
  - condition:
      method: HEAD
      pathRegex: /authors/[^/]*\.json
    name: HEAD /authors/{id}.json
    isRetryable: true
  1. Apply the new profile kubectl apply -f authors-profile.yaml
  2. Run linkerd routes deploy/books --to svc/authors notice effective SR go to 100%
  3. Run linkerd routes deploy/webapp --to svc/books notice SR go to 100%

routes = append(routes, pbRoute)
}
budget := DefaultRetryBudget
if profile.RetryBudget != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if this is addressed elsewhere and I couldn't find it: would this be an appropriate place to validate the budget arguments? Would an error here be more visible to a user?

The proxy will already disregard invalid budgets and log at WARN level, which should be visible, but only if the user is looking at the proxy logs...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Probably a good idea to validate here too. Unfortunately, the suffers from the same problem that validation errors will only be visible if you're looking at the logs.

We may be able to some limited validation with CRD schema validation too...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for linkerd inject or other related command to do the validation then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be a great addition to the inject checks.

@grampelberg
Copy link
Contributor

@adleong I can't get stat to work on this branch. Installed booksapp, looked at stat and there's nothing there.

@adleong
Copy link
Member Author

adleong commented Jan 9, 2019

@grampelberg uh oh. can you share a screenshot of the stat command and output?

@grampelberg
Copy link
Contributor

$ bin/go-run cli -n booksapp stat deploy
NAME      MESHED   SUCCESS   RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99   TLS
authors      1/1         -     -             -             -             -     -
books        1/1         -     -             -             -             -     -
traffic      1/1         -     -             -             -             -     -
webapp       3/3         -     -             -             -             -     -

@grampelberg
Copy link
Contributor

I've replicated my stat bug on stable, so it isn't your code! I'm going to go figure out what the deal is there now .....

@jon-walton
Copy link
Contributor

is it feasible to read the IsRetryable from an openapi spec? The spec supports extensions (https://github.com/OAI/OpenAPI-Specification/blob/master/versions/3.0.2.md#specificationExtensions) as part of the operation object (https://github.com/OAI/OpenAPI-Specification/blob/master/versions/3.0.2.md#operationObject).

We could use something like x-linkerd-is-retryable

The reason I ask this is that our openapi spec is auto generated based on the routes in the controllers. It would be annoying to maintain a separate list of routes and their linkerd settings to then patch the json before passing into the linkerd cli

@grampelberg
Copy link
Contributor

What do you think about reporting effective by default and adding a -o wide option?

bin/go-run cli -n booksapp routes deploy/books --to svc/authors
ROUTE                     SERVICE   SUCCESS   RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99
HEAD /authors/{id}.json   authors   100.00%  2.2rps         5ms          16ms          19ms
[UNKNOWN]                 authors     0.00%  0.0rps         0ms           0ms           0ms

bin/go-run cli -n booksapp routes deploy/books --to svc/authors -o wide
ROUTE                     SERVICE   SUCCESS   RPS   ACTUAL_SUCCESS  ACTUAL_RPS  LATENCY_P50   LATENCY_P95   LATENCY_P99
HEAD /authors/{id}.json   authors   100.00%  2.2rps         55.14%      4.0rps          5ms          16ms          19ms
[UNKNOWN]                 authors     0.00%  0.0rps          0.00%      0.0rps          0ms           0ms           0ms

@adleong
Copy link
Member Author

adleong commented Jan 11, 2019

@jon-walton I love this idea! It should be pretty simple to add this to the linkerd profile --open-api command. Would you mind opening an issue?

@adleong
Copy link
Member Author

adleong commented Jan 11, 2019

@grampelberg I dig it. ⛏

@grampelberg
Copy link
Contributor

@adleong also see #2072 if you wanted to tackle that as a separate PR.

Signed-off-by: Alex Leong <alex@buoyant.io>
Signed-off-by: Alex Leong <alex@buoyant.io>
@adleong
Copy link
Member Author

adleong commented Jan 14, 2019

@grampelberg I implemented your -o wide suggestion. PTAL

@grampelberg
Copy link
Contributor

Oh yeah, I like that!

@siggy siggy self-requested a review January 15, 2019 21:02
Copy link
Member

@siggy siggy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great. thanks for the test setup instructions.

cli/cmd/root.go Outdated
@@ -162,17 +162,17 @@ func newStatOptionsBase() *statOptionsBase {

func (o *statOptionsBase) validateOutputFormat() error {
switch o.outputFormat {
case "table", "json", "":
case "table", "json", "wide", "":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to implement the EFFECTIVE/ACTUAL columns for the linkerd stat command? if not, it's a bit confusing that stat -o wide works but is a no-op.

related, if routes -o wide only works with --to, consider validating that, or maybe documenting it in the help.

cli/cmd/root.go Outdated
proxyOutboundCapacity: map[string]uint{},
proxyCPURequest: "",
proxyMemoryRequest: "",
tls: "",
disableExternalProfiles: false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is a go-fmt issue? when i save this file in vscode it re-aligns the field assignments.

"SUCCESS",
"RPS",
}
if options.toResource == "" || options.outputFormat != "wide" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider:

outputActual := options.toResource != "" && options.outputFormat == "wide"

... and then use outputActual in the 3 subsequent if statements.

@@ -13,6 +13,7 @@ import (

const (
routeReqQuery = "sum(increase(route_response_total%s[%s])) by (%s, dst, classification)"
actualRouteReqQuery = "sum(increase(route_actual_response_total%s[%s])) by (%s, dst, classification)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the prometheus metrics level and BasicStats level, we're re-using the existing response keys for "effective" stats, and adding new keys for "actual". higher up in the abstractions, around jsonRouteStats and cli table headers, we're instead adding both new "effective" and "actual" keys, and leaving the existing keys empty. what do you think about modifying the higher-level abstractions to match these lower-level ones, such that we'd remove "effective" altogether?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only place that we have "effective" keys and leave the existing keys empty is in the struct that will be serialized to json. I do this so that we can have different field names: effective_rps in the outbound case and just plain rps in the inbound case.

Signed-off-by: Alex Leong <alex@buoyant.io>
Signed-off-by: Alex Leong <alex@buoyant.io>
Signed-off-by: Alex Leong <alex@buoyant.io>
Copy link
Member

@siggy siggy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 🚢

Copy link
Member

@klingerf klingerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⭐️ 🚀 Awesome! So cool to see these stats wired up.

$ bin/linkerd routes deploy/books --to svc/authors -owide
ROUTE                     SERVICE   EFFECTIVE_SUCCESS   EFFECTIVE_RPS   ACTUAL_SUCCESS   ACTUAL_RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99
HEAD /authors/{id}.json   authors             100.00%          2.2rps           51.16%       4.3rps          12ms          30ms          38ms

func buildTopRoutesRequest(resource string, options *routesOptions) (*pb.TopRoutesRequest, error) {
err := options.validateOutputFormat()
err := validateOutputFormat(options)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIOLI -- you could still use *routesOptions as the receiver for this method, even though it embeds statOptionsBase, which has its own definition of validateOutputFormat. Something like this would work:

diff --git a/cli/cmd/routes.go b/cli/cmd/routes.go
index bab9fd36..9c3dceb4 100644
--- a/cli/cmd/routes.go
+++ b/cli/cmd/routes.go
@@ -268,12 +268,12 @@ func printRouteJSON(stats []*routeRowStats, w *tabwriter.Writer, options *routes
 	fmt.Fprintf(w, "%s\n", b)
 }
 
-func validateOutputFormat(options *routesOptions) error {
-	switch options.outputFormat {
+func (o *routesOptions) validateOutputFormat() error {
+	switch o.outputFormat {
 	case "table", "json", "":
 		return nil
 	case "wide":
-		if options.toResource == "" {
+		if o.toResource == "" {
 			return errors.New("wide output is only available when --to is specified")
 		}
 		return nil
@@ -283,7 +283,7 @@ func validateOutputFormat(options *routesOptions) error {
 }
 
 func buildTopRoutesRequest(resource string, options *routesOptions) (*pb.TopRoutesRequest, error) {
-	err := validateOutputFormat(options)
+	err := options.validateOutputFormat()
 	if err != nil {
 		return nil, err
 	}

siggy added a commit that referenced this pull request Jan 16, 2019
Depends on #2037, #2038, #2063, #2066, #2086, #2087, #2089

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
siggy added a commit that referenced this pull request Jan 16, 2019
Depends on #2037, #2038, #2063, #2066, #2086, #2087, #2089

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Signed-off-by: Alex Leong <alex@buoyant.io>
@adleong adleong merged commit 771542d into master Jan 16, 2019
siggy added a commit that referenced this pull request Jan 17, 2019
Depends on #2037, #2038, #2063, #2066, #2087, #2089, #2105

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
@adleong adleong deleted the alex/retry-two branch December 13, 2019 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants