Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to see breakdown of requests by fate #2016

Open
adleong opened this Issue Dec 20, 2018 · 8 comments

Comments

4 participants
@adleong
Copy link
Member

adleong commented Dec 20, 2018

Feature Request

What problem are you trying to solve?

Setting up retries can be confusing and it can be difficult to determine if retries have been configured successfully. Furthermore, it can be very difficult to know WHY retries are or are not happening. Showing effective and actual RPS can show whether retries are happening, but do not explain why. For example, retries may be skipped for a variety of reasons including:

  • route not configured as retryable
  • timeout exceeded
  • budget exhausted
  • message not bufferable (too large)

How should the problem be solved?

All requests that the Linkerd proxy sends fall into exactly one of these categories:

  • In progress (request sent, response not yet complete)
  • Success (no retries necessary)
  • Success after retry (initial response was failure, but succeeded after some number of retires)
  • Failure: not retryable (failure, route is not retryable)
  • Failure: timeout exceeded
  • Failure: retry skipped: budget exceeded (failure, could not retry due to retry budget)
  • Failure: retry skipped: message not bufferable (request or response too large, could not be buffered)

Add a linkerd request-breakdown (PLEASE help me come up with a better name) which displays a breakdown of how many requests fall into each of these categories:

$ linkerd request-breakdown svc/books

ROUTE                       SERVICE   IN PROGRESS   SUCCESS                   FAILURE
                                                    FIRST TRY   AFTER RETRY   NOT RETRYABLE   TIMEOUT   NO BUDGET   TOO LARGE
DELETE /books/{id}.json     books               1          41             0               0         0           0           0
GET /books.json             books               0          23            19               0         0           7           0
GET /books/{id}.json        books               1          25            20               0         0           6           0
POST /books.json            books               2          15             0              17         0           0           0
PUT /books/{id}.json        books               0          21             0              19         0           0           0
[UNKNOWN]                   books               0          23             0               0         0           0           0

To be able to distinguish between success (first try) and success (after retry), we would probably need to add a new prometheus label that indicates if an actual request is an original request or a retry.

@olix0r

This comment has been minimized.

Copy link
Member

olix0r commented Dec 20, 2018

I'm not sure about the solution being adding labels to existing metrics. there are some proxy impl details that will influence how the solution shakes out

@grampelberg

This comment has been minimized.

Copy link
Contributor

grampelberg commented Dec 20, 2018

This feels valuable, but does it need to be added to routes? I've been hoping that routes and stat have identical look/feel (just with routes).

A -o wide would match the kubectl UX. Honestly, the more I see examples of the impacts of retries on the rest of the UI, I keep wondering whether all the metrics related to retries should be separated out into their own command.

@adleong

This comment has been minimized.

Copy link
Member Author

adleong commented Dec 20, 2018

This is route level data, but it could certainly go in a separate command.

@grampelberg

This comment has been minimized.

Copy link
Contributor

grampelberg commented Dec 20, 2018

@adleong we can't get this data on the stat level? I just assumed that it was a different view.

@adleong

This comment has been minimized.

Copy link
Member Author

adleong commented Dec 20, 2018

I guess it depends on what you mean. Retries happen at the route layer. But we could roll up the data from resources, just like we do for the routes command.

@grampelberg

This comment has been minimized.

Copy link
Contributor

grampelberg commented Dec 20, 2018

Ahhh, I get it. As a user, I would like to see this data at the stat level (pod, deployment, authority) as well as at the route level. How would that work with --from and --to?

@adleong

This comment has been minimized.

Copy link
Member Author

adleong commented Dec 20, 2018

There is a subtle but important difference between this and stat/routes. Stat and routes both default to showing inbound data i.e. what is the success rate of requests the target receives. This behavior is flipped to show outbound data if the --to flag is used.

These retry stats, on the other hand, ONLY make sense for outbound data (there are no inbound retries). So defaulting to inbound like stat/routes wouldn't make sense. So we would need to somehow make it clear that this command works differently and always shows data about requests that the target resource is sending (as opposed to receiving).

@adleong

This comment has been minimized.

Copy link
Member Author

adleong commented Jan 3, 2019

Updated description to re-position this as a retries debugging tool.

@adleong adleong added this to TODO in Service Profiles Jan 9, 2019

@adleong adleong added this to To do in 2.2 Jan 22, 2019

@siggy siggy added the priority/P2 label Jan 23, 2019

@grampelberg grampelberg removed this from To do in 2.2 Feb 5, 2019

@grampelberg grampelberg added this to To do in 2.3 via automation Feb 5, 2019

@grampelberg grampelberg added priority/P1 and removed priority/P0 labels Feb 25, 2019

@admc admc removed this from To do in 2.3 Mar 6, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.