Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkerd2 stat CLI and linkerd2 web dashboard delayed compared to Grafana #1453

Closed
willtrking opened this issue Aug 14, 2018 · 4 comments
Closed
Labels

Comments

@willtrking
Copy link

Hey there,

When monitoring service RPS with linkerd stat and the linkerd dashboard, the RPS statistic is super delayed comparatively to Grafana.

E.g. if we have a spike of traffic from ~10 rps to ~100 rps for 10 seconds, Grafana properly responds and shows the spike in RPS. linkerd stat and the dashboard seem to respond here as well. However, once the RPS drops back down to ~10 after that 10 second period, linkerd stat and the dashboard are delayed, taking sometimes over 30 seconds to adjust back down to what Grafana shows (which is accurate).

@klingerf
Copy link
Member

@willtrking This is a great find! I agree that these should be consistent. I included a bunch of info below to help explain the discrepancy. We should use this issue to track standardizing so that the dashboard and the CLI present the same view of the data.

When you run the linkerd stat command, we fire off a prometheus query that looks like this:

sum(increase(response_total{direction="inbound"}[1m])) by (namespace, deployment)

On the grafana dashboard, we instead use a query that looks like this:

sum(irate(request_total{direction="inbound"}[30s])) by (namespace, deployment)

These two queries are different in a couple of ways:

  • The stat query is using the response_total metric, whereas the grafana query is using the request_total metric. response_total will always lag behind request_total, since request_total is incremented when a request starts, and response_total is incremented a request ends.

  • The time window for the stat query is 1m, whereas the time window for the grafana query is 30s. This means that the grafana query is going to use 3 datapoints at most when computing rps, but the stat command is using more datapoints, including some that are farther in the past.

  • Furthermore, the stat query is using increase and then calculating rps by dividing by the entire time window, whereas the grafana query is using irate, which only uses the 2 most recent datapoints when calculating rps.

The main reason for these discrepancies is that the grafana dashboard shows you timeseries data that includes a value for every datapoint in a given time window. The stat command, on the other hand, shows you a single, rolled-up value for all data points in a time window. The default time window for the stat command is 1 minute, so it takes a full minute for changes in request volume to be fully reflected in the stat output.

You can try shortening the time window for the stat command to only include the last two datapoints, and that should mirror the behavior of the grafana dashboard much more closely. You can do this via the -t flag. Testing locally, this command picks up changes in request rate much more quickly:

linkerd stat deploy -t 20s --all-namespaces

@stale
Copy link

stale bot commented Nov 15, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Nov 15, 2018
@stale stale bot closed this as completed Nov 29, 2018
@wmorgan
Copy link
Member

wmorgan commented Nov 29, 2018

@siggy @grampelberg do you think it's worth turning Kevin's excellent response above into documentation?

@siggy
Copy link
Member

siggy commented Nov 30, 2018

@wmorgan i think so. the whole area under https://linkerd.io/2/cli/ needs love, this would be a good addition to that.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants