Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLO - http request external duration and error source #989

Merged
merged 10 commits into from
May 20, 2024

Conversation

scottlepp
Copy link
Contributor

@scottlepp scottlepp commented May 15, 2024

What this PR does / why we need it:
Provides Middleware for Http Clients to capture external duration as a custom metric. Since we already capture total duration, we can subtract these to get plugin duration.

In discussion with @wbrowne, the API Server will expose this new metric/get's scraped into Prometheus, and we can create SLOs based on duration and error source.

Also, capture error source here as a label. Then we don't have to pass error source around.

We will take the same approach in SQLDS. So minimal changes will be required per plugin.

Ref https://github.com/grafana/enterprise-datasources/issues/730
Ref https://github.com/grafana/enterprise-datasources/issues/731

@scottlepp scottlepp requested a review from a team as a code owner May 15, 2024 16:53
@scottlepp scottlepp requested review from wbrowne, marefr, andresmgot, xnyo, cletter7 and Multimo and removed request for a team May 15, 2024 16:53
@scottlepp scottlepp requested a review from asimpson May 15, 2024 17:04
Copy link
Member

@marefr marefr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think this will work that's fine by me.

I've never subtracted histogram from another histogram so that's why I'm curios.

experimental/slo/slo_middleware.go Outdated Show resolved Hide resolved
experimental/slo/slo_middleware.go Outdated Show resolved Hide resolved
experimental/slo/slo_middleware.go Outdated Show resolved Hide resolved
"github.com/prometheus/client_golang/prometheus/promauto"
)

var duration = promauto.NewHistogramVec(prometheus.HistogramOpts{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since histogram don't you need to specify buckets? Have you verified subtracting histogram from histogram works as you imagine?

Copy link
Contributor Author

@scottlepp scottlepp May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I grabbed the code from here as it looked similar. I'm no expert here but from my understanding here we need a histogram, or we have to have shorter scrape times. Since that code isn't using buckets, I assumed it would work.

Copy link
Contributor Author

@scottlepp scottlepp May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xnyo suggested subtracting the values, but that was with metrics captured from tempo which appear to be histogram. Not sure if he ever confirmed that subtraction would work.

Span metrics generate two metrics: a counter that computes requests, and a histogram that computes operation’s durations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But wasn't he thinking taking the full latency of a trace and subtracting downstream span latencies and then update latency histogram, e.g. it's not prometheus doing the subtraction and you get one metric.

So my question is more, given you have two different latency histogram metrics how easy is it to reliably subtract one from the other and how does that scale/perform?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested if it subtracting histograms works reliably, but the way this histogram is set up should replicate what Tempo does with the metrics generator. I think @svennergr mentioned he was doing something similar with histograms in the ElasticSearch datasource as well, so he may have some guidance on whether this will be accurate or not

experimental/slo/slo_middleware.go Show resolved Hide resolved
@scottlepp scottlepp requested a review from marefr May 15, 2024 18:56
Copy link
Member

@marefr marefr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM feel free to evaluate. Suggested some minor changes

experimental/slo/slo_middleware.go Outdated Show resolved Hide resolved
experimental/slo/slo_middleware.go Outdated Show resolved Hide resolved
Copy link
Member

@xnyo xnyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! I have some minor suggestions but the code LGTM! 🚀

experimental/slo/slo_middleware.go Outdated Show resolved Hide resolved
"github.com/prometheus/client_golang/prometheus/promauto"
)

var duration = promauto.NewHistogramVec(prometheus.HistogramOpts{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested if it subtracting histograms works reliably, but the way this histogram is set up should replicate what Tempo does with the metrics generator. I think @svennergr mentioned he was doing something similar with histograms in the ElasticSearch datasource as well, so he may have some guidance on whether this will be accurate or not

experimental/slo/slo_middleware.go Outdated Show resolved Hide resolved
@scottlepp scottlepp merged commit ee141c6 into main May 20, 2024
3 checks passed
@scottlepp scottlepp deleted the slo-duration-error-source branch May 20, 2024 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants