-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Describe the change you'd like to see
Organizations are rapidly adopting SLOs (service level objectives) as the foundations of their SRE engineering effort.
The idea is to view services from a user perspective and follow these steps:
- Select metrics that make good SLIs
- Use SLIs to create proper SLOs
- Use the error budget implicitly defined by your SLO to mitigate risks (out of scope).
Knative Serving provides a number of metrics on which we can create a number of SLIs/SLOs for the end user.
Users should know what the available metrics are and should have meaningful SLIs/SLOs ready to use.
As a first step we should document the information needed in this repo and then create and/or suggest the right tooling to help users monitor/enforce SLOs.
Additional context
- Service Level Indicators (SLIs) are metrics that measure a property of a service. The metrics used to measure the level of service provided to end users (e.g., availability, latency, throughput).
- Service Level Objectives (SLOs) are the targeted levels of service, a statement of performance, measured by SLIs.
They are typically expressed as a percentage over a period of time. SLOs can be either time-based which means how much time of the measured period we need to achieve our performance target or events-based which means what percentage of the events need to be successful.
A nice blog post on the topic with many details can be found here.
Some examples:
-
SLI 1: “A service submitted/updated by the user should have its revision in ready state within N ms”
This will help us detect issues like revisions not coming up in time. We can report times with a new metric when we reconcile the revision at the controller component. -
SLI 2: “A service deleted by the user should have its resources removed within M ms”
Same as above. -
SLI 3: “Services creating a lot of blocked connections at the activator side should be automatically re-configured within X ms”
This SLI requires some auto-tuning capabilities that don't currently exist. The idea is that when CC (current concurrency) is not
0 (infinite) then requests might get blocked and queued at the [activator side]
(https://github.com/knative/serving/blob/edf5ae036c246b98b4392017fb1d94b7ced066b0/pkg/activator/net/throttler.go#L217)
as they are going to be throttled. At some point if queued requests exceed a [limit]
(https://github.com/knative/serving/blob/edf5ae036c246b98b4392017fb1d94b7ced066b0/pkg/activator/net/throttler.go#L57)
errors will be returned. The idea is to be able to detect when the activator is under pressure and proactively configure services
to avoid request delays. -
SL4: “When switching from proxy to serve mode 95% percentile of latency for the proxied requests (via activator) should be statistically similar to the ones served directly”
It is desirable to have the same behavior for proxied and non-proxied requests. When a request goes through our system, the
user shouldnt see a difference compared to the requests that hit his service directly for a high number of requests issued. -
SLI 5: “Number of auto-scaling actions that reached the desired pod number”
-
SLI 6: “Cold start times are below a threshold of N ms”
This requires some metric that calculates time spent at pod bootstrapping and user’s app initialization -
SLO 1: “99.9% of the services submitted/updated should become ready within N ms and removed with M ms”
-
SLO 2: “99.9% of the time Serving is working in stable mode for all services”
Resources
Calling out several people who might be interest in this: @evankanderson @markusthoemmes @yuzisun @csantanapr @aslom @mattmoor @grantr @tcnghia @abrennan89