Define Serving SLIs/SLOs 

**Describe the change you'd like to see**
Organizations are rapidly adopting SLOs (service level objectives) as the foundations of their SRE engineering effort.
The idea is to view services from a user perspective and follow these steps:
- Select metrics that make good SLIs
- Use SLIs to create proper SLOs
- Use the error budget implicitly defined by your SLO to mitigate risks (out of scope).

Knative Serving provides a number of [metrics](https://docs.google.com/document/d/1cGZff-FXNBP-i4v-_Fhua9JkQ_utWHaxAft2tvJeTuU/edit?usp=sharing) on which we can create a number of SLIs/SLOs for the end user.
Users should know what the available metrics are and should have meaningful SLIs/SLOs ready to use.
As a first step we should document the information needed in this repo and then create and/or suggest the right tooling to help users monitor/enforce SLOs.

**Additional context**
- Service Level Indicators (SLIs) are metrics that measure a property of a service. The metrics used to measure the level of service provided to end users (e.g., availability, latency, throughput).
- Service Level Objectives (SLOs) are the targeted levels of service, a statement of performance, measured by SLIs.
They are typically expressed as a percentage over a period of time. SLOs can be either time-based which means how much time of the measured period we need to achieve our performance target or events-based which means what percentage of the events need to be successful.

A nice blog post on the topic with many details can be found [here](https://www.openshift.com/blog/monitoring-services-like-an-sre-in-openshift-servicemesh).

Some examples:

- _SLI 1: “A service submitted/updated by the user should have its revision in ready state within N ms”_
This will help us detect issues like revisions not coming up in time.  We can report times with a new metric when we reconcile the revision at the controller component.

- _SLI 2: “A service deleted by the user should have its resources removed within M ms”_
Same as above.

- _SLI 3: “Services creating a lot of blocked connections at the activator side should be automatically re-configured within X ms”_
   This SLI requires some auto-tuning capabilities that don't currently exist. The idea is that when CC (current concurrency) is not 
   0 (infinite) then requests might get blocked and queued at the [activator side] 
   (https://github.com/knative/serving/blob/edf5ae036c246b98b4392017fb1d94b7ced066b0/pkg/activator/net/throttler.go#L217) 
   as they are going to be throttled. At some point if queued requests  exceed a [limit] 
  (https://github.com/knative/serving/blob/edf5ae036c246b98b4392017fb1d94b7ced066b0/pkg/activator/net/throttler.go#L57) 
   errors will be returned. The idea is to be able to detect when the activator is under pressure and proactively configure services 
   to avoid request delays.

- _SL4: “When switching from proxy to serve mode 95% percentile of latency for the proxied requests (via activator) should be statistically similar to the ones served directly”_

   It is desirable to have the same behavior for proxied and non-proxied requests. When a request goes through our system, the 
   user shouldnt see a difference compared to the requests that hit his service directly for a high number of requests issued.

- _SLI 5: “Number of auto-scaling actions that reached the desired pod number”_

- _SLI 6: “Cold start times are below a threshold of N ms”_
  This requires some metric that calculates time spent at pod bootstrapping and user’s app initialization
 
- _SLO 1: “99.9% of the services submitted/updated should become ready within N ms and removed with M ms”_

- _SLO 2: “99.9% of the time Serving is working in stable mode for all services”_

Resources
 - [Implementing Service Level Objectives](https://learning.oreilly.com/library/view/implementing-service-level/9781492076803/)
 - [SRE fundamentals: SLIs, SLAs and SLOs](https://cloud.google.com/blog/products/gcp/sre-fundamentals-slis-slas-and-slos)

Calling out several people who might be interest in this: @evankanderson @markusthoemmes @yuzisun @csantanapr @aslom @mattmoor @grantr @tcnghia @abrennan89

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Define Serving SLIs/SLOs #3140

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Define Serving SLIs/SLOs #3140

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions