Identify SLIs to determine SLO so we can offer an SLA #1213

MylesBorins · 2018-04-05T04:21:24Z

Bit of a mouthful, but I'll dig in briefly.

https://landing.google.com/sre/book/chapters/service-level-objectives.html

We use intuition, experience, and an understanding of what users want to define service level indicators (SLIs), objectives (SLOs), and agreements (SLAs). These measurements describe basic properties of metrics that matter, what values we want those metrics to have, and how we’ll react if we can’t provide the expected service. Ultimately, choosing appropriate metrics helps to drive the right action if something goes wrong, and also gives an SRE team confidence that a service is healthy.

An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided.

An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound. For example, we might decide that we will return Shakespeare search results "quickly," adopting an SLO that our average search request latency should be less than 100 milliseconds.

Finally, SLAs are service level agreements: an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain. The consequences are most easily recognized when they are financial—a rebate or a penalty—but they can take other forms. An easy way to tell the difference between an SLO and an SLA is to ask "what happens if the SLOs aren’t met?": if there is no explicit consequence, then you are almost certainly looking at an SLO.16

We should identify indicators to measure service. Set High Objectives. Make a service agreement that is below those objectives with enough tolerance that small outages do not affect the agreement. Perhaps what we offer isn't a classic SLA, but considering that gyp is relying on our hosted header files having server down time is a huge issue. How quickly our tarballs download (or availability) affect CI systems such as travis.

Overall I think that a slightly more prescriptive approach to our ops might help us not only avoid unforeseen outages, but more importantly avoid us regressing / repeating problems.

gibfahn · 2018-04-11T18:56:44Z

So tl;dr would be something like:

Define acceptable levels of downtime for user-facing web services, and steps to take if we exceed those levels.

?

gibfahn · 2018-04-11T18:57:30Z

If so then seems reasonable to me, but the build team has two sets of users, "people who use node", and collaborators on projects in the node org that use our infra.

MylesBorins mentioned this issue Apr 12, 2018

Travel Fund approval and being careful about funds available nodejs/admin#99

Closed

MylesBorins closed this as completed Nov 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify SLIs to determine SLO so we can offer an SLA #1213

Identify SLIs to determine SLO so we can offer an SLA #1213

MylesBorins commented Apr 5, 2018

gibfahn commented Apr 11, 2018

gibfahn commented Apr 11, 2018

Identify SLIs to determine SLO so we can offer an SLA #1213

Identify SLIs to determine SLO so we can offer an SLA #1213

Comments

MylesBorins commented Apr 5, 2018

gibfahn commented Apr 11, 2018

gibfahn commented Apr 11, 2018