Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify SLIs to determine SLO so we can offer an SLA #1213

Closed
MylesBorins opened this issue Apr 5, 2018 · 2 comments
Closed

Identify SLIs to determine SLO so we can offer an SLA #1213

MylesBorins opened this issue Apr 5, 2018 · 2 comments

Comments

@MylesBorins
Copy link
Contributor

Bit of a mouthful, but I'll dig in briefly.

https://landing.google.com/sre/book/chapters/service-level-objectives.html

We use intuition, experience, and an understanding of what users want to define service level indicators (SLIs), objectives (SLOs), and agreements (SLAs). These measurements describe basic properties of metrics that matter, what values we want those metrics to have, and how we’ll react if we can’t provide the expected service. Ultimately, choosing appropriate metrics helps to drive the right action if something goes wrong, and also gives an SRE team confidence that a service is healthy.

An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided.

An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound. For example, we might decide that we will return Shakespeare search results "quickly," adopting an SLO that our average search request latency should be less than 100 milliseconds.

Finally, SLAs are service level agreements: an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain. The consequences are most easily recognized when they are financial—a rebate or a penalty—but they can take other forms. An easy way to tell the difference between an SLO and an SLA is to ask "what happens if the SLOs aren’t met?": if there is no explicit consequence, then you are almost certainly looking at an SLO.16

We should identify indicators to measure service. Set High Objectives. Make a service agreement that is below those objectives with enough tolerance that small outages do not affect the agreement. Perhaps what we offer isn't a classic SLA, but considering that gyp is relying on our hosted header files having server down time is a huge issue. How quickly our tarballs download (or availability) affect CI systems such as travis.

Overall I think that a slightly more prescriptive approach to our ops might help us not only avoid unforeseen outages, but more importantly avoid us regressing / repeating problems.

@gibfahn
Copy link
Member

gibfahn commented Apr 11, 2018

So tl;dr would be something like:

Define acceptable levels of downtime for user-facing web services, and steps to take if we exceed those levels.

?

@gibfahn
Copy link
Member

gibfahn commented Apr 11, 2018

If so then seems reasonable to me, but the build team has two sets of users, "people who use node", and collaborators on projects in the node org that use our infra.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants