You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We use intuition, experience, and an understanding of what users want to define service level indicators (SLIs), objectives (SLOs), and agreements (SLAs). These measurements describe basic properties of metrics that matter, what values we want those metrics to have, and how we’ll react if we can’t provide the expected service. Ultimately, choosing appropriate metrics helps to drive the right action if something goes wrong, and also gives an SRE team confidence that a service is healthy.
An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided.
An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound. For example, we might decide that we will return Shakespeare search results "quickly," adopting an SLO that our average search request latency should be less than 100 milliseconds.
Finally, SLAs are service level agreements: an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain. The consequences are most easily recognized when they are financial—a rebate or a penalty—but they can take other forms. An easy way to tell the difference between an SLO and an SLA is to ask "what happens if the SLOs aren’t met?": if there is no explicit consequence, then you are almost certainly looking at an SLO.16
We should identify indicators to measure service. Set High Objectives. Make a service agreement that is below those objectives with enough tolerance that small outages do not affect the agreement. Perhaps what we offer isn't a classic SLA, but considering that gyp is relying on our hosted header files having server down time is a huge issue. How quickly our tarballs download (or availability) affect CI systems such as travis.
Overall I think that a slightly more prescriptive approach to our ops might help us not only avoid unforeseen outages, but more importantly avoid us regressing / repeating problems.
The text was updated successfully, but these errors were encountered:
If so then seems reasonable to me, but the build team has two sets of users, "people who use node", and collaborators on projects in the node org that use our infra.
Bit of a mouthful, but I'll dig in briefly.
https://landing.google.com/sre/book/chapters/service-level-objectives.html
We should identify indicators to measure service. Set High Objectives. Make a service agreement that is below those objectives with enough tolerance that small outages do not affect the agreement. Perhaps what we offer isn't a classic SLA, but considering that gyp is relying on our hosted header files having server down time is a huge issue. How quickly our tarballs download (or availability) affect CI systems such as travis.
Overall I think that a slightly more prescriptive approach to our ops might help us not only avoid unforeseen outages, but more importantly avoid us regressing / repeating problems.
The text was updated successfully, but these errors were encountered: