New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SLO Lifecycle ADR #13
Conversation
Signed-off-by: Lisa Seelye <lisa@users.noreply.github.com>
/hold for review |
|
||
## Proposed Architecture | ||
|
||
Add Service Level Objectives (SLOs) as a feature to services with its own distinct lifecycle described herein. The PM will be accountable for the SLO feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We got some feedback that using the word feature would not work out the way I thought it would, recommend not using this word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I will make the change.
|
||
With an accepted proposal from the Research Phase, an owner can be found to implement the changes. Not all of these changes are necessarily made by software engineers. Certainly changes to instrument the service’s codebase is in the domain of the software engineer, but changing team policies and practices is more likely to be the domain of the engineering lead. Remember that the work isn’t intended to be wholly owned by the person to whom it is assigned, that person can and should work in concert with people who can assist. [Consider the RACI chart][raci-chart] which shows that more than just one role is involved; work as a team to accomplish the goals. | ||
|
||
The owner will likely be an engineer on the service's team who may work in concert with an appropriate SRE IC to affect changes to the service and related ecosystem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/IC//
These steps can be used to evaluate the SLOs themselves, but there's also review on the targets set by the service's SLO(s): | ||
|
||
* Are the service's SLOs being met? | ||
* If they're not, what are common reasons why not? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And who's job is it to fix? e.g. do we have upstream quality issues that could be caught by new tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a new section for this.
|
||
* The existence of an SLO | ||
* Reporting for SLO compliance and error budget exhaustion | ||
* Willingness to prioritize reliability work over feature work when necessary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here again, use the word customer or customer experience, it's not reliability for reliabilities sake. its because we're proxying customer experience through SLOs and these generally focus on reliability (but not exclusively).
Signed-off-by: Lisa Seelye <lisa@users.noreply.github.com>
|
||
### Permissive SLO Phase | ||
|
||
Teams may find that adopting this SLO lifecycle challenging for any number of reasons, whether they're listed in the [challenges](#challenges) section or not. For teams that want adopt a lightweight approach to the SLO lifecycle, a "permissive phase" could be helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also mention here that we can also have aspirational but realistic SLOs and actual SLOs together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't aspirational SLOs aim to become "actual" SLOs, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, just as a concept though, they can be implemented to help calibrate your actual SLOs better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dastergon I've had a think and maybe we leave that out of the lifecycle doc here and give the idea a home in the kind of slo cookbook, since I know we have teams that chose that path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's table the aspirational SLOs for the moment only because we can add it later to either this, or the cookbook once we have a worked example.
Signed-off-by: Lisa Seelye <lisa@users.noreply.github.com>
It's not really fixed in so much as it's not pointing to somewhere that's just blatantly wrong now.
/hold cancel |
/lgtm |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED Approval requirements bypassed by manually added approval. This pull-request has been approved by: The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Add the SLO Lifecycle document.
There are some places here that will need to be edited after a merge since there are references to RACI document that will want to link to that document, but that's not ready yet (and also links to this document).
For SIGSRE-59.
Signed-off-by: Lisa Seelye lisa@users.noreply.github.com