Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SLO Lifecycle ADR #13

Merged
merged 4 commits into from Jun 14, 2022
Merged

Conversation

lisa
Copy link
Contributor

@lisa lisa commented May 11, 2022

Add the SLO Lifecycle document.

There are some places here that will need to be edited after a merge since there are references to RACI document that will want to link to that document, but that's not ready yet (and also links to this document).

For SIGSRE-59.

Signed-off-by: Lisa Seelye lisa@users.noreply.github.com

Signed-off-by: Lisa Seelye <lisa@users.noreply.github.com>
@sesheta sesheta added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 11, 2022
@lisa
Copy link
Contributor Author

lisa commented May 11, 2022

/hold

for review

@sesheta sesheta added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 11, 2022

## Proposed Architecture

Add Service Level Objectives (SLOs) as a feature to services with its own distinct lifecycle described herein. The PM will be accountable for the SLO feature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We got some feedback that using the word feature would not work out the way I thought it would, recommend not using this word.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I will make the change.


With an accepted proposal from the Research Phase, an owner can be found to implement the changes. Not all of these changes are necessarily made by software engineers. Certainly changes to instrument the service’s codebase is in the domain of the software engineer, but changing team policies and practices is more likely to be the domain of the engineering lead. Remember that the work isn’t intended to be wholly owned by the person to whom it is assigned, that person can and should work in concert with people who can assist. [Consider the RACI chart][raci-chart] which shows that more than just one role is involved; work as a team to accomplish the goals.

The owner will likely be an engineer on the service's team who may work in concert with an appropriate SRE IC to affect changes to the service and related ecosystem.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/IC//

These steps can be used to evaluate the SLOs themselves, but there's also review on the targets set by the service's SLO(s):

* Are the service's SLOs being met?
* If they're not, what are common reasons why not?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And who's job is it to fix? e.g. do we have upstream quality issues that could be caught by new tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a new section for this.


* The existence of an SLO
* Reporting for SLO compliance and error budget exhaustion
* Willingness to prioritize reliability work over feature work when necessary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here again, use the word customer or customer experience, it's not reliability for reliabilities sake. its because we're proxying customer experience through SLOs and these generally focus on reliability (but not exclusively).

Signed-off-by: Lisa Seelye <lisa@users.noreply.github.com>

### Permissive SLO Phase

Teams may find that adopting this SLO lifecycle challenging for any number of reasons, whether they're listed in the [challenges](#challenges) section or not. For teams that want adopt a lightweight approach to the SLO lifecycle, a "permissive phase" could be helpful.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also mention here that we can also have aspirational but realistic SLOs and actual SLOs together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't aspirational SLOs aim to become "actual" SLOs, though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, just as a concept though, they can be implemented to help calibrate your actual SLOs better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dastergon I've had a think and maybe we leave that out of the lifecycle doc here and give the idea a home in the kind of slo cookbook, since I know we have teams that chose that path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's table the aspirational SLOs for the moment only because we can add it later to either this, or the cookbook once we have a worked example.

lisa and others added 2 commits June 2, 2022 13:29
Signed-off-by: Lisa Seelye <lisa@users.noreply.github.com>
It's not really fixed in so much as it's not pointing to somewhere that's just blatantly wrong now.
@lisa
Copy link
Contributor Author

lisa commented Jun 13, 2022

/hold cancel

@sesheta sesheta removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 13, 2022
@jeremyeder
Copy link
Contributor

/lgtm

@sesheta sesheta added the lgtm Indicates that a PR is ready to be merged. label Jun 13, 2022
@dastergon
Copy link
Contributor

/lgtm

@jeremyeder jeremyeder added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 14, 2022
@sesheta
Copy link
Member

sesheta commented Jun 14, 2022

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sesheta sesheta merged commit 2c3d36d into operate-first:main Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants