-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a CI on-call and setup guidelines for reverting PR and freezing master #6801
Comments
Just a thought: why can't we run the full test suit on each PR. I mean, include the cwag_integ_test and lte_integ_test etc.. from Circle CI into the PR mandatory tests? That will take long time for a PR to land but we will make sure no bad code is landed |
Hey @uri200, so that is the plan for the new jenkins based CI @tmdzk is working on. (We can't quite do this with our current setup due to security reasons.) Obviously, once that is done the oncall job should get easier as we would greatly lower the chance of merging bad code. But it is still good to define what steps need to be taken when master gets broken. |
Once this gets formalized, let's add it to |
CI Health Strategies
This GH issue outlines actions to be taken to better improve CI health on master.
Current state of CI health
In the past 30 days (Mar 20 2021 - Apr 26 2021), less than half of our 325 commits resulted in a deployable build.
*the per-job success rate for deploy jobs are higher because they are only run if the test jobs pass
Problem statements
Proposed solution to improve problem #1
While PR authors can optionally trigger Jenkins CWAG/LTE integration test to sanity check, they do not prevent bad PRs from merging as it is not mandatory. We cannot make them mandatory just yet as the infrastructure is not ready. The new CI infrastructure has plans to make them mandatory.
In the meantime, we can try to catch more non-functional breakages at the precommit level such as bringing up containers to check for crashloops.
Proposed solution to improve problem #2, #3, #4
The current method for catching CI breakages on master is via the #ci slack channel. In the long term, we can set up a simple dashboard with the new CI to make breakage detection easier.
I propose we create an on-call rotation to manually monitor the #ci slack channel for any obvious breakages. If there is an actual breakage on master, the oncall will declare a merge freeze to all magma maintainers. By declaring a merge freeze, we should prevent any new breakages from sneaking into master.
As for enforcing the merge freeze, I've found a Merge Freeze GH Action that seems interesting to try out. This essentially adds a failing check that can be toggled into a mandatory check when a merge freeze is in place. Administrators will still be able to bypass the check. The upside of this action is the Slack integration where the merge freeze can be triggered from Slack. Additionally, it can be configured to message a slack chennel when a branch is freezed / unfreezed.
For additional tooling, Pager Duty provides an oncall tracking service that also integrates with Slack.
On-call Members
The Magma maintainers should nominate a set of maintainers that have shown to be good shepherds of CI health.
To start off, I nominate the following people
On-call Duration
We can start off with a 1 week duration.
On-call Responsibilities
Relevant Tests / Jobs
While all CI check failures should raise an alert, we should be most careful about checks that are not covered by or slightly differ from precommit checks.
Logical tests not covered in precommit
Deploy jobs not covered in precommit
Responsibilities
Steps on reverting a PR
Use the
revert
button to create a PR that reverts the original PR.Add the TSC members as reviewers so that they are notified.
The text was updated successfully, but these errors were encountered: