Add a CI on-call and setup guidelines for reverting PR and freezing master #6801

themarwhal · 2021-05-06T16:35:58Z

CI Health Strategies

This GH issue outlines actions to be taken to better improve CI health on master.

Current state of CI health

In the past 30 days (Mar 20 2021 - Apr 26 2021), less than half of our 325 commits resulted in a deployable build.
*the per-job success rate for deploy jobs are higher because they are only run if the test jobs pass

job name	success rate per job	success rate per commit	longest breakage
lte-integration-test	42%	N/A	16 days
cwf-integration-test	59%	N/A	12 days
agw-deploy	90%	38%	16 days
cwag-deploy	87%	51%	12 days

Problem statements

Our mandatory pre-commit CI checks severely lack in GW integration testing compared to post-commit checks
It takes the team around a day to notice CI failures and file GitHub issues
Fixing CI breakages on master is not treated as a high priority item
Merging PRs while CI is red adds additional problems on master that prolong CI recovery time

Proposed solution to improve problem #1

While PR authors can optionally trigger Jenkins CWAG/LTE integration test to sanity check, they do not prevent bad PRs from merging as it is not mandatory. We cannot make them mandatory just yet as the infrastructure is not ready. The new CI infrastructure has plans to make them mandatory.

In the meantime, we can try to catch more non-functional breakages at the precommit level such as bringing up containers to check for crashloops.

Proposed solution to improve problem #2, #3, #4

The current method for catching CI breakages on master is via the #ci slack channel. In the long term, we can set up a simple dashboard with the new CI to make breakage detection easier.

I propose we create an on-call rotation to manually monitor the #ci slack channel for any obvious breakages. If there is an actual breakage on master, the oncall will declare a merge freeze to all magma maintainers. By declaring a merge freeze, we should prevent any new breakages from sneaking into master.

As for enforcing the merge freeze, I've found a Merge Freeze GH Action that seems interesting to try out. This essentially adds a failing check that can be toggled into a mandatory check when a merge freeze is in place. Administrators will still be able to bypass the check. The upside of this action is the Slack integration where the merge freeze can be triggered from Slack. Additionally, it can be configured to message a slack chennel when a branch is freezed / unfreezed.

For additional tooling, Pager Duty provides an oncall tracking service that also integrates with Slack.

On-call Members

The Magma maintainers should nominate a set of maintainers that have shown to be good shepherds of CI health.
To start off, I nominate the following people

On-call Duration

We can start off with a 1 week duration.

On-call Responsibilities

Relevant Tests / Jobs

While all CI check failures should raise an alert, we should be most careful about checks that are not covered by or slightly differ from precommit checks.

Logical tests not covered in precommit

cwf-integ-test
lte-integ-test

Deploy jobs not covered in precommit

nms-build
cwag-deploy*
xwfm-deploy-latest
orc8r-build
lte-agw-deploy
feg-build

Responsibilities

If a same test / build fails more than 3 times in a row, the on-call should file a GH issue and announce a merge freeze
Look through merged commits since last healthy commit
1. If it is clear which PR caused the breakage, use the GitHub UI (see below) to create a PR that reverts the change. Contact one of the TSC members so that it can be force merged into master
2. Consult the DevOps team if there seems to be an issue with infrastructure
Once the master has green checks, announce a merge unfreeze

Steps on reverting a PR

Use the revert button to create a PR that reverts the original PR.

Add the TSC members as reviewers so that they are notified.

The text was updated successfully, but these errors were encountered:

uri200 · 2021-05-06T17:02:52Z

Just a thought: why can't we run the full test suit on each PR. I mean, include the cwag_integ_test and lte_integ_test etc.. from Circle CI into the PR mandatory tests?

That will take long time for a PR to land but we will make sure no bad code is landed

themarwhal · 2021-05-06T17:12:24Z

Just a thought: why can't we run the full test suit on each PR. I mean, include the cwag_integ_test and lte_integ_test etc.. from Circle CI into the PR mandatory tests?

That will take long time for a PR to land but we will make sure no bad code is landed

Hey @uri200, so that is the plan for the new jenkins based CI @tmdzk is working on. (We can't quite do this with our current setup due to security reasons.)

Obviously, once that is done the oncall job should get easier as we would greatly lower the chance of merging bad code. But it is still good to define what steps need to be taken when master gets broken.

hcgatewood · 2021-05-06T20:36:35Z

Once this gets formalized, let's add it to Contribute docusaurus

themarwhal mentioned this issue May 6, 2021

doc: Add a CI health oncall proposal #6798

Closed

1 task

themarwhal added the type: proposal Proposals and design documents label May 6, 2021

themarwhal added type: proposal Proposals and design documents and removed type: proposal Proposals and design documents labels May 10, 2021

hcgatewood changed the title ~~[Proposal] Add a CI on-call and setup guidelines for reverting PR and freezing master~~ Add a CI on-call and setup guidelines for reverting PR and freezing master May 20, 2021

hcgatewood added the status: accepted label May 20, 2021

hcgatewood closed this as completed May 20, 2021

hcgatewood added the tsc label May 21, 2021

hcgatewood added proposal:result:accepted and removed status: accepted labels Jun 4, 2021

hcgatewood added the status: accepted label Jul 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a CI on-call and setup guidelines for reverting PR and freezing master #6801

Add a CI on-call and setup guidelines for reverting PR and freezing master #6801

themarwhal commented May 6, 2021 •

edited

Loading

uri200 commented May 6, 2021

themarwhal commented May 6, 2021 •

edited

Loading

hcgatewood commented May 6, 2021

Add a CI on-call and setup guidelines for reverting PR and freezing master #6801

Add a CI on-call and setup guidelines for reverting PR and freezing master #6801

Comments

themarwhal commented May 6, 2021 • edited Loading

CI Health Strategies

Current state of CI health

Problem statements

Proposed solution to improve problem #1

Proposed solution to improve problem #2, #3, #4

On-call Members

On-call Duration

On-call Responsibilities

Relevant Tests / Jobs

Responsibilities

Steps on reverting a PR

uri200 commented May 6, 2021

themarwhal commented May 6, 2021 • edited Loading

hcgatewood commented May 6, 2021

themarwhal commented May 6, 2021 •

edited

Loading

themarwhal commented May 6, 2021 •

edited

Loading