Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a CI on-call and setup guidelines for reverting PR and freezing master #6801

Closed
themarwhal opened this issue May 6, 2021 · 3 comments
Closed
Labels
status: accepted type: proposal Proposals and design documents

Comments

@themarwhal
Copy link
Member

themarwhal commented May 6, 2021

CI Health Strategies

This GH issue outlines actions to be taken to better improve CI health on master.

Current state of CI health

In the past 30 days (Mar 20 2021 - Apr 26 2021), less than half of our 325 commits resulted in a deployable build.
*the per-job success rate for deploy jobs are higher because they are only run if the test jobs pass

job name success rate per job success rate per commit longest breakage
lte-integration-test 42% N/A 16 days
cwf-integration-test 59% N/A 12 days
agw-deploy 90% 38% 16 days
cwag-deploy 87% 51% 12 days

Problem statements

  1. Our mandatory pre-commit CI checks severely lack in GW integration testing compared to post-commit checks
  2. It takes the team around a day to notice CI failures and file GitHub issues
  3. Fixing CI breakages on master is not treated as a high priority item
  4. Merging PRs while CI is red adds additional problems on master that prolong CI recovery time

Proposed solution to improve problem #1

While PR authors can optionally trigger Jenkins CWAG/LTE integration test to sanity check, they do not prevent bad PRs from merging as it is not mandatory. We cannot make them mandatory just yet as the infrastructure is not ready. The new CI infrastructure has plans to make them mandatory.

In the meantime, we can try to catch more non-functional breakages at the precommit level such as bringing up containers to check for crashloops.

Proposed solution to improve problem #2, #3, #4

The current method for catching CI breakages on master is via the #ci slack channel. In the long term, we can set up a simple dashboard with the new CI to make breakage detection easier.

I propose we create an on-call rotation to manually monitor the #ci slack channel for any obvious breakages. If there is an actual breakage on master, the oncall will declare a merge freeze to all magma maintainers. By declaring a merge freeze, we should prevent any new breakages from sneaking into master.

As for enforcing the merge freeze, I've found a Merge Freeze GH Action that seems interesting to try out. This essentially adds a failing check that can be toggled into a mandatory check when a merge freeze is in place. Administrators will still be able to bypass the check. The upside of this action is the Slack integration where the merge freeze can be triggered from Slack. Additionally, it can be configured to message a slack chennel when a branch is freezed / unfreezed.

For additional tooling, Pager Duty provides an oncall tracking service that also integrates with Slack.

On-call Members

The Magma maintainers should nominate a set of maintainers that have shown to be good shepherds of CI health.
To start off, I nominate the following people

On-call Duration

We can start off with a 1 week duration.

On-call Responsibilities

Relevant Tests / Jobs

While all CI check failures should raise an alert, we should be most careful about checks that are not covered by or slightly differ from precommit checks.

Logical tests not covered in precommit

  • cwf-integ-test
  • lte-integ-test

Deploy jobs not covered in precommit

  • nms-build
  • cwag-deploy*
  • xwfm-deploy-latest
  • orc8r-build
  • lte-agw-deploy
  • feg-build

Responsibilities

  1. If a same test / build fails more than 3 times in a row, the on-call should file a GH issue and announce a merge freeze
  2. Look through merged commits since last healthy commit
    1. If it is clear which PR caused the breakage, use the GitHub UI (see below) to create a PR that reverts the change. Contact one of the TSC members so that it can be force merged into master
    2. Consult the DevOps team if there seems to be an issue with infrastructure
  3. Once the master has green checks, announce a merge unfreeze

Steps on reverting a PR

Use the revert button to create a PR that reverts the original PR.

GH-revert-button

GH-created-PR

Add the TSC members as reviewers so that they are notified.

@uri200
Copy link
Contributor

uri200 commented May 6, 2021

Just a thought: why can't we run the full test suit on each PR. I mean, include the cwag_integ_test and lte_integ_test etc.. from Circle CI into the PR mandatory tests?

That will take long time for a PR to land but we will make sure no bad code is landed

@themarwhal
Copy link
Member Author

themarwhal commented May 6, 2021

Just a thought: why can't we run the full test suit on each PR. I mean, include the cwag_integ_test and lte_integ_test etc.. from Circle CI into the PR mandatory tests?

That will take long time for a PR to land but we will make sure no bad code is landed

Hey @uri200, so that is the plan for the new jenkins based CI @tmdzk is working on. (We can't quite do this with our current setup due to security reasons.)

Obviously, once that is done the oncall job should get easier as we would greatly lower the chance of merging bad code. But it is still good to define what steps need to be taken when master gets broken.

@themarwhal themarwhal added the type: proposal Proposals and design documents label May 6, 2021
@hcgatewood
Copy link
Contributor

Once this gets formalized, let's add it to Contribute docusaurus

@themarwhal themarwhal added type: proposal Proposals and design documents and removed type: proposal Proposals and design documents labels May 10, 2021
@hcgatewood hcgatewood changed the title [Proposal] Add a CI on-call and setup guidelines for reverting PR and freezing master Add a CI on-call and setup guidelines for reverting PR and freezing master May 20, 2021
@hcgatewood hcgatewood added the tsc label May 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: accepted type: proposal Proposals and design documents
Projects
None yet
Development

No branches or pull requests

3 participants