Kubernetes CI Policy: all prow.k8s.io jobs must have testgrid alerts configured #18599

spiffxp · 2020-08-01T03:35:27Z

Part of #18551

Why this is important:

In order to ensure effective use of community resources, we expect them to be spent on jobs that provide useful signal, and are actively maintained
Configuring testgrid alerts requires setting an e-mail address, which gives us a point of contact to escalate to if a job is deemed an ineffective use of community resources
We'll use this to implement a policy where we reserve the right to remove/disable jobs that are deemed an ineffective use of resources (e.g. perma-failing for O(weeks)) if the point of contact is unresponsive

TODO:

come up with test to enforce this policy (logging only)
come up with report to identify jobs and likely candidate sig owners
- eg: if a job is on "sig-foo"'s dashboard, it's likely they want to own it
- who to contact for jobs that aren't sigs?
send out notice to all sigs, give a deadline of N weeks
any jobs that don't have e-mail addresses configured should be removed
flip test to failing once all jobs meet the policy

Other thoughts / notes:

need to parse both prowjob config and testgrid config for this
- there are some testgrids not populated by prow, we can probably ignore these?
- there are some prowjobs that don't have all their testgrid config in annotations

/sig testing

spiffxp · 2020-09-10T15:27:33Z

Some prior art to serve as starting points:

go run ./experiment/prowjob-report/main.go --config ./config/prow/config.yaml --job-config ./config/jobs --format csv > jobs.csv which I periodically reimport into this spreadsheet
- the owner_dash column is a guess on who should own the job based on which sig/wg-prefixed testgrid dashboard the job lives under
- could run with --format json and pipe through to jq
- there are some gaps; the big one I'm aware of is that it misses testgrid info that is solely in testgrid config

This test (which does more than its name lets on) should serve as a good starting point for how to fail or log jobs/tabs that don't have alerts configured:

test-infra/config/tests/testgrids/config_test.go

Lines 506 to 544 in 2293860

    
           func TestReleaseBlockingJobsMustHaveTestgridDescriptions(t *testing.T) { 
        
           	// TODO(spiffxp): start with master, enforce for all release branches 
        
           	re := regexp.MustCompile("^sig-release-master-(blocking|informing)$") 
        
           	for _, dashboard := range cfg.Dashboards { 
        
           		if !re.MatchString(dashboard.Name) { 
        
           			continue 
        
           		} 
        
           		suffix := re.FindStringSubmatch(dashboard.Name)[1] 
        
           		for _, dashboardtab := range dashboard.DashboardTab { 
        
           			intro := fmt.Sprintf("dashboard_tab %v/%v is release-%v", dashboard.Name, dashboardtab.Name, suffix) 
        
           			if dashboardtab.Name == "" { 
        
           				t.Errorf("%v: - Must have a name", intro) 
        
           			} 
        
           			if dashboardtab.TestGroupName == "" { 
        
           				t.Errorf("%v: - Must have a test_group_name", intro) 
        
           			} 
        
           			if dashboardtab.Description == "" { 
        
           				t.Errorf("%v: - Must have a description", intro) 
        
           			} 
        
           			// TODO(spiffxp): enforce for informing as well 
        
           			if suffix == "informing" { 
        
           				if !strings.HasPrefix(dashboardtab.Description, "OWNER: ") { 
        
           					t.Logf("NOTICE: %v: - Must have a description that starts with OWNER: ", intro) 
        
           				} 
        
           				if dashboardtab.AlertOptions == nil { 
        
           					t.Logf("NOTICE: %v: - Must have alert_options (ensure informing dashboard is listed first in testgrid-dashboards)", intro) 
        
           				} else if dashboardtab.AlertOptions.AlertMailToAddresses == "" { 
        
           					t.Logf("NOTICE: %v: - Must have alert_options.alert_mail_to_addresses", intro) 
        
           				} 
        
           			} else { 
        
           				if dashboardtab.AlertOptions == nil { 
        
           					t.Errorf("%v: - Must have alert_options (ensure blocking dashboard is listed first in testgrid-dashboards)", intro) 
        
           				} else if dashboardtab.AlertOptions.AlertMailToAddresses == "" { 
        
           					t.Errorf("%v: - Must have alert_options.alert_mail_to_addresses", intro) 
        
           				} 
        
           			} 
        
           		} 
        
           	} 
        
           }

spiffxp · 2020-09-11T18:27:35Z

/help

RobertKielty · 2020-09-17T13:29:02Z

@spiffxp, myself, @ScrapCodes and @rayandas will do work together on the test. Coordinating with them both now.

RobertKielty · 2020-09-19T11:08:24Z

So @rayandas @ScrapCodes and myself met up to do exploratory work on this and as result we learned a little bazel!

To run the test above we need to invoke bazel as follows :

cd TEST_INFRA_REPO_ROOT
bazel test //config/tests/testgrids:go_default_test --config--test_output=all

Worth noting that we must use bazel to run this test and not the go test runner.

The bazel build file is used to pull in test-grid runtime configuration using the
following dependencies :

        "@com_github_googlecloudplatform_testgrid//config:go_default_library",
        "@com_github_googlecloudplatform_testgrid//pb/config:go_default_library",

and also to pass in parameters to the test


        "--config=$(location testconf.pb)",
        "--prow-config=$(location //config/prow:config.yaml)",
        "--job-config=config/jobs",

RobertKielty · 2020-09-20T08:47:02Z

/assign
/remove help-wanted

RobertKielty · 2020-09-20T08:59:10Z

To run the specific test use

bazel test //config/tests/testgrids:go_default_test \
--test_output=all \
--test_filter=TestReleaseBlockingJobsShouldHaveTestgridDescriptions

RobertKielty · 2020-09-21T10:12:59Z

/remove help-wanted

spiffxp · 2020-11-03T18:33:35Z

@RobertKielty the command is /remove help (not intuitive IMO)... but speaking of, are you still working on this?

spiffxp · 2020-11-03T19:03:58Z

Reviewed #19286 (review)

RobertKielty · 2020-11-13T10:44:54Z

Reviewing this now.

RobertKielty · 2020-11-20T10:48:02Z

I want to talk about to @spiffxp about this when he gets back.

RobertKielty · 2020-12-23T14:41:00Z

Spoke with @spiffxp about this issue where I proposed writing a helper function to decouple selection of Kubernetes jobs from testing their policy conformance.

RobertKielty · 2020-12-23T14:47:56Z

/remove help

spiffxp · 2021-01-08T21:39:59Z

/remove-help

spiffxp · 2021-02-09T20:52:23Z

/milestone v1.21
Checking back in on this

RobertKielty · 2021-05-18T16:50:51Z

Might be a good idea to hand off to someone else.

spiffxp · 2021-07-27T04:39:25Z

/help

I don't really see any way around having to use the tests in config/tests/testgrid. It's the only way you can guarantee you're walking testgrid configs that are the result of both whatever prowjob annotations exist, and whatever manually defined testgrids there are. And testgrid config is where the alerting is defined.

Like, you have my full support if you want to try suggesting:

"hey what if we didn't allow manually defined testgrids for any prowjobs we define" or
"hey if I added support to configurator for this one testgrid-config-only-field as an annotation we wouldn't need the testgrid configs"

Because then if you could just use prowjob annotations it's way faster/easier to run go test ./config/tests/jobs/..., plus you don't have to worry about iterating over all testgrids, then all job configs (to make sure you didn't miss something that only showed up in one)...

But if you're not willing to tackle that to simplify the problem space, then...

The slowest-but-it-already-works way to do this is use bazel to run the tests. The first time takes a long time (if I've timed it, I've forgotten, but longer than 10min), you end up recompiling the world, and you get a tiny space heater for a while. But then it's faster and you can iterate relatively quickly.

There is a way you can use go to run the configurator to generate the testgrid proto, then run the testgrid tests point at that, but it's too easy to get tripped up and use stale data.

Things you could start with:

WIP : CI Signal Policy Violation: A DashboardTab config MUST CONTAIN an alert contact email address. #19286 has review comments pointing the way forward
Kubernetes CI Policy: all prow.k8s.io jobs must have testgrid alerts configured #18599 (comment) and has links to prior art

k8s-ci-robot · 2021-07-27T04:39:26Z

@spiffxp:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

I don't really see any way around having to use the tests in config/tests/testgrid. It's the only way you can guarantee you're walking testgrid configs that are the result of both whatever prowjob annotations exist, and whatever manually defined testgrids there are. And testgrid config is where the alerting is defined.

Like, you have my full support if you want to try suggesting:

"hey what if we didn't allow manually defined testgrids for any prowjobs we define" or

"hey if I added support to configurator for this one testgrid-config-only-field as an annotation we wouldn't need the testgrid configs"

Because then if you could just use prowjob annotations it's way faster/easier to run go test ./config/tests/jobs/..., plus you don't have to worry about iterating over all testgrids, then all job configs (to make sure you didn't miss something that only showed up in one)...

But if you're not willing to tackle that to simplify the problem space, then...

The slowest-but-it-already-works way to do this is use bazel to run the tests. The first time takes a long time (if I've timed it, I've forgotten, but longer than 10min), you end up recompiling the world, and you get a tiny space heater for a while. But then it's faster and you can iterate relatively quickly.

There is a way you can use go to run the configurator to generate the testgrid proto, then run the testgrid tests point at that, but it's too easy to get tripped up and use stale data.

Things you could start with:

WIP : CI Signal Policy Violation: A DashboardTab config MUST CONTAIN an alert contact email address. #19286 has review comments pointing the way forward

Kubernetes CI Policy: all prow.k8s.io jobs must have testgrid alerts configured #18599 (comment) and has links to prior art

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

spiffxp · 2021-10-01T19:05:26Z

/kind cleanup

k8s-triage-robot · 2021-12-30T19:50:49Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-01-29T20:18:31Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-02-28T21:15:18Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-02-28T21:15:36Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ameukam · 2022-02-28T21:36:22Z

/reopen
/lifecycle frozen

k8s-ci-robot · 2022-02-28T21:36:39Z

@ameukam: Reopened this issue.

In response to this:

/reopen
/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ameukam · 2022-02-28T21:37:21Z

/milestone clear

k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Aug 1, 2020

spiffxp added this to Backlog in CI Policy Improvements via automation Aug 1, 2020

spiffxp mentioned this issue Aug 1, 2020

Kubernetes CI Policy: remove egregiously perma-failing jobs #18600

Open

spiffxp added area/jobs area/testgrid labels Aug 1, 2020

This was referenced Aug 1, 2020

Kubernetes CI Policy: create and enforce policy of removing continuously unhealthy jobs #18601

Open

Kubernetes CI Policy (Umbrella issue) #18551

Open

alejandrox1 mentioned this issue Sep 10, 2020

Add the contact information in case of job issue it will send out a notification kubernetes/sig-release#1216

Closed

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Sep 11, 2020

RobertKielty moved this from Backlog to In Progress in CI Policy Improvements Sep 20, 2020

k8s-ci-robot assigned RobertKielty Sep 20, 2020

RobertKielty mentioned this issue Sep 21, 2020

WIP : CI Signal Policy Violation: A DashboardTab config MUST CONTAIN an alert contact email address. #19286

Closed

LappleApple moved this from In Progress to In Review in CI Policy Improvements Nov 5, 2020

spiffxp mentioned this issue Nov 10, 2020

Add test for testrgrid alert emails on release-blocking jobs #19892

Merged

k8s-ci-robot removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jan 8, 2021

BenTheElder modified the milestones: v1.21, v1.23 Jun 25, 2021

spiffxp moved this from To Triage to Help Wanted in sig-testing issues Jul 27, 2021

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 27, 2021

cpanato mentioned this issue Sep 8, 2021

Describe and write down the CI policy kubernetes/sig-testing#9

Closed

k8s-ci-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Oct 1, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 30, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 29, 2022

k8s-ci-robot closed this as completed Feb 28, 2022

CI Policy Improvements automation moved this from In Review to Closed Feb 28, 2022

sig-testing issues automation moved this from Help Wanted to Done Feb 28, 2022

k8s-ci-robot reopened this Feb 28, 2022

CI Policy Improvements automation moved this from Closed to Backlog Feb 28, 2022

sig-testing issues automation moved this from Done to Backlog Feb 28, 2022

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Feb 28, 2022

k8s-ci-robot removed this from the v1.23 milestone Feb 28, 2022

rayandas mentioned this issue Jan 24, 2023

REQUEST: New membership for rayandas kubernetes/org#3982

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes CI Policy: all prow.k8s.io jobs must have testgrid alerts configured #18599

Kubernetes CI Policy: all prow.k8s.io jobs must have testgrid alerts configured #18599

spiffxp commented Aug 1, 2020 •

edited

Loading

spiffxp commented Sep 10, 2020

spiffxp commented Sep 11, 2020

RobertKielty commented Sep 17, 2020 •

edited

Loading

RobertKielty commented Sep 19, 2020 •

edited

Loading

RobertKielty commented Sep 20, 2020

RobertKielty commented Sep 20, 2020 •

edited

Loading

RobertKielty commented Sep 21, 2020

spiffxp commented Nov 3, 2020

spiffxp commented Nov 3, 2020

RobertKielty commented Nov 13, 2020

RobertKielty commented Nov 20, 2020

RobertKielty commented Dec 23, 2020

RobertKielty commented Dec 23, 2020

spiffxp commented Jan 8, 2021

spiffxp commented Feb 9, 2021

RobertKielty commented May 18, 2021

spiffxp commented Jul 27, 2021

k8s-ci-robot commented Jul 27, 2021

spiffxp commented Oct 1, 2021

k8s-triage-robot commented Dec 30, 2021

k8s-triage-robot commented Jan 29, 2022

k8s-triage-robot commented Feb 28, 2022

k8s-ci-robot commented Feb 28, 2022

ameukam commented Feb 28, 2022

k8s-ci-robot commented Feb 28, 2022

ameukam commented Feb 28, 2022

Kubernetes CI Policy: all prow.k8s.io jobs must have testgrid alerts configured #18599

Kubernetes CI Policy: all prow.k8s.io jobs must have testgrid alerts configured #18599

Comments

spiffxp commented Aug 1, 2020 • edited Loading

spiffxp commented Sep 10, 2020

spiffxp commented Sep 11, 2020

RobertKielty commented Sep 17, 2020 • edited Loading

RobertKielty commented Sep 19, 2020 • edited Loading

RobertKielty commented Sep 20, 2020

RobertKielty commented Sep 20, 2020 • edited Loading

RobertKielty commented Sep 21, 2020

spiffxp commented Nov 3, 2020

spiffxp commented Nov 3, 2020

RobertKielty commented Nov 13, 2020

RobertKielty commented Nov 20, 2020

RobertKielty commented Dec 23, 2020

RobertKielty commented Dec 23, 2020

spiffxp commented Jan 8, 2021

spiffxp commented Feb 9, 2021

RobertKielty commented May 18, 2021

spiffxp commented Jul 27, 2021

k8s-ci-robot commented Jul 27, 2021

spiffxp commented Oct 1, 2021

k8s-triage-robot commented Dec 30, 2021

k8s-triage-robot commented Jan 29, 2022

k8s-triage-robot commented Feb 28, 2022

k8s-ci-robot commented Feb 28, 2022

ameukam commented Feb 28, 2022

k8s-ci-robot commented Feb 28, 2022

ameukam commented Feb 28, 2022

spiffxp commented Aug 1, 2020 •

edited

Loading

RobertKielty commented Sep 17, 2020 •

edited

Loading

RobertKielty commented Sep 19, 2020 •

edited

Loading

RobertKielty commented Sep 20, 2020 •

edited

Loading