TRT-662: elevate risk if missing large group of tests #723

neisw · 2022-12-06T22:43:58Z

Analyze current job test counts against historical counts

neisw · 2022-12-06T22:45:10Z

pkg/api/job_runs.go


 	// Pre-load test bugs as well:
 	if len(jobRun.Tests) <= maxFailuresToFullyAnalyze {
 		for i, tr := range jobRun.Tests {
 			bugs, err := query.LoadBugsForTest(dbc, tr.Test.Name, true)
 			if err != nil {
-				return apitype.ProwJobRunRiskAnalysis{}, err
+				// SELECT * FROM "tests" WHERE name = '[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]' ORDER BY "tests"."id" LIMIT 1;
+				// causes an error here, don't think we need to fail due to missing bugs - error(*errors.errorString) *{s: "record not found"}


I can remove the commented queries but I'm curious about them and wanted to document / followup a little

Interesting. That test is explicitly not imported into sippy because it's a meta-test that just encompasses the results of other tests. I just hit problems with it myself yesterday and started explicitly ignoring it in risk analysis as well: #721

Log and continue here seems good though, and I don't see any reason to elevate risk because inability to find bugs doesn't make the result any more risky.

neisw · 2022-12-06T22:45:48Z

pkg/db/query/job_queries.go

+
+	if len(jobRuns) == 0 {
+		// select * from prow_job_runs where prow_job_id = 1805 and succeeded = true limit 10;
+		// has no 'succeeded = true' runs


Same here, interesting to have a job that never passes?

Depressingly frequent. :)

I think rather than falling back to failed runs, which may be blocked on installs and thus running little to no tests, we may as well just skip the check. If the job is perma broken this check isn't likely to uncover much.

dgoodwin · 2022-12-07T11:51:58Z

DEVELOPMENT.md

+  --database-dsn="postgresql://postgres:password@localhost:5432/sippy_openshift" \
+  --mode=ocp
+```
+


This doesn't seem right, I've always loaded db's using the normal default name of "postgres" for the database, and without creating those user accounts (though you will see some warnings but I believe that's fine).

pg_restore -d postgres -h localhost -p 5432 -U postgres -W filename

Is this what you were using?

I used what Dennis has above

pg_restore -h localhost -U postgres -p 5432 --verbose -Fc -C -d postgres ./sippy-backup-2022-10-20.dump

It works without adding the users / roles but generates errors. The DB goes to sippy_openshift so when my debug stopped working (before I updated the dsn) I went chasing the errors I saw. And then ultimately had to update the dsn.

Let me know if there are changes you want here

It's the -C causing that behavior, "When this option is used, the database named with -d is used only to issue the initial DROP DATABASE and CREATE DATABASE commands. All data is restored into the database name that appears in the archive.".

Let's just drop the -C from above and then we don't need to doc a forked path.

pkg/api/job_runs.go

dgoodwin · 2022-12-07T11:55:53Z

pkg/api/job_runs.go

 	// NOTE: we are including bugs for all releases, may want to filter here in future to just those
 	// with an AffectsVersions that seems to match our compareRelease?
 	jobBugs, err := query.LoadBugsForJobs(dbc, []int{int(jobRun.ProwJob.ID)}, true)
 	if err != nil {
-		return apitype.ProwJobRunRiskAnalysis{}, err
+		log.Errorf("Error evaluating bugs for prow job: %d", jobRun.ProwJob.ID)


Suggested change

log.Errorf("Error evaluating bugs for prow job: %d", jobRun.ProwJob.ID)

log.WithError(err).Errorf("Error evaluating bugs for prow job: %d", jobRun.ProwJob.ID)

dgoodwin · 2022-12-07T12:01:37Z

pkg/api/job_runs.go


 	// Pre-load test bugs as well:
 	if len(jobRun.Tests) <= maxFailuresToFullyAnalyze {
 		for i, tr := range jobRun.Tests {
 			bugs, err := query.LoadBugsForTest(dbc, tr.Test.Name, true)
 			if err != nil {
-				return apitype.ProwJobRunRiskAnalysis{}, err
+				// SELECT * FROM "tests" WHERE name = '[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]' ORDER BY "tests"."id" LIMIT 1;
+				// causes an error here, don't think we need to fail due to missing bugs - error(*errors.errorString) *{s: "record not found"}


Interesting. That test is explicitly not imported into sippy because it's a meta-test that just encompasses the results of other tests. I just hit problems with it myself yesterday and started explicitly ignoring it in risk analysis as well: #721

Log and continue here seems good though, and I don't see any reason to elevate risk because inability to find bugs doesn't make the result any more risky.

dgoodwin · 2022-12-07T12:09:12Z

pkg/db/query/job_queries.go

+			t += len(jobRun.Tests)
+		}
+
+		t /= len(jobRuns)


heh, I only learned since I got a lint nag

Getting the avg number of tests would be faster to do in SQL, e.g. last 7 days

Example:

SELECT avg(count) FROM ( SELECT count(*) FROM prow_job_run_tests INNER JOIN prow_job_runs ON prow_job_runs.id = prow_job_run_tests.prow_job_run_id INNER JOIN prow_jobs ON prow_jobs.id = prow_job_runs.prow_job_id WHERE prow_jobs.name = 'periodic-ci-openshift-release-master-ci-4.13-e2e-aws-ovn-upgrade' AND prow_job_runs.timestamp >= CURRENT_DATE - interval '7' day GROUP BY prow_job_run_id) t;

All that new knowledge lost with more efficient sql queries... I was wondering about raw query performance vs gorm. Thanks for the pointer.

dgoodwin · 2022-12-07T12:14:46Z

pkg/db/query/job_queries.go

+
+	var jobRuns []models.ProwJobRun
+	res := dbc.DB.Joins("ProwJob").
+		Preload("Tests").Order("updated_at desc").Limit(10).Find(&jobRuns, "prow_job_id = ? AND succeeded = true", prowJobID)


Nit: Suggest using "timestamp" instead of updated_at, which is the time the job ran rather than when we went into the db.

Should we filter here on timestamp > now - 2 weeks? Wondering if there are concerns with finding those successes like 3 months back.

dgoodwin · 2022-12-07T12:15:54Z

pkg/sippyserver/server.go

+		// select id, count(*) from prow_job_run_tests where prow_job_run_id = jobRunID
+		jobRunTestCount, err = query.JobRunTestCount(s.db, jobRunID)
+		if err != nil {
+			log.Errorf("Error getting job run test count %v", err)


Suggested change

log.Errorf("Error getting job run test count %v", err)

log.WithError(err).Error("Error getting job run test count")

This one's a NIT if you've got the error in the message anyhow but structured logging is still good for us to keep in mind.

dgoodwin · 2022-12-07T12:16:58Z

pkg/sippyserver/server.go

+
+	// if we had an error we will continue the risk analysis and not elevate based on test counts
+	if err != nil {
+		log.Errorf("Error comparing historical job run test count %v", err)


Nit: WithError

pkg/sippyserver/server.go

dgoodwin · 2022-12-07T12:23:11Z

pkg/api/job_runs.go

@@ -171,7 +204,7 @@ func JobRunRiskAnalysis(dbc *db.DB, jobRun *models.ProwJobRun) (apitype.ProwJobR
 // testResultsFunc is used for injecting db responses in unit tests.
 type testResultsFunc func(testName string, release, suite string, variants []string) (*apitype.Test, error)

-func runJobRunAnalysis(jobRun *models.ProwJobRun, compareRelease string,
+func runJobRunAnalysis(jobRun *models.ProwJobRun, compareRelease string, jobRunTestCount int, historicalRunTestCount int,
 	testResultsFunc testResultsFunc) (apitype.ProwJobRunRiskAnalysis, error) {

 	logger := log.WithFields(log.Fields{


Continuing with my theme of nitpicking logging, could you do us a favor and move this contextual logger up in the the parent function, then pass it to runJobRunAnalysis as an argument. (maybe drop or adjust the "func" in each per your preference) Then you can use this for all the logging in the function above and we'll consistently know what job run the problems are coming from. It could then be passed to any other function we log from to make sure we know what we're looking at when we see those messages.

I think I should have done it this way originally.

stbenjam · 2022-12-07T14:06:55Z

pkg/db/query/job_queries.go

+		// has no 'succeeded = true' runs
+		// if we didn't get any then open it up to just the most recent runs
+		res = dbc.DB.Joins("ProwJob").
+			Preload("Tests").Order("updated_at desc").Limit(10).Find(&jobRuns, "prow_job_id = ?", prowJobID)


Is 10 enough? If you use the SQL approach below, doing 100 or even 1K should be feasible. \

I'd be worried something breaks resulting in us not running a lot of tests, once it occurs on 10 runs we lose the signal of this really quickly.

dgoodwin · 2022-12-08T11:45:15Z

DEVELOPMENT.md

+```bash
+echo "CREATE USER sippyro;" | psql postgresql://postgres:password@localhost:5432/postgres
+echo "CREATE USER rdsadmin;" | psql postgresql://postgres:password@localhost:5432/postgres
+```


I'm torn here, I guess the errors do alarm people, but we also possibly shouldn't publish the account names we use. Maybe just a note that you can ignore errors about accounts that do not exist?

openshift-ci · 2022-12-08T13:49:59Z

@neisw: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

dgoodwin · 2022-12-08T15:09:06Z

/lgtm

openshift-ci · 2022-12-08T15:10:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, neisw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dgoodwin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot requested review from deads2k and dgoodwin December 6, 2022 22:44

neisw commented Dec 6, 2022

View reviewed changes

dgoodwin reviewed Dec 7, 2022

View reviewed changes

stbenjam reviewed Dec 7, 2022

View reviewed changes

neisw force-pushed the risk-elevation-missing-tests branch 4 times, most recently from cfd3553 to 019dc34 Compare December 7, 2022 21:32

dgoodwin reviewed Dec 8, 2022

View reviewed changes

elevate risk if missing large group of tests

19493ec

neisw force-pushed the risk-elevation-missing-tests branch from 019dc34 to 19493ec Compare December 8, 2022 13:35

openshift-ci bot assigned dgoodwin Dec 8, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 8, 2022

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 8, 2022

openshift-merge-robot merged commit a7c887b into openshift:master Dec 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TRT-662: elevate risk if missing large group of tests #723

TRT-662: elevate risk if missing large group of tests #723

neisw commented Dec 6, 2022

neisw Dec 6, 2022

dgoodwin Dec 7, 2022

neisw Dec 6, 2022

dgoodwin Dec 7, 2022

dgoodwin Dec 7, 2022

dgoodwin Dec 7, 2022

neisw Dec 7, 2022

neisw Dec 7, 2022

dgoodwin Dec 8, 2022

dgoodwin Dec 7, 2022

dgoodwin Dec 7, 2022

dgoodwin Dec 7, 2022

neisw Dec 7, 2022

stbenjam Dec 7, 2022 •

edited

neisw Dec 7, 2022

dgoodwin Dec 7, 2022

dgoodwin Dec 7, 2022

dgoodwin Dec 7, 2022

dgoodwin Dec 7, 2022

stbenjam Dec 7, 2022 •

edited

dgoodwin Dec 8, 2022 •

edited

openshift-ci bot commented Dec 8, 2022

dgoodwin commented Dec 8, 2022

openshift-ci bot commented Dec 8, 2022

	log.Errorf("Error evaluating bugs for prow job: %d", jobRun.ProwJob.ID)
	log.WithError(err).Errorf("Error evaluating bugs for prow job: %d", jobRun.ProwJob.ID)

	log.Errorf("Error getting job run test count %v", err)
	log.WithError(err).Error("Error getting job run test count")

TRT-662: elevate risk if missing large group of tests #723

TRT-662: elevate risk if missing large group of tests #723

Conversation

neisw commented Dec 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stbenjam Dec 7, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stbenjam Dec 7, 2022 • edited

Choose a reason for hiding this comment

dgoodwin Dec 8, 2022 • edited

Choose a reason for hiding this comment

openshift-ci bot commented Dec 8, 2022

dgoodwin commented Dec 8, 2022

openshift-ci bot commented Dec 8, 2022

stbenjam Dec 7, 2022 •

edited

stbenjam Dec 7, 2022 •

edited

dgoodwin Dec 8, 2022 •

edited