New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TRT-662: elevate risk if missing large group of tests #723
TRT-662: elevate risk if missing large group of tests #723
Conversation
pkg/api/job_runs.go
Outdated
|
||
// Pre-load test bugs as well: | ||
if len(jobRun.Tests) <= maxFailuresToFullyAnalyze { | ||
for i, tr := range jobRun.Tests { | ||
bugs, err := query.LoadBugsForTest(dbc, tr.Test.Name, true) | ||
if err != nil { | ||
return apitype.ProwJobRunRiskAnalysis{}, err | ||
// SELECT * FROM "tests" WHERE name = '[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]' ORDER BY "tests"."id" LIMIT 1; | ||
// causes an error here, don't think we need to fail due to missing bugs - error(*errors.errorString) *{s: "record not found"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove the commented queries but I'm curious about them and wanted to document / followup a little
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. That test is explicitly not imported into sippy because it's a meta-test that just encompasses the results of other tests. I just hit problems with it myself yesterday and started explicitly ignoring it in risk analysis as well: #721
Log and continue here seems good though, and I don't see any reason to elevate risk because inability to find bugs doesn't make the result any more risky.
pkg/db/query/job_queries.go
Outdated
|
||
if len(jobRuns) == 0 { | ||
// select * from prow_job_runs where prow_job_id = 1805 and succeeded = true limit 10; | ||
// has no 'succeeded = true' runs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, interesting to have a job that never passes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depressingly frequent. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think rather than falling back to failed runs, which may be blocked on installs and thus running little to no tests, we may as well just skip the check. If the job is perma broken this check isn't likely to uncover much.
DEVELOPMENT.md
Outdated
--database-dsn="postgresql://postgres:password@localhost:5432/sippy_openshift" \ | ||
--mode=ocp | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem right, I've always loaded db's using the normal default name of "postgres" for the database, and without creating those user accounts (though you will see some warnings but I believe that's fine).
pg_restore -d postgres -h localhost -p 5432 -U postgres -W filename
Is this what you were using?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used what Dennis has above
pg_restore -h localhost -U postgres -p 5432 --verbose -Fc -C -d postgres ./sippy-backup-2022-10-20.dump
It works without adding the users / roles but generates errors. The DB goes to sippy_openshift so when my debug stopped working (before I updated the dsn) I went chasing the errors I saw. And then ultimately had to update the dsn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if there are changes you want here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the -C causing that behavior, "When this option is used, the database named with -d is used only to issue the initial DROP DATABASE and CREATE DATABASE commands. All data is restored into the database name that appears in the archive.".
Let's just drop the -C from above and then we don't need to doc a forked path.
pkg/api/job_runs.go
Outdated
// NOTE: we are including bugs for all releases, may want to filter here in future to just those | ||
// with an AffectsVersions that seems to match our compareRelease? | ||
jobBugs, err := query.LoadBugsForJobs(dbc, []int{int(jobRun.ProwJob.ID)}, true) | ||
if err != nil { | ||
return apitype.ProwJobRunRiskAnalysis{}, err | ||
log.Errorf("Error evaluating bugs for prow job: %d", jobRun.ProwJob.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.Errorf("Error evaluating bugs for prow job: %d", jobRun.ProwJob.ID) | |
log.WithError(err).Errorf("Error evaluating bugs for prow job: %d", jobRun.ProwJob.ID) |
pkg/api/job_runs.go
Outdated
|
||
// Pre-load test bugs as well: | ||
if len(jobRun.Tests) <= maxFailuresToFullyAnalyze { | ||
for i, tr := range jobRun.Tests { | ||
bugs, err := query.LoadBugsForTest(dbc, tr.Test.Name, true) | ||
if err != nil { | ||
return apitype.ProwJobRunRiskAnalysis{}, err | ||
// SELECT * FROM "tests" WHERE name = '[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]' ORDER BY "tests"."id" LIMIT 1; | ||
// causes an error here, don't think we need to fail due to missing bugs - error(*errors.errorString) *{s: "record not found"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. That test is explicitly not imported into sippy because it's a meta-test that just encompasses the results of other tests. I just hit problems with it myself yesterday and started explicitly ignoring it in risk analysis as well: #721
Log and continue here seems good though, and I don't see any reason to elevate risk because inability to find bugs doesn't make the result any more risky.
pkg/db/query/job_queries.go
Outdated
t += len(jobRun.Tests) | ||
} | ||
|
||
t /= len(jobRuns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/= ? TIL!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heh, I only learned since I got a lint nag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting the avg number of tests would be faster to do in SQL, e.g. last 7 days
Example:
SELECT
avg(count)
FROM (
SELECT
count(*)
FROM
prow_job_run_tests
INNER JOIN prow_job_runs ON prow_job_runs.id = prow_job_run_tests.prow_job_run_id
INNER JOIN prow_jobs ON prow_jobs.id = prow_job_runs.prow_job_id
WHERE
prow_jobs.name = 'periodic-ci-openshift-release-master-ci-4.13-e2e-aws-ovn-upgrade'
AND prow_job_runs.timestamp >= CURRENT_DATE - interval '7' day
GROUP BY
prow_job_run_id) t;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All that new knowledge lost with more efficient sql queries... I was wondering about raw query performance vs gorm. Thanks for the pointer.
pkg/db/query/job_queries.go
Outdated
|
||
var jobRuns []models.ProwJobRun | ||
res := dbc.DB.Joins("ProwJob"). | ||
Preload("Tests").Order("updated_at desc").Limit(10).Find(&jobRuns, "prow_job_id = ? AND succeeded = true", prowJobID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Suggest using "timestamp" instead of updated_at, which is the time the job ran rather than when we went into the db.
Should we filter here on timestamp > now - 2 weeks? Wondering if there are concerns with finding those successes like 3 months back.
pkg/sippyserver/server.go
Outdated
// select id, count(*) from prow_job_run_tests where prow_job_run_id = jobRunID | ||
jobRunTestCount, err = query.JobRunTestCount(s.db, jobRunID) | ||
if err != nil { | ||
log.Errorf("Error getting job run test count %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.Errorf("Error getting job run test count %v", err) | |
log.WithError(err).Error("Error getting job run test count") |
This one's a NIT if you've got the error in the message anyhow but structured logging is still good for us to keep in mind.
pkg/sippyserver/server.go
Outdated
|
||
// if we had an error we will continue the risk analysis and not elevate based on test counts | ||
if err != nil { | ||
log.Errorf("Error comparing historical job run test count %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: WithError
pkg/api/job_runs.go
Outdated
@@ -171,7 +204,7 @@ func JobRunRiskAnalysis(dbc *db.DB, jobRun *models.ProwJobRun) (apitype.ProwJobR | |||
// testResultsFunc is used for injecting db responses in unit tests. | |||
type testResultsFunc func(testName string, release, suite string, variants []string) (*apitype.Test, error) | |||
|
|||
func runJobRunAnalysis(jobRun *models.ProwJobRun, compareRelease string, | |||
func runJobRunAnalysis(jobRun *models.ProwJobRun, compareRelease string, jobRunTestCount int, historicalRunTestCount int, | |||
testResultsFunc testResultsFunc) (apitype.ProwJobRunRiskAnalysis, error) { | |||
|
|||
logger := log.WithFields(log.Fields{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing with my theme of nitpicking logging, could you do us a favor and move this contextual logger up in the the parent function, then pass it to runJobRunAnalysis as an argument. (maybe drop or adjust the "func" in each per your preference) Then you can use this for all the logging in the function above and we'll consistently know what job run the problems are coming from. It could then be passed to any other function we log from to make sure we know what we're looking at when we see those messages.
I think I should have done it this way originally.
pkg/db/query/job_queries.go
Outdated
// has no 'succeeded = true' runs | ||
// if we didn't get any then open it up to just the most recent runs | ||
res = dbc.DB.Joins("ProwJob"). | ||
Preload("Tests").Order("updated_at desc").Limit(10).Find(&jobRuns, "prow_job_id = ?", prowJobID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 10 enough? If you use the SQL approach below, doing 100 or even 1K should be feasible. \
I'd be worried something breaks resulting in us not running a lot of tests, once it occurs on 10 runs we lose the signal of this really quickly.
cfd3553
to
019dc34
Compare
DEVELOPMENT.md
Outdated
```bash | ||
echo "CREATE USER sippyro;" | psql postgresql://postgres:password@localhost:5432/postgres | ||
echo "CREATE USER rdsadmin;" | psql postgresql://postgres:password@localhost:5432/postgres | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm torn here, I guess the errors do alarm people, but we also possibly shouldn't publish the account names we use. Maybe just a note that you can ignore errors about accounts that do not exist?
019dc34
to
19493ec
Compare
@neisw: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgoodwin, neisw The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Analyze current job test counts against historical counts