Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRT-662: elevate risk if missing large group of tests #723

Merged

Conversation

neisw
Copy link
Contributor

@neisw neisw commented Dec 6, 2022

Analyze current job test counts against historical counts


// Pre-load test bugs as well:
if len(jobRun.Tests) <= maxFailuresToFullyAnalyze {
for i, tr := range jobRun.Tests {
bugs, err := query.LoadBugsForTest(dbc, tr.Test.Name, true)
if err != nil {
return apitype.ProwJobRunRiskAnalysis{}, err
// SELECT * FROM "tests" WHERE name = '[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]' ORDER BY "tests"."id" LIMIT 1;
// causes an error here, don't think we need to fail due to missing bugs - error(*errors.errorString) *{s: "record not found"}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove the commented queries but I'm curious about them and wanted to document / followup a little

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. That test is explicitly not imported into sippy because it's a meta-test that just encompasses the results of other tests. I just hit problems with it myself yesterday and started explicitly ignoring it in risk analysis as well: #721

Log and continue here seems good though, and I don't see any reason to elevate risk because inability to find bugs doesn't make the result any more risky.


if len(jobRuns) == 0 {
// select * from prow_job_runs where prow_job_id = 1805 and succeeded = true limit 10;
// has no 'succeeded = true' runs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, interesting to have a job that never passes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depressingly frequent. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rather than falling back to failed runs, which may be blocked on installs and thus running little to no tests, we may as well just skip the check. If the job is perma broken this check isn't likely to uncover much.

DEVELOPMENT.md Outdated
--database-dsn="postgresql://postgres:password@localhost:5432/sippy_openshift" \
--mode=ocp
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right, I've always loaded db's using the normal default name of "postgres" for the database, and without creating those user accounts (though you will see some warnings but I believe that's fine).

pg_restore -d postgres -h localhost -p 5432 -U postgres -W filename

Is this what you were using?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used what Dennis has above

pg_restore -h localhost -U postgres -p 5432 --verbose -Fc -C -d postgres ./sippy-backup-2022-10-20.dump

It works without adding the users / roles but generates errors. The DB goes to sippy_openshift so when my debug stopped working (before I updated the dsn) I went chasing the errors I saw. And then ultimately had to update the dsn.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if there are changes you want here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the -C causing that behavior, "When this option is used, the database named with -d is used only to issue the initial DROP DATABASE and CREATE DATABASE commands. All data is restored into the database name that appears in the archive.".

Let's just drop the -C from above and then we don't need to doc a forked path.

pkg/api/job_runs.go Show resolved Hide resolved
// NOTE: we are including bugs for all releases, may want to filter here in future to just those
// with an AffectsVersions that seems to match our compareRelease?
jobBugs, err := query.LoadBugsForJobs(dbc, []int{int(jobRun.ProwJob.ID)}, true)
if err != nil {
return apitype.ProwJobRunRiskAnalysis{}, err
log.Errorf("Error evaluating bugs for prow job: %d", jobRun.ProwJob.ID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Errorf("Error evaluating bugs for prow job: %d", jobRun.ProwJob.ID)
log.WithError(err).Errorf("Error evaluating bugs for prow job: %d", jobRun.ProwJob.ID)


// Pre-load test bugs as well:
if len(jobRun.Tests) <= maxFailuresToFullyAnalyze {
for i, tr := range jobRun.Tests {
bugs, err := query.LoadBugsForTest(dbc, tr.Test.Name, true)
if err != nil {
return apitype.ProwJobRunRiskAnalysis{}, err
// SELECT * FROM "tests" WHERE name = '[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]' ORDER BY "tests"."id" LIMIT 1;
// causes an error here, don't think we need to fail due to missing bugs - error(*errors.errorString) *{s: "record not found"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. That test is explicitly not imported into sippy because it's a meta-test that just encompasses the results of other tests. I just hit problems with it myself yesterday and started explicitly ignoring it in risk analysis as well: #721

Log and continue here seems good though, and I don't see any reason to elevate risk because inability to find bugs doesn't make the result any more risky.

t += len(jobRun.Tests)
}

t /= len(jobRuns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/= ? TIL!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, I only learned since I got a lint nag

Copy link
Member

@stbenjam stbenjam Dec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting the avg number of tests would be faster to do in SQL, e.g. last 7 days

Example:

SELECT
    avg(count)
FROM (
    SELECT
        count(*)
    FROM
        prow_job_run_tests
        INNER JOIN prow_job_runs ON prow_job_runs.id = prow_job_run_tests.prow_job_run_id
        INNER JOIN prow_jobs ON prow_jobs.id = prow_job_runs.prow_job_id
    WHERE
        prow_jobs.name = 'periodic-ci-openshift-release-master-ci-4.13-e2e-aws-ovn-upgrade'
        AND prow_job_runs.timestamp >= CURRENT_DATE - interval '7' day
    GROUP BY
        prow_job_run_id) t;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All that new knowledge lost with more efficient sql queries... I was wondering about raw query performance vs gorm. Thanks for the pointer.


var jobRuns []models.ProwJobRun
res := dbc.DB.Joins("ProwJob").
Preload("Tests").Order("updated_at desc").Limit(10).Find(&jobRuns, "prow_job_id = ? AND succeeded = true", prowJobID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Suggest using "timestamp" instead of updated_at, which is the time the job ran rather than when we went into the db.

Should we filter here on timestamp > now - 2 weeks? Wondering if there are concerns with finding those successes like 3 months back.

// select id, count(*) from prow_job_run_tests where prow_job_run_id = jobRunID
jobRunTestCount, err = query.JobRunTestCount(s.db, jobRunID)
if err != nil {
log.Errorf("Error getting job run test count %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Errorf("Error getting job run test count %v", err)
log.WithError(err).Error("Error getting job run test count")

This one's a NIT if you've got the error in the message anyhow but structured logging is still good for us to keep in mind.


// if we had an error we will continue the risk analysis and not elevate based on test counts
if err != nil {
log.Errorf("Error comparing historical job run test count %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: WithError

pkg/sippyserver/server.go Show resolved Hide resolved
@@ -171,7 +204,7 @@ func JobRunRiskAnalysis(dbc *db.DB, jobRun *models.ProwJobRun) (apitype.ProwJobR
// testResultsFunc is used for injecting db responses in unit tests.
type testResultsFunc func(testName string, release, suite string, variants []string) (*apitype.Test, error)

func runJobRunAnalysis(jobRun *models.ProwJobRun, compareRelease string,
func runJobRunAnalysis(jobRun *models.ProwJobRun, compareRelease string, jobRunTestCount int, historicalRunTestCount int,
testResultsFunc testResultsFunc) (apitype.ProwJobRunRiskAnalysis, error) {

logger := log.WithFields(log.Fields{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing with my theme of nitpicking logging, could you do us a favor and move this contextual logger up in the the parent function, then pass it to runJobRunAnalysis as an argument. (maybe drop or adjust the "func" in each per your preference) Then you can use this for all the logging in the function above and we'll consistently know what job run the problems are coming from. It could then be passed to any other function we log from to make sure we know what we're looking at when we see those messages.

I think I should have done it this way originally.

// has no 'succeeded = true' runs
// if we didn't get any then open it up to just the most recent runs
res = dbc.DB.Joins("ProwJob").
Preload("Tests").Order("updated_at desc").Limit(10).Find(&jobRuns, "prow_job_id = ?", prowJobID)
Copy link
Member

@stbenjam stbenjam Dec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 10 enough? If you use the SQL approach below, doing 100 or even 1K should be feasible. \

I'd be worried something breaks resulting in us not running a lot of tests, once it occurs on 10 runs we lose the signal of this really quickly.

@neisw neisw force-pushed the risk-elevation-missing-tests branch 4 times, most recently from cfd3553 to 019dc34 Compare December 7, 2022 21:32
DEVELOPMENT.md Outdated
```bash
echo "CREATE USER sippyro;" | psql postgresql://postgres:password@localhost:5432/postgres
echo "CREATE USER rdsadmin;" | psql postgresql://postgres:password@localhost:5432/postgres
```
Copy link
Contributor

@dgoodwin dgoodwin Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm torn here, I guess the errors do alarm people, but we also possibly shouldn't publish the account names we use. Maybe just a note that you can ignore errors about accounts that do not exist?

@neisw neisw force-pushed the risk-elevation-missing-tests branch from 019dc34 to 19493ec Compare December 8, 2022 13:35
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 8, 2022

@neisw: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dgoodwin
Copy link
Contributor

dgoodwin commented Dec 8, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 8, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 8, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, neisw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 8, 2022
@openshift-merge-robot openshift-merge-robot merged commit a7c887b into openshift:master Dec 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants