OCPCRT-455: Add splay configuration to stagger verification job start times#737
OCPCRT-455: Add splay configuration to stagger verification job start times#737stbenjam wants to merge 1 commit intoopenshift:mainfrom
Conversation
… times Add a "splay" option to ReleaseConfig that spreads out verification job launches over a configurable random window. Each job gets a deterministic delay in [0, splay) based on an FNV32 hash of the payload tag and job name, reducing the blast radius of transient infrastructure issues like registry outages. The splay gate is applied in ensureProwJobForReleaseTag(), the single function through which all ProwJob creation flows. Aggregator jobs are held until all their analysis jobs have been created, preserving the existing ordering guarantee. The controller re-queues itself every 30s while jobs are still deferred. Example config: "splay": "15m" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@stbenjam: This pull request references OCPCRT-455 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughThis change introduces a deterministic splay feature that defers the creation of release-related jobs (analysis and Prow jobs) based on a configured duration. The splay delay is computed from a hash of the tag name and job name, ensuring consistent deferral across multiple invocations while spreading job creation across time. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: stbenjam The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@stbenjam: This pull request references OCPCRT-455 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
cmd/release-controller/splay_test.go (2)
53-59: Consider strengthening the payload variation test.The test only logs when different payloads produce identical delays (a 1-in-900 chance with 15-minute splay). While hash collisions are expected occasionally, a stronger assertion across multiple payload pairs would provide better confidence:
♻️ Optional: Test with more samples
t.Run("different payloads get different delays for same job", func(t *testing.T) { - d1 := splayDelay("4.22.0-0.nightly-2026-03-11-124441", "e2e-aws", splay) - d2 := splayDelay("4.22.0-0.nightly-2026-03-12-060000", "e2e-aws", splay) - if d1 == d2 { - t.Logf("note: same delay for different payloads (possible but unlikely): %v", d1) + delays := make(map[time.Duration]int) + for i := range 10 { + tag := fmt.Sprintf("4.22.0-0.nightly-2026-03-%02d-124441", 11+i) + delays[splayDelay(tag, "e2e-aws", splay)]++ + } + if len(delays) < 2 { + t.Errorf("expected varied delays across 10 payloads, got %d distinct values", len(delays)) } })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/release-controller/splay_test.go` around lines 53 - 59, The test "different payloads get different delays for same job" currently only logs when two payloads collide; update it to assert across multiple payload pairs using the splayDelay function to reduce flakiness: generate a short list of distinct payload strings (e.g., several different timestamps or suffixes), compute delays via splayDelay for each payload with the same job ("e2e-aws") and splay value, and then assert that not all computed delays are identical (or assert that at least one pair differs); if all delays are equal call t.Fatalf with a descriptive message. Ensure you reference the splayDelay function and the test name while adding the loop/collection and the assertion.
27-36: Minor: Job name generation produces unexpected characters beyond initial range.For
i >= 26,string(rune('a'+i))produces non-letter characters (e.g.,{,|). While this doesn't affect test validity (the hash still works), usingfmt.Sprintf("job-%d", i)would be clearer:♻️ Suggested improvement
t.Run("within bounds", func(t *testing.T) { for i := range 100 { tag := "4.22.0-0.nightly-2026-03-11-124441" - job := "job-" + string(rune('a'+i)) + job := fmt.Sprintf("job-%d", i) d := splayDelay(tag, job, splay) if d < 0 || d >= splay { t.Errorf("delay %v out of range [0, %v) for job %s", d, splay, job) } } })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/release-controller/splay_test.go` around lines 27 - 36, The test's job name generation in the "within bounds" subtest uses string(rune('a'+i)) which yields non-letter characters when i>=26; update the loop to produce deterministic, readable names (e.g., use fmt.Sprintf("job-%d", i) or similar) when creating the job variable used by splayDelay so job names remain simple and predictable; modify the test in the same t.Run block where splayDelay(tag, job, splay) is called to replace string(rune('a'+i)) with a formatted numeric name.cmd/release-controller/sync_verify_prow.go (1)
65-75: Silent failure when timestamp parsing fails may hide configuration issues.When
releaseTag.Annotations[ReleaseAnnotationCreationTimestamp]is missing or malformed, the splay logic is silently skipped and the job is created immediately. While this is fail-safe behavior, it could mask configuration problems.Consider logging when parsing fails so operators can detect issues:
🔧 Suggested improvement
if splay > 0 { tagCreated, parseErr := time.Parse(time.RFC3339, releaseTag.Annotations[releasecontroller.ReleaseAnnotationCreationTimestamp]) if parseErr == nil { h := fnv.New32a() h.Write([]byte(releaseTag.Name + "/" + prowJobName)) delay := time.Duration(h.Sum32()%uint32(splay.Seconds())) * time.Second remaining := delay - time.Since(tagCreated) if remaining > 0 { klog.V(4).Infof("Splay: deferring job %s for %s (delay %s, tag created %s)", prowJobName, remaining.Truncate(time.Second), delay.Truncate(time.Second), tagCreated.Format(time.RFC3339)) return nil, nil } + } else { + klog.V(4).Infof("Splay: skipping delay for %s (failed to parse creation timestamp: %v)", prowJobName, parseErr) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/release-controller/sync_verify_prow.go` around lines 65 - 75, The splay skip silently hides malformed/missing timestamps: in the block using releaseTag.Annotations[releasecontroller.ReleaseAnnotationCreationTimestamp] where you parse with time.Parse into tagCreated (and currently ignore parseErr), add a klog warning or info (including prowJobName, releaseTag.Name and the raw annotation value) when parseErr != nil so operators see malformed/missing timestamps; keep the existing behavior of creating the job on parse failure but emit the log for visibility and reference splay and tagCreated in the message for context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@cmd/release-controller/splay_test.go`:
- Around line 53-59: The test "different payloads get different delays for same
job" currently only logs when two payloads collide; update it to assert across
multiple payload pairs using the splayDelay function to reduce flakiness:
generate a short list of distinct payload strings (e.g., several different
timestamps or suffixes), compute delays via splayDelay for each payload with the
same job ("e2e-aws") and splay value, and then assert that not all computed
delays are identical (or assert that at least one pair differs); if all delays
are equal call t.Fatalf with a descriptive message. Ensure you reference the
splayDelay function and the test name while adding the loop/collection and the
assertion.
- Around line 27-36: The test's job name generation in the "within bounds"
subtest uses string(rune('a'+i)) which yields non-letter characters when i>=26;
update the loop to produce deterministic, readable names (e.g., use
fmt.Sprintf("job-%d", i) or similar) when creating the job variable used by
splayDelay so job names remain simple and predictable; modify the test in the
same t.Run block where splayDelay(tag, job, splay) is called to replace
string(rune('a'+i)) with a formatted numeric name.
In `@cmd/release-controller/sync_verify_prow.go`:
- Around line 65-75: The splay skip silently hides malformed/missing timestamps:
in the block using
releaseTag.Annotations[releasecontroller.ReleaseAnnotationCreationTimestamp]
where you parse with time.Parse into tagCreated (and currently ignore parseErr),
add a klog warning or info (including prowJobName, releaseTag.Name and the raw
annotation value) when parseErr != nil so operators see malformed/missing
timestamps; keep the existing behavior of creating the job on parse failure but
emit the log for visibility and reference splay and tagCreated in the message
for context.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9bab8503-8c0a-46e8-846f-02ad48fc7eb7
📒 Files selected for processing (6)
cmd/release-controller/splay_test.gocmd/release-controller/sync_analysis.gocmd/release-controller/sync_upgrade.gocmd/release-controller/sync_verify.gocmd/release-controller/sync_verify_prow.gopkg/release-controller/types.go
|
@stbenjam: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Just a crazy idea, I have no way to really test this, need to ask Brad at some point, and think if this is the right way to do this /hold |
Summary
splayoption toReleaseConfig(e.g.,"splay": "15m") that spreads out verification job launches over a random window to reduce the blast radius of transient infrastructure issues like registry outages[0, splay)via FNV32 hash of the payload tag + job name — no state to trackMotivation
https://amd64.ocp.releases.ci.openshift.org/releasestream/4.22.0-0.nightly/release/4.22.0-0.nightly-2026-03-11-124441 started all 30 aggregation jobs simultaneously, and several hit the same registry outage. With a 15-minute splay, fewer jobs would have been affected.
Implementation Details
ensureProwJobForReleaseTag(), the single function through which all ProwJob creation flows (verification, analysis, aggregator, and upgrade jobs)Jira
https://issues.redhat.com/browse/OCPCRT-455
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests