OCPCRT-455: Add splay configuration to stagger verification job start times by stbenjam · Pull Request #737 · openshift/release-controller

stbenjam · 2026-03-11T22:58:55Z

Summary

Adds a splay option to ReleaseConfig (e.g., "splay": "15m") that spreads out verification job launches over a random window to reduce the blast radius of transient infrastructure issues like registry outages
Each job gets a deterministic delay in [0, splay) via FNV32 hash of the payload tag + job name — no state to track
Aggregator jobs are held until all their analysis jobs are created, preserving the existing ordering guarantee
When splay is unset, behavior is unchanged

Motivation

https://amd64.ocp.releases.ci.openshift.org/releasestream/4.22.0-0.nightly/release/4.22.0-0.nightly-2026-03-11-124441 started all 30 aggregation jobs simultaneously, and several hit the same registry outage. With a 15-minute splay, fewer jobs would have been affected.

Implementation Details

The splay gate is applied in ensureProwJobForReleaseTag(), the single function through which all ProwJob creation flows (verification, analysis, aggregator, and upgrade jobs)
Delay is deterministic per (payload, job) pair using FNV32 hash — same delay on every re-sync, no state to track
Aggregator jobs are held until all their analysis jobs have been created, preserving the existing ordering guarantee
The controller re-queues itself every 30 seconds while jobs are still deferred
When splay is unset (zero), behavior is unchanged — all jobs launch immediately

Jira

https://issues.redhat.com/browse/OCPCRT-455

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added a Splay configuration option to enable deterministic delays for job creation, distributing release job launches over a configurable time window to reduce scheduling load spikes.
Tests
- Added comprehensive test suite validating splay delay computation for determinism and proper distribution across jobs.

… times Add a "splay" option to ReleaseConfig that spreads out verification job launches over a configurable random window. Each job gets a deterministic delay in [0, splay) based on an FNV32 hash of the payload tag and job name, reducing the blast radius of transient infrastructure issues like registry outages. The splay gate is applied in ensureProwJobForReleaseTag(), the single function through which all ProwJob creation flows. Aggregator jobs are held until all their analysis jobs have been created, preserving the existing ordering guarantee. The controller re-queues itself every 30s while jobs are still deferred. Example config: "splay": "15m" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci-robot · 2026-03-11T22:59:00Z

@stbenjam: This pull request references OCPCRT-455 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

Adds a splay option to ReleaseConfig (e.g., "splay": "15m") that spreads out verification job launches over a random window to reduce the blast radius of transient infrastructure issues like registry outages

Each job gets a deterministic delay in [0, splay) via FNV32 hash of the payload tag + job name — no state to track

Aggregator jobs are held until all their analysis jobs are created, preserving the existing ordering guarantee

When splay is unset, behavior is unchanged

Motivation

https://amd64.ocp.releases.ci.openshift.org/releasestream/4.22.0-0.nightly/release/4.22.0-0.nightly-2026-03-11-124441 started all 30 aggregation jobs simultaneously, and several hit the same registry outage. With a 15-minute splay, fewer jobs would have been affected.

Implementation Details

The splay gate is applied in ensureProwJobForReleaseTag(), the single function through which all ProwJob creation flows (verification, analysis, aggregator, and upgrade jobs)

Delay is deterministic per (payload, job) pair using FNV32 hash — same delay on every re-sync, no state to track

Aggregator jobs are held until all their analysis jobs have been created, preserving the existing ordering guarantee

The controller re-queues itself every 30 seconds while jobs are still deferred

When splay is unset (zero), behavior is unchanged — all jobs launch immediately

Jira

https://issues.redhat.com/browse/OCPCRT-455

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-03-11T22:59:16Z

Walkthrough

This change introduces a deterministic splay feature that defers the creation of release-related jobs (analysis and Prow jobs) based on a configured duration. The splay delay is computed from a hash of the tag name and job name, ensuring consistent deferral across multiple invocations while spreading job creation across time.

Changes

Cohort / File(s)	Summary
Splay Configuration `pkg/release-controller/types.go`	Added `Splay` field (type `utils.Duration`) to `ReleaseConfig` struct with JSON serialization support.
Splay Implementation `cmd/release-controller/sync_verify_prow.go`	Integrated deterministic splay logic that computes a delay via FNV-1a hash of tag name and job name, then defers Prow job creation if delay is still pending relative to tag creation timestamp.
Job Creation Return Signature `cmd/release-controller/sync_analysis.go`	Changed `launchAnalysisJobs` return type from `error` to `(bool, error)` to track whether all analysis jobs were successfully created or deferred.
Deferral Handling `cmd/release-controller/sync_verify.go`	Updated call sites for `launchAnalysisJobs` to handle new return tuple; defers aggregator job launch and schedules retry when analysis jobs are deferred by splay logic.
Documentation & Tests `cmd/release-controller/sync_upgrade.go`, `cmd/release-controller/splay_test.go`	Added clarifying comment on nil job deferral behavior and introduced `TestSplayDelay` suite validating determinism, bounds checking, and distribution of computed delays across different jobs and payloads.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding a splay configuration feature to stagger verification job start times, which is the central purpose of this PR.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names	✅ Passed	Test file uses standard Go testing with static, descriptive test names containing no dynamic information.
Test Structure And Quality	✅ Passed	Test file splay_test.go follows the exact standard Go testing pattern used throughout the codebase, with single-responsibility subtests, meaningful assertion messages, and consistency with existing tests.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-03-11T22:59:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: stbenjam
Once this PR has been reviewed and has the lgtm label, please assign alexnpavel for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2026-03-11T23:04:30Z

@stbenjam: This pull request references OCPCRT-455 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

Adds a splay option to ReleaseConfig (e.g., "splay": "15m") that spreads out verification job launches over a random window to reduce the blast radius of transient infrastructure issues like registry outages

Each job gets a deterministic delay in [0, splay) via FNV32 hash of the payload tag + job name — no state to track

Aggregator jobs are held until all their analysis jobs are created, preserving the existing ordering guarantee

When splay is unset, behavior is unchanged

Motivation

https://amd64.ocp.releases.ci.openshift.org/releasestream/4.22.0-0.nightly/release/4.22.0-0.nightly-2026-03-11-124441 started all 30 aggregation jobs simultaneously, and several hit the same registry outage. With a 15-minute splay, fewer jobs would have been affected.

Implementation Details

The splay gate is applied in ensureProwJobForReleaseTag(), the single function through which all ProwJob creation flows (verification, analysis, aggregator, and upgrade jobs)

Delay is deterministic per (payload, job) pair using FNV32 hash — same delay on every re-sync, no state to track

Aggregator jobs are held until all their analysis jobs have been created, preserving the existing ordering guarantee

The controller re-queues itself every 30 seconds while jobs are still deferred

When splay is unset (zero), behavior is unchanged — all jobs launch immediately

Jira

https://issues.redhat.com/browse/OCPCRT-455

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features

Added a Splay configuration option to enable deterministic delays for job creation, distributing release job launches over a configurable time window to reduce scheduling load spikes.

Tests

Added comprehensive test suite validating splay delay computation for determinism and proper distribution across jobs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

🧹 Nitpick comments (3)

cmd/release-controller/splay_test.go (2)

53-59: Consider strengthening the payload variation test.

The test only logs when different payloads produce identical delays (a 1-in-900 chance with 15-minute splay). While hash collisions are expected occasionally, a stronger assertion across multiple payload pairs would provide better confidence:

♻️ Optional: Test with more samples

 	t.Run("different payloads get different delays for same job", func(t *testing.T) {
-		d1 := splayDelay("4.22.0-0.nightly-2026-03-11-124441", "e2e-aws", splay)
-		d2 := splayDelay("4.22.0-0.nightly-2026-03-12-060000", "e2e-aws", splay)
-		if d1 == d2 {
-			t.Logf("note: same delay for different payloads (possible but unlikely): %v", d1)
+		delays := make(map[time.Duration]int)
+		for i := range 10 {
+			tag := fmt.Sprintf("4.22.0-0.nightly-2026-03-%02d-124441", 11+i)
+			delays[splayDelay(tag, "e2e-aws", splay)]++
+		}
+		if len(delays) < 2 {
+			t.Errorf("expected varied delays across 10 payloads, got %d distinct values", len(delays))
 		}
 	})

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cmd/release-controller/splay_test.go` around lines 53 - 59, The test
"different payloads get different delays for same job" currently only logs when
two payloads collide; update it to assert across multiple payload pairs using
the splayDelay function to reduce flakiness: generate a short list of distinct
payload strings (e.g., several different timestamps or suffixes), compute delays
via splayDelay for each payload with the same job ("e2e-aws") and splay value,
and then assert that not all computed delays are identical (or assert that at
least one pair differs); if all delays are equal call t.Fatalf with a
descriptive message. Ensure you reference the splayDelay function and the test
name while adding the loop/collection and the assertion.

27-36: Minor: Job name generation produces unexpected characters beyond initial range.

For i >= 26, string(rune('a'+i)) produces non-letter characters (e.g., {, |). While this doesn't affect test validity (the hash still works), using fmt.Sprintf("job-%d", i) would be clearer:

♻️ Suggested improvement

 	t.Run("within bounds", func(t *testing.T) {
 		for i := range 100 {
 			tag := "4.22.0-0.nightly-2026-03-11-124441"
-			job := "job-" + string(rune('a'+i))
+			job := fmt.Sprintf("job-%d", i)
 			d := splayDelay(tag, job, splay)
 			if d < 0 || d >= splay {
 				t.Errorf("delay %v out of range [0, %v) for job %s", d, splay, job)
 			}
 		}
 	})

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cmd/release-controller/splay_test.go` around lines 27 - 36, The test's job
name generation in the "within bounds" subtest uses string(rune('a'+i)) which
yields non-letter characters when i>=26; update the loop to produce
deterministic, readable names (e.g., use fmt.Sprintf("job-%d", i) or similar)
when creating the job variable used by splayDelay so job names remain simple and
predictable; modify the test in the same t.Run block where splayDelay(tag, job,
splay) is called to replace string(rune('a'+i)) with a formatted numeric name.

cmd/release-controller/sync_verify_prow.go (1)

65-75: Silent failure when timestamp parsing fails may hide configuration issues.

When releaseTag.Annotations[ReleaseAnnotationCreationTimestamp] is missing or malformed, the splay logic is silently skipped and the job is created immediately. While this is fail-safe behavior, it could mask configuration problems.

Consider logging when parsing fails so operators can detect issues:

🔧 Suggested improvement

 	if splay > 0 {
 		tagCreated, parseErr := time.Parse(time.RFC3339, releaseTag.Annotations[releasecontroller.ReleaseAnnotationCreationTimestamp])
 		if parseErr == nil {
 			h := fnv.New32a()
 			h.Write([]byte(releaseTag.Name + "/" + prowJobName))
 			delay := time.Duration(h.Sum32()%uint32(splay.Seconds())) * time.Second
 			remaining := delay - time.Since(tagCreated)
 			if remaining > 0 {
 				klog.V(4).Infof("Splay: deferring job %s for %s (delay %s, tag created %s)", prowJobName, remaining.Truncate(time.Second), delay.Truncate(time.Second), tagCreated.Format(time.RFC3339))
 				return nil, nil
 			}
+		} else {
+			klog.V(4).Infof("Splay: skipping delay for %s (failed to parse creation timestamp: %v)", prowJobName, parseErr)
 		}
 	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cmd/release-controller/sync_verify_prow.go` around lines 65 - 75, The splay
skip silently hides malformed/missing timestamps: in the block using
releaseTag.Annotations[releasecontroller.ReleaseAnnotationCreationTimestamp]
where you parse with time.Parse into tagCreated (and currently ignore parseErr),
add a klog warning or info (including prowJobName, releaseTag.Name and the raw
annotation value) when parseErr != nil so operators see malformed/missing
timestamps; keep the existing behavior of creating the job on parse failure but
emit the log for visibility and reference splay and tagCreated in the message
for context.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cmd/release-controller/splay_test.go`:
- Around line 53-59: The test "different payloads get different delays for same
job" currently only logs when two payloads collide; update it to assert across
multiple payload pairs using the splayDelay function to reduce flakiness:
generate a short list of distinct payload strings (e.g., several different
timestamps or suffixes), compute delays via splayDelay for each payload with the
same job ("e2e-aws") and splay value, and then assert that not all computed
delays are identical (or assert that at least one pair differs); if all delays
are equal call t.Fatalf with a descriptive message. Ensure you reference the
splayDelay function and the test name while adding the loop/collection and the
assertion.
- Around line 27-36: The test's job name generation in the "within bounds"
subtest uses string(rune('a'+i)) which yields non-letter characters when i>=26;
update the loop to produce deterministic, readable names (e.g., use
fmt.Sprintf("job-%d", i) or similar) when creating the job variable used by
splayDelay so job names remain simple and predictable; modify the test in the
same t.Run block where splayDelay(tag, job, splay) is called to replace
string(rune('a'+i)) with a formatted numeric name.

In `@cmd/release-controller/sync_verify_prow.go`:
- Around line 65-75: The splay skip silently hides malformed/missing timestamps:
in the block using
releaseTag.Annotations[releasecontroller.ReleaseAnnotationCreationTimestamp]
where you parse with time.Parse into tagCreated (and currently ignore parseErr),
add a klog warning or info (including prowJobName, releaseTag.Name and the raw
annotation value) when parseErr != nil so operators see malformed/missing
timestamps; keep the existing behavior of creating the job on parse failure but
emit the log for visibility and reference splay and tagCreated in the message
for context.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9bab8503-8c0a-46e8-846f-02ad48fc7eb7

📥 Commits

Reviewing files that changed from the base of the PR and between 893f340 and 7d099e3.

📒 Files selected for processing (6)

cmd/release-controller/splay_test.go
cmd/release-controller/sync_analysis.go
cmd/release-controller/sync_upgrade.go
cmd/release-controller/sync_verify.go
cmd/release-controller/sync_verify_prow.go
pkg/release-controller/types.go

openshift-ci · 2026-03-11T23:18:08Z

@stbenjam: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

stbenjam · 2026-03-12T15:28:56Z

Just a crazy idea, I have no way to really test this, need to ask Brad at some point, and think if this is the right way to do this

/hold

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 11, 2026

openshift-ci bot requested review from AlexNPavel and hoxhaeris March 11, 2026 22:59

coderabbitai bot reviewed Mar 11, 2026

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 12, 2026

stbenjam mentioned this pull request Mar 13, 2026

Acquire install lease when provisioning a cluster openshift/release#76238

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPCRT-455: Add splay configuration to stagger verification job start times#737

OCPCRT-455: Add splay configuration to stagger verification job start times#737
stbenjam wants to merge 1 commit intoopenshift:mainfrom
stbenjam:splay

stbenjam commented Mar 11, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

openshift-ci-robot commented Mar 11, 2026 •

edited by openshift-ci bot

Loading

Summary

Motivation

Implementation Details

Jira

Uh oh!

coderabbitai bot commented Mar 11, 2026 •

edited

Loading

Uh oh!

openshift-ci bot commented Mar 11, 2026

Uh oh!

openshift-ci-robot commented Mar 11, 2026 •

edited by openshift-ci bot

Loading

Summary

Motivation

Implementation Details

Jira

Summary by CodeRabbit

Uh oh!

coderabbitai bot left a comment

Uh oh!

openshift-ci bot commented Mar 11, 2026

Uh oh!

stbenjam commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stbenjam commented Mar 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Implementation Details

Jira

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Mar 11, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Implementation Details

Jira

Uh oh!

coderabbitai bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

openshift-ci bot commented Mar 11, 2026

Uh oh!

openshift-ci-robot commented Mar 11, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Implementation Details

Jira

Summary by CodeRabbit

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Mar 11, 2026

Uh oh!

stbenjam commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stbenjam commented Mar 11, 2026 •

edited by coderabbitai bot

Loading

openshift-ci-robot commented Mar 11, 2026 •

edited by openshift-ci bot

Loading

coderabbitai bot commented Mar 11, 2026 •

edited

Loading

openshift-ci-robot commented Mar 11, 2026 •

edited by openshift-ci bot

Loading