Skip to content

OCPCRT-455: Add splay configuration to stagger verification job start times#737

Open
stbenjam wants to merge 1 commit intoopenshift:mainfrom
stbenjam:splay
Open

OCPCRT-455: Add splay configuration to stagger verification job start times#737
stbenjam wants to merge 1 commit intoopenshift:mainfrom
stbenjam:splay

Conversation

@stbenjam
Copy link
Member

@stbenjam stbenjam commented Mar 11, 2026

Summary

  • Adds a splay option to ReleaseConfig (e.g., "splay": "15m") that spreads out verification job launches over a random window to reduce the blast radius of transient infrastructure issues like registry outages
  • Each job gets a deterministic delay in [0, splay) via FNV32 hash of the payload tag + job name — no state to track
  • Aggregator jobs are held until all their analysis jobs are created, preserving the existing ordering guarantee
  • When splay is unset, behavior is unchanged

Motivation

https://amd64.ocp.releases.ci.openshift.org/releasestream/4.22.0-0.nightly/release/4.22.0-0.nightly-2026-03-11-124441 started all 30 aggregation jobs simultaneously, and several hit the same registry outage. With a 15-minute splay, fewer jobs would have been affected.

Implementation Details

  • The splay gate is applied in ensureProwJobForReleaseTag(), the single function through which all ProwJob creation flows (verification, analysis, aggregator, and upgrade jobs)
  • Delay is deterministic per (payload, job) pair using FNV32 hash — same delay on every re-sync, no state to track
  • Aggregator jobs are held until all their analysis jobs have been created, preserving the existing ordering guarantee
  • The controller re-queues itself every 30 seconds while jobs are still deferred
  • When splay is unset (zero), behavior is unchanged — all jobs launch immediately

Jira

https://issues.redhat.com/browse/OCPCRT-455

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added a Splay configuration option to enable deterministic delays for job creation, distributing release job launches over a configurable time window to reduce scheduling load spikes.
  • Tests

    • Added comprehensive test suite validating splay delay computation for determinism and proper distribution across jobs.

… times

Add a "splay" option to ReleaseConfig that spreads out verification job
launches over a configurable random window. Each job gets a deterministic
delay in [0, splay) based on an FNV32 hash of the payload tag and job
name, reducing the blast radius of transient infrastructure issues like
registry outages.

The splay gate is applied in ensureProwJobForReleaseTag(), the single
function through which all ProwJob creation flows. Aggregator jobs are
held until all their analysis jobs have been created, preserving the
existing ordering guarantee. The controller re-queues itself every 30s
while jobs are still deferred.

Example config: "splay": "15m"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 11, 2026
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 11, 2026

@stbenjam: This pull request references OCPCRT-455 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

  • Adds a splay option to ReleaseConfig (e.g., "splay": "15m") that spreads out verification job launches over a random window to reduce the blast radius of transient infrastructure issues like registry outages
  • Each job gets a deterministic delay in [0, splay) via FNV32 hash of the payload tag + job name — no state to track
  • Aggregator jobs are held until all their analysis jobs are created, preserving the existing ordering guarantee
  • When splay is unset, behavior is unchanged

Motivation

https://amd64.ocp.releases.ci.openshift.org/releasestream/4.22.0-0.nightly/release/4.22.0-0.nightly-2026-03-11-124441 started all 30 aggregation jobs simultaneously, and several hit the same registry outage. With a 15-minute splay, fewer jobs would have been affected.

Implementation Details

  • The splay gate is applied in ensureProwJobForReleaseTag(), the single function through which all ProwJob creation flows (verification, analysis, aggregator, and upgrade jobs)
  • Delay is deterministic per (payload, job) pair using FNV32 hash — same delay on every re-sync, no state to track
  • Aggregator jobs are held until all their analysis jobs have been created, preserving the existing ordering guarantee
  • The controller re-queues itself every 30 seconds while jobs are still deferred
  • When splay is unset (zero), behavior is unchanged — all jobs launch immediately

Jira

https://issues.redhat.com/browse/OCPCRT-455

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link

coderabbitai bot commented Mar 11, 2026

Walkthrough

This change introduces a deterministic splay feature that defers the creation of release-related jobs (analysis and Prow jobs) based on a configured duration. The splay delay is computed from a hash of the tag name and job name, ensuring consistent deferral across multiple invocations while spreading job creation across time.

Changes

Cohort / File(s) Summary
Splay Configuration
pkg/release-controller/types.go
Added Splay field (type utils.Duration) to ReleaseConfig struct with JSON serialization support.
Splay Implementation
cmd/release-controller/sync_verify_prow.go
Integrated deterministic splay logic that computes a delay via FNV-1a hash of tag name and job name, then defers Prow job creation if delay is still pending relative to tag creation timestamp.
Job Creation Return Signature
cmd/release-controller/sync_analysis.go
Changed launchAnalysisJobs return type from error to (bool, error) to track whether all analysis jobs were successfully created or deferred.
Deferral Handling
cmd/release-controller/sync_verify.go
Updated call sites for launchAnalysisJobs to handle new return tuple; defers aggregator job launch and schedules retry when analysis jobs are deferred by splay logic.
Documentation & Tests
cmd/release-controller/sync_upgrade.go, cmd/release-controller/splay_test.go
Added clarifying comment on nil job deferral behavior and introduced TestSplayDelay suite validating determinism, bounds checking, and distribution of computed delays across different jobs and payloads.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding a splay configuration feature to stagger verification job start times, which is the central purpose of this PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names ✅ Passed Test file uses standard Go testing with static, descriptive test names containing no dynamic information.
Test Structure And Quality ✅ Passed Test file splay_test.go follows the exact standard Go testing pattern used throughout the codebase, with single-responsibility subtests, meaningful assertion messages, and consistency with existing tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from AlexNPavel and hoxhaeris March 11, 2026 22:59
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 11, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: stbenjam
Once this PR has been reviewed and has the lgtm label, please assign alexnpavel for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 11, 2026

@stbenjam: This pull request references OCPCRT-455 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

  • Adds a splay option to ReleaseConfig (e.g., "splay": "15m") that spreads out verification job launches over a random window to reduce the blast radius of transient infrastructure issues like registry outages
  • Each job gets a deterministic delay in [0, splay) via FNV32 hash of the payload tag + job name — no state to track
  • Aggregator jobs are held until all their analysis jobs are created, preserving the existing ordering guarantee
  • When splay is unset, behavior is unchanged

Motivation

https://amd64.ocp.releases.ci.openshift.org/releasestream/4.22.0-0.nightly/release/4.22.0-0.nightly-2026-03-11-124441 started all 30 aggregation jobs simultaneously, and several hit the same registry outage. With a 15-minute splay, fewer jobs would have been affected.

Implementation Details

  • The splay gate is applied in ensureProwJobForReleaseTag(), the single function through which all ProwJob creation flows (verification, analysis, aggregator, and upgrade jobs)
  • Delay is deterministic per (payload, job) pair using FNV32 hash — same delay on every re-sync, no state to track
  • Aggregator jobs are held until all their analysis jobs have been created, preserving the existing ordering guarantee
  • The controller re-queues itself every 30 seconds while jobs are still deferred
  • When splay is unset (zero), behavior is unchanged — all jobs launch immediately

Jira

https://issues.redhat.com/browse/OCPCRT-455

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

  • Added a Splay configuration option to enable deterministic delays for job creation, distributing release job launches over a configurable time window to reduce scheduling load spikes.

  • Tests

  • Added comprehensive test suite validating splay delay computation for determinism and proper distribution across jobs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
cmd/release-controller/splay_test.go (2)

53-59: Consider strengthening the payload variation test.

The test only logs when different payloads produce identical delays (a 1-in-900 chance with 15-minute splay). While hash collisions are expected occasionally, a stronger assertion across multiple payload pairs would provide better confidence:

♻️ Optional: Test with more samples
 	t.Run("different payloads get different delays for same job", func(t *testing.T) {
-		d1 := splayDelay("4.22.0-0.nightly-2026-03-11-124441", "e2e-aws", splay)
-		d2 := splayDelay("4.22.0-0.nightly-2026-03-12-060000", "e2e-aws", splay)
-		if d1 == d2 {
-			t.Logf("note: same delay for different payloads (possible but unlikely): %v", d1)
+		delays := make(map[time.Duration]int)
+		for i := range 10 {
+			tag := fmt.Sprintf("4.22.0-0.nightly-2026-03-%02d-124441", 11+i)
+			delays[splayDelay(tag, "e2e-aws", splay)]++
+		}
+		if len(delays) < 2 {
+			t.Errorf("expected varied delays across 10 payloads, got %d distinct values", len(delays))
 		}
 	})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/release-controller/splay_test.go` around lines 53 - 59, The test
"different payloads get different delays for same job" currently only logs when
two payloads collide; update it to assert across multiple payload pairs using
the splayDelay function to reduce flakiness: generate a short list of distinct
payload strings (e.g., several different timestamps or suffixes), compute delays
via splayDelay for each payload with the same job ("e2e-aws") and splay value,
and then assert that not all computed delays are identical (or assert that at
least one pair differs); if all delays are equal call t.Fatalf with a
descriptive message. Ensure you reference the splayDelay function and the test
name while adding the loop/collection and the assertion.

27-36: Minor: Job name generation produces unexpected characters beyond initial range.

For i >= 26, string(rune('a'+i)) produces non-letter characters (e.g., {, |). While this doesn't affect test validity (the hash still works), using fmt.Sprintf("job-%d", i) would be clearer:

♻️ Suggested improvement
 	t.Run("within bounds", func(t *testing.T) {
 		for i := range 100 {
 			tag := "4.22.0-0.nightly-2026-03-11-124441"
-			job := "job-" + string(rune('a'+i))
+			job := fmt.Sprintf("job-%d", i)
 			d := splayDelay(tag, job, splay)
 			if d < 0 || d >= splay {
 				t.Errorf("delay %v out of range [0, %v) for job %s", d, splay, job)
 			}
 		}
 	})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/release-controller/splay_test.go` around lines 27 - 36, The test's job
name generation in the "within bounds" subtest uses string(rune('a'+i)) which
yields non-letter characters when i>=26; update the loop to produce
deterministic, readable names (e.g., use fmt.Sprintf("job-%d", i) or similar)
when creating the job variable used by splayDelay so job names remain simple and
predictable; modify the test in the same t.Run block where splayDelay(tag, job,
splay) is called to replace string(rune('a'+i)) with a formatted numeric name.
cmd/release-controller/sync_verify_prow.go (1)

65-75: Silent failure when timestamp parsing fails may hide configuration issues.

When releaseTag.Annotations[ReleaseAnnotationCreationTimestamp] is missing or malformed, the splay logic is silently skipped and the job is created immediately. While this is fail-safe behavior, it could mask configuration problems.

Consider logging when parsing fails so operators can detect issues:

🔧 Suggested improvement
 	if splay > 0 {
 		tagCreated, parseErr := time.Parse(time.RFC3339, releaseTag.Annotations[releasecontroller.ReleaseAnnotationCreationTimestamp])
 		if parseErr == nil {
 			h := fnv.New32a()
 			h.Write([]byte(releaseTag.Name + "/" + prowJobName))
 			delay := time.Duration(h.Sum32()%uint32(splay.Seconds())) * time.Second
 			remaining := delay - time.Since(tagCreated)
 			if remaining > 0 {
 				klog.V(4).Infof("Splay: deferring job %s for %s (delay %s, tag created %s)", prowJobName, remaining.Truncate(time.Second), delay.Truncate(time.Second), tagCreated.Format(time.RFC3339))
 				return nil, nil
 			}
+		} else {
+			klog.V(4).Infof("Splay: skipping delay for %s (failed to parse creation timestamp: %v)", prowJobName, parseErr)
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/release-controller/sync_verify_prow.go` around lines 65 - 75, The splay
skip silently hides malformed/missing timestamps: in the block using
releaseTag.Annotations[releasecontroller.ReleaseAnnotationCreationTimestamp]
where you parse with time.Parse into tagCreated (and currently ignore parseErr),
add a klog warning or info (including prowJobName, releaseTag.Name and the raw
annotation value) when parseErr != nil so operators see malformed/missing
timestamps; keep the existing behavior of creating the job on parse failure but
emit the log for visibility and reference splay and tagCreated in the message
for context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cmd/release-controller/splay_test.go`:
- Around line 53-59: The test "different payloads get different delays for same
job" currently only logs when two payloads collide; update it to assert across
multiple payload pairs using the splayDelay function to reduce flakiness:
generate a short list of distinct payload strings (e.g., several different
timestamps or suffixes), compute delays via splayDelay for each payload with the
same job ("e2e-aws") and splay value, and then assert that not all computed
delays are identical (or assert that at least one pair differs); if all delays
are equal call t.Fatalf with a descriptive message. Ensure you reference the
splayDelay function and the test name while adding the loop/collection and the
assertion.
- Around line 27-36: The test's job name generation in the "within bounds"
subtest uses string(rune('a'+i)) which yields non-letter characters when i>=26;
update the loop to produce deterministic, readable names (e.g., use
fmt.Sprintf("job-%d", i) or similar) when creating the job variable used by
splayDelay so job names remain simple and predictable; modify the test in the
same t.Run block where splayDelay(tag, job, splay) is called to replace
string(rune('a'+i)) with a formatted numeric name.

In `@cmd/release-controller/sync_verify_prow.go`:
- Around line 65-75: The splay skip silently hides malformed/missing timestamps:
in the block using
releaseTag.Annotations[releasecontroller.ReleaseAnnotationCreationTimestamp]
where you parse with time.Parse into tagCreated (and currently ignore parseErr),
add a klog warning or info (including prowJobName, releaseTag.Name and the raw
annotation value) when parseErr != nil so operators see malformed/missing
timestamps; keep the existing behavior of creating the job on parse failure but
emit the log for visibility and reference splay and tagCreated in the message
for context.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9bab8503-8c0a-46e8-846f-02ad48fc7eb7

📥 Commits

Reviewing files that changed from the base of the PR and between 893f340 and 7d099e3.

📒 Files selected for processing (6)
  • cmd/release-controller/splay_test.go
  • cmd/release-controller/sync_analysis.go
  • cmd/release-controller/sync_upgrade.go
  • cmd/release-controller/sync_verify.go
  • cmd/release-controller/sync_verify_prow.go
  • pkg/release-controller/types.go

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 11, 2026

@stbenjam: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@stbenjam
Copy link
Member Author

Just a crazy idea, I have no way to really test this, need to ask Brad at some point, and think if this is the right way to do this

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants