Skip to content

pj-rehearse: add concurrency control, changed-files prefilter, and drop notifications#5165

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
jmguzik:pj-reharse-change
May 7, 2026
Merged

pj-rehearse: add concurrency control, changed-files prefilter, and drop notifications#5165
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
jmguzik:pj-reharse-change

Conversation

@jmguzik
Copy link
Copy Markdown
Contributor

@jmguzik jmguzik commented May 7, 2026

pj-rehearse was vulnerable to memory spikes (50+ GB) under load because it handled an unlimited number of webhook requests concurrently, each loading full CI/Prow configurations.

This commit introduces three features to mitigate the problem:

  1. Handler dispatcher with bounded concurrency and queuing:

    • Configurable max concurrent handlers (default 5), max queued (50), queue timeout (5m), and execution timeout (15m) via CLI flags.
    • Requests that exceed the queue capacity or wait too long are dropped with a GitHub comment notifying the user.
    • Separate Prometheus metrics for in-flight handlers, queued requests, drops (by reason), and execution timeouts.
  2. Changed-files prefilter (implements DPTP-2888):

    • Before loading full configs, diffs the PR against the base branch to check if any changed files fall under ci-operator/config, ci-operator/jobs, core-services/prow/02_config, or ci-operator/step-registry.
    • If no rehearsal-relevant paths changed, skips DetermineAffectedJobs entirely, avoiding the expensive config load.
  3. Merged PR event handlers into a single registration:

    • handlePullRequestCreation and handleNewPush are now dispatched as one unit, preventing a single PR event from consuming two dispatcher slots.

Co-authored-by Cursor

Summary by CodeRabbit

  • New Features

    • Webhook admission control with concurrency limits, bounded queueing, queue timeouts, metrics for queued/dropped requests (with drop reasons), execution-timeout tracking, and user-facing notifications when requests are dropped.
  • Improvements

    • Rehearsal analysis now prefilters changes and skips work when no relevant config paths changed; clearer messages when diffs cannot be determined.
  • Tests

    • Added unit and concurrency tests for dispatch/queue behavior, path-matching logic, and drop-notification content/timing.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 7174f289-b742-40cf-b014-17c8c344a8d5

📥 Commits

Reviewing files that changed from the base of the PR and between 9102c7d and 325d36e.

📒 Files selected for processing (3)
  • cmd/pj-rehearse/main.go
  • cmd/pj-rehearse/server.go
  • cmd/pj-rehearse/server_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • cmd/pj-rehearse/main.go

📝 Walkthrough

Walkthrough

Adds a bounded webhook handler dispatcher with Prometheus metrics, queuing, and drop notifications; and adds prefilter helpers to skip job-analysis when no rehearsal-relevant files changed, wiring those checks into issue-comment and affected-job flows.

Changes

Handler Admission Control & Request Drop Notifications

Layer / File(s) Summary
Metrics & Dispatcher Foundation
cmd/pj-rehearse/main.go
Adds Prometheus metrics: queuedHandlers gauge, droppedHandlerRequests counter (label reason), timedOutHandlers counter; registers metrics and introduces handlerDispatcher with execution and queue channels.
Dispatch Logic & Queueing
cmd/pj-rehearse/main.go
Implements dispatch, tryAcquireExecutionSlot, tryQueue, queue-timeout behavior, dequeue, and drop paths that increment drop metrics and invoke an optional drop callback.
Execution Accounting & Timeouts
cmd/pj-rehearse/main.go
Runs handlers in goroutines with inFlightHandlers gauge, panic recovery, and a separate timeout waiter that increments timedOutHandlers and logs a warning without cancelling the handler.
Configuration & Validation
cmd/pj-rehearse/main.go
Adds CLI flags: --max-concurrent-handlers, --max-queued-handlers, --queue-timeout-minutes, --handler-timeout-minutes; updates options.validate() to require values > 0.
Drop Notification Logic
cmd/pj-rehearse/main.go
Adds notifyDroppedRequest to post GitHub comments for dropped requests; messages differ by drop reason (queue_full / queue_timeout) and userTriggered flag.
Webhook Handler Integration
cmd/pj-rehearse/main.go
Routes pull-request and issue-comment webhook handling through dispatcher.dispatch; supplies per-event drop callbacks that call notifyDroppedRequest with appropriate userTriggered value.
Tests
cmd/pj-rehearse/server_test.go
Adds dispatcher tests: immediate execution, queue ordering, queue-full drop ("queue_full"), queue-timeout drop ("queue_timeout"); adds commentRecorder and tests for notifyDroppedRequest content for user-triggered and automatic drops.

Job Analysis Early-Skip Optimization

Layer / File(s) Summary
Path Relevance Helpers
cmd/pj-rehearse/server.go
Adds isRehearsalRelevantPath and hasPathPrefix to identify rehearsal-relevant paths (CI/operator config roots, job config roots including directory-derived roots, and optional registry paths).
Prefilter Logic
cmd/pj-rehearse/server.go
Adds shouldAnalyzeRehearsals which diffs baseRef vs HEAD, logs changed-file count, returns diff errors on failure, and returns true if any changed path matches relevance predicates.
Command & API Integration
cmd/pj-rehearse/server.go
Issue-comment command flow and getAffectedJobs now call shouldAnalyzeRehearsals; on diff error they abort/return wrapped error, and when no relevant changes exist they skip DetermineAffectedJobs and return empty job lists (issue-comment flow also posts an explanatory comment).
Tests
cmd/pj-rehearse/server_test.go
Adds TestHasPathPrefix and TestIsRehearsalRelevantPath to validate exact-or-child prefix semantics and includeRegistryChanges behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 10 | ❌ 4

❌ Failed checks (3 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Go Error Handling ⚠️ Warning Missing nil check before dereferencing in server.go lines 305-320. GetPullRequest result not checked for nil before dereferencing pullRequest fields in handlePotentialCommands. Add nil check for pullRequest: if pullRequest == nil after GetPullRequest call before passing to handlePotentialCommands or dereferencing its fields.
Test Coverage For New Features ⚠️ Warning shouldAnalyzeRehearsals lacks unit test coverage. Core function implementing DPTP-2888 prefilter that diffs changed files. Helpers tested but not orchestration. Add unit test for shouldAnalyzeRehearsals covering diff errors, early return for no relevant changes, and return true when relevant files found. Mock git.RepoClient.Diff.
Test Structure And Quality ❓ Inconclusive Check specifies "Ginkgo test code" but PR contains standard Go testing (*testing.T). Check applicability is unclear. If Ginkgo-specific: PASS. If generic criteria: FAIL - TestDispatcherQueuesThenExecutes and TestDispatcherDropsWhenQueueFull each test multiple unrelated behaviors, violating single responsibility.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the three main changes in the PR: concurrency control, changed-files prefilter, and drop notifications. It is specific, clear, and directly related to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in the PR are stable and deterministic. Standard Go testing package used exclusively. No Ginkgo patterns found. All test names are descriptive and contain no dynamic values.
Microshift Test Compatibility ✅ Passed The PR contains only standard Go unit tests (TestX functions), not Ginkgo e2e tests. The MicroShift compatibility check for Ginkgo tests is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR adds only standard Go unit tests, not Ginkgo e2e tests. SNO check applies only to Ginkgo patterns (It, Describe, Context, When). Not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only Go CLI tooling for pj-rehearse plugin. No manifests, operators, or topology-specific scheduling constraints added.
Ote Binary Stdout Contract ✅ Passed No stdout contract violations found. Logrus defaults to stderr, error output explicitly uses os.Stderr, no unauthorized stdout writes detected in process-level code.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests were added. The PR adds standard Go unit tests to cmd/pj-rehearse/server_test.go covering dispatcher logic, path filtering, and notification behavior. The check is not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/pj-rehearse/main.go`:
- Around line 90-100: The queued request currently holds its queued slot until
run() returns because dequeue() is deferred; change this so the dequeued metric
is released as soon as execution starts by removing the top-level defer
d.dequeue() and instead calling d.dequeue() immediately after successfully
acquiring the execution slot (i.e., right after the case that does
d.executionSlots <- struct{}{}), then call d.run(logger, handler) as before;
keep the existing timeout/drop branch (droppedHandlerRequests, onDrop,
queueTimeout) untouched so dropped requests still increment metrics and trigger
onDrop.

In `@cmd/pj-rehearse/server_test.go`:
- Around line 268-272: The test uses time.Sleep to wait for the goroutine
started with d.dispatch to acquire a slot (via the blocker closure), which
causes flakiness; instead add explicit synchronization: create a ready (or
queued) channel that the dispatching goroutine closes or sends on when it has
acquired the slot or enqueued the job, have the test wait on that channel (with
a timeout) rather than sleeping, and apply the same ready/queued signaling
replacement for the other dispatcher tests that currently use Sleep (the tests
invoking d.dispatch and blocker at the other locations).

In `@cmd/pj-rehearse/server.go`:
- Around line 534-537: The diff currently uses a branch ref name (baseRef) in
shouldAnalyzeRehearsals which can be stale; change the function to diff against
the exact base commit SHA used by prepareCandidate: update
shouldAnalyzeRehearsals signature to accept the base commit SHA (e.g., baseSHA
string) instead of or in addition to baseRef, and replace the
repoClient.Diff(baseRef, "HEAD") call with repoClient.Diff(baseSHA, "HEAD");
ensure callers (including prepareCandidate) pass the same SHA that was fetched
and used for the rebase so both functions use the identical commit SHA.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 02c61700-85e4-496b-ae8f-7f70bb1a65d3

📥 Commits

Reviewing files that changed from the base of the PR and between 651357b and 667fb7a.

📒 Files selected for processing (3)
  • cmd/pj-rehearse/main.go
  • cmd/pj-rehearse/server.go
  • cmd/pj-rehearse/server_test.go

Comment thread cmd/pj-rehearse/main.go Outdated
Comment thread cmd/pj-rehearse/server_test.go Outdated
Comment thread cmd/pj-rehearse/server.go
@jmguzik jmguzik force-pushed the pj-reharse-change branch from 667fb7a to 0823b29 Compare May 7, 2026 07:11
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
cmd/pj-rehearse/server.go (1)

531-535: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Diff against the rebased base SHA, not the branch name.

prepareCandidate() checks out the exact base SHA from GitHub, but this helper diffs baseRef (for example, main) against HEAD. If the local branch ref is stale or missing, the prefilter can skip rehearsals or fail even though the rebase succeeded. Pass the same base SHA that prepareCandidate() used.

Suggested fix
-func shouldAnalyzeRehearsals(repoClient git.RepoClient, baseRef string, includeRegistryChanges bool, logger *logrus.Entry) (bool, error) {
-	changedFiles, err := repoClient.Diff(baseRef, "HEAD")
+func shouldAnalyzeRehearsals(repoClient git.RepoClient, baseSHA string, includeRegistryChanges bool, logger *logrus.Entry) (bool, error) {
+	changedFiles, err := repoClient.Diff(baseSHA, "HEAD")
 	if err != nil {
-		return false, fmt.Errorf("failed to diff changed files from base ref %s: %w", baseRef, err)
+		return false, fmt.Errorf("failed to diff changed files from base SHA %s: %w", baseSHA, err)
 	}

Then update the callers to pass the exact base SHA used during prepareCandidate().

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/pj-rehearse/server.go` around lines 531 - 535, shouldAnalyzeRehearsals
currently diffs the named baseRef (e.g., "main") against HEAD which can be
stale; change it to accept and use the exact base SHA that prepareCandidate()
checks out so the diff is performed against the rebased commit, and update all
callers to pass that same base SHA (ensure the function signature and callers
reference shouldAnalyzeRehearsals and prepareCandidate consistently so the exact
base SHA is propagated).
cmd/pj-rehearse/main.go (1)

82-95: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Release the queue slot as soon as execution starts.

A request that leaves the queue still counts as queued until run() returns because dequeue() is deferred. That inflates pj_rehearse_handlers_queued and can trigger queue_full drops while nothing is actually waiting.

Suggested fix
 	if !d.tryQueue() {
 		droppedHandlerRequests.WithLabelValues("queue_full").Inc()
 		logger.WithField("max_queue", cap(d.queueSlots)).Warn("Dropping webhook request because handler queue is full")
 		if onDrop != nil {
 			onDrop("queue_full")
 		}
 		return
 	}
-	defer d.dequeue()

 	select {
 	case d.executionSlots <- struct{}{}:
+		d.dequeue()
 		d.run(logger, handler)
 	case <-time.After(d.queueTimeout):
+		d.dequeue()
 		droppedHandlerRequests.WithLabelValues("queue_timeout").Inc()
 		logger.WithField("timeout", d.queueTimeout).Warn("Dropping webhook request because it waited too long in queue")
 		if onDrop != nil {
 			onDrop("queue_timeout")
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/pj-rehearse/main.go` around lines 82 - 95, The deferred dequeue keeps the
queue slot occupied until run() returns; remove the defer d.dequeue() and
instead release the queue slot as soon as execution starts by calling
d.dequeue() immediately after successfully acquiring an execution slot (i.e.,
inside the select case after sending to d.executionSlots and before calling
d.run), so queued metrics (d.queueSlots / pj_rehearse_handlers_queued) reflect
actual waiting requests; adjust any error paths similarly to ensure the slot is
always released when execution is handed off.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/pj-rehearse/main.go`:
- Around line 390-404: The event registrations currently submit every PR and
issue-comment event into dispatcher.dispatch even when the action would be
ignored; before calling dispatcher.dispatch in
eventServer.RegisterHandlePullRequestEvent and RegisterHandleIssueCommentEvent,
filter by the GitHub event action (e.g., only allow
"opened"/"reopened"/"synchronize" for PRs that should trigger
s.handlePullRequestCreation or s.handleNewPush, and only allow relevant comment
actions/commands for issue comments) and return early if not relevant so
dispatcher.dispatch is never invoked; update the anonymous handlers wrapping
s.handlePullRequestCreation, s.handleNewPush, and s.handleIssueComment to
perform that pre-check and only call notifyDroppedRequest when an event that
would have been dispatched is actually dropped by the dispatcher.
- Around line 108-116: The worker goroutine started in handlerDispatcher.run
must recover from panics thrown by the callback to avoid crashing the process;
update the anonymous goroutine that calls handler() to wrap the handler
invocation with a deferred recover() that catches any panic, logs the panic and
stack trace (use the same logger variable) and allows the goroutine to clean up
(maintaining the existing deferred concurrentHandlersInFlight.Dec(),
<-d.executionSlots and close(done) behavior); specifically modify the anonymous
func in run to add a defer that calls recover(), formats the panic and
runtime.Stack() and emits a logger.Errorf or logger.WithField(...).Errorf so the
panic is recorded but does not kill the process.

---

Duplicate comments:
In `@cmd/pj-rehearse/main.go`:
- Around line 82-95: The deferred dequeue keeps the queue slot occupied until
run() returns; remove the defer d.dequeue() and instead release the queue slot
as soon as execution starts by calling d.dequeue() immediately after
successfully acquiring an execution slot (i.e., inside the select case after
sending to d.executionSlots and before calling d.run), so queued metrics
(d.queueSlots / pj_rehearse_handlers_queued) reflect actual waiting requests;
adjust any error paths similarly to ensure the slot is always released when
execution is handed off.

In `@cmd/pj-rehearse/server.go`:
- Around line 531-535: shouldAnalyzeRehearsals currently diffs the named baseRef
(e.g., "main") against HEAD which can be stale; change it to accept and use the
exact base SHA that prepareCandidate() checks out so the diff is performed
against the rebased commit, and update all callers to pass that same base SHA
(ensure the function signature and callers reference shouldAnalyzeRehearsals and
prepareCandidate consistently so the exact base SHA is propagated).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: c1c84ebe-fb99-4f49-b011-cc5cd43c5b22

📥 Commits

Reviewing files that changed from the base of the PR and between 667fb7a and 0823b29.

📒 Files selected for processing (3)
  • cmd/pj-rehearse/main.go
  • cmd/pj-rehearse/server.go
  • cmd/pj-rehearse/server_test.go

Comment thread cmd/pj-rehearse/main.go
Comment thread cmd/pj-rehearse/main.go
@jmguzik jmguzik force-pushed the pj-reharse-change branch from 0823b29 to 9102c7d Compare May 7, 2026 07:24
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
cmd/pj-rehearse/main.go (2)

108-124: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add panic recovery in the worker goroutine.

A panic in handler() will crash the plugin process. Wrap with recover() to log and continue.

Suggested fix
 func (d *handlerDispatcher) run(logger *logrus.Entry, handler func()) {
 	done := make(chan struct{})
 	go func() {
 		concurrentHandlersInFlight.Inc()
 		defer concurrentHandlersInFlight.Dec()
-		defer func() { <-d.executionSlots }()
-		defer close(done)
+		defer func() {
+			if r := recover(); r != nil {
+				logger.WithField("panic", r).Error("webhook handler panicked")
+			}
+			<-d.executionSlots
+			close(done)
+		}()
 		handler()
 	}()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/pj-rehearse/main.go` around lines 108 - 124, The goroutine started in
handlerDispatcher.run can panic inside handler(), which will crash the process;
wrap the goroutine's body with a defer that recovers from panic,
increments/observes the existing metrics as needed, and logs the panic and stack
using the provided logger (use logger.WithField/WithError and include debug
stack) so the process continues; ensure recovery runs before the other defers
that decrement concurrentHandlersInFlight and release d.executionSlots and that
close(done) still executes after recovery.

390-404: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Filter irrelevant events before entering the dispatcher.

Currently all PR and issue-comment events are dispatched, but handlers no-op for most actions (e.g., closed, labeled). Under load this wastes dispatcher capacity and may post drop notifications for events that would have been ignored.

Suggested fix
 		eventServer.RegisterHandlePullRequestEvent(func(l *logrus.Entry, event github.PullRequestEvent) {
+			if event.Action != github.PullRequestActionOpened && event.Action != github.PullRequestActionSynchronize {
+				return
+			}
 			dispatcher.dispatch(l, func() {
 				s.handlePullRequestCreation(l, event)
 				s.handleNewPush(l, event)
 			}, func(reason string) {
 				notifyDroppedRequest(s.ghc, event.Repo.Owner.Login, event.Repo.Name, event.Number, event.PullRequest.User.Login, reason, o.queueTimeoutMinutes, false, l)
 			})
 		})
 		eventServer.RegisterHandleIssueCommentEvent(func(l *logrus.Entry, event github.IssueCommentEvent) {
+			if !event.Issue.IsPullRequest() || event.Action != github.IssueCommentActionCreated {
+				return
+			}
 			dispatcher.dispatch(l, func() {
 				s.handleIssueComment(l, event)
 			}, func(reason string) {
 				notifyDroppedRequest(s.ghc, event.Repo.Owner.Login, event.Repo.Name, event.Issue.Number, event.Comment.User.Login, reason, o.queueTimeoutMinutes, true, l)
 			})
 		})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/pj-rehearse/main.go` around lines 390 - 404, Filter out irrelevant GitHub
event actions before calling dispatcher to avoid wasting capacity and spurious
drop notifications: inside the RegisterHandlePullRequestEvent callback check
event.Action and only call dispatcher.dispatch for PR actions your handlers
expect (e.g., "opened", "reopened", "synchronize", "edited"/"ready_for_review"
as applicable) and inside RegisterHandleIssueCommentEvent only dispatch for
relevant comment actions (e.g., "created", and optionally "edited") and, if
appropriate, only for comments on pull requests; keep the existing
notifyDroppedRequest usage for events you intentionally skip so dropped-notify
behavior remains correct. Target the callbacks registered here (the anonymous
funcs passed to eventServer.RegisterHandlePullRequestEvent and
RegisterHandleIssueCommentEvent) and gate the dispatcher.dispatch invocation
based on event.Action and whether the Issue is a PR.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@cmd/pj-rehearse/main.go`:
- Around line 108-124: The goroutine started in handlerDispatcher.run can panic
inside handler(), which will crash the process; wrap the goroutine's body with a
defer that recovers from panic, increments/observes the existing metrics as
needed, and logs the panic and stack using the provided logger (use
logger.WithField/WithError and include debug stack) so the process continues;
ensure recovery runs before the other defers that decrement
concurrentHandlersInFlight and release d.executionSlots and that close(done)
still executes after recovery.
- Around line 390-404: Filter out irrelevant GitHub event actions before calling
dispatcher to avoid wasting capacity and spurious drop notifications: inside the
RegisterHandlePullRequestEvent callback check event.Action and only call
dispatcher.dispatch for PR actions your handlers expect (e.g., "opened",
"reopened", "synchronize", "edited"/"ready_for_review" as applicable) and inside
RegisterHandleIssueCommentEvent only dispatch for relevant comment actions
(e.g., "created", and optionally "edited") and, if appropriate, only for
comments on pull requests; keep the existing notifyDroppedRequest usage for
events you intentionally skip so dropped-notify behavior remains correct. Target
the callbacks registered here (the anonymous funcs passed to
eventServer.RegisterHandlePullRequestEvent and RegisterHandleIssueCommentEvent)
and gate the dispatcher.dispatch invocation based on event.Action and whether
the Issue is a PR.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 1979fce6-49e3-45e0-ba01-6ccd28420ab9

📥 Commits

Reviewing files that changed from the base of the PR and between 0823b29 and 9102c7d.

📒 Files selected for processing (3)
  • cmd/pj-rehearse/main.go
  • cmd/pj-rehearse/server.go
  • cmd/pj-rehearse/server_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • cmd/pj-rehearse/server_test.go

…op notifications

pj-rehearse was vulnerable to memory spikes (50+ GB) under load because
it handled an unlimited number of webhook requests concurrently, each
loading full CI/Prow configurations.

This commit introduces three features to mitigate the problem:

1. Handler dispatcher with bounded concurrency and queuing:
   - Configurable max concurrent handlers (default 5), max queued (50),
     queue timeout (5m), and execution timeout (15m) via CLI flags.
   - Requests that exceed the queue capacity or wait too long are dropped
     with a GitHub comment notifying the user.
   - Separate Prometheus metrics for in-flight handlers, queued requests,
     drops (by reason), and execution timeouts.

2. Changed-files prefilter (implements DPTP-2888):
   - Before loading full configs, diffs the PR against the base branch to
     check if any changed files fall under ci-operator/config,
     ci-operator/jobs, core-services/prow/02_config, or
     ci-operator/step-registry.
   - If no rehearsal-relevant paths changed, skips DetermineAffectedJobs
     entirely, avoiding the expensive config load.

3. Merged PR event handlers into a single registration:
   - handlePullRequestCreation and handleNewPush are now dispatched as
     one unit, preventing a single PR event from consuming two dispatcher
     slots.

Co-authored-by Cursor
Signed-off-by: Jakub Guzik <jguzik@redhat.com>
@jmguzik jmguzik force-pushed the pj-reharse-change branch from 9102c7d to 325d36e Compare May 7, 2026 07:32
@deepsm007
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 7, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deepsm007, jmguzik

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 53463e8 and 2 for PR HEAD 325d36e in total

@hector-vido
Copy link
Copy Markdown
Contributor

/test images
/override ci/prow/breaking-changes

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

@hector-vido: Overrode contexts on behalf of hector-vido: ci/prow/breaking-changes

Details

In response to this:

/test images
/override ci/prow/breaking-changes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification

No second-stage tests were triggered for this PR.

This can happen when:

  • The changed files don't match any pipeline_run_if_changed patterns
  • All files match pipeline_skip_if_only_changed patterns
  • No pipeline-controlled jobs are defined for the main branch

Use /test ? to see all available tests.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

@jmguzik: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 582d460 into openshift:main May 7, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants