fix: update job closures for all completed builds to prevent schedule… #491

Mic92 · 2025-09-16T08:48:56Z

…r deadlock

The scheduler was only updating job_closures for failed builds, causing it to potentially wait indefinitely for dependencies that had already completed successfully. This resulted in builds hanging when successful dependencies weren't properly removed from the closure tracking.

Rename _handle_failed_job to _handle_completed_job for clarity
Move job_closures update to run after ALL completed jobs (success/failure)
Ensure closure update happens after get_failed_dependents to maintain integrity

Summary by CodeRabbit

New Features
- None
Bug Fixes
- Consistent handling of all completed jobs (successes and failures).
- Dependent failures now scheduled correctly before state updates.
- More accurate caching and linking of failed builds when configured.
- Reduced noisy logging by only reporting removed jobs when present.
Refactor
- Unified job completion handling in the processing loop for clearer, more reliable flow.

…r deadlock The scheduler was only updating job_closures for failed builds, causing it to potentially wait indefinitely for dependencies that had already completed successfully. This resulted in builds hanging when successful dependencies weren't properly removed from the closure tracking. - Rename _handle_failed_job to _handle_completed_job for clarity - Move job_closures update to run after ALL completed jobs (success/failure) - Ensure closure update happens after get_failed_dependents to maintain integrity

coderabbitai · 2025-09-16T08:49:03Z

Walkthrough

Refactors failure-specific handling into a unified completion handler. The run loop now calls _handle_completed_job, which processes both successes and failures. Failure caching is explicitly gated by result checks. Dependent-failure scheduling occurs before closure updates. Removed jobs logging is conditional. Closures are updated for all completed jobs.

Changes

Cohort / File(s)	Summary
Completion handling refactor `buildbot_nix/buildbot_nix/build_trigger.py`	Renamed `_handle_failed_job` → `_handle_completed_job`. Early-return if job not `NixEvalJobSuccess`. Failure path: guarded failed-build caching, compute/schedule dependent failures, update build schedule, extend BRIDs. Move closure updates to post-scheduling for all completions. Conditional logging for removed jobs. Run loop calls new handler.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant RL as RunLoop
  participant BT as BuildTrigger
  participant HC as _handle_completed_job
  participant DB as FailedBuildsDB
  participant SCH as Scheduler/Queue

  RL->>BT: on job completion (job, result)
  BT->>HC: _handle_completed_job(job, ctx, brids, result)

  alt Job is not NixEvalJobSuccess
    HC-->>BT: return (no-op)
  else Job is NixEvalJobSuccess
    opt result == FAILURE
      HC->>DB: cache failed build URL with timestamp (if configured)
      HC->>SCH: compute failed dependents
      loop for each failed dependent
        HC->>SCH: schedule dependency-failure
      end
      HC->>BT: extend BRIDs map
    end
    HC->>BT: update job closures (all completed jobs)
    opt any removed jobs
      HC->>BT: log removed jobs
    end
    HC-->>BT: done
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title concisely and accurately summarizes the primary change—updating job closures for all completed builds to prevent scheduler deadlock—and directly reflects the PR objectives and the code modifications described in the summary, so it is clear and relevant to the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch refactoring

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 29eccb5 and 8571954.

📒 Files selected for processing (1)

buildbot_nix/buildbot_nix/build_trigger.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

buildbot_nix/buildbot_nix/build_trigger.py (2)

buildbot_nix/buildbot_nix/models.py (1)

NixEvalJobSuccess (334-344)

buildbot_nix/buildbot_nix/failed_builds.py (2)

check_build (40-44)

add_build (35-38)

🔇 Additional comments (6)

buildbot_nix/buildbot_nix/build_trigger.py (6)

438-447: Good refactoring: clearer semantics for completion handling.

The renaming from _handle_failed_job to _handle_completed_job better reflects the method's actual purpose of handling all completed jobs, not just failures. The early return for non-success jobs is appropriate.

449-461: LGTM: Failure caching logic is properly scoped.

The nested conditions properly guard failure caching to only occur when result != SUCCESS and specifically for util.FAILURE with a configured database. The URL resolution and timestamp tracking are appropriately handled.

462-483: Critical fix: Correct ordering ensures scheduler doesn't deadlock.

Moving the dependent failure scheduling before updating job_closures is essential. The comment on line 462 correctly emphasizes this ordering requirement. This ensures that get_failed_dependents can properly traverse the dependency graph before the closures are modified.

484-489: Key improvement: Job closures updated for ALL completed jobs.

This is the core fix for the scheduler deadlock issue. By updating job_closures for both successful and failed jobs, the scheduler properly tracks completion state and won't wait indefinitely for dependencies that have already finished.

595-596: LGTM: Clean integration point in the main loop.

The refactored method is cleanly integrated into the main scheduling loop, maintaining the original control flow while fixing the underlying issue.

465-475: Iteration-safe: removals from ctx.build_schedule_order are safe.

get_failed_dependents makes a copy (jobs = list(jobs) at line 333) and the main loop iterates using a snapshot (for build in list(ctx.build_schedule_order) at line 547), so removing items in _handle_completed_job (remove at line 471) and in _process_build_for_scheduling (removes at lines 400/417/421) mutates the underlying list but not any live iterator; no other raw iterations over build_schedule_order were found.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Mic92 added this pull request to the merge queue Sep 16, 2025

Merged via the queue into main with commit 201ae35 Sep 16, 2025
3 checks passed

Mic92 deleted the refactoring branch September 16, 2025 09:04

eljamm mentioned this pull request Sep 16, 2025

flake.lock: Update buildbot-nix ngi-nix/ngipkgs#1639

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: update job closures for all completed builds to prevent schedule… #491

fix: update job closures for all completed builds to prevent schedule… #491

Uh oh!

Mic92 commented Sep 16, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

fix: update job closures for all completed builds to prevent schedule… #491

fix: update job closures for all completed builds to prevent schedule… #491

Uh oh!

Conversation

Mic92 commented Sep 16, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Mic92 commented Sep 16, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 16, 2025 •

edited

Loading