Skip to content

Conversation

@Mic92
Copy link
Member

@Mic92 Mic92 commented Sep 16, 2025

…r deadlock

The scheduler was only updating job_closures for failed builds, causing it to potentially wait indefinitely for dependencies that had already completed successfully. This resulted in builds hanging when successful dependencies weren't properly removed from the closure tracking.

  • Rename _handle_failed_job to _handle_completed_job for clarity
  • Move job_closures update to run after ALL completed jobs (success/failure)
  • Ensure closure update happens after get_failed_dependents to maintain integrity

Summary by CodeRabbit

  • New Features
    • None
  • Bug Fixes
    • Consistent handling of all completed jobs (successes and failures).
    • Dependent failures now scheduled correctly before state updates.
    • More accurate caching and linking of failed builds when configured.
    • Reduced noisy logging by only reporting removed jobs when present.
  • Refactor
    • Unified job completion handling in the processing loop for clearer, more reliable flow.

…r deadlock

The scheduler was only updating job_closures for failed builds, causing it to
potentially wait indefinitely for dependencies that had already completed
successfully. This resulted in builds hanging when successful dependencies
weren't properly removed from the closure tracking.

- Rename _handle_failed_job to _handle_completed_job for clarity
- Move job_closures update to run after ALL completed jobs (success/failure)
- Ensure closure update happens after get_failed_dependents to maintain integrity
@coderabbitai
Copy link

coderabbitai bot commented Sep 16, 2025

Walkthrough

Refactors failure-specific handling into a unified completion handler. The run loop now calls _handle_completed_job, which processes both successes and failures. Failure caching is explicitly gated by result checks. Dependent-failure scheduling occurs before closure updates. Removed jobs logging is conditional. Closures are updated for all completed jobs.

Changes

Cohort / File(s) Summary
Completion handling refactor
buildbot_nix/buildbot_nix/build_trigger.py
Renamed _handle_failed_job_handle_completed_job. Early-return if job not NixEvalJobSuccess. Failure path: guarded failed-build caching, compute/schedule dependent failures, update build schedule, extend BRIDs. Move closure updates to post-scheduling for all completions. Conditional logging for removed jobs. Run loop calls new handler.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant RL as RunLoop
  participant BT as BuildTrigger
  participant HC as _handle_completed_job
  participant DB as FailedBuildsDB
  participant SCH as Scheduler/Queue

  RL->>BT: on job completion (job, result)
  BT->>HC: _handle_completed_job(job, ctx, brids, result)

  alt Job is not NixEvalJobSuccess
    HC-->>BT: return (no-op)
  else Job is NixEvalJobSuccess
    opt result == FAILURE
      HC->>DB: cache failed build URL with timestamp (if configured)
      HC->>SCH: compute failed dependents
      loop for each failed dependent
        HC->>SCH: schedule dependency-failure
      end
      HC->>BT: extend BRIDs map
    end
    HC->>BT: update job closures (all completed jobs)
    opt any removed jobs
      HC->>BT: log removed jobs
    end
    HC-->>BT: done
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title concisely and accurately summarizes the primary change—updating job closures for all completed builds to prevent scheduler deadlock—and directly reflects the PR objectives and the code modifications described in the summary, so it is clear and relevant to the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch refactoring

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 29eccb5 and 8571954.

📒 Files selected for processing (1)
  • buildbot_nix/buildbot_nix/build_trigger.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
buildbot_nix/buildbot_nix/build_trigger.py (2)
buildbot_nix/buildbot_nix/models.py (1)
  • NixEvalJobSuccess (334-344)
buildbot_nix/buildbot_nix/failed_builds.py (2)
  • check_build (40-44)
  • add_build (35-38)
🔇 Additional comments (6)
buildbot_nix/buildbot_nix/build_trigger.py (6)

438-447: Good refactoring: clearer semantics for completion handling.

The renaming from _handle_failed_job to _handle_completed_job better reflects the method's actual purpose of handling all completed jobs, not just failures. The early return for non-success jobs is appropriate.


449-461: LGTM: Failure caching logic is properly scoped.

The nested conditions properly guard failure caching to only occur when result != SUCCESS and specifically for util.FAILURE with a configured database. The URL resolution and timestamp tracking are appropriately handled.


462-483: Critical fix: Correct ordering ensures scheduler doesn't deadlock.

Moving the dependent failure scheduling before updating job_closures is essential. The comment on line 462 correctly emphasizes this ordering requirement. This ensures that get_failed_dependents can properly traverse the dependency graph before the closures are modified.


484-489: Key improvement: Job closures updated for ALL completed jobs.

This is the core fix for the scheduler deadlock issue. By updating job_closures for both successful and failed jobs, the scheduler properly tracks completion state and won't wait indefinitely for dependencies that have already finished.


595-596: LGTM: Clean integration point in the main loop.

The refactored method is cleanly integrated into the main scheduling loop, maintaining the original control flow while fixing the underlying issue.


465-475: Iteration-safe: removals from ctx.build_schedule_order are safe.

get_failed_dependents makes a copy (jobs = list(jobs) at line 333) and the main loop iterates using a snapshot (for build in list(ctx.build_schedule_order) at line 547), so removing items in _handle_completed_job (remove at line 471) and in _process_build_for_scheduling (removes at lines 400/417/421) mutates the underlying list but not any live iterator; no other raw iterations over build_schedule_order were found.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Mic92 Mic92 added this pull request to the merge queue Sep 16, 2025
Merged via the queue into main with commit 201ae35 Sep 16, 2025
3 checks passed
@Mic92 Mic92 deleted the refactoring branch September 16, 2025 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants