Skip to content

Fix exporter deadlock when lease ends before before_lease_hook is set#569

Merged
mangelajo merged 4 commits intomainfrom
fix/exporter-deadlock-before-lease-hook-567
Apr 16, 2026
Merged

Fix exporter deadlock when lease ends before before_lease_hook is set#569
mangelajo merged 4 commits intomainfrom
fix/exporter-deadlock-before-lease-hook-567

Conversation

@ambient-code
Copy link
Copy Markdown
Contributor

@ambient-code ambient-code bot commented Apr 16, 2026

Summary

  • Fix race condition where conn_tg gets cancelled before before_lease_hook.set() is reached for no-hook exporters, causing _cleanup_after_lease() to deadlock forever on before_lease_hook.wait() inside a shielded scope
  • Add unconditional before_lease_hook.set() in handle_lease()'s finally block as primary fix
  • Add dynamic safety timeout in _cleanup_after_lease() as defense-in-depth against future similar races — uses the configured hook timeout + 30s margin when a before_lease hook is present, or 15s default otherwise (addresses review feedback from @mangelajo)
  • Fix diff-cover path resolution: add [tool.coverage.run] source = ["."] to the jumpstarter package's pyproject.toml so that coverage.xml records the correct source directory, enabling diff-cover to match coverage data against git diff paths

Fixes #567

Test plan

  • All existing exporter tests pass (exporter_test.py, exporter_retry_test.py, lease_context_test.py, hooks_test.py)
  • New tests cover safety timeout code paths (fixes diff-coverage CI failure)
  • Ruff linting passes
  • Local diff-cover reports 96% total coverage (80% for exporter.py, 100% for test file)
  • E2E tests should no longer show the deadlock pattern described in the issue

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 16, 2026

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cde751d0-b5ec-4e8a-ba7a-7b2ad4f5df86

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/exporter-deadlock-before-lease-hook-567

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

await lease_scope.before_lease_hook.wait()
# Safety timeout: prevent permanent deadlock if before_lease_hook
# was never set due to a race (e.g. conn_tg cancelled early).
with move_on_after(30) as timeout_scope:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a problem could arise here if 30 < the configure before_lease hook timeout in the exporter

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. You're right that the hardcoded 30-second timeout could conflict with the configured hook timeout (which defaults to 120s and is user-configurable).

That said, looking at the code flow more carefully, this timeout should never actually fire in practice with the primary fix in place:

  1. run_before_lease_hook (which runs the actual hook) always calls before_lease_hook.set() in its own finally block (hooks.py line 623)
  2. When there's no hook, the new finally block at line 754-755 in handle_lease() sets it before _cleanup_after_lease is even called

So by the time _cleanup_after_lease reaches the wait(), the event should already be set. The 30-second timeout is purely defense-in-depth for unforeseen edge cases.

Still, I agree it would be better to either:

  • Use the configured hook timeout (e.g. self.hook_executor.config.before_lease.timeout) plus a margin, or
  • Use a longer default (e.g. 300s) since this is a last-resort safety net and not a normal code path

I'll update the timeout to be derived from the hook's configured timeout when a hook executor is present, falling back to a generous default otherwise.

@ambient-code
Copy link
Copy Markdown
Contributor Author

ambient-code bot commented Apr 16, 2026

Fix plan

Addressing @mangelajo's feedback and CI failures:

1. Dynamic safety timeout (addresses review feedback)

The hardcoded 30-second safety timeout in _cleanup_after_lease could conflict with the user-configured hook timeout (default 120s). Will change it to:

  • Use self.hook_executor.config.before_lease.timeout + 30 when a hook executor with a before_lease hook is configured
  • Fall back to 300 seconds as a generous default when no hook is configured (since the primary fix in the finally block should handle the no-hook case instantly)

2. Add test coverage for new code paths (fixes CI coverage failure)

The pytest-matrix checks are failing because the new lines (602-605, 608, 754-755) have 0% diff coverage. Will add tests covering:

  • The finally block ensuring before_lease_hook is set when conn_tg is cancelled early
  • The safety timeout path in _cleanup_after_lease

3. type-check-python failure

This failure is in jumpstarter_driver_renode/driver_test.py and is unrelated to this PR's changes.

# was never set due to a race (e.g. conn_tg cancelled early).
# Use the configured hook timeout (+ margin) when available so we
# never interrupt a legitimately-running beforeLease hook.
safety_timeout = 300 # generous default for no-hook / unknown cases
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
safety_timeout = 300 # generous default for no-hook / unknown cases
safety_timeout = 15 # generous default for no-hook / unknown cases

For non hook I would even be more agressive, to match the server/client 30 sec connection timeouts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied -- changed the default to 15s. Good call on aligning it with the server/client connection timeouts. Also rebased onto main to pick up the type-check fix from #568.

@mangelajo
Copy link
Copy Markdown
Member

This is failing coverage checks, and could benefit from a rebase (that fixes type checking linter)

@ambient-code
Copy link
Copy Markdown
Contributor Author

ambient-code bot commented Apr 16, 2026

Fix plan (addressing latest feedback)

1. Apply suggestion: lower default safety timeout to 15s

@mangelajo's suggestion makes sense — the 300s default is only used when there's no hook executor configured. In the no-hook case, before_lease_hook should be set nearly instantly by the finally block, so 15s is more than enough and aligns better with the server/client 30-second connection timeouts.

2. Rebase onto main

This will pick up the type-check fix for jumpstarter_driver_renode/driver_test.py which is unrelated to this PR but is blocking CI.

3. Coverage

The existing tests cover the timeout logic but the coverage tool doesn't pick them up because they patch move_on_after at the module level rather than exercising the actual production code paths. Will verify the tests properly register in coverage after the rebase.

Ambient Code Bot and others added 3 commits April 16, 2026 10:33
For exporters without a hook_executor, the before_lease_hook event is
only set inside the conn_tg task group. When a lease ends quickly
(before conn_tg reaches the set() call), wait_for_lease_end() cancels
conn_tg, and before_lease_hook is never set. Then _cleanup_after_lease()
blocks forever on before_lease_hook.wait() inside a shielded scope,
which also prevents after_lease_hook_done from being set, causing
serve() to hang permanently.

Two fixes applied:
1. Primary: Set before_lease_hook unconditionally in the finally block
   of handle_lease(), before calling _cleanup_after_lease.
2. Defense-in-depth: Add a 30-second safety timeout with move_on_after
   in _cleanup_after_lease when waiting for before_lease_hook.

Fixes #567

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address review feedback from @mangelajo: the hardcoded 30s safety
timeout in _cleanup_after_lease could conflict with the user-configured
hook timeout (default 120s). Now uses hook_timeout + 30s margin when
a before_lease hook is configured, falling back to 300s otherwise.

Add tests covering the safety timeout and finally-block code paths
to fix the diff-coverage CI failure (lines 602-608, 754-755).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ambient-code ambient-code bot force-pushed the fix/exporter-deadlock-before-lease-hook-567 branch from 540dafd to 01f109c Compare April 16, 2026 10:33
Add [tool.coverage.run] source = ["."] to the jumpstarter package's
pyproject.toml so that coverage.xml records the correct source
directory path. Without this, the <source> element in coverage.xml
was empty, causing diff-cover to fail to match coverage data against
git diff paths -- resulting in 0% reported coverage for all changed
lines despite tests actually covering them.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ambient-code
Copy link
Copy Markdown
Contributor Author

ambient-code bot commented Apr 16, 2026

Coverage fix pushed

The pytest-matrix coverage failures were caused by a path mismatch between coverage.xml and diff-cover.

Root cause: When --cov is used without specifying a source (e.g. --cov instead of --cov=.), coverage.py leaves the <source> element in coverage.xml empty. diff-cover needs this element to resolve coverage file paths against git diff paths. Since git diff always uses repo-relative paths (e.g. python/packages/jumpstarter/jumpstarter/exporter/exporter.py) but coverage.xml uses package-relative paths (e.g. jumpstarter/exporter/exporter.py), the match fails silently — reporting 0% coverage despite all tests actually passing and covering the code.

Fix: Added [tool.coverage.run] source = ["."] to python/packages/jumpstarter/pyproject.toml. This tells coverage.py to record the current directory as the source root, which populates the <source> element in coverage.xml with the correct absolute path. diff-cover can then correctly match coverage data to git diff paths.

With this fix, local diff-cover reports 80% coverage on exporter.py (only the handle_lease finally-block at lines 762-763 is uncovered, which requires a full gRPC setup to test) and 100% on the test file. Total diff coverage is 96%, well above the 80% threshold.

Note: The CI workflow (.github/workflows/python-tests.yaml) also uses PYTEST_ADDOPTS: "--cov --cov-report=xml" — the bare --cov there is no longer an issue because coverage.py reads the source from the package's pyproject.toml config when source=None is passed by pytest-cov. However, for consistency it would be good to also change it to --cov=. --cov-report=xml in a follow-up (bot can't push workflow changes due to missing workflows permission).

@mangelajo mangelajo merged commit aaf98be into main Apr 16, 2026
32 checks passed
mangelajo added a commit that referenced this pull request Apr 17, 2026
Key beforeLease scheduling and cleanup waits off the configured before_lease hook instead of hook_executor presence. This preserves the deadlock fix from #569 without letting after_lease-only configs or lease cleanup race on a synthetic before_lease_hook signal.

Fixes #613

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky E2E: exporter deadlocks when lease ends before before_lease_hook is set (no-hook exporters)

1 participant