Fix exporter deadlock when lease ends before before_lease_hook is set by ambient-code[bot] · Pull Request #569 · jumpstarter-dev/jumpstarter

ambient-code · 2026-04-16T08:48:53Z

Summary

Fix race condition where conn_tg gets cancelled before before_lease_hook.set() is reached for no-hook exporters, causing _cleanup_after_lease() to deadlock forever on before_lease_hook.wait() inside a shielded scope
Add unconditional before_lease_hook.set() in handle_lease()'s finally block as primary fix
Add dynamic safety timeout in _cleanup_after_lease() as defense-in-depth against future similar races — uses the configured hook timeout + 30s margin when a before_lease hook is present, or 15s default otherwise (addresses review feedback from @mangelajo)
Fix diff-cover path resolution: add [tool.coverage.run] source = ["."] to the jumpstarter package's pyproject.toml so that coverage.xml records the correct source directory, enabling diff-cover to match coverage data against git diff paths

Fixes #567

Test plan

All existing exporter tests pass (exporter_test.py, exporter_retry_test.py, lease_context_test.py, hooks_test.py)
New tests cover safety timeout code paths (fixes diff-coverage CI failure)
Ruff linting passes
Local diff-cover reports 96% total coverage (80% for exporter.py, 100% for test file)
E2E tests should no longer show the deadlock pattern described in the issue

🤖 Generated with Claude Code

coderabbitai · 2026-04-16T08:49:00Z

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cde751d0-b5ec-4e8a-ba7a-7b2ad4f5df86

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/exporter-deadlock-before-lease-hook-567

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

mangelajo · 2026-04-16T09:06:05Z

-            await lease_scope.before_lease_hook.wait()
+            # Safety timeout: prevent permanent deadlock if before_lease_hook
+            # was never set due to a race (e.g. conn_tg cancelled early).
+            with move_on_after(30) as timeout_scope:


I think a problem could arise here if 30 < the configure before_lease hook timeout in the exporter

Good catch. You're right that the hardcoded 30-second timeout could conflict with the configured hook timeout (which defaults to 120s and is user-configurable).

That said, looking at the code flow more carefully, this timeout should never actually fire in practice with the primary fix in place:

run_before_lease_hook (which runs the actual hook) always calls before_lease_hook.set() in its own finally block (hooks.py line 623)

When there's no hook, the new finally block at line 754-755 in handle_lease() sets it before _cleanup_after_lease is even called

So by the time _cleanup_after_lease reaches the wait(), the event should already be set. The 30-second timeout is purely defense-in-depth for unforeseen edge cases.

Still, I agree it would be better to either:

Use the configured hook timeout (e.g. self.hook_executor.config.before_lease.timeout) plus a margin, or

Use a longer default (e.g. 300s) since this is a last-resort safety net and not a normal code path

I'll update the timeout to be derived from the hook's configured timeout when a hook executor is present, falling back to a generous default otherwise.

ambient-code · 2026-04-16T09:33:49Z

Fix plan

Addressing @mangelajo's feedback and CI failures:

1. Dynamic safety timeout (addresses review feedback)

The hardcoded 30-second safety timeout in _cleanup_after_lease could conflict with the user-configured hook timeout (default 120s). Will change it to:

Use self.hook_executor.config.before_lease.timeout + 30 when a hook executor with a before_lease hook is configured
Fall back to 300 seconds as a generous default when no hook is configured (since the primary fix in the finally block should handle the no-hook case instantly)

2. Add test coverage for new code paths (fixes CI coverage failure)

The pytest-matrix checks are failing because the new lines (602-605, 608, 754-755) have 0% diff coverage. Will add tests covering:

The finally block ensuring before_lease_hook is set when conn_tg is cancelled early
The safety timeout path in _cleanup_after_lease

3. type-check-python failure

This failure is in jumpstarter_driver_renode/driver_test.py and is unrelated to this PR's changes.

mangelajo · 2026-04-16T10:14:59Z

+            # was never set due to a race (e.g. conn_tg cancelled early).
+            # Use the configured hook timeout (+ margin) when available so we
+            # never interrupt a legitimately-running beforeLease hook.
+            safety_timeout = 300  # generous default for no-hook / unknown cases


Suggested change

safety_timeout = 300 # generous default for no-hook / unknown cases

safety_timeout = 15 # generous default for no-hook / unknown cases

For non hook I would even be more agressive, to match the server/client 30 sec connection timeouts.

Applied -- changed the default to 15s. Good call on aligning it with the server/client connection timeouts. Also rebased onto main to pick up the type-check fix from #568.

mangelajo · 2026-04-16T10:22:23Z

This is failing coverage checks, and could benefit from a rebase (that fixes type checking linter)

ambient-code · 2026-04-16T10:32:30Z

Fix plan (addressing latest feedback)

1. Apply suggestion: lower default safety timeout to 15s

@mangelajo's suggestion makes sense — the 300s default is only used when there's no hook executor configured. In the no-hook case, before_lease_hook should be set nearly instantly by the finally block, so 15s is more than enough and aligns better with the server/client 30-second connection timeouts.

2. Rebase onto main

This will pick up the type-check fix for jumpstarter_driver_renode/driver_test.py which is unrelated to this PR but is blocking CI.

3. Coverage

The existing tests cover the timeout logic but the coverage tool doesn't pick them up because they patch move_on_after at the module level rather than exercising the actual production code paths. Will verify the tests properly register in coverage after the rebase.

For exporters without a hook_executor, the before_lease_hook event is only set inside the conn_tg task group. When a lease ends quickly (before conn_tg reaches the set() call), wait_for_lease_end() cancels conn_tg, and before_lease_hook is never set. Then _cleanup_after_lease() blocks forever on before_lease_hook.wait() inside a shielded scope, which also prevents after_lease_hook_done from being set, causing serve() to hang permanently. Two fixes applied: 1. Primary: Set before_lease_hook unconditionally in the finally block of handle_lease(), before calling _cleanup_after_lease. 2. Defense-in-depth: Add a 30-second safety timeout with move_on_after in _cleanup_after_lease when waiting for before_lease_hook. Fixes #567 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@mangelajo

Address review feedback from @mangelajo: the hardcoded 30s safety timeout in _cleanup_after_lease could conflict with the user-configured hook timeout (default 120s). Now uses hook_timeout + 30s margin when a before_lease hook is configured, falling back to 300s otherwise. Add tests covering the safety timeout and finally-block code paths to fix the diff-coverage CI failure (lines 602-608, 754-755). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add [tool.coverage.run] source = ["."] to the jumpstarter package's pyproject.toml so that coverage.xml records the correct source directory path. Without this, the <source> element in coverage.xml was empty, causing diff-cover to fail to match coverage data against git diff paths -- resulting in 0% reported coverage for all changed lines despite tests actually covering them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ambient-code · 2026-04-16T11:48:55Z

Coverage fix pushed

The pytest-matrix coverage failures were caused by a path mismatch between coverage.xml and diff-cover.

Root cause: When --cov is used without specifying a source (e.g. --cov instead of --cov=.), coverage.py leaves the <source> element in coverage.xml empty. diff-cover needs this element to resolve coverage file paths against git diff paths. Since git diff always uses repo-relative paths (e.g. python/packages/jumpstarter/jumpstarter/exporter/exporter.py) but coverage.xml uses package-relative paths (e.g. jumpstarter/exporter/exporter.py), the match fails silently — reporting 0% coverage despite all tests actually passing and covering the code.

Fix: Added [tool.coverage.run] source = ["."] to python/packages/jumpstarter/pyproject.toml. This tells coverage.py to record the current directory as the source root, which populates the <source> element in coverage.xml with the correct absolute path. diff-cover can then correctly match coverage data to git diff paths.

With this fix, local diff-cover reports 80% coverage on exporter.py (only the handle_lease finally-block at lines 762-763 is uncovered, which requires a full gRPC setup to test) and 100% on the test file. Total diff coverage is 96%, well above the 80% threshold.

Note: The CI workflow (.github/workflows/python-tests.yaml) also uses PYTEST_ADDOPTS: "--cov --cov-report=xml" — the bare --cov there is no longer an issue because coverage.py reads the source from the package's pyproject.toml config when source=None is passed by pytest-cov. However, for consistency it would be good to also change it to --cov=. --cov-report=xml in a follow-up (bot can't push workflow changes due to missing workflows permission).

Key beforeLease scheduling and cleanup waits off the configured before_lease hook instead of hook_executor presence. This preserves the deadlock fix from #569 without letting after_lease-only configs or lease cleanup race on a synthetic before_lease_hook signal. Fixes #613 Made-with: Cursor

ambient-code bot mentioned this pull request Apr 16, 2026

Flaky E2E: exporter deadlocks when lease ends before before_lease_hook is set (no-hook exporters) #567

Closed

mangelajo reviewed Apr 16, 2026

View reviewed changes

Ambient Code Bot and others added 3 commits April 16, 2026 10:33

Lower default safety timeout from 300s to 15s per review feedback

01f109c

ambient-code bot force-pushed the fix/exporter-deadlock-before-lease-hook-567 branch from 540dafd to 01f109c Compare April 16, 2026 10:33

mangelajo approved these changes Apr 16, 2026

View reviewed changes

mangelajo merged commit aaf98be into main Apr 16, 2026
32 checks passed

mangelajo mentioned this pull request Apr 17, 2026

Exporter should key beforeLease orchestration on configured before_lease hook #613

Open

This was referenced Apr 17, 2026

Fix beforeLease gating for afterLease-only exporters #614

Open

Implement JEP-0012: explicit lease lifecycle FSM for exporter #618

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix exporter deadlock when lease ends before before_lease_hook is set#569

Fix exporter deadlock when lease ends before before_lease_hook is set#569
mangelajo merged 4 commits intomainfrom
fix/exporter-deadlock-before-lease-hook-567

ambient-code bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 16, 2026 •

edited

Loading

Review skipped

Uh oh!

mangelajo Apr 16, 2026

Uh oh!

ambient-code bot Apr 16, 2026

Uh oh!

ambient-code bot commented Apr 16, 2026

Uh oh!

mangelajo Apr 16, 2026

Uh oh!

ambient-code bot Apr 16, 2026

Uh oh!

mangelajo commented Apr 16, 2026

Uh oh!

ambient-code bot commented Apr 16, 2026

Uh oh!

ambient-code bot commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	safety_timeout = 300 # generous default for no-hook / unknown cases
	safety_timeout = 15 # generous default for no-hook / unknown cases

Conversation

ambient-code bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

coderabbitai bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

mangelajo Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

ambient-code bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

ambient-code bot commented Apr 16, 2026

Fix plan

1. Dynamic safety timeout (addresses review feedback)

2. Add test coverage for new code paths (fixes CI coverage failure)

3. type-check-python failure

Uh oh!

mangelajo Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

ambient-code bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

mangelajo commented Apr 16, 2026

Uh oh!

ambient-code bot commented Apr 16, 2026

Fix plan (addressing latest feedback)

1. Apply suggestion: lower default safety timeout to 15s

2. Rebase onto main

3. Coverage

Uh oh!

ambient-code bot commented Apr 16, 2026

Coverage fix pushed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ambient-code bot commented Apr 16, 2026 •

edited

Loading

coderabbitai bot commented Apr 16, 2026 •

edited

Loading