Skip to content
This repository was archived by the owner on Jan 23, 2026. It is now read-only.

fix: lease-acquisition loop can fail with temporary grpc issues#761

Merged
bennyz merged 1 commit intojumpstarter-dev:mainfrom
mangelajo:retry-lease-monitor
Dec 2, 2025
Merged

fix: lease-acquisition loop can fail with temporary grpc issues#761
bennyz merged 1 commit intojumpstarter-dev:mainfrom
mangelajo:retry-lease-monitor

Conversation

@mangelajo
Copy link
Copy Markdown
Member

@mangelajo mangelajo commented Dec 1, 2025

a lease acquisition loop can fail in the following way very easily if a router in openshift/k8s is restarted during the lease get operation.

this retry mechanism avoids this issue.

2025-11-28T15:22:39.557321889Z for lease: There are 9 approved
2025-11-28T15:22:39.557321889Z exporters, (i.e. ) but all of them are
2025-11-28T15:22:39.557321889Z already leased (7:30:58)
2025-11-28T15:22:44.577566967Z [11/28/25 15:22:44] INFO INFO:jumpstarter.client.lease:Waiting lease.py:355
2025-11-28T15:22:44.577566967Z for lease: There are 9 approved
2025-11-28T15:22:44.577566967Z exporters, (i.e. ) but all of them are
2025-11-28T15:22:44.577566967Z already leased (7:31:03)
2025-11-28T15:22:56.436248186Z Error: grpc error: Socket closed

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Enhanced lease operations with automatic retry logic and exponential backoff for improved connection handling during transient failures.

✏️ Tip: You can customize this high-level summary in your review settings.

a lease acquisition loop can fail in the following way very easily if a router
in openshift/k8s is restarted during the lease get operation.

this retry mechanism avoids this issue.

```
2025-11-28T15:22:39.557321889Z for lease: There are 9 approved
2025-11-28T15:22:39.557321889Z exporters, (i.e. ) but all of them are
2025-11-28T15:22:39.557321889Z already leased (7:30:58)
2025-11-28T15:22:44.577566967Z [11/28/25 15:22:44] INFO INFO:jumpstarter.client.lease:Waiting lease.py:355
2025-11-28T15:22:44.577566967Z for lease: There are 9 approved
2025-11-28T15:22:44.577566967Z exporters, (i.e. ) but all of them are
2025-11-28T15:22:44.577566967Z already leased (7:31:03)
2025-11-28T15:22:56.436248186Z Error: grpc error: Socket closed
```
@netlify
Copy link
Copy Markdown

netlify Bot commented Dec 1, 2025

Deploy Preview for jumpstarter-docs ready!

Name Link
🔨 Latest commit 78ad580
🔍 Latest deploy log https://app.netlify.com/projects/jumpstarter-docs/deploys/692dae55e8eb4a000853fe1d
😎 Deploy Preview https://deploy-preview-761--jumpstarter-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 1, 2025

Walkthrough

The changes introduce a retry mechanism with exponential backoff to the lease acquisition process in the Jumpstarter client. A new private method _get_with_retry() wraps lease retrieval operations and automatically retries on connection errors, while _acquire() is updated to use this resilient method during polling.

Changes

Cohort / File(s) Summary
Lease acquisition resilience
packages/jumpstarter/jumpstarter/client/lease.py
Added tenacity import; introduced _get_with_retry() private method with retry decorator and error logging to handle GetLease operations with exponential backoff on ConnectionError; refactored _acquire() to call _get_with_retry() instead of direct get() calls.
Dependency management
packages/jumpstarter/pyproject.toml
Added tenacity >=8.2.0 to runtime dependencies.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Verify the tenacity retry configuration (decorator parameters, backoff strategy, exception types) is correct for the use case
  • Confirm that replacing get() with _get_with_retry() in the polling loop maintains intended behavior and doesn't mask unexpected failures
  • Validate that the new dependency version is compatible with existing requirements

Poem

🐰 With tenacity's gentle hand,
Lost leases now firmly stand,
Through errors we exponentially hop,
Each retry ensures we don't stop,
A rabbit's resilience—hurrah! ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main change: adding retry logic to handle temporary gRPC issues in the lease-acquisition loop.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mangelajo mangelajo requested review from bennyz and bkhizgiy December 1, 2025 15:03
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
packages/jumpstarter/jumpstarter/client/lease.py (1)

81-98: Consider using tenacity's built-in logging to reduce noise.

The current implementation logs an error on every retry attempt (line 96), which could be noisy during transient network issues. Tenacity provides a cleaner pattern using the before_sleep callback.

Apply this diff to use tenacity's built-in logging mechanism:

-from tenacity import retry, retry_if_exception_type, wait_exponential_jitter
+from tenacity import retry, retry_if_exception_type, wait_exponential_jitter, before_sleep_log

 @retry(
     wait=wait_exponential_jitter(initial=1, max=120, jitter=1),
     retry=retry_if_exception_type(ConnectionError),
+    before_sleep=before_sleep_log(logger, logging.WARNING),
     reraise=True,
 )
 async def _get_with_retry(self):
     """Get lease with exponential backoff retry on ConnectionError.
     
     Retries with exponential backoff and jitter indefinitely when ConnectionError occurs.
     The wait time between retries is capped at 2 minutes (120 seconds).
     Jitter helps prevent thundering herd problems when multiple clients retry simultaneously.
     """
-    try:
-        return await self.get()
-    except ConnectionError as e:
-        logger.error("Error while getting lease %s: %s", self.name, e)
-        raise
+    return await self.get()

This approach logs warnings before each retry sleep instead of errors on every attempt, providing clearer visibility into retry behavior without excessive noise.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between afaf058 and 78ad580.

📒 Files selected for processing (2)
  • packages/jumpstarter/jumpstarter/client/lease.py (3 hunks)
  • packages/jumpstarter/pyproject.toml (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
packages/*/pyproject.toml

📄 CodeRabbit inference engine (.cursor/rules/project-structure.mdc)

Each package's pyproject.toml must include project metadata with Apache-2.0 license only

Files:

  • packages/jumpstarter/pyproject.toml
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/project-structure.mdc)

Ruff should be used for code formatting and linting, excluding jumpstarter-protocol package

Files:

  • packages/jumpstarter/jumpstarter/client/lease.py
🧠 Learnings (8)
📓 Common learnings
Learnt from: michalskrivanek
Repo: jumpstarter-dev/jumpstarter PR: 704
File: packages/jumpstarter/jumpstarter/client/grpc.py:100-107
Timestamp: 2025-10-14T17:43:07.788Z
Learning: In the Jumpstarter client lease model (packages/jumpstarter/jumpstarter/client/grpc.py), `effective_duration` represents the elapsed time for an active lease so far, not the total duration. To calculate expected release time, use `effective_begin_time + duration` (where `duration` is the configured/requested duration), not `effective_begin_time + effective_duration`.
Learnt from: michalskrivanek
Repo: jumpstarter-dev/jumpstarter PR: 589
File: packages/jumpstarter/jumpstarter/exporter/exporter.py:147-155
Timestamp: 2025-08-14T13:11:35.034Z
Learning: In the jumpstarter fork-based architecture, when the exporter's serve() method exits (e.g., due to lease changes), the child process terminates and the parent process automatically restarts it, eliminating concerns about orphaned tasks since the entire process is restarted.
📚 Learning: 2025-11-27T09:58:55.346Z
Learnt from: CR
Repo: jumpstarter-dev/jumpstarter PR: 0
File: .cursor/rules/project-structure.mdc:0-0
Timestamp: 2025-11-27T09:58:55.346Z
Learning: Applies to packages/jumpstarter-cli-*/pyproject.toml : CLI packages must depend on `jumpstarter` and `jumpstarter-cli-common` in their `pyproject.toml`

Applied to files:

  • packages/jumpstarter/pyproject.toml
📚 Learning: 2025-11-27T09:58:55.346Z
Learnt from: CR
Repo: jumpstarter-dev/jumpstarter PR: 0
File: .cursor/rules/project-structure.mdc:0-0
Timestamp: 2025-11-27T09:58:55.346Z
Learning: Applies to packages/jumpstarter-driver-*/pyproject.toml : Driver packages must depend on `jumpstarter` and specific hardware libraries in their `pyproject.toml`

Applied to files:

  • packages/jumpstarter/pyproject.toml
📚 Learning: 2025-11-27T09:58:55.346Z
Learnt from: CR
Repo: jumpstarter-dev/jumpstarter PR: 0
File: .cursor/rules/project-structure.mdc:0-0
Timestamp: 2025-11-27T09:58:55.346Z
Learning: Core packages must depend on `jumpstarter-protocol`

Applied to files:

  • packages/jumpstarter/pyproject.toml
📚 Learning: 2025-11-27T09:58:55.346Z
Learnt from: CR
Repo: jumpstarter-dev/jumpstarter PR: 0
File: .cursor/rules/project-structure.mdc:0-0
Timestamp: 2025-11-27T09:58:55.346Z
Learning: Applies to packages/jumpstarter-driver-*/pyproject.toml : Driver packages must register via the `jumpstarter.drivers` entry point in `pyproject.toml`

Applied to files:

  • packages/jumpstarter/pyproject.toml
📚 Learning: 2025-11-27T09:58:55.346Z
Learnt from: CR
Repo: jumpstarter-dev/jumpstarter PR: 0
File: .cursor/rules/project-structure.mdc:0-0
Timestamp: 2025-11-27T09:58:55.346Z
Learning: Applies to packages/jumpstarter-driver-*/pyproject.toml : Driver packages must follow the naming pattern `jumpstarter-driver-<name>`

Applied to files:

  • packages/jumpstarter/pyproject.toml
📚 Learning: 2025-10-14T17:43:07.788Z
Learnt from: michalskrivanek
Repo: jumpstarter-dev/jumpstarter PR: 704
File: packages/jumpstarter/jumpstarter/client/grpc.py:100-107
Timestamp: 2025-10-14T17:43:07.788Z
Learning: In the Jumpstarter client lease model (packages/jumpstarter/jumpstarter/client/grpc.py), `effective_duration` represents the elapsed time for an active lease so far, not the total duration. To calculate expected release time, use `effective_begin_time + duration` (where `duration` is the configured/requested duration), not `effective_begin_time + effective_duration`.

Applied to files:

  • packages/jumpstarter/jumpstarter/client/lease.py
📚 Learning: 2025-10-02T15:29:09.070Z
Learnt from: mangelajo
Repo: jumpstarter-dev/jumpstarter PR: 681
File: packages/jumpstarter/jumpstarter/client/lease.py:198-220
Timestamp: 2025-10-02T15:29:09.070Z
Learning: In the jumpstarter codebase, `client_from_path` and `client_from_channel` functions in `packages/jumpstarter/jumpstarter/client/client.py` automatically call `GetReport(empty_pb2.Empty())` during client initialization. This means that successfully entering the client context manager guarantees the exporter is responsive and has answered the GetReport RPC.

Applied to files:

  • packages/jumpstarter/jumpstarter/client/lease.py
🧬 Code graph analysis (1)
packages/jumpstarter/jumpstarter/client/lease.py (1)
packages/jumpstarter/jumpstarter/common/exceptions.py (1)
  • ConnectionError (42-45)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: Redirect rules - jumpstarter-docs
  • GitHub Check: Header rules - jumpstarter-docs
  • GitHub Check: build
  • GitHub Check: pytest-matrix (macos-15, 3.13)
  • GitHub Check: pytest-matrix (ubuntu-24.04, 3.12)
  • GitHub Check: pytest-matrix (ubuntu-24.04, 3.11)
  • GitHub Check: pytest-matrix (ubuntu-24.04, 3.13)
  • GitHub Check: pytest-matrix (macos-15, 3.12)
  • GitHub Check: pytest-matrix (macos-15, 3.11)
  • GitHub Check: e2e
  • GitHub Check: Pages changed - jumpstarter-docs
🔇 Additional comments (3)
packages/jumpstarter/jumpstarter/client/lease.py (2)

26-26: LGTM!

The import correctly brings in the necessary tenacity utilities for implementing the retry mechanism with exponential backoff.


158-158: LGTM!

The integration of _get_with_retry() into the acquisition loop correctly handles transient gRPC connection errors with exponential backoff, while the outer fail_after() timeout ensures the process doesn't hang indefinitely.

packages/jumpstarter/pyproject.toml (1)

22-22: Version 8.2.0 already supports async retry—no version change needed unless you require newer patch updates.

Tenacity 8.2.0 includes async support via AsyncRetrying and decorators for coroutines, with async-related bug fixes in that release. While patch versions like 8.2.2 are available, 8.2.0 is stable and functional for async retry use cases. If the intent was to allow newer patch releases (e.g., tenacity>=8.2.0 instead of ==8.2.0), clarify the versioning constraint.

Likely an incorrect or invalid review comment.

@bennyz bennyz merged commit 54ca5b2 into jumpstarter-dev:main Dec 2, 2025
19 checks passed
@jumpstarter-backport-bot
Copy link
Copy Markdown

Successfully created backport PR for release-0.7:

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants