fix: lease-acquisition loop can fail with temporary grpc issues#761
fix: lease-acquisition loop can fail with temporary grpc issues#761bennyz merged 1 commit intojumpstarter-dev:mainfrom
Conversation
a lease acquisition loop can fail in the following way very easily if a router in openshift/k8s is restarted during the lease get operation. this retry mechanism avoids this issue. ``` 2025-11-28T15:22:39.557321889Z for lease: There are 9 approved 2025-11-28T15:22:39.557321889Z exporters, (i.e. ) but all of them are 2025-11-28T15:22:39.557321889Z already leased (7:30:58) 2025-11-28T15:22:44.577566967Z [11/28/25 15:22:44] INFO INFO:jumpstarter.client.lease:Waiting lease.py:355 2025-11-28T15:22:44.577566967Z for lease: There are 9 approved 2025-11-28T15:22:44.577566967Z exporters, (i.e. ) but all of them are 2025-11-28T15:22:44.577566967Z already leased (7:31:03) 2025-11-28T15:22:56.436248186Z Error: grpc error: Socket closed ```
✅ Deploy Preview for jumpstarter-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
WalkthroughThe changes introduce a retry mechanism with exponential backoff to the lease acquisition process in the Jumpstarter client. A new private method Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
packages/jumpstarter/jumpstarter/client/lease.py (1)
81-98: Consider using tenacity's built-in logging to reduce noise.The current implementation logs an error on every retry attempt (line 96), which could be noisy during transient network issues. Tenacity provides a cleaner pattern using the
before_sleepcallback.Apply this diff to use tenacity's built-in logging mechanism:
-from tenacity import retry, retry_if_exception_type, wait_exponential_jitter +from tenacity import retry, retry_if_exception_type, wait_exponential_jitter, before_sleep_log @retry( wait=wait_exponential_jitter(initial=1, max=120, jitter=1), retry=retry_if_exception_type(ConnectionError), + before_sleep=before_sleep_log(logger, logging.WARNING), reraise=True, ) async def _get_with_retry(self): """Get lease with exponential backoff retry on ConnectionError. Retries with exponential backoff and jitter indefinitely when ConnectionError occurs. The wait time between retries is capped at 2 minutes (120 seconds). Jitter helps prevent thundering herd problems when multiple clients retry simultaneously. """ - try: - return await self.get() - except ConnectionError as e: - logger.error("Error while getting lease %s: %s", self.name, e) - raise + return await self.get()This approach logs warnings before each retry sleep instead of errors on every attempt, providing clearer visibility into retry behavior without excessive noise.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
packages/jumpstarter/jumpstarter/client/lease.py(3 hunks)packages/jumpstarter/pyproject.toml(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
packages/*/pyproject.toml
📄 CodeRabbit inference engine (.cursor/rules/project-structure.mdc)
Each package's
pyproject.tomlmust include project metadata with Apache-2.0 license only
Files:
packages/jumpstarter/pyproject.toml
**/*.py
📄 CodeRabbit inference engine (.cursor/rules/project-structure.mdc)
Ruff should be used for code formatting and linting, excluding
jumpstarter-protocolpackage
Files:
packages/jumpstarter/jumpstarter/client/lease.py
🧠 Learnings (8)
📓 Common learnings
Learnt from: michalskrivanek
Repo: jumpstarter-dev/jumpstarter PR: 704
File: packages/jumpstarter/jumpstarter/client/grpc.py:100-107
Timestamp: 2025-10-14T17:43:07.788Z
Learning: In the Jumpstarter client lease model (packages/jumpstarter/jumpstarter/client/grpc.py), `effective_duration` represents the elapsed time for an active lease so far, not the total duration. To calculate expected release time, use `effective_begin_time + duration` (where `duration` is the configured/requested duration), not `effective_begin_time + effective_duration`.
Learnt from: michalskrivanek
Repo: jumpstarter-dev/jumpstarter PR: 589
File: packages/jumpstarter/jumpstarter/exporter/exporter.py:147-155
Timestamp: 2025-08-14T13:11:35.034Z
Learning: In the jumpstarter fork-based architecture, when the exporter's serve() method exits (e.g., due to lease changes), the child process terminates and the parent process automatically restarts it, eliminating concerns about orphaned tasks since the entire process is restarted.
📚 Learning: 2025-11-27T09:58:55.346Z
Learnt from: CR
Repo: jumpstarter-dev/jumpstarter PR: 0
File: .cursor/rules/project-structure.mdc:0-0
Timestamp: 2025-11-27T09:58:55.346Z
Learning: Applies to packages/jumpstarter-cli-*/pyproject.toml : CLI packages must depend on `jumpstarter` and `jumpstarter-cli-common` in their `pyproject.toml`
Applied to files:
packages/jumpstarter/pyproject.toml
📚 Learning: 2025-11-27T09:58:55.346Z
Learnt from: CR
Repo: jumpstarter-dev/jumpstarter PR: 0
File: .cursor/rules/project-structure.mdc:0-0
Timestamp: 2025-11-27T09:58:55.346Z
Learning: Applies to packages/jumpstarter-driver-*/pyproject.toml : Driver packages must depend on `jumpstarter` and specific hardware libraries in their `pyproject.toml`
Applied to files:
packages/jumpstarter/pyproject.toml
📚 Learning: 2025-11-27T09:58:55.346Z
Learnt from: CR
Repo: jumpstarter-dev/jumpstarter PR: 0
File: .cursor/rules/project-structure.mdc:0-0
Timestamp: 2025-11-27T09:58:55.346Z
Learning: Core packages must depend on `jumpstarter-protocol`
Applied to files:
packages/jumpstarter/pyproject.toml
📚 Learning: 2025-11-27T09:58:55.346Z
Learnt from: CR
Repo: jumpstarter-dev/jumpstarter PR: 0
File: .cursor/rules/project-structure.mdc:0-0
Timestamp: 2025-11-27T09:58:55.346Z
Learning: Applies to packages/jumpstarter-driver-*/pyproject.toml : Driver packages must register via the `jumpstarter.drivers` entry point in `pyproject.toml`
Applied to files:
packages/jumpstarter/pyproject.toml
📚 Learning: 2025-11-27T09:58:55.346Z
Learnt from: CR
Repo: jumpstarter-dev/jumpstarter PR: 0
File: .cursor/rules/project-structure.mdc:0-0
Timestamp: 2025-11-27T09:58:55.346Z
Learning: Applies to packages/jumpstarter-driver-*/pyproject.toml : Driver packages must follow the naming pattern `jumpstarter-driver-<name>`
Applied to files:
packages/jumpstarter/pyproject.toml
📚 Learning: 2025-10-14T17:43:07.788Z
Learnt from: michalskrivanek
Repo: jumpstarter-dev/jumpstarter PR: 704
File: packages/jumpstarter/jumpstarter/client/grpc.py:100-107
Timestamp: 2025-10-14T17:43:07.788Z
Learning: In the Jumpstarter client lease model (packages/jumpstarter/jumpstarter/client/grpc.py), `effective_duration` represents the elapsed time for an active lease so far, not the total duration. To calculate expected release time, use `effective_begin_time + duration` (where `duration` is the configured/requested duration), not `effective_begin_time + effective_duration`.
Applied to files:
packages/jumpstarter/jumpstarter/client/lease.py
📚 Learning: 2025-10-02T15:29:09.070Z
Learnt from: mangelajo
Repo: jumpstarter-dev/jumpstarter PR: 681
File: packages/jumpstarter/jumpstarter/client/lease.py:198-220
Timestamp: 2025-10-02T15:29:09.070Z
Learning: In the jumpstarter codebase, `client_from_path` and `client_from_channel` functions in `packages/jumpstarter/jumpstarter/client/client.py` automatically call `GetReport(empty_pb2.Empty())` during client initialization. This means that successfully entering the client context manager guarantees the exporter is responsive and has answered the GetReport RPC.
Applied to files:
packages/jumpstarter/jumpstarter/client/lease.py
🧬 Code graph analysis (1)
packages/jumpstarter/jumpstarter/client/lease.py (1)
packages/jumpstarter/jumpstarter/common/exceptions.py (1)
ConnectionError(42-45)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: Redirect rules - jumpstarter-docs
- GitHub Check: Header rules - jumpstarter-docs
- GitHub Check: build
- GitHub Check: pytest-matrix (macos-15, 3.13)
- GitHub Check: pytest-matrix (ubuntu-24.04, 3.12)
- GitHub Check: pytest-matrix (ubuntu-24.04, 3.11)
- GitHub Check: pytest-matrix (ubuntu-24.04, 3.13)
- GitHub Check: pytest-matrix (macos-15, 3.12)
- GitHub Check: pytest-matrix (macos-15, 3.11)
- GitHub Check: e2e
- GitHub Check: Pages changed - jumpstarter-docs
🔇 Additional comments (3)
packages/jumpstarter/jumpstarter/client/lease.py (2)
26-26: LGTM!The import correctly brings in the necessary tenacity utilities for implementing the retry mechanism with exponential backoff.
158-158: LGTM!The integration of
_get_with_retry()into the acquisition loop correctly handles transient gRPC connection errors with exponential backoff, while the outerfail_after()timeout ensures the process doesn't hang indefinitely.packages/jumpstarter/pyproject.toml (1)
22-22: Version 8.2.0 already supports async retry—no version change needed unless you require newer patch updates.Tenacity 8.2.0 includes async support via
AsyncRetryingand decorators for coroutines, with async-related bug fixes in that release. While patch versions like 8.2.2 are available, 8.2.0 is stable and functional for async retry use cases. If the intent was to allow newer patch releases (e.g.,tenacity>=8.2.0instead of==8.2.0), clarify the versioning constraint.Likely an incorrect or invalid review comment.
|
Successfully created backport PR for |
a lease acquisition loop can fail in the following way very easily if a router in openshift/k8s is restarted during the lease get operation.
this retry mechanism avoids this issue.
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.