Skip to content

Implement device runtime master model#11

Merged
jmagar merged 17 commits into
mainfrom
device-runtime-master
Apr 16, 2026
Merged

Implement device runtime master model#11
jmagar merged 17 commits into
mainfrom
device-runtime-master

Conversation

@jmagar
Copy link
Copy Markdown
Owner

@jmagar jmagar commented Apr 16, 2026

Summary

  • add the device runtime master/non-master model, including role resolution, fleet state, metadata inventory, queue-backed log ingest, and master-only control-plane gating
  • add master-routed lab device and lab logs command groups plus the /v1/device/* fleet API and remote OAuth relay start route
  • document the deployment, runtime, fleet log, transport, config, observability, and error-model changes
  • ensure master-bound device traffic reuses LAB_MCP_HTTP_TOKEN when bearer auth is enabled

Verification

  • cargo test --test device_config --test device_identity --test device_scan --test device_queue --test device_api --test device_runtime --test device_master_only --test device_cli --all-features
  • cargo test -p lab@0.3.3 --all-features
  • cargo build -p lab@0.3.3 --all-features
  • cargo fmt --all --check

Verification Caveat

  • cargo clippy -p lab@0.3.3 --all-features -- -D warnings fails on an existing workspace lint baseline outside this plan, including many pre-existing issues in crates/lab-auth/**
  • cargo clippy -p lab@0.3.3 --all-features --no-deps -- -D warnings also reports existing lab crate lint debt unrelated to this plan's functional changes

Summary by cubic

Implements the device runtime with master/non‑master roles and a /v1/device/* API, and gates the operator control plane (Web UI, MCP, /v1/{service}, /v1/gateway, OpenAPI/docs, OAuth metadata) to the master. Adds master‑routed lab device/lab logs commands, moves the device‑runtime HTTP client into lab-apis with timeouts, and hardens identity, queueing, log search/retention, and OAuth relay.

  • New Features

    • Role resolution from local hostname and [device].master, with short vs FQDN equivalence; only the master serves Web UI, MCP, /v1/{service}, /v1/gateway, OpenAPI/docs, and OAuth metadata.
    • /v1/device/*: hello, status, metadata, syslog/batch, oauth/relay/start, devices list/get, and logs/search (read routes are master‑only). Log search defaults to 200 with a hard max of 1000; per‑device retention capped at 10k events.
    • In‑memory fleet store and normalized log events; durable JSONL outbound queue with ack on delivery to avoid duplicate resend on decode errors; bootstrap logs are tail‑limited and skip unreadable candidates; device IDs are normalized on ingest and read.
    • Non‑masters reuse LAB_MCP_HTTP_TOKEN when bearer auth is enabled. Master‑routed CLI: lab device list|get <device_id>, lab logs search <device_id> <query>.
    • lab-apis DeviceRuntimeClient with request timeouts and safe URL encoding; OAuth relay start validates loopback bind, returns the actual bound address for ephemeral binds, and runs in the background on the device.
  • Migration

    • Set [device].master = "<hostname>" in config.toml on non‑masters; omit to default the local host to master.
    • Ensure the master is reachable at http://<master>:<mcp.port>; if binding beyond loopback, set LAB_MCP_HTTP_TOKEN and reuse it on all devices.
    • Start lab serve --transport http on every device; non‑masters auto‑send hello, metadata, and a bootstrap log flush.
    • OAuth relay bind addresses must be loopback.
    • Query fleet from the master: lab device list|get, lab logs search <id> <query>.

Written for commit ad5c119. Summary will update on new commits.

Summary by CodeRabbit

Release Notes

  • New Features
    • Added device fleet management system with master/non-master runtime support.
    • Added lab device list and lab device get commands to query fleet inventory.
    • Added lab logs search command for searching device logs across the fleet.
    • Added device status reporting and metadata collection capabilities.
    • Added device role configuration for fleet topology setup.

Copilot AI review requested due to automatic review settings April 16, 2026 00:43
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 16, 2026

📝 Walkthrough

Walkthrough

Introduces a device runtime subsystem enabling Linux machines to run as master or non-master fleet members. Adds device API routes under /v1/device/*, in-memory fleet store, device role resolution, CLI commands for device and log queries, non-master queue-based upload mechanism, and role-based HTTP route gating.

Changes

Cohort / File(s) Summary
Device API Route Handlers
crates/lab/src/api/device.rs, crates/lab/src/api/device/fleet.rs, crates/lab/src/api/device/hello.rs, crates/lab/src/api/device/logs.rs, crates/lab/src/api/device/metadata.rs, crates/lab/src/api/device/oauth.rs, crates/lab/src/api/device/status.rs, crates/lab/src/api/device/syslog.rs
HTTP handlers for device check-in (hello/status/metadata), log search, syslog batch ingestion, OAuth relay startup, and fleet queries with device-ID normalization and master-only access gates.
Device Runtime Core
crates/lab/src/device.rs, crates/lab/src/device/checkin.rs, crates/lab/src/device/config_scan.rs, crates/lab/src/device/identity.rs, crates/lab/src/device/log_collect.rs, crates/lab/src/device/log_event.rs, crates/lab/src/device/master_client.rs, crates/lab/src/device/oauth.rs, crates/lab/src/device/queue.rs, crates/lab/src/device/runtime.rs, crates/lab/src/device/store.rs
Device runtime implementation: role resolution (master vs non-master), master client for remote uploads, fleet store (in-memory async), outbound queue (file-backed JSONL), bootstrap log collection, metadata discovery from AI CLI configs.
CLI Device and Logs Commands
crates/lab/src/cli/device.rs, crates/lab/src/cli/logs.rs
New top-level CLI subcommands lab device {list,get} and lab logs search with async helper functions for fetching device inventory and searching logs via MasterClient.
Configuration and State Management
crates/lab/src/config.rs, crates/lab/src/api/state.rs
Extended LabConfig with optional [device] section; added DeviceRole enum and ResolvedDeviceRuntime; augmented AppState with device_store and device_role fields plus is_master() helper.
Routing and Role-Based Gating
crates/lab/src/api/router.rs, crates/lab/src/api/web.rs, crates/lab/src/cli/serve.rs, crates/lab/src/mcp/server.rs
Role-aware HTTP routing: master-only /v1/{service}, /v1/gateway, /mcp gates; device role propagation into CLI serve handler and MCP server; non-master Web UI disabled (403 response).
Device Runtime Integration Tests
crates/lab/tests/device_api.rs, crates/lab/tests/device_cli.rs, crates/lab/tests/device_config.rs, crates/lab/tests/device_identity.rs, crates/lab/tests/device_master_only.rs, crates/lab/tests/device_queue.rs, crates/lab/tests/device_runtime.rs, crates/lab/tests/device_scan.rs
Integration and unit test coverage for device API endpoints, CLI fetch operations, config deserialization, role resolution, master-only route blocking, queue persistence/ack, runtime startup workflows, and MCP config scanning.
Lab-APIs Device-Runtime Client
crates/lab-apis/src/device_runtime.rs, crates/lab-apis/src/device_runtime/client.rs, crates/lab-apis/src/device_runtime/types.rs, crates/lab-apis/src/lib.rs
New DeviceRuntimeClient wrapper around HttpClient with async methods for device check-in, fleet queries, and log search; includes 5-second timeout wrapping and health check integration.
Device Runtime Documentation
docs/DEVICE_RUNTIME.md, docs/DEPLOY.md, docs/FLEET_LOGS.md, docs/CLI.md, docs/CONFIG.md, docs/ARCH.md, docs/ERRORS.md, docs/GATEWAY.md, docs/MCP.md, docs/OAUTH.md, docs/OPERATIONS.md, docs/OBSERVABILITY.md, docs/TRANSPORT.md, docs/README.md
Comprehensive documentation for device runtime model, deployment topology, fleet log ingestion, master-only endpoint gating, role-aware transport exposure, auth precedence, and observability integration.
Utility and Library Updates
crates/lab/src/lib.rs, crates/lab/src/main.rs, crates/lab/src/oauth/local_relay.rs, crates/lab/src/dispatch/error.rs, Cargo.toml, crates/lab-apis/Cargo.toml
Added percent-encoding workspace dependency; refactored local OAuth relay startup into bind_local_relay_listener and serve_local_relay; added ToolError::internal_message constructor; updated crate-level module exports and lint allows.
Minor Formatting and Import Reordering
crates/lab/src/api/services/helpers.rs, crates/lab/src/cli/gateway.rs, crates/lab/src/dispatch/clients.rs, crates/lab/src/dispatch/gateway/dispatch.rs, crates/lab/src/dispatch/gateway/manager.rs, crates/lab/src/dispatch/gateway/service_catalog.rs, crates/lab/src/dispatch/helpers.rs, crates/lab/src/dispatch/upstream/pool.rs, crates/lab/src/oauth/target.rs, crates/lab/src/registry.rs, crates/lab-auth/src/authorize.rs, crates/lab-auth/src/token.rs
Import ordering shuffles, multi-line formatting refactors for control flow readability; no functional logic changes.

Sequence Diagrams

sequenceDiagram
    participant NonMasterDevice as Non-Master Device Runtime
    participant Queue as Outbound Queue (JSONL)
    participant Master as Master HTTP API
    participant Store as Fleet Store

    NonMasterDevice->>NonMasterDevice: resolve_runtime_role()
    NonMasterDevice->>NonMasterDevice: collect_bootstrap_logs()
    NonMasterDevice->>Queue: queue_syslog_batch(logs)
    NonMasterDevice->>Master: POST /v1/device/hello
    Master->>Store: record_hello()
    Master-->>NonMasterDevice: 200 OK
    NonMasterDevice->>NonMasterDevice: discover_ai_cli_configs()
    NonMasterDevice->>Master: POST /v1/device/metadata
    Master->>Store: record_metadata()
    Master-->>NonMasterDevice: 200 OK
    NonMasterDevice->>Queue: drain_batch(100)
    Queue-->>NonMasterDevice: [QueuedEnvelope, ...]
    loop Each Envelope
        NonMasterDevice->>Master: POST /v1/device/syslog/batch
        Master->>Store: record_logs()
        Master-->>NonMasterDevice: 200 OK
    end
    NonMasterDevice->>Queue: ack_drained(count)
Loading
sequenceDiagram
    participant Operator as Operator (CLI/HTTP)
    participant Master as Master HTTP API
    participant Store as Fleet Store
    participant Devices as Non-Master Devices (Background)

    Devices->>Master: POST /v1/device/hello, status, logs (continuous)
    Devices->>Master: POST /v1/device/syslog/batch
    Master->>Store: record_hello/status/metadata/logs
    Operator->>Master: GET /v1/device/devices
    Master->>Store: list_devices()
    Store-->>Master: [DeviceSnapshot, ...]
    Master-->>Operator: 200 OK + JSON
    Operator->>Master: GET /v1/device/devices/{device_id}
    Master->>Store: device(device_id)
    Store-->>Master: DeviceSnapshot
    Master-->>Operator: 200 OK + JSON
    Operator->>Master: POST /v1/device/logs/search
    Master->>Store: search_logs_for_device(device_id, query, offset, limit)
    Store-->>Master: [DeviceLogEvent, ...]
    Master-->>Operator: 200 OK + JSON
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

This diff spans a comprehensive new subsystem with heterogeneous concerns: device role resolution (hostname/IP logic), fleet store concurrency (RwLock-backed BTreeMap), queue persistence with atomic ack semantics, device API handlers with normalization/validation, CLI client wiring, and role-based routing gates. Review requires careful attention to synchronization invariants, error propagation, master-only access controls, and device-ID validation across multiple handler entry points.

Possibly related PRs

Poem

🐰 A fleet of devices hops into place,
Master and non-master run the race,
Logs queue up, then hop away,
Role resolution guides the way! 🚀
— The Lab Rabbit 🐇

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.31% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Implement device runtime master model' clearly and accurately summarizes the main change in the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch device-runtime-master

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements a role-aware “device runtime” model for lab serve, introducing a master/non-master split where the master hosts the operator control plane and fleet state, and non-master devices report metadata/logs to the master and expose only the minimal HTTP surface.

Changes:

  • Add device runtime subsystem (role resolution, master client, fleet store, outbound queue) plus /v1/device/* HTTP namespace and master-only routing gates.
  • Add master-routed CLI command groups (lab device, lab logs) backed by the master’s /v1/device/* APIs.
  • Extend docs and tests to cover the new deployment/runtime model, routing behavior, and fleet log ingestion/search.

Reviewed changes

Copilot reviewed 55 out of 55 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
docs/TRANSPORT.md Documents role-aware HTTP route layout and master-only surfaces.
docs/README.md Adds pointers to new device runtime / fleet / deploy docs.
docs/OPERATIONS.md Adds operational workflows and limits for device runtime fleets.
docs/OBSERVABILITY.md Adds device ingest observability requirements and redaction rules.
docs/OAUTH.md Documents remote device OAuth relay start route and auth expectations.
docs/MCP.md Documents that non-master devices do not expose MCP and describes command-group catalog notes.
docs/GATEWAY.md Clarifies gateway vs device runtime responsibilities and master-only gateway surface.
docs/FLEET_LOGS.md New doc describing fleet log event shape, ingest flow, query surfaces, and limits.
docs/ERRORS.md Adds device runtime-specific error taxonomy notes.
docs/DEVICE_RUNTIME.md New doc describing master/non-master roles, APIs, metadata inventory, queueing, and auth.
docs/DEPLOY.md New doc describing deployment topology and verification steps.
docs/CONFIG.md Documents new [device] config block and device-runtime auth behavior.
docs/CLI.md Documents new device and logs command groups and serve role behavior.
docs/ARCH.md Updates architecture narrative to include device runtime subsystem.
crates/lab/tests/device_scan.rs Tests AI CLI config discovery scanning.
crates/lab/tests/device_runtime.rs Tests device fleet store behavior and non-master initial uploads.
crates/lab/tests/device_queue.rs Tests durable queue persistence and reload.
crates/lab/tests/device_master_only.rs Tests non-master router gating for web/gateway surfaces.
crates/lab/tests/device_identity.rs Tests master/non-master role resolution rules.
crates/lab/tests/device_config.rs Tests parsing of new [device] config block.
crates/lab/tests/device_cli.rs Tests master-routed CLI behavior and bearer propagation.
crates/lab/tests/device_api.rs Tests /v1/device/* handlers (hello/syslog/oauth route).
crates/lab/src/mcp/server.rs Adds device role awareness to MCP service visibility.
crates/lab/src/main.rs Wires in new device module for the binary crate.
crates/lab/src/lib.rs Exposes lab as a library for integration tests and module reuse.
crates/lab/src/dispatch/error.rs Adds ToolError::internal_message helper.
crates/lab/src/device/store.rs Implements in-memory fleet store (snapshots + logs + metadata).
crates/lab/src/device/runtime.rs Implements device runtime behavior (hello/metadata/queue flush).
crates/lab/src/device/queue.rs Implements JSONL durable outbound queue for non-master uploads.
crates/lab/src/device/oauth.rs Wraps starting the local OAuth relay from device runtime context.
crates/lab/src/device/master_client.rs Implements HTTP client for non-master → master device API calls.
crates/lab/src/device/log_event.rs Defines normalized DeviceLogEvent schema.
crates/lab/src/device/log_collect.rs Adds bootstrap log collection stub returning normalized events.
crates/lab/src/device/identity.rs Implements local hostname and runtime role resolution helpers.
crates/lab/src/device/config_scan.rs Implements discovery of AI CLI MCP configs + hashing/metadata.
crates/lab/src/device/checkin.rs Defines hello/status/metadata upload request models.
crates/lab/src/device.rs Declares device subsystem modules.
crates/lab/src/config.rs Adds [device] config preferences + DeviceRole/ResolvedDeviceRuntime types.
crates/lab/src/cli/serve.rs Resolves runtime role, wires device store/role into app state, triggers initial non-master uploads, and gates MCP exposure.
crates/lab/src/cli/logs.rs Adds lab logs search CLI routed to master /v1/device/logs/search.
crates/lab/src/cli/device.rs Adds lab device list/get CLI routed to master /v1/device/devices*.
crates/lab/src/cli.rs Registers new CLI subcommands (device, logs).
crates/lab/src/catalog.rs Extends catalog to include device/logs command groups.
crates/lab/src/api/web.rs Returns 403 for web UI requests on non-master devices.
crates/lab/src/api/state.rs Adds device store + device role to app state.
crates/lab/src/api/router.rs Mounts /v1/device/* and gates master-only HTTP surfaces + MCP merge.
crates/lab/src/api/device/syslog.rs Adds /v1/device/syslog/batch ingest handler.
crates/lab/src/api/device/status.rs Adds /v1/device/status ingest handler.
crates/lab/src/api/device/oauth.rs Adds /v1/device/oauth/relay/start handler.
crates/lab/src/api/device/metadata.rs Adds /v1/device/metadata ingest handler.
crates/lab/src/api/device/logs.rs Adds /v1/device/logs/search handler with master-only gating.
crates/lab/src/api/device/hello.rs Adds /v1/device/hello ingest handler.
crates/lab/src/api/device/fleet.rs Adds master-only fleet query handlers (devices, devices/{id}).
crates/lab/src/api/device.rs Defines /v1/device/* router and shared ack shape.
crates/lab/src/api.rs Exposes new api::device module.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/TRANSPORT.md Outdated
Comment thread crates/lab/src/api/device/oauth.rs Outdated
Comment thread crates/lab/src/api/device/oauth.rs
Comment thread crates/lab/src/cli/serve.rs Outdated
Comment on lines +168 to +169
if let Err(error) = device_runtime.collect_and_flush_bootstrap_logs().await {
tracing::warn!(error = %error, "initial device log flush failed");
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collect_and_flush_bootstrap_logs() is invoked unconditionally, but flush_queue_once() is a no-op on the master. If collect_bootstrap_logs() ever returns events on the master, they’ll be queued to disk and never drained. Either skip this step when device_role == Master, or make collect_and_flush_bootstrap_logs() a no-op on the master as well.

Suggested change
if let Err(error) = device_runtime.collect_and_flush_bootstrap_logs().await {
tracing::warn!(error = %error, "initial device log flush failed");
if !matches!(device_role, DeviceRole::Master) {
if let Err(error) = device_runtime.collect_and_flush_bootstrap_logs().await {
tracing::warn!(error = %error, "initial device log flush failed");
}

Copilot uses AI. Check for mistakes.
Comment thread crates/lab/src/device/runtime.rs Outdated
client.post_status(&status).await?;
ack_count += 1;
}
_ => break,
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On an unknown QueuedEnvelope.kind, this loop breaks and then returns Ok(()) (after acking earlier entries). That leaves the unknown kind at the head of the queue and future flushes will stop immediately, effectively wedging the queue without surfacing an error. Consider returning an error (or at least logging and skipping) when an unknown kind is encountered.

Suggested change
_ => break,
_ => {
if ack_count > 0 {
queue.ack_drained(ack_count).await?;
}
return Err(anyhow!(
"unknown queued envelope kind: {}",
envelope.kind
));
}

Copilot uses AI. Check for mistakes.
Comment thread crates/lab/src/catalog.rs Outdated
Comment thread crates/lab/src/device/store.rs Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c0738a5748

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/lab/src/api/router.rs
Comment thread crates/lab/src/mcp/server.rs
Comment thread crates/lab/src/cli/serve.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 29

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/lab/src/mcp/server.rs (1)

330-343: ⚠️ Potential issue | 🔴 Critical

Non-master MCP gating is bypassable through direct upstream tool calls.

At Line 331 the visibility block is only applied when the service exists in the local registry (svc.is_some()). For non-master devices, a caller can still invoke an upstream tool name directly and reach the proxy path (Lines 437-542), which defeats the master-only MCP gate.

🔒 Suggested fix
 async fn call_tool(
     &self,
     request: CallToolRequestParams,
     context: RequestContext<RoleServer>,
 ) -> Result<CallToolResult, ErrorData> {
     let service = request.name.as_ref().to_string();
     let raw_arguments = request.arguments.clone();
     let args = request.arguments.unwrap_or_default();
     let action = args
         .get("action")
         .and_then(|v| v.as_str())
         .unwrap_or("")
         .to_string();
     let params = args.get("params").cloned().unwrap_or(Value::Null);
+
+    if matches!(self.device_role, Some(DeviceRole::NonMaster)) {
+        let envelope = build_error(
+            &service,
+            &action,
+            "not_found",
+            "mcp surface is disabled on non-master devices",
+        );
+        return Ok(CallToolResult::error(vec![Content::text(
+            envelope.to_string(),
+        )]));
+    }

     let svc = self.registry.services().iter().find(|s| s.name == service);

Also applies to: 437-542

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lab/src/mcp/server.rs` around lines 330 - 343, The current visibility
and permission checks only run when a local service entry exists
(svc.is_some()), allowing callers to bypass MCP gating by invoking upstream
tools directly; update the logic in the handler around
registry.services().iter().find(...) so that
service_visible_on_mcp(&service).await and
action_allowed_on_mcp(&service,&action).await are enforced regardless of svc
presence (i.e., perform the MCP visibility and action checks for both local and
proxied/upstream tool invocations) and return the same CallToolResult::error via
build_error when checks fail so the proxy path (the upstream proxy handling)
cannot be used to bypass master-only MCP gates.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/lab/src/api/device/fleet.rs`:
- Around line 49-60: The helper function require_master_store currently is
private but duplicated in logs.rs; make it reusable by changing its visibility
to pub(super) or pub(crate) (depending on desired scope) so sibling device
submodules can import it, update the signature in the
crate::lab::api::device::fleet module (function require_master_store)
accordingly, and remove the duplicated logic in logs.rs by importing and calling
this shared require_master_store instead of reimplementing the check and error
construction.

In `@crates/lab/src/api/device/logs.rs`:
- Around line 28-34: The current search loads all logs via
store.logs_for_device(&payload.device_id) and collects them into memory before
filtering by needle (payload.query.to_ascii_lowercase()), which can OOM for
high-volume devices; update the SearchLogsRequest to include limit and offset
fields and change the handler to apply offset (skip) and limit (take) while
iterating the logs stream/iterator returned by logs_for_device (or add those
parameters to logs_for_device) and only filter and collect the requested window,
using payload.limit and payload.offset to drive the pagination and avoid
collecting all events.
- Around line 17-27: Duplicate master-role check and device_store extraction in
logs.rs should be replaced by the existing helper; make the helper accessible
and call it instead of reimplementing the logic. Specifically, expose or move
the require_master_store helper (currently in fleet.rs) so logs.rs can call
require_master_store(state) to perform the DeviceRole::NonMaster check and
return the configured store; remove the duplicated if
matches!(state.device_role...) block and the manual device_store
clone/ok_or_else, and use the returned store from require_master_store to
continue.

In `@crates/lab/src/api/device/oauth.rs`:
- Around line 31-43: The handler currently spawns tokio::spawn and immediately
returns Ok(super::ok()) before start_local_oauth_relay(...) has performed its
bind/startup; change this so the bind/startup happens synchronously (or at least
is awaited to confirm success) before returning success. Specifically, either
(A) call start_local_oauth_relay(payload.bind_addr, resolved_target,
timeout).await and only spawn the background task after it successfully
completes, or (B) refactor start_local_oauth_relay into a two-step API (e.g.,
start_local_oauth_relay::start_bind(...) that performs the bind and returns a
RelayHandle) and use that to verify bind success prior to returning
Ok(super::ok()); keep references to start_local_oauth_relay, tokio::spawn,
payload.bind_addr, resolved_target, timeout and the response Ok(super::ok()) so
reviewers can find and update the code paths.

In `@crates/lab/src/api/device/syslog.rs`:
- Around line 7-11: The DeviceSyslogBatch struct accepts device_id unchecked;
add validation to ensure device_id is non-empty, trimmed, and within a sane
length (e.g., max 256 chars) to avoid abuse. Implement this by either adding a
validate() method on DeviceSyslogBatch (or a TryFrom<String> wrapper type) that
checks device_id (trim, ensure not empty, check length and allowed characters)
and returns an error on failure, and call that validator from the request
handler before processing events (where DeviceSyslogBatch and DeviceLogEvent are
used); alternatively implement custom Deserialize for DeviceSyslogBatch to
enforce the same checks at deserialization time and return a descriptive error
for invalid device_id.
- Around line 13-23: handle_batch currently calls store.record_logs but does not
emit the required dispatch observability event; modify handle_batch to record a
start timestamp, await store.record_logs(&payload.device_id,
payload.events).await, compute elapsed_ms, then emit a dispatch event at the
surface boundary with fields surface="device", service="syslog", action="batch",
elapsed_ms (measured), and request_id (take from the current tracing span or
request context / generate if absent), before returning Ok(super::ok());
reference the handle_batch function, DeviceSyslogBatch payload,
AppState.device_store and the record_logs call when adding this instrumentation.

In `@crates/lab/src/api/router.rs`:
- Line 103: The is_master boolean is computed twice; compute it once in
build_router by evaluating let is_master = !matches!(state.device_role,
Some(DeviceRole::NonMaster)); and pass that is_master into build_v1_router as an
explicit parameter (or accept it in build_v1_router's signature) instead of
recomputing inside build_v1_router; update all call sites and the
build_v1_router signature to thread this parameter through so the duplicated
computation is removed.

In `@crates/lab/src/api/web.rs`:
- Around line 84-90: When detecting a non-master device in the web handler (the
matches!(state.device_role, Some(DeviceRole::NonMaster)) branch that currently
returns StatusCode::FORBIDDEN), emit a WARN-level log before returning; call the
project's logging macro (e.g., warn! or tracing::warn!) with context including
the branch reason ("web ui disabled on non-master devices") and available
identifiers from state (e.g., state.device_id or similar) and request context
(remote address or path if accessible) so the denial is observable, then return
the 403 response as before.

In `@crates/lab/src/cli/logs.rs`:
- Around line 7-8: The CLI currently imports and uses MasterClient from the lab
crate (MasterClient in logs.rs) which violates the boundary rule; move the
master log service client logic into lab-apis (create/extend lab-apis' master
client module, e.g. lab_apis::master::client::MasterClient) so all business
logic lives in lab-apis/src/master/client.rs, expose minimal methods the CLI
needs, then change crates/lab/src/cli/logs.rs to remove the direct
crate::device::master_client::MasterClient usage and instead call the new
lab-apis client methods as a thin formatting shim (only call client methods and
format output for the CLI code paths previously at lines ~29-37). Ensure the CLI
only handles presentation and error formatting while all service behavior
resides in the lab-apis client.

In `@crates/lab/src/device/config_scan.rs`:
- Around line 11-17: The DiscoveredMcpConfigFile currently exposes full local
paths and raw server config values; change the serialization/upload path to
redact sensitive data by (1) replacing DiscoveredMcpConfigFile.path with an
anonymized value (e.g., strip directories to a basename or replace with a
sentinel) and (2) replacing DiscoveredMcpConfigFile.servers values with either
redacted placeholders or deterministic non-reversible fingerprints (e.g., hash
of the JSON value) instead of the raw serde_json::Value; apply the same
redaction approach to the other structs/locations referenced in the comment (the
equivalent structs/fields around lines 45-54, 67-81, and 102-108) so no secrets
or full local paths are serialized or uploaded.

In `@crates/lab/src/device/identity.rs`:
- Around line 7-22: The resolve_local_hostname function returns raw HOSTNAME or
/etc/hostname values which can differ by case/whitespace and lacks a Windows
fallback, causing role checks that use exact equality to misclassify nodes;
normalize returned hostnames by trimming whitespace and converting to a
canonical form (e.g., to_lowercase()) and add a Windows fallback (check
"COMPUTERNAME" or use a cross-platform hostname lookup) in
resolve_local_hostname (and the similar logic at the other occurrence around
lines 25-34) so all callers compare normalized host strings consistently.

In `@crates/lab/src/device/log_collect.rs`:
- Around line 5-7: The function collect_bootstrap_logs currently returns
Ok(Vec::new()) which silently succeeds; change it to fail fast until real
collection is implemented by returning an error instead of an empty vector
(e.g., return a Result::Err with a clear message like "collect_bootstrap_logs
unimplemented" or use unimplemented!()/todo!() so callers see a hard failure).
Update the implementation inside pub fn collect_bootstrap_logs(_device_id: &str)
-> Result<Vec<DeviceLogEvent>> to produce that explicit error/placeholder rather
than Ok(empty).

In `@crates/lab/src/device/log_event.rs`:
- Around line 3-11: The DeviceLogEvent struct lacks Default and PartialEq
derives which would simplify test construction and enable direct equality
assertions; update the struct declaration for DeviceLogEvent to derive at least
Default and PartialEq in addition to the existing Debug, Clone, Serialize,
Deserialize so instances can be created with default() and compared in tests,
and ensure that all fields are compatible with Default/PartialEq (e.g., use
serde_json::Map and types that implement Default/PartialEq).

In `@crates/lab/src/device/master_client.rs`:
- Around line 8-135: MasterClient currently performs HTTP path construction,
auth injection, request sending and JSON decoding inside crates/lab (see methods
post_json, get_json, post_json_value and request on MasterClient), which
violates the guideline that those request/response responsibilities belong in
lab-apis; revert that responsibility by making MasterClient a thin wrapper that
delegates to the shared client in lab-apis: remove path constants and JSON
decoding from post_json/get_json/post_json_value and instead call into the
lab-apis client functions that accept path/payload and return parsed
serde_json::Value or Result<()>; keep only from_config and bearer_token wiring
here (or construct/pass a lab-apis client instance), and delete or simplify the
request(...) helper so auth is applied inside lab-apis client code.
- Around line 22-27: The MasterClient::with_bearer_token constructor creates its
reqwest client with Client::new() and needs bounded timeouts to avoid hangs;
replace that with reqwest::Client::builder(), set sensible .connect_timeout(...)
and overall .timeout(...), build the client and assign it to the http field in
with_bearer_token so all requests from MasterClient use those timeouts (mirror
the approach used in dispatch::qbittorrent client and tui::preview client).

In `@crates/lab/src/device/queue.rs`:
- Around line 49-64: push() and ack_drained() mutate the in-memory queue before
the disk operation completes, causing divergence on I/O failure; change both to
perform the durable file operation first (use append_entry(&self.path,
&envelope).await in push and rewrite_entries(&self.path, &new_entries).await in
ack_drained) while holding the entries mutex or using a temporary copy, then
update the in-memory Vec only after the I/O succeeds (for push: append to
entries after append_entry returns Ok; for ack_drained: create a new Vec without
the drained items, call rewrite_entries on that, and replace entries with the
new Vec only on Ok). Ensure you still hold the entries lock around the in-memory
replace to keep the transition atomic relative to concurrent callers.

In `@crates/lab/src/device/runtime.rs`:
- Around line 104-126: Both queue_syslog_batch() and flush_queue_once() reopen
DeviceOutboundQueue via DeviceOutboundQueue::open(...) which yields separate
in-memory snapshots and allows races where ack_drained() overwrites concurrently
appended entries; change the implementation to open the outbound queue once and
reuse the same DeviceOutboundQueue instance for both operations (e.g., add a
field like outbound_queue: Mutex<DeviceOutboundQueue> or lazy-init it in the
DeviceRuntime struct and use that instead of calling DeviceOutboundQueue::open
in queue_syslog_batch() and flush_queue_once()), then update
queue_syslog_batch(), flush_queue_once(), and any ack_drained() call sites to
acquire the same queue instance (and its mutex) so push/drain/ack operate on the
same snapshot and avoid stale rewrites.
- Around line 129-142: The loop over drained envelopes currently uses `_ =>
break`, which leaves the unknown envelope at the queue head and blocks all
future flushes; change that to skip/drop the unknown record instead: in the
match arm for the default case (where envelope.kind is unrecognized) emit a
warning (e.g., via the same logger/context) and drop or move the envelope to a
dead-letter path and increment ack_count so the item is removed from the
persisted queue; update the `_ => break` arm to something like `_ => {
log::warn!(... , kind = %envelope.kind); ack_count += 1; /* or send to
dead-letter via dead_letter_enqueue(envelope) */ }` while keeping the existing
syslog_batch and status handling in client.post_*.
- Around line 32-42: The code currently constructs the master's URL using a
caller-supplied master_port (for_http_master) which in practice is the local
HTTP bind port from serve; change for_http_master to stop deriving the master's
API port from the device's listen port and instead use the master's configured
API port (e.g., a port field on ResolvedDeviceRuntime such as
master_api_port/master_port if present, or read the canonical master API port
from configuration/env like LAB_MCP_HTTP_PORT). Update the for_http_master
signature to remove or ignore the caller-supplied port, build the base URL from
resolved.master_host plus the authoritative master API port field or config, and
update callers (e.g., serve) to stop passing the local bind port to
for_http_master so hello/metadata/syslog uploads target the actual master API
port.
- Around line 151-155: The method collect_and_flush_bootstrap_logs currently
always collects and queues bootstrap events then calls flush_queue_once which is
a no-op for DeviceRole::Master, causing masters to accumulate an unflushable
on-disk queue; guard the queuing step by checking the device role (compare
self.resolved.role or similar against DeviceRole::Master) and skip calling
queue_syslog_batch when the role is Master, while still returning Ok or calling
flush_queue_once as appropriate; update collect_and_flush_bootstrap_logs to
early-return or bypass queue_syslog_batch for DeviceRole::Master to prevent
creating persistent queued entries for master nodes.

In `@crates/lab/src/device/store.rs`:
- Around line 92-106: record_logs currently appends indefinitely to
DeviceSnapshot.logs causing unbounded memory growth; modify record_logs (and
DeviceSnapshot) to enforce a retention policy: introduce a MAX_LOGS constant and
when snapshot.logs is extended trim oldest entries to keep snapshot.logs.len()
<= MAX_LOGS (or replace logs Vec with a VecDeque and pop_front until within
limit) so that after snapshot.logs.extend(events) you remove/evict oldest
entries to maintain the cap; alternatively, implement offload logic to disk/DB
before appending if you prefer, but ensure record_logs enforces a hard bound on
snapshot.logs size.
- Around line 27-43: record_hello currently leaves DeviceSnapshot.connected
false, causing newly seen devices to appear disconnected; update the
record_hello function to set snapshot.connected = true (along with existing
updates to snapshot.last_seen and snapshot.role) when inserting/updating the
snapshot so a freshly registered non-master is shown as connected on first
contact; locate the record_hello method in store.rs and ensure you assign to the
DeviceSnapshot.connected field on the snapshot variable after obtaining or
creating it.

In `@crates/lab/src/lib.rs`:
- Around line 1-2: Remove the crate-wide #![allow(unreachable_pub)] in lib.rs
and instead either restrict unnecessary public items to pub(crate) (audit
exported structs, enums, functions and traits and change their visibility) or
apply #[allow(unreachable_pub)] only to the specific module(s) that legitimately
require it; locate the existing crate attribute unreachable_pub in lib.rs,
remove it, then for each item previously exposed, update its visibility to
pub(crate) or add a module-level #[allow(unreachable_pub)] above the specific
mod declarations that truly need the exception.

In `@crates/lab/tests/device_api.rs`:
- Around line 14-24: The test hello_endpoint_updates_master_store only asserts
HTTP 200 but its name implies it should check that the master store was updated;
update the test to actually assert store-side mutation by using the returned
_store from test_device_router (or change the binding to a named mutable store
variable) and verify the expected state change after calling hello_request,
e.g., query the store via its API/lookup method exposed by test_device_router,
or alternatively rename the test to reflect it only verifies HTTP status; refer
to hello_endpoint_updates_master_store, test_device_router, hello_request and
the _store variable to locate and implement the fix.

In `@crates/lab/tests/device_config.rs`:
- Around line 16-19: The test defaults_device_config_when_block_missing should
assert that the device block is entirely absent rather than using an OR that
allows an unexpected default; update the assertion to require
parsed.device.is_none() (referencing the parsed: lab::config::LabConfig variable
and the defaults_device_config_when_block_missing function) so the test fails if
deserialization creates a default device struct.

In `@crates/lab/tests/device_master_only.rs`:
- Around line 31-40: The tempdir created in test_non_master_router() can be
dropped when the function returns causing flaky filesystem access; keep the
TempDir alive for the lifetime of the request by returning it along with the
Router (or otherwise ensuring it isn't dropped). Change test_non_master_router()
to return (axum::Router, tempfile::TempDir) (or store the TempDir in the test
harness) and pass dir into with_web_assets_dir(dir.path().to_path_buf()) while
returning the TempDir so callers that call oneshot() can hold the TempDir until
the request completes; alternatively, call dir.into_path() to persist the
directory if you prefer not to hold TempDir. Ensure references to
test_non_master_router, with_web_assets_dir, and build_router_with_bearer are
updated accordingly.

In `@docs/GATEWAY.md`:
- Around line 14-15: Clarify the ambiguous phrase "or MCP" in the sentence about
gateway management by replacing it with an explicit path or surface reference;
update the line that mentions `/v1/gateway` to something like "non-master
devices do not mount `/v1/gateway` or the MCP endpoints (e.g., `/v1/mcp`)" or
the actual MCP path used in the codebase so readers know exactly which
route/surface is excluded.

In `@docs/OAUTH.md`:
- Around line 102-122: Add an explicit sentence in the "Device Runtime Relay
Start" section clarifying access control: state that POST
/v1/device/oauth/relay/start is an authenticated, protected control-plane
endpoint that requires operator/admin role (or equivalent) and is master-routed
(not callable by arbitrary devices/users), so implementers/operators understand
the auth/role expectation and routing semantics for safety.

In `@docs/OPERATIONS.md`:
- Line 133: The sentence "Every Linux `x86_64` fleet member runs `lab serve` as
a device runtime." wrongly implies platform restriction; update the
documentation text to either generalize to "Every fleet member runs `lab serve`
as a device runtime." or explicitly list supported platforms/architectures
(e.g., `x86_64`, `arm64`, macOS) and any platform-specific caveats; ensure the
edited phrase appears wherever the original string is used so the doc accurately
reflects platform support for `lab serve`.

---

Outside diff comments:
In `@crates/lab/src/mcp/server.rs`:
- Around line 330-343: The current visibility and permission checks only run
when a local service entry exists (svc.is_some()), allowing callers to bypass
MCP gating by invoking upstream tools directly; update the logic in the handler
around registry.services().iter().find(...) so that
service_visible_on_mcp(&service).await and
action_allowed_on_mcp(&service,&action).await are enforced regardless of svc
presence (i.e., perform the MCP visibility and action checks for both local and
proxied/upstream tool invocations) and return the same CallToolResult::error via
build_error when checks fail so the proxy path (the upstream proxy handling)
cannot be used to bypass master-only MCP gates.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8d9f5e0e-7fba-497f-afe4-d993d01a1641

📥 Commits

Reviewing files that changed from the base of the PR and between 3f1c974 and c0738a5.

📒 Files selected for processing (55)
  • crates/lab/src/api.rs
  • crates/lab/src/api/device.rs
  • crates/lab/src/api/device/fleet.rs
  • crates/lab/src/api/device/hello.rs
  • crates/lab/src/api/device/logs.rs
  • crates/lab/src/api/device/metadata.rs
  • crates/lab/src/api/device/oauth.rs
  • crates/lab/src/api/device/status.rs
  • crates/lab/src/api/device/syslog.rs
  • crates/lab/src/api/router.rs
  • crates/lab/src/api/state.rs
  • crates/lab/src/api/web.rs
  • crates/lab/src/catalog.rs
  • crates/lab/src/cli.rs
  • crates/lab/src/cli/device.rs
  • crates/lab/src/cli/logs.rs
  • crates/lab/src/cli/serve.rs
  • crates/lab/src/config.rs
  • crates/lab/src/device.rs
  • crates/lab/src/device/checkin.rs
  • crates/lab/src/device/config_scan.rs
  • crates/lab/src/device/identity.rs
  • crates/lab/src/device/log_collect.rs
  • crates/lab/src/device/log_event.rs
  • crates/lab/src/device/master_client.rs
  • crates/lab/src/device/oauth.rs
  • crates/lab/src/device/queue.rs
  • crates/lab/src/device/runtime.rs
  • crates/lab/src/device/store.rs
  • crates/lab/src/dispatch/error.rs
  • crates/lab/src/lib.rs
  • crates/lab/src/main.rs
  • crates/lab/src/mcp/server.rs
  • crates/lab/tests/device_api.rs
  • crates/lab/tests/device_cli.rs
  • crates/lab/tests/device_config.rs
  • crates/lab/tests/device_identity.rs
  • crates/lab/tests/device_master_only.rs
  • crates/lab/tests/device_queue.rs
  • crates/lab/tests/device_runtime.rs
  • crates/lab/tests/device_scan.rs
  • docs/ARCH.md
  • docs/CLI.md
  • docs/CONFIG.md
  • docs/DEPLOY.md
  • docs/DEVICE_RUNTIME.md
  • docs/ERRORS.md
  • docs/FLEET_LOGS.md
  • docs/GATEWAY.md
  • docs/MCP.md
  • docs/OAUTH.md
  • docs/OBSERVABILITY.md
  • docs/OPERATIONS.md
  • docs/README.md
  • docs/TRANSPORT.md

Comment thread crates/lab/src/api/device/fleet.rs Outdated
Comment on lines +49 to +60
fn require_master_store(state: &AppState) -> Result<std::sync::Arc<crate::device::store::DeviceFleetStore>, ToolError> {
if matches!(state.device_role, Some(DeviceRole::NonMaster)) {
return Err(ToolError::Sdk {
sdk_kind: "not_found".to_string(),
message: "device fleet queries are only available on the master".to_string(),
});
}
state
.device_store
.clone()
.ok_or_else(|| ToolError::internal_message("device store is not configured"))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider making require_master_store reusable across device submodules.

This helper encapsulates the master-only access pattern well. The logs.rs handler duplicates this same logic. Consider making this function pub(super) or pub(crate) so it can be imported and reused by sibling modules.

♻️ Suggested change
-fn require_master_store(state: &AppState) -> Result<std::sync::Arc<crate::device::store::DeviceFleetStore>, ToolError> {
+pub(super) fn require_master_store(state: &AppState) -> Result<std::sync::Arc<crate::device::store::DeviceFleetStore>, ToolError> {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fn require_master_store(state: &AppState) -> Result<std::sync::Arc<crate::device::store::DeviceFleetStore>, ToolError> {
if matches!(state.device_role, Some(DeviceRole::NonMaster)) {
return Err(ToolError::Sdk {
sdk_kind: "not_found".to_string(),
message: "device fleet queries are only available on the master".to_string(),
});
}
state
.device_store
.clone()
.ok_or_else(|| ToolError::internal_message("device store is not configured"))
}
pub(super) fn require_master_store(state: &AppState) -> Result<std::sync::Arc<crate::device::store::DeviceFleetStore>, ToolError> {
if matches!(state.device_role, Some(DeviceRole::NonMaster)) {
return Err(ToolError::Sdk {
sdk_kind: "not_found".to_string(),
message: "device fleet queries are only available on the master".to_string(),
});
}
state
.device_store
.clone()
.ok_or_else(|| ToolError::internal_message("device store is not configured"))
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lab/src/api/device/fleet.rs` around lines 49 - 60, The helper function
require_master_store currently is private but duplicated in logs.rs; make it
reusable by changing its visibility to pub(super) or pub(crate) (depending on
desired scope) so sibling device submodules can import it, update the signature
in the crate::lab::api::device::fleet module (function require_master_store)
accordingly, and remove the duplicated logic in logs.rs by importing and calling
this shared require_master_store instead of reimplementing the check and error
construction.

Comment thread crates/lab/src/api/device/logs.rs Outdated
Comment thread crates/lab/src/api/device/logs.rs
Comment thread crates/lab/src/api/device/oauth.rs Outdated
Comment thread crates/lab/src/api/device/syslog.rs
Comment thread crates/lab/tests/device_config.rs
Comment thread crates/lab/tests/device_master_only.rs Outdated
Comment thread docs/GATEWAY.md Outdated
Comment thread docs/OAUTH.md
Comment thread docs/OPERATIONS.md Outdated
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 issues found across 55 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/lab/src/device/runtime.rs">

<violation number="1" location="crates/lab/src/device/runtime.rs:141">
P1: On an unknown `QueuedEnvelope.kind`, this `break` exits the loop and falls through to `Ok(())` after acking only the already-processed entries. The unknown envelope stays at the head of the queue, and every subsequent `flush_queue_once` call will re-drain the same batch and immediately break on the same unknown kind — permanently wedging the queue without surfacing any error. Either return an error or log-and-skip the unknown entry to avoid stalling all future flushes.</violation>

<violation number="2" location="crates/lab/src/device/runtime.rs:164">
P2: Home directory resolution is Unix-only here; missing `USERPROFILE` fallback can place runtime files in the current working directory on Windows.</violation>
</file>

<file name="crates/lab/src/device/config_scan.rs">

<violation number="1" location="crates/lab/src/device/config_scan.rs:38">
P2: Guard on `exists()` is too broad; non-file paths cause `fs::read` to error and stop discovery. Check `is_file()` instead.</violation>
</file>

<file name="crates/lab/src/device/log_collect.rs">

<violation number="1" location="crates/lab/src/device/log_collect.rs:6">
P2: Bootstrap log collection is a stub that always returns no events, so the new bootstrap ingest path never sends any logs.</violation>
</file>

<file name="crates/lab/src/api/device/oauth.rs">

<violation number="1" location="crates/lab/src/api/device/oauth.rs:21">
P1: `bind_addr` is accepted verbatim from the request body, but the underlying relay is designed as a loopback-only forwarder. A caller can bind the relay on `0.0.0.0` or a LAN address, exposing an unauthenticated callback forwarder beyond localhost. Validate that `bind_addr.ip().is_loopback()` before spawning the relay.</violation>

<violation number="2" location="crates/lab/src/api/device/oauth.rs:31">
P1: The start endpoint always returns success even if the OAuth relay fails to bind, so callers can receive a false-positive `ok: true` while the relay never starts.</violation>
</file>

<file name="crates/lab/src/device/queue.rs">

<violation number="1" location="crates/lab/src/device/queue.rs:51">
P1: `push` updates the in-memory queue before the append succeeds, so a failed append leaves a phantom queued entry in memory.</violation>
</file>

<file name="crates/lab/src/catalog.rs">

<violation number="1" location="crates/lab/src/catalog.rs:68">
P2: Catalog now advertises `device`/`logs` as available even though they are not registered services, so discovery can direct clients to non-callable actions.</violation>
</file>

<file name="crates/lab/src/device/store.rs">

<violation number="1" location="crates/lab/src/device/store.rs:106">
P2: `record_logs` appends device logs without any cap or eviction, causing unbounded in-memory growth and increasingly expensive log reads.</violation>
</file>

<file name="crates/lab/src/api/device/syslog.rs">

<violation number="1" location="crates/lab/src/api/device/syslog.rs:21">
P2: Normalize or validate each event’s `device_id` before recording the batch to prevent cross-device log attribution inconsistencies.</violation>
</file>

<file name="crates/lab/src/cli/serve.rs">

<violation number="1" location="crates/lab/src/cli/serve.rs:162">
P1: The initial hello, metadata upload, and bootstrap log flush are all awaited sequentially before `run_http(...)` starts the listener. `MasterClient` is constructed with `reqwest::Client::new()` (no request timeout), so a slow or unreachable master can stall boot indefinitely — leaving `/health` and `/ready` unavailable. Move these off the critical startup path (e.g., `tokio::spawn`) or add a request timeout to `MasterClient`.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread crates/lab/src/api/device/oauth.rs Outdated
Comment thread crates/lab/src/device/queue.rs Outdated
Comment thread crates/lab/src/api/device/oauth.rs
Comment thread crates/lab/src/device/runtime.rs Outdated
Comment thread crates/lab/src/cli/serve.rs Outdated
Comment thread crates/lab/src/device/config_scan.rs Outdated
Comment thread crates/lab/src/device/log_collect.rs Outdated
Comment thread crates/lab/src/catalog.rs Outdated
Comment thread crates/lab/src/device/store.rs
Comment thread crates/lab/src/api/device/syslog.rs Outdated
Resolves review thread PRRT_kwDOR8nC1M57SUl1

Resolves review thread PRRT_kwDOR8nC1M57SUl2

Resolves review thread PRRT_kwDOR8nC1M57SUl4

Resolves review thread PRRT_kwDOR8nC1M57SUl5

Resolves review thread PRRT_kwDOR8nC1M57SSqD

Resolves review thread PRRT_kwDOR8nC1M57SSqF

Resolves review thread PRRT_kwDOR8nC1M57SSqH

Resolves review thread PRRT_kwDOR8nC1M57SSYn

Resolves review thread PRRT_kwDOR8nC1M57SSYv

Resolves review thread PRRT_kwDOR8nC1M57SSX8

- gate OAuth metadata routes to master-only exposure

- keep MCP resources aligned with hidden services

- bound startup/master-client behavior, relay bind validation, and log retention/search
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
crates/lab/src/oauth/local_relay.rs (1)

55-75: ⚠️ Potential issue | 🟠 Major

Remove the stdout write from relay startup.

serve_local_relay still prints to stdout after the refactor. In this binary, stdout can be protocol/output-bearing, so this can corrupt callers while also bypassing the structured logging pipeline.

Proposed fix
-    #[allow(clippy::print_stdout)]
-    {
-        println!(
-            "OAuth relay listening on http://{} -> {}",
-            config.bind_addr, config.resolved_target.target_url
-        );
-    }

As per coding guidelines, Use tracing for logging everywhere. Never use println! for debug info.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lab/src/oauth/local_relay.rs` around lines 55 - 75, In
serve_local_relay, remove the stdout write done via println! (the block guarded
by #[allow(clippy::print_stdout)]) so the function no longer writes to stdout;
instead rely on the existing tracing::info call (or add another
tracing::info/tracing::debug message if you want the same visible text) to
report "OAuth relay listening on ..." and reference config.bind_addr and
config.resolved_target.target_url; update or delete the println!-wrapped block
so all startup output goes through tracing in function serve_local_relay.
crates/lab/src/mcp/server.rs (1)

575-582: ⚠️ Potential issue | 🟠 Major

Non-master MCP gating is still bypassable through upstream discovery/proxying.

service_visible_on_mcp() only hides built-in registry services. list_tools, list_resources, list_prompts, and the upstream branch of call_tool() still go through current_upstream_pool() without a role check, so a non-master MCP session can still discover and invoke upstream control-plane functionality.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lab/src/mcp/server.rs` around lines 575 - 582, service_visible_on_mcp
currently prevents only built-in registry services but the upstream paths still
bypass gating; update the upstream-discovery and proxying code paths
(references: current_upstream_pool(), call_tool(), list_tools(),
list_resources(), list_prompts()) to enforce the same DeviceRole check used in
service_visible_on_mcp (i.e., return/deny when self.device_role matches
Some(DeviceRole::NonMaster)) before using current_upstream_pool() or forwarding
calls upstream, or alternatively add a helper like ensure_mcp_master() invoked
at the start of those functions to centralize the gate logic so non-master MCP
sessions cannot discover or invoke upstream control-plane functionality.
♻️ Duplicate comments (6)
crates/lab/src/api/device/syslog.rs (1)

15-25: ⚠️ Potential issue | 🟠 Major

Add the missing surface dispatch event.

This still writes the batch into the fleet store without emitting a surface-level dispatch log, so device.syslog.batch remains invisible in request tracing and metrics.

Proposed fix
 pub async fn handle_batch(
     State(state): State<AppState>,
     Json(payload): Json<DeviceSyslogBatch>,
 ) -> Result<Json<DeviceAck>, ToolError> {
+    let start = std::time::Instant::now();
+    let event_count = payload.events.len();
     validate_device_id_value(&payload.device_id, "device_id")?;
     let store = state
         .device_store
         .clone()
         .ok_or_else(|| ToolError::internal_message("device store is not configured"))?;
     store.record_logs(&payload.device_id, payload.events).await;
+    tracing::info!(
+        surface = "api",
+        service = "device",
+        action = "syslog.batch",
+        device_id = %payload.device_id,
+        event_count,
+        elapsed_ms = start.elapsed().as_millis() as u64,
+        "device syslog batch recorded"
+    );
     Ok(super::ok())
 }

As per coding guidelines, surface layers must comply with docs/OBSERVABILITY.md: dispatch events belong at the surface boundary, caller context must flow into downstream request logs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lab/src/api/device/syslog.rs` around lines 15 - 25, handle_batch
currently writes to the store with store.record_logs(&payload.device_id,
payload.events) but does not emit the required surface dispatch event; before
calling record_logs (at the surface boundary in handle_batch), emit a dispatch
event named "device.syslog.batch" using the app state's dispatcher (e.g.,
state.dispatcher or state.dispatch) and include the caller context,
payload.device_id and events in the event payload so downstream logs/metrics
trace the request; ensure you await or handle the dispatch result and propagate
any dispatch errors as ToolError consistent with other handlers.
crates/lab/tests/device_master_only.rs (1)

63-75: ⚠️ Potential issue | 🟡 Minor

Keep the tempdir alive for the request.

test_non_master_router() drops the TempDir before the caller awaits oneshot(), so /gateways/ can observe a missing assets directory and return 404 for the wrong reason. Return the TempDir alongside the Router (or otherwise persist it) so the filesystem path stays valid until the request completes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lab/tests/device_master_only.rs` around lines 63 - 75, The test
currently drops the TempDir before the request runs, causing the assets path to
vanish; modify test_non_master_router to return or retain the TempDir for the
lifetime of the request (e.g., return a tuple (Router, TempDir) or store the
TempDir in a struct returned by the helper) so the directory created for
with_web_assets_dir on AppState remains alive until the caller finishes awaiting
the oneshot request produced by build_router_with_bearer; ensure callers of
test_non_master_router are updated to keep the TempDir alive while they invoke
the router.
crates/lab/src/device/runtime.rs (3)

129-147: ⚠️ Potential issue | 🟠 Major

Unsupported queue entries permanently block later deliveries.

The unsupported queued envelope kind path returns without acknowledging that item, so every future flush stops on the same head record and valid entries behind it never get delivered. Drop or dead-letter unknown kinds with a warning and advance the queue instead of treating them as a fatal delivery error.

Suggested fix
         for envelope in drained {
             let delivery = match envelope.kind.as_str() {
                 "syslog_batch" => client.post_syslog_batch(&envelope.payload).await,
                 "status" => {
                     let status = serde_json::from_value::<DeviceStatus>(envelope.payload)
                         .context("decode queued status envelope")?;
                     client.post_status(&status).await
                 }
-                other => Err(anyhow!("unsupported queued envelope kind `{other}`")),
+                other => {
+                    tracing::warn!(kind = %other, "dropping unsupported queued envelope");
+                    ack_count += 1;
+                    continue;
+                }
             };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lab/src/device/runtime.rs` around lines 129 - 147, The loop over
drained queued envelopes treats unknown envelope kinds as a fatal error and
returns without acknowledging them, which permanently blocks the queue; change
the handling in the match arm for the `other` case (where `envelope.kind` is not
"syslog_batch" or "status") to log a warning (e.g., using tracing::warn!) and
set `delivery` to Ok(()) or otherwise advance `ack_count` so the item is
acknowledged/dropped (i.e., do not return Err for unsupported kinds). Update the
logic around `delivery` and `ack_processed_prefix` in the same function so
unsupported items are counted/acked and the loop continues to process subsequent
entries instead of aborting delivery.

99-126: ⚠️ Potential issue | 🟠 Major

Reopening the queue makes push/flush non-atomic.

queue_syslog_batch() and flush_queue_once() each call DeviceOutboundQueue::open(), so they operate on separate in-memory snapshots of the same JSONL file. If a push races with a flush, ack_drained() can rewrite stale state and drop the concurrent append. Keep one queue instance inside DeviceRuntime and share it across push/drain/ack.


32-42: ⚠️ Potential issue | 🟠 Major

Don't derive the master's base URL from the local listen port.

for_http_master() uses the caller-supplied port to build http://{master_host}:{port}, and serve passes the device's own HTTP bind port. Changing a non-master's local --port therefore retargets hello/metadata/syslog uploads to that same port on the master, which breaks fleets where the device listener port and the master's API port differ.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lab/src/device/runtime.rs` around lines 32 - 42, The code builds the
master base URL from the caller-supplied master_port (parameter master_port)
which is coming from the device's local HTTP bind port; change
for_http_master(ResolvedDeviceRuntime, master_port) to stop deriving the
master's URL from that parameter and instead use the master API address/port
that is part of the resolved runtime (e.g., a field on ResolvedDeviceRuntime
like master_api_port or master_address/master_url). Update for_http_master to
construct the URL from resolved.master_host combined with the resolved master's
API port/address (or accept a properly named master_api_port) and ensure callers
(e.g., serve) pass the master API port from the configuration/resolved runtime
rather than the device's local bind port; use MasterClient::with_bearer_token as
before but with the corrected URL.
crates/lab/src/device/master_client.rs (1)

37-134: ⚠️ Potential issue | 🟠 Major

Move request construction and response parsing into lab-apis service client.

This file still owns endpoint paths, outbound request construction, and JSON decoding inside crates/lab, which breaks the required service boundary. Keep this layer as a thin shim over lab-apis methods.

As per coding guidelines, “All business logic must live in lab-apis/src/<service>/client.rs.” and “Do not move upstream request construction or response parsing out of lab-apis.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lab/src/device/master_client.rs` around lines 37 - 134, This module
still constructs URLs, endpoint paths and decodes JSON (see post_json, get_json,
post_json_value and the public methods post_hello, post_status, post_metadata,
post_syslog_batch, fetch_devices, fetch_device, search_logs), violating the
service boundary; replace those implementations with thin forwards to the
centralized lab-apis service client API (call the appropriate lab-apis device
client methods rather than formatting paths or calling self.http), return
whatever lab-apis client returns (or map errors without parsing JSON yourself),
and remove manual URL/path construction and response.json() calls so this crate
only forwards payloads/ids to lab-apis and returns its results; keep from_config
and bearer setup as-is but delegate all request/response work to lab-apis client
functions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/lab/src/api/device/fleet.rs`:
- Around line 35-46: The get_device handler currently queries the store with the
raw Path(device_id) so malformed/encoded/overlong IDs surface as not_found;
update get_device to call super::validate_device_id_value(&device_id)
immediately after the master-only gate (after require_master_store(&state)? and
before calling store.device(&device_id).await) and return an appropriate
invalid_param ToolError on validation failure so invalid IDs yield invalid_param
instead of not_found.

In `@crates/lab/src/api/device/logs.rs`:
- Around line 25-31: payload.limit is used directly in search_logs_for_device
which allows unbounded/oversized requests; define a sane maximum (e.g., const
MAX_LOG_LIMIT) and clamp the effective limit used by the handler (replace let
limit = payload.limit.unwrap_or(200); with code that applies unwrap_or and then
min(MAX_LOG_LIMIT)) before calling search_logs_for_device(&payload.device_id,
&needle, offset, limit); optionally return a 4xx error if payload.limit exceeds
MAX_LOG_LIMIT instead of clamping, but ensure the change references
payload.limit, offset, limit, and the call to search_logs_for_device so the
validation happens at the API boundary.

In `@crates/lab/src/catalog.rs`:
- Line 66: The .collect::<Vec<_>>() call used to populate the services field is
redundant because services is already typed as Vec<ServiceCatalog> on the
Catalog struct; replace the turbofish-style .collect::<Vec<_>>() with a plain
.collect() when building services (referencing the services variable and
ServiceCatalog/Catalog types) so the compiler infers the collection type.

In `@crates/lab/src/device/master_client.rs`:
- Around line 57-59: In fetch_device, avoid interpolating raw device_id into the
path; percent-encode the device_id path segment before calling get_json so
reserved characters (/, ?, #, etc.) don't break routing. Replace the string
interpolation of device_id inside fetch_device with a percent-encoded version
(using a path-segment safe encoder such as percent_encoding or
url::form_urlencoded/Url utilities) and then call
self.get_json(&format!("/v1/device/devices/{}", encoded_device_id)). Ensure you
only encode the segment (not the whole URL) and keep the fetch_device signature
and get_json call intact.
- Around line 89-134: post_json, get_json, and post_json_value need to emit
request lifecycle events: before calling .send() emit a request.start (including
method and the computed url), on successful response emit request.finish
(include status code and url), and on any error path emit request.error with the
error details and url; update the request flow inside each function (post_json,
get_json, post_json_value) to wrap the .send()/.error_for_status()/.json()
sequence so that request.error is emitted on any ?-propagated failure (include
the error from with_context) and request.finish is emitted after a successful
.error_for_status() before parsing the body. Ensure you reference the same url
variable and the request(self.http.<method>(&url)) call sites so logging covers
every outbound request.
- Around line 24-31: Change with_bearer_token to return Result<Self,
reqwest::Error> (or a suitable error type) instead of panicking: replace the
current expect(...) on reqwest::Client::builder().build() with the ? operator
(or match) to propagate the build error, construct and return Ok(Self { http,
... }), and update any callers (and any new() wrapper) to handle the Result;
reference the with_bearer_token constructor and the
reqwest::Client::builder().build() call when making the change.

In `@crates/lab/src/device/oauth.rs`:
- Around line 13-25: The code currently drops the listener's actual bound
address after calling bind_local_relay_listener and spawning serve_local_relay;
capture listener.local_addr() immediately after bind_local_relay_listener
returns and include that SocketAddr in the function's return value (e.g., change
the function signature to return Result<SocketAddr, _> or a struct containing
the bound address) so callers using ephemeral ports (127.0.0.1:0) can discover
the real port; preserve the existing behavior of spawning tokio::spawn with
config_for_task and return the captured address (not the listener) on Ok.

---

Outside diff comments:
In `@crates/lab/src/mcp/server.rs`:
- Around line 575-582: service_visible_on_mcp currently prevents only built-in
registry services but the upstream paths still bypass gating; update the
upstream-discovery and proxying code paths (references: current_upstream_pool(),
call_tool(), list_tools(), list_resources(), list_prompts()) to enforce the same
DeviceRole check used in service_visible_on_mcp (i.e., return/deny when
self.device_role matches Some(DeviceRole::NonMaster)) before using
current_upstream_pool() or forwarding calls upstream, or alternatively add a
helper like ensure_mcp_master() invoked at the start of those functions to
centralize the gate logic so non-master MCP sessions cannot discover or invoke
upstream control-plane functionality.

In `@crates/lab/src/oauth/local_relay.rs`:
- Around line 55-75: In serve_local_relay, remove the stdout write done via
println! (the block guarded by #[allow(clippy::print_stdout)]) so the function
no longer writes to stdout; instead rely on the existing tracing::info call (or
add another tracing::info/tracing::debug message if you want the same visible
text) to report "OAuth relay listening on ..." and reference config.bind_addr
and config.resolved_target.target_url; update or delete the println!-wrapped
block so all startup output goes through tracing in function serve_local_relay.

---

Duplicate comments:
In `@crates/lab/src/api/device/syslog.rs`:
- Around line 15-25: handle_batch currently writes to the store with
store.record_logs(&payload.device_id, payload.events) but does not emit the
required surface dispatch event; before calling record_logs (at the surface
boundary in handle_batch), emit a dispatch event named "device.syslog.batch"
using the app state's dispatcher (e.g., state.dispatcher or state.dispatch) and
include the caller context, payload.device_id and events in the event payload so
downstream logs/metrics trace the request; ensure you await or handle the
dispatch result and propagate any dispatch errors as ToolError consistent with
other handlers.

In `@crates/lab/src/device/master_client.rs`:
- Around line 37-134: This module still constructs URLs, endpoint paths and
decodes JSON (see post_json, get_json, post_json_value and the public methods
post_hello, post_status, post_metadata, post_syslog_batch, fetch_devices,
fetch_device, search_logs), violating the service boundary; replace those
implementations with thin forwards to the centralized lab-apis service client
API (call the appropriate lab-apis device client methods rather than formatting
paths or calling self.http), return whatever lab-apis client returns (or map
errors without parsing JSON yourself), and remove manual URL/path construction
and response.json() calls so this crate only forwards payloads/ids to lab-apis
and returns its results; keep from_config and bearer setup as-is but delegate
all request/response work to lab-apis client functions.

In `@crates/lab/src/device/runtime.rs`:
- Around line 129-147: The loop over drained queued envelopes treats unknown
envelope kinds as a fatal error and returns without acknowledging them, which
permanently blocks the queue; change the handling in the match arm for the
`other` case (where `envelope.kind` is not "syslog_batch" or "status") to log a
warning (e.g., using tracing::warn!) and set `delivery` to Ok(()) or otherwise
advance `ack_count` so the item is acknowledged/dropped (i.e., do not return Err
for unsupported kinds). Update the logic around `delivery` and
`ack_processed_prefix` in the same function so unsupported items are
counted/acked and the loop continues to process subsequent entries instead of
aborting delivery.
- Around line 32-42: The code builds the master base URL from the
caller-supplied master_port (parameter master_port) which is coming from the
device's local HTTP bind port; change for_http_master(ResolvedDeviceRuntime,
master_port) to stop deriving the master's URL from that parameter and instead
use the master API address/port that is part of the resolved runtime (e.g., a
field on ResolvedDeviceRuntime like master_api_port or
master_address/master_url). Update for_http_master to construct the URL from
resolved.master_host combined with the resolved master's API port/address (or
accept a properly named master_api_port) and ensure callers (e.g., serve) pass
the master API port from the configuration/resolved runtime rather than the
device's local bind port; use MasterClient::with_bearer_token as before but with
the corrected URL.

In `@crates/lab/tests/device_master_only.rs`:
- Around line 63-75: The test currently drops the TempDir before the request
runs, causing the assets path to vanish; modify test_non_master_router to return
or retain the TempDir for the lifetime of the request (e.g., return a tuple
(Router, TempDir) or store the TempDir in a struct returned by the helper) so
the directory created for with_web_assets_dir on AppState remains alive until
the caller finishes awaiting the oneshot request produced by
build_router_with_bearer; ensure callers of test_non_master_router are updated
to keep the TempDir alive while they invoke the router.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f36f5e5c-9257-4dae-bed4-0fdaec7c44b2

📥 Commits

Reviewing files that changed from the base of the PR and between c0738a5 and 181ca55.

📒 Files selected for processing (18)
  • crates/lab/src/api/device.rs
  • crates/lab/src/api/device/fleet.rs
  • crates/lab/src/api/device/logs.rs
  • crates/lab/src/api/device/oauth.rs
  • crates/lab/src/api/device/syslog.rs
  • crates/lab/src/api/router.rs
  • crates/lab/src/catalog.rs
  • crates/lab/src/cli/serve.rs
  • crates/lab/src/device/master_client.rs
  • crates/lab/src/device/oauth.rs
  • crates/lab/src/device/runtime.rs
  • crates/lab/src/device/store.rs
  • crates/lab/src/mcp/server.rs
  • crates/lab/src/oauth/local_relay.rs
  • crates/lab/tests/device_api.rs
  • crates/lab/tests/device_master_only.rs
  • crates/lab/tests/device_runtime.rs
  • docs/TRANSPORT.md

Comment thread crates/lab/src/api/device/fleet.rs
Comment thread crates/lab/src/api/device/logs.rs
Comment thread crates/lab/src/catalog.rs Outdated
Comment thread crates/lab/src/device/master_client.rs Outdated
Comment thread crates/lab/src/device/master_client.rs Outdated
Comment on lines +57 to +59
pub async fn fetch_device(&self, device_id: &str) -> Result<serde_json::Value> {
self.get_json(&format!("/v1/device/devices/{device_id}"))
.await
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Encode device_id before inserting it into the URL path.

Raw interpolation at Line 58 can misroute requests when device_id contains reserved path characters (for example /, ?, #). Percent-encode path segments before building the endpoint path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lab/src/device/master_client.rs` around lines 57 - 59, In
fetch_device, avoid interpolating raw device_id into the path; percent-encode
the device_id path segment before calling get_json so reserved characters (/, ?,
#, etc.) don't break routing. Replace the string interpolation of device_id
inside fetch_device with a percent-encoded version (using a path-segment safe
encoder such as percent_encoding or url::form_urlencoded/Url utilities) and then
call self.get_json(&format!("/v1/device/devices/{}", encoded_device_id)). Ensure
you only encode the segment (not the whole URL) and keep the fetch_device
signature and get_json call intact.

Comment thread crates/lab/src/device/master_client.rs Outdated
Comment on lines +89 to +134
async fn post_json<T: serde::Serialize + ?Sized>(&self, path: &str, payload: &T) -> Result<()> {
let url = format!("{}{}", self.base_url.trim_end_matches('/'), path);
self.request(self.http.post(&url))
.json(payload)
.send()
.await
.with_context(|| format!("POST {url}"))?
.error_for_status()
.with_context(|| format!("POST {url} failed"))?;
Ok(())
}

async fn get_json(&self, path: &str) -> Result<serde_json::Value> {
let url = format!("{}{}", self.base_url.trim_end_matches('/'), path);
let response = self
.request(self.http.get(&url))
.send()
.await
.with_context(|| format!("GET {url}"))?
.error_for_status()
.with_context(|| format!("GET {url} failed"))?;
response
.json()
.await
.with_context(|| format!("decode {url}"))
}

async fn post_json_value<T: serde::Serialize + ?Sized>(
&self,
path: &str,
payload: &T,
) -> Result<serde_json::Value> {
let url = format!("{}{}", self.base_url.trim_end_matches('/'), path);
let response = self
.request(self.http.post(&url))
.json(payload)
.send()
.await
.with_context(|| format!("POST {url}"))?
.error_for_status()
.with_context(|| format!("POST {url} failed"))?;
response
.json()
.await
.with_context(|| format!("decode {url}"))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add required outbound HTTP lifecycle logs.

post_json, get_json, and post_json_value do not emit request.start and request.finish/request.error events, so dispatch observability is incomplete.

As per coding guidelines, “HTTP dispatch must emit request.start and request.finish or request.error for every outbound request.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lab/src/device/master_client.rs` around lines 89 - 134, post_json,
get_json, and post_json_value need to emit request lifecycle events: before
calling .send() emit a request.start (including method and the computed url), on
successful response emit request.finish (include status code and url), and on
any error path emit request.error with the error details and url; update the
request flow inside each function (post_json, get_json, post_json_value) to wrap
the .send()/.error_for_status()/.json() sequence so that request.error is
emitted on any ?-propagated failure (include the error from with_context) and
request.finish is emitted after a successful .error_for_status() before parsing
the body. Ensure you reference the same url variable and the
request(self.http.<method>(&url)) call sites so logging covers every outbound
request.

Comment thread crates/lab/src/device/oauth.rs Outdated
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 18 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/lab/src/device/runtime.rs">

<violation number="1" location="crates/lab/src/device/runtime.rs:133">
P2: A status decode error exits the loop before acknowledging already-delivered queue entries, which can cause duplicate re-delivery on the next flush.</violation>
</file>

<file name="crates/lab/src/api/device.rs">

<violation number="1" location="crates/lab/src/api/device.rs:41">
P2: `device_id` validation is bypassable because it validates trimmed length but keeps using the raw value. Oversized/whitespace-padded IDs can pass and be persisted as map keys.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread crates/lab/src/device/runtime.rs Outdated
Comment thread crates/lab/src/api/device.rs
jmagar added 2 commits April 15, 2026 22:08
Resolves review thread PRRT_kwDOR8nC1M57SU4C
Resolves review thread PRRT_kwDOR8nC1M57SU4F
Resolves review thread PRRT_kwDOR8nC1M57SU4G
Resolves review thread PRRT_kwDOR8nC1M57SU30
Resolves review thread PRRT_kwDOR8nC1M57SU32
Resolves review thread PRRT_kwDOR8nC1M57SU37
Resolves review thread PRRT_kwDOR8nC1M57SU3-
Resolves review thread PRRT_kwDOR8nC1M57SUms
Resolves review thread PRRT_kwDOR8nC1M57SUmy
Resolves review thread PRRT_kwDOR8nC1M57SUm0
Resolves review thread PRRT_kwDOR8nC1M57SUm1
Resolves review thread PRRT_kwDOR8nC1M57SUm4
Resolves review thread PRRT_kwDOR8nC1M57SUm6
Resolves review thread PRRT_kwDOR8nC1M57SUl-
Resolves review thread PRRT_kwDOR8nC1M57SUl_
Resolves review thread PRRT_kwDOR8nC1M57SUmC
Resolves review thread PRRT_kwDOR8nC1M57SUmD
Resolves review thread PRRT_kwDOR8nC1M57SUmF
Resolves review thread PRRT_kwDOR8nC1M57SUmJ
Resolves review thread PRRT_kwDOR8nC1M57SUmL
Resolves review thread PRRT_kwDOR8nC1M57SUmR
Resolves review thread PRRT_kwDOR8nC1M57SUmV
Resolves review thread PRRT_kwDOR8nC1M57SUmW
Resolves review thread PRRT_kwDOR8nC1M57SUmX
Resolves review thread PRRT_kwDOR8nC1M57SUmb
Resolves review thread PRRT_kwDOR8nC1M57SUmc
Resolves review thread PRRT_kwDOR8nC1M57SSYC
Resolves review thread PRRT_kwDOR8nC1M57SSYG

- move device-runtime HTTP semantics into lab-apis
- harden queue, identity, bootstrap log, and syslog ingest paths
- redact discovered CLI MCP config metadata before upload
- tighten docs and tests around master-only routing and device runtime behavior
Resolves review thread PRRT_kwDOR8nC1M57S1b8
Resolves review thread PRRT_kwDOR8nC1M57S1cN
Resolves review thread PRRT_kwDOR8nC1M57S1JE
Resolves review thread PRRT_kwDOR8nC1M57S1JI
Resolves review thread PRRT_kwDOR8nC1M57S1JW
Resolves review thread PRRT_kwDOR8nC1M57S1Jj

- normalize device ids on ingest and read paths
- acknowledge delivered queue entries before decode failures return
- clamp log-search limit at the API boundary
- return the actual bound relay address for ephemeral binds
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 35 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/lab/src/device/identity.rs">

<violation number="1" location="crates/lab/src/device/identity.rs:35">
P1: Role detection is overly permissive: matching only the first dot-separated segment can misclassify non-master hosts as master (including many IPv4 addresses with the same first octet).</violation>
</file>

<file name="crates/lab/src/device/log_collect.rs">

<violation number="1" location="crates/lab/src/device/log_collect.rs:23">
P2: `fs::read(path)?` aborts on the first unreadable candidate, so fallback log sources are never tried. Handle read errors per-candidate and continue to the next path.</violation>
</file>

<file name="crates/lab/src/cli/serve.rs">

<violation number="1" location="crates/lab/src/cli/serve.rs:94">
P1: Non-master master-client port ignores resolved CLI/env port overrides, causing device check-ins to target the wrong endpoint when `--port`/`LAB_MCP_HTTP_PORT` is used.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread crates/lab/src/device/identity.rs Outdated
Comment thread crates/lab/src/cli/serve.rs Outdated
Comment thread crates/lab/src/device/log_collect.rs Outdated
Resolves review thread PRRT_kwDOR8nC1M57S-8C
Resolves review thread PRRT_kwDOR8nC1M57S-8H
Resolves review thread PRRT_kwDOR8nC1M57S-8I

- tighten hostname equivalence for master role detection
- thread resolved master port overrides into device runtime clients
- continue bootstrap log discovery when a candidate is unreadable
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/lab/src/api/router.rs (1)

366-368: ⚠️ Potential issue | 🟠 Major

Don't install the Web UI fallback on non-master nodes.

This fallback is still mounted whenever web_assets_dir is set, so unmatched GETs on a non-master continue to hit serve_web_request and return the Web-UI-specific denial path. That breaks the new contract that non-masters expose only /health, /ready, and the limited /v1/device/* surface. Gate the fallback with is_master and keep the handler check as defense in depth.

Suggested fix
-    if state.web_assets_dir.is_some() {
+    if is_master && state.web_assets_dir.is_some() {
         router = router.fallback(crate::api::web::serve_web_request);
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lab/src/api/router.rs` around lines 366 - 368, The Web UI fallback is
currently mounted whenever state.web_assets_dir.is_some(), which exposes the UI
denial path on non-master nodes; change the condition to mount the fallback only
when web_assets_dir.is_some() AND the node is master (use state.is_master or
state.is_master() as appropriate) before calling
router.fallback(crate::api::web::serve_web_request), and keep the existing
serve_web_request handler's internal master check as defense-in-depth.
♻️ Duplicate comments (3)
crates/lab/src/cli/logs.rs (1)

7-8: ⚠️ Potential issue | 🟠 Major

Move CLI log-search client usage to lab-apis boundary path.
crates/lab/src/cli/logs.rs still binds to crate::device::master_client::MasterClient. This keeps service-client semantics in lab instead of the lab-apis client layer expected by the project architecture.

As per coding guidelines, “All business logic must live in lab-apis/src/<service>/client.rs” and dependency direction must be “cli -> dispatch -> lab-apis”.

Also applies to: 29-37

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lab/src/cli/logs.rs` around lines 7 - 8, The file logs.rs currently
imports and uses crate::device::master_client::MasterClient which keeps
service-client logic in the wrong crate; change logs.rs to use the client
exported by the lab-apis boundary instead. Replace the
crate::device::master_client import and all uses of that MasterClient with the
appropriate lab-apis client (the client type exported from lab-apis'
<service>/client.rs, e.g., LogsClient or MasterClient from
lab_apis::<service>::client), update the use statements and constructor calls to
the lab-apis API, and add the lab-apis dependency to this crate's Cargo.toml if
missing so the CLI depends on lab-apis (cli -> dispatch -> lab-apis) rather than
directly on crate::device internals.
crates/lab/tests/device_api.rs (1)

13-24: 🧹 Nitpick | 🔵 Trivial

Test verifies HTTP status but name still implies store mutation.

While marked as addressed, hello_endpoint_updates_master_store still only asserts StatusCode::OK without verifying the store was updated. Consider either:

  1. Adding a store assertion like assert!(store.device("dookie").await.is_some());
  2. Renaming to hello_endpoint_returns_ok

The next test (hello_endpoint_normalizes_device_id_before_storage) demonstrates the proper pattern.

♻️ Proposed fix to add store assertion
 #[tokio::test]
 async fn hello_endpoint_updates_master_store() {
-    let (app, _store) = test_device_router();
+    let (app, store) = test_device_router();
     let response = app
         .oneshot(hello_request(
             r#"{"device_id":"dookie","role":"non-master","version":"1.0.0"}"#,
         ))
         .await
         .unwrap();

     assert_eq!(response.status(), StatusCode::OK);
+    assert!(store.device("dookie").await.is_some());
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lab/tests/device_api.rs` around lines 13 - 24, The test named
hello_endpoint_updates_master_store only asserts HTTP 200 but doesn't check the
store; update the test (in hello_endpoint_updates_master_store) to verify the
store was mutated by replacing the unused _store with a real binding (e.g.,
store from test_device_router()) and add an assertion like checking
store.device("dookie").await.is_some(), following the pattern in
hello_endpoint_normalizes_device_id_before_storage; alternatively, if you prefer
not to assert store state, rename the test to hello_endpoint_returns_ok to
reflect its current behavior.
crates/lab/src/device/master_client.rs (1)

23-35: ⚠️ Potential issue | 🔴 Critical

Replace .expect() with fallible construction.

Line 33 uses .expect(...) which can panic during client initialization. Per coding guidelines: "Always return Result<T>, never panic."

🔧 Proposed fix
-    #[must_use]
-    pub fn with_bearer_token(base_url: impl Into<String>, bearer_token: Option<String>) -> Self {
+    pub fn with_bearer_token(
+        base_url: impl Into<String>,
+        bearer_token: Option<String>,
+    ) -> Result<Self> {
         let auth = bearer_token.map_or(Auth::None, |token| Auth::Bearer { token });
         let inner = DeviceRuntimeClient::new(
             base_url,
             auth,
             Some(std::time::Duration::from_secs(
                 DEFAULT_MASTER_CLIENT_TIMEOUT_SECS,
             )),
-        )
-        .expect("device runtime client builder should be valid");
-        Self { inner }
+        )?;
+        Ok(Self { inner })
     }

Also update new():

-    #[must_use]
-    #[allow(dead_code)]
-    pub fn new(base_url: impl Into<String>) -> Self {
+    #[allow(dead_code)]
+    pub fn new(base_url: impl Into<String>) -> Result<Self> {
         Self::with_bearer_token(base_url, None)
     }

And update from_config() at line 77:

-        Ok(Self::with_bearer_token(
+        Self::with_bearer_token(
             format!("http://{host}:{port}"),
             master_bearer_token(),
-        ))
+        )
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lab/src/device/master_client.rs` around lines 23 - 35, The
with_bearer_token function currently calls
DeviceRuntimeClient::new(...).expect(...), which can panic; change
with_bearer_token to return a Result<Self, E> (choose the crate's appropriate
error type) and propagate the DeviceRuntimeClient::new error instead of
unwrapping (use ?), e.g., return Ok(Self { inner }) on success. Similarly update
the associated new() constructor and from_config() to be fallible (return
Result) and propagate errors from DeviceRuntimeClient::new and any downstream
calls using ? so no .expect() remains in master_client.rs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/lab-apis/src/core/http.rs`:
- Around line 93-107: The new public constructor
with_default_headers_and_timeouts exposes per-client timeouts which violates the
HttpClient timeout policy; change it so timeouts are not caller-configurable by
either removing or converting with_default_headers_and_timeouts to a
private/internal function and hardcode connect_timeout to Duration::from_secs(5)
and timeout to Duration::from_secs(30) when building the reqwest Client (i.e. in
the Client::builder() call where connect_timeout and timeout are set), and keep
the public API surface (HttpClient constructors) using only the fixed 5s/30s
values.

In `@crates/lab-apis/src/device_runtime.rs`:
- Around line 1-3: The crate's service entry-point is missing the required
service wiring: add a pub const META: PluginMeta declaration and move/ensure the
ServiceClient implementation is exposed from this entry-point (not only in
client.rs). Specifically, declare pub const META: PluginMeta with the
appropriate plugin/service metadata in device_runtime (the current file) and
re-export or include the ServiceClient type/impl here so the entry-point
satisfies the repo contract while keeping implementation details in client.rs;
ensure the symbol names PluginMeta, META, and ServiceClient are present and
correctly referenced.

In `@crates/lab-apis/src/device_runtime/client.rs`:
- Around line 68-82: Replace the inline serde_json::json! payload in search_logs
with a typed request struct: add a SearchLogsRequest { device_id: String, query:
String } (derive serde::Serialize) in types.rs, import it into client.rs,
construct SearchLogsRequest { device_id: device_id.to_string(), query:
query.to_string() } inside the search_logs method and pass that to
self.http.post_json("/v1/device/logs/search", &request). This preserves the
existing signature of search_logs and ApiError handling but gives compile-time
type checking and consistent client patterns.
- Around line 62-66: The fetch_device method is interpolating device_id directly
into the path; percent-encode the device_id path segment before building the URL
to prevent path injection. Update fetch_device to encode the device_id (e.g.,
using percent_encoding::percent_encode with the PATH_SEGMENT_ENCODE_SET or an
equivalent path-segment-safe encoder) and use the encoded string in the format!
passed to self.http.get_json; keep the function signature and return type
unchanged and ensure the encoded value replaces the raw device_id in the URL.

In `@crates/lab/src/cli/device.rs`:
- Around line 16-20: The DeviceCommand enum's subcommands (DeviceCommand,
variants List and Get) lack help text for CLI help output; add brief
descriptions either as doc comments (/// ...) above the enum and each variant or
by annotating variants with #[arg(help = "...")] so clap can show them in
--help, e.g., provide a one-line description for List and for Get (mentioning
the device_id parameter) to improve discoverability.

In `@crates/lab/src/device/config_scan.rs`:
- Around line 71-73: The code currently reads bytes into `raw` using fs::read
and parses with `toml::from_slice`, which was removed; change to read text with
`fs::read_to_string(path).with_context(|| format!("read {}", path.display()))?`
and parse with `toml::from_str(&raw).with_context(|| format!("parse {}",
path.display()))?` so the `parsed: toml::Value` construction uses
`toml::from_str` on the UTF-8 string; update the `raw` binding and the call site
where `parsed` is created (identify `raw` and `parsed` in config_scan.rs)
accordingly.

In `@crates/lab/src/device/log_collect.rs`:
- Around line 23-31: The code currently calls fs::read(path) which loads the
entire file into memory; replace that with opening the file
(std::fs::File::open) and reading only the tail: stat the file to get its
length, compute start = file_len.saturating_sub(BOOTSTRAP_LOG_BYTE_LIMIT as
u64), seek to start (std::io::Seek::seek) and then read the remainder into a
buffer (std::io::Read::read_to_end) so tail_bytes is no longer operating on a
full-file Vec; keep the same error handling/log line (tracing::debug!(path =
%path.display(), error = %error, "skipping unreadable bootstrap log candidate"))
and apply the same change to the similar fs::read usage later (the block around
the other occurrence noted in the comment).

In `@crates/lab/src/device/queue.rs`:
- Around line 125-148: The rewrite_entries function currently writes directly to
the target path which is non-atomic; change it to write the serialized content
to a temporary file in the same parent directory (use a unique temp name, e.g.,
original filename + ".tmp" + a random/uuid suffix), fsync the temp file
(File::sync_all) to ensure data hits disk, close it, then atomically
rename/replace the target file with tokio::fs::rename; update error contexts to
mention the temp path and rename step. Ensure the temp file is created in the
same parent (use path.parent()) and cleaned up on errors.

In `@crates/lab/src/device/runtime.rs`:
- Around line 46-52: The test helper non_master_for_test currently calls
resolve_runtime_role(...).expect(...) which can panic; change it to return a
Result (e.g., Result<Self, anyhow::Error> or a custom error) instead of
unwrapping so callers handle failures. Specifically, update non_master_for_test
to propagate the error from resolve_runtime_role (using ? or map_err) and
construct the runtime with Self::new(resolved,
Some(MasterClient::new(base_url))) on success; ensure the function signature is
adjusted (e.g., -> Result<Self, _>) and update callers/tests accordingly to
handle the Result.

In `@crates/lab/tests/device_master_only.rs`:
- Around line 108-129: The test fixture currently leaks TempDir via Box::leak in
test_lab_auth_state; instead create and own the TempDir and return it alongside
the AuthState (e.g., change test_lab_auth_state to return
(lab_auth::state::AuthState, tempfile::TempDir) or a small TestAuthFixture
struct), remove Box::leak, build AuthConfig using tempdir.path().join(...), call
lab_auth::state::AuthState::new(config).await.unwrap(), and update any callers
to accept the tuple/fixture so the TempDir is dropped at test end rather than
leaked.

In `@crates/lab/tests/device_queue.rs`:
- Around line 21-23: The test queue_persists_and_reloads_entries only asserts
the length of the drained batch, which misses corrupted or incorrect payloads;
after calling reopened.drain_batch(10).await.unwrap() and binding to drained,
add an assertion that the actual payload(s) match expected values (e.g., compare
drained[0] to the original enqueued item or its fields) to ensure persisted
entries are reloaded intact; locate the assertions around reopened.drain_batch
and the drained variable in the test and add content equality checks rather than
only length checks.

In `@docs/OPERATIONS.md`:
- Around line 133-160: Update the OPERATIONS.md section to document the required
role and transport setup steps: instruct operators to set the `[device].master`
configuration on non-master nodes, start device runtimes with `lab serve
--transport http`, and explicitly state the master URL and token requirements
(e.g., how to obtain and export LAB_MCP_HTTP_TOKEN) so non-masters can reach the
master’s `/v1/device/*` endpoints; also note that the master must be reachable
at the health endpoint (`/health` on port 8765) and that fleet state is held
in-memory on the master, so ensure the master is started first with its token
and URL configured for non-masters.

---

Outside diff comments:
In `@crates/lab/src/api/router.rs`:
- Around line 366-368: The Web UI fallback is currently mounted whenever
state.web_assets_dir.is_some(), which exposes the UI denial path on non-master
nodes; change the condition to mount the fallback only when
web_assets_dir.is_some() AND the node is master (use state.is_master or
state.is_master() as appropriate) before calling
router.fallback(crate::api::web::serve_web_request), and keep the existing
serve_web_request handler's internal master check as defense-in-depth.

---

Duplicate comments:
In `@crates/lab/src/cli/logs.rs`:
- Around line 7-8: The file logs.rs currently imports and uses
crate::device::master_client::MasterClient which keeps service-client logic in
the wrong crate; change logs.rs to use the client exported by the lab-apis
boundary instead. Replace the crate::device::master_client import and all uses
of that MasterClient with the appropriate lab-apis client (the client type
exported from lab-apis' <service>/client.rs, e.g., LogsClient or MasterClient
from lab_apis::<service>::client), update the use statements and constructor
calls to the lab-apis API, and add the lab-apis dependency to this crate's
Cargo.toml if missing so the CLI depends on lab-apis (cli -> dispatch ->
lab-apis) rather than directly on crate::device internals.

In `@crates/lab/src/device/master_client.rs`:
- Around line 23-35: The with_bearer_token function currently calls
DeviceRuntimeClient::new(...).expect(...), which can panic; change
with_bearer_token to return a Result<Self, E> (choose the crate's appropriate
error type) and propagate the DeviceRuntimeClient::new error instead of
unwrapping (use ?), e.g., return Ok(Self { inner }) on success. Similarly update
the associated new() constructor and from_config() to be fallible (return
Result) and propagate errors from DeviceRuntimeClient::new and any downstream
calls using ? so no .expect() remains in master_client.rs.

In `@crates/lab/tests/device_api.rs`:
- Around line 13-24: The test named hello_endpoint_updates_master_store only
asserts HTTP 200 but doesn't check the store; update the test (in
hello_endpoint_updates_master_store) to verify the store was mutated by
replacing the unused _store with a real binding (e.g., store from
test_device_router()) and add an assertion like checking
store.device("dookie").await.is_some(), following the pattern in
hello_endpoint_normalizes_device_id_before_storage; alternatively, if you prefer
not to assert store state, rename the test to hello_endpoint_returns_ok to
reflect its current behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ab140963-cec1-47d9-8cbb-d46fff3c4e34

📥 Commits

Reviewing files that changed from the base of the PR and between 181ca55 and 49d6d4c.

📒 Files selected for processing (37)
  • crates/lab-apis/src/core/http.rs
  • crates/lab-apis/src/device_runtime.rs
  • crates/lab-apis/src/device_runtime/client.rs
  • crates/lab-apis/src/lib.rs
  • crates/lab/src/api/device.rs
  • crates/lab/src/api/device/fleet.rs
  • crates/lab/src/api/device/hello.rs
  • crates/lab/src/api/device/logs.rs
  • crates/lab/src/api/device/metadata.rs
  • crates/lab/src/api/device/oauth.rs
  • crates/lab/src/api/device/status.rs
  • crates/lab/src/api/device/syslog.rs
  • crates/lab/src/api/router.rs
  • crates/lab/src/api/state.rs
  • crates/lab/src/api/web.rs
  • crates/lab/src/cli/device.rs
  • crates/lab/src/cli/logs.rs
  • crates/lab/src/cli/serve.rs
  • crates/lab/src/device/config_scan.rs
  • crates/lab/src/device/identity.rs
  • crates/lab/src/device/log_collect.rs
  • crates/lab/src/device/log_event.rs
  • crates/lab/src/device/master_client.rs
  • crates/lab/src/device/oauth.rs
  • crates/lab/src/device/queue.rs
  • crates/lab/src/device/runtime.rs
  • crates/lab/src/lib.rs
  • crates/lab/tests/device_api.rs
  • crates/lab/tests/device_config.rs
  • crates/lab/tests/device_identity.rs
  • crates/lab/tests/device_master_only.rs
  • crates/lab/tests/device_queue.rs
  • crates/lab/tests/device_runtime.rs
  • crates/lab/tests/device_scan.rs
  • docs/GATEWAY.md
  • docs/OAUTH.md
  • docs/OPERATIONS.md

Comment thread crates/lab-apis/src/core/http.rs Outdated
Comment thread crates/lab-apis/src/device_runtime.rs
Comment thread crates/lab-apis/src/device_runtime/client.rs
Comment thread crates/lab-apis/src/device_runtime/client.rs
Comment thread crates/lab/src/cli/device.rs
Comment thread crates/lab/src/device/queue.rs
Comment thread crates/lab/src/device/runtime.rs
Comment thread crates/lab/tests/device_master_only.rs Outdated
Comment thread crates/lab/tests/device_queue.rs
Comment thread docs/OPERATIONS.md
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 8 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/lab/src/device/identity.rs">

<violation number="1" location="crates/lab/src/device/identity.rs:67">
P2: IP host comparison short-circuits to `false`, so equivalent IPs with different textual forms are misclassified as different devices.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread crates/lab/src/device/identity.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
crates/lab/src/device/runtime.rs (1)

125-146: ⚠️ Potential issue | 🟠 Major

Unknown envelope kind still bricks the queue.

When envelope.kind is unrecognized (line 134), the code returns Err(anyhow!(...)), which triggers ack_processed_prefix at line 142 with only the successfully delivered count. The unrecognized envelope remains at the head of the queue, blocking all future flushes.

The previous review suggested skipping/dropping unknown envelopes with a warning. Consider incrementing ack_count for unknown kinds to drain them:

🐛 Proposed fix
             "status" => match serde_json::from_value::<DeviceStatus>(envelope.payload)
                 .context("decode queued status envelope")
             {
                 Ok(status) => client.post_status(&status).await,
                 Err(error) => Err(error),
             },
-            other => Err(anyhow!("unsupported queued envelope kind `{other}`")),
+            other => {
+                tracing::warn!(kind = %other, "dropping unsupported queued envelope");
+                Ok(()) // Treat as successfully "delivered" to drain it
+            }
         };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lab/src/device/runtime.rs` around lines 125 - 146, The code currently
treats an unknown envelope.kind as an Err which prevents acking that envelope
and blocks the queue; change the "other" match arm so unknown kinds are logged
as a warning and treated as successfully processed (i.e., set delivery = Ok(()))
instead of Err(anyhow!(...)), so ack_count will be incremented and
ack_processed_prefix will be able to drain past it; update the "other" arm near
the match on envelope.kind and ensure the warning includes envelope.kind (and
maybe envelope.id) so developers can investigate, leaving the existing behavior
for client.post_syslog_batch and client.post_status unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/lab/src/device/identity.rs`:
- Around line 66-69: The comparison block that matches
(local_host.parse::<IpAddr>(), master_host.parse::<IpAddr>()) incorrectly
returns false when one side is an IP and the other a hostname (e.g., "127.0.0.1"
vs "localhost"); update the logic in the function containing local_host and
master_host so that when one side parses to an IpAddr and the other is a
hostname you check for loopback equivalence: if the parsed IpAddr.is_loopback()
and the hostname either equals "localhost" (case-insensitive) or resolves (via
DNS/ToSocketAddrs or lookup_host) to at least one loopback address, treat them
as equal; otherwise keep returning false. Use the existing local_host and
master_host identifiers to locate and change that match arm.

In `@crates/lab/src/device/log_collect.rs`:
- Around line 24-40: The tail reader can start mid-line so the first entry from
read_tail(...) may be a fragmented record; detect this by checking the first
byte of raw (e.g. raw.first()) and if it is not a newline, drop the first parsed
line before mapping to DeviceLogEvent (e.g. after .lines() use .skip(1) when
raw.first().map(|b| *b != b'\n') is true). Apply the same safeguard in the other
similar block around the code handling BOOTSTRAP_LOG_LINE_LIMIT / DeviceLogEvent
(the region noted as lines 61-69).

In `@crates/lab/src/device/queue.rs`:
- Around line 146-154: The current tmp file suffix is generated from
SystemTime::now().as_nanos() (the suffix variable) which could theoretically
collide; replace the timestamp suffix generation used when building tmp_path via
path.with_file_name(...) with a random suffix such as a UUID or a random u64
(e.g., using uuid::Uuid::new_v4() or rand::thread_rng().gen::<u64>()), update
Cargo.toml to add the chosen crate (uuid or rand), and keep the same file_name
extraction logic so tmp_path is created as "{file_name}.{random_suffix}.tmp"
instead of using SystemTime::now().

---

Duplicate comments:
In `@crates/lab/src/device/runtime.rs`:
- Around line 125-146: The code currently treats an unknown envelope.kind as an
Err which prevents acking that envelope and blocks the queue; change the "other"
match arm so unknown kinds are logged as a warning and treated as successfully
processed (i.e., set delivery = Ok(())) instead of Err(anyhow!(...)), so
ack_count will be incremented and ack_processed_prefix will be able to drain
past it; update the "other" arm near the match on envelope.kind and ensure the
warning includes envelope.kind (and maybe envelope.id) so developers can
investigate, leaving the existing behavior for client.post_syslog_batch and
client.post_status unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6bc8407e-6093-4da9-88ef-410252dd6528

📥 Commits

Reviewing files that changed from the base of the PR and between 49d6d4c and ad5c119.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock and included by **/*
📒 Files selected for processing (35)
  • Cargo.toml
  • crates/lab-apis/Cargo.toml
  • crates/lab-apis/src/device_runtime.rs
  • crates/lab-apis/src/device_runtime/client.rs
  • crates/lab-apis/src/device_runtime/types.rs
  • crates/lab-auth/src/authorize.rs
  • crates/lab-auth/src/token.rs
  • crates/lab/src/api/device/oauth.rs
  • crates/lab/src/api/device/syslog.rs
  • crates/lab/src/api/router.rs
  • crates/lab/src/api/services/helpers.rs
  • crates/lab/src/cli.rs
  • crates/lab/src/cli/device.rs
  • crates/lab/src/cli/gateway.rs
  • crates/lab/src/cli/logs.rs
  • crates/lab/src/cli/serve.rs
  • crates/lab/src/device/config_scan.rs
  • crates/lab/src/device/identity.rs
  • crates/lab/src/device/log_collect.rs
  • crates/lab/src/device/master_client.rs
  • crates/lab/src/device/queue.rs
  • crates/lab/src/device/runtime.rs
  • crates/lab/src/dispatch/clients.rs
  • crates/lab/src/dispatch/gateway/dispatch.rs
  • crates/lab/src/dispatch/gateway/manager.rs
  • crates/lab/src/dispatch/gateway/service_catalog.rs
  • crates/lab/src/dispatch/helpers.rs
  • crates/lab/src/dispatch/upstream/pool.rs
  • crates/lab/src/oauth/target.rs
  • crates/lab/src/registry.rs
  • crates/lab/tests/device_cli.rs
  • crates/lab/tests/device_master_only.rs
  • crates/lab/tests/device_queue.rs
  • crates/lab/tests/device_runtime.rs
  • docs/OPERATIONS.md

Comment on lines +66 to +69
match (local_host.parse::<IpAddr>(), master_host.parse::<IpAddr>()) {
(Ok(local_ip), Ok(master_ip)) => return local_ip == master_ip,
(Ok(_), Err(_)) | (Err(_), Ok(_)) => return false,
(Err(_), Err(_)) => {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle loopback hostname/IP equivalence in role resolution.
Line [68] currently forces false when one side is an IP and the other is a hostname. That misclassifies localhost vs 127.0.0.1/::1 as different devices and can incorrectly assign NonMaster.

💡 Proposed patch
 fn hosts_refer_to_same_device(local_host: &str, master_host: &str) -> bool {
@@
     match (local_host.parse::<IpAddr>(), master_host.parse::<IpAddr>()) {
         (Ok(local_ip), Ok(master_ip)) => return local_ip == master_ip,
-        (Ok(_), Err(_)) | (Err(_), Ok(_)) => return false,
+        (Ok(local_ip), Err(_)) => {
+            return local_ip.is_loopback() && is_loopback_hostname(master_host);
+        }
+        (Err(_), Ok(master_ip)) => {
+            return master_ip.is_loopback() && is_loopback_hostname(local_host);
+        }
         (Err(_), Err(_)) => {}
     }
@@
 }
+
+fn is_loopback_hostname(value: &str) -> bool {
+    matches!(value, "localhost" | "localhost.localdomain")
+}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
match (local_host.parse::<IpAddr>(), master_host.parse::<IpAddr>()) {
(Ok(local_ip), Ok(master_ip)) => return local_ip == master_ip,
(Ok(_), Err(_)) | (Err(_), Ok(_)) => return false,
(Err(_), Err(_)) => {}
fn hosts_refer_to_same_device(local_host: &str, master_host: &str) -> bool {
match (local_host.parse::<IpAddr>(), master_host.parse::<IpAddr>()) {
(Ok(local_ip), Ok(master_ip)) => return local_ip == master_ip,
(Ok(local_ip), Err(_)) => {
return local_ip.is_loopback() && is_loopback_hostname(master_host);
}
(Err(_), Ok(master_ip)) => {
return master_ip.is_loopback() && is_loopback_hostname(local_host);
}
(Err(_), Err(_)) => {}
}
}
fn is_loopback_hostname(value: &str) -> bool {
matches!(value, "localhost" | "localhost.localdomain")
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lab/src/device/identity.rs` around lines 66 - 69, The comparison block
that matches (local_host.parse::<IpAddr>(), master_host.parse::<IpAddr>())
incorrectly returns false when one side is an IP and the other a hostname (e.g.,
"127.0.0.1" vs "localhost"); update the logic in the function containing
local_host and master_host so that when one side parses to an IpAddr and the
other is a hostname you check for loopback equivalence: if the parsed
IpAddr.is_loopback() and the hostname either equals "localhost"
(case-insensitive) or resolves (via DNS/ToSocketAddrs or lookup_host) to at
least one loopback address, treat them as equal; otherwise keep returning false.
Use the existing local_host and master_host identifiers to locate and change
that match arm.

Comment on lines +24 to +40
let raw = match read_tail(path, BOOTSTRAP_LOG_BYTE_LIMIT as usize) {
Ok(raw) => raw,
Err(error) => {
tracing::debug!(path = %path.display(), error = %error, "skipping unreadable bootstrap log candidate");
continue;
}
};
let timestamp_unix_ms = SystemTime::now()
.duration_since(UNIX_EPOCH)
.unwrap_or_default()
.as_millis() as i64;
let events = String::from_utf8_lossy(&raw)
.lines()
.filter(|line| !line.trim().is_empty())
.rev()
.take(BOOTSTRAP_LOG_LINE_LIMIT)
.map(|line| DeviceLogEvent {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Drop the first parsed line when the tail starts mid-record.

When read_tail seeks into the middle of a file, the first lines() entry is a fragment and gets uploaded as a bogus event.

💡 Suggested fix
-        let raw = match read_tail(path, BOOTSTRAP_LOG_BYTE_LIMIT as usize) {
-            Ok(raw) => raw,
+        let (raw, truncated_head) = match read_tail(path, BOOTSTRAP_LOG_BYTE_LIMIT as usize) {
+            Ok(raw) => raw,
             Err(error) => {
                 tracing::debug!(path = %path.display(), error = %error, "skipping unreadable bootstrap log candidate");
                 continue;
             }
         };
@@
-        let events = String::from_utf8_lossy(&raw)
-            .lines()
-            .filter(|line| !line.trim().is_empty())
+        let mut lines = String::from_utf8_lossy(&raw)
+            .lines()
+            .filter(|line| !line.trim().is_empty())
+            .collect::<Vec<_>>();
+        if truncated_head && !lines.is_empty() {
+            lines.remove(0);
+        }
+        let events = lines
+            .into_iter()
             .rev()
             .take(BOOTSTRAP_LOG_LINE_LIMIT)
             .map(|line| DeviceLogEvent {
@@
-fn read_tail(path: &Path, max_len: usize) -> std::io::Result<Vec<u8>> {
+fn read_tail(path: &Path, max_len: usize) -> std::io::Result<(Vec<u8>, bool)> {
     let mut file = fs::File::open(path)?;
     let len = file.metadata()?.len() as usize;
     let start = len.saturating_sub(max_len) as u64;
     file.seek(SeekFrom::Start(start))?;
     let mut buf = Vec::with_capacity(len.saturating_sub(start as usize));
     file.read_to_end(&mut buf)?;
-    Ok(buf)
+    Ok((buf, start > 0))
 }

Also applies to: 61-69

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lab/src/device/log_collect.rs` around lines 24 - 40, The tail reader
can start mid-line so the first entry from read_tail(...) may be a fragmented
record; detect this by checking the first byte of raw (e.g. raw.first()) and if
it is not a newline, drop the first parsed line before mapping to DeviceLogEvent
(e.g. after .lines() use .skip(1) when raw.first().map(|b| *b != b'\n') is
true). Apply the same safeguard in the other similar block around the code
handling BOOTSTRAP_LOG_LINE_LIMIT / DeviceLogEvent (the region noted as lines
61-69).

Comment on lines +146 to +154
let suffix = SystemTime::now()
.duration_since(UNIX_EPOCH)
.unwrap_or_default()
.as_nanos();
let file_name = path
.file_name()
.and_then(|name| name.to_str())
.unwrap_or("queue");
let tmp_path = path.with_file_name(format!("{file_name}.{suffix}.tmp"));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider using a random suffix for temp file uniqueness.

The nanosecond timestamp suffix could theoretically collide if two rewrites occur within the same nanosecond on a fast system. While unlikely in practice, using a random suffix (e.g., from rand or uuid) would eliminate this edge case entirely.

♻️ Optional improvement
+use std::sync::atomic::{AtomicU64, Ordering};
+
+static REWRITE_COUNTER: AtomicU64 = AtomicU64::new(0);
+
 async fn rewrite_entries(path: &Path, entries: &[QueuedEnvelope]) -> Result<()> {
     // ...
-    let suffix = SystemTime::now()
-        .duration_since(UNIX_EPOCH)
-        .unwrap_or_default()
-        .as_nanos();
+    let counter = REWRITE_COUNTER.fetch_add(1, Ordering::Relaxed);
+    let suffix = format!("{}.{}", std::process::id(), counter);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let suffix = SystemTime::now()
.duration_since(UNIX_EPOCH)
.unwrap_or_default()
.as_nanos();
let file_name = path
.file_name()
.and_then(|name| name.to_str())
.unwrap_or("queue");
let tmp_path = path.with_file_name(format!("{file_name}.{suffix}.tmp"));
let counter = REWRITE_COUNTER.fetch_add(1, Ordering::Relaxed);
let suffix = format!("{}.{}", std::process::id(), counter);
let file_name = path
.file_name()
.and_then(|name| name.to_str())
.unwrap_or("queue");
let tmp_path = path.with_file_name(format!("{file_name}.{suffix}.tmp"));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/lab/src/device/queue.rs` around lines 146 - 154, The current tmp file
suffix is generated from SystemTime::now().as_nanos() (the suffix variable)
which could theoretically collide; replace the timestamp suffix generation used
when building tmp_path via path.with_file_name(...) with a random suffix such as
a UUID or a random u64 (e.g., using uuid::Uuid::new_v4() or
rand::thread_rng().gen::<u64>()), update Cargo.toml to add the chosen crate
(uuid or rand), and keep the same file_name extraction logic so tmp_path is
created as "{file_name}.{random_suffix}.tmp" instead of using SystemTime::now().

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 37 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/lab/src/device/log_collect.rs">

<violation number="1" location="crates/lab/src/device/log_collect.rs:35">
P2: `read_tail` may seek into the middle of the file (when the file is larger than `BOOTSTRAP_LOG_BYTE_LIMIT`), so the first item returned by `.lines()` is a truncated fragment. This bogus partial line gets uploaded as a log event. When `start > 0`, the first line should be dropped.</violation>
</file>

<file name="crates/lab/src/device/identity.rs">

<violation number="1" location="crates/lab/src/device/identity.rs:68">
P1: When one side is an IP and the other is a hostname, this returns `false` unconditionally. This misclassifies `localhost` vs `127.0.0.1` (or `::1`) as different devices, which can incorrectly assign `NonMaster` role to a host that should be master. Add loopback equivalence handling so that a loopback IP matched against `localhost` is treated as the same device.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.


match (local_host.parse::<IpAddr>(), master_host.parse::<IpAddr>()) {
(Ok(local_ip), Ok(master_ip)) => return local_ip == master_ip,
(Ok(_), Err(_)) | (Err(_), Ok(_)) => return false,
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: When one side is an IP and the other is a hostname, this returns false unconditionally. This misclassifies localhost vs 127.0.0.1 (or ::1) as different devices, which can incorrectly assign NonMaster role to a host that should be master. Add loopback equivalence handling so that a loopback IP matched against localhost is treated as the same device.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/lab/src/device/identity.rs, line 68:

<comment>When one side is an IP and the other is a hostname, this returns `false` unconditionally. This misclassifies `localhost` vs `127.0.0.1` (or `::1`) as different devices, which can incorrectly assign `NonMaster` role to a host that should be master. Add loopback equivalence handling so that a loopback IP matched against `localhost` is treated as the same device.</comment>

<file context>
@@ -62,14 +63,13 @@ fn hosts_refer_to_same_device(local_host: &str, master_host: &str) -> bool {
-        return false;
+    match (local_host.parse::<IpAddr>(), master_host.parse::<IpAddr>()) {
+        (Ok(local_ip), Ok(master_ip)) => return local_ip == master_ip,
+        (Ok(_), Err(_)) | (Err(_), Ok(_)) => return false,
+        (Err(_), Err(_)) => {}
     }
</file context>
Suggested change
(Ok(_), Err(_)) | (Err(_), Ok(_)) => return false,
(Ok(local_ip), Err(_)) => {
return local_ip.is_loopback() && matches!(master_host, "localhost" | "localhost.localdomain");
}
(Err(_), Ok(master_ip)) => {
return master_ip.is_loopback() && matches!(local_host, "localhost" | "localhost.localdomain");
}
Fix with Cubic

.duration_since(UNIX_EPOCH)
.unwrap_or_default()
.as_millis() as i64;
let events = String::from_utf8_lossy(&raw)
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: read_tail may seek into the middle of the file (when the file is larger than BOOTSTRAP_LOG_BYTE_LIMIT), so the first item returned by .lines() is a truncated fragment. This bogus partial line gets uploaded as a log event. When start > 0, the first line should be dropped.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/lab/src/device/log_collect.rs, line 35:

<comment>`read_tail` may seek into the middle of the file (when the file is larger than `BOOTSTRAP_LOG_BYTE_LIMIT`), so the first item returned by `.lines()` is a truncated fragment. This bogus partial line gets uploaded as a log event. When `start > 0`, the first line should be dropped.</comment>

<file context>
@@ -20,19 +21,18 @@ pub fn collect_bootstrap_logs(device_id: &str) -> Result<Vec<DeviceLogEvent>> {
             .unwrap_or_default()
             .as_millis() as i64;
-        let events = String::from_utf8_lossy(slice)
+        let events = String::from_utf8_lossy(&raw)
             .lines()
             .filter(|line| !line.trim().is_empty())
</file context>
Fix with Cubic

@jmagar jmagar merged commit e621636 into main Apr 16, 2026
6 of 9 checks passed
@jmagar jmagar deleted the device-runtime-master branch April 16, 2026 06:19
jmagar added a commit that referenced this pull request Apr 18, 2026
…one, callback ext, reload reconcile

Resolves review threads #5, #6, #10, #11, #12, #13, #23, #50/#71.

upstream_oauth.rs (threads 5, 23):
- Callback: embed error_kind in redirect URL query params instead of
  x-lab-oauth-error-kind header (browsers silently discard headers on
  302 responses)
- Callback: extract AuthContext via Option<Extension<AuthContext>>
  instead of reconstructing Parts from empty request (which discarded
  middleware extensions); update callback_subject signature accordingly

pool.rs (threads 6, 11):
- subject_scoped_call_tool: add circuit breaker calls (record_success_for
  / record_failure_for) around the peer call
- subject_scoped_read_resource: add circuit breaker calls AND response
  size guard matching the non-scoped read_upstream_resource path
- subject_scoped_get_prompt: add circuit breaker calls

server.rs (thread 12):
- Subject-scoped dispatch path now emits tracing::info! on success and
  tracing::warn! on failure, matching the non-subject-scoped path

cache.rs (thread 13):
- get_or_build: clone DashMap Ref before the .await call on
  build_auth_client to avoid holding a DashMap read-lock across await
  (potential deadlock under contention)

manager.rs (thread 10):
- reload: reconcile upstream_oauth_managers after loading new config;
  remove managers for OAuth upstreams no longer present, warn about
  new OAuth upstreams that need restart to get a manager

Threads 50/71 (TokenRefreshFailed → NeedsReauth): already mapped
correctly in map_auth_error; no change needed.
jmagar added a commit that referenced this pull request Apr 19, 2026
* spike: validate rmcp AuthClient integration with StreamableHttpClientWorker

Task 0 (gating spike) for the upstream MCP OAuth PKCE plan. Confirms four
integration points against rmcp 1.4.0 before Task 2 commits to a design:

1. AuthClient<reqwest::Client> constructs cleanly over AuthorizationManager
   + InMemoryCredentialStore.
2. AuthClient auto-injects Authorization: Bearer <token> when the caller
   passes auth_token: None — its StreamableHttpClient impl calls
   auth_manager.get_access_token() and fills the slot before delegating.
3. rmcp does NOT automatically refresh on a 401 from the upstream.
   AuthorizationManager::get_access_token() only refreshes on the local
   clock (REFRESH_BUFFER_SECS = 30s). Refresh-on-401 is the caller's
   responsibility, so Task 2 must layer it on.
4. Spike runs against a wiremock AS+RS stub by default, and against a
   real OAuth-protected MCP upstream when SPIKE_REAL_AS_URL is set, so
   the operator can validate end-to-end interactively before Task 2
   starts.

Plan A (AuthClient as StreamableHttpClient) is confirmed; Plan B (custom
wrapper that calls get_access_token() pre-request) is strictly inferior
and not needed.

Findings duplicated inline at the top of the spike example AND in the
stub crates/lab/src/oauth/upstream/refresh.rs that Task 2 will replace.
The upstream/* files are intentionally NOT wired into oauth.rs yet —
they are exploratory docs.

* feat: add rmcp-backed upstream oauth manager with single-flight refresh and at-rest encryption

- store.rs: SqliteCredentialStore + SqliteStateStore implementing rmcp CredentialStore/StateStore traits
  - ChaCha20-Poly1305 encryption at rest; decryption failure → AuthorizationRequired
  - StateStore::load uses atomic DELETE…RETURNING (take_upstream_oauth_state); delete is no-op
  - Two-lifetime pattern ('life0: 'async_trait, Self: 'async_trait) matching async_trait expansion
- refresh.rs: RefreshLocks (DashMap per-(upstream,subject) Mutex) + refresh_if_stale()
  - Single-flight serialization prevents concurrent refresh storms
  - AuthError::AuthorizationRequired → OauthError::NeedsReauth
- manager.rs: UpstreamOauthManager orchestrates full authorization_code+PKCE flow
  - begin_authorization: discovers/caches AS metadata, enforces S256, saves PKCE state
  - complete_authorization_callback: exchanges code, persists encrypted credentials
  - clear_credentials: deletes tokens + evicts pending state
  - build_auth_client: creates fresh AuthClient from stored credentials with proactive refresh
  - Supports Preregistered and Dynamic registration; ClientMetadataDocument is TODO
- Cargo.toml: promote oauth2 from dev-dep to regular dep (needed for TokenResponse trait methods)

* feat(upstream-oauth): wire HTTP routes and AppState for upstream OAuth callback

Task 3: mount /v1/upstream-oauth/:name/{start,callback} routes in the axum
router, guarded by upstream_oauth.is_some(). Add upstream_oauth field to
AppState with with_upstream_oauth() builder. Routes are outside the bearer
middleware — browser redirects from the AS cannot carry Authorization headers.

* feat(upstream): wire per-(upstream,subject) AuthClient cache into UpstreamPool

Add oauth_managers field to UpstreamPool (DashMap keyed by upstream name),
with_oauth_managers() builder for injection at serve time, and OAuth branch
in connect_http_upstream: looks up manager, calls build_auth_client("default"),
wraps the AuthClient in StreamableHttpClientWorker. Non-OAuth path unchanged.
oauth_required error on NeedsReauth marks upstream unhealthy via discover_all.

* docs: add upstream oauth pkce gateway guidance and error kinds

* feat(upstream-oauth): enforce S256, issuer binding, canonical resource, CIMD

Completes Task 2 §6 spec-aligned invariants identified by plan audit:

- Must-Fix #1 S256 enforcement: verify_s256 now rejects AS metadata that
  omits `code_challenge_methods_supported` (previously warned and continued)
  or advertises only non-S256 methods. Both paths surface
  `oauth_unsupported_method`.

- Must-Fix #4 canonical resource indicator: UpstreamConfig gains
  `canonical_url()` applying RFC 3986 §6.2.2 normalization at
  validation time (lowercase scheme+host, strip default port,
  dot-segment removal, percent-encoding case). Manager uses the
  canonical form when constructing the AuthorizationManager so
  rmcp's `resource` parameter on authorize and token is byte-identical
  to the canonical upstream URL. Known gap: rmcp 1.4 does not re-emit
  `resource` on the refresh_token grant; documented in UPSTREAM.md.

- Must-Fix #5 issuer binding (scope 6b): verify_issuer_binding requires
  `metadata.issuer` to be present and enforces host-consistency across
  authorization_endpoint, token_endpoint, and registration_endpoint
  (when present). Cannot duplicate rmcp discovery to bind against the
  successful discovery URL, so the check is approximated via endpoint
  host consistency. Violations surface as `oauth_issuer_mismatch`.

- CIMD registration: ClientMetadataDocument strategy now constructs the
  OAuth client locally, using the metadata document URL as the
  client_id. No registration_endpoint call is issued.

- Must-Fix #6 reactive 401 (scope 6b): deferred. rmcp's
  StreamableHttpClientWorker hides the raw HTTP response, so a 401 on
  an MCP call surfaces as a generic transport error. Operators
  recover via `POST /v1/gateway/oauth/start`. Documented in UPSTREAM.md
  so the doc no longer promises retry semantics that code does not
  implement.

Also ships Task 2/3 scaffolding kept uncommitted in the worktree:
UpstreamOauthCredentialRow + UpstreamOauthStateRow (manual redacted
Debug), UpstreamOauthConfig + UpstreamOauthRegistration enum (CIMD,
Preregistered, Dynamic), and `oauth: None` test-fixture fill-ins.

Tests:
- crates/lab/tests/upstream_oauth.rs (8 tests)
  - canonical_url_strips_default_port_and_lowercases_host
  - missing_code_challenge_methods_returns_unsupported
  - plain_pkce_only_returns_unsupported
  - authorize_url_carries_canonical_resource_indicator
  - token_exchange_carries_canonical_resource_indicator
  - issuer_missing_returns_issuer_mismatch
  - issuer_endpoint_host_mismatch_returns_issuer_mismatch
  - cimd_registration_uses_metadata_url_as_client_id

* feat: wire subject-scoped upstream oauth cache

* feat: finish upstream oauth gateway wiring and verification

* fix: align upstream oauth HTTP surface with ERRORS.md spec and add dispatch telemetry

Add elapsed_ms field to all four OAuth handlers (start, status, clear, callback)
so every dispatch event includes surface/service/action/elapsed_ms per OBSERVABILITY.md.

Change clear handler confirmation_required response from 400 plain-text to 422 JSON
envelope (ToolError::Sdk) to match the documented spec in ERRORS.md. Update
accompanying test to assert 422 UNPROCESSABLE_ENTITY and JSON kind field.

Resolves review threads:
  PRRT_kwDOR8nC1M576Bjh (missing dispatch logs)

Note: router callback placement, HTML escaping (html_escape helper already present),
subject validation (subject sourced from JWT auth.sub not query param), and axum
path syntax were all already correctly implemented — no changes needed.

* fix: harden upstream oauth manager - encryption, TOCTOU, issuer binding, client secret

seal()/seal_with_aad() now return Result<_, EncryptionError> instead of panicking via
.expect(); store.rs propagates the error as AuthError::InternalError.

get_or_discover_metadata holds the write lock across discovery to eliminate the
read-lock-drop-write-lock TOCTOU race where two callers could both issue discovery.

verify_issuer_binding changed from host-only comparison to full origin comparison
(scheme + host + explicit port) so http/https scheme and port differences are caught.

resolve_client_config now returns OauthError::Internal when client_secret_env names
an env var that is not set or empty, instead of silently using an empty secret.

Resolves review threads:
  PRRT_kwDOR8nC1M576Bjk  (seal() panics - encryption.rs)
  PRRT_kwDOR8nC1M576Cmu  (seal() panics - encryption.rs)
  PRRT_kwDOR8nC1M576BYd  (refresh lock TOCTOU)
  PRRT_kwDOR8nC1M576Cmn  (metadata cache TOCTOU)
  PRRT_kwDOR8nC1M576ez2  (issuer binding host-only)
  PRRT_kwDOR8nC1M576Bjm  (missing client_secret env var silent)
  PRRT_kwDOR8nC1M576Cmp  (missing client_secret env var silent)

Note: extract_state_param None-guard (FIX E) and in-memory PKCE map TTL (FIX I)
were both already correctly implemented in this codebase — no changes needed.

* fix: enforce upstream config validation at startup and fix cleanup_expired predicate

UpstreamConfig::validate() is now called for each upstream in load_toml(), so invalid
configs (bad URL scheme, conflicting auth fields) are caught at startup rather than
at first OAuth flow attempt. validate() also now rejects non-http/https URL schemes.

cleanup_expired changes both DELETE predicates from < to <= so rows expiring exactly
at the current timestamp are cleaned up consistently with the rest of the expiry checks.

Resolves review threads:
  PRRT_kwDOR8nC1M576ez0  (validate() never called at startup)
  PRRT_kwDOR8nC1M576dqS  (config.rs critical - URL scheme not validated)
  PRRT_kwDOR8nC1M576dqQ  (cleanup_expired < vs <= off-by-one)

Note: OauthError::Internal display/kind strings were already aligned in types.rs
(Display prefix matches kind() return value) — no change needed.

* fix: redact token from spike error, wire gateway validation, correct docs

spike_rmcp_auth_client.rs: remove token value from bail! error message to prevent
leaking access tokens into log output (use placeholder instead).

gateway/config.rs validate_upstream: call upstream.validate() at the start so that
bearer_token_env + oauth mutual-exclusion and other config constraints are enforced
in the gateway dispatch layer, not only in the top-level config loader.

docs/CONFIG.md: fix clear endpoint URL example to include required upstream= param.

docs/GATEWAY.md:
- Update clear endpoint description to reflect 422 JSON response (not 400 plain-text)
  and document required upstream= query param
- Correct callback-security section: remove claim that handler re-validates
  upstream-vs-state-row in application code (enforcement is via SQL primary key)

docs/UPSTREAM.md: correct claim that OAuth upstreams participate in startup discovery;
they are excluded from discover_all and connected per-request, not pooled.

Resolves review threads:
  PRRT_kwDOR8nC1M576Bjb  (token in spike error)
  PRRT_kwDOR8nC1M576dqT  (gateway/config.rs critical - validate not called)
  PRRT_kwDOR8nC1M576dqW  (CONFIG.md clear URL missing upstream=)
  PRRT_kwDOR8nC1M576dqa  (GATEWAY.md:140 clear endpoint description)
  PRRT_kwDOR8nC1M576dqc  (GATEWAY.md:152 callback invariants not in code)
  PRRT_kwDOR8nC1M576dqe  (UPSTREAM.md:158 OAuth discovery claim inaccurate)

Note: with_oauth_managers() wiring (FIX G) was already correctly implemented via
with_oauth_client_cache() in cli/serve.rs. OAuth error kinds in ERRORS.md were
already documented. No changes needed for those items.

* fix(oauth): remove dead url_host fn, duplicate tool_error_from_oauth, stale comment

Resolves review threads #15, #3, #21.
- Remove #[allow(dead_code)] url_host() from manager.rs (thread 15)
- Remove duplicate pub tool_error_from_oauth from gateway/oauth.rs; the
  private copy in manager.rs is the only caller (thread 3)
- Drop unused OauthError import from oauth.rs
- Correct stale comment in spike_rmcp_auth_client.rs: mock returns 401
  to drive re-auth path, not 200 (thread 21)

* docs(oauth): align upstream OAuth docs with implementation

Resolves review threads #4, #17, #18, #19, #24, #25, #26, #27, #28,
#29, #30, #36, #37, #38.

UPSTREAM.md:
- Thread 4: OAuth upstreams are attempted at startup and fail unhealthy
  (not excluded entirely)
- Thread 18: Issuer binding checks origin (scheme+host+port), not just
  host; covers auth/token/revocation/userinfo endpoints
- Thread 19: Remove false LRU-cap claim; lock entries live for process
  lifetime
- Thread 24: Merged catalog is transport-neutral; OAuth upstreams appear
  in catalog but need HTTP session to initiate authorization
- Thread 28: POST /v1/gateway/oauth/start route references are correct
- Thread 38: Remove auto-delete claim on invalid_grant (not implemented)

GATEWAY.md:
- Thread 17: Pending state SQL key is (upstream_name, subject,
  csrf_token), not just (upstream_name, csrf_token)
- Thread 25: Reload eagerly evicts all OAuth AuthClient entries; remove
  false built_with_client_id eviction-on-mismatch claim
- Thread 26: Routes /v1/gateway/oauth/* are implemented as documented
- Thread 27: Callback is browser-facing; subject from session cookie,
  not from state parameter

ERRORS.md:
- Thread 30: Remove "(RFC 7636 absence implies plain-only)" — omission
  of code_challenge_methods_supported is not equivalent to plain-only
- Thread 36: oauth_issuer_mismatch triggers on missing issuer or
  endpoint/issuer origin mismatch, not direct discovered-URL equality
- Thread 37: Route references /v1/gateway/oauth/status and
  POST /v1/gateway/oauth/start are correct (no /v1/upstream-oauth/ routes)

* fix(oauth): evict build_locks, fix param attribution, https guard, TTL guard, epoch default

Resolves review threads #1, #2/#9, #16, #20, #31, #32, #35.

cache.rs (thread 1):
- evict_subject and evict_upstream now also remove entries from
  build_locks, preventing unbounded growth on long-running processes

config.rs (threads 2/9):
- validate_upstream maps ConfigError::InvalidUrl to param="url" instead
  of "bearer_token_env"; auth-conflict errors still map to bearer_token_env

manager.rs (thread 35):
- ClientMetadataDocument URL validation now enforces https scheme;
  http URLs are rejected with OauthError::Internal

store.rs (thread 16):
- token_received_at falls back to now_unix() instead of 0 (Unix epoch)
  when absent; prevents access_token_expires_at underflow for tokens
  that don't carry a received_at timestamp

sqlite.rs (threads 20, 31, 32):
- TTL guard now also rejects expires_at <= created_at (negative delta)
  to prevent integer underflow on malicious/clock-skewed input
- cleanup queries already used <=; threads 31/32 already resolved

* fix(oauth): stdio ordering, oauth URL guard, is_master gating

Resolves review threads #7, #8, #22.

serve.rs (thread 7):
- Compute stdio_mode before build_upstream_oauth_runtime; skip OAuth
  runtime init entirely in stdio mode so missing LAB_PUBLIC_URL /
  LAB_OAUTH_ENCRYPTION_KEY never fails a stdio serve

config.rs / gateway config.rs (thread 8):
- UpstreamConfig::validate now rejects oauth+no-url combinations with
  ConfigError::MissingOauthUrl; gateway config dispatch maps the new
  variant to param="url"

router.rs (thread 22):
- Gateway OAuth routes (/v1/gateway/oauth/*) and browser callback are
  now guarded by is_master; non-master nodes no longer mount them

* fix(oauth): redirect error kind, circuit breaker, tracing, DashMap clone, callback ext, reload reconcile

Resolves review threads #5, #6, #10, #11, #12, #13, #23, #50/#71.

upstream_oauth.rs (threads 5, 23):
- Callback: embed error_kind in redirect URL query params instead of
  x-lab-oauth-error-kind header (browsers silently discard headers on
  302 responses)
- Callback: extract AuthContext via Option<Extension<AuthContext>>
  instead of reconstructing Parts from empty request (which discarded
  middleware extensions); update callback_subject signature accordingly

pool.rs (threads 6, 11):
- subject_scoped_call_tool: add circuit breaker calls (record_success_for
  / record_failure_for) around the peer call
- subject_scoped_read_resource: add circuit breaker calls AND response
  size guard matching the non-scoped read_upstream_resource path
- subject_scoped_get_prompt: add circuit breaker calls

server.rs (thread 12):
- Subject-scoped dispatch path now emits tracing::info! on success and
  tracing::warn! on failure, matching the non-subject-scoped path

cache.rs (thread 13):
- get_or_build: clone DashMap Ref before the .await call on
  build_auth_client to avoid holding a DashMap read-lock across await
  (potential deadlock under contention)

manager.rs (thread 10):
- reload: reconcile upstream_oauth_managers after loading new config;
  remove managers for OAuth upstreams no longer present, warn about
  new OAuth upstreams that need restart to get a manager

Threads 50/71 (TokenRefreshFailed → NeedsReauth): already mapped
correctly in map_auth_error; no change needed.

* fix(oauth): dynamic registration once per upstream, not per call

Resolves review thread PRRT_kwDOR8nC1M579vdo

`configured_authorization_manager` was calling `register_client` on
every invocation (complete_authorization_callback, build_auth_client),
receiving a new AS-assigned client_id each time — mismatching the id
used to start the flow.

Fix: `resolve_client_config` for Dynamic now:
  1. Checks stored credential row (available after token exchange)
  2. Checks in-memory `dynamic_client_ids` cache (populated by begin_authorization)
  3. Only calls register_client on the very first invocation

`clear_credentials` evicts the in-memory cache entry so a fresh
registration is issued when re-authorizing after credential clearance.

* fix(gateway): exhaustive ConfigError match, no catch-all for param attribution

Resolves review thread PRRT_kwDOR8nC1M579vdf

The wildcard arm `_ => param: "bearer_token_env"` was too broad: any
future ConfigError variant would be misattributed to bearer_token_env.

Replace the catch-all with an exhaustive match over all three ConfigError
variants so the compiler enforces correct attribution if new variants are
added. ConflictingAuth → bearer_token_env; MissingOauthUrl + InvalidUrl → url.

* fix(oauth): persist dynamic client registrations to SQLite

Dynamic registration (RFC 7591) must survive server restarts. Replace
the in-memory DashMap cache with a durable `upstream_oauth_dynamic_clients`
SQLite table (WITHOUT ROWID, PK: upstream_name + subject).

- Add `upstream_oauth_dynamic_clients` table to `open_connection` schema
- Add `save_dynamic_client_registration` (UPSERT), `find_dynamic_client_registration`
  (SELECT), and `delete_dynamic_client_registration` (DELETE) to `SqliteStore`
- Add round-trip test covering save, upsert, find, delete, and subject isolation
- Remove `dynamic_client_ids: Arc<DashMap<...>>` field from `UpstreamOauthManager`
- `resolve_client_config` Dynamic branch now checks SQLite before calling
  `register_client`, then persists the assigned client_id immediately
- `clear_credentials` also deletes the dynamic client registration row

client_secret is not stored: `token_endpoint_auth_method: "none"` means the
AS issues public clients and must not return a client_secret.

* feat(gateway-admin): upstream OAuth connection UI

Adds an "Upstream Connections" section to the Gateways page with per-upstream
Connect/Disconnect cards. Cards poll auth status, open the authorization URL
in a new tab, and wait for the callback before clearing the connecting state.

- GET /v1/gateway/oauth/upstreams — lists upstreams with oauth: config
- upstream-oauth-card: badge (Connected/Expiring/Disconnected), Connect/Disconnect
- upstream-oauth-section: grid of cards, null when no oauth upstreams configured
- Fix redirect loop: restore allow_session_cookie guard on browser redirect
- Fix redirect loop: switch v1 + MCP auth middleware to route_layer so unmatched
  SPA paths (e.g. /gateways) are not intercepted by auth middleware

* style: rustfmt router.rs
jmagar added a commit that referenced this pull request May 2, 2026
* spike: validate rmcp AuthClient integration with StreamableHttpClientWorker

Task 0 (gating spike) for the upstream MCP OAuth PKCE plan. Confirms four
integration points against rmcp 1.4.0 before Task 2 commits to a design:

1. AuthClient<reqwest::Client> constructs cleanly over AuthorizationManager
   + InMemoryCredentialStore.
2. AuthClient auto-injects Authorization: Bearer <token> when the caller
   passes auth_token: None — its StreamableHttpClient impl calls
   auth_manager.get_access_token() and fills the slot before delegating.
3. rmcp does NOT automatically refresh on a 401 from the upstream.
   AuthorizationManager::get_access_token() only refreshes on the local
   clock (REFRESH_BUFFER_SECS = 30s). Refresh-on-401 is the caller's
   responsibility, so Task 2 must layer it on.
4. Spike runs against a wiremock AS+RS stub by default, and against a
   real OAuth-protected MCP upstream when SPIKE_REAL_AS_URL is set, so
   the operator can validate end-to-end interactively before Task 2
   starts.

Plan A (AuthClient as StreamableHttpClient) is confirmed; Plan B (custom
wrapper that calls get_access_token() pre-request) is strictly inferior
and not needed.

Findings duplicated inline at the top of the spike example AND in the
stub crates/lab/src/oauth/upstream/refresh.rs that Task 2 will replace.
The upstream/* files are intentionally NOT wired into oauth.rs yet —
they are exploratory docs.

* feat: add rmcp-backed upstream oauth manager with single-flight refresh and at-rest encryption

- store.rs: SqliteCredentialStore + SqliteStateStore implementing rmcp CredentialStore/StateStore traits
  - ChaCha20-Poly1305 encryption at rest; decryption failure → AuthorizationRequired
  - StateStore::load uses atomic DELETE…RETURNING (take_upstream_oauth_state); delete is no-op
  - Two-lifetime pattern ('life0: 'async_trait, Self: 'async_trait) matching async_trait expansion
- refresh.rs: RefreshLocks (DashMap per-(upstream,subject) Mutex) + refresh_if_stale()
  - Single-flight serialization prevents concurrent refresh storms
  - AuthError::AuthorizationRequired → OauthError::NeedsReauth
- manager.rs: UpstreamOauthManager orchestrates full authorization_code+PKCE flow
  - begin_authorization: discovers/caches AS metadata, enforces S256, saves PKCE state
  - complete_authorization_callback: exchanges code, persists encrypted credentials
  - clear_credentials: deletes tokens + evicts pending state
  - build_auth_client: creates fresh AuthClient from stored credentials with proactive refresh
  - Supports Preregistered and Dynamic registration; ClientMetadataDocument is TODO
- Cargo.toml: promote oauth2 from dev-dep to regular dep (needed for TokenResponse trait methods)

* feat(upstream-oauth): wire HTTP routes and AppState for upstream OAuth callback

Task 3: mount /v1/upstream-oauth/:name/{start,callback} routes in the axum
router, guarded by upstream_oauth.is_some(). Add upstream_oauth field to
AppState with with_upstream_oauth() builder. Routes are outside the bearer
middleware — browser redirects from the AS cannot carry Authorization headers.

* feat(upstream): wire per-(upstream,subject) AuthClient cache into UpstreamPool

Add oauth_managers field to UpstreamPool (DashMap keyed by upstream name),
with_oauth_managers() builder for injection at serve time, and OAuth branch
in connect_http_upstream: looks up manager, calls build_auth_client("default"),
wraps the AuthClient in StreamableHttpClientWorker. Non-OAuth path unchanged.
oauth_required error on NeedsReauth marks upstream unhealthy via discover_all.

* docs: add upstream oauth pkce gateway guidance and error kinds

* feat(upstream-oauth): enforce S256, issuer binding, canonical resource, CIMD

Completes Task 2 §6 spec-aligned invariants identified by plan audit:

- Must-Fix #1 S256 enforcement: verify_s256 now rejects AS metadata that
  omits `code_challenge_methods_supported` (previously warned and continued)
  or advertises only non-S256 methods. Both paths surface
  `oauth_unsupported_method`.

- Must-Fix #4 canonical resource indicator: UpstreamConfig gains
  `canonical_url()` applying RFC 3986 §6.2.2 normalization at
  validation time (lowercase scheme+host, strip default port,
  dot-segment removal, percent-encoding case). Manager uses the
  canonical form when constructing the AuthorizationManager so
  rmcp's `resource` parameter on authorize and token is byte-identical
  to the canonical upstream URL. Known gap: rmcp 1.4 does not re-emit
  `resource` on the refresh_token grant; documented in UPSTREAM.md.

- Must-Fix #5 issuer binding (scope 6b): verify_issuer_binding requires
  `metadata.issuer` to be present and enforces host-consistency across
  authorization_endpoint, token_endpoint, and registration_endpoint
  (when present). Cannot duplicate rmcp discovery to bind against the
  successful discovery URL, so the check is approximated via endpoint
  host consistency. Violations surface as `oauth_issuer_mismatch`.

- CIMD registration: ClientMetadataDocument strategy now constructs the
  OAuth client locally, using the metadata document URL as the
  client_id. No registration_endpoint call is issued.

- Must-Fix #6 reactive 401 (scope 6b): deferred. rmcp's
  StreamableHttpClientWorker hides the raw HTTP response, so a 401 on
  an MCP call surfaces as a generic transport error. Operators
  recover via `POST /v1/gateway/oauth/start`. Documented in UPSTREAM.md
  so the doc no longer promises retry semantics that code does not
  implement.

Also ships Task 2/3 scaffolding kept uncommitted in the worktree:
UpstreamOauthCredentialRow + UpstreamOauthStateRow (manual redacted
Debug), UpstreamOauthConfig + UpstreamOauthRegistration enum (CIMD,
Preregistered, Dynamic), and `oauth: None` test-fixture fill-ins.

Tests:
- crates/lab/tests/upstream_oauth.rs (8 tests)
  - canonical_url_strips_default_port_and_lowercases_host
  - missing_code_challenge_methods_returns_unsupported
  - plain_pkce_only_returns_unsupported
  - authorize_url_carries_canonical_resource_indicator
  - token_exchange_carries_canonical_resource_indicator
  - issuer_missing_returns_issuer_mismatch
  - issuer_endpoint_host_mismatch_returns_issuer_mismatch
  - cimd_registration_uses_metadata_url_as_client_id

* feat: wire subject-scoped upstream oauth cache

* feat: finish upstream oauth gateway wiring and verification

* fix: align upstream oauth HTTP surface with ERRORS.md spec and add dispatch telemetry

Add elapsed_ms field to all four OAuth handlers (start, status, clear, callback)
so every dispatch event includes surface/service/action/elapsed_ms per OBSERVABILITY.md.

Change clear handler confirmation_required response from 400 plain-text to 422 JSON
envelope (ToolError::Sdk) to match the documented spec in ERRORS.md. Update
accompanying test to assert 422 UNPROCESSABLE_ENTITY and JSON kind field.

Resolves review threads:
  PRRT_kwDOR8nC1M576Bjh (missing dispatch logs)

Note: router callback placement, HTML escaping (html_escape helper already present),
subject validation (subject sourced from JWT auth.sub not query param), and axum
path syntax were all already correctly implemented — no changes needed.

* fix: harden upstream oauth manager - encryption, TOCTOU, issuer binding, client secret

seal()/seal_with_aad() now return Result<_, EncryptionError> instead of panicking via
.expect(); store.rs propagates the error as AuthError::InternalError.

get_or_discover_metadata holds the write lock across discovery to eliminate the
read-lock-drop-write-lock TOCTOU race where two callers could both issue discovery.

verify_issuer_binding changed from host-only comparison to full origin comparison
(scheme + host + explicit port) so http/https scheme and port differences are caught.

resolve_client_config now returns OauthError::Internal when client_secret_env names
an env var that is not set or empty, instead of silently using an empty secret.

Resolves review threads:
  PRRT_kwDOR8nC1M576Bjk  (seal() panics - encryption.rs)
  PRRT_kwDOR8nC1M576Cmu  (seal() panics - encryption.rs)
  PRRT_kwDOR8nC1M576BYd  (refresh lock TOCTOU)
  PRRT_kwDOR8nC1M576Cmn  (metadata cache TOCTOU)
  PRRT_kwDOR8nC1M576ez2  (issuer binding host-only)
  PRRT_kwDOR8nC1M576Bjm  (missing client_secret env var silent)
  PRRT_kwDOR8nC1M576Cmp  (missing client_secret env var silent)

Note: extract_state_param None-guard (FIX E) and in-memory PKCE map TTL (FIX I)
were both already correctly implemented in this codebase — no changes needed.

* fix: enforce upstream config validation at startup and fix cleanup_expired predicate

UpstreamConfig::validate() is now called for each upstream in load_toml(), so invalid
configs (bad URL scheme, conflicting auth fields) are caught at startup rather than
at first OAuth flow attempt. validate() also now rejects non-http/https URL schemes.

cleanup_expired changes both DELETE predicates from < to <= so rows expiring exactly
at the current timestamp are cleaned up consistently with the rest of the expiry checks.

Resolves review threads:
  PRRT_kwDOR8nC1M576ez0  (validate() never called at startup)
  PRRT_kwDOR8nC1M576dqS  (config.rs critical - URL scheme not validated)
  PRRT_kwDOR8nC1M576dqQ  (cleanup_expired < vs <= off-by-one)

Note: OauthError::Internal display/kind strings were already aligned in types.rs
(Display prefix matches kind() return value) — no change needed.

* fix: redact token from spike error, wire gateway validation, correct docs

spike_rmcp_auth_client.rs: remove token value from bail! error message to prevent
leaking access tokens into log output (use placeholder instead).

gateway/config.rs validate_upstream: call upstream.validate() at the start so that
bearer_token_env + oauth mutual-exclusion and other config constraints are enforced
in the gateway dispatch layer, not only in the top-level config loader.

docs/CONFIG.md: fix clear endpoint URL example to include required upstream= param.

docs/GATEWAY.md:
- Update clear endpoint description to reflect 422 JSON response (not 400 plain-text)
  and document required upstream= query param
- Correct callback-security section: remove claim that handler re-validates
  upstream-vs-state-row in application code (enforcement is via SQL primary key)

docs/UPSTREAM.md: correct claim that OAuth upstreams participate in startup discovery;
they are excluded from discover_all and connected per-request, not pooled.

Resolves review threads:
  PRRT_kwDOR8nC1M576Bjb  (token in spike error)
  PRRT_kwDOR8nC1M576dqT  (gateway/config.rs critical - validate not called)
  PRRT_kwDOR8nC1M576dqW  (CONFIG.md clear URL missing upstream=)
  PRRT_kwDOR8nC1M576dqa  (GATEWAY.md:140 clear endpoint description)
  PRRT_kwDOR8nC1M576dqc  (GATEWAY.md:152 callback invariants not in code)
  PRRT_kwDOR8nC1M576dqe  (UPSTREAM.md:158 OAuth discovery claim inaccurate)

Note: with_oauth_managers() wiring (FIX G) was already correctly implemented via
with_oauth_client_cache() in cli/serve.rs. OAuth error kinds in ERRORS.md were
already documented. No changes needed for those items.

* fix(oauth): remove dead url_host fn, duplicate tool_error_from_oauth, stale comment

Resolves review threads #15, #3, #21.
- Remove #[allow(dead_code)] url_host() from manager.rs (thread 15)
- Remove duplicate pub tool_error_from_oauth from gateway/oauth.rs; the
  private copy in manager.rs is the only caller (thread 3)
- Drop unused OauthError import from oauth.rs
- Correct stale comment in spike_rmcp_auth_client.rs: mock returns 401
  to drive re-auth path, not 200 (thread 21)

* docs(oauth): align upstream OAuth docs with implementation

Resolves review threads #4, #17, #18, #19, #24, #25, #26, #27, #28,
#29, #30, #36, #37, #38.

UPSTREAM.md:
- Thread 4: OAuth upstreams are attempted at startup and fail unhealthy
  (not excluded entirely)
- Thread 18: Issuer binding checks origin (scheme+host+port), not just
  host; covers auth/token/revocation/userinfo endpoints
- Thread 19: Remove false LRU-cap claim; lock entries live for process
  lifetime
- Thread 24: Merged catalog is transport-neutral; OAuth upstreams appear
  in catalog but need HTTP session to initiate authorization
- Thread 28: POST /v1/gateway/oauth/start route references are correct
- Thread 38: Remove auto-delete claim on invalid_grant (not implemented)

GATEWAY.md:
- Thread 17: Pending state SQL key is (upstream_name, subject,
  csrf_token), not just (upstream_name, csrf_token)
- Thread 25: Reload eagerly evicts all OAuth AuthClient entries; remove
  false built_with_client_id eviction-on-mismatch claim
- Thread 26: Routes /v1/gateway/oauth/* are implemented as documented
- Thread 27: Callback is browser-facing; subject from session cookie,
  not from state parameter

ERRORS.md:
- Thread 30: Remove "(RFC 7636 absence implies plain-only)" — omission
  of code_challenge_methods_supported is not equivalent to plain-only
- Thread 36: oauth_issuer_mismatch triggers on missing issuer or
  endpoint/issuer origin mismatch, not direct discovered-URL equality
- Thread 37: Route references /v1/gateway/oauth/status and
  POST /v1/gateway/oauth/start are correct (no /v1/upstream-oauth/ routes)

* fix(oauth): evict build_locks, fix param attribution, https guard, TTL guard, epoch default

Resolves review threads #1, #2/#9, #16, #20, #31, #32, #35.

cache.rs (thread 1):
- evict_subject and evict_upstream now also remove entries from
  build_locks, preventing unbounded growth on long-running processes

config.rs (threads 2/9):
- validate_upstream maps ConfigError::InvalidUrl to param="url" instead
  of "bearer_token_env"; auth-conflict errors still map to bearer_token_env

manager.rs (thread 35):
- ClientMetadataDocument URL validation now enforces https scheme;
  http URLs are rejected with OauthError::Internal

store.rs (thread 16):
- token_received_at falls back to now_unix() instead of 0 (Unix epoch)
  when absent; prevents access_token_expires_at underflow for tokens
  that don't carry a received_at timestamp

sqlite.rs (threads 20, 31, 32):
- TTL guard now also rejects expires_at <= created_at (negative delta)
  to prevent integer underflow on malicious/clock-skewed input
- cleanup queries already used <=; threads 31/32 already resolved

* fix(oauth): stdio ordering, oauth URL guard, is_master gating

Resolves review threads #7, #8, #22.

serve.rs (thread 7):
- Compute stdio_mode before build_upstream_oauth_runtime; skip OAuth
  runtime init entirely in stdio mode so missing LAB_PUBLIC_URL /
  LAB_OAUTH_ENCRYPTION_KEY never fails a stdio serve

config.rs / gateway config.rs (thread 8):
- UpstreamConfig::validate now rejects oauth+no-url combinations with
  ConfigError::MissingOauthUrl; gateway config dispatch maps the new
  variant to param="url"

router.rs (thread 22):
- Gateway OAuth routes (/v1/gateway/oauth/*) and browser callback are
  now guarded by is_master; non-master nodes no longer mount them

* fix(oauth): redirect error kind, circuit breaker, tracing, DashMap clone, callback ext, reload reconcile

Resolves review threads #5, #6, #10, #11, #12, #13, #23, #50/#71.

upstream_oauth.rs (threads 5, 23):
- Callback: embed error_kind in redirect URL query params instead of
  x-lab-oauth-error-kind header (browsers silently discard headers on
  302 responses)
- Callback: extract AuthContext via Option<Extension<AuthContext>>
  instead of reconstructing Parts from empty request (which discarded
  middleware extensions); update callback_subject signature accordingly

pool.rs (threads 6, 11):
- subject_scoped_call_tool: add circuit breaker calls (record_success_for
  / record_failure_for) around the peer call
- subject_scoped_read_resource: add circuit breaker calls AND response
  size guard matching the non-scoped read_upstream_resource path
- subject_scoped_get_prompt: add circuit breaker calls

server.rs (thread 12):
- Subject-scoped dispatch path now emits tracing::info! on success and
  tracing::warn! on failure, matching the non-subject-scoped path

cache.rs (thread 13):
- get_or_build: clone DashMap Ref before the .await call on
  build_auth_client to avoid holding a DashMap read-lock across await
  (potential deadlock under contention)

manager.rs (thread 10):
- reload: reconcile upstream_oauth_managers after loading new config;
  remove managers for OAuth upstreams no longer present, warn about
  new OAuth upstreams that need restart to get a manager

Threads 50/71 (TokenRefreshFailed → NeedsReauth): already mapped
correctly in map_auth_error; no change needed.

* fix(oauth): dynamic registration once per upstream, not per call

Resolves review thread PRRT_kwDOR8nC1M579vdo

`configured_authorization_manager` was calling `register_client` on
every invocation (complete_authorization_callback, build_auth_client),
receiving a new AS-assigned client_id each time — mismatching the id
used to start the flow.

Fix: `resolve_client_config` for Dynamic now:
  1. Checks stored credential row (available after token exchange)
  2. Checks in-memory `dynamic_client_ids` cache (populated by begin_authorization)
  3. Only calls register_client on the very first invocation

`clear_credentials` evicts the in-memory cache entry so a fresh
registration is issued when re-authorizing after credential clearance.

* fix(gateway): exhaustive ConfigError match, no catch-all for param attribution

Resolves review thread PRRT_kwDOR8nC1M579vdf

The wildcard arm `_ => param: "bearer_token_env"` was too broad: any
future ConfigError variant would be misattributed to bearer_token_env.

Replace the catch-all with an exhaustive match over all three ConfigError
variants so the compiler enforces correct attribution if new variants are
added. ConflictingAuth → bearer_token_env; MissingOauthUrl + InvalidUrl → url.

* fix(oauth): persist dynamic client registrations to SQLite

Dynamic registration (RFC 7591) must survive server restarts. Replace
the in-memory DashMap cache with a durable `upstream_oauth_dynamic_clients`
SQLite table (WITHOUT ROWID, PK: upstream_name + subject).

- Add `upstream_oauth_dynamic_clients` table to `open_connection` schema
- Add `save_dynamic_client_registration` (UPSERT), `find_dynamic_client_registration`
  (SELECT), and `delete_dynamic_client_registration` (DELETE) to `SqliteStore`
- Add round-trip test covering save, upsert, find, delete, and subject isolation
- Remove `dynamic_client_ids: Arc<DashMap<...>>` field from `UpstreamOauthManager`
- `resolve_client_config` Dynamic branch now checks SQLite before calling
  `register_client`, then persists the assigned client_id immediately
- `clear_credentials` also deletes the dynamic client registration row

client_secret is not stored: `token_endpoint_auth_method: "none"` means the
AS issues public clients and must not return a client_secret.

* feat(gateway-admin): upstream OAuth connection UI

Adds an "Upstream Connections" section to the Gateways page with per-upstream
Connect/Disconnect cards. Cards poll auth status, open the authorization URL
in a new tab, and wait for the callback before clearing the connecting state.

- GET /v1/gateway/oauth/upstreams — lists upstreams with oauth: config
- upstream-oauth-card: badge (Connected/Expiring/Disconnected), Connect/Disconnect
- upstream-oauth-section: grid of cards, null when no oauth upstreams configured
- Fix redirect loop: restore allow_session_cookie guard on browser redirect
- Fix redirect loop: switch v1 + MCP auth middleware to route_layer so unmatched
  SPA paths (e.g. /gateways) are not intercepted by auth middleware

* style: rustfmt router.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants