Resync to v56#2
Merged
Merged
Conversation
…A#1413) - Add publish-manifest input to docker-build.yml (default true); single-arch branch callers set it false so the merge job is skipped and the shared bare :SHA tag in GHCR is never written by branch workflows - branch-kubernetes-e2e: retag :SHA-amd64 to :SHA before kind load so Helm's image.tag matches what is loaded in kind containerd - branch-e2e: pass image-tag as :SHA-arm64 to e2e-test so the arch-specific GHCR tag is used directly without depending on the bare tag - bare :SHA in GHCR is now written only by test-gpu.yml (multi-arch build), eliminating the last-writer-wins race across concurrent workflows
* test(server): cover service endpoint plaintext security * test(server): align tls test with from_files Option<&Path> signature TlsAcceptor::from_files now accepts the client CA path as Option<&Path> (per the require_client_auth refactor on main). Wrap the helper's CA path in Some(...) so the new plaintext-service-http tests compile after rebasing onto current main. --------- Co-authored-by: Taylor Mutch <taylormutch@gmail.com>
…ng hosts (NVIDIA#1449) Exec'ing /bin/sleep (SELinux label bin_t) from a user_home_t test binary causes /proc/<pid>/exe readlink to return ENOENT on SELinux-enforcing hosts due to the cross-domain boundary. Skip the test at runtime when getenforce reports Enforcing. Also adds a ChildGuard drop guard for safe child cleanup on panic and increases the exec-detection deadline from 2s to 5s. Signed-off-by: Derek Carr <decarr@redhat.com>
* fix(sandbox): allow first-label L7 host wildcards * docs(sandbox): document L7 host wildcard contract + add OPA runtime tests - Add Host Wildcards section to architecture/security-policy.md describing accepted (first-label *, **, intra-label *-X) and rejected (bare, TLD, non-first-label, recursive-in-label) forms, and noting that wildcards never cross '.' boundaries. - Expand the policy-schema.mdx 'host' field description to reflect the same contract instead of only mentioning '*.example.com'. - Add OPA runtime tests asserting '*-aiplatform.googleapis.com' matches 'us-central1-aiplatform.googleapis.com' and does not match 'us-central1.aiplatform.googleapis.com' (cross-dot boundary). Locks validator/runtime alignment for intra-label wildcards. * chore: update mise lockfile * test(server): tolerate serialized inference upserts --------- Co-authored-by: John Myers <9696606+johntmyers@users.noreply.github.com>
Add -o/--output flag to `openshell gateway list` matching the existing sandbox list pattern, enabling machine-readable output for scripting. Signed-off-by: Florent Benoit <fbenoit@redhat.com>
Remove ~280 lines of duplicated code across 30 files in 5 areas: - centered_rect: consolidate 5 identical TUI layout helpers into a single pub fn in openshell-tui/src/ui/mod.rs - server test helpers: replace ~100 inline Store::connect() calls with local test_store() helpers; deduplicate test_server_state() in grpc/service.rs to use the shared test_support version - rogue PKI: extract 20-line rogue CA+client cert generation block (duplicated in two integration tests) into generate_rogue_pki() in tests/common/mod.rs - provider tests: replace 8 identical 28-line test modules with a single macro_rules! test_discovers_env_credential! invocation - label constants: centralize openshell.ai/ container label keys in openshell-core::driver_utils; update Docker and Kubernetes drivers to import from there instead of redefining them locally
Signed-off-by: Piotr Mlocek <pmlocek@nvidia.com>
…er (NVIDIA#1483) The env var was only wired up via clap in the standalone openshell-driver-podman binary. When the Podman driver runs embedded in the gateway, config came exclusively from TOML deserialization and the env var was never consulted. Apply it as a post-deserialization override, matching the existing OPENSHELL_K8S_WORKSPACE_DEFAULT_STORAGE_SIZE pattern. Closes NVIDIA#1446
…safe) (NVIDIA#1505) In the Rust ecosystem there's largely three ways to do system calls: - raw libc - nix - rustix Of the three, libc is almost all `unsafe` and really 95% of use cases should be either nix or rustix. nix is the original one, but after having looked at the code of both, I think rustix is just better designed and organized. It's also reached 1.0, whereas nix is still making semver-breaking changes (in fact we're behind here in this project). Now in practice, we have both *transitively* in the depchain already, and that's true for quite a lot of projects. But I think rustix is better, so let's add rustix as a workspace dependency (process feature) and migrate a few use cases to it - it's especially better than the raw libc which is suprisingly widespread. If we agree to do this, then many other calls can be ported. Signed-off-by: Colin Walters <walters@verbum.org>
…VIDIA#1507) After NVIDIA#1415 ships, users upgrading from previous releases need guidance on the gateway.env deprecation, port/bind/database path changes, and the podman.socket restart requirement. - docs(rpm): add 'Migrating from gateway.env' section to TROUBLESHOOTING covering backward compatibility, env-to-TOML key mapping, and three breaking changes (default port 8080->17670, bind address 0.0.0.0->127.0.0.1, database path move). Add podman.socket restart step to upgrade procedure. - docs(rpm): add upgrade callout to CONFIGURATION.md pointing at migration section. - fix(podman): retry PodmanComputeDriver ping up to 5 times with 2s delay to tolerate transient socket unavailability after package upgrades. The systemd unit uses Wants=podman.socket (not Requires) so the gateway can start while the socket is briefly re-activating after an RPM upgrade changes its unit file on disk. - chore(rpm): update EnvironmentFile comment in RPM spec to explain backward-compatibility intent. Signed-off-by: Adam Miller <admiller@redhat.com>
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com>
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
* docs(rfc): add sandbox resource requirements proposal Signed-off-by: Evan Lezar <elezar@nvidia.com> * docs(rfc): finalize sandbox resource requirements --------- Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
* fix(cli): add json output for policy get * test(cli): cover policy get full json output * fix(cli): address policy get json clippy --------- Co-authored-by: John Myers <9696606+johntmyers@users.noreply.github.com>
* feat(providers): derive discovery from profiles Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com> * fix(providers): keep v2 discovery profile-only Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com> * docs(providers): update providers v2 behavior Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com> * fix(providers): make github profile read-only Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com> --------- Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com>
* fix(homebrew): repair local driver bootstrap state * fix(bootstrap): satisfy default SAN doc lint
Thread the gateway_insecure flag through gateway_add(), gateway_login(), and all OIDC HTTP clients so that --gateway-insecure and OPENSHELL_GATEWAY_INSECURE apply to OIDC discovery, token exchange, and token refresh requests. Previously, the flag only affected gRPC connections to the gateway. OIDC HTTP clients (reqwest::get and http_client) always verified TLS certificates, causing gateway registration and login to fail when the OIDC issuer used a self-signed certificate (common on OpenShift with edge-terminated routes). Fixes NVIDIA#1534 Signed-off-by: Adel Zaalouk <azaalouk@redhat.com>
Signed-off-by: Piotr Mlocek <pmlocek@nvidia.com>
Bumps [docker/login-action](https://github.com/docker/login-action) from 4.1.0 to 4.2.0. - [Release notes](https://github.com/docker/login-action/releases) - [Commits](docker/login-action@4907a6d...650006c) --- updated-dependencies: - dependency-name: docker/login-action dependency-version: 4.2.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* ci(kubernetes): pin mise in e2e workflow Signed-off-by: Taylor Mutch <taylormutch@gmail.com> * ci(kubernetes): mirror postgres image for ha e2e Signed-off-by: Taylor Mutch <taylormutch@gmail.com> * ci(kubernetes): reuse e2e workflow for ha Signed-off-by: Taylor Mutch <taylormutch@gmail.com> --------- Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
…VIDIA#1661) The gateway.sh script appended supervisor_image after the [openshell.gateway.gateway_jwt] table header, so TOML parsed it as a gateway_jwt field. The Podman driver never saw the override and fell back to the default ghcr.io/nvidia/openshell/supervisor:latest. Move supervisor_image into [openshell.drivers.podman] where the driver config deserializer expects it.
) Move three duplicated definitions into openshell-core so every consumer has a single canonical source: - format_bytes: identical 14-line function existed in docker, kubernetes, and vm drivers. Moved to openshell-core::progress where all three already imported from. - DEFAULT_SANDBOX_PIDS_LIMIT: i64 constant (2048) duplicated in docker driver and podman config. Moved to openshell-core::config alongside other shared defaults. Podman re-exports it for internal call-site compatibility. - current_time_ms: secrets.rs in openshell-sandbox reimplemented the same logic as openshell-core::time::now_ms(). Remove the local copy and call now_ms() directly via the existing dep.
…VIDIA#1666) * fix(config): reject unknown fields in nested gateway config tables The gateway TOML loader silently ignored keys placed under the wrong table header. PR NVIDIA#1661 fixed one instance of this (supervisor_image under [openshell.gateway.gateway_jwt]) but the root cause remained: the nested gateway config tables did not deny unknown fields, so a misplaced key was accepted and dropped instead of erroring. Concretely, tasks/scripts/gateway.sh emitted `sandbox_namespace` right after the [openshell.gateway.gateway_jwt] heredoc, so it landed inside the gateway_jwt table rather than [openshell.gateway]. The k8s driver already receives the namespace via [openshell.drivers.kubernetes], so the stray line was dead config that parsed without complaint. Changes: - Add #[serde(deny_unknown_fields)] to the nested gateway config tables that are part of the config-file parse tree: TlsConfig, OidcConfig, MtlsAuthConfig, GatewayAuthConfig, GatewayJwtConfig. - Remove the misplaced sandbox_namespace line from gateway.sh. - Drop the unused Serialize/Deserialize derives from Config and ServiceRoutingConfig (see below). - Add a regression test asserting a key under the wrong nested table is rejected.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
* feat(providers): add Google Vertex AI provider Adds Vertex AI provider profiles, routing, credential refresh plumbing, CLI support, docs, and regression coverage. Keeps the related NETLINK_ROUTE seccomp allowance needed by Vertex client tooling that calls getifaddrs. * docs: add Vertex AI sandbox usage for Claude Code and OpenCode Cover the full end-to-end setup for running Claude Code and OpenCode inside an OpenShell sandbox via inference.local with a Vertex AI backend: - google-vertex-ai.mdx: add 'Use from a Sandbox' section with tabbed examples for Claude Code (--bare flag, no /v1 suffix) and OpenCode (/v1 suffix required). Add providers_v2_enabled prerequisite and --no-verify note for global region. Document policy proposals table covering metadata.google.internal (always blocked), downloads.claude.ai, and storage.googleapis.com. - inference-routing.mdx: expand 'Use the Local Endpoint' section with tabbed examples for Claude Code, OpenCode, Python OpenAI SDK, and Python Anthropic SDK. Add notes explaining the /v1 path suffix difference between clients. - supported-agents.mdx: update Claude Code and OpenCode rows to mention inference.local support and correct base URL requirements. * fix: address vertex review findings * test(sandbox): retry on spurious Ok in fork-exec ambiguity test On arm64 under heavy CI load, the /proc fd scan in find_socket_inode_owners can transiently miss the parent process's socket fd entry, returning only the child as an owner. This causes resolve_process_identity to return Ok (single owner, no ambiguity check fires) instead of the expected ambiguous-ownership Err. Extend the retry loop to also handle unexpected Ok results, mirroring the existing retry for transient Err results. 10 retries at 50ms gives a 500ms settling window, which is sufficient for procfs to stabilize on loaded arm64 runners. * fix: address vertex review regressions * docs(router): clarify stream_response semantics for Vertex rawPredict routing Document the three call sites of prepare_backend_request and their stream_response values in a caller table: - send_backend_request: false → :rawPredict (unary endpoint) - send_backend_request_streaming: true → :streamRawPredict - verify_backend_endpoint: explicitly false to probe the unary endpoint Cross-reference the table from build_provider_url and is_vertex_anthropic_rawpredict_route so the stream_response=true guard in the suffix upgrade branch is understood in full context. Also note that is_vertex_anthropic_rawpredict_route is a structural predicate (model_in_path + anthropic_messages + :rawPredict suffix), not a named-provider check, so any future provider with the same route shape inherits the transforms automatically.
* fix: correct example paths in local-inference README * fix: correct example paths in local-inference routes.yaml
The RPM canary needs to exercise the install.sh user-service path, but a GitHub Actions job container does not boot with systemd as PID 1. The Fedora RPM canary needs to exercise the install.sh user-service path, but a GitHub Actions job container does not boot with systemd as PID 1. This means the Fedora RPM canary was incomplete as compared to the others. With this change, we run Fedora as a nested privileged systemd container instead, wait for systemd to become reachable, then start the root user manager so systemctl --user works for the RPM gateway unit, achieving parity with the other canary tests. Signed-off-by: Kris Hicks <khicks@nvidia.com>
* chore: wip providers v2 tui and codex profile * chore: wip effective policy get and codex profile * chore: wip provider profiles and tui detail views * feat(tui): annotate policy proposal review status
…1699) Install the Snap built by the triggering Release Dev workflow by setting merge-multiple: true on the artifact download. actions/download-artifact otherwise extracts each artifact into its own subdirectory, leaving the package at release/snap-linux-amd64/*.snap, so the install glob ./release/*.snap matched nothing. Merging flattens the artifact's contents directly into release/ where the dangerous local snap install expects it. Harden the Snap canary setup by enabling snapd.socket, waiting for snap seeding (snap wait system seed.loaded), and running every step with strict shell options (set -euo pipefail) so failures surface immediately. Register the snapped gateway with the CLI as the documented local plaintext snap-docker gateway, and print version and snap services, before running openshell status so the canary verifies a configured and reachable gateway instead of only the install. Signed-off-by: Kris Hicks <khicks@nvidia.com>
Add a desktop launcher for the OpenShell TUI so users can launch
"openshell term" from their desktop environment application menu.
The change adds three files:
- snap/local/term.desktop: desktop entry file for the application launcher
- snap/local/icon.png: application icon (copied from snap store data)
- snapcraft.yaml: new "term" app entry that runs "openshell term"
with home, network, ssh-keys, and system-observe plugs, plus install
rules to stage the desktop file and icon under meta/gui/
The desktop file references the icon via ${SNAP} which is resolved
at runtime to the snap installation directory. The term app reuses
the same connection plugs as the main openshell app.
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Auto-detection previously treated Podman as available only when the podman CLI was visible on PATH. However, package manager services can run with a restricted PATH, which lets Docker be selected even when a Podman API socket is reachable. Additionally, podman may symlink /var/run/docker.sock to podman's machine unix socket, which would be incorrectly detected as Docker. Worse still: the podman machine may not even be running. This replaces the Podman binary check with a functional HTTP probe against the standard Podman socket paths. The probe requires /_ping to answer with a Libpod-Api-Version header before treating the socket as Podman, which lets the gateway select the embedded Podman driver only when the API is usable. Signed-off-by: Kris Hicks <khicks@nvidia.com>
Signed-off-by: Kris Hicks <khicks@nvidia.com>
The Ubuntu Snap canary downloads its artifact from a different workflow run (the triggering Release Dev run) via run-id. Cross-run downloads require authentication, so pass github.token to actions/download-artifact. Signed-off-by: Kris Hicks <khicks@nvidia.com>
…teway (NVIDIA#1419) The existing docs omitted or misstated several requirements when running the gateway as a container with the Docker compute driver: - OPENSHELL_GRPC_ENDPOINT is required; the Docker driver uses only the scheme (http/https) — host and port are substituted automatically with host.openshell.internal and the gateway's own bind port - Supervisor binary must be extracted to a host path before starting the gateway; bind-mount sources are resolved by the host Docker daemon so the path must be identical inside and outside the gateway container - Docker socket access requires adding the docker group (UID 1000 default) - Port binding should remain 127.0.0.1; Docker driver adds a bridge listener automatically - add --server-san host.openshell.internal to generate-certs for mTLS - Complete the mTLS docker run with all Docker driver requirements - Add deploy/docker/gateway.toml — TOML config for the Docker driver - Add deploy/docker/docker-compose.yml referencing the TOML - Add docs/get-started/tutorials/docker-compose.mdx tutorial page - Remote gateway registration instructions (--remote flag) Address reviewer feedback: - Move Docker Compose tutorials card to the bottom of the list - Replace inline YAML snippet in Docker Compose section with a reference to deploy/docker/ to avoid drift - Clarify OPENSHELL_DB_URL is safe in compose.yml (plain SQLite path, no credentials); the TOML block targets credential-bearing DSNs - Note that ./ in source: resolves relative to the compose file directory - Clarify that only the scheme from OPENSHELL_GRPC_ENDPOINT matters - Add note that the tilde volume mount resolves to the same absolute path on both host and container
…#1708) Remove three groups of copy-pasted code in openshell-server: 1. grpc/mod.rs had a private current_time_ms() wrapper identical to the one already exported from persistence/mod.rs. Remove the duplicate and update the three grpc sub-modules (policy, sandbox, service) to import directly from crate::persistence. 2. test_store() was repeated verbatim in seven #[cfg(test)] blocks. Promote a single canonical version to persistence/mod.rs (cfg-gated) and replace all copies with crate::persistence::test_store() calls or a thin Arc wrapper in supervisor_session. 3. grpc_client_mtls() and build_tls_root() were copy-pasted across edge_tunnel_auth.rs and multiplex_tls_integration.rs. Move both into the existing tests/common/mod.rs shared module and import from there.
…IDIA#1700) * fix(helm): create sandbox JWT secret under cert-manager The cert-manager install path (certManager.enabled=true, pkiInitJob.enabled=false) left the gateway StatefulSet unable to start because nothing created the openshell-jwt-keys Secret: cert-manager owns TLS Secrets but does not mint the sandbox JWT signing key, and the certgen hook only rendered when pkiInitJob.enabled was true. Separate JWT signing-key provisioning from TLS PKI provisioning: - certgen: add a --jwt-only mode that creates only the Opaque JWT signing Secret, for use when another controller owns TLS Secrets. - certgen.yaml: render the hook when pkiInitJob.enabled OR certManager.enabled is true. cert-manager takes precedence and runs the hook with --jwt-only even if pkiInitJob.enabled remains true. Remove the mutual-exclusion failure between the two values. - _helpers.tpl: add openshell.sandboxJwtSecretName, shared by the hook and the StatefulSet mount. - Update values, README, docs, architecture, and the debug-openshell-cluster skill to reflect the new precedence; the documented cert-manager install no longer needs pkiInitJob.enabled=false. Closes NVIDIA#1691 * fix(helm): honor cert-manager precedence for client CA volume The client CA volume logic treated pkiInitJob.enabled as proof that built-in PKI owns the client CA. With cert-manager precedence now allowing certManager.enabled=true alongside the default pkiInitJob.enabled=true, that assumption mounts the server TLS cert secret as the client CA and ignores certManager.clientCaFromServerTlsSecret=false, which can break mTLS or trust the wrong CA. Gate the pkiInitJob.enabled term with (not certManager.enabled) in all three client CA conditions (volume mount, volume definition, and secret selection) so cert-manager owns TLS when enabled. Add a Helm test suite covering built-in PKI, cert-manager shared CA, the regression config (cert-manager + clientCaFromServerTlsSecret=false + default pkiInitJob), and the no-client-CA case.
…ods (NVIDIA#1729) Allow operators to configure a default Kubernetes runtimeClassName that is applied to sandbox pods when the CreateSandbox request does not specify one. This avoids requiring every API caller to explicitly set the runtime class for clusters that always need a specific RuntimeClass (e.g. kata-containers, nvidia). The fallback is applied in the Kubernetes driver only — per-request values still take priority, and an empty default (the built-in) preserves existing behavior (field omitted, cluster default applies).
…1635) (NVIDIA#1645) Closes NVIDIA#1635 Signed-off-by: Philippe Martin <phmartin@redhat.com>
Resolves two conflicts from the aible branch: - .github/workflows/docker-build.yml: keep both the aible `image-registry` input (for docker.io/aible registry support) and the upstream `publish-manifest` input; take upstream's simplified IMAGE_TAG format (always includes arch suffix). - crates/openshell-driver-kubernetes/src/driver.rs: keep aible's `privileged: true` securityContext and `apply_gateway_readiness_init` wait-for-gateway init container alongside upstream's new `image_pull_secret_refs`, projected SA token volume, sandboxJwt bootstrap, and service account support. Also updates deploy/docker/Dockerfile.gateway and Dockerfile.supervisor to add libclang-dev, clang, and python3 (required by the z3/bindgen build deps introduced in v0.0.56), and adds bundled-z3 feature to the gateway build. Updates values.yaml to use docker.io/aible/openshell-supervisor:20260604.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Changes / conflicts
update docker build to parameterize repo to use docker.io for aible images from fork
for driver.rs added privilege: true for security context and a way-for-gateway to avoid race condition with logger