Skip to content

Resync to v56#2

Merged
rodbutters merged 105 commits into
aiblefrom
resync-to-v56
Jun 4, 2026
Merged

Resync to v56#2
rodbutters merged 105 commits into
aiblefrom
resync-to-v56

Conversation

@rodbutters
Copy link
Copy Markdown

Summary

  • merges in upstream v0.0.56 into aible branch.
  • Resolved conflict and preserved aible additions

Changes / conflicts

update docker build to parameterize repo to use docker.io for aible images from fork
for driver.rs added privilege: true for security context and a way-for-gateway to avoid race condition with logger

mesutoezdil and others added 30 commits May 20, 2026 14:06
…A#1413)

- Add publish-manifest input to docker-build.yml (default true); single-arch
  branch callers set it false so the merge job is skipped and the shared
  bare :SHA tag in GHCR is never written by branch workflows
- branch-kubernetes-e2e: retag :SHA-amd64 to :SHA before kind load so
  Helm's image.tag matches what is loaded in kind containerd
- branch-e2e: pass image-tag as :SHA-arm64 to e2e-test so the arch-specific
  GHCR tag is used directly without depending on the bare tag
- bare :SHA in GHCR is now written only by test-gpu.yml (multi-arch build),
  eliminating the last-writer-wins race across concurrent workflows
* test(server): cover service endpoint plaintext security

* test(server): align tls test with from_files Option<&Path> signature

TlsAcceptor::from_files now accepts the client CA path as Option<&Path>
(per the require_client_auth refactor on main). Wrap the helper's CA
path in Some(...) so the new plaintext-service-http tests compile after
rebasing onto current main.

---------

Co-authored-by: Taylor Mutch <taylormutch@gmail.com>
…ng hosts (NVIDIA#1449)

Exec'ing /bin/sleep (SELinux label bin_t) from a user_home_t test binary
causes /proc/<pid>/exe readlink to return ENOENT on SELinux-enforcing
hosts due to the cross-domain boundary. Skip the test at runtime when
getenforce reports Enforcing.

Also adds a ChildGuard drop guard for safe child cleanup on panic and
increases the exec-detection deadline from 2s to 5s.

Signed-off-by: Derek Carr <decarr@redhat.com>
* fix(sandbox): allow first-label L7 host wildcards

* docs(sandbox): document L7 host wildcard contract + add OPA runtime tests

- Add Host Wildcards section to architecture/security-policy.md
  describing accepted (first-label *, **, intra-label *-X) and
  rejected (bare, TLD, non-first-label, recursive-in-label) forms,
  and noting that wildcards never cross '.' boundaries.
- Expand the policy-schema.mdx 'host' field description to reflect
  the same contract instead of only mentioning '*.example.com'.
- Add OPA runtime tests asserting '*-aiplatform.googleapis.com'
  matches 'us-central1-aiplatform.googleapis.com' and does not match
  'us-central1.aiplatform.googleapis.com' (cross-dot boundary). Locks
  validator/runtime alignment for intra-label wildcards.

* chore: update mise lockfile

* test(server): tolerate serialized inference upserts

---------

Co-authored-by: John Myers <9696606+johntmyers@users.noreply.github.com>
Add -o/--output flag to `openshell gateway list` matching the existing
sandbox list pattern, enabling machine-readable output for scripting.

Signed-off-by: Florent Benoit <fbenoit@redhat.com>
Remove ~280 lines of duplicated code across 30 files in 5 areas:

- centered_rect: consolidate 5 identical TUI layout helpers into a
  single pub fn in openshell-tui/src/ui/mod.rs
- server test helpers: replace ~100 inline Store::connect() calls
  with local test_store() helpers; deduplicate test_server_state()
  in grpc/service.rs to use the shared test_support version
- rogue PKI: extract 20-line rogue CA+client cert generation block
  (duplicated in two integration tests) into generate_rogue_pki()
  in tests/common/mod.rs
- provider tests: replace 8 identical 28-line test modules with a
  single macro_rules! test_discovers_env_credential! invocation
- label constants: centralize openshell.ai/ container label keys
  in openshell-core::driver_utils; update Docker and Kubernetes
  drivers to import from there instead of redefining them locally
Signed-off-by: Piotr Mlocek <pmlocek@nvidia.com>
…er (NVIDIA#1483)

The env var was only wired up via clap in the standalone
openshell-driver-podman binary. When the Podman driver runs embedded
in the gateway, config came exclusively from TOML deserialization and
the env var was never consulted. Apply it as a post-deserialization
override, matching the existing OPENSHELL_K8S_WORKSPACE_DEFAULT_STORAGE_SIZE
pattern.

Closes NVIDIA#1446
…safe) (NVIDIA#1505)

In the Rust ecosystem there's largely three ways to do system calls:

- raw libc
- nix
- rustix

Of the three, libc is almost all `unsafe` and really 95% of use
cases should be either nix or rustix. nix is the original one,
but after having looked at the code of both, I think rustix
is just better designed and organized. It's also reached 1.0,
whereas nix is still making semver-breaking changes (in fact
we're behind here in this project).

Now in practice, we have both *transitively* in the depchain
already, and that's true for quite a lot of projects.

But I think rustix is better, so let's add rustix as
a workspace dependency (process feature) and migrate
a few use cases to it - it's especially better than the raw
libc which is suprisingly widespread.

If we agree to do this, then many other calls can be ported.

Signed-off-by: Colin Walters <walters@verbum.org>
…VIDIA#1507)

After NVIDIA#1415 ships, users upgrading from previous releases need guidance
on the gateway.env deprecation, port/bind/database path changes, and
the podman.socket restart requirement.

- docs(rpm): add 'Migrating from gateway.env' section to TROUBLESHOOTING
  covering backward compatibility, env-to-TOML key mapping, and three
  breaking changes (default port 8080->17670, bind address 0.0.0.0->127.0.0.1,
  database path move). Add podman.socket restart step to upgrade procedure.
- docs(rpm): add upgrade callout to CONFIGURATION.md pointing at migration
  section.
- fix(podman): retry PodmanComputeDriver ping up to 5 times with 2s delay
  to tolerate transient socket unavailability after package upgrades.
  The systemd unit uses Wants=podman.socket (not Requires) so the gateway
  can start while the socket is briefly re-activating after an RPM upgrade
  changes its unit file on disk.
- chore(rpm): update EnvironmentFile comment in RPM spec to explain
  backward-compatibility intent.

Signed-off-by: Adam Miller <admiller@redhat.com>
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com>
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
* docs(rfc): add sandbox resource requirements proposal

Signed-off-by: Evan Lezar <elezar@nvidia.com>

* docs(rfc): finalize sandbox resource requirements

---------

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
* fix(cli): add json output for policy get

* test(cli): cover policy get full json output

* fix(cli): address policy get json clippy

---------

Co-authored-by: John Myers <9696606+johntmyers@users.noreply.github.com>
* feat(providers): derive discovery from profiles

Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com>

* fix(providers): keep v2 discovery profile-only

Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com>

* docs(providers): update providers v2 behavior

Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com>

* fix(providers): make github profile read-only

Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com>

---------

Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com>
* fix(homebrew): repair local driver bootstrap state

* fix(bootstrap): satisfy default SAN doc lint
Thread the gateway_insecure flag through gateway_add(), gateway_login(),
and all OIDC HTTP clients so that --gateway-insecure and
OPENSHELL_GATEWAY_INSECURE apply to OIDC discovery, token exchange, and
token refresh requests.

Previously, the flag only affected gRPC connections to the gateway. OIDC
HTTP clients (reqwest::get and http_client) always verified TLS
certificates, causing gateway registration and login to fail when the
OIDC issuer used a self-signed certificate (common on OpenShift with
edge-terminated routes).

Fixes NVIDIA#1534

Signed-off-by: Adel Zaalouk <azaalouk@redhat.com>
Signed-off-by: Piotr Mlocek <pmlocek@nvidia.com>
Bumps [docker/login-action](https://github.com/docker/login-action) from 4.1.0 to 4.2.0.
- [Release notes](https://github.com/docker/login-action/releases)
- [Commits](docker/login-action@4907a6d...650006c)

---
updated-dependencies:
- dependency-name: docker/login-action
  dependency-version: 4.2.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
TaylorMutch and others added 28 commits June 1, 2026 15:17
* ci(kubernetes): pin mise in e2e workflow

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

* ci(kubernetes): mirror postgres image for ha e2e

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

* ci(kubernetes): reuse e2e workflow for ha

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

---------

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
…VIDIA#1661)

The gateway.sh script appended supervisor_image after the
[openshell.gateway.gateway_jwt] table header, so TOML parsed it as a
gateway_jwt field. The Podman driver never saw the override and fell
back to the default ghcr.io/nvidia/openshell/supervisor:latest.
Move supervisor_image into [openshell.drivers.podman] where the driver
config deserializer expects it.
)

Move three duplicated definitions into openshell-core so every
consumer has a single canonical source:

- format_bytes: identical 14-line function existed in docker,
  kubernetes, and vm drivers. Moved to openshell-core::progress
  where all three already imported from.

- DEFAULT_SANDBOX_PIDS_LIMIT: i64 constant (2048) duplicated in
  docker driver and podman config. Moved to openshell-core::config
  alongside other shared defaults. Podman re-exports it for
  internal call-site compatibility.

- current_time_ms: secrets.rs in openshell-sandbox reimplemented
  the same logic as openshell-core::time::now_ms(). Remove the
  local copy and call now_ms() directly via the existing dep.
…VIDIA#1666)

* fix(config): reject unknown fields in nested gateway config tables

The gateway TOML loader silently ignored keys placed under the wrong
table header. PR NVIDIA#1661 fixed one instance of this (supervisor_image
under [openshell.gateway.gateway_jwt]) but the root cause remained: the
nested gateway config tables did not deny unknown fields, so a misplaced
key was accepted and dropped instead of erroring.

Concretely, tasks/scripts/gateway.sh emitted `sandbox_namespace` right
after the [openshell.gateway.gateway_jwt] heredoc, so it landed inside
the gateway_jwt table rather than [openshell.gateway]. The k8s driver
already receives the namespace via [openshell.drivers.kubernetes], so
the stray line was dead config that parsed without complaint.

Changes:
- Add #[serde(deny_unknown_fields)] to the nested gateway config tables
  that are part of the config-file parse tree: TlsConfig, OidcConfig,
  MtlsAuthConfig, GatewayAuthConfig, GatewayJwtConfig.
- Remove the misplaced sandbox_namespace line from gateway.sh.
- Drop the unused Serialize/Deserialize derives from Config and
  ServiceRoutingConfig (see below).
- Add a regression test asserting a key under the wrong nested table is
  rejected.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
* feat(providers): add Google Vertex AI provider

Adds Vertex AI provider profiles, routing, credential refresh plumbing, CLI support, docs, and regression coverage. Keeps the related NETLINK_ROUTE seccomp allowance needed by Vertex client tooling that calls getifaddrs.

* docs: add Vertex AI sandbox usage for Claude Code and OpenCode

Cover the full end-to-end setup for running Claude Code and OpenCode
inside an OpenShell sandbox via inference.local with a Vertex AI backend:

- google-vertex-ai.mdx: add 'Use from a Sandbox' section with tabbed
  examples for Claude Code (--bare flag, no /v1 suffix) and OpenCode
  (/v1 suffix required). Add providers_v2_enabled prerequisite and
  --no-verify note for global region. Document policy proposals table
  covering metadata.google.internal (always blocked), downloads.claude.ai,
  and storage.googleapis.com.

- inference-routing.mdx: expand 'Use the Local Endpoint' section with
  tabbed examples for Claude Code, OpenCode, Python OpenAI SDK, and
  Python Anthropic SDK. Add notes explaining the /v1 path suffix
  difference between clients.

- supported-agents.mdx: update Claude Code and OpenCode rows to mention
  inference.local support and correct base URL requirements.

* fix: address vertex review findings

* test(sandbox): retry on spurious Ok in fork-exec ambiguity test

On arm64 under heavy CI load, the /proc fd scan in
find_socket_inode_owners can transiently miss the parent process's
socket fd entry, returning only the child as an owner. This causes
resolve_process_identity to return Ok (single owner, no ambiguity
check fires) instead of the expected ambiguous-ownership Err.

Extend the retry loop to also handle unexpected Ok results, mirroring
the existing retry for transient Err results. 10 retries at 50ms gives
a 500ms settling window, which is sufficient for procfs to stabilize
on loaded arm64 runners.

* fix: address vertex review regressions

* docs(router): clarify stream_response semantics for Vertex rawPredict routing

Document the three call sites of prepare_backend_request and their
stream_response values in a caller table:

- send_backend_request: false → :rawPredict (unary endpoint)
- send_backend_request_streaming: true → :streamRawPredict
- verify_backend_endpoint: explicitly false to probe the unary endpoint

Cross-reference the table from build_provider_url and
is_vertex_anthropic_rawpredict_route so the stream_response=true guard
in the suffix upgrade branch is understood in full context.

Also note that is_vertex_anthropic_rawpredict_route is a structural
predicate (model_in_path + anthropic_messages + :rawPredict suffix),
not a named-provider check, so any future provider with the same route
shape inherits the transforms automatically.
* fix: correct example paths in local-inference README

* fix: correct example paths in local-inference routes.yaml
The RPM canary needs to exercise the install.sh user-service path, but a GitHub
Actions job container does not boot with systemd as PID 1. The Fedora RPM
canary needs to exercise the install.sh user-service path, but a GitHub Actions
job container does not boot with systemd as PID 1. This means the Fedora RPM
canary was incomplete as compared to the others.

With this change, we run Fedora as a nested privileged systemd container
instead, wait for systemd to become reachable, then start the root user manager
so systemctl --user works for the RPM gateway unit, achieving parity with the
other canary tests.

Signed-off-by: Kris Hicks <khicks@nvidia.com>
* chore: wip providers v2 tui and codex profile

* chore: wip effective policy get and codex profile

* chore: wip provider profiles and tui detail views

* feat(tui): annotate policy proposal review status
…1699)

Install the Snap built by the triggering Release Dev workflow by setting
merge-multiple: true on the artifact download. actions/download-artifact
otherwise extracts each artifact into its own subdirectory, leaving the
package at release/snap-linux-amd64/*.snap, so the install glob
./release/*.snap matched nothing. Merging flattens the artifact's contents
directly into release/ where the dangerous local snap install expects it.

Harden the Snap canary setup by enabling snapd.socket, waiting for snap
seeding (snap wait system seed.loaded), and running every step with strict
shell options (set -euo pipefail) so failures surface immediately.

Register the snapped gateway with the CLI as the documented local plaintext
snap-docker gateway, and print version and snap services, before running
openshell status so the canary verifies a configured and reachable gateway
instead of only the install.

Signed-off-by: Kris Hicks <khicks@nvidia.com>
Add a desktop launcher for the OpenShell TUI so users can launch
"openshell term" from their desktop environment application menu.

The change adds three files:
- snap/local/term.desktop: desktop entry file for the application launcher
- snap/local/icon.png: application icon (copied from snap store data)
- snapcraft.yaml: new "term" app entry that runs "openshell term"
  with home, network, ssh-keys, and system-observe plugs, plus install
  rules to stage the desktop file and icon under meta/gui/

The desktop file references the icon via ${SNAP} which is resolved
at runtime to the snap installation directory. The term app reuses
the same connection plugs as the main openshell app.

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Auto-detection previously treated Podman as available only when the podman CLI
was visible on PATH. However, package manager services can run with a
restricted PATH, which lets Docker be selected even when a Podman API socket is
reachable. Additionally, podman may symlink /var/run/docker.sock to podman's
machine unix socket, which would be incorrectly detected as Docker. Worse
still: the podman machine may not even be running.

This replaces the Podman binary check with a functional HTTP probe against the
standard Podman socket paths. The probe requires /_ping to answer with a
Libpod-Api-Version header before treating the socket as Podman, which lets the
gateway select the embedded Podman driver only when the API is usable.

Signed-off-by: Kris Hicks <khicks@nvidia.com>
Signed-off-by: Kris Hicks <khicks@nvidia.com>
The Ubuntu Snap canary downloads its artifact from a different workflow run
(the triggering Release Dev run) via run-id. Cross-run downloads require
authentication, so pass github.token to actions/download-artifact.

Signed-off-by: Kris Hicks <khicks@nvidia.com>
…teway (NVIDIA#1419)

The existing docs omitted or misstated several requirements when running
the gateway as a container with the Docker compute driver:

- OPENSHELL_GRPC_ENDPOINT is required; the Docker driver uses only the
  scheme (http/https) — host and port are substituted automatically with
  host.openshell.internal and the gateway's own bind port
- Supervisor binary must be extracted to a host path before starting the
  gateway; bind-mount sources are resolved by the host Docker daemon so
  the path must be identical inside and outside the gateway container
- Docker socket access requires adding the docker group (UID 1000 default)
- Port binding should remain 127.0.0.1; Docker driver adds a bridge
  listener automatically
- add --server-san host.openshell.internal to generate-certs for mTLS
- Complete the mTLS docker run with all Docker driver requirements
- Add deploy/docker/gateway.toml — TOML config for the Docker driver
- Add deploy/docker/docker-compose.yml referencing the TOML
- Add docs/get-started/tutorials/docker-compose.mdx tutorial page
- Remote gateway registration instructions (--remote flag)

Address reviewer feedback:
- Move Docker Compose tutorials card to the bottom of the list
- Replace inline YAML snippet in Docker Compose section with a reference
  to deploy/docker/ to avoid drift
- Clarify OPENSHELL_DB_URL is safe in compose.yml (plain SQLite path,
  no credentials); the TOML block targets credential-bearing DSNs
- Note that ./ in source: resolves relative to the compose file directory
- Clarify that only the scheme from OPENSHELL_GRPC_ENDPOINT matters
- Add note that the tilde volume mount resolves to the same absolute
  path on both host and container
…#1708)

Remove three groups of copy-pasted code in openshell-server:

1. grpc/mod.rs had a private current_time_ms() wrapper identical to the
   one already exported from persistence/mod.rs. Remove the duplicate
   and update the three grpc sub-modules (policy, sandbox, service) to
   import directly from crate::persistence.

2. test_store() was repeated verbatim in seven #[cfg(test)] blocks.
   Promote a single canonical version to persistence/mod.rs (cfg-gated)
   and replace all copies with crate::persistence::test_store() calls or
   a thin Arc wrapper in supervisor_session.

3. grpc_client_mtls() and build_tls_root() were copy-pasted across
   edge_tunnel_auth.rs and multiplex_tls_integration.rs. Move both into
   the existing tests/common/mod.rs shared module and import from there.
…IDIA#1700)

* fix(helm): create sandbox JWT secret under cert-manager

The cert-manager install path (certManager.enabled=true,
pkiInitJob.enabled=false) left the gateway StatefulSet unable to start
because nothing created the openshell-jwt-keys Secret: cert-manager owns
TLS Secrets but does not mint the sandbox JWT signing key, and the
certgen hook only rendered when pkiInitJob.enabled was true.

Separate JWT signing-key provisioning from TLS PKI provisioning:

- certgen: add a --jwt-only mode that creates only the Opaque JWT
  signing Secret, for use when another controller owns TLS Secrets.
- certgen.yaml: render the hook when pkiInitJob.enabled OR
  certManager.enabled is true. cert-manager takes precedence and runs
  the hook with --jwt-only even if pkiInitJob.enabled remains true.
  Remove the mutual-exclusion failure between the two values.
- _helpers.tpl: add openshell.sandboxJwtSecretName, shared by the hook
  and the StatefulSet mount.
- Update values, README, docs, architecture, and the
  debug-openshell-cluster skill to reflect the new precedence; the
  documented cert-manager install no longer needs pkiInitJob.enabled=false.

Closes NVIDIA#1691

* fix(helm): honor cert-manager precedence for client CA volume

The client CA volume logic treated pkiInitJob.enabled as proof that
built-in PKI owns the client CA. With cert-manager precedence now
allowing certManager.enabled=true alongside the default
pkiInitJob.enabled=true, that assumption mounts the server TLS cert
secret as the client CA and ignores
certManager.clientCaFromServerTlsSecret=false, which can break mTLS or
trust the wrong CA.

Gate the pkiInitJob.enabled term with (not certManager.enabled) in all
three client CA conditions (volume mount, volume definition, and secret
selection) so cert-manager owns TLS when enabled. Add a Helm test suite
covering built-in PKI, cert-manager shared CA, the regression config
(cert-manager + clientCaFromServerTlsSecret=false + default pkiInitJob),
and the no-client-CA case.
…ods (NVIDIA#1729)

Allow operators to configure a default Kubernetes runtimeClassName that
is applied to sandbox pods when the CreateSandbox request does not
specify one. This avoids requiring every API caller to explicitly set the
runtime class for clusters that always need a specific RuntimeClass
(e.g. kata-containers, nvidia).

The fallback is applied in the Kubernetes driver only — per-request
values still take priority, and an empty default (the built-in) preserves
existing behavior (field omitted, cluster default applies).
Resolves two conflicts from the aible branch:

- .github/workflows/docker-build.yml: keep both the aible
  `image-registry` input (for docker.io/aible registry support) and
  the upstream `publish-manifest` input; take upstream's simplified
  IMAGE_TAG format (always includes arch suffix).

- crates/openshell-driver-kubernetes/src/driver.rs: keep aible's
  `privileged: true` securityContext and `apply_gateway_readiness_init`
  wait-for-gateway init container alongside upstream's new
  `image_pull_secret_refs`, projected SA token volume, sandboxJwt
  bootstrap, and service account support.

Also updates deploy/docker/Dockerfile.gateway and Dockerfile.supervisor
to add libclang-dev, clang, and python3 (required by the z3/bindgen
build deps introduced in v0.0.56), and adds bundled-z3 feature to the
gateway build. Updates values.yaml to use docker.io/aible/openshell-supervisor:20260604.
@rodbutters rodbutters merged commit aa975cc into aible Jun 4, 2026
6 of 8 checks passed
@rodbutters rodbutters deleted the resync-to-v56 branch June 4, 2026 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.