Skip to content

Add agentic controller: API types, reconciler, and CLI#1

Merged
harche merged 60 commits into
openshift:mainfrom
harche:wt/api-pr
May 7, 2026
Merged

Add agentic controller: API types, reconciler, and CLI#1
harche merged 60 commits into
openshift:mainfrom
harche:wt/api-pr

Conversation

@harche
Copy link
Copy Markdown
Contributor

@harche harche commented Apr 22, 2026

Summary

Initial agentic controller for agentic.openshift.io/v1alpha1, to be imported into lightspeed-operator.

  • CRD types (Proposal, Agent, LLMProvider) and generated manifests
  • Proposal lifecycle reconciler with approval gates, dynamic RBAC, retry, and escalation
  • oc-agentic CLI plugin

Test plan

  • Unit tests pass (go test ./...)
  • Builds and CRD generation succeed

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// LLMProviderType identifies the hosting backend for an LLM provider.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe just Provider? We have agentic scoping from the agentic.openshift.io group, and I'm having trouble imagining other providers besides these AI backends. Also, the AI backends like the Anthropic API, etc., are composite harnesses, and not just raw LLM models, right? So even if we need a way to distincuish between other agentic.openshift.io providers, I dunno if LLM-ness is quite the distinction to use. But also, 🤷, it's just a name, and I'm happy to use whatever, if folks don't want to bikeshed at this level of pedanticism.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah fair point, they are more than just LLMs. But I lean toward keeping LLMProvider because when someone reads kind: LLMProvider in a YAML file they immediately know what it is. kind: Provider on its own could be anything without checking the API group. The extra specificity pays for itself in readability especially when people are scanning through manifests. Happy to change it if you feel strongly though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nobody reads kind: Provider on its own though, apiVersion is right next door. For example:

$ oc get -o yaml clusterversion version | head -n2
apiVersion: config.openshift.io/v1
kind: ClusterVersion

I don't feel strongly though, I'm just explaining my impressions, and you're free to do whatever makes sense to you and whoever this repo's approvers will be, regardless of how closely it matches my understanding.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to just Provider, unless there are more types. Eg. RAGProvider/OKPProvider/XYZProvider (similar as in in llama-stack).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provider is definitely more straightforward. If we are concerned about needing additional "provider" types in the future, maybe making this more generic to something like InferenceProvider makes more sense here?

Taking inspiration from llama-stack they basically have inference, vector store, safety, and tool runtime providers and "inference" sounds the most applicable here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion — llama-stack's InferenceProvider makes sense for their scope since they're a generic inference framework covering LLMs, embeddings, safety, and vector stores.

For us, LLMProvider is intentionally specific. This resource configures which LLM backend (Anthropic, OpenAI, Vertex, Bedrock) an agent talks to — model name, credentials, endpoint. It's not a generic inference abstraction, and we don't plan to extend it to cover embeddings, safety filters, or vector stores.

InferenceProvider also risks confusion with GPU-as-a-service inference platforms (CoreWeave, Lambda, OpenRouter, RunPod, etc.) — which are a different layer entirely. LLMProvider is unambiguous about what it configures.

Comment thread api/v1alpha1/agent_types.go Outdated
Comment thread api/v1alpha1/agent_types.go Outdated
// runtime (e.g., forwarded from a user session).
//
// +enum
type MCPHeaderSourceType string
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming nit, are MCPs the only kind of tools that the agent needs to auth for? And this header is just about auth, right? Not the other things HTTP headers could be used for (content negotiation, cache management, etc.)? Maybe ToolAuthenticationSourceType?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming here follows the existing MCPHeaderSourceType from the lightspeed-operator API so we stay consistent across both operators. Also these headers aren't strictly about auth, they could be used for routing or tenant identification too, so ToolAuthenticationSourceType would be narrowing it. Since the type is already scoped under MCPServerConfig the MCP prefix gives enough context without being too specific about the purpose.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait this is already merged lightspeed API? Can we just use the existing types then, instead of creating copies here?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to start fresh and see what's actually needed rather than inheriting current OLS types that may set the wrong base.

Comment thread api/v1alpha1/agent_types.go Outdated
//
// When omitted, the entire image is mounted as a single volume.
// +optional
Paths []string `json:"paths,omitempty"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's translation risk if we deviate from the target APIs we expect to use to deliver this functionality. With this knob expecting to feed some VolumeMount knobs, maybe we should just use:

// Mount configures how the image will be mounted into the agent's sandbox Pod.
Mount *corev1.VolumeMount `json:"mount,omitempty"

to be transparent about being a pass-through consumer, without having to have our own opinions about how volume-mounting works? And then we're on the hook to vendor-bump k8s.io/api with each release (tracking the oldest version of Kube that we claim compatibility with for a given Lightspeed operator branch?), but we aren't on the hook to keep a closer eye on evolutions like new approaches to declaring subpaths or whatever it is upstream decides to add to VolumeMount next.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I considered this. The thing is, paths isn't really a thin wrapper over VolumeMount. The operator does something fairly opinionated with it. It takes a single image volume and creates multiple subPath mounts, using path.Base() to derive the mount name and placing everything under /app/skills/. So /skills/prometheus becomes a subPath mount at /app/skills/prometheus, and /skills/cluster-update/update-advisor becomes /app/skills/update-advisor.

If we exposed a raw VolumeMount here, the user would need to manually specify mountPath, subPath, and name for each skill, and get them all consistent with the conventions the operator expects. That's a lot of ceremony for what should be "give me these skills from this image."

I think paths is the right abstraction for the common case. If we run into situations where someone needs full VolumeMount control we could always add an escape hatch later. Wdyt?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also mask the properties you don't want to expose. E.g. you have opinions on Name, so don't expose that. You have opinions on ReadOnly, so don't expose that. You're left with SubPath and SubPathExpr, so expose those? That saves you the work of having to translate from paths to multiple VolumeMounts, while still giving the spec-author some properties/Godocs they're familiar with from their Pod work.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the appeal of familiar field names but I think it would actually be more confusing. If a user sees subPath they'll expect they can also set mountPath, readOnly, etc. and wonder why those are missing. And SubPathExpr uses downward API env var substitution which doesn't really apply to skills mounting.

paths is doing more than subPath under the hood. For example, this spec:

skills:
  - image: registry.ci.openshift.org/ocp/5.0:agentic-skills
    paths:
      - /skills/prometheus
      - /skills/cluster-update/update-advisor

produces two VolumeMounts from a single image volume:

volumeMounts:
  - name: skills
    mountPath: /app/skills/prometheus
    subPath: skills/prometheus
    readOnly: true
  - name: skills
    mountPath: /app/skills/update-advisor
    subPath: skills/cluster-update/update-advisor
    readOnly: true

The operator derives the mount name from path.Base(), hardcodes readOnly, and places everything under /app/skills/. Exposing subPath directly would let users think they control all of this when they don't.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you can generate that same volumeMounts context with just subPath exposed, like:

skills:
- image: registry.ci.openshift.org/ocp/5.0:agentic-skills
  subPath: /skills/prometheus
- image: registry.ci.openshift.org/ocp/5.0:agentic-skills
  subPath: /skills/cluster-update

right? Although for agentic-skills specifically, I'd expect any flattening needed to happen at image-build time, so you can just do:

skills:
- image: registry.ci.openshift.org/ocp/5.0:agentic-skills
  subPath: /skills

and not feel like you have to track and handle each skill you want mounted from that image (unless for some reason you want that control to exclude skills you don't want the agent to access).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not strongly opinionated here, but if the goal is to have a simpler API, the existing shape seems generally OK to me.

Comment thread api/v1alpha1/agent_types.go Outdated
Comment thread api/v1alpha1/llmprovider_types.go Outdated
Comment thread api/v1alpha1/workflow_types.go Outdated
Comment thread api/v1alpha1/workflow_types.go Outdated
Comment thread api/v1alpha1/workflow_types.go Outdated
Comment thread api/v1alpha1/proposal_analysis_types.go Outdated
)

// DiagnosisResult contains the root cause analysis from the analysis agent.
// This is populated by the agent during the Analyzing phase and stored in
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "phase" -> "step", here and elsewhere, to align with WorkflowSpec's language like "analysis references an Agent for the analysis step". I don't think you're trying to make a phase-vs-step distinction, and if you aren't, sticking with one word will help keep folks from wondering about that possibility.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, it should be 'step'. I will change it everywhere. Thanks.

Comment thread api/v1alpha1/proposal_analysis_types.go Outdated
Comment thread api/v1alpha1/proposal_analysis_types.go Outdated
// +kubebuilder:validation:MaxLength=253
Name string `json:"name"`
// command is the command or API call to run for this check
// (e.g., "oc get pod -n production -l app=web -o jsonpath='{.items[0].status.phase}'").
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires oc in the agent-sandbox Pod, but we've dropped that requirement. Do we need to restore it as part of the "what you can expect the agent to do?" contract between the agent container executing the step and the AI engine generating the advice?

Copy link
Copy Markdown
Member

@wking wking Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As one possibility, OpenShift docs list prereqs to help readers decide if they will be able to follow the process we're about to talk about. For example:

Prerequisites

You installed the OpenShift CLI (oc) that matches the version for your updated version.
You are logged in to the cluster as user with cluster-admin privileges.
You have paused all MachineHealthCheck resources.

Those aren't dynamically negotiated though. We could have a system that had some dynamic negotation (e.g. "oh, no oc? How about kubectl?"). Or we could hard-code the requirements for Agent image as non-negotiable, and then the operator here could hard-code that context ("step commands should be Bash scripts with access to POSIX utilities as well as oc (the OpenShift client) and..."). Or there could be an endpoint served by the image (/context?) where the image tries to describe itself ("I am a POSIX system with the following additional commands..."), and the operator could slurp that up and pass it along to the AI engine when asking for command steps. Or some such...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator communicates with agents via HTTP POST, but the agent image itself ships with CLI tools (oc, promtool, rg, standard unix utilities) that agents use to interact with the cluster inside their sandbox. The command field here is what the analysis agent recommends the verification agent run using those tools, not a direct operator-to-agent call. So oc in the example is correct.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have a system that had some dynamic negotation (e.g. "oh, no oc? How about kubectl?"). Or we could hard-code the requirements for Agent image as non-negotiable...

The system prompt already handles this today. It tells the LLM what tools are available in the sandbox (oc, promtool, rg, standard POSIX utils). The system prompt is referenced via systemPromptRef on the Agent CR, so different agent images can describe different toolsets.

// before the formal verification step. This gives early signal on whether
// the remediation worked. In trust-mode workflows (verification skipped),
// this is the only verification that occurs.
type ExecutionVerification struct {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear of the value add of this type, vs. recycling VerifyCheck for this use-case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They serve different purposes. ExecutionVerification is a quick "did things improve?" signal from the execution agent, just a bool and a summary. VerifyCheck is a structured per-check result from the dedicated verification step with name, source, value, and pass/fail. Collapsing them would force the execution agent to produce structured check results for what's meant to be a lightweight sanity check.

Comment thread api/v1alpha1/proposal_status_types.go Outdated
Sandbox SandboxInfo `json:"sandbox,omitempty"`
// success indicates whether execution completed successfully.
// +optional
Success *bool `json:"success,omitempty"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

success: false leaves a lot of room for ambiguous stress. Can we drop this, and instead declare some well-known types for Conditions where we talk about success with room for a confidence read and human-facing description string? As an example of that, see the Applies condition for conditional update risks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, dropped success from both ExecutionStepStatus and VerificationStepStatus. The operator reads success from the agent's response and sets step conditions with reason and message instead. Conditions already carry the same information with more context.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is still here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Success *bool field that was discussed in the original thread was removed from ExecutionStepStatus and VerificationStepStatus as stated.

However, StepResultRef and the four result CRDs (AnalysisResult, ExecutionResult, VerificationResult, EscalationResult) still had a success bool field. Replaced all of them with outcome: ActionOutcome (enum: Succeeded / Failed) per the boolean convention. Updated the controller, CLI, and console UI accordingly.

Comment thread api/v1alpha1/agent_types.go Outdated
Comment thread api/v1alpha1/proposal_types.go Outdated
// +patchStrategy=merge
// +patchMergeKey=type
// +optional
Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add observedGeneration?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, added observedGeneration to ProposalStatus.

Comment thread api/v1alpha1/agent_types.go Outdated
// execution, or verification). When omitted, the agent uses a default
// prompt appropriate for its workflow step.
// +optional
SystemPromptRef *corev1.LocalObjectReference `json:"systemPromptRef,omitempty"`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agent type is clusterscoped. Would be better to also have namespace in the reference?

Copy link
Copy Markdown
Contributor Author

@harche harche Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that namespace is needed. I'm actually reconsidering making all four CRDs (LLMProvider, Agent, Workflow, Proposal) namespace-scoped instead of cluster-scoped. Namespace scoping gives multi-tenancy and RBAC isolation naturally, simplifies references (everything is LocalObjectReference within the same namespace), and makes it easier to scope execution RBAC to the namespace boundary. Cluster-wide use cases still work by putting everything in a shared namespace. Thoughts? cc @wking @mrunalp

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am updating the code to have those name-scoped, I can update existing operator code to verify everything still works fine.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd need cluster scoped agents and namespaced scoped agents that just look at application issues within those bounds.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrunalp If we make agents namespace scoped, an Agent which needs to have scope beyond the namespace (cluster wide) - they will have to end up in arbitrary/default namespace - which I think is not a good situation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Agent CR is configuration only, it defines which LLM, skills, and prompt to use. It doesn't control the agent's runtime permissions. Those come from the RBAC rules the analysis step recommends, which admins review and approve as part of the proposal. Only after approval does the operator provision the Role/ClusterRole bindings on the sandbox ServiceAccount. The sandbox pod itself always runs in the operator's namespace (openshift-lightspeed), not the Agent's namespace, so there's no exec-based escalation path. An Agent in openshift-lightspeed namespace can have full cluster-wide access at runtime if admins approve the requested RBAC.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above on modes. I think we need mode-level configuration

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the execution of the agents will only ever happen in the openshift-lightspeed namespace, just specify that this configmap must exist in the openshift-lightspeed namespace?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field has been removed — the Agent CR no longer has systemPromptRef or any ConfigMap reference. The Agent is now a cluster-scoped tier (LLM provider, timeouts, maxTurns, providerSettings). Skills, system prompts, and MCP servers are configured per-proposal via ToolsSpec.

// set by the operator -- users should not modify status fields directly.
// The status provides complete observability into the proposal's progress,
// including per-step results, retry history, and standard Kubernetes conditions.
type ProposalStatus struct {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: What about adding something like Phase here reflecting the lifecycle https://github.com/openshift/lightspeed-agentic-operator/pull/1/changes#diff-608f31719fd83cc2e66fb400f835b53884e9fb407d0c826397d1c0fe82fc79a8R44? It could be then used in printerColumns.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Added a phase field to ProposalStatus with a printerColumn. The operator derives it from conditions on every reconcile — conditions stay the source of truth, phase is for human consumption (oc get, console list views).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to be mixing 2 concerns - phase, which is a workflow/FSM state, defined by wF, and proposal status - what is happening with proposal

Comment thread api/v1alpha1/agent_types.go Outdated
// +kubebuilder:validation:XValidation:rule="self.type != 'Secret' ? !has(self.secretRef) : true",message="secretRef must not be set when type is 'Kubernetes' or 'Client'"
type MCPHeaderValueSource struct {
// type specifies the source type for the header value.
// +kubebuilder:validation:Required
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// +kubebuilder:validation:Required
// +required

This is present everywhere so just calling it out here.

Talking with @JoelSpeed and @camilamacedo86, it seems that there are plans to deprecate this.

kubernetes-sigs/controller-tools#1241

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, switched all +kubebuilder:validation:Required markers to +required.

Comment thread api/v1alpha1/agent_types.go Outdated
// MCPHeader defines an HTTP header to send with every request to an
// MCP server. Used for authentication and routing.
type MCPHeader struct {
// name of the header (e.g., "Authorization", "X-API-Key").
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One feedback from API review is to add all requirements on these fields to the GoDoc.

When you do oc explain you want to see what requirements there are on the fields. The markers do not show up as documentation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, added all validation requirements (min/max length, enum values, patterns, item limits) to the GoDoc so they show up in oc explain.

Comment thread api/v1alpha1/agent_types.go Outdated
// headers to send to the MCP server.
// +optional
// +listType=map
// +listMapKey=name
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linter should say that you need ssatags here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran kube-api-linter against the API types — 0 issues. All list fields already have +listType markers (atomic or map with appropriate keys). Let me know if you're seeing something specific I'm missing.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, my bad, I didn't run it correctly, I ran into those failures too, but fixed now. Thanks.

… skills

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
harche and others added 5 commits May 4, 2026 09:58
- Add EscalationOutputSchema (success/summary/content) and register in
  defaultOutputSchemas map so the escalation agent receives structured
  output constraints
- Fix duplicate EscalationResult: add ConditionUnknown/InProgress guard
  to handleEscalation, matching the pattern used by analysis/execution
  handlers. Without this, a re-queued reconcile during agent execution
  would spin up a second sandbox
- Add TestEscalation_InProgressIsIdempotent to verify re-reconcile
  during escalation is a no-op

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace explicit StartTime/CompletionTime fields on step statuses and
Outcome/StartTime/CompletionTime on result CRs with standard K8s
conditions. Result CRs now have a status subresource with Started and
Completed conditions whose lastTransitionTime provides timestamps and
whose reason encodes the outcome (Succeeded/Failed).

Also adds defaultOption to ApprovalPolicyStage with CEL validation
requiring it when execution approval is Automatic, preventing
auto-approved execution from running with an empty remediation plan.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Refactor LLMProvider discriminated union, enhance ToolsSpec with
secret mount types, update CLI for revised API types, and adjust
sandbox template generation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace string-based mountAs examples and API reference with the
discriminated union SecretMountSpec (type: EnvVar/FilePath).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SandboxClaim's own Ready condition provides pod lifecycle timestamps.
Step execution timing is tracked by step conditions. These fields were
redundant (CompletionTime was never set).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only need to address comments labeled Blocking: and then this should be good to go.

Please ping me once these are addressed and I'll re-review.

Comment thread api/v1alpha1/llmprovider_types.go Outdated
Comment on lines +176 to +181
// Must contain only lowercase letters, digits, and hyphens
// (e.g., "us-east-1", "eu-west-2", "ap-southeast-1").
// +required
// +kubebuilder:validation:MinLength=2
// +kubebuilder:validation:MaxLength=63
// +kubebuilder:validation:XValidation:rule="self.matches('^[a-z][a-z0-9-]*[a-z0-9]$')",message="region must contain only lowercase letters, digits, and hyphens, start with a letter, and not end with a hyphen"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: Explicitly document that values must begin with lowercase letters and must end with lowercase alphanumeric characters.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Updated region field comments on both Vertex and Bedrock configs to explicitly state: must begin with a lowercase letter and end with a lowercase alphanumeric character.

Comment thread api/v1alpha1/llmprovider_types.go Outdated
// +optional
// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:MaxLength=2048
// +kubebuilder:validation:XValidation:rule="isURL(self) && url(self).getScheme() in ['http', 'https'] && url(self).getHostname() != '' && !self.contains('@') && !self.contains('#')",message="url must be a valid HTTP or HTTPS URL with a hostname; fragments and userinfo are not allowed"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: We normally suggest that CEL validations are broken into validations that are as tightly scopes as possible.

For example, I'd probably break this into 4 validations:

  1. HTTP/HTTPS scheme check
  2. Hostname check
  3. userinfo check
  4. fragment check

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Split the monolithic URL CEL validation into 4 separate rules across all URL/endpoint fields: scheme check, hostname check, userinfo check, fragment check. Each now has its own error message.

Comment on lines +189 to +190
// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:MaxLength=2048
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: Always explicitly document minimum and maximum length constraints.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added "Maximum 2048 characters." to all URL/endpoint field comments across all provider configs.

// When omitted, the SDK default is used.
// +optional
// +kubebuilder:validation:MaxLength=32
// +kubebuilder:validation:XValidation:rule="self.matches('^[0-9]{4}-[0-9]{2}-[0-9]{2}(-preview)?$')",message="apiVersion must be a date in YYYY-MM-DD format with an optional -preview suffix"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: Is date based the only versioning scheme they have? It looks like they have a v1 api based on https://learn.microsoft.com/en-us/azure/foundry/openai/latest

Comment on lines +73 to +74
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=500
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: Explicitly document.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — documented min/max bounds in the maxTurns field comment (same as #39 above).

Comment thread api/v1alpha1/proposalapproval_types.go Outdated
Comment on lines +89 to +92
// denied indicates the user has denied this step, making the entire
// proposal terminal. Once set to true, it cannot be unset.
// +optional
Denied bool `json:"denied,omitempty"`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: Do not use a boolean and instead use an enum. Maybe something like:

state: Approved | Denied

?

Feel free to use a different field name.

Non-blocking: If we make this CR represent approval+denial, maybe ProposalApproval is a bit of a misleading name?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Replaced denied bool with decision ApprovalDecision enum (Approved | Denied). Updated CEL rule to prevent changing decision once set to Denied. Updated controller, CLI, and console.

// +optional
metav1.ObjectMeta `json:"metadata,omitempty"`

// +optional
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: Required

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Made spec required on ProposalApproval.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: during cluster testing, found that making spec required conflicts with the operator's lifecycle. The operator creates ProposalApproval with an empty spec when all stages are Manual (the default) — users then add approval stages one at a time. Reverting to +optional with omitzero so the initial empty state is valid. The MinItems=1 on stages still prevents stages: [] — omitting stages entirely is the valid empty state.

Comment thread api/v1alpha1/proposalapproval_types.go Outdated
Comment on lines +122 to +127
// +kubebuilder:validation:XValidation:rule="self.stages.all(s1, self.stages.all(s2, s1 == s2 || s1.type != s2.type))",message="stage types must be unique"
type ProposalApprovalSpec struct {
// stages lists the approved (or denied) workflow steps. Each entry is
// a discriminated union keyed by type.
// +optional
// +listType=atomic
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: For uniqueness use +listType=map and +listMapKey=type

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Replaced +listType=atomic + CEL uniqueness with +listType=map +listMapKey=type on stages.

// (spec.tools) or per-step (spec.analysis.tools, spec.execution.tools,
// spec.verification.tools). Per-step tools replace the shared default
// for that step.
type ToolsSpec struct {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: min properties of 1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added MinProperties=1 to ToolsSpec (addressed as part of the tools pointer-to-value change).

Comment on lines +42 to +82
// proposalName is the name of the parent Proposal in the same namespace.
// +required
// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:MaxLength=253
ProposalName string `json:"proposalName"`

// attempt is the 1-based overall attempt number.
// +required
// +kubebuilder:validation:Minimum=1
Attempt int32 `json:"attempt"`

// retryIndex is the 0-based retry index within the current analysis.
// +required
// +kubebuilder:validation:Minimum=0
RetryIndex int32 `json:"retryIndex"`

// checks contains individual verification check results.
// +optional
// +listType=atomic
// +kubebuilder:validation:MaxItems=50
Checks []VerifyCheck `json:"checks,omitempty"`

// summary is a Markdown-formatted verification summary.
// +optional
// +kubebuilder:validation:MaxLength=32768
Summary string `json:"summary,omitempty"`

// components contains optional adapter-defined structured data.
// +optional
// +listType=atomic
// +kubebuilder:validation:MaxItems=20
Components []apiextensionsv1.JSON `json:"components,omitempty"`

// sandbox tracks the sandbox pod used for this verification.
// +optional
Sandbox SandboxInfo `json:"sandbox,omitzero"`

// failureReason is populated when the step failed due to a system error.
// +optional
// +kubebuilder:validation:MaxLength=8192
FailureReason string `json:"failureReason,omitempty"`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: move to status

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Moved checks, summary, sandbox, failureReason under .status on VerificationResult.

harche and others added 15 commits May 6, 2026 08:32
…1, #13-17, #19-22, #29, #32)

- openshift#1: Change status.attempts from *int32 to int32 with min=1/max=int32 validation
- openshift#2: Move maxAttempts from ProposalSpec to ApprovalPolicy (admin ceiling) and
  ExecutionApproval (user choice); fix semantics so N = N total attempts
- openshift#3: Remove spec.revision, use metadata.generation for revision signaling
- openshift#4: Make ProposalStep.tools non-pointer with omitzero, add MinProperties=1
- #8-11: Make outcome/result/conditionOutcome required; SandboxInfo fields required
- #13: Add CEL singleton rule (metadata.name == "cluster") on ApprovalPolicy
- #14: Use listType=map+listMapKey=name on ApprovalPolicy stages
- #15: Remove defaultOption from ApprovalPolicy (operator defaults to option 0)
- #16-17: Add omitempty to name/approval, document allowed values
- #19: Replace denied bool with decision enum (Approved|Denied) + CEL immutability
- #20: Make ProposalApproval spec required
- #21: Use listType=map+listMapKey=type on ProposalApproval stages
- #22: Make estimatedImpact required on RemediationOption
- #29: Remove components from ExecutionResult and VerificationResult (copy-paste error)
- #32: Add MinProperties=1 to ToolsSpec

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…#30-31)

Each Result CRD (AnalysisResult, ExecutionResult, VerificationResult,
EscalationResult) now has its own status type containing result data +
conditions. Identity fields (proposalName, attempt, retryIndex) remain
top-level. Result data (options, actionsTaken, checks, summary, content,
components, sandbox, failureReason) moved under .status since results
communicate outcomes, not desired state.

Also removes components from ExecutionResult and VerificationResult
(copy-paste error — not in their output schemas).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…leanup

- Change ProposalStep fields from pointer to value with omitzero + MinProperties=1
- Add RFC1123 DNS subdomain CEL validation on agent name fields
- Add MinProperties=1 to step status types, RBACResult, AgentTimeouts
- Add MinItems=1 on ApprovalPolicy stages
- Add CEL rule requiring rollbackPlan when reversible is Reversible/Partial
- Fix rollbackPlan location in analysis schema (was under verification)
- Remove MinLength=1 on apiGroups items to allow "" (core group)
- Split monolithic URL CEL into 4 separate rules with specific messages
- Document field constraints: region, model, maxTurns, timeouts, URL length

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Every field in the analysis, execution, verification, and escalation
output schemas now has a description guiding the agent on what to
produce. These descriptions are the agent's primary instruction for
structured output.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add MinLength=1 on optional strings (FailureReason, Summary, etc.)
- Add MinItems=1 on optional arrays (Conditions, Results, etc.)
- Add MinProperties=1 on status/spec types
- Fix godoc comments to start with field name (metadata, spec, status)
- Add omitempty to required fields per linter convention
- Change Result CR status fields from omitempty to omitzero
- Change ApprovalStage approval fields from pointer to value with omitzero
- Move Conditions to first field in ApprovalStageStatus
- Add MaxLength=253 to ApprovalStageStatus.Name
- Add Maximum=10 to RetryIndex, Minimum=1 to ObservedGeneration
- Document API linting in CLAUDE.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pec, examples

- Add generation to console Proposal metadata type (TS compile error)
- Revert ProposalApproval spec from +required to +optional — empty spec
  is the valid initial state when all stages are Manual
- Fix example YAMLs: mountAs string→struct, execution: {}→agent: default

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix createIdempotent to properly write status via status subresource
  (DeepCopy before Create, then Status().Patch with full status)
- Add tests for createIdempotent with WithStatusSubresource
- Make VerificationStep.expected optional (agent doesn't always know)
- Default agent field to "default" on approval types (satisfies linter)
- Make ProposalApprovalSpec.stages required with default=[], remove MinItems
  conflict, keep MinProperties=1 on type
- Change RetryIndex to *int32 (valid zero value needs pointer)
- Update verification schema: source description asks for full command
- Remove reflect import, no linter exclusions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each Result CRD (AnalysisResult, ExecutionResult, VerificationResult,
EscalationResult) now has a Spec type containing identity fields
(proposalName, attempt, retryIndex). Spec is immutable via CEL rule
self == oldSelf. Print columns updated to .spec.* paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ghten RetryIndex

- Remove Attempts from ProposalStatus (always 1, never incremented)
- Remove Attempt from all Result CRD specs (always 1)
- Move ObservedGeneration from AnalysisStepStatus to conditions
  (standard K8s pattern — every condition now carries ObservedGeneration)
- Set ObservedGeneration on all 18 condition calls
- Tighten RetryIndex Maximum from 10 to 2 (matches maxAttempts max of 3)
- Update CEL messages: decisions/maxAttempts immutability
- Clarify that denying any stage terminates the entire proposal
- Update verification schema source description for full commands
- Make VerificationStep.expected optional
- Default agent to "default" on approval types (satisfies linter)
- Clean up related tests, CLI, templates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…steps

- Make rbac required in analysis schema only when proposal has execution
- Make verification required only when proposal has verification step
- Advisory proposals get a minimal schema (title, diagnosis, proposal)
- Keep MinProperties=1 on RBACResult (when present, must have rules)
- Make VerificationStep.Command and Expected optional
- Make RollbackPlan.Command optional
- Remove rollbackPlan CEL conditional requirement
- Add tests for outputSchemaForStep across all proposal variants

Tested on cluster:
- Full (exec + verify): passed
- Trust mode (exec, no verify): passed
- Advisory with target: passed
- Trivial advisory: passed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Template conditionally includes RBAC and verification instructions
based on whether the proposal has execution/verification steps.
Advisory proposals get a simpler prompt without RBAC/verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OutputSchema defines the structured output contract for analysis agents,
not a tool the sandbox pod uses. Move it to ProposalSpec where it belongs.
When present, it is injected as a required "components" property in each
analysis option. Change RemediationOption.Components from []JSON to *JSON
to accept any schema-defined shape (object or array).

- api: remove OutputSchema from ToolsSpec, add to ProposalSpec with
  immutability CEL rule
- api: change RemediationOption.Components from []apiextensionsv1.JSON
  to *apiextensionsv1.JSON for flexible schema-driven shapes
- api: remove redundant Components from AnalysisResultStatus (already
  per-option via RemediationOption)
- controller: outputSchemaForStep injects spec.outputSchema as
  "components" property in analysis schema when present
- controller: remove top-level Components from analysisResponse,
  AnalysisOutput, and result creation
- Regenerate deepcopy and CRDs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the user approves execution and selects an option, the operator
trims the AnalysisResult to keep only that option. No SelectedOption
index field needed — selectedOption() always returns options[0].

- api: remove SelectedOption from AnalysisStepStatus
- controller: add trimNonSelectedOptions() that reads the selected
  index from ProposalApproval and patches AnalysisResult
- controller: extract getLatestAnalysisResult() shared helper
- controller: trimNonSelectedOptions returns the selected option
  directly, eliminating a double-fetch in handleExecution
- cli: remove Selected Option display line
- Consolidate 3 redundant individual trim tests into table-driven test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… output

Introduce AnalysisOutputMode enum (Default/Minimal) and AnalysisOutput
struct wrapping the mode and optional custom schema. Default mode includes
all built-in properties (diagnosis, proposal, rbac, verification). Minimal
mode strips built-in properties for analysis-only proposals that define
their own output shape via a custom schema.

CEL validation enforces: schema is required when mode is Minimal, and
Minimal mode is only allowed for analysis-only proposals (no execution
or verification steps). Diagnosis and proposal fields on RemediationOption
are now optional to support Minimal mode output.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Allow cluster admins to configure how many proposals the operator
reconciles concurrently via ApprovalPolicy.spec.maxConcurrentProposals.
Defaults to 5. The operator reads this at startup from the cluster
singleton ApprovalPolicy.

Also add CEL validation on RemediationOption enforcing that diagnosis
and proposal must be present together or both absent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@harche harche changed the title Add CRD types, reconciler skeleton, and API docs Add agentic controller: API types, reconciler, and CLI May 7, 2026
@harche harche marked this pull request as ready for review May 7, 2026 15:56
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 7, 2026
Copy link
Copy Markdown

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API surface matches what we've discussed and I have no more outstanding blockers to getting this merged and iterating as we learn more.

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 7, 2026
@harche
Copy link
Copy Markdown
Contributor Author

harche commented May 7, 2026

API surface matches what we've discussed and I have no more outstanding blockers to getting this merged and iterating as we learn more.

/lgtm

Big thanks to @everettraven and everyone who weighed in here, the feedback really shaped this API for the better. Appreciate all the thoughtful reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.