Add agentic controller: API types, reconciler, and CLI by harche · Pull Request #1 · openshift/lightspeed-agentic-operator

harche · 2026-04-22T18:03:29Z

Summary

Initial agentic controller for agentic.openshift.io/v1alpha1, to be imported into lightspeed-operator.

CRD types (Proposal, Agent, LLMProvider) and generated manifests
Proposal lifecycle reconciler with approval gates, dynamic RBAC, retry, and escalation
oc-agentic CLI plugin

Test plan

Unit tests pass (go test ./...)
Builds and CRD generation succeed

wking · 2026-04-22T18:16:50Z

+	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+)
+
+// LLMProviderType identifies the hosting backend for an LLM provider.


nit: Maybe just Provider? We have agentic scoping from the agentic.openshift.io group, and I'm having trouble imagining other providers besides these AI backends. Also, the AI backends like the Anthropic API, etc., are composite harnesses, and not just raw LLM models, right? So even if we need a way to distincuish between other agentic.openshift.io providers, I dunno if LLM-ness is quite the distinction to use. But also, 🤷, it's just a name, and I'm happy to use whatever, if folks don't want to bikeshed at this level of pedanticism.

Yeah fair point, they are more than just LLMs. But I lean toward keeping LLMProvider because when someone reads kind: LLMProvider in a YAML file they immediately know what it is. kind: Provider on its own could be anything without checking the API group. The extra specificity pays for itself in readability especially when people are scanning through manifests. Happy to change it if you feel strongly though.

nobody reads kind: Provider on its own though, apiVersion is right next door. For example:

$ oc get -o yaml clusterversion version | head -n2 apiVersion: config.openshift.io/v1 kind: ClusterVersion

I don't feel strongly though, I'm just explaining my impressions, and you're free to do whatever makes sense to you and whoever this repo's approvers will be, regardless of how closely it matches my understanding.

I'm inclined to just Provider, unless there are more types. Eg. RAGProvider/OKPProvider/XYZProvider (similar as in in llama-stack).

Provider is definitely more straightforward. If we are concerned about needing additional "provider" types in the future, maybe making this more generic to something like InferenceProvider makes more sense here?

Taking inspiration from llama-stack they basically have inference, vector store, safety, and tool runtime providers and "inference" sounds the most applicable here.

Good suggestion — llama-stack's InferenceProvider makes sense for their scope since they're a generic inference framework covering LLMs, embeddings, safety, and vector stores.

For us, LLMProvider is intentionally specific. This resource configures which LLM backend (Anthropic, OpenAI, Vertex, Bedrock) an agent talks to — model name, credentials, endpoint. It's not a generic inference abstraction, and we don't plan to extend it to cover embeddings, safety filters, or vector stores.

InferenceProvider also risks confusion with GPU-as-a-service inference platforms (CoreWeave, Lambda, OpenRouter, RunPod, etc.) — which are a different layer entirely. LLMProvider is unambiguous about what it configures.

wking · 2026-04-22T18:27:50Z

+//     runtime (e.g., forwarded from a user session).
+//
+// +enum
+type MCPHeaderSourceType string


Naming nit, are MCPs the only kind of tools that the agent needs to auth for? And this header is just about auth, right? Not the other things HTTP headers could be used for (content negotiation, cache management, etc.)? Maybe ToolAuthenticationSourceType?

The naming here follows the existing MCPHeaderSourceType from the lightspeed-operator API so we stay consistent across both operators. Also these headers aren't strictly about auth, they could be used for routing or tenant identification too, so ToolAuthenticationSourceType would be narrowing it. Since the type is already scoped under MCPServerConfig the MCP prefix gives enough context without being too specific about the purpose.

wait this is already merged lightspeed API? Can we just use the existing types then, instead of creating copies here?

Better to start fresh and see what's actually needed rather than inheriting current OLS types that may set the wrong base.

wking · 2026-04-22T18:39:24Z

+	//
+	// When omitted, the entire image is mounted as a single volume.
+	// +optional
+	Paths []string `json:"paths,omitempty"`


There's translation risk if we deviate from the target APIs we expect to use to deliver this functionality. With this knob expecting to feed some VolumeMount knobs, maybe we should just use:

// Mount configures how the image will be mounted into the agent's sandbox Pod. Mount *corev1.VolumeMount `json:"mount,omitempty"

to be transparent about being a pass-through consumer, without having to have our own opinions about how volume-mounting works? And then we're on the hook to vendor-bump k8s.io/api with each release (tracking the oldest version of Kube that we claim compatibility with for a given Lightspeed operator branch?), but we aren't on the hook to keep a closer eye on evolutions like new approaches to declaring subpaths or whatever it is upstream decides to add to VolumeMount next.

Yeah I considered this. The thing is, paths isn't really a thin wrapper over VolumeMount. The operator does something fairly opinionated with it. It takes a single image volume and creates multiple subPath mounts, using path.Base() to derive the mount name and placing everything under /app/skills/. So /skills/prometheus becomes a subPath mount at /app/skills/prometheus, and /skills/cluster-update/update-advisor becomes /app/skills/update-advisor.

If we exposed a raw VolumeMount here, the user would need to manually specify mountPath, subPath, and name for each skill, and get them all consistent with the conventions the operator expects. That's a lot of ceremony for what should be "give me these skills from this image."

I think paths is the right abstraction for the common case. If we run into situations where someone needs full VolumeMount control we could always add an escape hatch later. Wdyt?

you can also mask the properties you don't want to expose. E.g. you have opinions on Name, so don't expose that. You have opinions on ReadOnly, so don't expose that. You're left with SubPath and SubPathExpr, so expose those? That saves you the work of having to translate from paths to multiple VolumeMounts, while still giving the spec-author some properties/Godocs they're familiar with from their Pod work.

I see the appeal of familiar field names but I think it would actually be more confusing. If a user sees subPath they'll expect they can also set mountPath, readOnly, etc. and wonder why those are missing. And SubPathExpr uses downward API env var substitution which doesn't really apply to skills mounting.

paths is doing more than subPath under the hood. For example, this spec:

skills: - image: registry.ci.openshift.org/ocp/5.0:agentic-skills paths: - /skills/prometheus - /skills/cluster-update/update-advisor

produces two VolumeMounts from a single image volume:

volumeMounts: - name: skills mountPath: /app/skills/prometheus subPath: skills/prometheus readOnly: true - name: skills mountPath: /app/skills/update-advisor subPath: skills/cluster-update/update-advisor readOnly: true

The operator derives the mount name from path.Base(), hardcodes readOnly, and places everything under /app/skills/. Exposing subPath directly would let users think they control all of this when they don't.

But you can generate that same volumeMounts context with just subPath exposed, like:

skills: - image: registry.ci.openshift.org/ocp/5.0:agentic-skills subPath: /skills/prometheus - image: registry.ci.openshift.org/ocp/5.0:agentic-skills subPath: /skills/cluster-update

right? Although for agentic-skills specifically, I'd expect any flattening needed to happen at image-build time, so you can just do:

skills: - image: registry.ci.openshift.org/ocp/5.0:agentic-skills subPath: /skills

and not feel like you have to track and handle each skill you want mounted from that image (unless for some reason you want that control to exclude skills you don't want the agent to access).

I'm not strongly opinionated here, but if the goal is to have a simpler API, the existing shape seems generally OK to me.

wking · 2026-04-22T20:15:02Z

+)
+
+// DiagnosisResult contains the root cause analysis from the analysis agent.
+// This is populated by the agent during the Analyzing phase and stored in


nit: "phase" -> "step", here and elsewhere, to align with WorkflowSpec's language like "analysis references an Agent for the analysis step". I don't think you're trying to make a phase-vs-step distinction, and if you aren't, sticking with one word will help keep folks from wondering about that possibility.

Agree, it should be 'step'. I will change it everywhere. Thanks.

wking · 2026-04-22T20:30:37Z

+	// +kubebuilder:validation:MaxLength=253
+	Name string `json:"name"`
+	// command is the command or API call to run for this check
+	// (e.g., "oc get pod -n production -l app=web -o jsonpath='{.items[0].status.phase}'").


This requires oc in the agent-sandbox Pod, but we've dropped that requirement. Do we need to restore it as part of the "what you can expect the agent to do?" contract between the agent container executing the step and the AI engine generating the advice?

As one possibility, OpenShift docs list prereqs to help readers decide if they will be able to follow the process we're about to talk about. For example:

Prerequisites

You installed the OpenShift CLI (oc) that matches the version for your updated version.
You are logged in to the cluster as user with cluster-admin privileges.
You have paused all MachineHealthCheck resources.

Those aren't dynamically negotiated though. We could have a system that had some dynamic negotation (e.g. "oh, no oc? How about kubectl?"). Or we could hard-code the requirements for Agent image as non-negotiable, and then the operator here could hard-code that context ("step commands should be Bash scripts with access to POSIX utilities as well as oc (the OpenShift client) and..."). Or there could be an endpoint served by the image (/context?) where the image tries to describe itself ("I am a POSIX system with the following additional commands..."), and the operator could slurp that up and pass it along to the AI engine when asking for command steps. Or some such...

The operator communicates with agents via HTTP POST, but the agent image itself ships with CLI tools (oc, promtool, rg, standard unix utilities) that agents use to interact with the cluster inside their sandbox. The command field here is what the analysis agent recommends the verification agent run using those tools, not a direct operator-to-agent call. So oc in the example is correct.

We could have a system that had some dynamic negotation (e.g. "oh, no oc? How about kubectl?"). Or we could hard-code the requirements for Agent image as non-negotiable...

The system prompt already handles this today. It tells the LLM what tools are available in the sandbox (oc, promtool, rg, standard POSIX utils). The system prompt is referenced via systemPromptRef on the Agent CR, so different agent images can describe different toolsets.

wking · 2026-04-22T20:42:35Z

+// before the formal verification step. This gives early signal on whether
+// the remediation worked. In trust-mode workflows (verification skipped),
+// this is the only verification that occurs.
+type ExecutionVerification struct {


I'm not clear of the value add of this type, vs. recycling VerifyCheck for this use-case.

They serve different purposes. ExecutionVerification is a quick "did things improve?" signal from the execution agent, just a bool and a summary. VerifyCheck is a structured per-check result from the dedicated verification step with name, source, value, and pass/fail. Collapsing them would force the execution agent to produce structured check results for what's meant to be a lightweight sanity check.

wking · 2026-04-22T20:46:29Z

+	Sandbox SandboxInfo `json:"sandbox,omitempty"`
+	// success indicates whether execution completed successfully.
+	// +optional
+	Success *bool `json:"success,omitempty"`


success: false leaves a lot of room for ambiguous stress. Can we drop this, and instead declare some well-known types for Conditions where we talk about success with room for a confidence read and human-facing description string? As an example of that, see the Applies condition for conditional update risks.

Agree, dropped success from both ExecutionStepStatus and VerificationStepStatus. The operator reads success from the agent's response and sets step conditions with reason and message instead. Conditions already carry the same information with more context.

Looks like this is still here?

The Success *bool field that was discussed in the original thread was removed from ExecutionStepStatus and VerificationStepStatus as stated.

However, StepResultRef and the four result CRDs (AnalysisResult, ExecutionResult, VerificationResult, EscalationResult) still had a success bool field. Replaced all of them with outcome: ActionOutcome (enum: Succeeded / Failed) per the boolean convention. Updated the controller, CLI, and console UI accordingly.

tremes · 2026-04-23T06:37:37Z

+	// +patchStrategy=merge
+	// +patchMergeKey=type
+	// +optional
+	Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`


Should we also add observedGeneration?

Agree, added observedGeneration to ProposalStatus.

tremes · 2026-04-23T06:46:34Z

+	// execution, or verification). When omitted, the agent uses a default
+	// prompt appropriate for its workflow step.
+	// +optional
+	SystemPromptRef *corev1.LocalObjectReference `json:"systemPromptRef,omitempty"`


Agent type is clusterscoped. Would be better to also have namespace in the reference?

You're right that namespace is needed. I'm actually reconsidering making all four CRDs (LLMProvider, Agent, Workflow, Proposal) namespace-scoped instead of cluster-scoped. Namespace scoping gives multi-tenancy and RBAC isolation naturally, simplifies references (everything is LocalObjectReference within the same namespace), and makes it easier to scope execution RBAC to the namespace boundary. Cluster-wide use cases still work by putting everything in a shared namespace. Thoughts? cc @wking @mrunalp

I am updating the code to have those name-scoped, I can update existing operator code to verify everything still works fine.

We'd need cluster scoped agents and namespaced scoped agents that just look at application issues within those bounds.

@mrunalp If we make agents namespace scoped, an Agent which needs to have scope beyond the namespace (cluster wide) - they will have to end up in arbitrary/default namespace - which I think is not a good situation.

The Agent CR is configuration only, it defines which LLM, skills, and prompt to use. It doesn't control the agent's runtime permissions. Those come from the RBAC rules the analysis step recommends, which admins review and approve as part of the proposal. Only after approval does the operator provision the Role/ClusterRole bindings on the sandbox ServiceAccount. The sandbox pod itself always runs in the operator's namespace (openshift-lightspeed), not the Agent's namespace, so there's no exec-based escalation path. An Agent in openshift-lightspeed namespace can have full cluster-wide access at runtime if admins approve the requested RBAC.

See comment above on modes. I think we need mode-level configuration

If the execution of the agents will only ever happen in the openshift-lightspeed namespace, just specify that this configmap must exist in the openshift-lightspeed namespace?

This field has been removed — the Agent CR no longer has systemPromptRef or any ConfigMap reference. The Agent is now a cluster-scoped tier (LLM provider, timeouts, maxTurns, providerSettings). Skills, system prompts, and MCP servers are configured per-proposal via ToolsSpec.

tremes · 2026-04-23T07:01:30Z

+// set by the operator -- users should not modify status fields directly.
+// The status provides complete observability into the proposal's progress,
+// including per-step results, retry history, and standard Kubernetes conditions.
+type ProposalStatus struct {


nit: What about adding something like Phase here reflecting the lifecycle https://github.com/openshift/lightspeed-agentic-operator/pull/1/changes#diff-608f31719fd83cc2e66fb400f835b53884e9fb407d0c826397d1c0fe82fc79a8R44? It could be then used in printerColumns.

Good idea. Added a phase field to ProposalStatus with a printerColumn. The operator derives it from conditions on every reconcile — conditions stay the source of truth, phase is for human consumption (oc get, console list views).

We seem to be mixing 2 concerns - phase, which is a workflow/FSM state, defined by wF, and proposal status - what is happening with proposal

kannon92 · 2026-04-23T15:59:36Z

+// +kubebuilder:validation:XValidation:rule="self.type != 'Secret' ? !has(self.secretRef) : true",message="secretRef must not be set when type is 'Kubernetes' or 'Client'"
+type MCPHeaderValueSource struct {
+	// type specifies the source type for the header value.
+	// +kubebuilder:validation:Required


Suggested change

// +kubebuilder:validation:Required

// +required

This is present everywhere so just calling it out here.

Talking with @JoelSpeed and @camilamacedo86, it seems that there are plans to deprecate this.

kubernetes-sigs/controller-tools#1241

Agree, switched all +kubebuilder:validation:Required markers to +required.

kannon92 · 2026-04-23T16:00:31Z

+// MCPHeader defines an HTTP header to send with every request to an
+// MCP server. Used for authentication and routing.
+type MCPHeader struct {
+	// name of the header (e.g., "Authorization", "X-API-Key").


One feedback from API review is to add all requirements on these fields to the GoDoc.

When you do oc explain you want to see what requirements there are on the fields. The markers do not show up as documentation.

Agree, added all validation requirements (min/max length, enum values, patterns, item limits) to the GoDoc so they show up in oc explain.

kannon92 · 2026-04-23T16:02:14Z

+	// headers to send to the MCP server.
+	// +optional
+	// +listType=map
+	// +listMapKey=name


Linter should say that you need ssatags here.

Ran kube-api-linter against the API types — 0 issues. All list fields already have +listType markers (atomic or map with appropriate keys). Let me know if you're seeing something specific I'm missing.

https://gist.github.com/kannon92/78c94077c670439dbf40811cad93e2bf

odd, I ran it and I got 50 failures.

yeah, my bad, I didn't run it correctly, I ran into those failures too, but fixed now. Thanks.

… skills Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add EscalationOutputSchema (success/summary/content) and register in defaultOutputSchemas map so the escalation agent receives structured output constraints - Fix duplicate EscalationResult: add ConditionUnknown/InProgress guard to handleEscalation, matching the pattern used by analysis/execution handlers. Without this, a re-queued reconcile during agent execution would spin up a second sandbox - Add TestEscalation_InProgressIsIdempotent to verify re-reconcile during escalation is a no-op Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace explicit StartTime/CompletionTime fields on step statuses and Outcome/StartTime/CompletionTime on result CRs with standard K8s conditions. Result CRs now have a status subresource with Started and Completed conditions whose lastTransitionTime provides timestamps and whose reason encodes the outcome (Succeeded/Failed). Also adds defaultOption to ApprovalPolicyStage with CEL validation requiring it when execution approval is Automatic, preventing auto-approved execution from running with an empty remediation plan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Refactor LLMProvider discriminated union, enhance ToolsSpec with secret mount types, update CLI for revised API types, and adjust sandbox template generation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace string-based mountAs examples and API reference with the discriminated union SecretMountSpec (type: EnvVar/FilePath). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SandboxClaim's own Ready condition provides pod lifecycle timestamps. Step execution timing is tracked by step conditions. These fields were redundant (CompletionTime was never set). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

everettraven

Only need to address comments labeled Blocking: and then this should be good to go.

Please ping me once these are addressed and I'll re-review.

everettraven · 2026-05-05T18:47:04Z

+	// Must contain only lowercase letters, digits, and hyphens
+	// (e.g., "us-east-1", "eu-west-2", "ap-southeast-1").
+	// +required
+	// +kubebuilder:validation:MinLength=2
+	// +kubebuilder:validation:MaxLength=63
+	// +kubebuilder:validation:XValidation:rule="self.matches('^[a-z][a-z0-9-]*[a-z0-9]$')",message="region must contain only lowercase letters, digits, and hyphens, start with a letter, and not end with a hyphen"


Non-blocking: Explicitly document that values must begin with lowercase letters and must end with lowercase alphanumeric characters.

Done. Updated region field comments on both Vertex and Bedrock configs to explicitly state: must begin with a lowercase letter and end with a lowercase alphanumeric character.

everettraven · 2026-05-05T18:49:44Z

+	// +optional
+	// +kubebuilder:validation:MinLength=1
+	// +kubebuilder:validation:MaxLength=2048
+	// +kubebuilder:validation:XValidation:rule="isURL(self) && url(self).getScheme() in ['http', 'https'] && url(self).getHostname() != '' && !self.contains('@') && !self.contains('#')",message="url must be a valid HTTP or HTTPS URL with a hostname; fragments and userinfo are not allowed"


Non-blocking: We normally suggest that CEL validations are broken into validations that are as tightly scopes as possible.

For example, I'd probably break this into 4 validations:

HTTP/HTTPS scheme check

Hostname check

userinfo check

fragment check

Done. Split the monolithic URL CEL validation into 4 separate rules across all URL/endpoint fields: scheme check, hostname check, userinfo check, fragment check. Each now has its own error message.

everettraven · 2026-05-05T18:50:24Z

+	// +kubebuilder:validation:MinLength=1
+	// +kubebuilder:validation:MaxLength=2048


Non-blocking: Always explicitly document minimum and maximum length constraints.

Done. Added "Maximum 2048 characters." to all URL/endpoint field comments across all provider configs.

everettraven · 2026-05-05T18:53:10Z

+	// When omitted, the SDK default is used.
+	// +optional
+	// +kubebuilder:validation:MaxLength=32
+	// +kubebuilder:validation:XValidation:rule="self.matches('^[0-9]{4}-[0-9]{2}-[0-9]{2}(-preview)?$')",message="apiVersion must be a date in YYYY-MM-DD format with an optional -preview suffix"


Non-blocking: Is date based the only versioning scheme they have? It looks like they have a v1 api based on https://learn.microsoft.com/en-us/azure/foundry/openai/latest

everettraven · 2026-05-05T18:58:51Z

+	// +kubebuilder:validation:Minimum=1
+	// +kubebuilder:validation:Maximum=500


Non-blocking: Explicitly document.

Addressed — documented min/max bounds in the maxTurns field comment (same as #39 above).

everettraven · 2026-05-05T21:13:00Z

+	// denied indicates the user has denied this step, making the entire
+	// proposal terminal. Once set to true, it cannot be unset.
+	// +optional
+	Denied bool `json:"denied,omitempty"`


Blocking: Do not use a boolean and instead use an enum. Maybe something like:

state: Approved | Denied

?

Feel free to use a different field name.

Non-blocking: If we make this CR represent approval+denial, maybe ProposalApproval is a bit of a misleading name?

Done. Replaced denied bool with decision ApprovalDecision enum (Approved | Denied). Updated CEL rule to prevent changing decision once set to Denied. Updated controller, CLI, and console.

everettraven · 2026-05-05T21:14:29Z

+	// +optional
+	metav1.ObjectMeta `json:"metadata,omitempty"`
+
+	// +optional


Blocking: Required

Done. Made spec required on ProposalApproval.

Update: during cluster testing, found that making spec required conflicts with the operator's lifecycle. The operator creates ProposalApproval with an empty spec when all stages are Manual (the default) — users then add approval stages one at a time. Reverting to +optional with omitzero so the initial empty state is valid. The MinItems=1 on stages still prevents stages: [] — omitting stages entirely is the valid empty state.

everettraven · 2026-05-05T21:15:06Z

+// +kubebuilder:validation:XValidation:rule="self.stages.all(s1, self.stages.all(s2, s1 == s2 || s1.type != s2.type))",message="stage types must be unique"
+type ProposalApprovalSpec struct {
+	// stages lists the approved (or denied) workflow steps. Each entry is
+	// a discriminated union keyed by type.
+	// +optional
+	// +listType=atomic


Blocking: For uniqueness use +listType=map and +listMapKey=type

Done. Replaced +listType=atomic + CEL uniqueness with +listType=map +listMapKey=type on stages.

everettraven · 2026-05-05T21:17:16Z

+// (spec.tools) or per-step (spec.analysis.tools, spec.execution.tools,
+// spec.verification.tools). Per-step tools replace the shared default
+// for that step.
+type ToolsSpec struct {


Blocking: min properties of 1

Done. Added MinProperties=1 to ToolsSpec (addressed as part of the tools pointer-to-value change).

everettraven · 2026-05-05T21:17:45Z

+	// proposalName is the name of the parent Proposal in the same namespace.
+	// +required
+	// +kubebuilder:validation:MinLength=1
+	// +kubebuilder:validation:MaxLength=253
+	ProposalName string `json:"proposalName"`
+
+	// attempt is the 1-based overall attempt number.
+	// +required
+	// +kubebuilder:validation:Minimum=1
+	Attempt int32 `json:"attempt"`
+
+	// retryIndex is the 0-based retry index within the current analysis.
+	// +required
+	// +kubebuilder:validation:Minimum=0
+	RetryIndex int32 `json:"retryIndex"`
+
+	// checks contains individual verification check results.
+	// +optional
+	// +listType=atomic
+	// +kubebuilder:validation:MaxItems=50
+	Checks []VerifyCheck `json:"checks,omitempty"`
+
+	// summary is a Markdown-formatted verification summary.
+	// +optional
+	// +kubebuilder:validation:MaxLength=32768
+	Summary string `json:"summary,omitempty"`
+
+	// components contains optional adapter-defined structured data.
+	// +optional
+	// +listType=atomic
+	// +kubebuilder:validation:MaxItems=20
+	Components []apiextensionsv1.JSON `json:"components,omitempty"`
+
+	// sandbox tracks the sandbox pod used for this verification.
+	// +optional
+	Sandbox SandboxInfo `json:"sandbox,omitzero"`
+
+	// failureReason is populated when the step failed due to a system error.
+	// +optional
+	// +kubebuilder:validation:MaxLength=8192
+	FailureReason string `json:"failureReason,omitempty"`


Blocking: move to status

Done. Moved checks, summary, sandbox, failureReason under .status on VerificationResult.

…1, #13-17, #19-22, #29, #32) - openshift#1: Change status.attempts from *int32 to int32 with min=1/max=int32 validation - openshift#2: Move maxAttempts from ProposalSpec to ApprovalPolicy (admin ceiling) and ExecutionApproval (user choice); fix semantics so N = N total attempts - openshift#3: Remove spec.revision, use metadata.generation for revision signaling - openshift#4: Make ProposalStep.tools non-pointer with omitzero, add MinProperties=1 - #8-11: Make outcome/result/conditionOutcome required; SandboxInfo fields required - #13: Add CEL singleton rule (metadata.name == "cluster") on ApprovalPolicy - #14: Use listType=map+listMapKey=name on ApprovalPolicy stages - #15: Remove defaultOption from ApprovalPolicy (operator defaults to option 0) - #16-17: Add omitempty to name/approval, document allowed values - #19: Replace denied bool with decision enum (Approved|Denied) + CEL immutability - #20: Make ProposalApproval spec required - #21: Use listType=map+listMapKey=type on ProposalApproval stages - #22: Make estimatedImpact required on RemediationOption - #29: Remove components from ExecutionResult and VerificationResult (copy-paste error) - #32: Add MinProperties=1 to ToolsSpec Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…#30-31) Each Result CRD (AnalysisResult, ExecutionResult, VerificationResult, EscalationResult) now has its own status type containing result data + conditions. Identity fields (proposalName, attempt, retryIndex) remain top-level. Result data (options, actionsTaken, checks, summary, content, components, sandbox, failureReason) moved under .status since results communicate outcomes, not desired state. Also removes components from ExecutionResult and VerificationResult (copy-paste error — not in their output schemas). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…leanup - Change ProposalStep fields from pointer to value with omitzero + MinProperties=1 - Add RFC1123 DNS subdomain CEL validation on agent name fields - Add MinProperties=1 to step status types, RBACResult, AgentTimeouts - Add MinItems=1 on ApprovalPolicy stages - Add CEL rule requiring rollbackPlan when reversible is Reversible/Partial - Fix rollbackPlan location in analysis schema (was under verification) - Remove MinLength=1 on apiGroups items to allow "" (core group) - Split monolithic URL CEL into 4 separate rules with specific messages - Document field constraints: region, model, maxTurns, timeouts, URL length Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Every field in the analysis, execution, verification, and escalation output schemas now has a description guiding the agent on what to produce. These descriptions are the agent's primary instruction for structured output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add MinLength=1 on optional strings (FailureReason, Summary, etc.) - Add MinItems=1 on optional arrays (Conditions, Results, etc.) - Add MinProperties=1 on status/spec types - Fix godoc comments to start with field name (metadata, spec, status) - Add omitempty to required fields per linter convention - Change Result CR status fields from omitempty to omitzero - Change ApprovalStage approval fields from pointer to value with omitzero - Move Conditions to first field in ApprovalStageStatus - Add MaxLength=253 to ApprovalStageStatus.Name - Add Maximum=10 to RetryIndex, Minimum=1 to ObservedGeneration - Document API linting in CLAUDE.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…pec, examples - Add generation to console Proposal metadata type (TS compile error) - Revert ProposalApproval spec from +required to +optional — empty spec is the valid initial state when all stages are Manual - Fix example YAMLs: mountAs string→struct, execution: {}→agent: default Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix createIdempotent to properly write status via status subresource (DeepCopy before Create, then Status().Patch with full status) - Add tests for createIdempotent with WithStatusSubresource - Make VerificationStep.expected optional (agent doesn't always know) - Default agent field to "default" on approval types (satisfies linter) - Make ProposalApprovalSpec.stages required with default=[], remove MinItems conflict, keep MinProperties=1 on type - Change RetryIndex to *int32 (valid zero value needs pointer) - Update verification schema: source description asks for full command - Remove reflect import, no linter exclusions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Each Result CRD (AnalysisResult, ExecutionResult, VerificationResult, EscalationResult) now has a Spec type containing identity fields (proposalName, attempt, retryIndex). Spec is immutable via CEL rule self == oldSelf. Print columns updated to .spec.* paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ghten RetryIndex - Remove Attempts from ProposalStatus (always 1, never incremented) - Remove Attempt from all Result CRD specs (always 1) - Move ObservedGeneration from AnalysisStepStatus to conditions (standard K8s pattern — every condition now carries ObservedGeneration) - Set ObservedGeneration on all 18 condition calls - Tighten RetryIndex Maximum from 10 to 2 (matches maxAttempts max of 3) - Update CEL messages: decisions/maxAttempts immutability - Clarify that denying any stage terminates the entire proposal - Update verification schema source description for full commands - Make VerificationStep.expected optional - Default agent to "default" on approval types (satisfies linter) - Clean up related tests, CLI, templates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…steps - Make rbac required in analysis schema only when proposal has execution - Make verification required only when proposal has verification step - Advisory proposals get a minimal schema (title, diagnosis, proposal) - Keep MinProperties=1 on RBACResult (when present, must have rules) - Make VerificationStep.Command and Expected optional - Make RollbackPlan.Command optional - Remove rollbackPlan CEL conditional requirement - Add tests for outputSchemaForStep across all proposal variants Tested on cluster: - Full (exec + verify): passed - Trust mode (exec, no verify): passed - Advisory with target: passed - Trivial advisory: passed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Template conditionally includes RBAC and verification instructions based on whether the proposal has execution/verification steps. Advisory proposals get a simpler prompt without RBAC/verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

OutputSchema defines the structured output contract for analysis agents, not a tool the sandbox pod uses. Move it to ProposalSpec where it belongs. When present, it is injected as a required "components" property in each analysis option. Change RemediationOption.Components from []JSON to *JSON to accept any schema-defined shape (object or array). - api: remove OutputSchema from ToolsSpec, add to ProposalSpec with immutability CEL rule - api: change RemediationOption.Components from []apiextensionsv1.JSON to *apiextensionsv1.JSON for flexible schema-driven shapes - api: remove redundant Components from AnalysisResultStatus (already per-option via RemediationOption) - controller: outputSchemaForStep injects spec.outputSchema as "components" property in analysis schema when present - controller: remove top-level Components from analysisResponse, AnalysisOutput, and result creation - Regenerate deepcopy and CRDs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When the user approves execution and selects an option, the operator trims the AnalysisResult to keep only that option. No SelectedOption index field needed — selectedOption() always returns options[0]. - api: remove SelectedOption from AnalysisStepStatus - controller: add trimNonSelectedOptions() that reads the selected index from ProposalApproval and patches AnalysisResult - controller: extract getLatestAnalysisResult() shared helper - controller: trimNonSelectedOptions returns the selected option directly, eliminating a double-fetch in handleExecution - cli: remove Selected Option display line - Consolidate 3 redundant individual trim tests into table-driven test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… output Introduce AnalysisOutputMode enum (Default/Minimal) and AnalysisOutput struct wrapping the mode and optional custom schema. Default mode includes all built-in properties (diagnosis, proposal, rbac, verification). Minimal mode strips built-in properties for analysis-only proposals that define their own output shape via a custom schema. CEL validation enforces: schema is required when mode is Minimal, and Minimal mode is only allowed for analysis-only proposals (no execution or verification steps). Diagnosis and proposal fields on RemediationOption are now optional to support Minimal mode output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Allow cluster admins to configure how many proposals the operator reconciles concurrently via ApprovalPolicy.spec.maxConcurrentProposals. Defaults to 5. The operator reads this at startup from the cluster singleton ApprovalPolicy. Also add CEL validation on RemediationOption enforcing that diagnosis and proposal must be present together or both absent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

everettraven

API surface matches what we've discussed and I have no more outstanding blockers to getting this merged and iterating as we learn more.

/lgtm

harche · 2026-05-07T16:02:44Z

API surface matches what we've discussed and I have no more outstanding blockers to getting this merged and iterating as we learn more.

/lgtm

Big thanks to @everettraven and everyone who weighed in here, the feedback really shaped this API for the better. Appreciate all the thoughtful reviews.