Add agentic controller: API types, reconciler, and CLI#1
Conversation
| metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
| ) | ||
|
|
||
| // LLMProviderType identifies the hosting backend for an LLM provider. |
There was a problem hiding this comment.
nit: Maybe just Provider? We have agentic scoping from the agentic.openshift.io group, and I'm having trouble imagining other providers besides these AI backends. Also, the AI backends like the Anthropic API, etc., are composite harnesses, and not just raw LLM models, right? So even if we need a way to distincuish between other agentic.openshift.io providers, I dunno if LLM-ness is quite the distinction to use. But also, 🤷, it's just a name, and I'm happy to use whatever, if folks don't want to bikeshed at this level of pedanticism.
There was a problem hiding this comment.
Yeah fair point, they are more than just LLMs. But I lean toward keeping LLMProvider because when someone reads kind: LLMProvider in a YAML file they immediately know what it is. kind: Provider on its own could be anything without checking the API group. The extra specificity pays for itself in readability especially when people are scanning through manifests. Happy to change it if you feel strongly though.
There was a problem hiding this comment.
nobody reads kind: Provider on its own though, apiVersion is right next door. For example:
$ oc get -o yaml clusterversion version | head -n2
apiVersion: config.openshift.io/v1
kind: ClusterVersionI don't feel strongly though, I'm just explaining my impressions, and you're free to do whatever makes sense to you and whoever this repo's approvers will be, regardless of how closely it matches my understanding.
There was a problem hiding this comment.
I'm inclined to just Provider, unless there are more types. Eg. RAGProvider/OKPProvider/XYZProvider (similar as in in llama-stack).
There was a problem hiding this comment.
Provider is definitely more straightforward. If we are concerned about needing additional "provider" types in the future, maybe making this more generic to something like InferenceProvider makes more sense here?
Taking inspiration from llama-stack they basically have inference, vector store, safety, and tool runtime providers and "inference" sounds the most applicable here.
There was a problem hiding this comment.
Good suggestion — llama-stack's InferenceProvider makes sense for their scope since they're a generic inference framework covering LLMs, embeddings, safety, and vector stores.
For us, LLMProvider is intentionally specific. This resource configures which LLM backend (Anthropic, OpenAI, Vertex, Bedrock) an agent talks to — model name, credentials, endpoint. It's not a generic inference abstraction, and we don't plan to extend it to cover embeddings, safety filters, or vector stores.
InferenceProvider also risks confusion with GPU-as-a-service inference platforms (CoreWeave, Lambda, OpenRouter, RunPod, etc.) — which are a different layer entirely. LLMProvider is unambiguous about what it configures.
| // runtime (e.g., forwarded from a user session). | ||
| // | ||
| // +enum | ||
| type MCPHeaderSourceType string |
There was a problem hiding this comment.
Naming nit, are MCPs the only kind of tools that the agent needs to auth for? And this header is just about auth, right? Not the other things HTTP headers could be used for (content negotiation, cache management, etc.)? Maybe ToolAuthenticationSourceType?
There was a problem hiding this comment.
The naming here follows the existing MCPHeaderSourceType from the lightspeed-operator API so we stay consistent across both operators. Also these headers aren't strictly about auth, they could be used for routing or tenant identification too, so ToolAuthenticationSourceType would be narrowing it. Since the type is already scoped under MCPServerConfig the MCP prefix gives enough context without being too specific about the purpose.
There was a problem hiding this comment.
wait this is already merged lightspeed API? Can we just use the existing types then, instead of creating copies here?
There was a problem hiding this comment.
Better to start fresh and see what's actually needed rather than inheriting current OLS types that may set the wrong base.
| // | ||
| // When omitted, the entire image is mounted as a single volume. | ||
| // +optional | ||
| Paths []string `json:"paths,omitempty"` |
There was a problem hiding this comment.
There's translation risk if we deviate from the target APIs we expect to use to deliver this functionality. With this knob expecting to feed some VolumeMount knobs, maybe we should just use:
// Mount configures how the image will be mounted into the agent's sandbox Pod.
Mount *corev1.VolumeMount `json:"mount,omitempty"to be transparent about being a pass-through consumer, without having to have our own opinions about how volume-mounting works? And then we're on the hook to vendor-bump k8s.io/api with each release (tracking the oldest version of Kube that we claim compatibility with for a given Lightspeed operator branch?), but we aren't on the hook to keep a closer eye on evolutions like new approaches to declaring subpaths or whatever it is upstream decides to add to VolumeMount next.
There was a problem hiding this comment.
Yeah I considered this. The thing is, paths isn't really a thin wrapper over VolumeMount. The operator does something fairly opinionated with it. It takes a single image volume and creates multiple subPath mounts, using path.Base() to derive the mount name and placing everything under /app/skills/. So /skills/prometheus becomes a subPath mount at /app/skills/prometheus, and /skills/cluster-update/update-advisor becomes /app/skills/update-advisor.
If we exposed a raw VolumeMount here, the user would need to manually specify mountPath, subPath, and name for each skill, and get them all consistent with the conventions the operator expects. That's a lot of ceremony for what should be "give me these skills from this image."
I think paths is the right abstraction for the common case. If we run into situations where someone needs full VolumeMount control we could always add an escape hatch later. Wdyt?
There was a problem hiding this comment.
you can also mask the properties you don't want to expose. E.g. you have opinions on Name, so don't expose that. You have opinions on ReadOnly, so don't expose that. You're left with SubPath and SubPathExpr, so expose those? That saves you the work of having to translate from paths to multiple VolumeMounts, while still giving the spec-author some properties/Godocs they're familiar with from their Pod work.
There was a problem hiding this comment.
I see the appeal of familiar field names but I think it would actually be more confusing. If a user sees subPath they'll expect they can also set mountPath, readOnly, etc. and wonder why those are missing. And SubPathExpr uses downward API env var substitution which doesn't really apply to skills mounting.
paths is doing more than subPath under the hood. For example, this spec:
skills:
- image: registry.ci.openshift.org/ocp/5.0:agentic-skills
paths:
- /skills/prometheus
- /skills/cluster-update/update-advisorproduces two VolumeMounts from a single image volume:
volumeMounts:
- name: skills
mountPath: /app/skills/prometheus
subPath: skills/prometheus
readOnly: true
- name: skills
mountPath: /app/skills/update-advisor
subPath: skills/cluster-update/update-advisor
readOnly: trueThe operator derives the mount name from path.Base(), hardcodes readOnly, and places everything under /app/skills/. Exposing subPath directly would let users think they control all of this when they don't.
There was a problem hiding this comment.
But you can generate that same volumeMounts context with just subPath exposed, like:
skills:
- image: registry.ci.openshift.org/ocp/5.0:agentic-skills
subPath: /skills/prometheus
- image: registry.ci.openshift.org/ocp/5.0:agentic-skills
subPath: /skills/cluster-updateright? Although for agentic-skills specifically, I'd expect any flattening needed to happen at image-build time, so you can just do:
skills:
- image: registry.ci.openshift.org/ocp/5.0:agentic-skills
subPath: /skillsand not feel like you have to track and handle each skill you want mounted from that image (unless for some reason you want that control to exclude skills you don't want the agent to access).
There was a problem hiding this comment.
I'm not strongly opinionated here, but if the goal is to have a simpler API, the existing shape seems generally OK to me.
d7df521 to
297600d
Compare
| ) | ||
|
|
||
| // DiagnosisResult contains the root cause analysis from the analysis agent. | ||
| // This is populated by the agent during the Analyzing phase and stored in |
There was a problem hiding this comment.
nit: "phase" -> "step", here and elsewhere, to align with WorkflowSpec's language like "analysis references an Agent for the analysis step". I don't think you're trying to make a phase-vs-step distinction, and if you aren't, sticking with one word will help keep folks from wondering about that possibility.
There was a problem hiding this comment.
Agree, it should be 'step'. I will change it everywhere. Thanks.
| // +kubebuilder:validation:MaxLength=253 | ||
| Name string `json:"name"` | ||
| // command is the command or API call to run for this check | ||
| // (e.g., "oc get pod -n production -l app=web -o jsonpath='{.items[0].status.phase}'"). |
There was a problem hiding this comment.
This requires oc in the agent-sandbox Pod, but we've dropped that requirement. Do we need to restore it as part of the "what you can expect the agent to do?" contract between the agent container executing the step and the AI engine generating the advice?
There was a problem hiding this comment.
As one possibility, OpenShift docs list prereqs to help readers decide if they will be able to follow the process we're about to talk about. For example:
Prerequisites
You installed the OpenShift CLI (
oc) that matches the version for your updated version.
You are logged in to the cluster as user withcluster-adminprivileges.
You have paused allMachineHealthCheckresources.
Those aren't dynamically negotiated though. We could have a system that had some dynamic negotation (e.g. "oh, no oc? How about kubectl?"). Or we could hard-code the requirements for Agent image as non-negotiable, and then the operator here could hard-code that context ("step commands should be Bash scripts with access to POSIX utilities as well as oc (the OpenShift client) and..."). Or there could be an endpoint served by the image (/context?) where the image tries to describe itself ("I am a POSIX system with the following additional commands..."), and the operator could slurp that up and pass it along to the AI engine when asking for command steps. Or some such...
There was a problem hiding this comment.
The operator communicates with agents via HTTP POST, but the agent image itself ships with CLI tools (oc, promtool, rg, standard unix utilities) that agents use to interact with the cluster inside their sandbox. The command field here is what the analysis agent recommends the verification agent run using those tools, not a direct operator-to-agent call. So oc in the example is correct.
There was a problem hiding this comment.
We could have a system that had some dynamic negotation (e.g. "oh, no
oc? How aboutkubectl?"). Or we could hard-code the requirements for Agentimageas non-negotiable...
The system prompt already handles this today. It tells the LLM what tools are available in the sandbox (oc, promtool, rg, standard POSIX utils). The system prompt is referenced via systemPromptRef on the Agent CR, so different agent images can describe different toolsets.
| // before the formal verification step. This gives early signal on whether | ||
| // the remediation worked. In trust-mode workflows (verification skipped), | ||
| // this is the only verification that occurs. | ||
| type ExecutionVerification struct { |
There was a problem hiding this comment.
I'm not clear of the value add of this type, vs. recycling VerifyCheck for this use-case.
There was a problem hiding this comment.
They serve different purposes. ExecutionVerification is a quick "did things improve?" signal from the execution agent, just a bool and a summary. VerifyCheck is a structured per-check result from the dedicated verification step with name, source, value, and pass/fail. Collapsing them would force the execution agent to produce structured check results for what's meant to be a lightweight sanity check.
| Sandbox SandboxInfo `json:"sandbox,omitempty"` | ||
| // success indicates whether execution completed successfully. | ||
| // +optional | ||
| Success *bool `json:"success,omitempty"` |
There was a problem hiding this comment.
success: false leaves a lot of room for ambiguous stress. Can we drop this, and instead declare some well-known types for Conditions where we talk about success with room for a confidence read and human-facing description string? As an example of that, see the Applies condition for conditional update risks.
There was a problem hiding this comment.
Agree, dropped success from both ExecutionStepStatus and VerificationStepStatus. The operator reads success from the agent's response and sets step conditions with reason and message instead. Conditions already carry the same information with more context.
There was a problem hiding this comment.
The Success *bool field that was discussed in the original thread was removed from ExecutionStepStatus and VerificationStepStatus as stated.
However, StepResultRef and the four result CRDs (AnalysisResult, ExecutionResult, VerificationResult, EscalationResult) still had a success bool field. Replaced all of them with outcome: ActionOutcome (enum: Succeeded / Failed) per the boolean convention. Updated the controller, CLI, and console UI accordingly.
| // +patchStrategy=merge | ||
| // +patchMergeKey=type | ||
| // +optional | ||
| Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"` |
There was a problem hiding this comment.
Agree, added observedGeneration to ProposalStatus.
| // execution, or verification). When omitted, the agent uses a default | ||
| // prompt appropriate for its workflow step. | ||
| // +optional | ||
| SystemPromptRef *corev1.LocalObjectReference `json:"systemPromptRef,omitempty"` |
There was a problem hiding this comment.
Agent type is clusterscoped. Would be better to also have namespace in the reference?
There was a problem hiding this comment.
You're right that namespace is needed. I'm actually reconsidering making all four CRDs (LLMProvider, Agent, Workflow, Proposal) namespace-scoped instead of cluster-scoped. Namespace scoping gives multi-tenancy and RBAC isolation naturally, simplifies references (everything is LocalObjectReference within the same namespace), and makes it easier to scope execution RBAC to the namespace boundary. Cluster-wide use cases still work by putting everything in a shared namespace. Thoughts? cc @wking @mrunalp
There was a problem hiding this comment.
I am updating the code to have those name-scoped, I can update existing operator code to verify everything still works fine.
There was a problem hiding this comment.
We'd need cluster scoped agents and namespaced scoped agents that just look at application issues within those bounds.
There was a problem hiding this comment.
@mrunalp If we make agents namespace scoped, an Agent which needs to have scope beyond the namespace (cluster wide) - they will have to end up in arbitrary/default namespace - which I think is not a good situation.
There was a problem hiding this comment.
The Agent CR is configuration only, it defines which LLM, skills, and prompt to use. It doesn't control the agent's runtime permissions. Those come from the RBAC rules the analysis step recommends, which admins review and approve as part of the proposal. Only after approval does the operator provision the Role/ClusterRole bindings on the sandbox ServiceAccount. The sandbox pod itself always runs in the operator's namespace (openshift-lightspeed), not the Agent's namespace, so there's no exec-based escalation path. An Agent in openshift-lightspeed namespace can have full cluster-wide access at runtime if admins approve the requested RBAC.
There was a problem hiding this comment.
See comment above on modes. I think we need mode-level configuration
There was a problem hiding this comment.
If the execution of the agents will only ever happen in the openshift-lightspeed namespace, just specify that this configmap must exist in the openshift-lightspeed namespace?
There was a problem hiding this comment.
This field has been removed — the Agent CR no longer has systemPromptRef or any ConfigMap reference. The Agent is now a cluster-scoped tier (LLM provider, timeouts, maxTurns, providerSettings). Skills, system prompts, and MCP servers are configured per-proposal via ToolsSpec.
| // set by the operator -- users should not modify status fields directly. | ||
| // The status provides complete observability into the proposal's progress, | ||
| // including per-step results, retry history, and standard Kubernetes conditions. | ||
| type ProposalStatus struct { |
There was a problem hiding this comment.
nit: What about adding something like Phase here reflecting the lifecycle https://github.com/openshift/lightspeed-agentic-operator/pull/1/changes#diff-608f31719fd83cc2e66fb400f835b53884e9fb407d0c826397d1c0fe82fc79a8R44? It could be then used in printerColumns.
There was a problem hiding this comment.
Good idea. Added a phase field to ProposalStatus with a printerColumn. The operator derives it from conditions on every reconcile — conditions stay the source of truth, phase is for human consumption (oc get, console list views).
There was a problem hiding this comment.
We seem to be mixing 2 concerns - phase, which is a workflow/FSM state, defined by wF, and proposal status - what is happening with proposal
| // +kubebuilder:validation:XValidation:rule="self.type != 'Secret' ? !has(self.secretRef) : true",message="secretRef must not be set when type is 'Kubernetes' or 'Client'" | ||
| type MCPHeaderValueSource struct { | ||
| // type specifies the source type for the header value. | ||
| // +kubebuilder:validation:Required |
There was a problem hiding this comment.
| // +kubebuilder:validation:Required | |
| // +required |
This is present everywhere so just calling it out here.
Talking with @JoelSpeed and @camilamacedo86, it seems that there are plans to deprecate this.
There was a problem hiding this comment.
Agree, switched all +kubebuilder:validation:Required markers to +required.
| // MCPHeader defines an HTTP header to send with every request to an | ||
| // MCP server. Used for authentication and routing. | ||
| type MCPHeader struct { | ||
| // name of the header (e.g., "Authorization", "X-API-Key"). |
There was a problem hiding this comment.
One feedback from API review is to add all requirements on these fields to the GoDoc.
When you do oc explain you want to see what requirements there are on the fields. The markers do not show up as documentation.
There was a problem hiding this comment.
Agree, added all validation requirements (min/max length, enum values, patterns, item limits) to the GoDoc so they show up in oc explain.
| // headers to send to the MCP server. | ||
| // +optional | ||
| // +listType=map | ||
| // +listMapKey=name |
There was a problem hiding this comment.
Linter should say that you need ssatags here.
There was a problem hiding this comment.
Ran kube-api-linter against the API types — 0 issues. All list fields already have +listType markers (atomic or map with appropriate keys). Let me know if you're seeing something specific I'm missing.
There was a problem hiding this comment.
https://gist.github.com/kannon92/78c94077c670439dbf40811cad93e2bf
odd, I ran it and I got 50 failures.
There was a problem hiding this comment.
yeah, my bad, I didn't run it correctly, I ran into those failures too, but fixed now. Thanks.
… skills Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add EscalationOutputSchema (success/summary/content) and register in defaultOutputSchemas map so the escalation agent receives structured output constraints - Fix duplicate EscalationResult: add ConditionUnknown/InProgress guard to handleEscalation, matching the pattern used by analysis/execution handlers. Without this, a re-queued reconcile during agent execution would spin up a second sandbox - Add TestEscalation_InProgressIsIdempotent to verify re-reconcile during escalation is a no-op Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace explicit StartTime/CompletionTime fields on step statuses and Outcome/StartTime/CompletionTime on result CRs with standard K8s conditions. Result CRs now have a status subresource with Started and Completed conditions whose lastTransitionTime provides timestamps and whose reason encodes the outcome (Succeeded/Failed). Also adds defaultOption to ApprovalPolicyStage with CEL validation requiring it when execution approval is Automatic, preventing auto-approved execution from running with an empty remediation plan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Refactor LLMProvider discriminated union, enhance ToolsSpec with secret mount types, update CLI for revised API types, and adjust sandbox template generation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace string-based mountAs examples and API reference with the discriminated union SecretMountSpec (type: EnvVar/FilePath). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SandboxClaim's own Ready condition provides pod lifecycle timestamps. Step execution timing is tracked by step conditions. These fields were redundant (CompletionTime was never set). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
everettraven
left a comment
There was a problem hiding this comment.
Only need to address comments labeled Blocking: and then this should be good to go.
Please ping me once these are addressed and I'll re-review.
| // Must contain only lowercase letters, digits, and hyphens | ||
| // (e.g., "us-east-1", "eu-west-2", "ap-southeast-1"). | ||
| // +required | ||
| // +kubebuilder:validation:MinLength=2 | ||
| // +kubebuilder:validation:MaxLength=63 | ||
| // +kubebuilder:validation:XValidation:rule="self.matches('^[a-z][a-z0-9-]*[a-z0-9]$')",message="region must contain only lowercase letters, digits, and hyphens, start with a letter, and not end with a hyphen" |
There was a problem hiding this comment.
Non-blocking: Explicitly document that values must begin with lowercase letters and must end with lowercase alphanumeric characters.
There was a problem hiding this comment.
Done. Updated region field comments on both Vertex and Bedrock configs to explicitly state: must begin with a lowercase letter and end with a lowercase alphanumeric character.
| // +optional | ||
| // +kubebuilder:validation:MinLength=1 | ||
| // +kubebuilder:validation:MaxLength=2048 | ||
| // +kubebuilder:validation:XValidation:rule="isURL(self) && url(self).getScheme() in ['http', 'https'] && url(self).getHostname() != '' && !self.contains('@') && !self.contains('#')",message="url must be a valid HTTP or HTTPS URL with a hostname; fragments and userinfo are not allowed" |
There was a problem hiding this comment.
Non-blocking: We normally suggest that CEL validations are broken into validations that are as tightly scopes as possible.
For example, I'd probably break this into 4 validations:
- HTTP/HTTPS scheme check
- Hostname check
- userinfo check
- fragment check
There was a problem hiding this comment.
Done. Split the monolithic URL CEL validation into 4 separate rules across all URL/endpoint fields: scheme check, hostname check, userinfo check, fragment check. Each now has its own error message.
| // +kubebuilder:validation:MinLength=1 | ||
| // +kubebuilder:validation:MaxLength=2048 |
There was a problem hiding this comment.
Non-blocking: Always explicitly document minimum and maximum length constraints.
There was a problem hiding this comment.
Done. Added "Maximum 2048 characters." to all URL/endpoint field comments across all provider configs.
| // When omitted, the SDK default is used. | ||
| // +optional | ||
| // +kubebuilder:validation:MaxLength=32 | ||
| // +kubebuilder:validation:XValidation:rule="self.matches('^[0-9]{4}-[0-9]{2}-[0-9]{2}(-preview)?$')",message="apiVersion must be a date in YYYY-MM-DD format with an optional -preview suffix" |
There was a problem hiding this comment.
Non-blocking: Is date based the only versioning scheme they have? It looks like they have a v1 api based on https://learn.microsoft.com/en-us/azure/foundry/openai/latest
| // +kubebuilder:validation:Minimum=1 | ||
| // +kubebuilder:validation:Maximum=500 |
There was a problem hiding this comment.
Non-blocking: Explicitly document.
There was a problem hiding this comment.
Addressed — documented min/max bounds in the maxTurns field comment (same as #39 above).
| // denied indicates the user has denied this step, making the entire | ||
| // proposal terminal. Once set to true, it cannot be unset. | ||
| // +optional | ||
| Denied bool `json:"denied,omitempty"` |
There was a problem hiding this comment.
Blocking: Do not use a boolean and instead use an enum. Maybe something like:
state: Approved | Denied?
Feel free to use a different field name.
Non-blocking: If we make this CR represent approval+denial, maybe ProposalApproval is a bit of a misleading name?
There was a problem hiding this comment.
Done. Replaced denied bool with decision ApprovalDecision enum (Approved | Denied). Updated CEL rule to prevent changing decision once set to Denied. Updated controller, CLI, and console.
| // +optional | ||
| metav1.ObjectMeta `json:"metadata,omitempty"` | ||
|
|
||
| // +optional |
There was a problem hiding this comment.
Done. Made spec required on ProposalApproval.
There was a problem hiding this comment.
Update: during cluster testing, found that making spec required conflicts with the operator's lifecycle. The operator creates ProposalApproval with an empty spec when all stages are Manual (the default) — users then add approval stages one at a time. Reverting to +optional with omitzero so the initial empty state is valid. The MinItems=1 on stages still prevents stages: [] — omitting stages entirely is the valid empty state.
| // +kubebuilder:validation:XValidation:rule="self.stages.all(s1, self.stages.all(s2, s1 == s2 || s1.type != s2.type))",message="stage types must be unique" | ||
| type ProposalApprovalSpec struct { | ||
| // stages lists the approved (or denied) workflow steps. Each entry is | ||
| // a discriminated union keyed by type. | ||
| // +optional | ||
| // +listType=atomic |
There was a problem hiding this comment.
Blocking: For uniqueness use +listType=map and +listMapKey=type
There was a problem hiding this comment.
Done. Replaced +listType=atomic + CEL uniqueness with +listType=map +listMapKey=type on stages.
| // (spec.tools) or per-step (spec.analysis.tools, spec.execution.tools, | ||
| // spec.verification.tools). Per-step tools replace the shared default | ||
| // for that step. | ||
| type ToolsSpec struct { |
There was a problem hiding this comment.
Done. Added MinProperties=1 to ToolsSpec (addressed as part of the tools pointer-to-value change).
| // proposalName is the name of the parent Proposal in the same namespace. | ||
| // +required | ||
| // +kubebuilder:validation:MinLength=1 | ||
| // +kubebuilder:validation:MaxLength=253 | ||
| ProposalName string `json:"proposalName"` | ||
|
|
||
| // attempt is the 1-based overall attempt number. | ||
| // +required | ||
| // +kubebuilder:validation:Minimum=1 | ||
| Attempt int32 `json:"attempt"` | ||
|
|
||
| // retryIndex is the 0-based retry index within the current analysis. | ||
| // +required | ||
| // +kubebuilder:validation:Minimum=0 | ||
| RetryIndex int32 `json:"retryIndex"` | ||
|
|
||
| // checks contains individual verification check results. | ||
| // +optional | ||
| // +listType=atomic | ||
| // +kubebuilder:validation:MaxItems=50 | ||
| Checks []VerifyCheck `json:"checks,omitempty"` | ||
|
|
||
| // summary is a Markdown-formatted verification summary. | ||
| // +optional | ||
| // +kubebuilder:validation:MaxLength=32768 | ||
| Summary string `json:"summary,omitempty"` | ||
|
|
||
| // components contains optional adapter-defined structured data. | ||
| // +optional | ||
| // +listType=atomic | ||
| // +kubebuilder:validation:MaxItems=20 | ||
| Components []apiextensionsv1.JSON `json:"components,omitempty"` | ||
|
|
||
| // sandbox tracks the sandbox pod used for this verification. | ||
| // +optional | ||
| Sandbox SandboxInfo `json:"sandbox,omitzero"` | ||
|
|
||
| // failureReason is populated when the step failed due to a system error. | ||
| // +optional | ||
| // +kubebuilder:validation:MaxLength=8192 | ||
| FailureReason string `json:"failureReason,omitempty"` |
There was a problem hiding this comment.
Done. Moved checks, summary, sandbox, failureReason under .status on VerificationResult.
…1, #13-17, #19-22, #29, #32) - openshift#1: Change status.attempts from *int32 to int32 with min=1/max=int32 validation - openshift#2: Move maxAttempts from ProposalSpec to ApprovalPolicy (admin ceiling) and ExecutionApproval (user choice); fix semantics so N = N total attempts - openshift#3: Remove spec.revision, use metadata.generation for revision signaling - openshift#4: Make ProposalStep.tools non-pointer with omitzero, add MinProperties=1 - #8-11: Make outcome/result/conditionOutcome required; SandboxInfo fields required - #13: Add CEL singleton rule (metadata.name == "cluster") on ApprovalPolicy - #14: Use listType=map+listMapKey=name on ApprovalPolicy stages - #15: Remove defaultOption from ApprovalPolicy (operator defaults to option 0) - #16-17: Add omitempty to name/approval, document allowed values - #19: Replace denied bool with decision enum (Approved|Denied) + CEL immutability - #20: Make ProposalApproval spec required - #21: Use listType=map+listMapKey=type on ProposalApproval stages - #22: Make estimatedImpact required on RemediationOption - #29: Remove components from ExecutionResult and VerificationResult (copy-paste error) - #32: Add MinProperties=1 to ToolsSpec Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…#30-31) Each Result CRD (AnalysisResult, ExecutionResult, VerificationResult, EscalationResult) now has its own status type containing result data + conditions. Identity fields (proposalName, attempt, retryIndex) remain top-level. Result data (options, actionsTaken, checks, summary, content, components, sandbox, failureReason) moved under .status since results communicate outcomes, not desired state. Also removes components from ExecutionResult and VerificationResult (copy-paste error — not in their output schemas). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…leanup - Change ProposalStep fields from pointer to value with omitzero + MinProperties=1 - Add RFC1123 DNS subdomain CEL validation on agent name fields - Add MinProperties=1 to step status types, RBACResult, AgentTimeouts - Add MinItems=1 on ApprovalPolicy stages - Add CEL rule requiring rollbackPlan when reversible is Reversible/Partial - Fix rollbackPlan location in analysis schema (was under verification) - Remove MinLength=1 on apiGroups items to allow "" (core group) - Split monolithic URL CEL into 4 separate rules with specific messages - Document field constraints: region, model, maxTurns, timeouts, URL length Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Every field in the analysis, execution, verification, and escalation output schemas now has a description guiding the agent on what to produce. These descriptions are the agent's primary instruction for structured output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add MinLength=1 on optional strings (FailureReason, Summary, etc.) - Add MinItems=1 on optional arrays (Conditions, Results, etc.) - Add MinProperties=1 on status/spec types - Fix godoc comments to start with field name (metadata, spec, status) - Add omitempty to required fields per linter convention - Change Result CR status fields from omitempty to omitzero - Change ApprovalStage approval fields from pointer to value with omitzero - Move Conditions to first field in ApprovalStageStatus - Add MaxLength=253 to ApprovalStageStatus.Name - Add Maximum=10 to RetryIndex, Minimum=1 to ObservedGeneration - Document API linting in CLAUDE.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pec, examples
- Add generation to console Proposal metadata type (TS compile error)
- Revert ProposalApproval spec from +required to +optional — empty spec
is the valid initial state when all stages are Manual
- Fix example YAMLs: mountAs string→struct, execution: {}→agent: default
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix createIdempotent to properly write status via status subresource (DeepCopy before Create, then Status().Patch with full status) - Add tests for createIdempotent with WithStatusSubresource - Make VerificationStep.expected optional (agent doesn't always know) - Default agent field to "default" on approval types (satisfies linter) - Make ProposalApprovalSpec.stages required with default=[], remove MinItems conflict, keep MinProperties=1 on type - Change RetryIndex to *int32 (valid zero value needs pointer) - Update verification schema: source description asks for full command - Remove reflect import, no linter exclusions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each Result CRD (AnalysisResult, ExecutionResult, VerificationResult, EscalationResult) now has a Spec type containing identity fields (proposalName, attempt, retryIndex). Spec is immutable via CEL rule self == oldSelf. Print columns updated to .spec.* paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ghten RetryIndex - Remove Attempts from ProposalStatus (always 1, never incremented) - Remove Attempt from all Result CRD specs (always 1) - Move ObservedGeneration from AnalysisStepStatus to conditions (standard K8s pattern — every condition now carries ObservedGeneration) - Set ObservedGeneration on all 18 condition calls - Tighten RetryIndex Maximum from 10 to 2 (matches maxAttempts max of 3) - Update CEL messages: decisions/maxAttempts immutability - Clarify that denying any stage terminates the entire proposal - Update verification schema source description for full commands - Make VerificationStep.expected optional - Default agent to "default" on approval types (satisfies linter) - Clean up related tests, CLI, templates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…steps - Make rbac required in analysis schema only when proposal has execution - Make verification required only when proposal has verification step - Advisory proposals get a minimal schema (title, diagnosis, proposal) - Keep MinProperties=1 on RBACResult (when present, must have rules) - Make VerificationStep.Command and Expected optional - Make RollbackPlan.Command optional - Remove rollbackPlan CEL conditional requirement - Add tests for outputSchemaForStep across all proposal variants Tested on cluster: - Full (exec + verify): passed - Trust mode (exec, no verify): passed - Advisory with target: passed - Trivial advisory: passed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Template conditionally includes RBAC and verification instructions based on whether the proposal has execution/verification steps. Advisory proposals get a simpler prompt without RBAC/verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OutputSchema defines the structured output contract for analysis agents, not a tool the sandbox pod uses. Move it to ProposalSpec where it belongs. When present, it is injected as a required "components" property in each analysis option. Change RemediationOption.Components from []JSON to *JSON to accept any schema-defined shape (object or array). - api: remove OutputSchema from ToolsSpec, add to ProposalSpec with immutability CEL rule - api: change RemediationOption.Components from []apiextensionsv1.JSON to *apiextensionsv1.JSON for flexible schema-driven shapes - api: remove redundant Components from AnalysisResultStatus (already per-option via RemediationOption) - controller: outputSchemaForStep injects spec.outputSchema as "components" property in analysis schema when present - controller: remove top-level Components from analysisResponse, AnalysisOutput, and result creation - Regenerate deepcopy and CRDs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the user approves execution and selects an option, the operator trims the AnalysisResult to keep only that option. No SelectedOption index field needed — selectedOption() always returns options[0]. - api: remove SelectedOption from AnalysisStepStatus - controller: add trimNonSelectedOptions() that reads the selected index from ProposalApproval and patches AnalysisResult - controller: extract getLatestAnalysisResult() shared helper - controller: trimNonSelectedOptions returns the selected option directly, eliminating a double-fetch in handleExecution - cli: remove Selected Option display line - Consolidate 3 redundant individual trim tests into table-driven test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… output Introduce AnalysisOutputMode enum (Default/Minimal) and AnalysisOutput struct wrapping the mode and optional custom schema. Default mode includes all built-in properties (diagnosis, proposal, rbac, verification). Minimal mode strips built-in properties for analysis-only proposals that define their own output shape via a custom schema. CEL validation enforces: schema is required when mode is Minimal, and Minimal mode is only allowed for analysis-only proposals (no execution or verification steps). Diagnosis and proposal fields on RemediationOption are now optional to support Minimal mode output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Allow cluster admins to configure how many proposals the operator reconciles concurrently via ApprovalPolicy.spec.maxConcurrentProposals. Defaults to 5. The operator reads this at startup from the cluster singleton ApprovalPolicy. Also add CEL validation on RemediationOption enforcing that diagnosis and proposal must be present together or both absent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
everettraven
left a comment
There was a problem hiding this comment.
API surface matches what we've discussed and I have no more outstanding blockers to getting this merged and iterating as we learn more.
/lgtm
Big thanks to @everettraven and everyone who weighed in here, the feedback really shaped this API for the better. Appreciate all the thoughtful reviews. |
Summary
Initial agentic controller for
agentic.openshift.io/v1alpha1, to be imported intolightspeed-operator.Proposal,Agent,LLMProvider) and generated manifestsoc-agenticCLI pluginTest plan
go test ./...)