feat: add dataSource PVC mount and per-proposal timeoutMinutes#17
feat: add dataSource PVC mount and per-proposal timeoutMinutes#17sakshipatels98-byte wants to merge 2 commits into
Conversation
|
Skipping CI for Draft Pull Request. |
📝 WalkthroughWalkthroughThis PR adds per-step sandbox agent timeout configuration and optional PersistentVolumeClaim data source mounting to Proposal CRDs. The timeout is threaded from schema through HTTP client and handler orchestration, while data source support is integrated into template patching. RBAC permissions for PVC access and example proposals are included. ChangesProposal Timeout and Data Source Support
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
2a89d44 to
f71f525
Compare
f71f525 to
31261be
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
controller/proposal/sandbox_agent.go (1)
177-210:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse one shared deadline for the whole step.
timeoutis spent once inWaitReadyand then a second time in the HTTP client, sotimeoutMinutes: 5can block for roughly 10 minutes before failing. If this field is meant to cap a step end-to-end, the later phases need to use the remaining budget, not the original budget again.⏱️ Possible fix
func (s *SandboxAgentCaller) callWithSandbox( ctx context.Context, proposal *agenticv1alpha1.Proposal, stepName string, @@ ) (json.RawMessage, error) { if timeout <= 0 { timeout = defaultSandboxTimeout } + deadline := time.Now().Add(timeout) templateName, err := EnsureAgentTemplate(ctx, s.K8sClient, defaultBaseTemplateName, s.Namespace, stepName, step.Agent, step.LLM, step.Tools, proposal.Spec.DataSource) if err != nil { @@ - endpoint, err := s.Sandbox.WaitReady(ctx, claimName, timeout) + endpoint, err := s.Sandbox.WaitReady(ctx, claimName, time.Until(deadline)) if err != nil { return nil, fmt.Errorf("wait for sandbox: %w", err) } @@ - client := s.ClientFactory(agentURL, timeout) + remaining := time.Until(deadline) + if remaining <= 0 { + return nil, fmt.Errorf("step timeout exceeded before agent call") + } + client := s.ClientFactory(agentURL, remaining) resp, err := client.Run(ctx, "", query, schema, agentCtx)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controller/proposal/sandbox_agent.go` around lines 177 - 210, The step currently reuses the full original timeout for WaitReady and then again for the HTTP client, doubling the effective wait; fix by creating a single deadline-based context and deriving the remaining budget for the client: wrap the incoming ctx with context.WithTimeout/context.WithDeadline using the provided timeout (store the returned cancel and deadline), pass that ctx to s.Sandbox.WaitReady, then compute remaining := time.Until(deadline) (clamp to a minimum and fallback if non-positive) and pass remaining to s.ClientFactory(agentURL, remaining) (and ensure to call cancel()). Reference EnsureAgentTemplate, s.Sandbox.Claim, s.Sandbox.WaitReady, s.ClientFactory(...).Run and the timeout variable when applying this change.
🧹 Nitpick comments (4)
controller/proposal/reconciler_test.go (1)
39-64: ⚡ Quick winCapture the timeout passed by the reconciler.
testAgentCallerdrops the new argument everywhere, so nothing in this file proves thatproposalTimeout(proposal)is actually passed intoAnalyze/Execute/Verify/Escalate. A small field likelastAnalyzeTimeoutplus one case withSpec.TimeoutMinutes = 30would cover the new handler plumbing.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controller/proposal/reconciler_test.go` around lines 39 - 64, The testAgentCaller currently ignores the timeout parameter so tests don't verify that proposalTimeout(proposal) is propagated; add fields on testAgentCaller (e.g., lastAnalyzeTimeout, lastExecuteTimeout, lastVerifyTimeout, lastEscalateTimeout) and assign the incoming timeout argument in the respective methods Analyze, Execute, Verify, Escalate; then add/update a test that sets Spec.TimeoutMinutes = 30 on a Proposal, calls the reconciler path, and asserts each last*Timeout equals the expected duration returned by proposalTimeout(proposal) to prove the plumbing forwards the timeout.controller/proposal/sandbox_agent_test.go (1)
57-82: ⚡ Quick winRecord and assert the propagated timeout in these helpers.
The new
time.Durationparameter is ignored here, so the tests only prove the signatures compile. Capturing the value passed intoClientFactory(and similarly intoWaitReady) would let this suite verify that non-default proposal timeouts actually reach the sandbox layer.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controller/proposal/sandbox_agent_test.go` around lines 57 - 82, Update newTestSandboxAgentCaller and newTestSandboxAgentCallerWithProposal so the ClientFactory closure captures and records the time.Duration it receives and the mockSandboxProvider's WaitReady call records its timeout; specifically, modify the ClientFactory func(_ string, d time.Duration) to store d onto the provided mockHTTPClient (e.g., a LastTimeout field) before returning it, and ensure the mockSandboxProvider exposes a WaitReady(timeout time.Duration) recorder that tests can assert against; this will let tests verify that non-default proposal timeouts are propagated to both ClientFactory and WaitReady.controller/proposal/client_test.go (1)
38-172: ⚡ Quick winAdd assertions for the new timeout contract.
These updates only make the tests compile against the new constructor. There’s still no coverage that a custom timeout is serialized into
timeout_ms, or thattimeout <= 0falls back todefaultSandboxTimeout, so the new behavior can regress silently.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controller/proposal/client_test.go` around lines 38 - 172, Tests for NewAgentHTTPClient lack assertions that the serialized request contains the timeout_ms field and that timeout <= 0 falls back to defaultSandboxTimeout; update relevant tests that decode agentRunRequest (e.g., inside the handlers in TestAgentHTTPClient_RunWithExecutionResult, TestAgentHTTPClient_RunWithoutExecutionResult, TestAgentHTTPClient_RunWithContext and the basic Run test) to assert req.TimeoutMs equals the provided custom timeout when NewAgentHTTPClient(server.URL, X) is called with X>0, and assert req.TimeoutMs equals defaultSandboxTimeout when NewAgentHTTPClient(..., 0) or a negative timeout is used; reference the agentRunRequest struct and NewAgentHTTPClient constructor to locate where to add these checks.controller/proposal/sandbox_templates_test.go (1)
68-70: ⚡ Quick winAdd coverage for non-nil
dataSourcehere.All updated hash calls still pass
nilfordataSource, so this file never asserts that changingclaimNamechanges the derived template name. A focused test for that — plus one for the read-only/data/inputPVC mount — would cover the new behavior directly.Also applies to: 490-524
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controller/proposal/sandbox_templates_test.go` around lines 68 - 70, Update tests to exercise a non-nil dataSource when calling computeTemplateHash: modify mustHash (and the test cases around it) to pass a populated agenticv1alpha1.DataSource with a distinct ClaimName and verify that changing ClaimName alters the returned hash; additionally add an assertion that the computed template includes the expected PVC mount mode (readOnly) for /data/input by constructing a DataSource whose MountPath is /data/input and verifying the template metadata or generated mount flags reflect read-only. Target the helper mustHash and calls to computeTemplateHash so the new assertions cover the branch where dataSource is non-nil.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@api/v1alpha1/proposal_types.go`:
- Line 297: The XValidation rule for the dataSource field currently allows
nil->non-nil updates; replace the rule so only two cases are permitted: both old
and new lack dataSource, or both have it and it is equal. Update the kubebuilder
annotation on the dataSource validation (the comment containing XValidation) to
use an expression such as "(has(oldSelf.dataSource) && has(self.dataSource) &&
self.dataSource == oldSelf.dataSource) || (!has(oldSelf.dataSource) &&
!has(self.dataSource))" so adding a dataSource after creation is disallowed.
In `@controller/proposal/sandbox_templates.go`:
- Around line 523-551: The current patchDataSource unconditionally uses the name
"data-source" and addPVCVolume will overwrite any existing volume with that
name; modify addPVCVolume and patchDataSource so they do not clobber an
unrelated pre-existing volume: in addPVCVolume, when finding an existing volume
with the same name check whether it already has a persistentVolumeClaim with the
same claimName and only replace if it matches; otherwise return a sentinel (or
error) to signal a name conflict. In patchDataSource, detect that conflict and
instead pick a non-colliding name (e.g., "data-source-<uniqueSuffix>") and use
that name for both addPVCVolume and addVolumeMount, or surface the error to the
caller; reference the functions addPVCVolume and patchDataSource and ensure the
chosen name is threaded through to addVolumeMount so mounts reference the
correct volume.
In `@controller/proposal/templates/analysis_query.tmpl`:
- Around line 3-27: The template fails to tell the agent that PVC-backed input
may already be mounted at /data/input when spec.dataSource is set, so update the
contract and template: modify buildAnalysisQuery in
controller/proposal/helpers.go to accept a flag/field (e.g., hasDataSource or
includeMountedInput) derived from the Request/spec.dataSource (so callers can
indicate PVC-backed input is present), and change
controller/proposal/templates/analysis_query.tmpl to conditionally include a
short directive like "Input data is mounted read-only at /data/input; prefer
analyzing that bundle before collecting live data" when that flag is true;
ensure the helpers.go change sets the flag when spec.dataSource is present so
PVC-backed RCA is handled as bundle-first analysis.
- Around line 19-25: The skill-loading branch is inconsistent: "cat
/app/skills/intelliaide/SKILL.md" cannot "return one or more paths." Update the
logic around the atomic command (the literal string "cat
/app/skills/intelliaide/SKILL.md" and references to SKILL.md) so it is
internally consistent — either change the command to a listing/find command that
can return paths (e.g., use a glob/find/ls conceptual step) or change the prose
to state that if the cat command fails because the file does not exist you must
stop and return a JSON error, and otherwise read that SKILL.md and follow its
workflow exactly; ensure the branch no longer claims cat can return multiple
paths.
In `@examples/setup/09-intelliaide-proposals.yaml`:
- Around line 21-31: Add one concrete example of the PVC-backed input to the
IntelliAide proposals by updating a Proposal's spec to include a dataSource
block using the exact shape spec.dataSource.claimName and a placeholder PVC name
(e.g., <pre-existing-pvc>); locate any Proposal objects in this file that
currently run in live mode and add a dataSource: claimName: <pre-existing-pvc>
under their spec so the new /data/input workflow and PVC-backed input are
discoverable (ensure the YAML key is spec.dataSource.claimName and that only the
example uses a placeholder to avoid changing runtime behavior).
- Around line 33-36: The docs instruct building/pushing
quay.io/<org>/intelliaide-skills:latest but all Proposal manifests reference
image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest,
so examples fail; either (A) update each Proposal image field to
quay.io/<org>/intelliaide-skills:latest (search for the string
image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest
in the Proposal resources and replace) or (B) add the missing retag/push steps
after the build: retag quay image to the in-cluster registry name and push it
(i.e., docker/podman tag quay.io/<org>/intelliaide-skills:latest
image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest
and push), and apply this fix for all occurrences noted (the other blocks at the
same pattern: lines ~70-72, 101-103, 147-149, 183-185).
---
Outside diff comments:
In `@controller/proposal/sandbox_agent.go`:
- Around line 177-210: The step currently reuses the full original timeout for
WaitReady and then again for the HTTP client, doubling the effective wait; fix
by creating a single deadline-based context and deriving the remaining budget
for the client: wrap the incoming ctx with
context.WithTimeout/context.WithDeadline using the provided timeout (store the
returned cancel and deadline), pass that ctx to s.Sandbox.WaitReady, then
compute remaining := time.Until(deadline) (clamp to a minimum and fallback if
non-positive) and pass remaining to s.ClientFactory(agentURL, remaining) (and
ensure to call cancel()). Reference EnsureAgentTemplate, s.Sandbox.Claim,
s.Sandbox.WaitReady, s.ClientFactory(...).Run and the timeout variable when
applying this change.
---
Nitpick comments:
In `@controller/proposal/client_test.go`:
- Around line 38-172: Tests for NewAgentHTTPClient lack assertions that the
serialized request contains the timeout_ms field and that timeout <= 0 falls
back to defaultSandboxTimeout; update relevant tests that decode agentRunRequest
(e.g., inside the handlers in TestAgentHTTPClient_RunWithExecutionResult,
TestAgentHTTPClient_RunWithoutExecutionResult,
TestAgentHTTPClient_RunWithContext and the basic Run test) to assert
req.TimeoutMs equals the provided custom timeout when
NewAgentHTTPClient(server.URL, X) is called with X>0, and assert req.TimeoutMs
equals defaultSandboxTimeout when NewAgentHTTPClient(..., 0) or a negative
timeout is used; reference the agentRunRequest struct and NewAgentHTTPClient
constructor to locate where to add these checks.
In `@controller/proposal/reconciler_test.go`:
- Around line 39-64: The testAgentCaller currently ignores the timeout parameter
so tests don't verify that proposalTimeout(proposal) is propagated; add fields
on testAgentCaller (e.g., lastAnalyzeTimeout, lastExecuteTimeout,
lastVerifyTimeout, lastEscalateTimeout) and assign the incoming timeout argument
in the respective methods Analyze, Execute, Verify, Escalate; then add/update a
test that sets Spec.TimeoutMinutes = 30 on a Proposal, calls the reconciler
path, and asserts each last*Timeout equals the expected duration returned by
proposalTimeout(proposal) to prove the plumbing forwards the timeout.
In `@controller/proposal/sandbox_agent_test.go`:
- Around line 57-82: Update newTestSandboxAgentCaller and
newTestSandboxAgentCallerWithProposal so the ClientFactory closure captures and
records the time.Duration it receives and the mockSandboxProvider's WaitReady
call records its timeout; specifically, modify the ClientFactory func(_ string,
d time.Duration) to store d onto the provided mockHTTPClient (e.g., a
LastTimeout field) before returning it, and ensure the mockSandboxProvider
exposes a WaitReady(timeout time.Duration) recorder that tests can assert
against; this will let tests verify that non-default proposal timeouts are
propagated to both ClientFactory and WaitReady.
In `@controller/proposal/sandbox_templates_test.go`:
- Around line 68-70: Update tests to exercise a non-nil dataSource when calling
computeTemplateHash: modify mustHash (and the test cases around it) to pass a
populated agenticv1alpha1.DataSource with a distinct ClaimName and verify that
changing ClaimName alters the returned hash; additionally add an assertion that
the computed template includes the expected PVC mount mode (readOnly) for
/data/input by constructing a DataSource whose MountPath is /data/input and
verifying the template metadata or generated mount flags reflect read-only.
Target the helper mustHash and calls to computeTemplateHash so the new
assertions cover the branch where dataSource is non-nil.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f4659329-5901-4895-8a4c-9eb649291e97
⛔ Files ignored due to path filters (3)
api/v1alpha1/zz_generated.deepcopy.gois excluded by!**/zz_generated.deepcopy.goconfig/crd/bases/agentic.openshift.io_proposals.yamlis excluded by!config/crd/bases/**config/rbac/role.yamlis excluded by!config/rbac/role.yaml
📒 Files selected for processing (14)
.gitignoreapi/v1alpha1/proposal_types.gocontroller/proposal/agent.gocontroller/proposal/client.gocontroller/proposal/client_test.gocontroller/proposal/handlers.gocontroller/proposal/reconciler.gocontroller/proposal/reconciler_test.gocontroller/proposal/sandbox_agent.gocontroller/proposal/sandbox_agent_test.gocontroller/proposal/sandbox_templates.gocontroller/proposal/sandbox_templates_test.gocontroller/proposal/templates/analysis_query.tmplexamples/setup/09-intelliaide-proposals.yaml
| // +kubebuilder:validation:XValidation:rule="!has(oldSelf.analysis) || (has(self.analysis) && self.analysis == oldSelf.analysis)",message="analysis is immutable once set" | ||
| // +kubebuilder:validation:XValidation:rule="!has(oldSelf.execution) || (has(self.execution) && self.execution == oldSelf.execution)",message="execution is immutable once set" | ||
| // +kubebuilder:validation:XValidation:rule="!has(oldSelf.verification) || (has(self.verification) && self.verification == oldSelf.verification)",message="verification is immutable once set" | ||
| // +kubebuilder:validation:XValidation:rule="!has(oldSelf.dataSource) || (has(self.dataSource) && self.dataSource == oldSelf.dataSource)",message="dataSource is immutable once set" |
There was a problem hiding this comment.
Disallow adding dataSource after creation.
This rule still permits nil -> non-nil updates because the first branch is true whenever oldSelf.dataSource is unset. That breaks the “fixed at creation” contract and allows an in-flight Proposal to change which PVC gets mounted.
🛠️ Possible fix
-// +kubebuilder:validation:XValidation:rule="!has(oldSelf.dataSource) || (has(self.dataSource) && self.dataSource == oldSelf.dataSource)",message="dataSource is immutable once set"
+// +kubebuilder:validation:XValidation:rule="has(oldSelf.dataSource) ? (has(self.dataSource) && self.dataSource == oldSelf.dataSource) : !has(self.dataSource)",message="dataSource is immutable after creation"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // +kubebuilder:validation:XValidation:rule="!has(oldSelf.dataSource) || (has(self.dataSource) && self.dataSource == oldSelf.dataSource)",message="dataSource is immutable once set" | |
| // +kubebuilder:validation:XValidation:rule="has(oldSelf.dataSource) ? (has(self.dataSource) && self.dataSource == oldSelf.dataSource) : !has(self.dataSource)",message="dataSource is immutable after creation" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@api/v1alpha1/proposal_types.go` at line 297, The XValidation rule for the
dataSource field currently allows nil->non-nil updates; replace the rule so only
two cases are permitted: both old and new lack dataSource, or both have it and
it is equal. Update the kubebuilder annotation on the dataSource validation (the
comment containing XValidation) to use an expression such as
"(has(oldSelf.dataSource) && has(self.dataSource) && self.dataSource ==
oldSelf.dataSource) || (!has(oldSelf.dataSource) && !has(self.dataSource))" so
adding a dataSource after creation is disallowed.
| func addPVCVolume(tmpl *unstructured.Unstructured, volumeName, claimName string) error { | ||
| volumes, _, _ := unstructured.NestedSlice(tmpl.Object, "spec", "podTemplate", "spec", "volumes") | ||
| vol := map[string]any{ | ||
| "name": volumeName, | ||
| "persistentVolumeClaim": map[string]any{ | ||
| "claimName": claimName, | ||
| }, | ||
| } | ||
| for i, v := range volumes { | ||
| existing, ok := v.(map[string]any) | ||
| if !ok { | ||
| continue | ||
| } | ||
| if existing["name"] == volumeName { | ||
| volumes[i] = vol | ||
| return unstructured.SetNestedSlice(tmpl.Object, volumes, "spec", "podTemplate", "spec", "volumes") | ||
| } | ||
| } | ||
| volumes = append(volumes, vol) | ||
| return unstructured.SetNestedSlice(tmpl.Object, volumes, "spec", "podTemplate", "spec", "volumes") | ||
| } | ||
|
|
||
| func patchDataSource(tmpl *unstructured.Unstructured, ds *agenticv1alpha1.DataSource) error { | ||
| volName := "data-source" | ||
| if err := addPVCVolume(tmpl, volName, ds.ClaimName); err != nil { | ||
| return fmt.Errorf("add data source PVC volume: %w", err) | ||
| } | ||
| return addVolumeMount(tmpl, volName, dataSourceMountPath, true) | ||
| } |
There was a problem hiding this comment.
Avoid clobbering a pre-existing base-template volume named data-source.
patchDataSource always uses the hard-coded volume name data-source, and addPVCVolume overwrites any existing volume with that name. If the base SandboxTemplate already defines data-source for something else, this silently rewires that volume — and any mounts that reference it — to the proposal PVC.
💡 Safer direction
+const dataSourceVolumeName = "lightspeed-data-source"
+
func patchDataSource(tmpl *unstructured.Unstructured, ds *agenticv1alpha1.DataSource) error {
- volName := "data-source"
+ volName := dataSourceVolumeName
if err := addPVCVolume(tmpl, volName, ds.ClaimName); err != nil {
return fmt.Errorf("add data source PVC volume: %w", err)
}
return addVolumeMount(tmpl, volName, dataSourceMountPath, true)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func addPVCVolume(tmpl *unstructured.Unstructured, volumeName, claimName string) error { | |
| volumes, _, _ := unstructured.NestedSlice(tmpl.Object, "spec", "podTemplate", "spec", "volumes") | |
| vol := map[string]any{ | |
| "name": volumeName, | |
| "persistentVolumeClaim": map[string]any{ | |
| "claimName": claimName, | |
| }, | |
| } | |
| for i, v := range volumes { | |
| existing, ok := v.(map[string]any) | |
| if !ok { | |
| continue | |
| } | |
| if existing["name"] == volumeName { | |
| volumes[i] = vol | |
| return unstructured.SetNestedSlice(tmpl.Object, volumes, "spec", "podTemplate", "spec", "volumes") | |
| } | |
| } | |
| volumes = append(volumes, vol) | |
| return unstructured.SetNestedSlice(tmpl.Object, volumes, "spec", "podTemplate", "spec", "volumes") | |
| } | |
| func patchDataSource(tmpl *unstructured.Unstructured, ds *agenticv1alpha1.DataSource) error { | |
| volName := "data-source" | |
| if err := addPVCVolume(tmpl, volName, ds.ClaimName); err != nil { | |
| return fmt.Errorf("add data source PVC volume: %w", err) | |
| } | |
| return addVolumeMount(tmpl, volName, dataSourceMountPath, true) | |
| } | |
| func addPVCVolume(tmpl *unstructured.Unstructured, volumeName, claimName string) error { | |
| volumes, _, _ := unstructured.NestedSlice(tmpl.Object, "spec", "podTemplate", "spec", "volumes") | |
| vol := map[string]any{ | |
| "name": volumeName, | |
| "persistentVolumeClaim": map[string]any{ | |
| "claimName": claimName, | |
| }, | |
| } | |
| for i, v := range volumes { | |
| existing, ok := v.(map[string]any) | |
| if !ok { | |
| continue | |
| } | |
| if existing["name"] == volumeName { | |
| volumes[i] = vol | |
| return unstructured.SetNestedSlice(tmpl.Object, volumes, "spec", "podTemplate", "spec", "volumes") | |
| } | |
| } | |
| volumes = append(volumes, vol) | |
| return unstructured.SetNestedSlice(tmpl.Object, volumes, "spec", "podTemplate", "spec", "volumes") | |
| } | |
| const dataSourceVolumeName = "lightspeed-data-source" | |
| func patchDataSource(tmpl *unstructured.Unstructured, ds *agenticv1alpha1.DataSource) error { | |
| volName := dataSourceVolumeName | |
| if err := addPVCVolume(tmpl, volName, ds.ClaimName); err != nil { | |
| return fmt.Errorf("add data source PVC volume: %w", err) | |
| } | |
| return addVolumeMount(tmpl, volName, dataSourceMountPath, true) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@controller/proposal/sandbox_templates.go` around lines 523 - 551, The current
patchDataSource unconditionally uses the name "data-source" and addPVCVolume
will overwrite any existing volume with that name; modify addPVCVolume and
patchDataSource so they do not clobber an unrelated pre-existing volume: in
addPVCVolume, when finding an existing volume with the same name check whether
it already has a persistentVolumeClaim with the same claimName and only replace
if it matches; otherwise return a sentinel (or error) to signal a name conflict.
In patchDataSource, detect that conflict and instead pick a non-colliding name
(e.g., "data-source-<uniqueSuffix>") and use that name for both addPVCVolume and
addVolumeMount, or surface the error to the caller; reference the functions
addPVCVolume and patchDataSource and ensure the chosen name is threaded through
to addVolumeMount so mounts reference the correct volume.
| ## Skills | ||
|
|
||
| A specialist deep-RCA pipeline is available at `/app/skills/intelliaide/SKILL.md`. | ||
|
|
||
| Use it ONLY when the request calls for: | ||
| - Root cause analysis (RCA) or deeper / ML-assisted troubleshooting of a cluster issue | ||
| - Must-gather collection or analysis | ||
| - Investigating pod failures, etcd degradation, networking problems, storage issues, etc. | ||
| - Any request that explicitly mentions "deeper analysis", "deeper troubleshooting", "root cause", "RCA", "must-gather", or "IntelliAide" | ||
|
|
||
| For routine inspection (checking pod/node status, listing events, summarising resource state, | ||
| describing objects), use `kubectl`/`oc` commands directly — do NOT invoke the IntelliAide pipeline. | ||
|
|
||
| **Decision rule — apply before doing anything else:** | ||
| 1. Read the `## Request` section below. | ||
| 2. If it is a routine inspection query → proceed with `kubectl`/`oc` directly. | ||
| 3. If it is a deep-RCA or troubleshooting request → read the skill file with ONE atomic command: | ||
| ``` | ||
| cat /app/skills/intelliaide/SKILL.md | ||
| ``` | ||
| If the command returns one or more paths, read the most relevant SKILL.md with `cat` | ||
| and follow its workflow **exactly** instead of the instructions below. | ||
| If no SKILL.md files are found, stop immediately and return a JSON error response — skills are required and their absence is a fatal misconfiguration. | ||
|
|
||
| ## Analysis requirements |
There was a problem hiding this comment.
Surface /data/input in the prompt when spec.dataSource is set.
This template never tells the agent that PVC-backed input may already be mounted at /data/input. The operator now mounts spec.dataSource there, but buildAnalysisQuery(...) only passes Request, HasExecution, and HasVerification, so PVC-backed RCA will still look like live-mode analysis and may ignore the supplied bundle entirely.
This needs a small contract change in controller/proposal/helpers.go plus a conditional instruction here, e.g. “input data is mounted read-only at /data/input; prefer analyzing that before collecting new live data.”
Also applies to: 27-42
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@controller/proposal/templates/analysis_query.tmpl` around lines 3 - 27, The
template fails to tell the agent that PVC-backed input may already be mounted at
/data/input when spec.dataSource is set, so update the contract and template:
modify buildAnalysisQuery in controller/proposal/helpers.go to accept a
flag/field (e.g., hasDataSource or includeMountedInput) derived from the
Request/spec.dataSource (so callers can indicate PVC-backed input is present),
and change controller/proposal/templates/analysis_query.tmpl to conditionally
include a short directive like "Input data is mounted read-only at /data/input;
prefer analyzing that bundle before collecting live data" when that flag is
true; ensure the helpers.go change sets the flag when spec.dataSource is present
so PVC-backed RCA is handled as bundle-first analysis.
| 3. If it is a deep-RCA or troubleshooting request → read the skill file with ONE atomic command: | ||
| ``` | ||
| cat /app/skills/intelliaide/SKILL.md | ||
| ``` | ||
| If the command returns one or more paths, read the most relevant SKILL.md with `cat` | ||
| and follow its workflow **exactly** instead of the instructions below. | ||
| If no SKILL.md files are found, stop immediately and return a JSON error response — skills are required and their absence is a fatal misconfiguration. |
There was a problem hiding this comment.
Make the skill-loading branch internally consistent.
Line 21 uses cat /app/skills/intelliaide/SKILL.md, but Lines 23-25 talk about that command “returning one or more paths.” cat returns file contents or an error, so this branch is impossible as written and can derail the RCA routing logic.
Suggested prompt fix
3. If it is a deep-RCA or troubleshooting request → read the skill file with ONE atomic command:
```
cat /app/skills/intelliaide/SKILL.md
```
- If the command returns one or more paths, read the most relevant SKILL.md with `cat`
- and follow its workflow **exactly** instead of the instructions below.
- If no SKILL.md files are found, stop immediately and return a JSON error response — skills are required and their absence is a fatal misconfiguration.
+ If that command fails because the file does not exist, stop immediately and return a JSON error response — skills are required and their absence is a fatal misconfiguration.
+ Follow its workflow **exactly** instead of the instructions below.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| 3. If it is a deep-RCA or troubleshooting request → read the skill file with ONE atomic command: | |
| ``` | |
| cat /app/skills/intelliaide/SKILL.md | |
| ``` | |
| If the command returns one or more paths, read the most relevant SKILL.md with `cat` | |
| and follow its workflow **exactly** instead of the instructions below. | |
| If no SKILL.md files are found, stop immediately and return a JSON error response — skills are required and their absence is a fatal misconfiguration. | |
| 3. If it is a deep-RCA or troubleshooting request → read the skill file with ONE atomic command: |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@controller/proposal/templates/analysis_query.tmpl` around lines 19 - 25, The
skill-loading branch is inconsistent: "cat /app/skills/intelliaide/SKILL.md"
cannot "return one or more paths." Update the logic around the atomic command
(the literal string "cat /app/skills/intelliaide/SKILL.md" and references to
SKILL.md) so it is internally consistent — either change the command to a
listing/find command that can return paths (e.g., use a glob/find/ls conceptual
step) or change the prose to state that if the cat command fails because the
file does not exist you must stop and return a JSON error, and otherwise read
that SKILL.md and follow its workflow exactly; ensure the branch no longer
claims cat can return multiple paths.
| # IntelliAide triggering: | ||
| # The analysis prompt (analysis_query.tmpl) reads the request text and invokes | ||
| # the IntelliAide pipeline when the request contains RCA keywords such as | ||
| # "root cause analysis", "RCA", "deeper analysis", "must-gather", or "IntelliAide". | ||
| # All requests in this file are phrased to trigger IntelliAide automatically. | ||
| # | ||
| # Modes: | ||
| # Live mode (default) — IntelliAide reads live cluster state via kubectl/oc. | ||
| # Must-gather mode — set RUN_MUST_GATHER=1 env var, or phrase the request | ||
| # with "using must-gather" / "collect must-gather". | ||
| # |
There was a problem hiding this comment.
Add one concrete dataSource example to this manifest.
This PR introduces PVC-backed input, but every Proposal here is live-mode only and none show the required spec.dataSource.claimName shape. That makes the new /data/input workflow hard to discover from the only example file shipped with the feature.
A single example with:
dataSource:
claimName: <pre-existing-pvc>would make the new API immediately usable.
Also applies to: 46-188
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/setup/09-intelliaide-proposals.yaml` around lines 21 - 31, Add one
concrete example of the PVC-backed input to the IntelliAide proposals by
updating a Proposal's spec to include a dataSource block using the exact shape
spec.dataSource.claimName and a placeholder PVC name (e.g., <pre-existing-pvc>);
locate any Proposal objects in this file that currently run in live mode and add
a dataSource: claimName: <pre-existing-pvc> under their spec so the new
/data/input workflow and PVC-backed input are discoverable (ensure the YAML key
is spec.dataSource.claimName and that only the example uses a placeholder to
avoid changing runtime behavior).
| # 1. Build and push skills image: | ||
| # podman build -f Dockerfile.skills \ | ||
| # -t quay.io/<org>/intelliaide-skills:latest . | ||
| # podman push quay.io/<org>/intelliaide-skills:latest |
There was a problem hiding this comment.
Align the documented image build/push steps with the image each Proposal references.
Lines 33-36 tell users to publish quay.io/<org>/intelliaide-skills:latest, but every Proposal points at image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest. Following this file verbatim leaves the referenced image unavailable, so the examples do not work out-of-the-box.
Please either update the Proposal manifests to use the documented image, or add the missing retag/push step for the in-cluster registry.
Also applies to: 70-72, 101-103, 147-149, 183-185
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/setup/09-intelliaide-proposals.yaml` around lines 33 - 36, The docs
instruct building/pushing quay.io/<org>/intelliaide-skills:latest but all
Proposal manifests reference
image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest,
so examples fail; either (A) update each Proposal image field to
quay.io/<org>/intelliaide-skills:latest (search for the string
image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest
in the Proposal resources and replace) or (B) add the missing retag/push steps
after the build: retag quay image to the in-cluster registry name and push it
(i.e., docker/podman tag quay.io/<org>/intelliaide-skills:latest
image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest
and push), and apply this fix for all occurrences noted (the other blocks at the
same pattern: lines ~70-72, 101-103, 147-149, 183-185).
| // +optional | ||
| // +kubebuilder:validation:Minimum=1 | ||
| // +kubebuilder:validation:Maximum=60 | ||
| TimeoutMinutes *int32 `json:"timeoutMinutes,omitempty"` |
There was a problem hiding this comment.
This should be on ProposalStep, not on ProposalSpec. Right now all four steps (analysis, execution, verification, escalation) get the same timeout — but IntelliAide RCA analysis might need 30 minutes while a verification step that runs kubectl get pods needs 2. A single proposal-level knob forces users to over-provision the timeout for every step just to accommodate the slowest one.
Would you consider moving it to ProposalStep? That would also align with how agent and tools are already scoped — per-step, not per-proposal.
| @@ -1,5 +1,31 @@ | |||
| You are an analysis agent. Your job is to diagnose the problem, determine the root cause, and propose one or more remediation options. Do NOT attempt to fix, patch, or execute any changes — only analyze and propose. | |||
|
|
|||
| ## Skills | |||
There was a problem hiding this comment.
The operator is skill-agnostic by design — it orchestrates Proposal lifecycles without knowing which skills are mounted. Hardcoding intelliaide references and keyword-routing here breaks that separation, and making skill absence fatal would break any Proposal that doesn't mount skills.
No changes to the analysis template are needed. spec.request already carries the user's intent to the agent — the adapter creating the Proposal can embed whatever context it needs there (e.g., "perform deep RCA on diagnostic data at /data/input"). The agent sees the request, sees the skills in its filesystem, and decides what to use. The operator shouldn't be involved in that decision.
There was a problem hiding this comment.
yeah this makes sense - so if we are triggering the proposal manually, this is not needed and if we want it to be automatic we need an adapter. either way this is not necessary.
| // pre-populated with data before the Proposal is created. The operator | ||
| // mounts it read-only at a well-known path (/data/input) accessible to | ||
| // all skills in the sandbox pod. | ||
| type DataSource struct { |
There was a problem hiding this comment.
Do we need PVC support here? The sandbox pods are ephemeral — if the agent needs must-gather data, it can collect it during its run (e.g., oc adm must-gather -o /tmp/must-gather) and work with it locally. The skill or spec.request can instruct the agent to do that.
This avoids adding new API surface (DataSource type, immutability rules, PVC RBAC, volume mount patching in sandbox templates) for something the agent can already do with its existing tools and a writable /tmp.
There was a problem hiding this comment.
@harche while that can be done, it restricts the ability to bring a must-gather from outside and analyze it, i.e customer cases.
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@sakshipatels98-byte: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| // | ||
| // Immutable: input data source is fixed at creation. | ||
| // +optional | ||
| DataSource *DataSource `json:"dataSource,omitzero"` |
There was a problem hiding this comment.
If you set add the DataSource here then all the steps (i.e. analyse, execution etc) would need mount that data source, but that may not be necessary in all use cases. Consider moving this to per-step, so data source is attached only to the steps that need it.
There was a problem hiding this comment.
Or maybe consider making it part of ToolsSpec above, that way it will be available at the proposal as well as step level.
Summary
Enable IntelliAide-style deep RCA by letting Proposals reference pre-loaded
diagnostic data and run long analysis steps in the sandbox.
spec.dataSource) — optional, immutable reference to apre-existing PVC in the Proposal namespace. The operator mounts it read-only
at
/data/inputin the sandbox pod (via sandbox template patching).spec.timeoutMinutes) — optional 1–60minutes per step (default 5). Used for analysis, execution, verification,
and escalation so long-running RCA can override the default.
analysis_query.tmplrequires skill discoveryfrom
/app/skillsbefore generic analysis; whendataSourceis set, tellsthe agent diagnostic data is available at
/data/input.get/list/watch; result CR*/statussub-resources.examples/setup/09-intelliaide-proposals.yamlwithdataSource,timeoutMinutes: 30, and analysis-only IntelliAide workflows.Companion PR:
agentic-skillsadds theintelliaide/skill pipeline thatreads from
/data/input.API Changes
spec.dataSource.claimNamespec.timeoutMinutesTest Plan
make testpasses (unit tests updated for new signatures)make manifestsproduces clean CRD with both new fieldsdataSource→ verify volume mounted at/data/inputtimeoutMinutes: 30→ operator waits 30mSummary by CodeRabbit
Release Notes
New Features
Documentation
Tests