Debug runtime issues rev 1#6
Conversation
…kload status handling
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 366c97f9bb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if delErr := rt.Delete(ctx, workload.ID); delErr != nil { | ||
| m.logger.Warnf("Failed to delete workload during cleanup: %v", delErr) | ||
| } | ||
| m.cleanupManagedStorageAllocations(ctx, managedAllocations, true) |
There was a problem hiding this comment.
Keep managed storage usable for retry recreate
When rt.Start fails after rt.Create succeeded for a workload with managed volumes, this branch detaches/deletes the freshly provisioned allocations while leaving the workload persisted for reconciliation. The retry path later recreates from the saved spec without rerunning managed-storage preparation, so it can reuse stale host paths/devices that were just removed. For start failures on managed-volume workloads, either keep the allocations until retry or ensure the retry path reprovisions them.
Useful? React with 👍 / 👎.
| attachment, err := m.volumeMgr.Attach(ctx, platformSpec, handle, workload.ID) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("attach volume %q (%s): %w", spec.Name, spec.Driver, err) |
There was a problem hiding this comment.
Roll back provisioned volumes when attach fails
If Attach fails after Provision has succeeded, this returns immediately without calling the rollback helper, leaving the saved handle and backing directory/image behind; with multiple managed volumes it also leaves any earlier attachments from the same loop. This can happen on stage-directory permission errors or later-volume failures, so the partial allocations should be cleaned up before returning the storage setup error.
Useful? React with 👍 / 👎.
| if retainPolicy == "retain" { | ||
| continue | ||
| } | ||
| if err := m.volumeMgr.Delete(ctx, handle); err != nil { |
There was a problem hiding this comment.
Do not delete volumes still attached elsewhere
Because volume handles are keyed only by driver:name, two workloads that request the same managed volume share the same stored handle, but deleting one workload deletes the handle and backing storage immediately after removing only that workload's attachment. For the local provider this Delete removes the shared directory, which can destroy data for another still-running workload; check for other attachments to the same volume before deleting it.
Useful? React with 👍 / 👎.
| code = strings.TrimSpace(status.Metadata["failure_reason"]) | ||
| message = strings.TrimSpace(status.Metadata["failure_message"]) | ||
| retryable = strings.EqualFold(strings.TrimSpace(status.Metadata["retryable"]), "true") | ||
| if rawNext := strings.TrimSpace(status.Metadata["retry_next_at"]); rawNext != "" { |
There was a problem hiding this comment.
Read the retry timestamp key that is written
The workload manager records retry delays under next_retry_time in the status metadata, but the heartbeat reason conversion looks for retry_next_at, so ReasonDetail.NextRetryAt is never populated for deferred retries. This makes the control plane miss the scheduled retry time even though it is present in the agent status metadata.
Useful? React with 👍 / 👎.
Summary
This release adds support for managed volume storage orchestration, enhanced workload telemetry, and improved error handling aligned with scheduler platform extensions. Compute-agent now supports NFS and Ceph-RBD managed volumes for containers and VMs, with proper lifecycle management and storage driver capability advertisement.
Major Features
Managed Volume Support
ManagedVolumes[]ManagedVolumeSpecto container and VM workload specslocal- Host bind paths (existing behavior)nfs- NFS server mountsceph-rbd- Ceph RBD block devicesStorage Driver Capability Advertisement
SupportedStorageDrivers[]field populated in heartbeatEnhanced Workload Telemetry
WorkloadUsagemodel with CPU%, memory, disk I/O, and network metricsImproved Failure Diagnostics
WorkloadReasonwith structured code, message, last transition, and next retry metadataCloud-Init Enhancement (VM)
CloudInitConfigwith separate fields:user_data- User-provided cloud-init scriptmeta_data- Cloud-init metadatanetwork_config- Network configuration (optional)vendor_data- Vendor data (optional)Breaking Changes
None. All changes are backward compatible.
Deprecations
cloudInitfield deprecated in favor of structuredCloudInitConfigChanged Files
api/proto/agent.proto (updated)
ManagedVolumeSpecmessage typeManagedVolumesfield toContainerSpecManagedVolumesfield toVMSpecWorkloadUsageandWorkloadReasonmessage typesinternal/models/workload.go (updated)
ManagedVolumes[]ManagedVolumeSpecto container specManagedVolumes[]ManagedVolumeSpecto VM specWorkloadUsagestruct for telemetryWorkloadReasonstruct for structured failuresinternal/control/client.go (updated)
SupportedStorageDriversinternal/runtime/docker.go (updated)
internal/runtime/vm.go (updated)
meta-data,network-config,vendor-datafilesinternal/workload/manager.go (updated)
pkg/api/v1/ (regenerated)
Resource Impact
Storage Overhead:
Network:
Backward Compatibility:
ManagedVolumescontinue to workCloudInitstill supported alongside structured config