Skip to content

AppWrapper gets stuck when wrapped resource is a CRD that is not installed #276

@dgrove-oss

Description

@dgrove-oss

Describe the Bug

Create an AppWrapper around a resource whose CRD is not installed.
As expected, creation fails and the AppWrapper enters a terminal failed state.

Unfortunately, deleting the appwrapper gets stuck with the appwrapper in a terminating state because the delete of the non-existing resource fails with an unexpected error.

yaml
                        cpu: 1
  status:
    componentStatus:
    - apiVersion: kubeflow.org/v1
      conditions:
      - lastTransitionTime: "2024-12-12T02:07:02Z"
        message: ""
        reason: ComponentCreationInitiated
        status: Unknown
        type: ResourcesDeployed
      kind: PyTorchJob
      name: pytorch-simple
      podSets:
      - path: template.spec.pytorchReplicaSpecs.Master.template
        replicas: 1
      - path: template.spec.pytorchReplicaSpecs.Worker.template
        replicas: 1
    conditions:
    - lastTransitionTime: "2024-12-12T02:07:02Z"
      message: Suspend is false
      reason: Resuming
      status: "True"
      type: QuotaReserved
    - lastTransitionTime: "2024-12-12T02:07:02Z"
      message: Suspend is false
      reason: Resuming
      status: "True"
      type: ResourcesDeployed
    - lastTransitionTime: "2024-12-12T02:07:02Z"
      message: Suspend is false
      reason: Resuming
      status: "False"
      type: PodsReady
    - lastTransitionTime: "2024-12-12T02:07:02Z"
      message: 'error creating components: no matches for kind "PyTorchJob" in version
        "kubeflow.org/v1"'
      reason: CreateFailed
      status: "True"
      type: Unhealthy
    - lastTransitionTime: "2024-12-12T02:07:02Z"
      message: ""
      reason: DeletionInitiated
      status: "True"
      type: DeletingResources
    phase: Terminating
kind: List
metadata:
  resourceVersion: ""
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.5/pkg/internal/controller/controller.go:227
2024-12-12T02:10:17.473541557Z	ERROR	logr@v1.4.2/logr.go:301	Deletion error	{"controller": "AppWrapper", "controllerGroup": "workload.codeflare.dev", "controllerKind": "AppWrapper", "AppWrapper": {"name":"sample-pytorch-job","namespace":"default"}, "namespace": "default", "name": "sample-pytorch-job", "reconcileID": "936970f7-f7db-4a0c-b561-bad27b1dd2fe", "error": "no matches for kind \"PyTorchJob\" in version \"kubeflow.org/v1\""}
github.com/go-logr/logr.Logger.Error
	/go/pkg/mod/github.com/go-logr/logr@v1.4.2/logr.go:301
github.com/project-codeflare/appwrapper/internal/controller/appwrapper.(*AppWrapperReconciler).deleteComponents.func1
	/workspace/internal/controller/appwrapper/resource_management.go:371
github.com/project-codeflare/appwrapper/internal/controller/appwrapper.(*AppWrapperReconciler).deleteComponents
	/workspace/internal/controller/appwrapper/resource_management.go:386
github.com/project-codeflare/appwrapper/internal/controller/appwrapper.(*AppWrapperReconciler).Reconcile
	/workspace/internal/controller/appwrapper/appwrapper_controller.go:120
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.5/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.5/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.5/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.5/pkg/internal/controller/controller.go:227
(base) dgrove@Dave's IBM Mac kueue % kubectl get appwrapper -o yaml 

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions