Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: invalid memory address or nil pointer dereference #1553

Closed
martialblog opened this issue Mar 18, 2022 · 9 comments
Closed

Error: invalid memory address or nil pointer dereference #1553

martialblog opened this issue Mar 18, 2022 · 9 comments

Comments

@martialblog
Copy link

Hi,

I'm running into an invalid memory address or nil pointer dereference Error when a PyTorchJob on the Cluster is failing.

The PyTorchJob Pod runs into a Python Exception which then causes the training-operator Deployment to crash.

Training Operator Release: v1.3.0
Kubernetes: rke2 v1.21.10

PyTorchJob Manifest:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  creationTimestamp: "2022-03-18T13:10:53Z"
  generation: 1
  name: pytorch-job-name-hidden
  resourceVersion: "75160182"
  uid: d9fd2dce
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          containers:
            image: some-private-pytorch-image
            imagePullPolicy: Always
            name: pytorch
            resources:
              limits:
                cpu: 4
                memory: 45000Mi
                nvidia.com/gpu: 1
              requests:
                cpu: 2
                memory: 45000Mi
            securityContext:
              allowPrivilegeEscalation: false
          imagePullSecrets:
          - name: gitlab-hidden
  runPolicy:
    ttlSecondsAfterFinished: 120
status:
  conditions:
  - lastTransitionTime: "2022-03-18T13:10:53Z"
    lastUpdateTime: "2022-03-18T13:10:53Z"
    message: PyTorchJob pytorch-job-name-hidden is created.
    reason: PyTorchJobCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2022-03-18T13:10:54Z"
    lastUpdateTime: "2022-03-18T13:10:54Z"
    message: PyTorchJob pytorch-job-name-hidden is running.
    reason: JobRunning
    status: "False"
    type: Running
  - lastTransitionTime: "2022-03-18T13:15:46Z"
    lastUpdateTime: "2022-03-18T13:15:46Z"
    message: PyTorchJob pytorch-job-name-hidden is failed because 1 Master replica(s)
      failed.
    reason: JobFailed
    status: "True"
    type: Failed
  replicaStatuses:
    Master:
      failed: 1

Error Message:

time="2022-03-18T14:55:33Z" level=info msg="Reconciling for job pytorch-job-name-hidden"

E0318 14:55:33.703519       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 631 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x16da180, 0x27a0b00)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/runtime/runtime.go:48 +0x82
panic(0x16da180, 0x27a0b00)
        /usr/local/go/src/runtime/panic.go:969 +0x166
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).CleanupJob(0xc000220240, 0xc0004bd798, 0xc002ad7a20, 0x3, 0x3, 0xc0005e9e90, 0x0, 0x0, 0x0, 0x18987c0, ...)
        /go/pkg/mod/github.com/kubeflow/common@v0.3.7/pkg/controller.v1/common/job.go:401 +0xbd
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).ReconcileJobs(0xc000220240, 0x18987c0, 0xc0004bd680, 0xc0005e8690, 0xc002ad7a20, 0x3, 0x3, 0xc0005e9e90, 0x0, 0x0, ...)
        /go/pkg/mod/github.com/kubeflow/common@v0.3.7/pkg/controller.v1/common/job.go:147 +0x76d
github.com/kubeflow/tf-operator/pkg/controller.v1/pytorch.(*PyTorchJobReconciler).Reconcile(0xc000220240, 0x1b88fa0, 0xc000919da0, 0xc000496360, 0x6, 0xc000c34680, 0x1c, 0xc000919da0, 0x40903b, 0xc000030000, ...)
        /workspace/pkg/controller.v1/pytorch/pytorchjob_controller.go:159 +0x83c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00039dd60, 0x1b88ee0, 0xc0003f2000, 0x1750a40, 0xc000ab0260)
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:263 +0x2f1
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00039dd60, 0x1b88ee0, 0xc0003f2000, 0x0)
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:235 +0x202
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1(0x1b88ee0, 0xc0003f2000)
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:198 +0x4a
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:185 +0x37
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000b2f750)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00169ff50, 0x1b46440, 0xc000a857d0, 0xc0003f2001, 0xc000a370e0)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:156 +0xa3
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000b2f750, 0x3b9aca00, 0x0, 0x1, 0xc000a370e0)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x1b88ee0, 0xc0003f2000, 0xc00054fd30, 0x3b9aca00, 0x0, 0x1)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:185 +0xa6
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x1b88ee0, 0xc0003f2000, 0xc00054fd30, 0x3b9aca00)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:99 +0x57
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:195 +0x4f6
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x14a257d]

goroutine 631 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/runtime/runtime.go:55 +0x105
panic(0x16da180, 0x27a0b00)
        /usr/local/go/src/runtime/panic.go:969 +0x166
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).CleanupJob(0xc000220240, 0xc0004bd798, 0xc002ad7a20, 0x3, 0x3, 0xc0005e9e90, 0x0, 0x0, 0x0, 0x18987c0, ...)
        /go/pkg/mod/github.com/kubeflow/common@v0.3.7/pkg/controller.v1/common/job.go:401 +0xbd
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).ReconcileJobs(0xc000220240, 0x18987c0, 0xc0004bd680, 0xc0005e8690, 0xc002ad7a20, 0x3, 0x3, 0xc0005e9e90, 0x0, 0x0, ...)
        /go/pkg/mod/github.com/kubeflow/common@v0.3.7/pkg/controller.v1/common/job.go:147 +0x76d
github.com/kubeflow/tf-operator/pkg/controller.v1/pytorch.(*PyTorchJobReconciler).Reconcile(0xc000220240, 0x1b88fa0, 0xc000919da0, 0xc000496360, 0x6, 0xc000c34680, 0x1c, 0xc000919da0, 0x40903b, 0xc000030000, ...)
        /workspace/pkg/controller.v1/pytorch/pytorchjob_controller.go:159 +0x83c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00039dd60, 0x1b88ee0, 0xc0003f2000, 0x1750a40, 0xc000ab0260)
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:263 +0x2f1
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00039dd60, 0x1b88ee0, 0xc0003f2000, 0x0)
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:235 +0x202
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1(0x1b88ee0, 0xc0003f2000)
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:198 +0x4a
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:185 +0x37
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000b2f750)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00169ff50, 0x1b46440, 0xc000a857d0, 0xc0003f2001, 0xc000a370e0)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:156 +0xa3
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000b2f750, 0x3b9aca00, 0x0, 0x1, 0xc000a370e0)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x1b88ee0, 0xc0003f2000, 0xc00054fd30, 0x3b9aca00, 0x0, 0x1)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:185 +0xa6
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x1b88ee0, 0xc0003f2000, 0xc00054fd30, 0x3b9aca00)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:99 +0x57
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:195 +0x4f6

I don't know if this is related to #1382 (Seemed already fixed to me)

Let me know if you need further information.

Cheers,
Markus

@martialblog martialblog changed the title Errror: invalid memory address or nil pointer dereference Error: invalid memory address or nil pointer dereference Mar 18, 2022
@cheimu
Copy link
Member

cheimu commented Mar 20, 2022

I think this is the reason
image
as indicated in the comments, jobStatus.CompletionTime is nil, so it got nil pointer panic when finishTime.Add(duration)

And it has been fixed for latest version. You can try to update it : )
image

@martialblog
Copy link
Author

Hi, thanks for the update. Yes seems as if that might be the cause.

Ok, so this was fixed in kubeflow/common#178 , however the PR is from 26.11.2021 and the latest kubeflow/common Release v0.4.1 was on 24.11.2021.

And as far as I can tell, the kubeflow/training-operator Release 1.4.0 still uses kubeflow/common v0.4.1

In order to fix this the kubeflow/common version needs to be bumped, right? Correct me if I am wrong.

Thanks,
Markus

@cheimu
Copy link
Member

cheimu commented Mar 21, 2022

Hi, thanks for the update. Yes seems as if that might be the cause.

Ok, so this was fixed in kubeflow/common#178 , however the PR is from 26.11.2021 and the latest kubeflow/common Release v0.4.1 was on 24.11.2021.

And as far as I can tell, the kubeflow/training-operator Release 1.4.0 still uses kubeflow/common v0.4.1

In order to fix this the kubeflow/common version needs to be bumped, right? Correct me if I am wrong.

Thanks, Markus

I guess so, you can try to update it and make docker-build to see it that works. I think the updating package dependency is a periodic task by bots...

@martialblog
Copy link
Author

Yeah I figured building an Image would be a workaround.

Would be great to have a kubeflow/common v0.4.2 and kubeflow/training-operator 1.4.1 release though.

@martialblog
Copy link
Author

kubeflow/common is now at v.0.4.3 f554921 any change we can get a training-operator minor release?

@johnugeorge
Copy link
Member

Ref: #1622

@martialblog
Copy link
Author

Awesome possum, looking forward to the release.

Still I think the project should maybe consider bugfix releases inbetween a major release? Maybe there is or was already discussion about that, just mentioning it.

Anywho, keep up the good work!

@johnugeorge
Copy link
Member

You can try RC release #1622 (comment)

@martialblog
Copy link
Author

Should be fixed in 1.5.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants