Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job with unreachable cluster causes a panic #3022

Closed
sinhalvi opened this issue May 23, 2024 · 1 comment · Fixed by #3024
Closed

Job with unreachable cluster causes a panic #3022

sinhalvi opened this issue May 23, 2024 · 1 comment · Fixed by #3024
Assignees
Labels
impact/panic This bug represents a panic or unexpected crash kind/bug Some behavior is incorrect or out of spec p1 A bug severe enough to be the next item assigned to an engineer resolution/fixed This issue was fixed

Comments

@sinhalvi
Copy link

sinhalvi commented May 23, 2024

What happened?

Hi, I suddenly see this error when I do a pulumi up on a stack

panic: runtime error: invalid memory address or nil pointer dereference
    [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x2daee4d]
    goroutine 103 [running]:
    github.com/pulumi/pulumi-kubernetes/provider/v4/pkg/clients.(*DynamicClientSet).ResourceClient(0xc0005f3300, {{0xc000c2ddd8, 0x5}, {0xc000c2ddde, 0x2}, {0xc000c2ddf4, 0x3}}, {0xc000a96068, 0x4})
        /home/runner/work/pulumi-kubernetes/pulumi-kubernetes/provider/pkg/clients/clients.go:95 +0x10d
    github.com/pulumi/pulumi-kubernetes/provider/v4/pkg/clients.(*DynamicClientSet).ResourceClientForObject(0xc0005f3300, 0xc00087fd40)
        /home/runner/work/pulumi-kubernetes/pulumi-kubernetes/provider/pkg/clients/clients.go:118 +0x11f
    github.com/pulumi/pulumi-kubernetes/provider/v4/pkg/provider.(*kubeProvider).readLiveObject(0xc000628000, 0xc00087fd40)
        /home/runner/work/pulumi-kubernetes/pulumi-kubernetes/provider/pkg/provider/provider.go:2680 +0xf8
    github.com/pulumi/pulumi-kubernetes/provider/v4/pkg/provider.(*kubeProvider).Diff(0xc000628000, {0x59c0b50, 0xc0009902a0}, 0xc000581400)
        /home/runner/work/pulumi-kubernetes/pulumi-kubernetes/provider/pkg/provider/provider.go:1749 +0x13fe
    github.com/pulumi/pulumi/sdk/v3/proto/go._ResourceProvider_Diff_Handler.func1({0x59c0b50?, 0xc0009902a0?}, {0x5364c60?, 0xc000581400?})
        /home/runner/go/pkg/mod/github.com/pulumi/pulumi/sdk/v3@v3.114.0/proto/go/provider_grpc.pb.go:575 +0xcb
    github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc.OpenTracingServerInterceptor.func1({0x59c0b50, 0xc0009b3320}, {0x5364c60, 0xc000581400}, 0xc0000b4fa0, 0xc000011d40)
        /home/runner/go/pkg/mod/github.com/grpc-ecosystem/grpc-opentracing@v0.0.0-20180507213350-8e809c8a8645/go/otgrpc/server.go:57 +0x3db
    github.com/pulumi/pulumi/sdk/v3/proto/go._ResourceProvider_Diff_Handler({0x58c95e0, 0xc000628000}, {0x59c0b50, 0xc0009b3320}, 0xc000581380, 0xc000c24660)
        /home/runner/go/pkg/mod/github.com/pulumi/pulumi/sdk/v3@v3.114.0/proto/go/provider_grpc.pb.go:577 +0x143
    google.golang.org/grpc.(*Server).processUnaryRPC(0xc0005ce000, {0x59c0b50, 0xc0009b3290}, {0x5a12540, 0xc0002a0600}, 0xc0009ba5a0, 0xc000c4ca80, 0x9189388, 0x0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/server.go:1369 +0xdf8
    google.golang.org/grpc.(*Server).handleStream(0xc0005ce000, {0x5a12540, 0xc0002a0600}, 0xc0009ba5a0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/server.go:1780 +0xe8b
    google.golang.org/grpc.(*Server).serveStreams.func2.1()
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/server.go:1019 +0x8b
    created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 55
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/server.go:1030 +0x125

Example

I am using job resource from
import { Job } from '@pulumi/kubernetes/batch/v1'

Output of pulumi about

It was working before and I have no changes in code or packages. I did upgrade pulumi and other and this is my env
CLI
Version 3.116.1
Go Version go1.22.3
Go Compiler gc

Plugins
KIND NAME VERSION
resource aws 6.18.0
resource docker 4.5.1
resource kubernetes 4.12.0
language nodejs unknown

Host
OS darwin
Version 12.6.8
Arch x86_64

This project is written in nodejs: executable='/usr/local/bin/node' version='v21.7.2'

Current Stack:

OPP TYPE URN
Backend
Name 2030009945
URL s3://
User sindhu.halvi
Organizations
Token type personal

Dependencies:
NAME VERSION
@pulumi/kubernetes 4.12.0
@pulumi/pulumi 3.105.0
@types/node 18.19.6
ts-deepmerge 6.2.0
ts-node 10.9.2
typescript 5.3.3
@pulumi/aws 6.18.0
@pulumi/docker 4.5.1
prettier 3.2.1

Additional context

one of the error logs shows this

Log file created at: 2024/05/22 17:59:21
Running on machine: 2030009945
Binary: Built with gc go1.22.3 for darwin/amd64
Previous log: <none>
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0522 17:59:21.374242   14450 plugins.go:431] GitHub rate limit exceeded for https://api.github.com/repos/pulumi/pulumi-kubernetes/releases/tags/v4.12.0, try again in 9m54.625765s. You can set GITHUB_TOKEN to make an authenticated request with a higher rate limit.
pulumi.2030009945.sindhu_halvi.log.ERROR.20240522-175921.14450 (END)

I have tried everthing from updating, re-installing. nothing solves this issue

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@sinhalvi sinhalvi added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels May 23, 2024
@rquitales
Copy link
Contributor

Thanks for reporting this error @sinhalvi and apologies that you're facing it. After tracing through the logs, it appears that this error occurs in a very specific scenario. The following events must occur for the panic to occur.

  1. The cluster must be unreachable (perhaps due to a malformed kubeconfig). This results in the k8s clients being nil. We return a nil client since it is still useful for yaml rendering during the preview operation (ref: https://github.com/pulumi/pulumi-kubernetes/blob/fa7330c3b4db20f1914e10e020b46a1b72af7f66/provider/pkg/clients/clients.go#L62https://github.com/pulumi/pulumi-kubernetes/blob/fa7330c3b4db20f1914e10e020b46a1b72af7f66/provider/pkg/clients/clients.go#L62)
  2. The Job resource defined in the Pulumi program must have the pulumi.com/replaceUnready annotation set to true.
  3. When we diff the resources to determine whether a replace or update operation should be done, we attempt to hit the live cluster
    if live, err := k.readLiveObject(oldLive); err == nil {
    This results in the panic since our clients are nil due to an unreachable client.

To address this, we will need to add a clusterUnreachable check prior to checking the live cluster for the job's status.

@rquitales rquitales added p1 A bug severe enough to be the next item assigned to an engineer impact/panic This bug represents a panic or unexpected crash and removed needs-triage Needs attention from the triage team labels May 24, 2024
@rquitales rquitales self-assigned this May 24, 2024
@rquitales rquitales changed the title panic: runtime error Job with unreachable cluster causes a panic May 24, 2024
rquitales added a commit that referenced this issue May 28, 2024
…3024)

### Proposed changes
This PR ensures that we do not make a k8s API request during the
provider's diff if there is an unreachable cluster. This currently
occurs when the Pulumi program contains a Job resource with the
`replaceUnready` annotation set to true. A panic would occur if we
attempt to make the API call since our clients are nil.

#### Testing done:
1. Created a repro test case that fails with a panic
(https://github.com/pulumi/pulumi-kubernetes/actions/runs/9228447658/job/25392833842?pr=3024)
2. Added logic to prevent the panic, and test passes subsequently
without intervention
(https://github.com/pulumi/pulumi-kubernetes/actions/runs/9228685506/job/25393667599?pr=3024)
3. Manual validation to ensure panic isn't trigerred.

### Related issues (optional)

Fixes: #3022
@pulumi-bot pulumi-bot added the resolution/fixed This issue was fixed label May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact/panic This bug represents a panic or unexpected crash kind/bug Some behavior is incorrect or out of spec p1 A bug severe enough to be the next item assigned to an engineer resolution/fixed This issue was fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants