Fix hangs in core SDK when monitor unavailable #7734

t0yv0 · 2021-08-10T21:12:49Z

Description

History

Problem

In the above PR we have replaced sys.exit(1) with not raising an exception in cases when the monitor service becomes unresponsive over gRPC, which is typical of shutdown scenarios. Unfortunately this means that various exception handlers and try_finally blocks for register_resource and similar are not firing in this case. I have read and stepped through the code and I do not think the code path is properly handled outside of the exception handlers. In particular, futures are left dangling. This can cause hangs if something is waiting on those futures. This can include the generic wait_for_rpcs waiter.

The user reports a failing program in #7036 that triggers this condition:

import sys
import os

from pulumi import automation as auto
from pulumi.automation.errors import CommandError

import pulumi
import pulumi_gcp as gcp
import faulthandler

def do_park():
    project_id = os.environ.get("PARK_PROJECT_ID")

    sa = gcp.serviceaccount.Account(
        resource_name="long-long-long-long-long-long-resource-name",
        account_id="long-long-long-long-long-long-resource-name",
        display_name="long-long-long-long-long-long-resource-name",
        project=project_id,
    )

    return sa


def park_pulumi():
    faulthandler.enable()
    stack = auto.create_or_select_stack("parking_pulumi",
                                        project_name="parking_pulumi",
                                        program=do_park)
    try:
        result = stack.up(on_output=print)
        print("Successfully deployed stack")
        print(f"Deployment summary {result.summary}")
    except CommandError as command_error:
        print('-----' * 20)
        print(f"Error while deploy stack")
        print(f"Details: {command_error}")
        print("I am now parked")
        print('-----' * 20)


if __name__ == "__main__":
    park_pulumi()

In the register_resource of gcp.serviceaccount.Account, futures are created that are not fulfilled, likely because an error in validating the Account props get detected elsewhere and causes monitor gRPC server termination. The use of Auto API seems essential to create this asynchrony.

Fix

The suggested fix is throwing RunError which is included in the SDK to represent "quiet" errors in this situation. Throwing an error triggers the usual exception handlers. The hangs are removed, and the user receives a compact "error: Resource monitor has terminated, shutting down". Unfortunately, https://github.com/pulumi/pulumi/pull/6249/files comes without a test so this might be accidentally regressing what that PR was fixing.

Issues

Fixes #7036

Checklist

I have added tests that prove my fix is effective or that my feature works

I have updated the CHANGELOG-PENDING file with my change

Yes, there are changes in this PR that warrants bumping the Pulumi Service API version

t0yv0 · 2021-08-10T21:18:01Z

[nix-shell:~/tmp/repro-7036/stack]$ pulumi preview
Previewing update (dev)

View Live: https://app.pulumi.com/t0yv0/stack/dev/previews/0ecc1808-de42-4d26-8a34-76367147c84a

     Type                 Name       Plan       Info
 +   pulumi:pulumi:Stack  stack-dev  create     1 error; 21 messages

Diagnostics:
  pulumi:pulumi:Stack (stack-dev):
    !!CommandError:
     code: 255
     stdout: Updating (parking_pulumi)
    View Live: https://app.pulumi.com/t0yv0/parking_pulumi/parking_pulumi/updates/46
    pulumi:pulumi:Stack parking_pulumi-parking_pulumi running
    +  gcp:serviceAccount:Account long-long-long-long-long-long-resource-name creating
    +  gcp:serviceAccount:Account long-long-long-long-long-long-resource-name creating error: 1 error occurred:
    +  gcp:serviceAccount:Account long-long-long-long-long-long-resource-name **creating failed** error: 1 error occurred:
    pulumi:pulumi:Stack parking_pulumi-parking_pulumi running error: update failed
    pulumi:pulumi:Stack parking_pulumi-parking_pulumi **failed** 1 error
    Diagnostics:
    gcp:serviceAccount:Account (long-long-long-long-long-long-resource-name):
    error: 1 error occurred:
    * project: required field is not set
    pulumi:pulumi:Stack (parking_pulumi-parking_pulumi):
    error: update failed
    Resources:
    1 unchanged
    Duration: 1s
     stderr:

    error: Resource monitor has terminated, shutting down

    error: an unhandled error occurred: Program exited with non-zero exit code: 1

EvanBoyle

Nice find! Fix appears unsurprising, but I am not an expert on our python runtime. May want to get a second opinion (perhaps Justin?).

justinvp

LGTM

This reverts commit e567b47.

t0yv0 requested review from EvanBoyle and pgavlin August 10, 2021 21:12

t0yv0 added 2 commits August 10, 2021 17:15

Fix hangs in core SDK when monitor unavailable

6c1fed8

merged

6d8e588

t0yv0 force-pushed the t0yv0/fix-7710-b branch from 67d2613 to 6d8e588 Compare August 10, 2021 21:16

EvanBoyle approved these changes Aug 10, 2021

View reviewed changes

t0yv0 requested a review from justinvp August 11, 2021 00:48

justinvp approved these changes Aug 11, 2021

View reviewed changes

t0yv0 merged commit e567b47 into master Aug 11, 2021

t0yv0 deleted the t0yv0/fix-7710-b branch August 11, 2021 14:50

t0yv0 added a commit that referenced this pull request Aug 11, 2021

Revert "Fix hangs in core SDK when monitor unavailable (#7734)"

5daaebb

This reverts commit e567b47.

t0yv0 added a commit that referenced this pull request Aug 11, 2021

Revert "Fix hangs in core SDK when monitor unavailable (#7734)" (#7744)

ea33368

This reverts commit e567b47.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hangs in core SDK when monitor unavailable #7734

Fix hangs in core SDK when monitor unavailable #7734

t0yv0 commented Aug 10, 2021

t0yv0 commented Aug 10, 2021

EvanBoyle left a comment

justinvp left a comment

Fix hangs in core SDK when monitor unavailable #7734

Fix hangs in core SDK when monitor unavailable #7734

Conversation

t0yv0 commented Aug 10, 2021

Description

History

Problem

Fix

Issues

Checklist

t0yv0 commented Aug 10, 2021

EvanBoyle left a comment

Choose a reason for hiding this comment

justinvp left a comment

Choose a reason for hiding this comment