Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hangs in core SDK when monitor unavailable #7734

Merged
merged 2 commits into from
Aug 11, 2021
Merged

Conversation

t0yv0
Copy link
Member

@t0yv0 t0yv0 commented Aug 10, 2021

Description

History

Problem

In the above PR we have replaced sys.exit(1) with not raising an exception in cases when the monitor service becomes unresponsive over gRPC, which is typical of shutdown scenarios. Unfortunately this means that various exception handlers and try_finally blocks for register_resource and similar are not firing in this case. I have read and stepped through the code and I do not think the code path is properly handled outside of the exception handlers. In particular, futures are left dangling. This can cause hangs if something is waiting on those futures. This can include the generic wait_for_rpcs waiter.

The user reports a failing program in #7036 that triggers this condition:

import sys
import os

from pulumi import automation as auto
from pulumi.automation.errors import CommandError

import pulumi
import pulumi_gcp as gcp
import faulthandler

def do_park():
    project_id = os.environ.get("PARK_PROJECT_ID")

    sa = gcp.serviceaccount.Account(
        resource_name="long-long-long-long-long-long-resource-name",
        account_id="long-long-long-long-long-long-resource-name",
        display_name="long-long-long-long-long-long-resource-name",
        project=project_id,
    )

    return sa


def park_pulumi():
    faulthandler.enable()
    stack = auto.create_or_select_stack("parking_pulumi",
                                        project_name="parking_pulumi",
                                        program=do_park)
    try:
        result = stack.up(on_output=print)
        print("Successfully deployed stack")
        print(f"Deployment summary {result.summary}")
    except CommandError as command_error:
        print('-----' * 20)
        print(f"Error while deploy stack")
        print(f"Details: {command_error}")
        print("I am now parked")
        print('-----' * 20)


if __name__ == "__main__":
    park_pulumi()

In the register_resource of gcp.serviceaccount.Account, futures are created that are not fulfilled, likely because an error in validating the Account props get detected elsewhere and causes monitor gRPC server termination. The use of Auto API seems essential to create this asynchrony.

Fix

The suggested fix is throwing RunError which is included in the SDK to represent "quiet" errors in this situation. Throwing an error triggers the usual exception handlers. The hangs are removed, and the user receives a compact "error: Resource monitor has terminated, shutting down". Unfortunately, https://github.com/pulumi/pulumi/pull/6249/files comes without a test so this might be accidentally regressing what that PR was fixing.

Issues

Fixes #7036

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • Yes, there are changes in this PR that warrants bumping the Pulumi Service API version

@t0yv0
Copy link
Member Author

t0yv0 commented Aug 10, 2021

[nix-shell:~/tmp/repro-7036/stack]$ pulumi preview
Previewing update (dev)

View Live: https://app.pulumi.com/t0yv0/stack/dev/previews/0ecc1808-de42-4d26-8a34-76367147c84a

     Type                 Name       Plan       Info
 +   pulumi:pulumi:Stack  stack-dev  create     1 error; 21 messages

Diagnostics:
  pulumi:pulumi:Stack (stack-dev):
    !!CommandError:
     code: 255
     stdout: Updating (parking_pulumi)
    View Live: https://app.pulumi.com/t0yv0/parking_pulumi/parking_pulumi/updates/46
    pulumi:pulumi:Stack parking_pulumi-parking_pulumi running
    +  gcp:serviceAccount:Account long-long-long-long-long-long-resource-name creating
    +  gcp:serviceAccount:Account long-long-long-long-long-long-resource-name creating error: 1 error occurred:
    +  gcp:serviceAccount:Account long-long-long-long-long-long-resource-name **creating failed** error: 1 error occurred:
    pulumi:pulumi:Stack parking_pulumi-parking_pulumi running error: update failed
    pulumi:pulumi:Stack parking_pulumi-parking_pulumi **failed** 1 error
    Diagnostics:
    gcp:serviceAccount:Account (long-long-long-long-long-long-resource-name):
    error: 1 error occurred:
    * project: required field is not set
    pulumi:pulumi:Stack (parking_pulumi-parking_pulumi):
    error: update failed
    Resources:
    1 unchanged
    Duration: 1s
     stderr:

    error: Resource monitor has terminated, shutting down

    error: an unhandled error occurred: Program exited with non-zero exit code: 1

Copy link
Contributor

@EvanBoyle EvanBoyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find! Fix appears unsurprising, but I am not an expert on our python runtime. May want to get a second opinion (perhaps Justin?).

@t0yv0 t0yv0 requested a review from justinvp August 11, 2021 00:48
Copy link
Member

@justinvp justinvp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@t0yv0 t0yv0 merged commit e567b47 into master Aug 11, 2021
@t0yv0 t0yv0 deleted the t0yv0/fix-7710-b branch August 11, 2021 14:50
t0yv0 added a commit that referenced this pull request Aug 11, 2021
t0yv0 added a commit that referenced this pull request Aug 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pulumi freezes with automation API when a resource allocation fails with python
3 participants