Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Automation API] - asyncio errors in python inline programs #6228

Closed
komalali opened this issue Jan 30, 2021 · 8 comments · Fixed by #6249
Closed

[Automation API] - asyncio errors in python inline programs #6228

komalali opened this issue Jan 30, 2021 · 8 comments · Fixed by #6249
Assignees
Labels
area/automation-api kind/bug Some behavior is incorrect or out of spec
Milestone

Comments

@komalali
Copy link
Member

There is an issue with error-handling in python inline programs.

Expected Behavior

Error is returned gracefully without asyncio error messages.

Current Behavior

asyncio errors with Task exception was never retrieved.

Full output:

pulumi:pulumi:Stack bigquery-dev running
+  gcp:organizations:Project project creating
+  gcp:organizations:Project project created
+  gcp:serviceAccount:Account sa creating
+  gcp:serviceAccount:Account sa created
+  gcp:serviceAccount:Key key creating
+  gcp:projects:IAMBinding permissions creating
+  gcp:projects:IAMBinding permissions creating error: 1 error occurred:
+  gcp:projects:IAMBinding permissions **creating failed** error: 1 error occurred:
+  gcp:serviceAccount:Key key created
Task exception was never retrieved
future: <Task finished name='Task-92' coro=<RPCManager.do_rpc.<locals>.rpc_wrapper() done, defined at /Users/komal/go/src/github.com/pulumi/automation-api-examples/python/pulumi_via_jupyter/venv/lib/python3.8/site-packages/pulumi/runtime/rpc_manager.py:61> exception=<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1612033464.773224000","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1612033464.772631000","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}"
>>
Traceback (most recent call last):
  File "/Users/komal/go/src/github.com/pulumi/automation-api-examples/python/pulumi_via_jupyter/venv/lib/python3.8/site-packages/pulumi/runtime/rpc_manager.py", line 67, in rpc_wrapper
    result = await rpc
  File "/Users/komal/go/src/github.com/pulumi/automation-api-examples/python/pulumi_via_jupyter/venv/lib/python3.8/site-packages/pulumi/runtime/resource.py", line 495, in do_register
    rpc.resolve_outputs(res, resolver.serialized_props, resp.object, deps, resolvers)
  File "/Users/komal/go/src/github.com/pulumi/automation-api-examples/python/pulumi_via_jupyter/venv/lib/python3.8/site-packages/pulumi/runtime/rpc.py", line 594, in resolve_outputs
    log.debug(f"incoming output property translated: {key} -> {translated_key}")
  File "/Users/komal/go/src/github.com/pulumi/automation-api-examples/python/pulumi_via_jupyter/venv/lib/python3.8/site-packages/pulumi/log.py", line 40, in debug
    _log(engine, engine_pb2.DEBUG, msg, resource, stream_id, ephemeral)
  File "/Users/komal/go/src/github.com/pulumi/automation-api-examples/python/pulumi_via_jupyter/venv/lib/python3.8/site-packages/pulumi/log.py", line 111, in _log
    engine.Log(req)
  File "/Users/komal/go/src/github.com/pulumi/automation-api-examples/python/pulumi_via_jupyter/venv/lib/python3.8/site-packages/grpc/_channel.py", line 923, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/Users/komal/go/src/github.com/pulumi/automation-api-examples/python/pulumi_via_jupyter/venv/lib/python3.8/site-packages/grpc/_channel.py", line 826, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1612033464.772633000","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1612033464.772631000","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}"
>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/komal/go/src/github.com/pulumi/automation-api-examples/python/pulumi_via_jupyter/venv/lib/python3.8/site-packages/pulumi/runtime/rpc_manager.py", line 70, in rpc_wrapper
    log.debug("RPC failed with exception:")
  File "/Users/komal/go/src/github.com/pulumi/automation-api-examples/python/pulumi_via_jupyter/venv/lib/python3.8/site-packages/pulumi/log.py", line 40, in debug
    _log(engine, engine_pb2.DEBUG, msg, resource, stream_id, ephemeral)
  File "/Users/komal/go/src/github.com/pulumi/automation-api-examples/python/pulumi_via_jupyter/venv/lib/python3.8/site-packages/pulumi/log.py", line 111, in _log
    engine.Log(req)
  File "/Users/komal/go/src/github.com/pulumi/automation-api-examples/python/pulumi_via_jupyter/venv/lib/python3.8/site-packages/grpc/_channel.py", line 923, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/Users/komal/go/src/github.com/pulumi/automation-api-examples/python/pulumi_via_jupyter/venv/lib/python3.8/site-packages/grpc/_channel.py", line 826, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1612033464.773224000","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1612033464.772631000","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}"
>
pulumi:pulumi:Stack bigquery-dev running error: update failed
pulumi:pulumi:Stack bigquery-dev **failed** 1 error

Diagnostics:
gcp:projects:IAMBinding (permissions):
error: 1 error occurred:
* Request "Set IAM Binding for role \"roles/owner\" on \"project \\\"bigquery-project-1059587\\\"\"" returned error: Error applying IAM policy for project "bigquery-project-1059587": Error setting IAM policy for project "bigquery-project-1059587": googleapi: Error 400: Request contains an invalid argument., badRequest

pulumi:pulumi:Stack (bigquery-dev):
error: update failed

Resources:
+ 3 created
1 unchanged

Duration: 1m33s

---------------------------------------------------------------------------
CommandError                              Traceback (most recent call last)
<ipython-input-6-da22296ae34d> in <module>
----> 1 bigquery = deploy_project("bigquery",
      2                           gcp_service_account,
      3                           plugins=[("gcp", "v4.9.1")])

<ipython-input-2-87b8cd4dbe5f> in deploy_project(project_name, program, plugins, config)
     27     stack.refresh(on_output=print)
     28 
---> 29     stack.up(on_output=print)
     30 
     31     return stack

~/go/src/github.com/pulumi/automation-api-examples/python/pulumi_via_jupyter/venv/lib/python3.8/site-packages/pulumi/x/automation/stack.py in up(self, parallel, message, target, expect_no_changes, target_dependents, replace, on_output, program)
    273 
    274         try:
--> 275             up_result = self._run_pulumi_cmd_sync(args, on_output)
    276             outputs = self.outputs()
    277             summary = self.info()

~/go/src/github.com/pulumi/automation-api-examples/python/pulumi_via_jupyter/venv/lib/python3.8/site-packages/pulumi/x/automation/stack.py in _run_pulumi_cmd_sync(self, args, on_output)
    538         args.extend(additional_args)
    539 
--> 540         result = _run_pulumi_cmd(args, self.workspace.work_dir, envs, on_output)
    541         self.workspace.post_command_callback(self.name)
    542         return result

~/go/src/github.com/pulumi/automation-api-examples/python/pulumi_via_jupyter/venv/lib/python3.8/site-packages/pulumi/x/automation/cmd.py in _run_pulumi_cmd(args, cwd, additional_env, on_output)
     77     result = CommandResult(stderr=stderr_contents, stdout='\n'.join(stdout_chunks), code=code)
     78     if code != 0:
---> 79         raise create_command_error(result)
     80 
     81     return result

CommandError: 
 code: 255
 stdout: Updating (dev)

View Live: https://app.pulumi.com/komalali/bigquery/dev/updates/12


pulumi:pulumi:Stack bigquery-dev running
+  gcp:organizations:Project project creating
+  gcp:organizations:Project project created
+  gcp:serviceAccount:Account sa creating
+  gcp:serviceAccount:Account sa created
+  gcp:serviceAccount:Key key creating
+  gcp:projects:IAMBinding permissions creating
+  gcp:projects:IAMBinding permissions creating error: 1 error occurred:
+  gcp:projects:IAMBinding permissions **creating failed** error: 1 error occurred:
+  gcp:serviceAccount:Key key created
pulumi:pulumi:Stack bigquery-dev running error: update failed
pulumi:pulumi:Stack bigquery-dev **failed** 1 error

Diagnostics:
gcp:projects:IAMBinding (permissions):
error: 1 error occurred:
* Request "Set IAM Binding for role \"roles/owner\" on \"project \\\"bigquery-project-1059587\\\"\"" returned error: Error applying IAM policy for project "bigquery-project-1059587": Error setting IAM policy for project "bigquery-project-1059587": googleapi: Error 400: Request contains an invalid argument., badRequest

pulumi:pulumi:Stack (bigquery-dev):
error: update failed

Resources:
+ 3 created
1 unchanged

Duration: 1m33s

 stderr: 
@komalali
Copy link
Member Author

@komalali komalali added the kind/bug Some behavior is incorrect or out of spec label Jan 30, 2021
@komalali komalali changed the title [Automation API] - Runtime errors in python inline programs [Automation API] - asyncio errors in python inline programs Jan 30, 2021
@EvanBoyle
Copy link
Contributor

@komalali I think we need to turn off logging for asyncio in accordance with this comment and snippet in the python shim:

# We are (unfortunately) suppressing the log output of asyncio to avoid showing to users some of the bad things we
# do in our programming model.
#
# Fundamentally, Pulumi is a way for users to build asynchronous dataflow graphs that, as their deployments
# progress, resolve naturally and eventually result in the complete resolution of the graph. If one node in the
# graph fails (i.e. a resource fails to create, there's an exception in an apply, etc.), part of the graph remains
# unevaluated at the time that we exit.
#
# asyncio abhors this. It gets very upset if the process terminates without having observed every future that we
# have resolved. If we are terminating abnormally, it is highly likely that we are not going to observe every single
# future that we have created. Furthermore, it's *harmless* to do this - asyncio logs errors because it thinks it
# needs to tell users that they're doing bad things (which, to their credit, they are), but we are doing this
# deliberately.
#
# In order to paper over this for our users, we simply turn off the logger for asyncio. Users won't see any asyncio
# error messages, but if they stick to the Pulumi programming model, they wouldn't be seeing any anyway.
logging.getLogger("asyncio").setLevel(logging.CRITICAL)

@komalali
Copy link
Member Author

komalali commented Jan 31, 2021

Yeah, I'm aware of that comment and that is certainly a way to paper over the issue. That was how I was first handling it here too. But I see 2 problems with that:

  1. What if someone is using asyncio outside of their inline program in the code? And in this case, yeah we could just set logging back to the original level after the function is done running. That wouldn't work well if we wanted to run parallel programs in the same process but that's not really an option right now due to all the global state anyway.
  2. Turning off logging just covers up the fact that we don’t actually handle errors correctly in run_pulumi_func. If we error out the way that I'm experiencing, the finally block doesn't run all the way through. It never ends up cancelling the outstanding tasks. (i.e the debug message "Canceling all outstanding tasks" or "run_pulumi_func completed" never get reached). This is the case with both inline and local programs (i.e. programs run via the CLI directly). I spent some time looking into Graceful shutdowns and Error handling in asyncio, and we never assign a global exception handler to the event loop so it's not guaranteed that we would ever clean up correctly.

So, I'm not convinced that just turning off the logging is the "right" answer. But it could be a "good enough" answer since it's what we've been doing for the past 2 years anyway 🤷

@komalali
Copy link
Member Author

The "right" answer would be to fix run_pulumi_func to handle errors correctly.

@spara
Copy link
Contributor

spara commented Jan 31, 2021 via email

@komalali
Copy link
Member Author

This is a quality issue, I would ask Alex/Lee Zen for guidance.

Yeah I'll bring this up in planning on Tuesday.

@EvanBoyle
Copy link
Contributor

If we error out the way that I'm experiencing, the finally block doesn't run all the way through

Definitely want to make sure we exit cleanly and don't leave the GRPC server running in the case of an error. Luckily, since this is all in one process we should be able to attach a debugger and step through to see exactly what's going on (in the inline case). Happy to pair on that if it would be helpful.

@leezen leezen added this to the 0.52 milestone Feb 1, 2021
@komalali
Copy link
Member Author

komalali commented Feb 2, 2021

In my spelunking through the code, I noticed that there are a number of places in the python runtime code where we sys.exit(0) when the resource monitor is not available (because it is in the process of shutting down).
Examples:

When the process exits within an RPC, run_pulumi_func never completes, resulting in outstanding tasks that can eventually also error with the same grpc.StatusCode.UNAVAILABLE leading to the Task exception was never received log messages.

I noticed that the initial PR that added the exits did this in both the typescript and python runtimes. However, @EvanBoyle updated the typescript runtime to absorb these errors and disconnect cleanly rather than exit the process. We should do the same for the python runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/automation-api kind/bug Some behavior is incorrect or out of spec
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants