Improve error messages output by the CLI #1011

swgillespie · 2018-03-07T22:48:29Z

Partially addresses #606.

This fixes a couple known issues with the way that we present errors
from the Pulumi CLI:

Any errors from RPC endpoints were bubbling up as they were to the top-level, which was unfortunate because they contained RPC-specific noise that we don't want to present to the user. This commit unwraps errors from resource providers.
The "catastrophic error" message often got printed twice
Fatal errors are often printed twice, because our CLI top-level prints out the fatal error that it receives before exiting. A lot of the time this error has already been printed.
Errors were prefixed by PU####.

Here's a before and after for a doomed deployment:

before:

$ pulumi update
Performing changes:
* pulumi:pulumi:Stack: (same)
    [urn=urn:pulumi:webserver::webserver::pulumi:pulumi:Stack::webserver-webserver]
    + aws:ec2/securityGroup:SecurityGroup: (create)
        [urn=urn:pulumi:webserver::webserver::aws:ec2/securityGroup:SecurityGroup::web-secgrp-2]
        description        : "Enable HTTP access"
        ingress            : [
            [0]: {
                cidrBlocks: [
                    [0]: "0.0.0.0/0"
                ]
                fromPort  : 80
                protocol  : "not-tcp"
                self      : false
                toPort    : 80
            }
        ]
        name               : "web-secgrp-2-3c2930a"
        revokeRulesOnDelete: false
error PU2000: Plan apply failed: rpc error: code = Unknown desc = creating urn:pulumi:webserver::webserver::aws:ec2/securityGroup:Securit
yGroup::web-secgrp-2: Error authorizing security group ingress rules: InvalidParameterValue: Invalid value 'not-tcp' for IP protocol. Unk
nown protocol.
        status code: 400, request id: c7dc1479-2123-426b-a3c7-45723de6960e
Step #2 failed [create]: this failure was catastrophic and the provider cannot guarantee recovery
info: no changes required:
      1 resource unchanged
A catastrophic error occurred; resources states may be unknown
error: rpc error: code = Unknown desc = creating urn:pulumi:webserver::webserver::aws:ec2/securityGroup:SecurityGroup::web-secgrp-2: Erro
r authorizing security group ingress rules: InvalidParameterValue: Invalid value 'not-tcp' for IP protocol. Unknown protocol.
        status code: 400, request id: c7dc1479-2123-426b-a3c7-45723de6960e

after:

$ pulumi update
Performing changes:
* pulumi:pulumi:Stack: (same)
    [urn=urn:pulumi:webserver::webserver::pulumi:pulumi:Stack::webserver-webserver]
    + aws:ec2/securityGroup:SecurityGroup: (create)
        [urn=urn:pulumi:webserver::webserver::aws:ec2/securityGroup:SecurityGroup::web-secgrp-2]
        description        : "Enable HTTP access"
        ingress            : [
            [0]: {
                cidrBlocks: [
                    [0]: "0.0.0.0/0"
                ]
                fromPort  : 80
                protocol  : "not-tcp"
                self      : false
                toPort    : 80
            }
        ]
        name               : "web-secgrp-2-392cae9"
        revokeRulesOnDelete: false
error: Plan apply failed: creating urn:pulumi:webserver::webserver::aws:ec2/securityGroup:SecurityGroup::web-secgrp-2: Error authorizing
security group ingress rules: InvalidParameterValue: Invalid value 'not-tcp' for IP protocol. Unknown protocol.
        status code: 400, request id: 69e2df74-69b0-476e-bca6-eb170d1115c1
Step #2 failed [create]: this failure was catastrophic and the provider cannot guarantee recovery
info: no changes required:
      1 resource unchanged
error: Deployment failed due to step application failure

The things to note:

RPC error goo rpc error: code = Unknown desc = is gone, getting stripped away at the provider plugin layer
PU2000 is gone
The error that bubbles up to the CLI top-level is Deployment failed due to step application failure and not the verbose error that we've already reported

swgillespie · 2018-03-07T22:52:00Z

pkg/resource/plugin/provider_plugin.go

+// In general, our resource state is only really unknown if the server
+// had an internal error, in which case it will serve one of `codes.Internal`,
+// `codes.DataLoss`, or `codes.Unknown` to us.
+func interpretRPCError(err error) (resource.Status, error) {


This bleeds in over to #219 a little bit and needs more work in order to reach its potential.

The idea here is that providers can send gRPC error codes alongside their error payloads to inform the engine of the nature of their failure. Today, we are sending the error code Unknown for all errors, so the engine has no idea if we can actually recover from the error we just received.

After this PR lands providers can start sending error codes like e.g. FailedPrecondition if a resource fails Terraform validation and we won't see the angry "catastrophic error" message for run-of-the-mill failures.

I'm not sure if there's a better name for this method. It's sort of a bummer that it's not clear from the name this is going to give you information about the resource's status and that the error returned is not an error from this method but rather a translation of the error argument.

Maybe resourceStateAndErrorFromRPCError Or maybe just resourceStateAndError? I don't know. Your call.

lukehoban · 2018-03-08T00:07:46Z

Re: the after:

error: Plan apply failed: creating urn:pulumi:webserver::webserver::aws:ec2/securityGroup:SecurityGroup::web-secgrp-2: Error authorizing
security group ingress rules: InvalidParameterValue: Invalid value 'not-tcp' for IP protocol. Unknown protocol.
        status code: 400, request id: 69e2df74-69b0-476e-bca6-eb170d1115c1
Step #2 failed [create]: this failure was catastrophic and the provider cannot guarantee recovery
info: no changes required:
      1 resource unchanged
error: Deployment failed due to step application failure

Do we need to present both error: entries here?
Do we want to add newlines or some other form of separation to highlight the actual error message?
Can we maybe make the engine error message text a little more human? ("plan apply", "step application", "failure was catastrophic", etc.)

Perhaps something like:

Error creating urn:pulumi:webserver::webserver::aws:ec2/securityGroup:SecurityGroup::web-secgrp-2: 

    Error authorizing security group ingress rules: InvalidParameterValue: Invalid value 'not-tcp' for IP 
    protocol. Unknown protocol.
            status code: 400, request id: 69e2df74-69b0-476e-bca6-eb170d1115c1

Update failed due to resource creation failure.

info: no changes required:
      1 resource unchanged

swgillespie · 2018-03-08T00:41:05Z

Do we need to present both error: entries here?

If we use cmdutil.RunFunc at the CLI top-level, we do. If we don't, we can just exit with a non-zero exit code instead of printing the error. I'm happy to do that if we're fine changing the CLI top-levels a little bit.

Do we want to add newlines or some other form of separation to highlight the actual error message?

I looked at pkg/errors a little more (since we use errors.Wrap a lot) and it looks like it's possible to iterate over the error chain and print it nicely like this. I'll give it a try.

Can we maybe make the engine error message text a little more human? ("plan apply", "step application", "failure was catastrophic", etc.)

Sure! I'm not really sure what to do with "failure was catastrophic" - should we just omit it completely? We'll emit it for pretty much every error as it is today...

ellismg · 2018-03-08T00:42:14Z

pkg/resource/deploy/plan_apply.go

+			return status, errors.New("Preview failed due to step application failure")
+		}
+
+		return status, errors.New("Deployment failed due to step application failure")


Unsure if we call this Deployment elsewhere in the engine, for messages that may end up in the user's face. The user facing verb is arguably Update.

lukehoban · 2018-03-08T00:59:12Z

Can we maybe make the engine error message text a little more human? ("plan apply", "step application", "failure was catastrophic", etc.)

Sure! I'm not really sure what to do with "failure was catastrophic" - should we just omit it completely? We'll emit it for pretty much every error as it is today...

It's not immediately clear we are gaining any value from that error. In my suggested text above, that whole part of the error was replaced with just Update failed due to resource creation failure. It feels like that is all that's needed. Is there any case this gets printed where it provides more useful information to the user than that simpler message?

lukehoban · 2018-03-08T01:04:03Z

Do we need to present both error: entries here?

If we use cmdutil.RunFunc at the CLI top-level, we do. If we don't, we can just exit with a non-zero exit code instead of printing the error. I'm happy to do that if we're fine changing the CLI top-levels a little bit.

A bit of a shame to have to duplicate, but I can see an argument for being consistent about always printing an error message as the final output when we fail.

Perhaps just keep it really simple then:

error: Update failed

swgillespie · 2018-03-08T20:16:00Z

So unfortunately it's not nice today to present sub-errrors in an "indented" way like in Luke's example. This is because the errors get wrapped on the provider side and serialized out via RPC so, by the time we see it in the engine, it's just a big string.

We could in theory parse the error message, but I'm wary of hardcoding provider implementation details (error message format) into the engine in such a way. I think that, going forward, we'd get a tremendous benefit from re-working our RPC boundaries a little bit to emit structured errors that the engine can inspect. gRPC lets us throw arbitrary "Details" objects onto error payloads, so we can pretty trivially include an error chain in gRPC error messages.

We also have the opportunity to attach arbitrary metadata to error messages (#992).

joeduffy

LGTM, let's try to land this morning.

joeduffy · 2018-03-09T16:56:52Z

pkg/engine/deploy.go

-		b.WriteString(colors.Reset)
-		b.WriteString("\n")
+		//
+		// [pulumi/pulumi#219] Since all errors are "unrecoverable" today, the below error message


I would delete this and think about how to add back something useful as part of #219.

joeduffy · 2018-03-09T16:57:33Z

pkg/engine/plan.go

@@ -195,7 +195,7 @@ func printPlan(result *planResult) (ResourceChanges, error) {
 	actions := newPreviewActions(result.Options)
 	_, _, _, err := result.Walk(actions, true)
 	if err != nil {
-		return nil, errors.Wrapf(err, "An error occurred while advancing the preview")
+		return nil, errors.New("An error occurred while advancing the preview")


Nit, should this be lower case like our other error messages?

joeduffy · 2018-03-09T16:57:50Z

pkg/resource/deploy/plan_apply.go

+	// a more specific error message.
+	if err != nil {
+		if preview {
+			return status, errors.New("Preview failed")


Nit, lower case (also below one).

joeduffy · 2018-03-09T16:59:59Z

pkg/resource/plugin/provider_plugin.go

+// and a message. Based on the error code given to us, we can understand
+// the state of our system and if our resource status is truly unknown.
+//
+// In general, our resource state is only really unknown if the server


Should the propagated error message differentiate on the status code? I'm just thinking that a bug report from a customer that contains minimal data might be easier to diagnose if it said something like [rpc internal error] or somesuch. Obviously, not for the errors that we expect to arise from plugins returning errors.

Good point. No provider actually returns Internal today, but if it did we could go even further and ask that providers return a stack trace of a panic that produced the internal error. (This is pretty trivial with gRPC, we just don't do it.)

Since everything returns Unknown today I'm inclined to leave this code as it is and, when we start plumbing gRPC error codes throughout our system, address it then. Is that OK?

joeduffy · 2018-03-09T17:04:17Z

pkg/resource/plugin/provider_plugin.go

+	}
+
+	glog.V(8).Infof("rpc error kind `%s` is well-understood and recoverable", rpcStatus.Code())
+	return resource.StatusOK, errors.New(rpcStatus.Message())


FWIW, the original intent of this was related to resource tainting. StatusUnknown was actually meant to convey that, because a resource provider returned an error, we cannot guarantee the state of the resources. Or so the thinking went. We should figure this out as part of #219, but AFAIK nothing actually uses this today so changing its meaning shouldn't affect anything. Just wanted you to know the history.

Makes sense, and I figured as much.

joeduffy · 2018-03-09T18:01:36Z

By the way, I really liked Luke's example from earlier:

Error creating urn:pulumi:webserver::webserver::aws:ec2/securityGroup:SecurityGroup::web-secgrp-2: 

    Error authorizing security group ingress rules: InvalidParameterValue: Invalid value 'not-tcp' for IP 
    protocol. Unknown protocol.
            status code: 400, request id: 69e2df74-69b0-476e-bca6-eb170d1115c1

Update failed due to resource creation failure.

info: no changes required:
      1 resource unchanged

I gather we can't get there just yet because we need to communicate more error information across the RPC boundary. However, please do file a follow up work item to track the idea. This would be a lovely place to get to.

swgillespie · 2018-03-09T18:07:30Z

@joeduffy will do. I think that being a little more disciplined about our RPC endpoints allow a lot of very cool scenarios like indented error messages just like this. I've been thinking a lot about it this week and I don't think it's a ton of work.

lukehoban · 2018-03-09T18:09:03Z

So unfortunately it's not nice today to present sub-errrors in an "indented" way like in Luke's example. This is because the errors get wrapped on the provider side and serialized out via RPC so, by the time we see it in the engine, it's just a big string. We could in theory parse the error message, but I'm wary of hardcoding provider implementation details (error message format) into the engine in such a way. I think that, going forward, we'd get a tremendous benefit from re-working our RPC boundaries a little bit to emit structured errors that the engine can inspect. gRPC lets us throw arbitrary "Details" objects onto error payloads, so we can pretty trivially include an error chain in gRPC error messages.

Can't we then just fix this inside the provider? Do the newlines an indentation there?

swgillespie · 2018-03-09T18:14:22Z

We can, but that doesn't feel like the right fix to me. It feels like a layering violation to expect a resource provider to spit out error messages that are nicely-formatted for a console. (I'm happy to do it if we really want this to be nice for M11, though.)

In general I'm a big believer in structured logging over stringy logging and to me there are a lot of engineering and hygiene benefits to having provider errors be structured.

swgillespie · 2018-03-09T19:36:15Z

Is anybody opposed to merging this now? I think I've addressed all feedback and I'll log issues for follow-up work items.

lukehoban · 2018-03-09T19:53:27Z

Is anybody opposed to merging this now? I think I've addressed all feedback and I'll log issues for follow-up work items.

Go for it - big improvement already here - looking forward to even more!

swgillespie · 2018-03-09T21:19:25Z

Thanks! I'll hold off merging until @ellismg 's done to keep the merge conflict pain to a minimum.

This fixes a couple known issues with the way that we present errors from the Pulumi CLI: 1. Any errors from RPC endpoints were bubbling up as they were to the top-level, which was unfortunate because they contained RPC-specific noise that we don't want to present to the user. This commit unwraps errors from resource providers. 2. The "catastrophic error" message often got printed twice 3. Fatal errors are often printed twice, because our CLI top-level prints out the fatal error that it receives before exiting. A lot of the time this error has already been printed. 4. Errors were prefixed by PU####.

…e error message as the final error

…ization

swgillespie · 2018-03-09T23:43:23Z

thanks for the reviews!

swgillespie commented Mar 7, 2018

View reviewed changes

swgillespie requested review from joeduffy and ellismg March 7, 2018 22:52

swgillespie added area/cli UX of using the CLI (args, output, logs) impact/changelog labels Mar 7, 2018

ellismg reviewed Mar 8, 2018

View reviewed changes

joeduffy approved these changes Mar 9, 2018

View reviewed changes

lukehoban approved these changes Mar 9, 2018

View reviewed changes

swgillespie added 5 commits March 9, 2018 14:52

Feedback: Omit the 'catastrophic' error message and use a less verbos…

0518e78

…e error message as the final error

Code review feedback: interpretRPCError -> resourceStateAndError

8a5993e

Code review feedback: deleting some commented-out code, error capital…

27c7d71

…ization

Cleanup after rebase

13aa063

swgillespie force-pushed the swgillespie/better-errors-2 branch from 9aa697e to 13aa063 Compare March 9, 2018 23:10

swgillespie merged commit 703a954 into master Mar 9, 2018

swgillespie deleted the swgillespie/better-errors-2 branch March 9, 2018 23:43

swgillespie mentioned this pull request Mar 9, 2018

Improve failure errors #606

Closed

swgillespie mentioned this pull request Mar 12, 2018

Provide structured errors across RPC endpoints #1038

Closed

lindydonna added the kind/bug Some behavior is incorrect or out of spec label Mar 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve error messages output by the CLI #1011

Improve error messages output by the CLI #1011

swgillespie commented Mar 7, 2018

swgillespie Mar 7, 2018

ellismg Mar 8, 2018

lukehoban commented Mar 8, 2018

swgillespie commented Mar 8, 2018

ellismg Mar 8, 2018

lukehoban commented Mar 8, 2018

lukehoban commented Mar 8, 2018

swgillespie commented Mar 8, 2018

joeduffy left a comment

joeduffy Mar 9, 2018

swgillespie Mar 9, 2018

joeduffy Mar 9, 2018

swgillespie Mar 9, 2018

joeduffy Mar 9, 2018

joeduffy Mar 9, 2018

swgillespie Mar 9, 2018

joeduffy Mar 9, 2018

joeduffy Mar 9, 2018

swgillespie Mar 9, 2018

joeduffy commented Mar 9, 2018

swgillespie commented Mar 9, 2018

lukehoban commented Mar 9, 2018

swgillespie commented Mar 9, 2018

swgillespie commented Mar 9, 2018

lukehoban commented Mar 9, 2018

swgillespie commented Mar 9, 2018

swgillespie commented Mar 9, 2018

Improve error messages output by the CLI #1011

Improve error messages output by the CLI #1011

Conversation

swgillespie commented Mar 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukehoban commented Mar 8, 2018

swgillespie commented Mar 8, 2018

Choose a reason for hiding this comment

lukehoban commented Mar 8, 2018

lukehoban commented Mar 8, 2018

swgillespie commented Mar 8, 2018

joeduffy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joeduffy commented Mar 9, 2018

swgillespie commented Mar 9, 2018

lukehoban commented Mar 9, 2018

swgillespie commented Mar 9, 2018

swgillespie commented Mar 9, 2018

lukehoban commented Mar 9, 2018

swgillespie commented Mar 9, 2018

swgillespie commented Mar 9, 2018