Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error messages output by the CLI #1011

Merged
merged 5 commits into from
Mar 9, 2018

Conversation

swgillespie
Copy link
Contributor

Partially addresses #606.

This fixes a couple known issues with the way that we present errors
from the Pulumi CLI:

  1. Any errors from RPC endpoints were bubbling up as they were to the top-level, which was unfortunate because they contained RPC-specific noise that we don't want to present to the user. This commit unwraps errors from resource providers.
  2. The "catastrophic error" message often got printed twice
  3. Fatal errors are often printed twice, because our CLI top-level prints out the fatal error that it receives before exiting. A lot of the time this error has already been printed.
  4. Errors were prefixed by PU####.

Here's a before and after for a doomed deployment:

before:

$ pulumi update
Performing changes:
* pulumi:pulumi:Stack: (same)
    [urn=urn:pulumi:webserver::webserver::pulumi:pulumi:Stack::webserver-webserver]
    + aws:ec2/securityGroup:SecurityGroup: (create)
        [urn=urn:pulumi:webserver::webserver::aws:ec2/securityGroup:SecurityGroup::web-secgrp-2]
        description        : "Enable HTTP access"
        ingress            : [
            [0]: {
                cidrBlocks: [
                    [0]: "0.0.0.0/0"
                ]
                fromPort  : 80
                protocol  : "not-tcp"
                self      : false
                toPort    : 80
            }
        ]
        name               : "web-secgrp-2-3c2930a"
        revokeRulesOnDelete: false
error PU2000: Plan apply failed: rpc error: code = Unknown desc = creating urn:pulumi:webserver::webserver::aws:ec2/securityGroup:Securit
yGroup::web-secgrp-2: Error authorizing security group ingress rules: InvalidParameterValue: Invalid value 'not-tcp' for IP protocol. Unk
nown protocol.
        status code: 400, request id: c7dc1479-2123-426b-a3c7-45723de6960e
Step #2 failed [create]: this failure was catastrophic and the provider cannot guarantee recovery
info: no changes required:
      1 resource unchanged
A catastrophic error occurred; resources states may be unknown
error: rpc error: code = Unknown desc = creating urn:pulumi:webserver::webserver::aws:ec2/securityGroup:SecurityGroup::web-secgrp-2: Erro
r authorizing security group ingress rules: InvalidParameterValue: Invalid value 'not-tcp' for IP protocol. Unknown protocol.
        status code: 400, request id: c7dc1479-2123-426b-a3c7-45723de6960e

after:

$ pulumi update
Performing changes:
* pulumi:pulumi:Stack: (same)
    [urn=urn:pulumi:webserver::webserver::pulumi:pulumi:Stack::webserver-webserver]
    + aws:ec2/securityGroup:SecurityGroup: (create)
        [urn=urn:pulumi:webserver::webserver::aws:ec2/securityGroup:SecurityGroup::web-secgrp-2]
        description        : "Enable HTTP access"
        ingress            : [
            [0]: {
                cidrBlocks: [
                    [0]: "0.0.0.0/0"
                ]
                fromPort  : 80
                protocol  : "not-tcp"
                self      : false
                toPort    : 80
            }
        ]
        name               : "web-secgrp-2-392cae9"
        revokeRulesOnDelete: false
error: Plan apply failed: creating urn:pulumi:webserver::webserver::aws:ec2/securityGroup:SecurityGroup::web-secgrp-2: Error authorizing
security group ingress rules: InvalidParameterValue: Invalid value 'not-tcp' for IP protocol. Unknown protocol.
        status code: 400, request id: 69e2df74-69b0-476e-bca6-eb170d1115c1
Step #2 failed [create]: this failure was catastrophic and the provider cannot guarantee recovery
info: no changes required:
      1 resource unchanged
error: Deployment failed due to step application failure

The things to note:

  1. RPC error goo rpc error: code = Unknown desc = is gone, getting stripped away at the provider plugin layer
  2. PU2000 is gone
  3. The error that bubbles up to the CLI top-level is Deployment failed due to step application failure and not the verbose error that we've already reported

// In general, our resource state is only really unknown if the server
// had an internal error, in which case it will serve one of `codes.Internal`,
// `codes.DataLoss`, or `codes.Unknown` to us.
func interpretRPCError(err error) (resource.Status, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bleeds in over to #219 a little bit and needs more work in order to reach its potential.

The idea here is that providers can send gRPC error codes alongside their error payloads to inform the engine of the nature of their failure. Today, we are sending the error code Unknown for all errors, so the engine has no idea if we can actually recover from the error we just received.

After this PR lands providers can start sending error codes like e.g. FailedPrecondition if a resource fails Terraform validation and we won't see the angry "catastrophic error" message for run-of-the-mill failures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there's a better name for this method. It's sort of a bummer that it's not clear from the name this is going to give you information about the resource's status and that the error returned is not an error from this method but rather a translation of the error argument.

Maybe resourceStateAndErrorFromRPCError Or maybe just resourceStateAndError? I don't know. Your call.

@swgillespie swgillespie added area/cli UX of using the CLI (args, output, logs) impact/changelog labels Mar 7, 2018
@lukehoban
Copy link
Member

Re: the after:

error: Plan apply failed: creating urn:pulumi:webserver::webserver::aws:ec2/securityGroup:SecurityGroup::web-secgrp-2: Error authorizing
security group ingress rules: InvalidParameterValue: Invalid value 'not-tcp' for IP protocol. Unknown protocol.
        status code: 400, request id: 69e2df74-69b0-476e-bca6-eb170d1115c1
Step #2 failed [create]: this failure was catastrophic and the provider cannot guarantee recovery
info: no changes required:
      1 resource unchanged
error: Deployment failed due to step application failure
  1. Do we need to present both error: entries here?
  2. Do we want to add newlines or some other form of separation to highlight the actual error message?
  3. Can we maybe make the engine error message text a little more human? ("plan apply", "step application", "failure was catastrophic", etc.)

Perhaps something like:

Error creating urn:pulumi:webserver::webserver::aws:ec2/securityGroup:SecurityGroup::web-secgrp-2: 

    Error authorizing security group ingress rules: InvalidParameterValue: Invalid value 'not-tcp' for IP 
    protocol. Unknown protocol.
            status code: 400, request id: 69e2df74-69b0-476e-bca6-eb170d1115c1

Update failed due to resource creation failure.

info: no changes required:
      1 resource unchanged

@swgillespie
Copy link
Contributor Author

Do we need to present both error: entries here?

If we use cmdutil.RunFunc at the CLI top-level, we do. If we don't, we can just exit with a non-zero exit code instead of printing the error. I'm happy to do that if we're fine changing the CLI top-levels a little bit.

Do we want to add newlines or some other form of separation to highlight the actual error message?

I looked at pkg/errors a little more (since we use errors.Wrap a lot) and it looks like it's possible to iterate over the error chain and print it nicely like this. I'll give it a try.

Can we maybe make the engine error message text a little more human? ("plan apply", "step application", "failure was catastrophic", etc.)

Sure! I'm not really sure what to do with "failure was catastrophic" - should we just omit it completely? We'll emit it for pretty much every error as it is today...

return status, errors.New("Preview failed due to step application failure")
}

return status, errors.New("Deployment failed due to step application failure")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure if we call this Deployment elsewhere in the engine, for messages that may end up in the user's face. The user facing verb is arguably Update.

@lukehoban
Copy link
Member

Can we maybe make the engine error message text a little more human? ("plan apply", "step application", "failure was catastrophic", etc.)

Sure! I'm not really sure what to do with "failure was catastrophic" - should we just omit it completely? We'll emit it for pretty much every error as it is today...

It's not immediately clear we are gaining any value from that error. In my suggested text above, that whole part of the error was replaced with just Update failed due to resource creation failure. It feels like that is all that's needed. Is there any case this gets printed where it provides more useful information to the user than that simpler message?

@lukehoban
Copy link
Member

Do we need to present both error: entries here?

If we use cmdutil.RunFunc at the CLI top-level, we do. If we don't, we can just exit with a non-zero exit code instead of printing the error. I'm happy to do that if we're fine changing the CLI top-levels a little bit.

A bit of a shame to have to duplicate, but I can see an argument for being consistent about always printing an error message as the final output when we fail.

Perhaps just keep it really simple then:

error: Update failed

@swgillespie
Copy link
Contributor Author

So unfortunately it's not nice today to present sub-errrors in an "indented" way like in Luke's example. This is because the errors get wrapped on the provider side and serialized out via RPC so, by the time we see it in the engine, it's just a big string.

We could in theory parse the error message, but I'm wary of hardcoding provider implementation details (error message format) into the engine in such a way. I think that, going forward, we'd get a tremendous benefit from re-working our RPC boundaries a little bit to emit structured errors that the engine can inspect. gRPC lets us throw arbitrary "Details" objects onto error payloads, so we can pretty trivially include an error chain in gRPC error messages.

We also have the opportunity to attach arbitrary metadata to error messages (#992).

Copy link
Member

@joeduffy joeduffy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's try to land this morning.

b.WriteString(colors.Reset)
b.WriteString("\n")
//
// [pulumi/pulumi#219] Since all errors are "unrecoverable" today, the below error message
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would delete this and think about how to add back something useful as part of #219.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay!

@@ -195,7 +195,7 @@ func printPlan(result *planResult) (ResourceChanges, error) {
actions := newPreviewActions(result.Options)
_, _, _, err := result.Walk(actions, true)
if err != nil {
return nil, errors.Wrapf(err, "An error occurred while advancing the preview")
return nil, errors.New("An error occurred while advancing the preview")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, should this be lower case like our other error messages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

// a more specific error message.
if err != nil {
if preview {
return status, errors.New("Preview failed")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, lower case (also below one).

// and a message. Based on the error code given to us, we can understand
// the state of our system and if our resource status is truly unknown.
//
// In general, our resource state is only really unknown if the server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the propagated error message differentiate on the status code? I'm just thinking that a bug report from a customer that contains minimal data might be easier to diagnose if it said something like [rpc internal error] or somesuch. Obviously, not for the errors that we expect to arise from plugins returning errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. No provider actually returns Internal today, but if it did we could go even further and ask that providers return a stack trace of a panic that produced the internal error. (This is pretty trivial with gRPC, we just don't do it.)

Since everything returns Unknown today I'm inclined to leave this code as it is and, when we start plumbing gRPC error codes throughout our system, address it then. Is that OK?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM.

}

glog.V(8).Infof("rpc error kind `%s` is well-understood and recoverable", rpcStatus.Code())
return resource.StatusOK, errors.New(rpcStatus.Message())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, the original intent of this was related to resource tainting. StatusUnknown was actually meant to convey that, because a resource provider returned an error, we cannot guarantee the state of the resources. Or so the thinking went. We should figure this out as part of #219, but AFAIK nothing actually uses this today so changing its meaning shouldn't affect anything. Just wanted you to know the history.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, and I figured as much.

@joeduffy
Copy link
Member

joeduffy commented Mar 9, 2018

By the way, I really liked Luke's example from earlier:

Error creating urn:pulumi:webserver::webserver::aws:ec2/securityGroup:SecurityGroup::web-secgrp-2: 

    Error authorizing security group ingress rules: InvalidParameterValue: Invalid value 'not-tcp' for IP 
    protocol. Unknown protocol.
            status code: 400, request id: 69e2df74-69b0-476e-bca6-eb170d1115c1

Update failed due to resource creation failure.

info: no changes required:
      1 resource unchanged

I gather we can't get there just yet because we need to communicate more error information across the RPC boundary. However, please do file a follow up work item to track the idea. This would be a lovely place to get to.

@swgillespie
Copy link
Contributor Author

@joeduffy will do. I think that being a little more disciplined about our RPC endpoints allow a lot of very cool scenarios like indented error messages just like this. I've been thinking a lot about it this week and I don't think it's a ton of work.

@lukehoban
Copy link
Member

So unfortunately it's not nice today to present sub-errrors in an "indented" way like in Luke's example. This is because the errors get wrapped on the provider side and serialized out via RPC so, by the time we see it in the engine, it's just a big string. We could in theory parse the error message, but I'm wary of hardcoding provider implementation details (error message format) into the engine in such a way. I think that, going forward, we'd get a tremendous benefit from re-working our RPC boundaries a little bit to emit structured errors that the engine can inspect. gRPC lets us throw arbitrary "Details" objects onto error payloads, so we can pretty trivially include an error chain in gRPC error messages.

Can't we then just fix this inside the provider? Do the newlines an indentation there?

@swgillespie
Copy link
Contributor Author

We can, but that doesn't feel like the right fix to me. It feels like a layering violation to expect a resource provider to spit out error messages that are nicely-formatted for a console. (I'm happy to do it if we really want this to be nice for M11, though.)

In general I'm a big believer in structured logging over stringy logging and to me there are a lot of engineering and hygiene benefits to having provider errors be structured.

@swgillespie
Copy link
Contributor Author

Is anybody opposed to merging this now? I think I've addressed all feedback and I'll log issues for follow-up work items.

@lukehoban
Copy link
Member

Is anybody opposed to merging this now? I think I've addressed all feedback and I'll log issues for follow-up work items.

Go for it - big improvement already here - looking forward to even more!

@swgillespie
Copy link
Contributor Author

Thanks! I'll hold off merging until @ellismg 's done to keep the merge conflict pain to a minimum.

This fixes a couple known issues with the way that we present errors
from the Pulumi CLI:
    1. Any errors from RPC endpoints were bubbling up as they were to
    the top-level, which was unfortunate because they contained
    RPC-specific noise that we don't want to present to the user. This
    commit unwraps errors from resource providers.
    2. The "catastrophic error" message often got printed twice
    3. Fatal errors are often printed twice, because our CLI top-level
    prints out the fatal error that it receives before exiting. A lot of
    the time this error has already been printed.
    4. Errors were prefixed by PU####.
@swgillespie swgillespie merged commit 703a954 into master Mar 9, 2018
@swgillespie
Copy link
Contributor Author

thanks for the reviews!

@swgillespie swgillespie deleted the swgillespie/better-errors-2 branch March 9, 2018 23:43
@lindydonna lindydonna added the kind/bug Some behavior is incorrect or out of spec label Mar 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cli UX of using the CLI (args, output, logs) kind/bug Some behavior is incorrect or out of spec
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants