Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI drops (some?) errors on the floor #1762

Closed
joeduffy opened this issue Aug 12, 2018 · 9 comments
Closed

CLI drops (some?) errors on the floor #1762

joeduffy opened this issue Aug 12, 2018 · 9 comments
Assignees
Labels
area/cli UX of using the CLI (args, output, logs) kind/bug Some behavior is incorrect or out of spec p0 Bugs severe enough to interrupt existing work
Milestone

Comments

@joeduffy
Copy link
Member

I have a resource that is failing its Check, and the CLI simply fails with:

λ pulumi up
Previewing update of stack 'hello-aks'
Previewing changes:

     Type                       Name                    Plan       Info
 +   pulumi:pulumi:Stack        azure-ts-aks-hello-aks  create
 +   └─ pulumi:providers:azure  default                 create
error: an error occurred while advancing the preview

The logs clearly show the failure coming from Check that never gets displayed:

I0812 09:43:41.744353   11969 rpc.go:68] Marshaling property for RPC[Provider[azure, 0xc42025aff0].Check(urn:pulumi:hello-aks::azure-ts-aks::azure:core/resourceGroup:ResourceGroup::azure-ts-aks-hello-aks).news]: location={East US}
I0812 09:43:41.745081   11969 provider_plugin.go:244] Provider[azure, 0xc42025aff0].Check(urn:pulumi:hello-aks::azure-ts-aks::azure:core/resourceGroup:ResourceGroup::azure-ts-aks-hello-aks) failed: err=name 'azure-ts-aks-hello-aks' plus 8 random chars is longer than maximum length 24
I0812 09:43:41.745091   11969 plan_executor.go:209] PlanExecutor.handleSingleEvent(...): received step event error: name 'azure-ts-aks-hello-aks' plus 8 random chars is longer than maximum length 24
I0812 09:43:41.745118   11969 plan_executor.go:115] PlanExecutor.Execute(...): waiting for incoming events
I0812 09:43:41.745129   11969 plan_executor.go:143] PlanExecutor.Execute(...): context canceled
I0812 09:43:41.745130   11969 step_executor.go:245] StepExecutor worker(7): worker exiting due to cancellation
I0812 09:43:41.745130   11969 plan_executor.go:81] PlanExecutor.Execute(...): cancel goroutine exiting
I0812 09:43:41.745141   11969 step_executor.go:245] StepExecutor worker(1): worker exiting due to cancellation
I0812 09:43:41.745150   11969 plan_executor.go:149] PlanExecutor.Execute(...): exited event loop, waiting for completion
I0812 09:43:41.746223   11969 step_executor.go:245] StepExecutor worker(-1): StepExecutor.waitForCompletion(): waiting for worker threads to exit
I0812 09:43:41.745152   11969 step_executor.go:245] StepExecutor worker(3): worker exiting due to cancellation
I0812 09:43:41.745161   11969 step_executor.go:245] StepExecutor worker(8): worker exiting due to cancellation
I0812 09:43:41.745161   11969 step_executor.go:245] StepExecutor worker(9): worker exiting due to cancellation
I0812 09:43:41.745156   11969 step_executor.go:245] StepExecutor worker(5): worker exiting due to cancellation
I0812 09:43:41.745149   11969 step_executor.go:245] StepExecutor worker(4): worker exiting due to cancellation
I0812 09:43:41.745167   11969 step_executor.go:245] StepExecutor worker(2): worker exiting due to cancellation
I0812 09:43:41.745175   11969 step_executor.go:245] StepExecutor worker(6): worker exiting due to cancellation
I0812 09:43:41.745200   11969 step_executor.go:245] StepExecutor worker(0): worker exiting due to cancellation
I0812 09:43:41.746407   11969 step_executor.go:245] StepExecutor worker(-1): StepExecutor.waitForCompletion(): worker threads all exited
I0812 09:43:41.746413   11969 plan_executor.go:151] PlanExecutor.Execute(...): step executor has completed
I0812 09:43:41.746419   11969 plan_executor.go:161] PlanExecutor.Execute(...): observed that the plan errored
I0812 09:43:41.749407   11969 langruntime_plugin.go:153] langhost[nodejs].Run(pwd=/Users/joeduffy/dev/code/src/github.com/pulumi/examples/azure-ts-aks,program=.,...,dryrun=true) failed: err=transport is closing
 +   └─ pulumi:providers:azure  default                 create
I0812 09:43:41.763428   11969 ignore.go:44] Explicitly ignoring and discarding error: 1 error occurred:

* operation not permitted
I0812 09:43:41.763452   11969 source_eval.go:160] EvalSourceIterator ended with an error: transport is closing
I0812 09:43:41.763471   11969 plan_executor.go:99] PlanExecutor.Execute(...): incoming events goroutine exiting
I0812 09:43:41.764196   11969 sink.go:154] defaultSink::Error(error: an error occurred while advancing the preview

This is on 0.15.0-rc2 bits:

λ pulumi version
v0.15.0-rc2
λ grep '"version":' node_modules/@pulumi/pulumi/package.json
  "version": "0.15.0-rc2"
λ grep '"version":' node_modules/@pulumi/azure/package.json
  "version": "0.15.0-rc2"
@joeduffy joeduffy added kind/bug Some behavior is incorrect or out of spec area/cli UX of using the CLI (args, output, logs) labels Aug 12, 2018
@joeduffy joeduffy added this to the 0.16 milestone Aug 12, 2018
@joeduffy
Copy link
Member Author

Bizarrely, now I can't even update or destroy this stack, and get 500s from the service each time I try. I'm not sure if it's related to the above failure.

Here is the log:

Updating stack 'hello-aks'
I0812 09:55:25.852103   12364 api.go:155] Making Pulumi API call: https://api.pulumi.com/api/stacks/joeduffy/hello-aks/update
I0812 09:55:25.852120   12364 api.go:157] Pulumi API call details (https://api.pulumi.com/api/stacks/joeduffy/hello-aks/update): headers=map[Content-Type:[application/json] User-Agent:[pulumi-cli/1 (v0.15.0-rc2; darwin)] Authorization:[token <SNIP>]]; body={"name":"azure-ts-aks","runtime":"nodejs","main":"","description":"Create an Azure Kubernetes Service (AKS) cluster and deploy a container to it.","options":{"analyzers":[],"color":"raw","dryRun":false,"parallel":10,"showConfig":false,"showReplacementSteps":false,"showNames":false,"summary":false,"debug":false},"config":{"azure-ts-aks:config:clientId":{"string":"clientId","secret":false},"azure-ts-aks:config:clientSecret":{"string":"AA<SNIP>Kg=","secret":true},"azure-ts-aks:config:clientsecret":{"string":"AAA<SNIP>mFiU=","secret":true},"azure-ts-aks:config:sshPublicKey":{"string":"ssh-rsa AAAAB3NzaC1yc2EAA<SNIP>vrCKiS14H74UsSricKFc+CG0S/jZJzX3Y2nu+CZIW0GpqkT9JdlkO32gDXKujxYaF joeduffy@joedu-wallawalla.local","secret":false},"azure:config:environment":{"string":"public","secret":false}},"metadata":{"message":"","environment":{"git.author":"joeduffy","git.author.email":"joe@pulumi.com","git.committer":"joeduffy","git.committer.email":"joe@pulumi.com","git.dirty":"true","git.head":"0b057e1ee855220e8d8a00833407d971de9e4192","github.login":"pulumi","github.repo":"examples"}}}
I0812 09:55:25.937413   12364 api.go:170] Pulumi API call response code (https://api.pulumi.com/api/stacks/joeduffy/hello-aks/update): 500 Internal Server Error
I0812 09:55:25.937645   12364 sink.go:154] defaultSink::Error(error: [500] Internal Server Error)
error: [500] Internal Server Error

/cc @chrsmith

@joeduffy
Copy link
Member Author

Got hit by this again just now:

λ pulumi up
Previewing update of stack 'hi-aks'
Previewing changes:

     Type                       Name                 Plan          Info
 *   pulumi:pulumi:Stack        azure-ts-aks-hi-aks  no change
 ~   └─ pulumi:providers:azure  default              update        changes: ~ version
error: an error occurred while advancing the preview

Inspecting the logs reveals that my Azure token has expired:

I0812 10:33:41.195156   58419 provider_plugin.go:197] Provider[azure, 0xc420e705a0].Configure() failed: err=No valid (unexpired) Azure CLI Auth Tokens found. Please run `az login`.
I0812 10:33:41.195175   58419 plan_executor.go:209] PlanExecutor.handleSingleEvent(...): received step event error: No valid (unexpired) Azure CLI Auth Tokens found. Please run `az login`.
I0812 10:33:41.195200   58419 plan_executor.go:115] PlanExecutor.Execute(...): waiting for incoming events
I0812 10:33:41.195207   58419 plan_executor.go:143] PlanExecutor.Execute(...): context canceled
I0812 10:33:41.195212   58419 plan_executor.go:149] PlanExecutor.Execute(...): exited event loop, waiting for completion
I0812 10:33:41.195218   58419 step_executor.go:245] StepExecutor worker(-1): StepExecutor.waitForCompletion(): waiting for worker threads to exit
I0812 10:33:41.195225   58419 step_executor.go:245] StepExecutor worker(5): worker exiting due to cancellation
I0812 10:33:41.195228   58419 step_executor.go:245] StepExecutor worker(8): worker exiting due to cancellation
I0812 10:33:41.195234   58419 plan_executor.go:81] PlanExecutor.Execute(...): cancel goroutine exiting
I0812 10:33:41.195237   58419 step_executor.go:245] StepExecutor worker(7): worker exiting due to cancellation
I0812 10:33:41.195242   58419 step_executor.go:245] StepExecutor worker(9): worker exiting due to cancellation
I0812 10:33:41.195241   58419 step_executor.go:245] StepExecutor worker(1): worker exiting due to cancellation
I0812 10:33:41.195248   58419 step_executor.go:245] StepExecutor worker(4): worker exiting due to cancellation
I0812 10:33:41.195244   58419 step_executor.go:245] StepExecutor worker(0): worker exiting due to cancellation
I0812 10:33:41.195255   58419 step_executor.go:245] StepExecutor worker(2): worker exiting due to cancellation
I0812 10:33:41.195256   58419 step_executor.go:245] StepExecutor worker(3): worker exiting due to cancellation
I0812 10:33:41.195266   58419 step_executor.go:245] StepExecutor worker(6): worker exiting due to cancellation
I0812 10:33:41.195324   58419 step_executor.go:245] StepExecutor worker(-1): StepExecutor.waitForCompletion(): worker threads all exited
I0812 10:33:41.195330   58419 plan_executor.go:151] PlanExecutor.Execute(...): step executor has completed
I0812 10:33:41.195335   58419 plan_executor.go:161] PlanExecutor.Execute(...): observed that the plan errored
I0812 10:33:41.201494   58419 langruntime_plugin.go:153] langhost[nodejs].Run(pwd=/Users/joeduffy/dev/code/src/github.com/pulumi/examples/azure-ts-aks,program=.,...,dryrun=true) failed: err=transport is closing
I0812 10:33:41.201536   58419 ignore.go:44] Explicitly ignoring and discarding error: rpc error: code = Canceled desc = grpc: the client connection is closing
I0812 10:33:41.213520   58419 ignore.go:44] Explicitly ignoring and discarding error: 1 error occurred:

* operation not permitted
I0812 10:33:41.213576   58419 source_eval.go:160] EvalSourceIterator ended with an error: transport is closing
 ~   └─ pulumi:providers:azure  default              update        changes: ~ version
I0812 10:33:41.214373   58419 sink.go:154] defaultSink::Error(error: an error occurred while advancing the preview

@joeduffy
Copy link
Member Author

I've hit this in a handful more cases today. This definitely needs a fix before 0.15 is ready to go.

I0812 16:25:22.672234   71528 provider_plugin.go:197] Provider[kubernetes, 0xc420d76be0].Configure() failed: err=Unable to read kubectl config: invalid configuration: no con
I0812 16:25:22.686537   71528 plan_executor.go:209] PlanExecutor.handleSingleEvent(...): received step event error: Unable to read kubectl config: invalid configuration: no configuration has been provided

@lukehoban lukehoban added the p0 Bugs severe enough to interrupt existing work label Aug 13, 2018
@lukehoban
Copy link
Member

FWIW - I’ve also hit this on every error (I think so far all things originating from the resource provider).

lukehoban added a commit to pulumi/pulumi-kubernetes that referenced this issue Aug 13, 2018
This test is failing in Travis, and is not debugable due to pulumi/pulumi#1762.  Disabling for now.

Tracking reenabling this with #130.
@swgillespie
Copy link
Contributor

There are potentially two different issues here. The first one is that the implementation of Check by the provider registry returns an error when Configure or CheckConfig fails instead of populating the failures field of CheckResponse. This leads us to drop those errors (see also #1712, since it is basically impossible today to reason about who logs what error).

I'm not sure what's going on with Joe's original Azure example. Looks like it's probably a similar issue.

Either way, a band-aid to make this much better until we tackle #1712 in M17 shouldn't be too bad here.

@swgillespie
Copy link
Contributor

It seems weird to me that the azure implementation of Check returns a legitimate error in Joe's example and not a non-error response with the failures field set. That seems to be what's confusing the step generator here.

@lukehoban
Copy link
Member

lukehoban commented Aug 13, 2018

I have a feeling this isn’t just about Check. I saw this for several failures coming from the provider which I’m fairly certain were errors from Update or Create operations as well.

I also hit this for both Azure and Kubernetes this afternoon.

@swgillespie
Copy link
Contributor

Here's the state of our error reporting:

  1. Failures in Check are (supposed to be) reported to the CLI by the step generator. The step generator may return errors which today may or may not have already been logged. As we can see here in this bug, the plan executor doesn't log the errors coming out of the step generator, so they can be dropped.
  2. Failures in step applications are logged by post-step events. This hasn't been changed recently, so I would be surprised if there were issues here.
  3. Errors coming from step execution other than the application of the steps themselves (e.g. pre and post-step callbacks returned errors, we registered the same URN twice) are logged by the step executor.

The de facto contract that the step generator has with the plan executor is that it has already done all error logging that needs to be done before returning an error. We should strive to centralize our error reporting logic in one place, either in the step generator or in the plan executor. In practice this would mean that the step generator always logs every single error that it sees, or that the plan executor logs every error that comes out of step generation.

@chrsmith
Copy link
Contributor

While we do need to focus on the CLI error reporting issue, the reason the service fails the call is because it looks like we are being sent a duplicate metadata key with the update. We'll probably still want to fail the request with a 400, but we can provide a better error message from the service using the expanded ErrorResponse type.

joeduffy pushed a commit to pulumi/pulumi-kubernetes that referenced this issue Aug 13, 2018
This test is failing in Travis, and is not debugable due to pulumi/pulumi#1762.  Disabling for now.

Tracking reenabling this with #130.
swgillespie added a commit that referenced this issue Aug 15, 2018
Similar to #1762, fixes #1775. The language
host can fail without issuing any diagnostics and it is very unclear
what happens if the engine does not log the error.
swgillespie added a commit that referenced this issue Aug 20, 2018
* Log errors coming from the language host

Similar to #1762, fixes #1775. The language
host can fail without issuing any diagnostics and it is very unclear
what happens if the engine does not log the error.

* CR feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cli UX of using the CLI (args, output, logs) kind/bug Some behavior is incorrect or out of spec p0 Bugs severe enough to interrupt existing work
Projects
None yet
Development

No branches or pull requests

4 participants