Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compute: service account-related BulkInsert error not surfaced to user #7876

Closed
hanming-lu opened this issue May 2, 2023 · 18 comments
Closed
Assignees
Labels
api: compute Issues related to the Compute Engine API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@hanming-lu
Copy link

Client
compute "cloud.google.com/go/compute/apiv1"
computepb "cloud.google.com/go/compute/apiv1/computepb"

instanceClient := compute.NewInstancesRESTClient(context.TODO())
instanceClient.BulkInsert()

Go Environment

$ go version
go version go1.20.3 linux/amd64
$ go env
Not relevant

Code

e.g.

package main

func main() {
         input := &computepb.BulkInsertInstanceRequest{}
	op, err1 := adpt.gceClient.BulkInsert(ctx, input)
        err2 := op.Wait(ctx)
        // instances launched == 0
        // err1 == nil and err2 == nil
}

Expected behavior

  • when error is service account-related (e.g. not found), then either err1 != nil or err2 != nil

Actual behavior

  • when error is service account-related (e.g. not found), then instances launched == 0, err1 == nil and err2 == nil
  • verified by using Google Logs Explorer and found the error "VM_MIN_COUNT_NOT_REACHED,EXTERNAL_RESOURCE_NOT_FOUND"
@hanming-lu hanming-lu added the triage me I really want to be triaged. label May 2, 2023
@product-auto-label product-auto-label bot added the api: compute Issues related to the Compute Engine API. label May 2, 2023
@hanming-lu hanming-lu changed the title compute: BulkInsert error not surfaced to user compute: service account-related BulkInsert error not surfaced to user May 2, 2023
@ijrsvt
Copy link

ijrsvt commented May 3, 2023

It looks like Compute Engine uses its own Operations client, not the longrunning one. The longrunning client correctly inspects the returned proto to check if it is an error type[1]. The Compute Engine Operations Client silently passes this to the .proto field, without inspecting[2]; I'm guessing this can be fixed by just checking if h.proto.Error != nil?

[1]

switch r := op.proto.Result.(type) {

[2]

@noahdietz
Copy link
Contributor

@ijrsvt you're right, but we can't use your suggestion directly. I've filed googleapis/gapic-generator-go#1320 to fix this in our client generator.

We can't use h.proto.Error != nil directly because the Compute Operation.error field is "non standard" vs. other APIs. The http_error_message and http_error_status_code fields, however, are simple to consume and annotated as the "error code" and "error message" fields. Other non-standard APIs will have similar such fields annotated in the same way. We can use these to generically construct an error type.

@hanming-lu in the mean time, you can do something similar to what @ijrsvt suggests and check the error via op.Proto().GetHttpErrorStatusCode() != http.StatusOK. I know this isn't ideal and we will work to get this rectified ASAP.

@ijrsvt
Copy link

ijrsvt commented May 3, 2023

Thanks for the quick follow up @noahdietz, could you elaborate on what you mean by non-standard? Are there other APIs that have this property?

@noahdietz
Copy link
Contributor

Well, you nailed it when you pointed out Compute Engine uses its own Operations client, not the longrunning one. :) This is getting in the weeds, and frankly as user, you shouldn't have to care about this difference, but because pretty much every other Cloud API uses LRO, we are able to generate highly regular code, especially when that code requires inspecting the response in some way (e.g. checking if the Operation.result is an error). Since Compute spun its own Operations pattern, with its own types and services, we had to start annotating the schema with indicators of what each element means relative to something like LRO.

So, this was just a mistake on our part while implementing generator support. We left out this admittedly crucial piece.

@noahdietz noahdietz added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. and removed triage me I really want to be triaged. labels May 3, 2023
@ijrsvt
Copy link

ijrsvt commented May 3, 2023

@noahdietz, can you comment on the difference between HttpErrorMessage() & Errors().Error[0].Message()? It looks like the Message from inside the Errors field provides more context than the other one:

Code:

fmt.Printf("GetHttpErrorMessage: %v\nGetHttpErrorStatusCode: %v\n", res.GetHttpErrorMessage(), res.GetHttpErrorStatusCode())
fmt.Printf("GetError: %v\n", res.GetError().Errors)
fmt.Printf("GetError.Message: %v\n", res.GetError().Errors[0].GetMessage())

Produces:

GetHttpErrorMessage: BAD REQUEST
GetHttpErrorStatusCode: 400
GetError: [code:"SERVICE_ACCOUNT_ACCESS_DENIED" message:"The user does not have access to service account '<SNIP>'.  User: 'ian@'.  Ask a project owner to grant you the iam.serviceAccountUser role on the service account"]
GetError.Message: The user does not have access to service account '<SNIP>'.  User: 'ian@'.  Ask a project owner to grant you the iam.serviceAccountUser role on the service account

@noahdietz
Copy link
Contributor

@ijrsvt I don't work on the Compute API, but I'm not surprised that this is the case. I already documented that potential issue in the issue I filed on the generator - the first step is returning an error in the first place, the second is figuring out how to give the most descriptive error possible (or at least make that information readily available). Fundamentally, the Compute Error message is a bespoke structure that we can't generate generic code to unpack.

@krousey
Copy link

krousey commented May 4, 2023

I just hit this today while trying to delete a network that still had a subnetwork in it, and the error was undetected. I've noticed that the workaround of checking the http status doesn't work if the operation succeeded (at least in the network deletion case) because that field is not present in the result. So you have to check if it's there at all, and then check it against http.StatusOK.

Here's my temporary workaround:

// ComputeOperationError is a wrapper around a compute.Operation that has
// errored out.
//
// TODO (https://github.com/googleapis/google-cloud-go/issues/7876): Remove once this is fixed.
type ComputeOperationError struct {
	Op *computepb.Operation
}

func (e *ComputeOperationError) Error() string {
	if e.Op.GetStatus() != computepb.Operation_DONE {
		return fmt.Sprintf("operation %q is not done", e.Op.GetName())
	}
	if len(e.Op.GetError().GetErrors()) > 0 && len(e.Op.GetError().GetErrors()[0].GetMessage()) > 0 {
		return e.Op.GetError().Errors[0].GetMessage()
	}
	return e.Op.GetHttpErrorMessage()
}

// ErrorFromComputeOperation returns an error if the compute.Operation has
// errored out. Otherwise, it returns nil.
func ErrorFromComputeOperation(op *compute.Operation) error {
	if op.Proto().HttpErrorStatusCode != nil && op.Proto().GetHttpErrorStatusCode() != http.StatusOK {
		return &ComputeOperationError{op.Proto()}
	}
	return nil
}

@noahdietz
Copy link
Contributor

Thanks @krousey you're right, my mistake, I spend so much time in gRPC where Status OK is 0, which is the default/unset value if the the field is not set. That's my bad. Thanks for the code snippet!

@krousey
Copy link

krousey commented May 4, 2023

Is there a gRPC version of the compute API?

@noahdietz
Copy link
Contributor

Is there a gRPC version of the compute API?

Nope, but we support all of Cloud in this repo and the vast majority support gRPC or HTTP-JSON. Compute does not, I just wasn't thinking straight.

@noahdietz
Copy link
Contributor

noahdietz commented May 9, 2023

Update: we have fixed the generator to do the following:

  1. Poll (and thus Wait) emits an error if the Operation contains one
  2. the error will be of type apierror.APIError wrapping a googleapi.Error (just like API call errors)
  3. For the time being, the googleapi.Error.message will include the details found in Operation.error so that at the very least the extra details are included in logs

Note: We are working on figuring out how to unpack the error detail info in the Operation into the structured error details found in an APIError so that users have something to code against concretely

We hope to have the Compute code regenerated soon and a release following that.

@hanming-lu
Copy link
Author

hanming-lu commented May 9, 2023

@noahdietz thank you for the update! May I ask if there is an estimate (i.e. days or weeks or months) on when a release with the fix will happen?

@noahdietz
Copy link
Contributor

I'm hoping we can get it in this week's release, but if not this week then definitely next.

@noahdietz
Copy link
Contributor

noahdietz commented May 11, 2023

Ok so we have the initial changes merged into main here:

if resp.HttpErrorStatusCode != nil && (resp.GetHttpErrorStatusCode() < 200 || resp.GetHttpErrorStatusCode() > 299) {
aErr := &googleapi.Error{
Code: int(resp.GetHttpErrorStatusCode()),
Message: fmt.Sprintf("%s: %v", resp.GetHttpErrorMessage(), resp.GetError()),
}
err, _ := apierror.FromError(aErr)
return err
}

We want to get a test case into our integration tests, and waiting on finding a good error mode to reliably test with.

I'm hoping we can release this code next week. Feel free to try it yourself though if you want to wire up the unreleased dependency.

@noahdietz
Copy link
Contributor

Ok so the above change was released in cloud.google.com/go/compute@1.19.3.

I'd appreciate some verification from folks on this thread that the behavior you expect is fulfilled.

@hanming-lu
Copy link
Author

Thank you for the update! Will take a look later this week.

@hanming-lu
Copy link
Author

Confirming an error is returned, and it looks like below, thank you!

Error 503: SERVICE UNAVAILABLE: errors:{code:"VM_MIN_COUNT_NOT_REACHED"  message:"Requested minimum count of 1 VMs could not be created."}  errors:{code:"SERVICE_ACCOUNT_ACCESS_DENIED"  message:"The user does not have access to service account 'service-account-to-use'.  User: ''.  Ask a project owner to grant you the iam.serviceAccountUser role on the service account"}

@quartzmo
Copy link
Member

Fantastic!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: compute Issues related to the Compute Engine API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

5 participants