Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure SD Failure metric and 404 Handling #10476

Merged
merged 7 commits into from Mar 31, 2022
Merged

Azure SD Failure metric and 404 Handling #10476

merged 7 commits into from Mar 31, 2022

Conversation

David-N-Perkins
Copy link
Contributor

@David-N-Perkins David-N-Perkins commented Mar 22, 2022

For issue 10455. In Azure service discovery:

  • Added a metric to count failures
  • When looking up Nics, for 404 responses log and continue instead of returning an error

No existing unit tests, so needs manual testing.

David N Perkins added 3 commits March 22, 2022 13:28
…okup

Signed-off-by: David N Perkins <David.N.Perkins@ibm.com>
Signed-off-by: David N Perkins <David.N.Perkins@ibm.com>
Signed-off-by: David N Perkins <David.N.Perkins@ibm.com>
@David-N-Perkins
Copy link
Contributor Author

@roidelapluie Please review when you have the time.

@@ -539,7 +555,7 @@ func mapFromVMScaleSetVM(vm compute.VirtualMachineScaleSetVM, scaleSetName strin
}
}

func (client *azureClient) getNetworkInterfaceByID(ctx context.Context, networkInterfaceID string) (*network.Interface, error) {
func (client *azureClient) getNetworkInterfaceByID(ctx context.Context, networkInterfaceID string) (*network.Interface, *autorest.DetailedError) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions/methods should always return errors as the abstract error interface.

Suggested change
func (client *azureClient) getNetworkInterfaceByID(ctx context.Context, networkInterfaceID string) (*network.Interface, *autorest.DetailedError) {
func (client *azureClient) getNetworkInterfaceByID(ctx context.Context, networkInterfaceID string) (*network.Interface, error) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any advantages to doing this and several drawbacks.

  1. The calling code has to use errors.As() to try and get a DetailedError, which would require another failure check.
  2. There's also no indication from the method's signatures that a DetailedError would get returned. That information gets hidden to the caller. This makes code maintenance and future development harder.

I see how this would make sense if several different errors could be returned. Am I missing something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Go, fallible functions return error and never concrete types. The complexities that arise from concrete error return types are, well, far too numerous to put into a GitHub PR comment 😉

The calling code has to use errors.As()

This is standard operating procedure.

There's also no indication from the method's signatures that a DetailedError would get returned.

It's expected that this kind of thing is captured in method documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any references that go into further details?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we narrow down the error to ignore in getNetworkInterfaceByID? This particular error would ignorer 404 in 3 different calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If callers need to access specific details from the error, then you can return an error type in the function...

    case code was 404:
        return nil, &DetailedError{...}

And use errors.As at the callsite...

var detailedErr *DetailedError
iface, err := c.getNetworkInterfaceByID(ctx, id)
switch {
case err == nil:
    // handle success case
case errors.As(err, &detailedErr):
    // handle DetailedError case
case err != nil:
    // handle default error case
}

Or...

if iface, err := c.getNetworkInterfaceByID(ctx, id); err == nil { 
    // handle success case
} else if detailedErr := &(DetailedError{}); errors.As(err, &detailedErr) {
    // handle DetailedError case
} else if err != nil {
    // handle default error case
}

ref

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I found similar patterns when researching custom errors after your first post. I asked, because I didn't see any advantage to creating custom errors over the current implementation of just returning a DetailedError.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s a good idea for functions that return errors always to use the error type in their signature (as we did above) rather than a concrete type such as *MyError, to help guarantee the error is created correctly. As an example, os.Open returns an error even though, if not nil, it’s always of concrete type *os.PathError.

reference, reference, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the references. They were very helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this method to return the error interface with different concrete types. The 404 check is performed here instead of by the caller.

Comment on lines 364 to 365
if detailedErr != nil {
if detailedErr.StatusCode == 404 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need to access elements of a specific error type, use errors.As.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Signed-off-by: David N Perkins <David.N.Perkins@ibm.com>
@gburton1
Copy link

We built and deployed the version in this PR. The results looked good. The happy path with no issues works fine, and we forced an error condition, and the failedCount metric is working as expected.

2022-03-28T22:22:02.970028236Z stderr F ts=2022-03-28T22:22:02.969Z caller=refresh.go:98 level=error component="discovery manager scrape" discovery=azure msg="Unable to refresh target groups" err="could not get virtual machines: could not list virtual machines: compute.VirtualMachinesClient#ListAll: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code=\"InvalidSubscriptionId\" Message=\"The provided subscription identifier '4d' is malformed or invalid.\""

Screen Shot 2022-03-28 at 5 26 02 PM

David N Perkins added 2 commits March 29, 2022 16:48
@roidelapluie
Copy link
Member

@peterbourgon do you want to give this a last review?

@peterbourgon
Copy link
Contributor

LGTM

Copy link
Member

@roidelapluie roidelapluie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Errors should not start with an uppercase, that is my last nit. Thanks @peterbourgon for the extra review.

@@ -561,6 +577,11 @@ func mapFromVMScaleSetVM(vm compute.VirtualMachineScaleSetVM, scaleSetName strin
}
}

var errorNotFound = errors.New("Network interface does not exist")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var errorNotFound = errors.New("Network interface does not exist")
var errorNotFound = errors.New("network interface does not exist")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Signed-off-by: David N Perkins <David.N.Perkins@ibm.com>
@roidelapluie roidelapluie enabled auto-merge (squash) March 31, 2022 12:29
@roidelapluie
Copy link
Member

Thanks!!

@roidelapluie roidelapluie merged commit ed0c682 into prometheus:main Mar 31, 2022
@David-N-Perkins David-N-Perkins deleted the azure-sd-failure-counter branch March 31, 2022 13:10
xinau pushed a commit to xinau/prometheus that referenced this pull request Apr 1, 2022
* For Azure sd, added failure counter and skipping of 404's from Nic lookup

Signed-off-by: David N Perkins <David.N.Perkins@ibm.com>
tjhop added a commit to tjhop/prometheus that referenced this pull request May 9, 2022
This commit introduces a new metric to count the number of failed
requests to Linode's API when using Linode SD. Resolves prometheus#10672, inspired
by prometheus#10476.

_Note_: this doens't count failures when polling the `/account/events`
endpoint, as a `401` there is how we determine if the supplied token has
the needed API scopes to do event polling vs full refreshes each
interval.

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
tjhop added a commit to tjhop/prometheus that referenced this pull request May 10, 2022
This commit introduces a new metric to count the number of failed
requests to Linode's API when using Linode SD. Resolves prometheus#10672, inspired
by prometheus#10476.

_Note_: this doens't count failures when polling the `/account/events`
endpoint, as a `401` there is how we determine if the supplied token has
the needed API scopes to do event polling vs full refreshes each
interval.

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
roidelapluie pushed a commit that referenced this pull request May 11, 2022
This commit introduces a new metric to count the number of failed
requests to Linode's API when using Linode SD. Resolves #10672, inspired
by #10476.

_Note_: this doens't count failures when polling the `/account/events`
endpoint, as a `401` there is how we determine if the supplied token has
the needed API scopes to do event polling vs full refreshes each
interval.

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
roidelapluie pushed a commit to roidelapluie/prometheus that referenced this pull request Jun 22, 2022
This commit introduces a new metric to count the number of failed
requests to Linode's API when using Linode SD. Resolves prometheus#10672, inspired
by prometheus#10476.

_Note_: this doens't count failures when polling the `/account/events`
endpoint, as a `401` there is how we determine if the supplied token has
the needed API scopes to do event polling vs full refreshes each
interval.

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants