Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: set up alerts for errors reported to stackdriver #21315

Open
adams-sarah opened this issue Aug 4, 2017 · 8 comments
Labels
Milestone

Comments

@adams-sarah
Copy link
Contributor

@adams-sarah adams-sarah commented Aug 4, 2017

Currently we get a bunch of errors reported, eg:

9/4/17 7:53AM: buildID: B32739cd61, name: windows-amd64-2016, hostType: host-windows-amd64-2016, error: failed to get a buildlet: Error creating instance: &{Code:QUOTA_EXCEEDED Location: Message:Quota 'CPUS' exceeded. Limit: 500.0 ForceSendFields:[] NullFields:[]}

8/4/17 7:53AM: buildID: B059b7649c, name: openbsd-amd64-60, hostType: host-openbsd-amd64-60, error: failed to get a buildlet: Failed to create instance: googleapi: Error 403: Quota 'CPUS' exceeded. Limit: 500.0, quotaExceeded
at main.(*buildStatus).reportErr (coordinator.go:1951)

... etc

Probably not a pressing issue, as the coordinator's retry logic is at least enough to ensure the builds eventually run.

EDIT:
we need to figure out which errors are noise, and alert on the rest.

@gopherbot gopherbot added this to the Unreleased milestone Aug 4, 2017
@gopherbot gopherbot added the Builders label Aug 4, 2017
@adams-sarah

This comment has been minimized.

Copy link
Contributor Author

@adams-sarah adams-sarah commented Aug 4, 2017

Though perhaps we should think about paying for more cpus.

@bradfitz

This comment has been minimized.

Copy link
Member

@bradfitz bradfitz commented Aug 4, 2017

Yeah, that's not an alertable issue.

That's normal during spikes of load and doesn't affect users or builds.

The coordinator tries to do its own throttling and dead reckoning of quota, but there seems to be some lag of the reported quota from GCP which causes our internal usage accounting & usage semaphore to get off.

Though perhaps we should think about paying for more cpus.

It's just a simple ticket to bump the quota, but it won't really help. During a few commits, we will almost always have more work to do than CPUs, regardless of our CPU limit. It's fine to hit the limit for a few minutes sometimes.

Better would be to finish the scheduler: #19178

@adams-sarah

This comment has been minimized.

Copy link
Contributor Author

@adams-sarah adams-sarah commented Aug 4, 2017

sounds good.

@adams-sarah adams-sarah closed this Aug 4, 2017
@bradfitz

This comment has been minimized.

Copy link
Member

@bradfitz bradfitz commented Aug 4, 2017

I think you should keep this bug open to fix your logging/alerting.

You shouldn't get false alarms.

@adams-sarah

This comment has been minimized.

Copy link
Contributor Author

@adams-sarah adams-sarah commented Aug 4, 2017

ah, no alerts generated. just error logged.
actually, i haven't set any alerts at all. still trying to get a feel for what is normal and what is not.
ok, i'll modify the bug to address this directly (decide what to alert on and set up the alerts) and re-open. sound good?

we should do something similar for the # reverse buildlet alerting stuff. b/c it's just noise ATM.
i can tackle that today. cc @shantuo

@adams-sarah adams-sarah reopened this Aug 4, 2017
@adams-sarah adams-sarah changed the title x/build: frequent "cpu quota exceeded" errors from GCP x/build: set up alerts for errors reported to stackdriver Aug 4, 2017
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Aug 4, 2017

Change https://golang.org/cl/53353 mentions this issue: cmd/buildlet: normalize macstadium host names for monitoring

gopherbot pushed a commit to golang/build that referenced this issue Aug 5, 2017
…ring

In prep for better alerting when dedicated (reverse) buildlers
disappear, normalize the the MacStadium host names to remove the
extraneous guest OS version from them, so we can track the host's
last-seen time reliably over time, even as the guest OS version
changes.

This CL also cleans up makemac while it's there and fixes some bugs
and adds some paranoia checks and cleans up logging and adds an HTTP
status handler.

A future change will improve coordinator monitoring of reverse
buildlets.

Updates golang/go#21315

Change-Id: I3d09168cc91f37715b65ae2924a1642401e18808
Reviewed-on: https://go-review.googlesource.com/53353
Reviewed-by: Jessie Frazelle <me@jessfraz.com>
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Feb 22, 2018

Change https://golang.org/cl/96416 mentions this issue: devapp: start of status handler for monitoring

gopherbot pushed a commit to golang/build that referenced this issue Feb 23, 2018
Also bump Go from 1.8 to 1.10, and change how the static binary is
built to avoid warnings during link.

Updates golang/go#21315
Updates golang/go#22603

Change-Id: I426491d48f787a77cb3eea4dff4d11f474236548
Reviewed-on: https://go-review.googlesource.com/96416
Reviewed-by: Andrew Bonventre <andybons@golang.org>
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Feb 27, 2018

Change https://golang.org/cl/97516 mentions this issue: devapp: revert status changes

gopherbot pushed a commit to golang/build that referenced this issue Feb 27, 2018
This reverts the status changes to devapp from CL 96416

This will go into its own server instead.

Updates golang/go#21315
Updates golang/go#22603

Change-Id: Icb17a5915124241b2ef97a1ee2e9a0e4298784ce
Reviewed-on: https://go-review.googlesource.com/97516
Reviewed-by: Andrew Bonventre <andybons@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.