New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: set up alerts for errors reported to stackdriver #21315

Open
adams-sarah opened this Issue Aug 4, 2017 · 8 comments

Comments

Projects
None yet
3 participants
@adams-sarah
Contributor

adams-sarah commented Aug 4, 2017

Currently we get a bunch of errors reported, eg:

9/4/17 7:53AM: buildID: B32739cd61, name: windows-amd64-2016, hostType: host-windows-amd64-2016, error: failed to get a buildlet: Error creating instance: &{Code:QUOTA_EXCEEDED Location: Message:Quota 'CPUS' exceeded. Limit: 500.0 ForceSendFields:[] NullFields:[]}

8/4/17 7:53AM: buildID: B059b7649c, name: openbsd-amd64-60, hostType: host-openbsd-amd64-60, error: failed to get a buildlet: Failed to create instance: googleapi: Error 403: Quota 'CPUS' exceeded. Limit: 500.0, quotaExceeded
at main.(*buildStatus).reportErr (coordinator.go:1951)

... etc

Probably not a pressing issue, as the coordinator's retry logic is at least enough to ensure the builds eventually run.

EDIT:
we need to figure out which errors are noise, and alert on the rest.

@gopherbot gopherbot added this to the Unreleased milestone Aug 4, 2017

@gopherbot gopherbot added the Builders label Aug 4, 2017

@adams-sarah

This comment has been minimized.

Contributor

adams-sarah commented Aug 4, 2017

Though perhaps we should think about paying for more cpus.

@bradfitz

This comment has been minimized.

Member

bradfitz commented Aug 4, 2017

Yeah, that's not an alertable issue.

That's normal during spikes of load and doesn't affect users or builds.

The coordinator tries to do its own throttling and dead reckoning of quota, but there seems to be some lag of the reported quota from GCP which causes our internal usage accounting & usage semaphore to get off.

Though perhaps we should think about paying for more cpus.

It's just a simple ticket to bump the quota, but it won't really help. During a few commits, we will almost always have more work to do than CPUs, regardless of our CPU limit. It's fine to hit the limit for a few minutes sometimes.

Better would be to finish the scheduler: #19178

@adams-sarah

This comment has been minimized.

Contributor

adams-sarah commented Aug 4, 2017

sounds good.

@adams-sarah adams-sarah closed this Aug 4, 2017

@bradfitz

This comment has been minimized.

Member

bradfitz commented Aug 4, 2017

I think you should keep this bug open to fix your logging/alerting.

You shouldn't get false alarms.

@adams-sarah

This comment has been minimized.

Contributor

adams-sarah commented Aug 4, 2017

ah, no alerts generated. just error logged.
actually, i haven't set any alerts at all. still trying to get a feel for what is normal and what is not.
ok, i'll modify the bug to address this directly (decide what to alert on and set up the alerts) and re-open. sound good?

we should do something similar for the # reverse buildlet alerting stuff. b/c it's just noise ATM.
i can tackle that today. cc @shantuo

@adams-sarah adams-sarah reopened this Aug 4, 2017

@adams-sarah adams-sarah changed the title from x/build: frequent "cpu quota exceeded" errors from GCP to x/build: set up alerts for errors reported to stackdriver Aug 4, 2017

@gopherbot

This comment has been minimized.

gopherbot commented Aug 4, 2017

Change https://golang.org/cl/53353 mentions this issue: cmd/buildlet: normalize macstadium host names for monitoring

gopherbot pushed a commit to golang/build that referenced this issue Aug 5, 2017

cmd/buildlet, cmd/makemac: normalize macstadium host names for monito…
…ring

In prep for better alerting when dedicated (reverse) buildlers
disappear, normalize the the MacStadium host names to remove the
extraneous guest OS version from them, so we can track the host's
last-seen time reliably over time, even as the guest OS version
changes.

This CL also cleans up makemac while it's there and fixes some bugs
and adds some paranoia checks and cleans up logging and adds an HTTP
status handler.

A future change will improve coordinator monitoring of reverse
buildlets.

Updates golang/go#21315

Change-Id: I3d09168cc91f37715b65ae2924a1642401e18808
Reviewed-on: https://go-review.googlesource.com/53353
Reviewed-by: Jessie Frazelle <me@jessfraz.com>
@gopherbot

This comment has been minimized.

gopherbot commented Feb 22, 2018

Change https://golang.org/cl/96416 mentions this issue: devapp: start of status handler for monitoring

gopherbot pushed a commit to golang/build that referenced this issue Feb 23, 2018

devapp, status: start of status handler for monitoring
Also bump Go from 1.8 to 1.10, and change how the static binary is
built to avoid warnings during link.

Updates golang/go#21315
Updates golang/go#22603

Change-Id: I426491d48f787a77cb3eea4dff4d11f474236548
Reviewed-on: https://go-review.googlesource.com/96416
Reviewed-by: Andrew Bonventre <andybons@golang.org>
@gopherbot

This comment has been minimized.

gopherbot commented Feb 27, 2018

Change https://golang.org/cl/97516 mentions this issue: devapp: revert status changes

gopherbot pushed a commit to golang/build that referenced this issue Feb 27, 2018

devapp: revert status changes
This reverts the status changes to devapp from CL 96416

This will go into its own server instead.

Updates golang/go#21315
Updates golang/go#22603

Change-Id: Icb17a5915124241b2ef97a1ee2e9a0e4298784ce
Reviewed-on: https://go-review.googlesource.com/97516
Reviewed-by: Andrew Bonventre <andybons@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment