Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: set up alerts for errors reported to stackdriver #21315

Open
adams-sarah opened this issue Aug 4, 2017 · 9 comments
Open

x/build: set up alerts for errors reported to stackdriver #21315

adams-sarah opened this issue Aug 4, 2017 · 9 comments
Labels
Milestone

Comments

@adams-sarah
Copy link
Contributor

@adams-sarah adams-sarah commented Aug 4, 2017

Currently we get a bunch of errors reported, eg:

9/4/17 7:53AM: buildID: B32739cd61, name: windows-amd64-2016, hostType: host-windows-amd64-2016, error: failed to get a buildlet: Error creating instance: &{Code:QUOTA_EXCEEDED Location: Message:Quota 'CPUS' exceeded. Limit: 500.0 ForceSendFields:[] NullFields:[]}

8/4/17 7:53AM: buildID: B059b7649c, name: openbsd-amd64-60, hostType: host-openbsd-amd64-60, error: failed to get a buildlet: Failed to create instance: googleapi: Error 403: Quota 'CPUS' exceeded. Limit: 500.0, quotaExceeded
at main.(*buildStatus).reportErr (coordinator.go:1951)

... etc

Probably not a pressing issue, as the coordinator's retry logic is at least enough to ensure the builds eventually run.

EDIT:
we need to figure out which errors are noise, and alert on the rest.

@gopherbot gopherbot added this to the Unreleased milestone Aug 4, 2017
@gopherbot gopherbot added the Builders label Aug 4, 2017
@adams-sarah

This comment has been minimized.

Copy link
Contributor Author

@adams-sarah adams-sarah commented Aug 4, 2017

Though perhaps we should think about paying for more cpus.

@bradfitz

This comment has been minimized.

Copy link
Contributor

@bradfitz bradfitz commented Aug 4, 2017

Yeah, that's not an alertable issue.

That's normal during spikes of load and doesn't affect users or builds.

The coordinator tries to do its own throttling and dead reckoning of quota, but there seems to be some lag of the reported quota from GCP which causes our internal usage accounting & usage semaphore to get off.

Though perhaps we should think about paying for more cpus.

It's just a simple ticket to bump the quota, but it won't really help. During a few commits, we will almost always have more work to do than CPUs, regardless of our CPU limit. It's fine to hit the limit for a few minutes sometimes.

Better would be to finish the scheduler: #19178

@adams-sarah

This comment has been minimized.

Copy link
Contributor Author

@adams-sarah adams-sarah commented Aug 4, 2017

sounds good.

@adams-sarah adams-sarah closed this Aug 4, 2017
@bradfitz

This comment has been minimized.

Copy link
Contributor

@bradfitz bradfitz commented Aug 4, 2017

I think you should keep this bug open to fix your logging/alerting.

You shouldn't get false alarms.

@adams-sarah

This comment has been minimized.

Copy link
Contributor Author

@adams-sarah adams-sarah commented Aug 4, 2017

ah, no alerts generated. just error logged.
actually, i haven't set any alerts at all. still trying to get a feel for what is normal and what is not.
ok, i'll modify the bug to address this directly (decide what to alert on and set up the alerts) and re-open. sound good?

we should do something similar for the # reverse buildlet alerting stuff. b/c it's just noise ATM.
i can tackle that today. cc @shantuo

@adams-sarah adams-sarah reopened this Aug 4, 2017
@adams-sarah adams-sarah changed the title x/build: frequent "cpu quota exceeded" errors from GCP x/build: set up alerts for errors reported to stackdriver Aug 4, 2017
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Aug 4, 2017

Change https://golang.org/cl/53353 mentions this issue: cmd/buildlet: normalize macstadium host names for monitoring

gopherbot pushed a commit to golang/build that referenced this issue Aug 5, 2017
…ring

In prep for better alerting when dedicated (reverse) buildlers
disappear, normalize the the MacStadium host names to remove the
extraneous guest OS version from them, so we can track the host's
last-seen time reliably over time, even as the guest OS version
changes.

This CL also cleans up makemac while it's there and fixes some bugs
and adds some paranoia checks and cleans up logging and adds an HTTP
status handler.

A future change will improve coordinator monitoring of reverse
buildlets.

Updates golang/go#21315

Change-Id: I3d09168cc91f37715b65ae2924a1642401e18808
Reviewed-on: https://go-review.googlesource.com/53353
Reviewed-by: Jessie Frazelle <me@jessfraz.com>
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Feb 22, 2018

Change https://golang.org/cl/96416 mentions this issue: devapp: start of status handler for monitoring

gopherbot pushed a commit to golang/build that referenced this issue Feb 23, 2018
Also bump Go from 1.8 to 1.10, and change how the static binary is
built to avoid warnings during link.

Updates golang/go#21315
Updates golang/go#22603

Change-Id: I426491d48f787a77cb3eea4dff4d11f474236548
Reviewed-on: https://go-review.googlesource.com/96416
Reviewed-by: Andrew Bonventre <andybons@golang.org>
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Feb 27, 2018

Change https://golang.org/cl/97516 mentions this issue: devapp: revert status changes

gopherbot pushed a commit to golang/build that referenced this issue Feb 27, 2018
This reverts the status changes to devapp from CL 96416

This will go into its own server instead.

Updates golang/go#21315
Updates golang/go#22603

Change-Id: Icb17a5915124241b2ef97a1ee2e9a0e4298784ce
Reviewed-on: https://go-review.googlesource.com/97516
Reviewed-by: Andrew Bonventre <andybons@golang.org>
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Dec 10, 2019

Change https://golang.org/cl/210740 mentions this issue: status: delete unused, incomplete status client & server

gopherbot pushed a commit to golang/build that referenced this issue Dec 10, 2019
The plan was to have a stand-alone status server that just kept track
of the last-reported status updates from various things, and then have
components self-report their status, and then the status server would
report a lack of timely updates as a failure.

But this was never completed and a bunch of health checking went into
the coordinator instead, which kinda works well enough.

I still think this project should be revived, but maybe it could use
some existing cloudy product instead.

In any case, I don't see myself finishing this, so I'm going to delete
this unused code for now. Others can resurrect it if desired if/when
they move stuff out of the coordinator.

Updates golang/go#21315
Updates golang/go#22603

Change-Id: I2d3c33358855206ea98283c9871c6e050e931a43
Reviewed-on: https://go-review.googlesource.com/c/build/+/210740
Reviewed-by: Alexander Rakoczy <alex@golang.org>
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.