-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/build: set up alerts for errors reported to stackdriver #21315
Comments
Though perhaps we should think about paying for more cpus. |
Yeah, that's not an alertable issue. That's normal during spikes of load and doesn't affect users or builds. The coordinator tries to do its own throttling and dead reckoning of quota, but there seems to be some lag of the reported quota from GCP which causes our internal usage accounting & usage semaphore to get off.
It's just a simple ticket to bump the quota, but it won't really help. During a few commits, we will almost always have more work to do than CPUs, regardless of our CPU limit. It's fine to hit the limit for a few minutes sometimes. Better would be to finish the scheduler: #19178 |
sounds good. |
I think you should keep this bug open to fix your logging/alerting. You shouldn't get false alarms. |
ah, no alerts generated. just error logged. we should do something similar for the # reverse buildlet alerting stuff. b/c it's just noise ATM. |
Change https://golang.org/cl/53353 mentions this issue: |
…ring In prep for better alerting when dedicated (reverse) buildlers disappear, normalize the the MacStadium host names to remove the extraneous guest OS version from them, so we can track the host's last-seen time reliably over time, even as the guest OS version changes. This CL also cleans up makemac while it's there and fixes some bugs and adds some paranoia checks and cleans up logging and adds an HTTP status handler. A future change will improve coordinator monitoring of reverse buildlets. Updates golang/go#21315 Change-Id: I3d09168cc91f37715b65ae2924a1642401e18808 Reviewed-on: https://go-review.googlesource.com/53353 Reviewed-by: Jessie Frazelle <me@jessfraz.com>
Change https://golang.org/cl/96416 mentions this issue: |
Also bump Go from 1.8 to 1.10, and change how the static binary is built to avoid warnings during link. Updates golang/go#21315 Updates golang/go#22603 Change-Id: I426491d48f787a77cb3eea4dff4d11f474236548 Reviewed-on: https://go-review.googlesource.com/96416 Reviewed-by: Andrew Bonventre <andybons@golang.org>
Change https://golang.org/cl/97516 mentions this issue: |
This reverts the status changes to devapp from CL 96416 This will go into its own server instead. Updates golang/go#21315 Updates golang/go#22603 Change-Id: Icb17a5915124241b2ef97a1ee2e9a0e4298784ce Reviewed-on: https://go-review.googlesource.com/97516 Reviewed-by: Andrew Bonventre <andybons@golang.org>
Change https://golang.org/cl/210740 mentions this issue: |
The plan was to have a stand-alone status server that just kept track of the last-reported status updates from various things, and then have components self-report their status, and then the status server would report a lack of timely updates as a failure. But this was never completed and a bunch of health checking went into the coordinator instead, which kinda works well enough. I still think this project should be revived, but maybe it could use some existing cloudy product instead. In any case, I don't see myself finishing this, so I'm going to delete this unused code for now. Others can resurrect it if desired if/when they move stuff out of the coordinator. Updates golang/go#21315 Updates golang/go#22603 Change-Id: I2d3c33358855206ea98283c9871c6e050e931a43 Reviewed-on: https://go-review.googlesource.com/c/build/+/210740 Reviewed-by: Alexander Rakoczy <alex@golang.org> Run-TryBot: Alexander Rakoczy <alex@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
Currently we get a bunch of errors reported, eg:
9/4/17 7:53AM: buildID: B32739cd61, name: windows-amd64-2016, hostType: host-windows-amd64-2016, error: failed to get a buildlet: Error creating instance: &{Code:QUOTA_EXCEEDED Location: Message:Quota 'CPUS' exceeded. Limit: 500.0 ForceSendFields:[] NullFields:[]}
8/4/17 7:53AM: buildID: B059b7649c, name: openbsd-amd64-60, hostType: host-openbsd-amd64-60, error: failed to get a buildlet: Failed to create instance: googleapi: Error 403: Quota 'CPUS' exceeded. Limit: 500.0, quotaExceeded
at main.(*buildStatus).reportErr (coordinator.go:1951)
... etc
Probably not a pressing issue, as the coordinator's retry logic is at least enough to ensure the builds eventually run.
EDIT:
we need to figure out which errors are noise, and alert on the rest.
The text was updated successfully, but these errors were encountered: