You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.
As a temporary workaround in the meantime, we could write the non-critical services (build.golang.org and dev.golang.org, not golang.org in general) in “crash-only” style, and just make sure that we'll notice if any given service is down.
For non-critical things that'll likely recover on their own, I can just add items (perhaps at WARN level where appropriate) at https://farmer.golang.org/#health .... each of those can easily be hooked up to monitoring too.
I'd prefer not to crash if a non-critical service we depend on is having temporary issues. We have a lot of them.
build.golang.org and dev.golang.org are not non-critical services. If they're down, trybots and builders don't run, gopherbot won't assign reviewers to CLs, etc. People rely on those things working, and so I don't think it's good idea to try to solve this issue at the cost of reducing Go contributor productivity. We should find a non-disruptive way to find "bad" entries in logs.
We should get alerts if we see new/many "bad" log messages from our various services.
For some definition of new, many, and bad.
Maybe bad could mean it has "error" in it. Or a dozen other phrases.
(forking from https://go-review.googlesource.com/c/build/+/179419/1/cmd/coordinator/gce.go#b193 )
/cc @bcmills @dmitshur
The text was updated successfully, but these errors were encountered: