As a temporary workaround in the meantime, we could write the non-critical services (build.golang.org and dev.golang.org, not golang.org in general) in “crash-only” style, and just make sure that we'll notice if any given service is down.
For non-critical things that'll likely recover on their own, I can just add items (perhaps at WARN level where appropriate) at https://farmer.golang.org/#health .... each of those can easily be hooked up to monitoring too.
I'd prefer not to crash if a non-critical service we depend on is having temporary issues. We have a lot of them.
build.golang.org and dev.golang.org are not non-critical services. If they're down, trybots and builders don't run, gopherbot won't assign reviewers to CLs, etc. People rely on those things working, and so I don't think it's good idea to try to solve this issue at the cost of reducing Go contributor productivity. We should find a non-disruptive way to find "bad" entries in logs.
We should get alerts if we see new/many "bad" log messages from our various services.
For some definition of new, many, and bad.
Maybe bad could mean it has "error" in it. Or a dozen other phrases.
(forking from https://go-review.googlesource.com/c/build/+/179419/1/cmd/coordinator/gce.go#b193 )
/cc @bcmills @dmitshur
The text was updated successfully, but these errors were encountered: