As a temporary workaround in the meantime, we could write the non-critical services (build.golang.org and dev.golang.org, not golang.org in general) in “crash-only” style, and just make sure that we'll notice if any given service is down.
For non-critical things that'll likely recover on their own, I can just add items (perhaps at WARN level where appropriate) at https://farmer.golang.org/#health .... each of those can easily be hooked up to monitoring too.
I'd prefer not to crash if a non-critical service we depend on is having temporary issues. We have a lot of them.
build.golang.org and dev.golang.org are not non-critical services. If they're down, trybots and builders don't run, gopherbot won't assign reviewers to CLs, etc. People rely on those things working, and so I don't think it's good idea to try to solve this issue at the cost of reducing Go contributor productivity. We should find a non-disruptive way to find "bad" entries in logs.