This can be for many reasons: VM crash, preemptible batch resource death (VM pulled away from us), us accidentally deleting the VM, us deleting the VM on purpose, etc.
To catch all those cases, currently we use TCP keep-alives to make sure the TCP connection is alive at least. But that appears to take several minutes, despite us using the http DefaultTransport (with TCP KeepAlive of 30 seconds?).
We should notice much quicker.
We should either reduce the TCP keep-alive time to a few seconds, or just have a background goroutine always pinging the buildlet in the background and interrupt any operation from a buildlet client (be it an exec, PutTar, etc... all commands should wait for both the real operation to complete, or for the buildlet health check goroutine to mark itself dead).