Increase reliability of debug dumps #8472
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes CORE-1193 and CORE-1294.
This PR does a bunch of stuff to make debug dumps more reliable, at least without burning the whole thing down and starting over.
pachctl debug dump
can now specify a timeout; it defaults to 30m.The timeout is adjusted down on the server side to about 90% of the client timeout. That means the debug dumper has some time to handle
context deadline exceeded
and start producing output before the RPC is totally aborted. I've had good results with timeouts as low as 100ms; you don't get everything, but you get some files. At 30m it should be Really Good (tm).Every multi-step operation that the dumper does now continues in the face of errors, if the error doesn't affect the next thing. Every for loop or function that does two+ things now uses
multierr.Append
to collect all the errors. That means if we hit an issue where we try to do something silly like InspectPipeline an input repo, we just continue doing the whole debug dump anyway. At the very end, an error will be returned, but we can still write all the other files.I fixed the thing where we did InspectPipeline on an input repo; there was a missing
continue
statement. I also tried to fix PPS's error message for a pipeline not being found, but it's actually not relevant to this PR. (I don't think the code can ever hit the case I "fixed", but in case it does, hey now the error type is correct. We still don't return grpc.status = NotFound from PPS under any of these circumstances though.)I added some arbitrary timeouts around things I don't think will be too slow, like we did for Loki.
I noticed that the Pod Describer from the k8s library can't take a context. That means it could run forever, so I put it in a background goroutine; the foreground goroutine tries to get its output until the context expires, and then it just abandons it and moves on. This will leak memory if it runs forever, but hey, after we review the debug dump we'll probably tell you to restart pachd anyway. In the future we'll have to just collect pod YAMLs instead of "describe" output. Or fork k8s.io/client-go to make the silly thing take a context.
As an example, here's what a run with an aggressive timeout looks like now:
We end up with data (and a long chain of error messages) even if we hit timeouts.