Increase reliability of debug dumps #8472

jrockway · 2022-12-21T09:09:01Z

Fixes CORE-1193 and CORE-1294.

This PR does a bunch of stuff to make debug dumps more reliable, at least without burning the whole thing down and starting over.

pachctl debug dump can now specify a timeout; it defaults to 30m.

The timeout is adjusted down on the server side to about 90% of the client timeout. That means the debug dumper has some time to handle context deadline exceeded and start producing output before the RPC is totally aborted. I've had good results with timeouts as low as 100ms; you don't get everything, but you get some files. At 30m it should be Really Good (tm).

Every multi-step operation that the dumper does now continues in the face of errors, if the error doesn't affect the next thing. Every for loop or function that does two+ things now uses multierr.Append to collect all the errors. That means if we hit an issue where we try to do something silly like InspectPipeline an input repo, we just continue doing the whole debug dump anyway. At the very end, an error will be returned, but we can still write all the other files.

I fixed the thing where we did InspectPipeline on an input repo; there was a missing continue statement. I also tried to fix PPS's error message for a pipeline not being found, but it's actually not relevant to this PR. (I don't think the code can ever hit the case I "fixed", but in case it does, hey now the error type is correct. We still don't return grpc.status = NotFound from PPS under any of these circumstances though.)

I added some arbitrary timeouts around things I don't think will be too slow, like we did for Loki.

I noticed that the Pod Describer from the k8s library can't take a context. That means it could run forever, so I put it in a background goroutine; the foreground goroutine tries to get its output until the context expires, and then it just abandons it and moves on. This will leak memory if it runs forever, but hey, after we review the debug dump we'll probably tell you to restart pachd anyway. In the future we'll have to just collect pod YAMLs instead of "describe" output. Or fork k8s.io/client-go to make the silly thing take a context.

As an example, here's what a run with an aggressive timeout looks like now:

$ rm dump.tgz; /usr/bin/time pachctl debug dump dump.tgz  --timeout=1s ; tar tzvf dump.tgz; du -h dump.tgz
rpc error: code = Unknown desc = listPipelines: context deadline exceeded; appLogs: context deadline exceeded; collectDatabaseDump: collectDatabaseTables: list tables: context deadline exceeded
Command exited with non-zero status 1
0.09user 0.02system 0:01.04elapsed 11%CPU (0avgtext+0avgdata 66060maxresident)k
0inputs+2176outputs (0major+2005minor)pagefaults 0swaps
-rwxrwxrwx 0/0            6214 1969-12-31 19:00 source-repos/default/benchmark-upload/commits.json
-rwxrwxrwx 0/0           52020 1969-12-31 19:00 source-repos/default/benchmark-upload/commits-chart.png
-rwxrwxrwx 0/0            8083 1969-12-31 19:00 source-repos/default/images/commits.json
-rwxrwxrwx 0/0           45961 1969-12-31 19:00 source-repos/default/images/commits-chart.png
-rwxrwxrwx 0/0              17 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/version.txt
-rwxrwxrwx 0/0            7612 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/describe.txt
-rwxrwxrwx 0/0         8690350 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/logs.txt
-rwxrwxrwx 0/0              80 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/logs-previous/error.txt
-rwxrwxrwx 0/0         8640042 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/logs-loki.txt
-rwxrwxrwx 0/0           22422 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/go_info.txt
-rwxrwxrwx 0/0           11559 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/goroutine
-rwxrwxrwx 0/0           84444 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/heap
-rwxrwxrwx 0/0              26 1969-12-31 19:00 database/activities/error.txt
-rwxrwxrwx 0/0              26 1969-12-31 19:00 database/row-counts/error.txt
-rwxrwxrwx 0/0              26 1969-12-31 19:00 database/table-sizes/error.txt
1.1M    dump.tgz

We end up with data (and a long chain of error messages) even if we hit timeouts.

codecov · 2022-12-21T09:38:35Z

Codecov Report

Merging #8472 (1f1b2f3) into master (6421a16) will decrease coverage by 39.03%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #8472       +/-   ##
==========================================
- Coverage   39.91%   0.88%   -39.04%     
==========================================
  Files         455     143      -312     
  Lines      122952   70607    -52345     
==========================================
- Hits        49081     624    -48457     
- Misses      64276   69947     +5671     
+ Partials     9595      36     -9559

Impacted Files	Coverage Δ
src/server/auth/db.go	`0.00% <0.00%> (-100.00%)`	⬇️
src/internal/log/etcd.go	`0.00% <0.00%> (-100.00%)`	⬇️
src/internal/log/http.go	`0.00% <0.00%> (-100.00%)`	⬇️
src/internal/log/step.go	`0.00% <0.00%> (-100.00%)`	⬇️
src/internal/watch/op.go	`0.00% <0.00%> (-100.00%)`	⬇️
src/internal/log/amazon.go	`0.00% <0.00%> (-100.00%)`	⬇️
src/internal/log/stdlog.go	`0.00% <0.00%> (-100.00%)`	⬇️
src/internal/dbutil/option.go	`0.00% <0.00%> (-100.00%)`	⬇️
src/internal/randutil/rand.go	`0.00% <0.00%> (-100.00%)`	⬇️
src/internal/grpcutil/error.go	`0.00% <0.00%> (-100.00%)`	⬇️
... and 429 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

albscui

LGTM! Most of my comments are just nits and curiosities, so not blockers for merging.

src/server/debug/server/server.go

src/server/debug/server/util.go

… little early so we can start sending bytes back when the timeout is near; try to dump thoroughly by using multierr rather than bailing out; add arbitrary timeouts to parts of the debug dump logic; make non-contexty things time out able

… little early so we can start sending bytes back when the timeout is near; try to dump thoroughly by using multierr rather than bailing out; add arbitrary timeouts to parts of the debug dump logic; make non-contexty things time out able (#8472)

jrockway requested a review from albscui December 21, 2022 09:09

albscui approved these changes Dec 22, 2022

View reviewed changes

jrockway force-pushed the jonathan/core-1294-timeouts-in-debug-dump branch from 2377e55 to 1f1b2f3 Compare January 4, 2023 17:00

jrockway merged commit 3c06bac into master Jan 4, 2023

jrockway deleted the jonathan/core-1294-timeouts-in-debug-dump branch January 4, 2023 20:59

jrockway mentioned this pull request Jan 5, 2023

[2.4.x] Increase the reliability of debug dumps #8488

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase reliability of debug dumps #8472

Increase reliability of debug dumps #8472

jrockway commented Dec 21, 2022 •

edited by jira bot

codecov bot commented Dec 21, 2022 •

edited

albscui left a comment

Increase reliability of debug dumps #8472

Increase reliability of debug dumps #8472

Conversation

jrockway commented Dec 21, 2022 • edited by jira bot

codecov bot commented Dec 21, 2022 • edited

Codecov Report

albscui left a comment

Choose a reason for hiding this comment

jrockway commented Dec 21, 2022 •

edited by jira bot

codecov bot commented Dec 21, 2022 •

edited