Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase reliability of debug dumps #8472

Merged
merged 1 commit into from Jan 4, 2023

Conversation

jrockway
Copy link
Member

@jrockway jrockway commented Dec 21, 2022

Fixes CORE-1193 and CORE-1294.

This PR does a bunch of stuff to make debug dumps more reliable, at least without burning the whole thing down and starting over.

pachctl debug dump can now specify a timeout; it defaults to 30m.

The timeout is adjusted down on the server side to about 90% of the client timeout. That means the debug dumper has some time to handle context deadline exceeded and start producing output before the RPC is totally aborted. I've had good results with timeouts as low as 100ms; you don't get everything, but you get some files. At 30m it should be Really Good (tm).

Every multi-step operation that the dumper does now continues in the face of errors, if the error doesn't affect the next thing. Every for loop or function that does two+ things now uses multierr.Append to collect all the errors. That means if we hit an issue where we try to do something silly like InspectPipeline an input repo, we just continue doing the whole debug dump anyway. At the very end, an error will be returned, but we can still write all the other files.

I fixed the thing where we did InspectPipeline on an input repo; there was a missing continue statement. I also tried to fix PPS's error message for a pipeline not being found, but it's actually not relevant to this PR. (I don't think the code can ever hit the case I "fixed", but in case it does, hey now the error type is correct. We still don't return grpc.status = NotFound from PPS under any of these circumstances though.)

I added some arbitrary timeouts around things I don't think will be too slow, like we did for Loki.

I noticed that the Pod Describer from the k8s library can't take a context. That means it could run forever, so I put it in a background goroutine; the foreground goroutine tries to get its output until the context expires, and then it just abandons it and moves on. This will leak memory if it runs forever, but hey, after we review the debug dump we'll probably tell you to restart pachd anyway. In the future we'll have to just collect pod YAMLs instead of "describe" output. Or fork k8s.io/client-go to make the silly thing take a context.

As an example, here's what a run with an aggressive timeout looks like now:

$ rm dump.tgz; /usr/bin/time pachctl debug dump dump.tgz  --timeout=1s ; tar tzvf dump.tgz; du -h dump.tgz
rpc error: code = Unknown desc = listPipelines: context deadline exceeded; appLogs: context deadline exceeded; collectDatabaseDump: collectDatabaseTables: list tables: context deadline exceeded
Command exited with non-zero status 1
0.09user 0.02system 0:01.04elapsed 11%CPU (0avgtext+0avgdata 66060maxresident)k
0inputs+2176outputs (0major+2005minor)pagefaults 0swaps
-rwxrwxrwx 0/0            6214 1969-12-31 19:00 source-repos/default/benchmark-upload/commits.json
-rwxrwxrwx 0/0           52020 1969-12-31 19:00 source-repos/default/benchmark-upload/commits-chart.png
-rwxrwxrwx 0/0            8083 1969-12-31 19:00 source-repos/default/images/commits.json
-rwxrwxrwx 0/0           45961 1969-12-31 19:00 source-repos/default/images/commits-chart.png
-rwxrwxrwx 0/0              17 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/version.txt
-rwxrwxrwx 0/0            7612 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/describe.txt
-rwxrwxrwx 0/0         8690350 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/logs.txt
-rwxrwxrwx 0/0              80 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/logs-previous/error.txt
-rwxrwxrwx 0/0         8640042 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/logs-loki.txt
-rwxrwxrwx 0/0           22422 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/go_info.txt
-rwxrwxrwx 0/0           11559 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/goroutine
-rwxrwxrwx 0/0           84444 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/heap
-rwxrwxrwx 0/0              26 1969-12-31 19:00 database/activities/error.txt
-rwxrwxrwx 0/0              26 1969-12-31 19:00 database/row-counts/error.txt
-rwxrwxrwx 0/0              26 1969-12-31 19:00 database/table-sizes/error.txt
1.1M    dump.tgz

We end up with data (and a long chain of error messages) even if we hit timeouts.

@codecov
Copy link

codecov bot commented Dec 21, 2022

Codecov Report

Merging #8472 (1f1b2f3) into master (6421a16) will decrease coverage by 39.03%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #8472       +/-   ##
==========================================
- Coverage   39.91%   0.88%   -39.04%     
==========================================
  Files         455     143      -312     
  Lines      122952   70607    -52345     
==========================================
- Hits        49081     624    -48457     
- Misses      64276   69947     +5671     
+ Partials     9595      36     -9559     
Impacted Files Coverage Δ
src/server/auth/db.go 0.00% <0.00%> (-100.00%) ⬇️
src/internal/log/etcd.go 0.00% <0.00%> (-100.00%) ⬇️
src/internal/log/http.go 0.00% <0.00%> (-100.00%) ⬇️
src/internal/log/step.go 0.00% <0.00%> (-100.00%) ⬇️
src/internal/watch/op.go 0.00% <0.00%> (-100.00%) ⬇️
src/internal/log/amazon.go 0.00% <0.00%> (-100.00%) ⬇️
src/internal/log/stdlog.go 0.00% <0.00%> (-100.00%) ⬇️
src/internal/dbutil/option.go 0.00% <0.00%> (-100.00%) ⬇️
src/internal/randutil/rand.go 0.00% <0.00%> (-100.00%) ⬇️
src/internal/grpcutil/error.go 0.00% <0.00%> (-100.00%) ⬇️
... and 429 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@albscui albscui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Most of my comments are just nits and curiosities, so not blockers for merging.

src/server/debug/server/server.go Show resolved Hide resolved
src/server/debug/server/server.go Show resolved Hide resolved
src/server/debug/server/server.go Show resolved Hide resolved
src/server/debug/server/server.go Show resolved Hide resolved
src/server/debug/server/util.go Show resolved Hide resolved
… little early so we can start sending bytes back when the timeout is near; try to dump thoroughly by using multierr rather than bailing out; add arbitrary timeouts to parts of the debug dump logic; make non-contexty things time out able
@jrockway jrockway force-pushed the jonathan/core-1294-timeouts-in-debug-dump branch from 2377e55 to 1f1b2f3 Compare January 4, 2023 17:00
@jrockway jrockway merged commit 3c06bac into master Jan 4, 2023
@jrockway jrockway deleted the jonathan/core-1294-timeouts-in-debug-dump branch January 4, 2023 20:59
jrockway added a commit that referenced this pull request Jan 5, 2023
… little early so we can start sending bytes back when the timeout is near; try to dump thoroughly by using multierr rather than bailing out; add arbitrary timeouts to parts of the debug dump logic; make non-contexty things time out able (#8472)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants