Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_timeline_deletion_with_files_stuck_in_upload_queue flakiness #6681

Closed
jcsp opened this issue Feb 8, 2024 · 7 comments
Closed

test_timeline_deletion_with_files_stuck_in_upload_queue flakiness #6681

jcsp opened this issue Feb 8, 2024 · 7 comments
Assignees
Labels
a/test Area: related to testing c/storage/pageserver Component: storage: pageserver

Comments

@jcsp
Copy link
Contributor

jcsp commented Feb 8, 2024

Rare:

AssertionError: assert not [
  (762, '2024-02-23T01:54:42.414003Z  WARN delete_timeline{tenant_id=faa1e715b82ea028c2ab77c827a4e253 shard_id=0000 timeline_id=f8b5e0a4e8c75657837989e9d700addb}: got not found err while removing timeline dir, proceeding anyway timeline_dir="/tmp/test_output/test_timeline_deletion_with_files_stuck_in_upload_queue[debug-pg14]-1/repo/pageserver_1/tenants/faa1e715b82ea028c2ab77c827a4e253/timelines/f8b5e0a4e8c75657837989e9d700addb" path="/tmp/test_output/test_timeline_deletion_with_files_stuck_in_upload_queue[debug-pg14]-1/repo/pageserver_1/tenants/faa1e715b82ea028c2ab77c827a4e253/timelines/f8b5e0a4e8c75657837989e9d700addb/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000000171F4C1-000000000172BF51"\n'),
  (763, '2024-02-23T01:54:42.436209Z  WARN delete_timeline{tenant_id=faa1e715b82ea028c2ab77c827a4e253 shard_id=0000 timeline_id=f8b5e0a4e8c75657837989e9d700addb}: got not found err while removing timeline dir, proceeding anyway timeline_dir="/tmp/test_output/test_timeline_deletion_with_files_stuck_in_upload_queue[debug-pg14]-1/repo/pageserver_1/tenants/faa1e715b82ea028c2ab77c827a4e253/timelines/f8b5e0a4e8c75657837989e9d700addb" path="/tmp/test_output/test_timeline_deletion_with_files_stuck_in_upload_queue[debug-pg14]-1/repo/pageserver_1/tenants/faa1e715b82ea028c2ab77c827a4e253/timelines/f8b5e0a4e8c75657837989e9d700addb/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696321-00000000016A5D69"\n'),
  (764, '2024-02-23T01:54:42.445635Z  WARN delete_timeline{tenant_id=faa1e715b82ea028c2ab77c827a4e253 shard_id=0000 timeline_id=f8b5e0a4e8c75657837989e9d700addb}: got not found err while removing timeline dir, proceeding anyway timeline_dir="/tmp/test_output/test_timeline_deletion_with_files_stuck_in_upload_queue[debug-pg14]-1/repo/pageserver_1/tenants/faa1e715b82ea028c2ab77c827a4e253/timelines/f8b5e0a4e8c75657837989e9d700addb" path="/tmp/test_output/test_timeline_deletion_with_files_stuck_in_upload_queue[debug-pg14]-1/repo/pageserver_1/tenants/faa1e715b82ea028c2ab77c827a4e253/timelines/f8b5e0a4e8c75657837989e9d700addb/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000000175F0B9-000000000176A2F1"\n')]

Unsure what that could be. Only example I've seen so far. Analysis in: #6681 (comment)

Common:

AssertionError: assert not [
  (866, '2024-02-22T17:26:05.355705Z ERROR request{method=PUT path=/v1/tenant/01f38f79d16faf4f1caa45fe1fbed6da/timeline/b2d0a3673159e19a88e51f564cfac803/checkpoint request_id=3e7ffcba-0384-4464-878e-3593c876cb94}: Error processing HTTP request: InternalServerError(queue is in state Stopping\n')]

this sounds like an error message which was changed. Fixed in #6894.

@jcsp jcsp added c/storage/pageserver Component: storage: pageserver a/test Area: related to testing labels Feb 8, 2024
@koivunej
Copy link
Contributor

koivunej commented Feb 23, 2024

Looks like a Stopping/Stopped string error, which I introduced in 8dee990 -- cannot see why. I'll just switch it back.

@koivunej
Copy link
Contributor

The more rare case is a valid situation, which happens when the struct Layer drops happen at the same time as timeline deletion cleaning up local layer files -- perhaps now the walkdir actually has an upper hand because it is sync code, vs. struct Layer using spawn_blocking, but it is a race.

The individual layers have no knowledge of deletion happening and they were being kept alive by UploadTask entries in RemoteTimelineClient. I think the answer to this is to hold the gate a bit more, and be sure to hold the guard until the end of deletion.

Something to consider after #6028.

koivunej added a commit that referenced this issue Feb 24, 2024
introduced in 8dee990, should help with
the #6681 common problem which is just a mismatched allowed error.
@koivunej
Copy link
Contributor

koivunej commented Mar 4, 2024

Next steps:

  • layers need to have a single gateguard (not necessarily gateguard per layer) so that we can synchronize the shutdown

@koivunej
Copy link
Contributor

The rare case is likely handled by #7082.

@koivunej
Copy link
Contributor

koivunej commented Mar 18, 2024

A recent failure

I think it was just being slow, pageserver logs do not appear to have any long stuckness.. But this may have been with the recently-made-assertions.

@koivunej
Copy link
Contributor

koivunej commented Apr 8, 2024

This has no longer been flaky since the Timeline::gate usage introduction in #7082: last flaky was 2024-03-16. However that work was merged before, so unsure.

@koivunej koivunej closed this as completed Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/test Area: related to testing c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

2 participants