Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tiered compaction: assertion failed: last.last_key <= key #7296

Closed
Tracked by #7554 ...
arpad-m opened this issue Apr 3, 2024 · 6 comments
Closed
Tracked by #7554 ...

tiered compaction: assertion failed: last.last_key <= key #7296

arpad-m opened this issue Apr 3, 2024 · 6 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver

Comments

@arpad-m
Copy link
Member

arpad-m commented Apr 3, 2024

The following assertion error is hit with new compaction:

panic{thread=mgmt request worker location=pageserver/compaction/src/compact_tiered.rs:757:13}: assertion failed: last.last_key <= key

I have a local test that reproduces it, extracted from/based on test_deletion_queue_recovery.

I think the problem is that the merge_delta_keys function does not return keys in key order.

originally part of #6768
moved to #7554

@arpad-m arpad-m added the c/storage/pageserver Component: storage: pageserver label Apr 3, 2024
@arpad-m arpad-m self-assigned this Apr 3, 2024
@arpad-m arpad-m changed the title new compaction: assertion failed: last.last_key <= key tiered compaction: assertion failed: last.last_key <= key Apr 3, 2024
@arpad-m
Copy link
Member Author

arpad-m commented Apr 3, 2024

I confirm, this is also reproducible with the test_deletion_queue_recovery test, now that #7282 has been merged.

@problame
Copy link
Contributor

problame commented May 6, 2024

No progress last week, plan to address this week.

@arpad-m
Copy link
Member Author

arpad-m commented May 8, 2024

I looked at it this on Tuesday and I haven't made much progress. I have found issues with the k-merge though (MergeDeltaKeys):

  • The items are modified while inside the heap in a way that it influences their ordering relative to other entries of the heap. One has to remove the item, modify it, and then reinsert it.
  • The order should also depend on the lsn: we want lower lsn to come before higher lsn. There is no such sorting right now, it all sorts based on keyspace. If you are fed a bunch of delta layers for 0000-FFFF keyspace with increasing lsn intervals, then the order they get chosen depends on the order of feeding, or some other arbitrary choice. Not what we want :).

I have fixed both issues in a local branch but this is sadly not enough to make the assert not fire. I have however filed a PR to make us assert a little bit more early on: #7648.

arpad-m added a commit that referenced this issue May 8, 2024
Adds ordering asserts to the output of the delta key iterator
`MergeDeltaKeys` that implements a k-merge.

Part of #7296 : the asserts added by this PR get hit in the reproducers
of #7296 as well, but they are earlier in the pipeline.
@problame
Copy link
Contributor

problame commented May 8, 2024

Meeting decision:

  • we'll remove the stream-based k-merge and use the primitive in-memory kmerge from the compact_legacy instead
  • Arpad to create a follow-up in the "Backlog" list of the epic to add back stream-based k-merge

@arpad-m
Copy link
Member Author

arpad-m commented May 8, 2024

The changes to remove the streaming based k-merge are in #7661, also with the two (insufficient) correctness fixes for the stream-based k-merge.

Maybe we can close this with #7661, but note that test_deletion_queue_recovery is still failing even with #7661 applied, so it can't be used as regression test.

arpad-m added a commit that referenced this issue May 9, 2024
This PR does two things:

First, it fixes a bug with tiered compaction's k-merge implementation.
It ignored the lsn of a key during ordering, so multiple updates of the
same key could be read in arbitrary order, say from different layers.
For example there is layers `[(a, 2),(b, 3)]` and `[(a, 1),(c, 2)]` in
the heap, they might return `(a,2)` and `(a,1)`.

Ultimately, this change wasn't enough to fix the ordering issues in
#7296, in other words there is likely still bugs in the k-merge. So as
the second thing, we switch away from the k-merge to an in-memory based
one, similar to #4839, but leave the code around to be improved and
maybe switched to later on.

Part of #7296
@arpad-m
Copy link
Member Author

arpad-m commented May 10, 2024

Closing as #7661 has been merged

@arpad-m arpad-m closed this as completed May 10, 2024
arpad-m added a commit that referenced this issue May 15, 2024
Adds a test that is a reproducer for many tiered compaction bugs,
both ones that have since been fixed as well as still unfxied ones:
* (now fixed) #7296 
* #7707 
* #7759
* Likely also #7244 but I haven't tried that.

The key ordering bug can be reproduced by switching to
`merge_delta_keys` instead of `merge_delta_keys_buffered`, so reverting
a big part of #7661, although it only sometimes reproduces (30-50% of
cases).

part of #7554
a-masterov pushed a commit that referenced this issue May 20, 2024
Adds ordering asserts to the output of the delta key iterator
`MergeDeltaKeys` that implements a k-merge.

Part of #7296 : the asserts added by this PR get hit in the reproducers
of #7296 as well, but they are earlier in the pipeline.
a-masterov pushed a commit that referenced this issue May 20, 2024
This PR does two things:

First, it fixes a bug with tiered compaction's k-merge implementation.
It ignored the lsn of a key during ordering, so multiple updates of the
same key could be read in arbitrary order, say from different layers.
For example there is layers `[(a, 2),(b, 3)]` and `[(a, 1),(c, 2)]` in
the heap, they might return `(a,2)` and `(a,1)`.

Ultimately, this change wasn't enough to fix the ordering issues in
#7296, in other words there is likely still bugs in the k-merge. So as
the second thing, we switch away from the k-merge to an in-memory based
one, similar to #4839, but leave the code around to be improved and
maybe switched to later on.

Part of #7296
a-masterov pushed a commit that referenced this issue May 20, 2024
Adds a test that is a reproducer for many tiered compaction bugs,
both ones that have since been fixed as well as still unfxied ones:
* (now fixed) #7296 
* #7707 
* #7759
* Likely also #7244 but I haven't tried that.

The key ordering bug can be reproduced by switching to
`merge_delta_keys` instead of `merge_delta_keys_buffered`, so reverting
a big part of #7661, although it only sometimes reproduces (30-50% of
cases).

part of #7554
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

2 participants