-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dube: timeout individual layer evictions, log progress and record metrics #6131
Conversation
2448 tests run: 2326 passed, 0 failed, 122 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
d97f1e1 at 2024-02-29T20:29:58.767Z :recycle: |
30b6202
to
0a5965d
Compare
0a5965d
to
106fc0a
Compare
Should do the same for eviction_task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While convincing myself of the cancel-safety of evict_layers, I noticed that it leaks the JoinSet
js
on cancellation.
Well, it does join_handle.abort()
on each task.
So, I think secondary_tenant.evict_layer()
has probably not been audited for cancellation safety.
I don't know the invariants, but, it doesn't look cancel safe (could be cancelled by the runtime when returning Poll::Ready from remove_file
). It's a pre-existing issue though, and secondaries aren't prod-enabled yet, so, let's punt that to a new issue. At a mininum, the the SecondaryTenant::evict_layer
function needs a comment stating it promises to be cancel-safe.
Excellent point. This PR was done before secondary tenants or at least their eviction, and I did not revisit it while fixing conflicts. |
106fc0a
to
b7b106f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the smoke test, great work!
I like the added metrics, not just their names.
Not sure why you went for a timeout
parameter instead of making callers who care use tokio::time::timeout
, but 🤷
I was thinking it's the mandatory nature makes sense here. Waiting is a nice to have, it's not required except for our tests. Timeout does not cancel it. |
2be668d
to
4e6ad8e
Compare
Rebase needed for build-image tools. |
4e6ad8e
to
583eb5e
Compare
Rebase needed (?) for test_encode #6963. |
5s against bugs.
weird that no dead_code warnings appeared.
mandatory as in internal, add it to EvictionError.
I was by mistake forgetting that we need to synchronize with the spawn_blocking pool instead of hoping the timeout magic to work, as the timeout future always prioritizes the actual timeouted work over the timeout itself.
it didn't work with paused time, unsure why that is.
583eb5e
to
d97f1e1
Compare
Because of bugs evictions could hang and pause disk usage eviction task. One such bug is known and fixed #6928. Guard each layer eviction with a modest timeout deeming timeouted evictions as failures, to be conservative.
In addition, add logging and metrics recording on each eviction iteration:
Additionally remove dead code for which no dead code warnings appeared in earlier PR.
Follow-up to: #6060.