Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(LayerManager): resident layers query #6634

Merged
merged 8 commits into from
Feb 12, 2024

Conversation

koivunej
Copy link
Contributor

@koivunej koivunej commented Feb 5, 2024

Refactor out layer accesses so that we can have easy access to resident layers, which are needed for number of cases instead of layers for eviction. Simplifies the heatmap building by only using Layers, not RemoteTimelineClient.

Cc: #5331

Copy link

github-actions bot commented Feb 5, 2024

2430 tests run: 2319 passed, 0 failed, 111 skipped (full report)


Flaky tests (3)

Postgres 16

  • test_crafted_wal_end[last_wal_record_crossing_segment]: release
  • test_vm_bit_clear_on_heap_lock: debug

Postgres 15

Code coverage (full report)

  • functions: 54.9% (11866 of 21609 functions)
  • lines: 82.1% (66232 of 80663 lines)

The comment gets automatically updated with the latest test results
c66553e at 2024-02-12T14:28:11.599Z :recycle:

@koivunej
Copy link
Contributor Author

koivunej commented Feb 6, 2024

I had not expected heatmap be using the same API we explicitly named *_for_disk_usage_eviction. Will fix a bit later.

@koivunej koivunej force-pushed the only_collect_layers_with_instant_read branch from d082519 to 2bfb450 Compare February 6, 2024 09:31
@koivunej koivunej marked this pull request as ready for review February 6, 2024 16:36
@koivunej koivunej requested a review from a team as a code owner February 6, 2024 16:36
@koivunej koivunej requested review from arpad-m and removed request for a team February 6, 2024 16:36
pageserver/src/tenant/timeline.rs Outdated Show resolved Hide resolved
@arpad-m
Copy link
Member

arpad-m commented Feb 6, 2024

If we contend for the lock, then won't we be missing out on evictions in a situation when there is high contention? In the worst case we'd not evict at all, right?

@koivunej
Copy link
Contributor Author

koivunej commented Feb 8, 2024

If we contend for the lock, then won't we be missing out on evictions in a situation when there is high contention? In the worst case we'd not evict at all, right?

Correct for both. This is however for a single timeline. It is thought to be unlikely that we could have more than 8 timelines in critical sections, or at least I thought so before starting to write this, earlier had noted the following:

This solution is based on a hope that all timelines will not be busy at the same with layer flushing, compaction, applying garbage collection.

There is of course one more additional source for why the try_read would fail: ingestion. We would have to be hitting the lock very precisely when walreceiver is committing, page requests would similarly wait then.

This was not my idea, nor am I sure if this works. Seems however the unluckyness cases would have to be really unlucky. In the meantime, we should still evict from other tenants.

previous version was piggybacking on
get_local_layers_for_disk_usage_eviction AND joining the data over to
RemoteTimelineClient. this was strictly worse than just iterating over
the (new) resident_layers query results because at best
RemoteTimelineClient is up to date, but at worst it is missing a layer
we have just flushed.

Layer otherwise has all of the information needed.
not building the intermediate vec might have an impact with full
pageserver.
@koivunej koivunej force-pushed the only_collect_layers_with_instant_read branch from ca98971 to 5d4d636 Compare February 8, 2024 23:13
we lose on the layer name logging, but no one was really looking at
those loggings either. we still have the ratelimited warn.
Copy link
Member

@arpad-m arpad-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not really convinced that the new behaviour on contention --- us skipping eviction entirely --- is the right call. If there is no contention, there is no real difference either as in that instance the blocking wait just immediately yields the result.
But @koivunej has told me on slack that there is future work building on this, so I'm approving.

@koivunej
Copy link
Contributor Author

Yes, it is unclear so I'll just drop the option returning for now.

@koivunej koivunej changed the title fix(disk usage based eviction): do not contend for layermap rwlock refactor(LayerManager): resident layers query Feb 12, 2024
@koivunej koivunej merged commit 7ea593d into main Feb 12, 2024
49 checks passed
@koivunej koivunej deleted the only_collect_layers_with_instant_read branch February 12, 2024 15:13
jcsp pushed a commit that referenced this pull request Feb 12, 2024
Refactor out layer accesses so that we can have easy access to resident
layers, which are needed for number of cases instead of layers for
eviction. Simplifies the heatmap building by only using Layers, not
RemoteTimelineClient.

Cc: #5331
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants