docs: sharding phase 1 RFC #5432

jcsp · 2023-10-02T09:22:14Z

We need to shard our Tenants to support larger databases without those large databases dominating our pageservers and/or requiring dedicated pageservers.

This RFC aims to define an initial capability that will permit creating large-capacity databases using a static configuration
defined at time of Tenant creation.

Online re-sharding is deferred as future work, as is offloading layers for historical reads. However, both of these capabilities would be implementable without further changes to the control plane or compute: this RFC aims to define the cross-component work needed to bootstrap sharding end-to-end.

github-actions · 2023-10-02T09:59:32Z

2652 tests run: 2528 passed, 0 failed, 124 skipped (full report)

Flaky tests (1)

Postgres 14

test_ondemand_activation: debug

Code coverage* (full report)

functions: 28.4% (7031 of 24721 functions)
lines: 47.2% (43454 of 92101 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
be81eaa at 2024-03-14T11:04:10.157Z :recycle:}

docs/rfcs/029-sharding-phase1.md

save-buffer · 2023-10-10T21:39:02Z

rendered

docs/rfcs/029-sharding-phase1.md

kelvich · 2023-10-10T22:16:28Z

I also wonder if the initial idea with the safekeepers is actually better:

for starters we can just send everything to all shards. That will lead to our underutilized network interfaces to be less underutilized but don't think it would be a big deal short-term.
that still leaves us with the problem of one shard holding old wal and not receiving new updates. But here I think we overstate importance of that problem and can allow such WAL segments to be offloaded to S3. Later on we can retrieve that segment on demand.
SK team already was considering switching current stream-per-connection model to something multiplexed with messages send over shared wire. GRPC looked like the most actionable thing there. So we anyway want to change protocol.
Also not concerned with the WAL decoding performance -- we only need pagerefs, so we could do stream parsing with minimal state (3-4 variables). Given that all that stream is written to disk pretty sure that such parsing would not be a bottleneck.

So I'd rather start with "all wal to all shards" and then went through that list async without blocking initial sharding.

knizhnik · 2023-10-11T06:13:04Z

I also wonder if the initial idea with the safekeepers is actually better:

for starters we can just send everything to all shards. That will lead to our underutilized network interfaces to be less underutilized but don't think it would be a big deal short-term.

that still leaves us with the problem of one shard holding old wal and not receiving new updates. But here I think we overstate importance of that problem and can allow such WAL segments to be offloaded to S3. Later on we can retrieve that segment on demand.

SK team already was considering switching current stream-per-connection model to something multiplexed with messages send over shared wire. GRPC looked like the most actionable thing there. So we anyway want to change protocol.

Also not concerned with the WAL decoding performance -- we only need pagerefs, so we could do stream parsing with minimal state (3-4 variables). Given that all that stream is written to disk pretty sure that such parsing would not be a bottleneck.

So I'd rather start with "all wal to all shards" and then went through that list async without blocking initial sharding.

I am still not quite sure about whether scattering WAL is good idea.
The drawbacks are clear - you explained why them may be not so critical. May be...
But what are the advantages? Reducing network traffic? It is a bottleneck?

Also, as I already wrote, in principle when we are sending the same stream to multiple consumers we can use low level broadcast. Which seems to be much more efficient than scattering. Not sure if it is applicable to safekeepers<->PS communication (at least because PS are not synced and may have different positions in the stream).

Concerning decoding - there are are some WAL commands which modify pages not specified in pagerefs:
i.e. *ALL_VISIBLE_CLEARED flags in HEAP INSERT/UPDATE/DELETE commands. It actually raise more serious problem than just WAL scattering. Current sharding schema assumes that we ignore forknum, but use blocknum. But definitely related pages in main and VM forks have different block numbers. So them can be assigned to different shards. So walingest should take it in account and do not try to insert in KV storage changes not relevant to this shard.

libs/pageserver_api/src/shard.rs

knizhnik · 2023-11-07T15:05:55Z

Sorry if I missed something, but I failed to find answer for one question related to key sharding in this RFC.
How we are going to handle relation size?
Right now relation size is maintained in in separate key-value pair in KV storage. It is not explicitly set by SMGR. It is updated on PS we we page image or wal record is appended to the relation.

So to which shard compute should send get_rel_size request?
Looks like the only working solution is to broadcast this request to all shards and then choose maximum.
But it seems to be quite inefficient... Certainly we have relsize cache at compute, so relation size is not expected to be retrieved so frequently. But relation size is requested at node startup. Actually one of the primary goals of shards was to reduced startup time. And here we need to broadcast get_rel_size requests to all shards which definitely will not speedup startup.

knizhnik · 2023-11-07T15:31:46Z

One more key sharding issue not covered by this RFC. VM fork is updated either by special WAL records, either as part of heap insert/update/delete operations (when correspondent bit is set). The problem is that block number of the updated VM fork page is very different from block number in the main fork specified in heap_* record.
So even if we are broadcasting the same WAL stream to all shards, we still wan to apply only those WAL records which are associated with the particular shard.

But for FSM/VM fork updates it is not so trivial to do. Their buffer tags are not specified in record's blocks data. We have to decode WAL record, check bit (if it changes page visibility) then calculate position of VM page and check if it is assigned to this shard and if so - update it.

Task becomes more challenged if we are going to scatter WAL. In this case we should look for this VM flags and broadcast record in this case.

libs/pageserver_api/src/shard.rs

docs/rfcs/029-sharding-phase1.md

jcsp · 2024-01-09T11:02:21Z

I've cleaned this up to cover what we did in Q4 -- will publish a smaller follow-on RFC that describes shard splitting (Q1).

docs/rfcs/031-sharding-static.md

jcsp added t/tech_design_rfc Issue type: tech design RFC c/storage Component: storage labels Oct 2, 2023

jcsp marked this pull request as draft October 2, 2023 09:22

knizhnik reviewed Oct 2, 2023

View reviewed changes

docs/rfcs/029-sharding-phase1.md Outdated Show resolved Hide resolved

docs/rfcs/029-sharding-phase1.md Outdated Show resolved Hide resolved

This was referenced Oct 9, 2023

Epic: pageserver support for sharding phase 1 #5507

Closed

Epic: compute support for sharding phase 1 #5508

Closed

ololobus reviewed Oct 10, 2023

View reviewed changes

docs/rfcs/029-sharding-phase1.md Outdated Show resolved Hide resolved

MMeent reviewed Oct 10, 2023

View reviewed changes

jcsp force-pushed the jcsp/rfc-sharding-phase1 branch from 7335475 to 3187af3 Compare October 23, 2023 11:54

koivunej reviewed Oct 23, 2023

View reviewed changes

libs/pageserver_api/src/shard.rs Outdated Show resolved Hide resolved

koivunej reviewed Oct 23, 2023

View reviewed changes

libs/pageserver_api/src/shard.rs Outdated Show resolved Hide resolved

knizhnik reviewed Nov 7, 2023

View reviewed changes

libs/pageserver_api/src/shard.rs Outdated Show resolved Hide resolved

kelvich reviewed Nov 7, 2023

View reviewed changes

docs/rfcs/029-sharding-phase1.md Outdated Show resolved Hide resolved

knizhnik mentioned this pull request Nov 9, 2023

Add support for PS sharding in compute #5837

Closed

5 tasks

docs: sharding phase 1 RFC

a65e709

jcsp force-pushed the jcsp/rfc-sharding-phase1 branch from 6bcb09f to a65e709 Compare January 9, 2024 11:01

jcsp marked this pull request as ready for review January 9, 2024 11:02

jcsp mentioned this pull request Jan 15, 2024

docs: shard splitting RFC #6358

Merged

5 tasks

NanoBjorn reviewed Jan 16, 2024

View reviewed changes

docs/rfcs/031-sharding-static.md Show resolved Hide resolved

031 updates

87a6837

problame approved these changes Mar 14, 2024

View reviewed changes

Merge branch 'main' into jcsp/rfc-sharding-phase1

be81eaa

jcsp enabled auto-merge (squash) March 14, 2024 10:02

jcsp merged commit 23416cc into main Mar 15, 2024
53 checks passed

jcsp deleted the jcsp/rfc-sharding-phase1 branch March 15, 2024 11:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: sharding phase 1 RFC #5432

docs: sharding phase 1 RFC #5432

jcsp commented Oct 2, 2023 •

edited

Loading

github-actions bot commented Oct 2, 2023 •

edited

Loading

Postgres 14

save-buffer commented Oct 10, 2023

kelvich commented Oct 10, 2023

knizhnik commented Oct 11, 2023

knizhnik commented Nov 7, 2023

knizhnik commented Nov 7, 2023

jcsp commented Jan 9, 2024

docs: sharding phase 1 RFC #5432

docs: sharding phase 1 RFC #5432

Conversation

jcsp commented Oct 2, 2023 • edited Loading

github-actions bot commented Oct 2, 2023 • edited Loading

2652 tests run: 2528 passed, 0 failed, 124 skipped (full report)

Postgres 14

Code coverage* (full report)

save-buffer commented Oct 10, 2023

kelvich commented Oct 10, 2023

knizhnik commented Oct 11, 2023

knizhnik commented Nov 7, 2023

knizhnik commented Nov 7, 2023

jcsp commented Jan 9, 2024

jcsp commented Oct 2, 2023 •

edited

Loading

github-actions bot commented Oct 2, 2023 •

edited

Loading