Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: sharding phase 1 RFC #5432

Merged
merged 3 commits into from
Mar 15, 2024
Merged

docs: sharding phase 1 RFC #5432

merged 3 commits into from
Mar 15, 2024

Conversation

jcsp
Copy link
Collaborator

@jcsp jcsp commented Oct 2, 2023

We need to shard our Tenants to support larger databases without those large databases dominating our pageservers and/or requiring dedicated pageservers.

This RFC aims to define an initial capability that will permit creating large-capacity databases using a static configuration
defined at time of Tenant creation.

Online re-sharding is deferred as future work, as is offloading layers for historical reads. However, both of these capabilities would be implementable without further changes to the control plane or compute: this RFC aims to define the cross-component work needed to bootstrap sharding end-to-end.

@jcsp jcsp added t/tech_design_rfc Issue type: tech design RFC c/storage Component: storage labels Oct 2, 2023
@jcsp jcsp marked this pull request as draft October 2, 2023 09:22
@github-actions
Copy link

github-actions bot commented Oct 2, 2023

2652 tests run: 2528 passed, 0 failed, 124 skipped (full report)


Flaky tests (1)

Postgres 14

  • test_ondemand_activation: debug

Code coverage* (full report)

  • functions: 28.4% (7031 of 24721 functions)
  • lines: 47.2% (43454 of 92101 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
be81eaa at 2024-03-14T11:04:10.157Z :recycle:

docs/rfcs/029-sharding-phase1.md Outdated Show resolved Hide resolved
docs/rfcs/029-sharding-phase1.md Outdated Show resolved Hide resolved
@save-buffer
Copy link
Contributor

rendered

docs/rfcs/029-sharding-phase1.md Outdated Show resolved Hide resolved
docs/rfcs/029-sharding-phase1.md Outdated Show resolved Hide resolved
docs/rfcs/029-sharding-phase1.md Outdated Show resolved Hide resolved
docs/rfcs/029-sharding-phase1.md Outdated Show resolved Hide resolved
docs/rfcs/029-sharding-phase1.md Outdated Show resolved Hide resolved
docs/rfcs/029-sharding-phase1.md Outdated Show resolved Hide resolved
docs/rfcs/029-sharding-phase1.md Outdated Show resolved Hide resolved
docs/rfcs/029-sharding-phase1.md Outdated Show resolved Hide resolved
docs/rfcs/029-sharding-phase1.md Outdated Show resolved Hide resolved
docs/rfcs/029-sharding-phase1.md Outdated Show resolved Hide resolved
@kelvich
Copy link
Contributor

kelvich commented Oct 10, 2023

I also wonder if the initial idea with the safekeepers is actually better:

  • for starters we can just send everything to all shards. That will lead to our underutilized network interfaces to be less underutilized but don't think it would be a big deal short-term.
  • that still leaves us with the problem of one shard holding old wal and not receiving new updates. But here I think we overstate importance of that problem and can allow such WAL segments to be offloaded to S3. Later on we can retrieve that segment on demand.
  • SK team already was considering switching current stream-per-connection model to something multiplexed with messages send over shared wire. GRPC looked like the most actionable thing there. So we anyway want to change protocol.
  • Also not concerned with the WAL decoding performance -- we only need pagerefs, so we could do stream parsing with minimal state (3-4 variables). Given that all that stream is written to disk pretty sure that such parsing would not be a bottleneck.

So I'd rather start with "all wal to all shards" and then went through that list async without blocking initial sharding.

@knizhnik
Copy link
Contributor

I also wonder if the initial idea with the safekeepers is actually better:

  • for starters we can just send everything to all shards. That will lead to our underutilized network interfaces to be less underutilized but don't think it would be a big deal short-term.
  • that still leaves us with the problem of one shard holding old wal and not receiving new updates. But here I think we overstate importance of that problem and can allow such WAL segments to be offloaded to S3. Later on we can retrieve that segment on demand.
  • SK team already was considering switching current stream-per-connection model to something multiplexed with messages send over shared wire. GRPC looked like the most actionable thing there. So we anyway want to change protocol.
  • Also not concerned with the WAL decoding performance -- we only need pagerefs, so we could do stream parsing with minimal state (3-4 variables). Given that all that stream is written to disk pretty sure that such parsing would not be a bottleneck.

So I'd rather start with "all wal to all shards" and then went through that list async without blocking initial sharding.

I am still not quite sure about whether scattering WAL is good idea.
The drawbacks are clear - you explained why them may be not so critical. May be...
But what are the advantages? Reducing network traffic? It is a bottleneck?

Also, as I already wrote, in principle when we are sending the same stream to multiple consumers we can use low level broadcast. Which seems to be much more efficient than scattering. Not sure if it is applicable to safekeepers<->PS communication (at least because PS are not synced and may have different positions in the stream).

Concerning decoding - there are are some WAL commands which modify pages not specified in pagerefs:
i.e. *ALL_VISIBLE_CLEARED flags in HEAP INSERT/UPDATE/DELETE commands. It actually raise more serious problem than just WAL scattering. Current sharding schema assumes that we ignore forknum, but use blocknum. But definitely related pages in main and VM forks have different block numbers. So them can be assigned to different shards. So walingest should take it in account and do not try to insert in KV storage changes not relevant to this shard.

@knizhnik
Copy link
Contributor

knizhnik commented Nov 7, 2023

Sorry if I missed something, but I failed to find answer for one question related to key sharding in this RFC.
How we are going to handle relation size?
Right now relation size is maintained in in separate key-value pair in KV storage. It is not explicitly set by SMGR. It is updated on PS we we page image or wal record is appended to the relation.

So to which shard compute should send get_rel_size request?
Looks like the only working solution is to broadcast this request to all shards and then choose maximum.
But it seems to be quite inefficient... Certainly we have relsize cache at compute, so relation size is not expected to be retrieved so frequently. But relation size is requested at node startup. Actually one of the primary goals of shards was to reduced startup time. And here we need to broadcast get_rel_size requests to all shards which definitely will not speedup startup.

@knizhnik
Copy link
Contributor

knizhnik commented Nov 7, 2023

One more key sharding issue not covered by this RFC. VM fork is updated either by special WAL records, either as part of heap insert/update/delete operations (when correspondent bit is set). The problem is that block number of the updated VM fork page is very different from block number in the main fork specified in heap_* record.
So even if we are broadcasting the same WAL stream to all shards, we still wan to apply only those WAL records which are associated with the particular shard.

But for FSM/VM fork updates it is not so trivial to do. Their buffer tags are not specified in record's blocks data. We have to decode WAL record, check bit (if it changes page visibility) then calculate position of VM page and check if it is assigned to this shard and if so - update it.

Task becomes more challenged if we are going to scatter WAL. In this case we should look for this VM flags and broadcast record in this case.

@jcsp
Copy link
Collaborator Author

jcsp commented Jan 9, 2024

I've cleaned this up to cover what we did in Q4 -- will publish a smaller follow-on RFC that describes shard splitting (Q1).

@jcsp jcsp marked this pull request as ready for review January 9, 2024 11:02
@jcsp jcsp mentioned this pull request Jan 15, 2024
5 tasks
@jcsp jcsp enabled auto-merge (squash) March 14, 2024 10:02
@jcsp jcsp merged commit 23416cc into main Mar 15, 2024
53 checks passed
@jcsp jcsp deleted the jcsp/rfc-sharding-phase1 branch March 15, 2024 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage Component: storage t/tech_design_rfc Issue type: tech design RFC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants