Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: tenant relocation between pageserver nodes #886

Closed
1 task done
kelvich opened this issue Nov 16, 2021 · 6 comments · Fixed by #995
Closed
1 task done

Epic: tenant relocation between pageserver nodes #886

kelvich opened this issue Nov 16, 2021 · 6 comments · Fixed by #995
Assignees
Labels
a/scalability Area: related to scalability c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic
Milestone

Comments

@kelvich
Copy link
Contributor

kelvich commented Nov 16, 2021

Motivation

We want to be able to assign tenants and timelines to the pageserver running on the appropriate EC2 instances, to be able to distribute the tenants and their workloads on EC2 instances fairly. This helps achieve the stable user query latencies, avoid noisy neighbor issues and when necessary perform maintenance and upgrades on the nodes where the pageserver runs.

See #985 for a similar issue on safekeepers

DoD

Add possibility to move tenant from one storage node to another. For now it is okay to require compute node restart to do that.

There should be a button in control plane to trigger the migration. Aside from compute restart there shouldnt be any downtime.

Tasks

For more in-depth description see the last RFC on the topic: #3868

For actual progress see the project board: https://github.com/orgs/neondatabase/projects/27/views/2

As described in the RFC we implement simpler approach first. The downside is the short downtime. The approach can evolve into one without the downtime after we solve immediate tech debt issues.

@kelvich kelvich added this to the Tech preview milestone Nov 16, 2021
@stepashka stepashka removed this from the Tech preview milestone Nov 29, 2021
@stepashka stepashka changed the title Epic: tenant rebalancing Epic: initial version of tenant rebalancing Dec 2, 2021
@stepashka stepashka added the t/Epic Issue type: Epic label Dec 6, 2021
@LizardWizzard
Copy link
Contributor

Some more notes.

What is the sequence of actions to test moving a tenant to a different pageserver?

  1. Init pageserver old
  2. Create new tenant
  3. Insert some data
  4. Run a checkpoint
  5. Wait for it to be uploaded to the remote storage
  6. Bootstrap new pageserver
  7. Wait for all the data to be downloaded
  8. Execute callmemaybe so new pageserver starts replication from safekeeper
  9. Wait for replication to catch up
  10. Restart compute with changed pageserver address
  11. Shut down tenant on the old pageserver

What bothers me the most is that it becomes possible for two pageservers to write checkpoints concurrently to the same s3 path. Though they'll probably write the same data I wouldnt rely on that, there might be a version upgrade and the format might diverge. So I think we shouldn't allow that. This case can be guarded with some flags like when we attach timeline to a different pageserver it shouldn't try to upload something to s3. But this is still quite tricky. We shouldn't crash because of OOM here because InMemoryLayers can now be swapped to disk but still we should manage this somehow. I'll investigate possible solutions

@kelvich
Copy link
Contributor Author

kelvich commented Dec 8, 2021

Some more notes.

What is the sequence of actions to test moving a tenant to a different pageserver?

1. Init pageserver old

2. Create new tenant

3. Insert some data

4. Run a checkpoint

5. Wait for it to be uploaded to the remote storage

6. Bootstrap new pageserver

7. Wait for all the data to be downloaded

8. Execute callmemaybe so new pageserver starts replication from safekeeper

9. Wait for replication to catch up

10. Restart compute with changed pageserver address

11. Shut down tenant on the old pageserver

I think for a v0 you can just go from 7 to 10 directly. It would be a bigger performance hiccup for compute, but should be okay from the correctness standpoint. After that we can add more synchronization between steps 7-10. Note that you can connect pageserver to safekeeper without callmemaybe.

What bothers me the most is that it becomes possible for two pageservers to write checkpoints concurrently to the same s3 path. Though they'll probably write the same data I wouldnt rely on that, there might be a version upgrade and the format might diverge. So I think we shouldn't allow that. This case can be guarded with some flags like when we attach timeline to a different pageserver it shouldn't try to upload something to s3. But this is still quite tricky. We shouldn't crash because of OOM here because InMemoryLayers can now be swapped to disk but still we should manage this somehow.

OOM shouldn't be the case here since pageserver anyway can spill files to disk. We can have unconditional check that if we are trying to overwrite existing valid files on s3 we need to check crc/hashsum and do nothing if they match -- that should help with your case IIUC.

@LizardWizzard
Copy link
Contributor

LizardWizzard commented Dec 8, 2021

I think for a v0 you can just go from 7 to 10 directly

Yeah, currently I just insert sleep calls if something needs waiting with todos to replace with api calls for proper synchronization.

do nothing if they match

What if they don't? Currently there is an archiving mechanism on the way to land in #874 and I've raised similar concerns there too. Can there be some non determinism in the files layout? Can there be a different version of pageserver with some changes to file layout? So I think now we might want to avoid both: overwriting files in s3, extracting s3 data to non empty local timeline we do not download anything for timeline that is available locally. Though it is good to check for race conditions etc. What do you think?

I've created #971 so maybe it is better to continue discussion there

cc @SomeoneToIgnore

@stepashka stepashka added this to the Technical preview milestone Dec 10, 2021
@stepashka stepashka changed the title Epic: initial version of tenant rebalancing Epic: implement migration of tenants between the pageserver nodes Dec 13, 2021
@stepashka stepashka added c/storage/pageserver Component: storage: pageserver s3 and removed launch blocker labels Dec 13, 2021
@stepashka stepashka changed the title Epic: implement migration of tenants between the pageserver nodes Epic: implement migration of a tenant between the pageserver nodes Dec 15, 2021
@stepashka stepashka changed the title Epic: implement migration of a tenant between the pageserver nodes Epic: implement migration of a tenant between pageserver nodes Dec 15, 2021
@stepashka stepashka added the a/scalability Area: related to scalability label Dec 22, 2021
@stepashka stepashka added the p/cloud Product: Neon Cloud label Dec 27, 2021
@stepashka stepashka removed the p/cloud Product: Neon Cloud label Jan 10, 2022
@stepashka stepashka added m/COW-jan22 Moment: during the Cloud Ops Week in Jan 22 and removed m/COW-jan22 Moment: during the Cloud Ops Week in Jan 22 labels Jan 14, 2022
@stepashka stepashka removed this from the Limited Preview milestone Jan 24, 2022
@stepashka stepashka added this to the 2022/06 milestone Jun 17, 2022
@kelvich kelvich changed the title Epic: implement migration of a tenant between pageserver nodes Epic: rebalancing aka implement migration of a tenant between pageserver nodes Jul 5, 2022
@kelvich kelvich changed the title Epic: rebalancing aka implement migration of a tenant between pageserver nodes Epic: rebalancing aka migration of a tenant between pageserver nodes Jul 5, 2022
@stepashka stepashka modified the milestones: 2022/06, 2022/08 Jul 25, 2022
@stepashka stepashka modified the milestones: 2022/08, 2022/09 Sep 5, 2022
@stepashka stepashka modified the milestones: 2022/09, 2022/10 Oct 17, 2022
@stepashka stepashka changed the title Epic: rebalancing aka migration of a tenant between pageserver nodes Epic: tenant relocation between pageserver nodes Oct 24, 2022
@vadim2404 vadim2404 modified the milestones: 2022/10, 2022/11 Nov 9, 2022
@shanyp shanyp modified the milestones: 2022/11, 2023/01 Dec 20, 2022
@LizardWizzard
Copy link
Contributor

Let me summarize whats left here:

  • There is a possible race with branch creation. So if branch is created when relocation is in progress it can be lost. There is a possible fix to track that on the client side. I e make some eventually consistent algorithm which will converge to proper branch set on target pageserver. This solution was implemented but was not committed to cloud repo.

  • Other issue is related to background operations that can corrupt state on remote storage (s3) if they're running on both pageservers simultaneously (gc/compaction)

This can be mitigated by suspending background operations before relocation. Corresponding issue: #2740 I attempted to implement that, but with current state of tenant management thing its hard to do that reliably. PR with that attempt #2665

This approach has some downsides, and I wrote an RFC with better proposal, but it is a way heavier change: #2676. RFC involves some future problems which will occur when we start thinking about scaling one tenant to multiple pageservers

So I think its better to start with suspending background operations now

@shanyp shanyp modified the milestones: 2023/01, 2023/02 Jan 9, 2023
@LizardWizzard
Copy link
Contributor

Another bug that needs to be resolved: #3478

@shanyp shanyp modified the milestones: 2023/02, 2023/03 Feb 19, 2023
@shanyp shanyp removed the s/blocked Status: is blocked on something else label May 10, 2023
problame added a commit that referenced this issue May 16, 2023
…cessing (#4235)

With this patch, the attach handler now follows the same pattern as
tenant create with regards to instantiation of the new tenant:

1. Prepare on-disk state using `create_tenant_files`.
2. Use the same code path as pageserver startup to load it into memory
and start background loops (`schedule_local_tenant_processing`).

It's a bit sad we can't use the
`PageServerConfig::tenant_attaching_mark_file_path` method inside
`create_tenant_files` because it operates in a temporary directory.
However, it's a small price to pay for the gained simplicity.

During implementation, I noticed that we don't handle failures post
`create_tenant_files` well. I left TODO comments in the code linking to
the issue that I created for this [^1].

Also, I'll dedupe the spawn_load and spawn_attach code in a future
commit.

refs #1555
part of #886 (Tenant
Relocation)

[^1]: #4233
problame added a commit that referenced this issue May 24, 2023
This PR adds support for supplying the tenant config upon /attach.

Before this change, when relocating a tenant using `/detach` and
`/attach`, the tenant config after `/attach` would be the default config
from `pageserver.toml`.
That is undesirable for settings such as the PITR-interval: if the
tenant's config on the source was `30 days` and the default config on
the attach-side is `7 days`, then the first GC run would eradicate 23
days worth of PITR capability.

The API change is backwards-compatible: if the body is empty, we
continue to use the default config.
We'll remove that capability as soon as the cloud.git code is updated to
use attach-time tenant config
(#4282 keeps track of this).

unblocks neondatabase/cloud#5092 
fixes #1555
part of #886 (Tenant
Relocation)

Implementation
==============

The preliminary PRs for this work were (most-recent to least-recent)

* #4279
* #4267
* #4252
* #4235
@LizardWizzard LizardWizzard removed their assignment Aug 14, 2023
@jcsp jcsp self-assigned this Aug 14, 2023
@jcsp
Copy link
Contributor

jcsp commented Sep 5, 2023

This ticket is quite old, and described an earlier migration approach with downtime, that already exists. Closing in favour of #5199 for the ongoing work to enable seamless migration.

@jcsp jcsp closed this as completed Sep 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/scalability Area: related to scalability c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic
Projects
None yet
6 participants