Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with concurrent checkpointing from both old and new pageserver during timeline migration #971

Closed
LizardWizzard opened this issue Dec 8, 2021 · 7 comments
Labels
c/storage/pageserver Component: storage: pageserver
Milestone

Comments

@LizardWizzard
Copy link
Contributor

LizardWizzard commented Dec 8, 2021

Started from discussion here #874 (comment)

There is a dangerous possibility of two pageservers writing data concurrently to the same underlying s3 storage. This is scary even if it works given the incremental format we use.

There are some questions and invariants we need to uphold. We shouldn't be able to overwrite local metadata of an active timeline with something older/newer from s3. As far as I understand the opposite is not a problem because we save metadata to each checkpoint. Currently local overwrite shouldn't happen because we do not schedule downloads for timelines that are present locally. So this is good.

  • What if we have some non determinism in checkpointer so it can create different files on two pageservers?
  • There will be two checkpoints on the same lsn, or they'll interleave somehow?
  • Even if there are no nondeterminism what if there are two different pageserver versions which use different code to produce checkpoints? E g new format version, changed layout, new files, missing files etc.

Feel free to correct me, maybe I'm missing something

@LizardWizzard LizardWizzard changed the title Deal with concurrent checkpointing from both old and new pageserver Deal with concurrent checkpointing from both old and new pageserver during timeline migration Dec 8, 2021
@LizardWizzard LizardWizzard added launch blocker c/storage/pageserver Component: storage: pageserver labels Dec 9, 2021
@stepashka stepashka added this to the Technical preview milestone Dec 13, 2021
@LizardWizzard
Copy link
Contributor Author

The current vision on that is that we should allow concurrent checkpointing without ant problems. I created a test for that, but faced unrelated errors which are currently being fixed on main. So I'll continue my investigation

@LizardWizzard
Copy link
Contributor Author

Note: while concurrent checkpointing shouldn't lead to correctness issues we still might want to avoid that in some future scenarios when we have timeline attached to two pageservers e.g to spread get page requests or to support failover. Currently concurrent checkpointing might happen in the process of tenant migration, when new and old pageservers are active simultaneously

@stepashka
Copy link
Member

this is waiting for #1396

@LizardWizzard
Copy link
Contributor Author

Things changed and we decided to introduce etcd. The new vision is to use it in order to prevent concurrent uploads from happening.

@problame
Copy link
Contributor

I think we can close this, given we're set to implement relocation as specified in RFC #3868

@problame problame closed this as not planned Won't fix, can't repro, duplicate, stale Apr 24, 2023
@LizardWizzard
Copy link
Contributor Author

I think this is still relevant. So issue describes the problem that RFC should solve in one way or another. In first iteration we decided to not have this problem by detaching before the attach so there is no concurrent background activity from more than one pageserver at a time. The project currently takes into account only first stage, so I'm not sure whether we should keep the issue in the project (keep it with separate label?). WDYT @problame?

@LizardWizzard LizardWizzard reopened this May 10, 2023
@shanyp
Copy link
Contributor

shanyp commented Dec 26, 2023

fixed by generations

@shanyp shanyp closed this as completed Dec 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

4 participants