Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safe DB migrations using RocksDB Checkpoints #6282

Merged
merged 20 commits into from
Feb 23, 2022
Merged

Conversation

nikurt
Copy link
Contributor

@nikurt nikurt commented Feb 11, 2022

RocksDB checkpoints hardlink files and therefore are free as long as the files are on the same filesystem. Because we usually a separate disk into ~/.near/data, this PR by default writes checkpoints inside that directory.
The PR detects if a checkpoint already exists and if it does refuses a DB migration.
Checkpoints are deleted after successful migrations.

Test plan:

  • Enable checkpoints even if no DB migrations are needed and run localnet
  • Use 1.21 version on localnet. Check that restarting with the current branch creates a checkpoint before applying DB migrations.
  • Use 1.21 version on localnet. When restarting with the current branch, kill the process before the migration is complete.
    • Check that the checkpoint still exists
    • Check that neard panics if tries to restart
    • Restore from the checkpoint, delete the checkpoint and check that the migration starts over.
  • Repeat the test above with use_checkpoints_for_db_migration = false and see that the DB gets corrupted and can't be recovered.

Copy link
Collaborator

@bowenwang1996 bowenwang1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the test plan?

nearcore/src/lib.rs Outdated Show resolved Hide resolved
nearcore/src/lib.rs Outdated Show resolved Hide resolved
nearcore/src/lib.rs Outdated Show resolved Hide resolved
@nikurt
Copy link
Contributor Author

nikurt commented Feb 15, 2022

What is the test plan?

@bowenwang1996 Manual testing. Updated description. Please let me know if you think we need an automated test for it.

core/store/src/db.rs Show resolved Hide resolved
nearcore/src/config.rs Outdated Show resolved Hide resolved
nearcore/src/lib.rs Show resolved Hide resolved
nearcore/src/lib.rs Outdated Show resolved Hide resolved
nearcore/src/lib.rs Outdated Show resolved Hide resolved
match existing_checkpoint_path {
Ok(existing_checkpoint_path) => match existing_checkpoint_path {
Some(existing_checkpoint_path) => {
panic!("Detected an existing DB migration checkpoint: '{}'. Probably a DB migration got interrupted and your DB is corrupted. Please replace the contents of '{}' with data from that checkpoint, delete the checkpoint and try again.",existing_checkpoint_path.display(), path.display());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if here, instead of a long instruction - we shouldn't point people to some README on our webpage (near.org/dev/recover-from-failed-migration) - where we can have more detailed steps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will follow up after I write that README.

nearcore/src/lib.rs Outdated Show resolved Hide resolved
nearcore/src/lib.rs Outdated Show resolved Hide resolved
@nikurt nikurt requested a review from mm-near February 16, 2022 15:09
nearcore/Cargo.toml Outdated Show resolved Hide resolved
/// Absolute path to the root directory for DB checkpoints.
/// If not set, defaults to the database location, i.e. '$home/data/'.
#[serde(skip_serializing_if = "Option::is_none")]
pub db_checkpoints_path: Option<PathBuf>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that other files here (genesis_file, validator_key_file) are treated as relative to the nearhome directory, we might want to do the same treatment here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Treating it as a relative path is complicated as the path needs to exist.
If a user provides "./sub/dir", then ~/.near/sub/dir needs to exist, which can be confusing.
Instead I added a check that the given path is an absolute path.

nearcore/src/lib.rs Outdated Show resolved Hide resolved
nearcore/src/lib.rs Outdated Show resolved Hide resolved
nearcore/src/lib.rs Outdated Show resolved Hide resolved
if near_config.config.use_checkpoints_for_db_migration {
if let Some(checkpoint_path) = checkpoint_path {
info!(target: "near", "Deleting DB checkpoint at '{}'", checkpoint_path.display());
match std::fs::remove_dir_all(&checkpoint_path) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that std::fs::remove_dir_all is broken on widows for large dirs, and https://docs.rs/remove_dir_all/latest/remove_dir_all/ should be preferred instead. We probably shouldn't care about this at this point though.

nearcore/src/lib.rs Outdated Show resolved Hide resolved
Update user-visible messages to consistently refer to these snapshots as "database migration snapshot".
Merge checkpoint creation with a check if it already exists.
nearcore/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines 97 to 100
panic!(
"Detected an existing database migration snapshot: '{}'.\n\
Probably a database migration got interrupted and your database is corrupted.\n\
Please replace the contents of '{}' with data from that checkpoint, delete the checkpoint and try again.",checkpoint_path.display(), path.display());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already return Result, no need to panic here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to return an Err. The only concern is that this error message is verbose, and the callsite handling error messages is verbose too.

@near-bulldozer near-bulldozer bot merged commit f951d6e into master Feb 23, 2022
@near-bulldozer near-bulldozer bot deleted the nikurt-checkpoint branch February 23, 2022 14:39
@nikurt nikurt added this to Done in Node Q2 2022 Feb 23, 2022
Copy link
Collaborator

@bowenwang1996 bowenwang1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikurt could you add documentation on how to use this feature? cc @janewang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants