Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

Better logging when backfilling ancient blocks fail #10796

Merged
merged 22 commits into from
Jul 1, 2019

Conversation

dvdplm
Copy link
Collaborator

@dvdplm dvdplm commented Jun 26, 2019

Handle databases with holes

After snapshots have been downloaded and imported into the new DB we try to salvage existing blocks from the current DB before switching to the new DB ( this happens in migrate_blocks()).

The current code assumes that the blocks in the current DB form a chain all the way down to the genesis block (i.e. they are "ancient blocks"), which is not necessarily true. There are situations where the current DB have some amount of blocks in it that do not stretch all the way back to genesis.

I think the way this situation can occur is:

  • start a node with an empty DB
  • a snapshot is downloaded and the chain populated with [n..m] blocks, where m is close to the tip and n = m - 10k blocks
  • maybe the node stays up and imports block and is generally fine
  • the node is stopped and naturally falls out of sync
  • the node is started again and as it's too far off the tip, a new snapshot is downloaded
  • the previous db will have blocks [i..j] where i != 0 (it'll be 10k blocks before wherever the tip was in the FIRST snapshot) and j is farther behind the current tip - 10k

Before this PR the final step of the second snapshot will fail with anUnlinkedAncientBlockChain error because the previous DB's blocks do not stretch all the way back to 0.

With this PR, the attempt to salvage the old blocks is simply abbandoned and the new snapshot is allowed to complete, effectively tossing away the content of the previous DB.

I suspect that it might be possible to do better than that, but best do this step by step.

The diff is rather noisy but the key change of the PR starts here.

ref. #10793

Print total blocks imported, closes #10792

@dvdplm dvdplm self-assigned this Jun 28, 2019
@dvdplm dvdplm marked this pull request as ready for review June 28, 2019 09:56
@dvdplm dvdplm requested review from ascjones and tomusdrw June 28, 2019 09:57
@dvdplm dvdplm added the A0-pleasereview 🤓 Pull request needs code review. label Jun 28, 2019
Copy link
Collaborator

@tomusdrw tomusdrw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sane, modulo TODOs and the deadlock.

let find_range = || -> Option<(H256, H256)> {
// TODO: In theory, if the current best_block is > new first_block (i.e. they overlap)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep the comment, but drop the TODO: prefix :)

ethcore/src/snapshot/service.rs Outdated Show resolved Hide resolved
@@ -695,8 +721,9 @@ impl Service {
Ok(()) |
Err(Error::Snapshot(SnapshotError::RestorationAborted)) => (),
Err(e) => {
// TODO: after this we're sometimes deadlocked
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should actually always deadlock - because we already have self.restoration.lock() acquired in line 720. parking_lot locks are not re-entrant.

Instead of calling abort_restore() we should rather use *restoration = None here. and maybe add a trace line if you care about it.
Alternatively we could have abort_restore take the locked restoration. IMHO if we do that for some methods already we should do that for all of them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should actually always deadlock

Well, for some reason we do not. It's pretty rare, I've seen it twice.

Instead of calling abort_restore() we should rather use *restoration = None

That's what abort_restore() does though: here. Why is it better to do it here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it better to do it here?

To avoid locking twice?

As I said the second ption should be abort_restore_with_restoration(locked_res: &mut Option<Restoration>) so we can keep the logic there, but just pass the existing lock.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the reason I removed *restoration = None from here and replaced it with a call to abort_restore() was to collect all shutdown actions in one place. Not that abort_restore() is doing very much atm, but I still think it's good to have a single point of abortion.
is locking twice really a concern though: this is an error handler and I don't think we care much about performance?

abort_restore_with_restoration(locked_res…)

The restoration is a member of the struct here, so I'm not sure what the point of passing it as a param tbh. Now you got me confused!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since parking_lot::Mutex is not re-entrant locking it twice in the same thread can (or does) lead to a deadlock. So the call like that:

fn some(&self) {
  let mut restorating = self.restoration.lock();
  ....
  self.abort_restore();
}

fn abort_restore(&self) {
  *self.restoration.lock() = None
}

is guaranteed to deadlock, I thought always, but I suspect it might be just random.

In the same file I suppose we already had deadlock issues, so the with_restoration pattern emerged, where instead of locking resources locally in a particular function you get passed a (mutable) reference to the locked resource:

let mut restoration = self.restoration.lock();
self.feed_chunk_with_restoration(&mut restoration)

I propose to use exactly the same patter for abort_restore - I'm totally fine grouping all the shutdown actions there, but to prevent potential double-locking and potential deadlocks we can pass the locked resource, so the first example becomes:

fn some(&self) {
  let mut restoration = self.restoration.lock();
  ....
  self.abort_restore(&mut restoration);
}

fn abort_restore(&self, res: &mut Option<Restoration>) {
  *res = None
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thank you for explaining.

Is this an option?

		let r = {
			let mut restoration = self.restoration.lock();
			self.feed_chunk_with_restoration(&mut restoration, hash, chunk, is_state)
		};
		match r {
			…
			Err(e) => {
				…
				self.abort_restore();
				…
			}
		}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to docs:

Attempts to lock a mutex in the thread which already holds the lock will result in a deadlock.

So I'm not sure either why it didn't always deadlock. The proposed fix looks good.

…ckChain

* master:
  revert changes to .gitlab-ci.yml (#10807)
  Add filtering capability to `parity_pendingTransactions` (issue 8269) (#10506)
  removed EthEngine alias (#10805)
  wait a bit longer in should_check_status_of_request_when_its_resolved (#10808)
  Do not drop the peer with None difficulty (#10772)
  ethcore-bloom-journal updated to 2018 (#10804)
  ethcore-light uses bincode 1.1 (#10798)
  Fix a few typos and unused warnings. (#10803)
  updated project to ansi_term 0.11 (#10799)
  added new ropsten-bootnode and removed old one (#10794)
  updated price-info to edition 2018 (#10801)
  ethcore-network-devp2p uses igd 0.9 (#10797)
  updated parity-local-store to edition 2018 and removed redundant Error type (#10800)
@@ -695,8 +721,9 @@ impl Service {
Ok(()) |
Err(Error::Snapshot(SnapshotError::RestorationAborted)) => (),
Err(e) => {
// TODO: after this we're sometimes deadlocked
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to docs:

Attempts to lock a mutex in the thread which already holds the lock will result in a deadlock.

So I'm not sure either why it didn't always deadlock. The proposed fix looks good.

ethcore/src/snapshot/service.rs Show resolved Hide resolved
@ordian ordian added this to the 2.6 milestone Jun 28, 2019
@ordian ordian added the M4-core ⛓ Core client code / Rust. label Jun 28, 2019
…ckChain

* master:
  cargo update -p smallvec (#10822)
  replace memzero with zeroize crate (#10816)
  Don't repeat the logic from Default impl (#10813)
  removed additional_params method (#10818)
  Add Constantinople eips to the dev (instant_seal) config (#10809)
  removed redundant fmt::Display implementations (#10806)
Copy link
Contributor

@ascjones ascjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits, otherwise LGTM.

ethcore/src/snapshot/error.rs Outdated Show resolved Hide resolved
ethcore/src/snapshot/service.rs Outdated Show resolved Hide resolved
// so the useful set of blocks is defined as:
// [0 ... min(new.first_block, best_ancient_block or best_block)]
//
// If, for whatever reason, the old db does not have ancient blocks (i.e.
// `best_ancient_block` is `None`) AND a non-zero `first_block`, such that the old db looks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think best_ancient_block being None can either mean: "no ancient blocks imported" or "all ancient blocks imported (no gaps)". So comment is misleading, although in the case when first_block is Some then the former is the true.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I should move the parens I think.

ethcore/src/snapshot/service.rs Outdated Show resolved Hide resolved
@dvdplm dvdplm merged commit 5dc5be1 into master Jul 1, 2019
@dvdplm dvdplm deleted the dp/chore/debug-synching-UnlinkedAncientBlockChain branch July 1, 2019 12:41
dvdplm added a commit that referenced this pull request Jul 3, 2019
* master: (22 commits)
  ethcore does not use byteorder (#10829)
  Better logging when backfilling ancient blocks fail (#10796)
  depends: Update wordlist to v1.3 (#10823)
  cargo update -p smallvec (#10822)
  replace memzero with zeroize crate (#10816)
  Don't repeat the logic from Default impl (#10813)
  removed additional_params method (#10818)
  Add Constantinople eips to the dev (instant_seal) config (#10809)
  removed redundant fmt::Display implementations (#10806)
  revert changes to .gitlab-ci.yml (#10807)
  Add filtering capability to `parity_pendingTransactions` (issue 8269) (#10506)
  removed EthEngine alias (#10805)
  wait a bit longer in should_check_status_of_request_when_its_resolved (#10808)
  Do not drop the peer with None difficulty (#10772)
  ethcore-bloom-journal updated to 2018 (#10804)
  ethcore-light uses bincode 1.1 (#10798)
  Fix a few typos and unused warnings. (#10803)
  updated project to ansi_term 0.11 (#10799)
  added new ropsten-bootnode and removed old one (#10794)
  updated price-info to edition 2018 (#10801)
  ...
dvdplm added a commit that referenced this pull request Jul 4, 2019
…me-parent

* master:
  refactor: whisper: Add type aliases and update rustdocs in message.rs (#10812)
  Break circular dependency between Client and Engine (part 1) (#10833)
  tests: Relates to #10655: Test instructions for Readme (#10835)
  refactor: Related #9459 - evmbin: replace untyped json! macro with fully typed serde serialization using Rust structs (#10657)
  idiomatic changes to PodState (#10834)
  Allow --nat extip:your.host.here.org (#10830)
  When updating the client or when called from RPC, sleep should mean sleep (#10814)
  Remove excessive warning (#10831)
  Fix typo in README.md (#10828)
  ethcore does not use byteorder (#10829)
  Better logging when backfilling ancient blocks fail (#10796)
  depends: Update wordlist to v1.3 (#10823)
dvdplm added a commit that referenced this pull request Jul 4, 2019
* master:
  refactor: whisper: Add type aliases and update rustdocs in message.rs (#10812)
  Break circular dependency between Client and Engine (part 1) (#10833)
  tests: Relates to #10655: Test instructions for Readme (#10835)
  refactor: Related #9459 - evmbin: replace untyped json! macro with fully typed serde serialization using Rust structs (#10657)
  idiomatic changes to PodState (#10834)
  Allow --nat extip:your.host.here.org (#10830)
  When updating the client or when called from RPC, sleep should mean sleep (#10814)
  Remove excessive warning (#10831)
  Fix typo in README.md (#10828)
  ethcore does not use byteorder (#10829)
  Better logging when backfilling ancient blocks fail (#10796)
  depends: Update wordlist to v1.3 (#10823)
  cargo update -p smallvec (#10822)
  replace memzero with zeroize crate (#10816)
  Don't repeat the logic from Default impl (#10813)
  removed additional_params method (#10818)
  Add Constantinople eips to the dev (instant_seal) config (#10809)
dvdplm added a commit that referenced this pull request Jul 4, 2019
* master: (21 commits)
  refactor: whisper: Add type aliases and update rustdocs in message.rs (#10812)
  Break circular dependency between Client and Engine (part 1) (#10833)
  tests: Relates to #10655: Test instructions for Readme (#10835)
  refactor: Related #9459 - evmbin: replace untyped json! macro with fully typed serde serialization using Rust structs (#10657)
  idiomatic changes to PodState (#10834)
  Allow --nat extip:your.host.here.org (#10830)
  When updating the client or when called from RPC, sleep should mean sleep (#10814)
  Remove excessive warning (#10831)
  Fix typo in README.md (#10828)
  ethcore does not use byteorder (#10829)
  Better logging when backfilling ancient blocks fail (#10796)
  depends: Update wordlist to v1.3 (#10823)
  cargo update -p smallvec (#10822)
  replace memzero with zeroize crate (#10816)
  Don't repeat the logic from Default impl (#10813)
  removed additional_params method (#10818)
  Add Constantinople eips to the dev (instant_seal) config (#10809)
  removed redundant fmt::Display implementations (#10806)
  revert changes to .gitlab-ci.yml (#10807)
  Add filtering capability to `parity_pendingTransactions` (issue 8269) (#10506)
  ...
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-pleasereview 🤓 Pull request needs code review. M4-core ⛓ Core client code / Rust.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Log progress when restoring snapshots
5 participants