Skip to content
This repository has been archived by the owner on Aug 23, 2020. It is now read-only.

Solid Milestone initialization fix #486

Merged
merged 6 commits into from
Apr 4, 2018

Conversation

Desolatora
Copy link
Contributor

Bug

LedgerValidator initializes wrong consistent milestone after restart when node is not fully synced.

Symptoms

When node is synced to solid milestone X and latest milestone is Y, after restart solid milestone is incorrectly initialized to Z (usually very close to Y).

Cause

Method LedgerValidator.buildSnapshot() is responsible for building the most recent consistent milestone. It uses MilestoneViewModel.firstWithSnapshot() and MilestoneViewModel.nextWithSnapshot() to iterate on every milestone with snapshot then gets the state differences and patches them to the current snapshot. These methods deterministically rely on StateDiffViewModel.exists() which internally uses tangle.maybeHas(). The problem is that tangle.maybeHas() can only guarantee when entry is not present, but does not guarantee if entry is present. This leads to MilestoneViewModel.firstWithSnapshot() and MilestoneViewModel.nextWithSnapshot() sometimes returning milestone without snapshot or more precise milestone without StateDiff. This incorrect behaviour is also supported by the fact that StateDiffViewModel.load() returns an object even if StateDiff is not present. This leads to latestSnapshot being patched with no state differences and incorrect milestone index.

Problems

When node is initialized with wrong solid milestone this may lead to solid milestone tracker appearing frozen while it is trying to solidify next milestone or compute state diff with both operations involving huge number of transactions. Sometimes wrong solid milestone index may lead to "Transaction resolves to incorrect ledger balance" errors (bug: #460).

Proposed fix

Method StateDiffViewModel.isEmpty() is introduced to check if current state difference is empty. LedgerValidator.buildSnapshot() now iterates on every candidate milestone and patches latestSnapshot only with nonempty state differences. This guarantees that initialized consistent milestone is solid.

@paulhandy paulhandy requested a review from alon-e January 18, 2018 03:04
Copy link
Contributor

@GalRogozinski GalRogozinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebase with new code and start using locks instead of synchronized

MilestoneViewModel candidateMilestone = MilestoneViewModel.first(tangle);

while (candidateMilestone != null) {
if (candidateMilestone.index() % 1000 == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please open an if block with brackets.
Please also add && log.isDebugEnabled() and change log level to debug


while (candidateMilestone != null) {
if (candidateMilestone.index() % 1000 == 0)
log.info("Building snapshot... Consistent: #" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are still in Java 8 it is preferable to use StringBuilder to concatenate strings

@@ -221,20 +223,34 @@ public static boolean isApproved(Snapshot snapshot, Hash hash) {
*/
private MilestoneViewModel buildSnapshot(Snapshot latestSnapshot) throws Exception {
MilestoneViewModel consistentMilestone = null;

synchronized (latestSnapshot.snapshotSyncObject) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use locks. See updated code

@@ -24,7 +24,7 @@ public StateDiffViewModel(final Map<Hash, Long> state, final Hash hash) {
this.stateDiff.state = state;
}

public static boolean exists(Tangle tangle, Hash hash) throws Exception {
public static boolean maybeExists(Tangle tangle, Hash hash) throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like

@GalRogozinski GalRogozinski added T-Bug C-Consensus S2 - High High Severity L-Awaiting More Reviews Lifecycle - Every PR must have at least 2 reviewers labels Mar 15, 2018
@Desolatora
Copy link
Contributor Author

Rebased with new code.

@GalRogozinski
Copy link
Contributor

Hey @Desolatora,
It seems like the rebase didn't roll out as smoothly as we like.
If you click on the "commits" tab or the "file changes" tab you will immediately see what I mean.

Git can be annoying sometimes, but we need to get it fixed.
If you find it too hard to fix this PR, you can close it and issue a new PR with only a link to this one in the description.

@Desolatora Desolatora force-pushed the buildsnapshot-fix branch 2 times, most recently from 96199e3 to 6662159 Compare March 19, 2018 18:56
@Desolatora
Copy link
Contributor Author

Rebase fixed.

GalRogozinski and others added 3 commits March 24, 2018 20:48
PR should have one area of responsibility
You don't publish ports if `--net=host` (`--network=host`) is used
Copy link
Contributor

@GalRogozinski GalRogozinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just clicked approved but I see that github claims there are still conflicts..

@Desolatora
Copy link
Contributor Author

Desolatora commented Apr 2, 2018

@GalRogozinski
It should be fixed now. If it isn't i will create a new PR.

Copy link
Contributor

@alon-e alon-e left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! - tested it locally in different DB scenarios.
see comment about a failure mode in buildSnapshots().

but other than that 👍

consistentMilestone = candidateMilestone;
}
}
candidateMilestone = candidateMilestone.next(tangle);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if a milestone is not consistent (Snapshot.isConsistent == false), but the next ones are (for example due to DB corruption or malice). then a consistent, but not the actual state would be reached, right?
would be best to fail on an existing, but inconsistent milestone stateDiff.

snapshotMilestone = snapshotMilestone.nextWithSnapshot(tangle);
MilestoneViewModel candidateMilestone = MilestoneViewModel.first(tangle);
while (candidateMilestone != null) {
if (log.isDebugEnabled() && candidateMilestone.index() % 10000 == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not your fault, but FYI from my tests log.isDebugEnabled() always returns true in IRI, regardless of the debug flag.

@Desolatora
Copy link
Contributor Author

Desolatora commented Apr 3, 2018

buildSnapshot() will now stop searching for candidates if inconsistent milestone is encountered.
Also i removed the log.isDebugEnabled() part and actually changed the log level to INFO. I think it is essencial for iri to report milestone initialization, because this is fairly slow process and it may appear that iri is stuck while actually it is properly initializing.

@alon-e alon-e added L-Ready For Test Code passed review and needs to be tested and removed L-Awaiting More Reviews Lifecycle - Every PR must have at least 2 reviewers labels Apr 4, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
C-Consensus L-Ready For Test Code passed review and needs to be tested S2 - High High Severity T-Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants