New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing a repair mechanism for corruptions in the ledger database #1174

Merged
merged 10 commits into from Nov 25, 2018

Conversation

Projects
None yet
3 participants
@hmoog
Copy link
Contributor

hmoog commented Nov 20, 2018

Description

This PR implements a repair mechanism for minor corruptions in the IRI database that causes conflicting milestones to be reverted and re-processed without having to reset the whole database.

(In addition it removes some unused imports)

Detailed Description

If IRI faces database corruptions in the "snapshotIndex" of transactions, it can happen that the same transaction gets processed as being confirmed by two or more milestones and therefore booking its balance multiple times leading to inconsistent balances. This causes the nodes to fall out of sync and report "Skipping negative value for address: ..." in an endless loop. The only way to recover from this problem is to do a --rescan or sometimes even remove the database and start a complete resync.

There are numerous reasons why these corruptions in the snapshotIndex can appear:

  • milestones got processed in the wrong order (fixed already but existing databases might still have these corruptions)
  • IRI crashes or gets stopped before the modified snapshotIndex was flushed to the database.
  • Race conditions between different threads that try to write to the same transaction at the same time (for example solid = true + snapshotIndex = xyz) and therefore overwriting the changes of the other thread. Note: Updating just a single property of the transaction causes the whole transaction to be serialized and written again. (THIS HAPPENS ALOT DURING TIMES OF HIGH LOADS / NETWORK ACTIVITY)

Type of change

  • Enhancement (a non-breaking change which adds functionality)

Checklist:

  • My code follows the style guidelines for this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • New and existing unit tests pass locally with my changes

@GalRogozinski GalRogozinski requested review from alon-e and GalRogozinski and removed request for GalRogozinski Nov 20, 2018

alon-e and others added some commits Nov 21, 2018

Update src/main/java/com/iota/iri/service/milestone/impl/MilestoneSer…
…viceImpl.java

Co-Authored-By: hmoog <hm@mkjc.net>
Update src/main/java/com/iota/iri/service/milestone/impl/LatestSolidM…
…ilestoneTrackerImpl.java

Co-Authored-By: hmoog <hm@mkjc.net>
Update src/main/java/com/iota/iri/service/milestone/impl/MilestoneSer…
…viceImpl.java

Co-Authored-By: hmoog <hm@mkjc.net>
Update src/main/java/com/iota/iri/service/milestone/MilestoneService.…
…java

Co-Authored-By: hmoog <hm@mkjc.net>
@alon-e

alon-e approved these changes Nov 21, 2018

@iotaledger iotaledger deleted a comment from codacy-bot Nov 22, 2018

for (int i = errorCausingMilestone.index(); i > errorCausingMilestone.index() - repairBackoffCounter; i--) {
milestoneService.resetCorruptedMilestone(i);
}
}

This comment has been minimized.

@GalRogozinski

GalRogozinski Nov 22, 2018

Member

I just want to be clear on this:
Let say indexes 100 and 101 are corrupt.
First time you go into this method you reset milestone 100.
Second time you reset both milestones 100 and 101?

This comment has been minimized.

@hmoog

hmoog Nov 22, 2018

Contributor

lets say 100 is corrupt because it didnt get its snapshotIndex set correctly and 101 also approves some of the transactions that were taken into account for the balances of 100 already. 101 can therefore not be applied.

We will first reset 101 and try to reapply it - if that fails we reset 101 and 100 and try to reapply both. If that fails we try to reapply 101, 100 and 99 and try to reapply the three of them.

Depending on which transaction didnt get its snapshotIndex correctly set, we might need to go a few milestones back to "find" the one that wasnt processed correctly. sometimes milestones 101 would reference a "broken" tx from milestone 97 for example.

@GalRogozinski
Copy link
Member

GalRogozinski left a comment

Due to current problems with buildkite, we see a fail even though the build passes current regression tests.

@GalRogozinski GalRogozinski merged commit 506e074 into iotaledger:dev-localsnapshots Nov 25, 2018

4 of 9 checks passed

buildkite/iri-build-jar-prs Build #217 failed (6 minutes, 22 seconds)
Details
buildkite/iri-build-jar-prs/55af75f1-3d5e-4de1-8f6f-5845960cc52d Failed (exit status 1)
Details
buildkite/iri-build-jar-prs/5feb8180-3410-4b91-b136-34e652ce7693 Failed (exit status 1)
Details
buildkite/iri-build-jar-prs/81977f11-193b-4d29-9152-86b7b59ca8d0 Failed (exit status 1)
Details
buildkite/iri-build-jar-prs/9675a7e1-6471-4ed9-b874-d1768d18bb57 Failed (exit status 1)
Details
Codacy/PR Quality Review Up to standards. A positive pull request.
Details
buildkite/iri-build-jar-prs/build-iri-oracle8 Passed (3 minutes, 3 seconds)
Details
buildkite/iri-build-jar-prs/pull-from-repo Passed (1 minute, 53 seconds)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment