-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restarted W10 node sometimes becomes stuck adding new block (onPersist insufficient funds) #3273
Comments
Are you using Level or Bolt? |
LevelDB. |
That's more interesting. It obviously means a broken DB, but the way it's handled internally it's not trivial for it to get into this state. Yet if it happens, there should be a reason for that. CC @532910. |
Uncovered some memory issues on my machine in recent days. Could well be a hardware fault that is responsible, so if you can't replicate, don't spend too long on it. |
I was able to reproduce a similar bug on my local machine under the following conditions:
I also was able to reproduce the same problem during network syncing with low disk space, so from my side this problem also seems to be connected with hardware problems. |
Do you have the DB still? Can we check what's the state difference? It shouldn't happen this way, whatever written must be consistent, it's a single transaction after all (or the DB itself is broken if it can't ensure consistency on commit failure because of limited disk space). |
I don't, I've removed the DB to test the release compatibility, but for some reason I suspect that the problem can be reproduced. I'll try to reproduce it and save the DB to look at the diffs. |
I have reproduced the same problem with boltdb, but can't share the database, unfortunately. @roman-khimov @AnnaShaleva Do you have any specific place in mind to look into? |
Oh, I was too soon to write, my problem seems a bit different:
But there are other nodes which are OK, so it seems an application/db issue, likely related to this one. |
Zero ideas on this one. It'd be cool if you'll be able to at least do the diff between proper state and whatever is in this broken DB. Luckily, your chain doesn't seem to be long. |
My version of neo-go is 0.104 with some commits after (base at 441eb8a) with fixes from #3279 (we reverted and reapplied #3110, but there were no conflicts so I do not expect problems here) and some unrelated improvements in the network server (mostly for multi-IP and TLS)
Sanity check: values at 0xc1 are different as corrupted node has acquired some headers, values at 0xc0 are the same. That being said here are seemingly relevant diffs stored by 0x70 prefix (contract data):
Native notary (
There is also diff for GAS contract, it is a bit bigger, won't attach here. Script I used to produce files: func TestExport(t *testing.T) {
runExport(t, "./mainnet.bolt")
runExport(t, "./corrupted.mainnet.bolt")
}
func runExport(t *testing.T, path string) {
db, err := bbolt.Open(path, os.ModePerm, &bbolt.Options{
Timeout: time.Second,
ReadOnly: true,
})
require.NoError(t, err)
defer db.Close()
f, err := os.OpenFile(path+".keys", os.O_CREATE|os.O_TRUNC|os.O_WRONLY, os.ModePerm)
require.NoError(t, err)
defer f.Close()
require.NoError(t, db.View(func(tx *bbolt.Tx) error {
printBucket(f, tx.Bucket([]byte("DB")), "")
return nil
}))
}
func printBucket(f *os.File, b *bbolt.Bucket, indent string) {
c := b.Cursor()
for k, v := c.First(); k != nil; k, v = c.Next() {
if v != nil {
f.WriteString(indent)
f.WriteString(hex.EncodeToString(k))
f.WriteString(": ")
f.WriteString(hex.EncodeToString(v))
f.WriteString("\n")
} else {
printBucket(f, b.Bucket(k), indent+hex.EncodeToString(k)+" : ")
}
}
} |
Account is blocked when it's in the Policy's storage. Introduced in bbbc680. This bug leads to the fact that during native Neo cache initialization at the last block in the dBFT epoch, all candidates accounts are "blocked", and thus, stand-by committee and validators are used in the subsequent new epoch. Close #3424. This bug may lead to the consequenses described in #3273, but it needs to be confirmed. Signed-off-by: Anna Shaleva <shaleva.ann@nspcc.ru>
Account is blocked when it's in the Policy's storage, not when it's missing from the Policy storage. Introduced in bbbc680. This bug leads to the fact that during native Neo cache initialization at the last block in the dBFT epoch, all candidates accounts are "blocked", and thus, stand-by committee and validators are used in the subsequent new epoch. Close #3424. This bug may lead to the consequenses described in #3273, but it needs to be confirmed. Signed-off-by: Anna Shaleva <shaleva.ann@nspcc.ru>
Account is blocked when it's in the Policy's storage, not when it's missing from the Policy storage. Introduced in bbbc680. This bug leads to the fact that during native Neo cache initialization at the last block in the dBFT epoch, all candidates accounts are "blocked", and thus, stand-by committee and validators are used in the subsequent new epoch. Close #3424. This bug may lead to the consequences described in #3273, but it needs to be confirmed. Signed-off-by: Anna Shaleva <shaleva.ann@nspcc.ru>
@EdgeDLT, @fyfyrchik, please take a look at the #3443. It should fix the problem, but it needs to be confirmed, thus I'm not closing this issue. |
@EdgeDLT, @fyfyrchik, 0.106.0 is out, you may try it. Please, write back if succeeded. |
Current Behavior
Attempting to start an out-of-sync node produces:
Expected Behavior
The node should successfully sync the next block.
Possible Solution
Something related to #3110 maybe? Not looked too deep into it.
Steps to Reproduce
Context
Ran into this issue with my indexer node. By default the node is not kept alive, it is fully synchronized then shut down once all operations are complete. Occasionally, but not always, starting the node again produces this error. If it does, no further progress can be made without a full resync.
Your Environment
Windows 10
The text was updated successfully, but these errors were encountered: