Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nodes getting stuck syncing #256

Merged
merged 3 commits into from
Apr 2, 2017

Conversation

denravonska
Copy link
Member

@denravonska denravonska commented Mar 31, 2017

When determining if a beacon is old, compare its age to the age of th…e last known block and not the system time.

Without this comparing to the system time would cause unsynced nodes to get stuck if they try to reconnect while being offline for a while. The beacon age is a construct to save local space on the nodes so it needs to be compared within the nodes' time frames. Doing additional testing on this.

This closes #254.

It's possible that I made a bug in the new time limit so it allows everything instead. Master doesn't have unit tests so it's hard to test. I'll double check it.

…e last known block and not the system time.

Without this comparing to the system time would cause unsynced nodes to get stuck if they try to reconnect while being offline for a while. The beacon age is a construct to save local space on the nodes so it needs to be compared within the nodes' time frames.
@denravonska denravonska changed the title When determining if a beacon is old, compare its age to the age of th… Fix nodes getting stuck syncing Mar 31, 2017
int64_t iAge = pindexBest != NULL
? pindexBest->nTime - mvApplicationCacheTimestamp[key]
: 0;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not compare the beacon age with age of block you are currently checking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is much better. I'll check it when I get back to the comp.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomasbrod Alright, there might not be a block to validate against since it also seems to apply when broadcasting your own beacon. What do you think about changing the signature to GetBeaconPublicKey(const std::string& cpid, uint64_t referenceTime);? That way we can use the current time for beacon publishing and the block time when validating blocks.

@@ -7170,7 +7170,7 @@ bool static ProcessMessage(CNode* pfrom, string strCommand, CDataStream& vRecv,
return false;
}

if (pfrom->nVersion < 180322 & !fTestNet && pindexBest->nHeight > 855000)
if (pfrom->nVersion < 180322 && !fTestNet && pindexBest->nHeight > 855000)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a large part of what caused the major syncing issue and 3.5.8.6 clients not getting dropped.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is a typo, but if you expand it it just happens to work out. It's merely a nuisance :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added som braces to the previous version expression:
((pfrom->nVersion < 180322) & (!fTestNet)) && (pindexBest->nHeight > 855000)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result of first '<' is 0/1 and result of '!' is 0/1 so the '&' does the same thing to it
then it goes to '&&' as boolean
which means that the two expressions are equivalent

@gridcoin
Copy link
Contributor

gridcoin commented Apr 2, 2017

I want to fix this as much as you all do but we have to get this right as it will be too embarassing to not fix it permanently and ask exchanges to upgrade. Therefore we need to focus on validating this and getting the network back to smooth sailing (IE instead of improving the code in this ticket).
So looking at the proposed changes, as Tomas said, that boolean agebra typo actually means the same thing - so thats not actually helping (fix the issue). On a side note, we could increment the network version, forcing the entire network to upgrade, but the problem with that is it creates a massive fork and risks GRC to be lost on trading shenanigans and that is especially dangerous when we have such an acute upgrade timeframe like this one (IE unplanned emergency upgrade) so I am shying away from that.
Regarding adding the If pindex == null then return valid, that doesnt really help either because we have the nGrandfather int all through the program where we dont check those low blocks anyway (when current block is null), as soon as wallet syncs more than 1k in there is always a pindexbest.

Moving on to the last change, comparing block timestamp as opposed to getAdjustedTime - either way works as a consensus - as GetAdjustedTime is a network agreed timestamp for the current blocks being checked.

Something changed, I dont know if its a portion of the network checking blocks that were truly online longer than the old bug exploited or what, but looking at this change it does not look like it will help.

I think we need more research, and we need a very simple change to be applied as a madatory - and then Devranoska you can then improve the code and Ill turn the reigns to you. I am going to send you build scripts for windows and all my windows dependencies asap btw.

As far as what can we do right now, the main solution I am thinking of is allowing the client to Pass the block check (for BAD CPID) when the block is older than BlockNeedsChecked- as that would allow clients to sync to the top. Then I could ensure the client does load MORE than 6 months In during the initial load (which I believe we do already). And possibly put a more stringent upgrade penalty in..... Will look into this more now- but I welcome comments.

@denravonska
Copy link
Member Author

denravonska commented Apr 2, 2017

@gridcoin Actually, this PR is intended as a bugfix and not an improvement or refactoring :) GetAdjustedTime() is the host's local time with a network time offset which means that the wallet may toss out beacons it needs to catch up back in sync. Consider a wallet which is taken offline for 6 months at block x. Now every beacon it used to know are old so it can never validate block x+1. That's why I think it should stay the way it was before, that is the age is compared to the age of the wallet's newest block. The proposal in your last paragraph will also solve it and make this PR redundant.

Disregarding that particular issue, there is something very strange going on. I added a findbeacon command to see where the beacon for a CPID of a particular block resided and found that for block 859239 (a current known "get-stuck-at" block) the CPID is 18c25e9a719c92bedd77a3e1c5ac536d and its closest beacon towards the root of the chain is in block 641919. That beacon is too old no matter how you look at it; height, current time or time from block 859239. Can anyone figure out how it got there or am I off in the way I'm thinking?

@gridcoin
Copy link
Contributor

gridcoin commented Apr 2, 2017

@denravonska, Ok, you are correct, now I see why I had it set originally the old way - in that nodes that are Not fully synced would have the same view of the past 6 months of beacons, and before the Mandatory, the main bug that existed before- was that nodes who stayed online for longer than 6 months had TOO many beacons in memory causing them to validate blocks they should have actually failed (IE the consensus was not accurate across all of our network).

OK, I am onboard with your change now. I will pull it in and see if I can add any additional logic regarding blocks that dont need checked because they are old, and then we can do a new mandatory.

Looking at this with a fresh view, I believe our last mandatory did not solve the problem and now this appears to me that it most likely will solve the problem. With a new grandfather block the latest snapshot should allow people to sync to the top.

@gridcoin
Copy link
Contributor

gridcoin commented Apr 2, 2017

Regarding the CPID in question: It should have been declined by 51% and I assume it was accepted because of the nodes staying on an extended period.

@gridcoin gridcoin merged commit 419db74 into gridcoin-community:master Apr 2, 2017
@gridcoin
Copy link
Contributor

gridcoin commented Apr 2, 2017

Thanks for all the hard work.. I think we are on the right track.

@skcin
Copy link
Contributor

skcin commented Apr 2, 2017

@gridcoin getting stuck between block 854400 and 855000 is also very common due to forking and alot of old clients still online. If the disconnect from old clients would happen before block 854400 syncing would be smoother.

@denravonska
Copy link
Member Author

denravonska commented Apr 2, 2017

@gridcoin Cool. My fix caused @skcin to get past 859238 but that shouldn't be possible. Either my fix is buggy or I have pulled the incorrect data for the beacons. Are beacons always stored in block.tx[>0]?

@skcin I think you were correct about your initial, grandfather change. I think there are blocks with very old beacons in the 854400 series so using only .7 clients should not help. Or did it fix it in your test?

@philipswift
Copy link

Excellent spirit and team play regardless of outcome! To the for..

@skcin
Copy link
Contributor

skcin commented Apr 2, 2017

@denravonska disconnecting earlier would just prevent you getting dragged on a fork and getting stuck there. Especially if you are syncing from 0 you might currently be connected to several 3.5.8.6 clients. Between block 854400 and 855000 they can drag you on a fork and you get stuck soon since they accept blocks you don't. I think this this is a temorary issue though, unitl all most old clients are gone.

@gridcoin
Copy link
Contributor

gridcoin commented Apr 2, 2017

@denravonska, the change looks solid, as far as logic. He probably either went on a new chain, or the chain may be full of investor blocks (that dont get checked). I think it will work.

@grctest
Copy link
Contributor

grctest commented Apr 5, 2017

@Dantali0n @tomasbrod - What are your GRC addresses? Small bounty for aiding in development.

@Dantali0n
Copy link

@grctest SA1aEz3wpSUK5NqKGsGYh9vzBie5tqJJyL thank you that's very kind :)

@philipswift
Copy link

@tomasbrod any particular reason for the 'thumb down'?

@grctest
Copy link
Contributor

grctest commented Apr 6, 2017

@Dantali0n:
@grctest SA1aEz3wpSUK5NqKGsGYh9vzBie5tqJJyL thank you that's very kind :)

Sent! TX: 64ea219e125bd57b7f62d8ddae0438217a43c8e7058fc3cc90653e9191ce2ac3-000

@tomasbrod
Copy link
Member

@grctest SD24cwWaqCd1gNNYnwbK2gTkkjVKWUuPuG (Just generated) Do I deserve it? I just talked into the issue. But hey, Thank you.

@philipswift Becouse it was off topic and ... unfinished? "To the for.." <- what does it mean?

@grctest
Copy link
Contributor

grctest commented Apr 7, 2017

@tomasbrod sent! tx: d04dd70a846323244e6ed40a686fafbebdaf11e8cb0577684c897daa4ff29179-000

@philipswift
Copy link

philipswift commented Apr 7, 2017

@tomasbrod lol, you give me a thumbs down because you don't understand? Skewed logic and quite an insight. It was finished. Maybe be a little less hasty to judge and try not to assume. Assuming makes an ASS out of U and ME.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stuck on block 854887
7 participants