New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bitswap: clean up ledgers when disconnecting #3437

Merged
merged 2 commits into from May 20, 2017

Conversation

Projects
None yet
3 participants
@whyrusleeping
Member

whyrusleeping commented Nov 29, 2016

License: MIT
Signed-off-by: Jeromy why@ipfs.io

// TODO: release ledger
e.lock.Lock()
defer e.lock.Unlock()
l, ok := e.ledgerMap[p]

This comment has been minimized.

@Kubuxu

Kubuxu Dec 6, 2016

Member

You are not locking l.lk here, and again we have situation with two locks. It shouts to me deadlock.

This comment has been minimized.

@whyrusleeping

whyrusleeping Dec 7, 2016

Member

Added the ledger lock. Re locking concerns, these ones are well scoped. the engine lock is either always held first, or not held while taking the ledger lock. And the engine lock is never taken while holding a ledger lock.

@Kubuxu

This comment has been minimized.

Member

Kubuxu commented Dec 7, 2016

Other concern: what if PeerConnected gets the instance but can't acquire lock for ledger as it is locked by PeerDisconnected. Then PeerConnected will increase value on ledger that is not longer in ledger map.

@whyrusleeping

This comment has been minimized.

Member

whyrusleeping commented Dec 7, 2016

@Kubuxu hrm... for that to happen, PeerConnected would have to return from findOrCreate, and then PeerDisconnected would have to take the engine lock, pull the ledger out of the map, and take the ledger lock before the PeerConnected process is able to. This IS possible.

One option is to not use findOrCreate and instead take the engine lock ourselves throughout the entire call to PeerConnected, essentially reimplementing findOrCreate in that function.

@Kubuxu

This comment has been minimized.

Member

Kubuxu commented Dec 7, 2016

I know that it is something that might happen very rarely or never but those edge cases add up and create hard to track down bugs and instability. If chance of this bug occuring is 0.00001% then chance that it will occur across 100000 runs is more than 60% and if we don't stop possibly introducing bugs like that go-ipfs will be always unstable and unreliable.

@whyrusleeping

This comment has been minimized.

Member

whyrusleeping commented Dec 7, 2016

@Kubuxu Right, So i think the solution is to make PeerConnected hold the engine lock through the entire method.

@Kubuxu

This comment has been minimized.

Member

Kubuxu commented Dec 9, 2016

So now it is thread safe, but does function findOrCreate makes sense if we have introduced ref counting?

@Kubuxu

Kubuxu approved these changes Dec 9, 2016

@Kubuxu

This comment has been minimized.

Member

Kubuxu commented Dec 9, 2016

Also I am still not a fan of those two locks as some not really connected change can introduce deadlock (locking for engine while holding ledger some ledger) and we might not catch it when we introduce it. We should really look into Actor oriented communication and how bad/good it will be.

@Kubuxu

This comment has been minimized.

Member

Kubuxu commented Dec 9, 2016

I rebased it to run coverage on it.

@Kubuxu

This comment has been minimized.

Member

Kubuxu commented Dec 9, 2016

It isn't tested anywhere, it might be worth to do that.

@whyrusleeping

This comment has been minimized.

Member

whyrusleeping commented Dec 9, 2016

@Kubuxu

This comment has been minimized.

Member

Kubuxu commented Dec 9, 2016

I am positive that we can make it clean and not so complicated with enough layers of sugarcoating.

I am just almost sure that we will introduce deadlock around this place sooner or later and it won't be diagnosed for a long time as reproduction of this will be almost impossible.

Also for someone to report deadlock like this one he would have to 1. encounter this deadlock 2. don't try resetting the node 3. capture goroutine dump 4. have us find those blocked routines on this lock. I miss Java's features in this regard.

This change LGTM if I get some tests. In case of not directly sharness tested features I would like the codecov/patch build check to be green.

@Kubuxu Kubuxu referenced this pull request Dec 14, 2016

Open

Mutex/Lock limit #3506

@Kubuxu

This comment has been minimized.

Member

Kubuxu commented Dec 19, 2016

Ok, it shows as if there was no coverage due to lack of cross package cover testing.

@lgierth

This comment has been minimized.

Member

lgierth commented Dec 20, 2016

Can I add the RFM label here? Let's continue the locking discussion in #3506.

whyrusleeping added some commits Nov 29, 2016

bitswap: clean up ledgers when disconnecting
License: MIT
Signed-off-by: Jeromy <why@ipfs.io>
test for partner removal
License: MIT
Signed-off-by: Jeromy <jeromyj@gmail.com>

@whyrusleeping whyrusleeping merged commit ec43fe4 into master May 20, 2017

6 of 8 checks passed

codeclimate 1 new issue (1 fixed)
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 90.47% of diff hit (target 63.74%)
Details
codecov/project 63.88% (+0.13%) compared to 8e2aed3
Details
commit-message-check/gitcop All commit messages are valid
Details
continuous-integration/jenkins/pr-merge This commit looks good
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@whyrusleeping whyrusleeping deleted the feat/bitswap-cleanup-ledger branch May 20, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment