Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(connecteventmanager): block Connected() until accepted #435

Merged
merged 4 commits into from
Aug 17, 2023

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Aug 16, 2023

Fixes: #432

Minimal attempt at solving #432

@rvagg rvagg requested a review from a team as a code owner August 16, 2023 01:51
@welcome
Copy link

welcome bot commented Aug 16, 2023

Thank you for submitting this PR!
A maintainer will be here shortly to review it.
We are super grateful, but we are also overloaded! Help us by making sure that:

  • The context for this PR is clear, with relevant discussion, decisions
    and stakeholders linked/mentioned.

  • Your contribution itself is clear (code comments, self-review for the
    rest) and in its best form. Follow the code contribution
    guidelines

    if they apply.

Getting other community members to do a review would be great help too on complex PRs (you can ask in the chats/forums). If you are unsure about something, just leave us a comment.
Next steps:

  • A maintainer will triage and assign priority to this PR, commenting on
    any missing things and potentially assigning a reviewer for high
    priority items.

  • The PR gets reviews, discussed and approvals as needed.

  • The PR is merged by maintainers when it has been approved and comments addressed.

We currently aim to provide initial feedback/triaging within two business days. Please keep an eye on any labelling actions, as these will indicate priorities and status of your contribution.
We are very grateful for your contribution!

@codecov
Copy link

codecov bot commented Aug 16, 2023

Codecov Report

Merging #435 (22fd90d) into main (dd32d67) will increase coverage by 0.22%.
The diff coverage is 51.02%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #435      +/-   ##
==========================================
+ Coverage   49.80%   50.02%   +0.22%     
==========================================
  Files         249      249              
  Lines       29972    29990      +18     
==========================================
+ Hits        14928    15003      +75     
+ Misses      13615    13554      -61     
- Partials     1429     1433       +4     
Files Changed Coverage Δ
bitswap/network/connecteventmanager.go 50.99% <51.02%> (+50.99%) ⬆️

... and 7 files with indirect coverage changes


func (p *peerState) setPending() {
if !p.isPending() {
p.accepted = make(chan struct{})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like some help sanity checking this because I'm not entirely clear of the atomicity of some of these operations.

We rely on connectEventManager's mutex to do most things, but there is a call to waitAccept that is unsynchronised, it just does a <-p.accepted; that could be happening while we're doing this operation of setting accepted to a new channel—although we're only doing this if the channel is already closed so <-p.accepted should have been a quick noop; is that atomic though? Is there potential for a conflict between <-p.accepted where p.accepted is closed, and p.accepted = make(chan struct{})?

@rvagg
Copy link
Member Author

rvagg commented Aug 16, 2023

All good with our test suite that makes heavy use of bitswap and has been heavily flaky because of this problem: filecoin-project/lassie#383

Copy link
Contributor

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree no per peer mutex since everything is already mutexed.

In your new simpler implementation, there is an edge case I'm not sure whether you should care about:

  • setState called, isPending set true, change created
  • before change processed, setState called again (to a same or different new state)
    • since isPending = true, no new change encoded, and waitNoop returned. however, the change has yet to be handled

Truthfully, the edge case where this causes downstream problems feels possibly non-existant. Just want to identify it.

The potential solution is to put handled back on the peerState, and return a func() that uses it whenever isPending = true -- but if you do so, I recommend you copy the reference out of the peerState before returning the wait func, so that any mutations to the channel in the peer state don't effect the callback. This can be done with a simple closure:

func makeWaitFunc(handled chan struct{}) waitFn {
   return func() {
      <-handled
   }
}

Also, are there any edge cases where handled never gets closed()? Like the queue shuts down for example?

@rvagg
Copy link
Member Author

rvagg commented Aug 16, 2023

Good feedback!

Here's what I've done:

  1. Handled the Stop() end condition by watching the c.done channel as well:
func (c *connectEventManager) makeWaitFunc(handled chan struct{}) waitFn {
	return func() {
		select {
		case <-handled:
		case <-c.done:
		}
	}
}
  1. Handled the pending case within the current structure with this branch that piggy-backs the existing change in the queue:
	} else if state.pending {
		// Find the change in the queue and return a wait function for it
		for _, change := range c.changeQueue {
			if change.pid == p {
				return c.makeWaitFunc(change.handled)
			}
		}
		log.Error("a peer was marked as change pending but not found in the change queue")
	}

Copy link
Contributor

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM by me.

@rvagg
Copy link
Member Author

rvagg commented Aug 17, 2023

Added a simple test to lock this behaviour in - Connected() blocks until PeerConnected() returns.

@rvagg rvagg force-pushed the rvagg/bitswap-connect-race-fix branch from 11fb4a7 to 22fd90d Compare August 17, 2023 00:35
@Jorropo
Copy link
Contributor

Jorropo commented Aug 17, 2023

The connectEventManager is a very complex piece of code for what it does and it took tens of minutes to properly understand it properly.
It's role is to maintain an atomic state of connected / not connected and dispatch callbacks on state transitions. I've submitted #436 which does this using a map. Could any of you please take a look @rvagg or @hannahhoward please ?

@rvagg
Copy link
Member Author

rvagg commented Aug 17, 2023

To reiterate the discussion on Slack - I'm not opposed to your approach in #436, but ripping out the existing piece of infrastructure is more radical than attempting to fix it; at least for the purposes we're trying to achieve right now. Perhaps this could be a two-step thing. I see this as a Chesterton's Fence situation - we're all acknowledging that we don't fully appreciate why the connecteventmanager exists and what it's aiming to achieve, and in that case the more prudent approach, rather than just ripping it out, might be to approach it more incrementally ("prudence is to understand why the fence is there in the first place before you attempt to take it down").

There's a singular issue we're trying to fix, and that is the race condition that exists between when a client is Connected and when the client is notified that PeerConnected so it can perform proper per-client initialisation.

Ditching connecteventmanager possibly fixes that, and other problems that relate to it; but I wouldn't mind having a commit in the main branch that solves the one thing we're concerned about now that we can point to, even if later commits do more radical things; so we can choose to be incremental and opt-in to being more radical in our downstream use of this code.

Copy link
Contributor

@Jorropo Jorropo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, this is tricky code and I'm not sure about all the interactions but tests are passing.
Should get #436 merged within a month ideally.

@Jorropo Jorropo merged commit 1d2f5e5 into main Aug 17, 2023
14 checks passed
@Jorropo Jorropo deleted the rvagg/bitswap-connect-race-fix branch August 17, 2023 06:18
Jorropo added a commit that referenced this pull request Aug 17, 2023
This wasn't caught because the tests hadn't run due to the test.Flaky.

The test were testing exactly for the bug #435 fixed.
Jorropo added a commit that referenced this pull request Aug 21, 2023
Jorropo added a commit that referenced this pull request Aug 21, 2023
…435)" and tests

This reverts commit 7ec68c5.
This reverts commit 59a2bca.
This reverts commit 1d2f5e5.
Jorropo added a commit that referenced this pull request Aug 21, 2023
…435)" and tests

This reverts commit 7ec68c5.
This reverts commit 59a2bca.
This reverts commit 1d2f5e5.
Jorropo added a commit that referenced this pull request Aug 21, 2023
…435)" and tests

This reverts commit 7ec68c5.
This reverts commit 59a2bca.
This reverts commit 1d2f5e5.
hannahhoward pushed a commit that referenced this pull request Jan 25, 2024
* feat(connecteventmanager): block Connected() until accepted

Ref: #432

Minimal attempt at solving #432

* fix(connecteventmanager): less complex channel signalling

* fix(connecteventmanager): handle change queue edge cases and closure

* fix(connecteventmanager): add test to confirm sync Connected() call flow

changelog: put the 435 fix in the right version

fix(connecteventmanager): clean up tests for new synchronous flow
gammazero pushed a commit that referenced this pull request May 17, 2024
* feat(connecteventmanager): block Connected() until accepted

Ref: #432

Minimal attempt at solving #432

* fix(connecteventmanager): less complex channel signalling

* fix(connecteventmanager): handle change queue edge cases and closure

* fix(connecteventmanager): add test to confirm sync Connected() call flow

changelog: put the 435 fix in the right version

fix(connecteventmanager): clean up tests for new synchronous flow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Bitswap peer connection race
3 participants