Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: shutdown queue on publish error if not done #412

Merged
merged 2 commits into from
Mar 7, 2023

Conversation

jacobheun
Copy link

@jacobheun jacobheun commented Mar 7, 2023

Summary

This is a patch on top of 0.13.2. While debugging network disconnect issues in Boost during retrieval we discovered a leak in go-routines for graphsync. The issue is that a hard network disconnect may still result in the MessageQueue being restarted if there are messages the server is still attempting to send.

This change does not fully fix the issue but after multiple runs of force disconnecting 500 retrievals, we saw a 10x reduction in open goroutines. More robust handling for hard network disconnects may be warranted in the 0.14.x line, but that is also likely a non trivial effort. This gets us most of the way.

Goroutine dumps from boost

The below were goroutine diffs between startup and after executing 500 forceful disconnects.

Before
image

After
image

@welcome
Copy link

welcome bot commented Mar 7, 2023

Thank you for submitting this PR!
A maintainer will be here shortly to review it.
We are super grateful, but we are also overloaded! Help us by making sure that:

  • The context for this PR is clear, with relevant discussion, decisions
    and stakeholders linked/mentioned.

  • Your contribution itself is clear (code comments, self-review for the
    rest) and in its best form. Follow the code contribution
    guidelines

    if they apply.

Getting other community members to do a review would be great help too on complex PRs (you can ask in the chats/forums). If you are unsure about something, just leave us a comment.
Next steps:

  • A maintainer will triage and assign priority to this PR, commenting on
    any missing things and potentially assigning a reviewer for high
    priority items.

  • The PR gets reviews, discussed and approvals as needed.

  • The PR is merged by maintainers when it has been approved and comments addressed.

We currently aim to provide initial feedback/triaging within two business days. Please keep an eye on any labelling actions, as these will indicate priorities and status of your contribution.
We are very grateful for your contribution!

Copy link
Collaborator

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Agree if publish error offers no benefit, let's not worry.

@hannahhoward hannahhoward merged commit 9561a73 into release/v0.13.x Mar 7, 2023
@rvagg rvagg deleted the fix/network-close branch March 8, 2023 00:07
gammazero added a commit that referenced this pull request Mar 28, 2023
rvagg pushed a commit that referenced this pull request Mar 29, 2023
hannahhoward added a commit that referenced this pull request Apr 7, 2023
gammazero added a commit that referenced this pull request Apr 13, 2023
* refactor(peermanager): add shutdown callback
* test(impl): add regression test
* More time for round trip test to complete

---------

Co-authored-by: gammazero <gammazero@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants