Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tracking issue] go-libp2p resource manager critical post release fixes #9442

Closed
8 of 9 tasks
BigLep opened this issue Dec 1, 2022 · 12 comments
Closed
8 of 9 tasks

Comments

@BigLep
Copy link
Contributor

BigLep commented Dec 1, 2022

This is the tracking issue for the streams of work that need to get corrected/fixed as result of the Kubo 0.17 release with libp2p resource manager enabled by default: #8761

Must complete before 0.18 RC

Theme 2: reports of default limits not working well for users

Theme 3: improve RM errors coming from other peers

This is related to reports of a disabled go-libp2p resource manager still managing resources. In fact this is an error message from a remote go-libp2p peer exceeding their own resource limits and when the message is printed on the local peer its hard to differentiate between the do.

Things we can do:

  1. Adjust messaging for resource manager on the local Kubo side so it's easier to differentiate between local and remote peer resource manager errors.
    • It looks like the message already is differentiable. Feel free to do more though.
  2. Adds some docs on how to differentiate and point to the go-libp2p issue for handling this better in go-libp2: quic: Error from peer should be explicit that it is coming from remote peer rather than a local error libp2p/go-libp2p#1928

Theme 4: confusion around "magic values"

This is about how "4611686018427388000" looks like a random number when it is actually our "infinity".

Things we can do:

  1. Allow -1 to denote infinity for the go-libp2p resource manager. Requested: Enable specifying infinite limits with a more "intuitive" magic number like -1 libp2p/go-libp2p#1935
  2. Document what Kubo's magic value is:
  3. Using another number that is more obvious a magic number like 999999999999999999

Options not on the table:

  1. ipfs swarm limits all to go from "4611686018427388000" to "infinity" on the output because that isn't valid JSON.
  2. Similarly we can't add a comment after the magic number because that isn't valid json and the go json parser we use doesn't allow parsing non-spec compliant things like comments.

Theme 5: be clearer on startup about what limits are being set and why

Add a log message like:

First computing default go-libp2p resource manager limits based on Swarm.ResourceManager.maxMemory of {Swarm.ResourceManager.maxMemory} and then applying any user-supplied overrides on top. Run ipfs swarm limits all to see the resulting limits.

Theme 6: Wrong tone about resource manager getting in the way (being a bug) vs. being a feature

UX angle about being clear that in general this is a feature not a bug:

Theme 7: clarity around the "error message" meaning

There isn't clarity around what "system: cannot reserve inbound connection: resource limit exceeded". means. For this example, it means Swarm.ResourceMgr.Limits.System.ConnsInbound is exceeded.

Things we can do:

  1. Provide docs on how to interpret this message.
  2. Reverse engineer the message based on https://github.com/libp2p/go-libp2p/blob/master/p2p/host/resource-manager/scope.go so we can map it back to Swarm.ResourceMgr.Limits.$scope.$limit. If we do that, we can then print what the limit value is.
    • 2022-12-06 maintainer conversation: not going to do this

Theme 8: Provide actionable advice when resource limits are hit

When a resource limit is hit, we point users to https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr. We could provide a better documentation path for how someone debugs this situation.

Theme 9: have the ConnMgr limits be set under Swarm.ResourceMgr.Limits.System.ConnsInbound

As discussed in #9468, this allows low priority idle connections to get cleaned up to make space for higher priority connections

Things we can do:

  1. Lower the default limits
  2. Log when ConnMgr limits aren'tunder Swarm.ResourceMgr.Limits.System.ConnsInbound

Theme 10: resource manager doing its job of protecting a node is alarming

This was broughtup throughout #9432, but generally users have found the resource manager "ERRORs" spammy. That they are printed as ERRORs also runs counter to our narrative about this being a feature. Should a feature doing its job be an error?

Things we can do:

Theme 11: fix bugs in the swarm stats command

Theme 12: remove additional footguns around (soft) ConnMgr and (hard) ResourceMgr limits and their interactions

Theme 13: clarify and improve handling of zeroes

Ideally completing before the 0.18 final release

Theme 1: usability issues in entering config

It's too easy for someone to enter invalid config and not get any feedback that they have done so.

Theme 11: Improve usability in the swarm stats command

@BigLep
Copy link
Contributor Author

BigLep commented Dec 2, 2022

@ajnavarro : are there any other critical followups with resource manager we should be tracking?

@BigLep
Copy link
Contributor Author

BigLep commented Dec 2, 2022

(I pasted a comment here that I meant to have in #9432. I moved it over: #9432 (comment) ).

@ajnavarro
Copy link
Member

@BigLep Adding libp2p/go-libp2p#1928 to the list

@ajnavarro
Copy link
Member

After fixing some problems on the routing system when using the parallel router, I'll continue with #9438

@BigLep
Copy link
Contributor Author

BigLep commented Dec 7, 2022

@ajnavarro : I updated the issue description with the latest info and started a PR with some dedicated libp2p resource management docs: #9468

@ajnavarro
Copy link
Member

A new one to the bucket:

@BigLep
Copy link
Contributor Author

BigLep commented Jan 14, 2023

Two more items identified to address (added to issue description):
#9545
#9549

@BigLep
Copy link
Contributor Author

BigLep commented Jan 21, 2023

@ajnavarro
Copy link
Member

Added some deeper possible solutions after talking with @MarcoPolo about how to improve RM integration on Kubo: #9580

@ShadowJonathan
Copy link

ShadowJonathan commented Jan 24, 2023

I just stumbled into theme 3 in #9432, I'll repeat what I said there:

The log entries could be made more clear if (remote) is turned into (from remote), else it's easy to confuse this as being related to remote something, not that it is coming from the remote node.

Wrapping it in quotes to show that it is "verbatim" would probably help as well, to deal with errors foreign to kubo. Printing the remote node version (if its known) would maybe also help debugging it further.

@BigLep
Copy link
Contributor Author

BigLep commented Feb 16, 2023

In the latest round of consolidating/organizing the resource manager/accountant work, I'm closing this in favor of #9650

@BigLep BigLep closed this as not planned Won't fix, can't repro, duplicate, stale Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

No branches or pull requests

3 participants