AcceleratedDHTClient needs more resources than the ResourceManager has by default #8945

aschmahmann · 2022-05-05T12:50:15Z

Checklist

This is a bug report, not a question. Ask questions on discuss.ipfs.io.
I have searched on the issue tracker for my bug.
I am running the latest go-ipfs version or have an issue updating.

Installation method

ipfs-update or dist.ipfs.io

Version

go-ipfs version: 0.13.0-rc1
Repo version: 12
System version: amd64/windows
Golang version: go1.18.1

Config

Abridged:
{
  "Experimental": {
    "AcceleratedDHTClient": true,
  },
  "Internal": {
    "Bitswap": {
      "EngineBlockstoreWorkerCount": null,
      "EngineTaskWorkerCount": null,
      "MaxOutstandingBytesPerPeer": null,
      "TaskWorkerCount": null
    }
  },
  "Routing": {
    "Type": "dhtclient"
  },
  "Swarm": {
    "AddrFilters": null,
    "ConnMgr": {
      "GracePeriod": "20s",
      "HighWater": 100,
      "LowWater": 50,
      "Type": "basic"
    },
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": false,
    "RelayClient": {},
    "RelayService": {},
    "Transports": {
      "Multiplexers": {},
      "Network": {},
      "Security": {}
    }
  }
}

Description

Running with the AcceleratedDHTClient enabled results in bursty connections (e.g. on startup) that result in the routing table being very small (compared to running without the resource manager enabled) and as a result not particularly functional.

Enabling debug logging on the resource manager shows that we're out of available outbound connections.

Some analysis here is still need, however AFAICT the reasons for this are:

We now have enforceable connection limits
The accelerated DHT client makes lots of connections in a short period of time. While it doesn't need to keep them open for very long it can't really close the connections in case they're in use (e.g. for Bitswap requests) and so it has to rely on the garbage collection on the host which also means waiting for the grace period.

Some options here include:

Increase the resource limits, particularly for outbound connections and some overall limits like connections and file descriptors. Given that outbound connections are largely controlled by user behavior (e.g. turning on the AcceleratedDHTClient, making lots of API or gateway requests, etc.) this might not be so bad
Make the AcceleratedDHTClient consume fewer resources - some examples include
- Reduce (the default) amount of concurrent work the AcceleratedDHTClient does when doing routing table initialization or maintenance
- More aggressively return resources to the system. Could be having the AcceleratedDHTClient call Trim() or manually calling close on unprotected connections it recently used.
- Have go-libp2p give the application layer more control over how dialing is done (e.g. for expensive but not latency-critical operations have the client only dial WAN peers and try different transports serially rather than in parallel)
Emitting louder errors to users of the AcceleratedDHTClient when insufficient resources are available to maintain the routing tables. At the extremes we could have hard coded errors in go-ipfs telling users to bump their limits or errors emitted from the AcceleratedDHTClient based on checking the error types received

Looking for feedback here on other ideas or preferences for how to handle this. My initial reaction is to go with increasing limits and have the AcceleratedDHTClient log errors when its routing table won't update properly due to resource starvation. Without the logs getting people to report issues properly will likely be difficult, and increasing the limits vs reducing the resource consumption is something we can then more easily tune over time (i.e. lower the limits as we reduce resource consumption)

The text was updated successfully, but these errors were encountered:

guseggert · 2022-05-05T15:28:35Z

Your plan sounds reasonable to me, as long as the new limits are reasonable. Do you know what the new limits should be for a clean startup?

For 2), adding a rate limit on peer dials with a knob should be sufficient for folks to work around whatever resource limitations they have.

Also libp2p/go-libp2p-kad-dht#254 might help, though not a short-term mitigation.

BigLep · 2022-05-05T15:52:28Z

2022-05-05 conversation:

We need errors to know when this situation occurs
Lets understand how limits were set.

For example, do we need a limit on outbound connections
Capture how this was chosen

Current plan:

Add errors: feat: log when resource manager limits are exceeded #8980
Raise outbound/global limits: fix: adjust rcmgr limits for accelerated DHT client rt refresh #8982
feat: add error log when resource manager throttles crawler libp2p/go-libp2p-kad-dht#772

We're not tackling bigger items like:

Being able to hand resources back to libp2p
Optimizing resource consumption in common code paths

BigLep · 2022-05-17T16:32:24Z

2022-05-17 conversation:

Part 1: Conditions for when print out an "accelerated DHT couldn't bootstrap message"

Time based OR
Number of X-failed dials due to resource manager
Going with X=1.

Part 2: messaging
"You may have a gimped accelerated DHT. Consider raising the limits. Read here for more info."

In the PR or here, lets set our outbound connections very low and then paste in what the logs look like.

We also need to confirm if we can reproduce locally quickly. Lets paste here the repro steps we took.

BigLep · 2022-05-24T16:44:55Z

Exit criteria:

ipfs stats dht | wc -l

Should be ~equal with and or without resource manager enabled.

BigLep · 2022-06-07T16:45:46Z

Resolving since we did adjust limits

There is a larger issue to address concerning experiment DHT startup here: libp2p/go-libp2p-kad-dht#770

ThomasFreedman · 2022-11-26T17:03:50Z

I just upgrated a very large repo from v7 to v12 which succeeded according to the info reported. However I see error messages continually occurring like:

2022-11-26T10:08:04.655-0600    ERROR   resourcemanager libp2p/rcmgr_logging.go:57      Consider inspecting logs and raising the reso
urce manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-26T10:08:14.656-0600    ERROR   resourcemanager libp2p/rcmgr_logging.go:53      Resource limits were exceeded 332 times with
error "system: cannot reserve inbound connection: resource limit exceeded".

I can find no information on the process by which the logs are analyzed and changes are made to the config file to remove these repetitive error messages. The link provided in the error message is of little help in how to resolve the errors.

BigLep · 2022-11-27T21:20:02Z

@ThomasFreedman : you mentioned v12. Did you mean v0.17 (latest Kubo release)?

ThomasFreedman · 2022-11-28T23:36:40Z

Yes, that probably correct since I was upgrading to that. I just copied what was printed in the output.
My apologies if I hijacked this topic. My issue is directly related.

I see there is already a bug filed for this error condition, and one person posted his solution which I will apply. Hopefully that will resolve the problem.

aschmahmann added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels May 5, 2022

BigLep mentioned this issue May 5, 2022

Release v0.13 #8640

Closed

65 tasks

BigLep added this to the go-ipfs 0.13 milestone May 5, 2022

BigLep assigned aschmahmann May 5, 2022

BigLep assigned guseggert and unassigned aschmahmann May 12, 2022

Winterhuman mentioned this issue May 13, 2022

peer watermarks are not respected #8970

Open

3 tasks

guseggert mentioned this issue May 13, 2022

DHT doesn't properly starts using Accelerated DHT Client when system starts and internet connection isn't stable yet libp2p/go-libp2p-kad-dht#770

Open

BigLep linked a pull request May 17, 2022 that will close this issue

feat: log when resource manager limits are exceeded #8980

Merged

guseggert closed this as completed in #8980 May 19, 2022

BigLep reopened this May 24, 2022

BigLep closed this as completed Jun 7, 2022

BigLep reopened this Nov 27, 2022

BigLep closed this as completed Nov 27, 2022

laurentsenta mentioned this issue Feb 28, 2023

chore: extract daemon & update to go 1.19 ipfs-shipyard/ipfs-check#33

Merged

dokterbob mentioned this issue Jun 25, 2023

Accelerated DHT Client causes OOM kill upon start of IPFS, ResourceMgr.MaxMemory ignored #9990

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AcceleratedDHTClient needs more resources than the ResourceManager has by default #8945

AcceleratedDHTClient needs more resources than the ResourceManager has by default #8945

aschmahmann commented May 5, 2022

guseggert commented May 5, 2022

BigLep commented May 5, 2022 •

edited

Loading

BigLep commented May 17, 2022 •

edited

Loading

BigLep commented May 24, 2022

BigLep commented Jun 7, 2022

ThomasFreedman commented Nov 26, 2022

BigLep commented Nov 27, 2022

ThomasFreedman commented Nov 28, 2022 •

edited

Loading

AcceleratedDHTClient needs more resources than the ResourceManager has by default #8945

AcceleratedDHTClient needs more resources than the ResourceManager has by default #8945

Comments

aschmahmann commented May 5, 2022

Checklist

Installation method

Version

Config

Description

guseggert commented May 5, 2022

BigLep commented May 5, 2022 • edited Loading

BigLep commented May 17, 2022 • edited Loading

BigLep commented May 24, 2022

BigLep commented Jun 7, 2022

ThomasFreedman commented Nov 26, 2022

BigLep commented Nov 27, 2022

ThomasFreedman commented Nov 28, 2022 • edited Loading

BigLep commented May 5, 2022 •

edited

Loading

BigLep commented May 17, 2022 •

edited

Loading

ThomasFreedman commented Nov 28, 2022 •

edited

Loading