Runaway connections/resources #6297

dokterbob · 2019-05-05T17:15:02Z

Version information:

go-ipfs version: 0.4.20-
Repo version: 7
System version: amd64/linux
Golang version: go1.12.4

Type: bug

Description:

Running a high-load node (ipfs-search.com crawler), we've been seeing sudden seemingly exponential growth in connection counts, paired with equal increases in connection count (after which our memlimit kills go-ipfs at 12G(!), which is the sudden recovery in the graph).

Notably, we're running with the connection manager configured for a maximum of 14400 connections.

Full config:

{
  "API": {
    "HTTPHeaders": null
  },
  "Addresses": {
    "API": "/ip4/127.0.0.1/tcp/5001",
    "Announce": [
    ],
    "Gateway": "/ip4/127.0.0.1/tcp/8080",
    "NoAnnounce": [
    ],
    "Swarm": [
      "/ip4/0.0.0.0/tcp/4001",
      "/ip6/::/tcp/4001"
    ]
  },
  "Bootstrap": [
  ],
  "Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "1h",
    "HashOnRead": false,
    "Spec": {
      "mounts": [
        {
          "child": {
            "path": "blocks",
            "shardFunc": "/repo/flatfs/shard/v1/next-to-last/2",
            "sync": true,
            "type": "flatfs"
          },
          "mountpoint": "/blocks",
          "prefix": "flatfs.datastore",
          "type": "measure"
        },
        {
          "child": {
            "compression": "none",
            "path": "datastore",
            "type": "levelds"
          },
          "mountpoint": "/",
          "prefix": "leveldb.datastore",
          "type": "measure"
        }
      ],
      "type": "mount"
    },
    "StorageGCWatermark": 90,
    "StorageMax": "10GB"
  },
  "Discovery": {
    "MDNS": {
      "Enabled": false,
      "Interval": 10
    }
  },
  "Experimental": {
    "FilestoreEnabled": true,
    "Libp2pStreamMounting": false,
    "ShardingEnabled": false
  },
  "Gateway": {
    "HTTPHeaders": {
      "Access-Control-Allow-Headers": [
        "X-Requested-With",
        "Range"
      ],
      "Access-Control-Allow-Methods": [
        "GET"
      ],
      "Access-Control-Allow-Origin": [
        "*"
      ]
    },
    "PathPrefixes": [],
    "RootRedirect": "",
    "Writable": false
  },
  "Identity": {
    "PeerID": "<private>",
    "PrivKey": "<nope>"
  },
  "Ipns": {
    "RecordLifetime": "",
    "RepublishPeriod": "",
    "ResolveCacheSize": 128
  },
  "Mounts": {
    "FuseAllowOther": false,
    "IPFS": "/ipfs",
    "IPNS": "/ipns"
  },
  "Reprovider": {
    "Interval": "12h",
    "Strategy": "all"
  },
  "SupernodeRouting": {
    "Servers": null
  },
  "Swarm": {
    "AddrFilters": [
    ],
    "DisableBandwidthMetrics": true,
    "DisableNatPortMap": true,
    "DisableRelay": false,
    "EnableRelayHop": false,
    "EnableAutoNATService": false
  },
  "ConnMgr": {
    "Type": "basic",
    "LowWater": 600,
    "HighWater": 14400,
    "GracePeriod": "20s"
  }
}

The text was updated successfully, but these errors were encountered:

dokterbob · 2019-05-05T18:25:42Z

Possibly related: #6237

5685C4A059D5 · 2019-05-06T04:12:38Z

I'm having a similar issue, on 0.4.19
I have highwater set to 500, but overnight the peer count had gone over 8000.
Probably the only thing stopping it from going higher is the limit on open files (8192).

sanderpick · 2019-05-07T02:24:41Z

I'm see the same connection runaways. Attaching some profiles from a machine that had about ~12k peers in its swarm (HeapSys near 3g), ~70,000 goroutines. This is with EnableRelayHop off.

A machine will be totally fine until it starts taking on tons of connections, as if it’s discovered by a rogue swarm (too dramatic?). Once that happens, it’s just a matter of time before it goes under w/ OOM.

Version: 5fd5d44

textile-profile-ip-172-31-14-164.us-east-2.compute.internal-2019-05-06T02_28_04+0000.tar.gz

whyrusleeping · 2019-05-07T09:56:04Z

Aside from the random connection spike being really bizarre (we should investigate this) we should probably start looking at hard limits on connections. We've avoided doing it before now because it gets really messy when reasoning through DoS protection, but I think having some true maximum above the high water mark makes good sense. Some open questions include:

do we disallow outbound dialing over the hard limit?
when we hit the hard limit, do we trigger any behavior?
grace period?

dokterbob · 2019-05-07T10:01:59Z

I've been arguing before and I am going to argue again that we should have hard internal resource constraints. I think it's really bad design if any daemon will just swallow up infinite system memory until the OOM kills it, it should really listen to OS' signals and/or have its internal resource management - particularly for something as resource hungry as IPFS.

Also note that this event takes place on timescales under which the memory manager really should have kicked in. With a grace period of 2 seconds it should have started killing connections way earlier, so there's something freaky going on there.

vyzo · 2019-05-07T10:07:09Z

Have you been looking at the kernel memory through smem?
We have observed sudden spikes of kernel memory in a development relay that kills the process, even if the userspace memory usage is stable.

dokterbob · 2019-05-07T10:12:00Z

The memory graph above is strictly the IPFS daemon's systemd slice. Not sure if that includes kernel allocations, but it seems to be that it shouldn't (perhaps I'm wrong though). Regardless, we shouldn't see this many sockets.

Stebalien · 2019-05-14T19:04:30Z

This is definitely #6237. I'm still working on a fix.

obo20 · 2019-05-17T14:58:03Z

@dokterbob Out of curiosity, what machine specs are you running that allows you to connect to around 14400 peers at once?

momack2 added this to Inbox in ipfs/go-ipfs May 9, 2019

Stebalien closed this as completed May 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runaway connections/resources #6297

Runaway connections/resources #6297

dokterbob commented May 5, 2019 •

edited

Loading

dokterbob commented May 5, 2019

5685C4A059D5 commented May 6, 2019

sanderpick commented May 7, 2019

whyrusleeping commented May 7, 2019

dokterbob commented May 7, 2019

vyzo commented May 7, 2019

dokterbob commented May 7, 2019 •

edited

Loading

Stebalien commented May 14, 2019

obo20 commented May 17, 2019

Runaway connections/resources #6297

Runaway connections/resources #6297

Comments

dokterbob commented May 5, 2019 • edited Loading

Version information:

Type: bug

Description:

dokterbob commented May 5, 2019

5685C4A059D5 commented May 6, 2019

sanderpick commented May 7, 2019

whyrusleeping commented May 7, 2019

dokterbob commented May 7, 2019

vyzo commented May 7, 2019

dokterbob commented May 7, 2019 • edited Loading

Stebalien commented May 14, 2019

obo20 commented May 17, 2019

dokterbob commented May 5, 2019 •

edited

Loading

dokterbob commented May 7, 2019 •

edited

Loading