Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runaway connections/resources #6297

Closed
dokterbob opened this issue May 5, 2019 · 9 comments
Closed

Runaway connections/resources #6297

dokterbob opened this issue May 5, 2019 · 9 comments

Comments

@dokterbob
Copy link
Contributor

dokterbob commented May 5, 2019

Version information:

go-ipfs version: 0.4.20-
Repo version: 7
System version: amd64/linux
Golang version: go1.12.4

Type: bug

Description:

Running a high-load node (ipfs-search.com crawler), we've been seeing sudden seemingly exponential growth in connection counts, paired with equal increases in connection count (after which our memlimit kills go-ipfs at 12G(!), which is the sudden recovery in the graph).

ssp_temp_capture

Notably, we're running with the connection manager configured for a maximum of 14400 connections.

Full config:

{
  "API": {
    "HTTPHeaders": null
  },
  "Addresses": {
    "API": "/ip4/127.0.0.1/tcp/5001",
    "Announce": [
    ],
    "Gateway": "/ip4/127.0.0.1/tcp/8080",
    "NoAnnounce": [
    ],
    "Swarm": [
      "/ip4/0.0.0.0/tcp/4001",
      "/ip6/::/tcp/4001"
    ]
  },
  "Bootstrap": [
  ],
  "Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "1h",
    "HashOnRead": false,
    "Spec": {
      "mounts": [
        {
          "child": {
            "path": "blocks",
            "shardFunc": "/repo/flatfs/shard/v1/next-to-last/2",
            "sync": true,
            "type": "flatfs"
          },
          "mountpoint": "/blocks",
          "prefix": "flatfs.datastore",
          "type": "measure"
        },
        {
          "child": {
            "compression": "none",
            "path": "datastore",
            "type": "levelds"
          },
          "mountpoint": "/",
          "prefix": "leveldb.datastore",
          "type": "measure"
        }
      ],
      "type": "mount"
    },
    "StorageGCWatermark": 90,
    "StorageMax": "10GB"
  },
  "Discovery": {
    "MDNS": {
      "Enabled": false,
      "Interval": 10
    }
  },
  "Experimental": {
    "FilestoreEnabled": true,
    "Libp2pStreamMounting": false,
    "ShardingEnabled": false
  },
  "Gateway": {
    "HTTPHeaders": {
      "Access-Control-Allow-Headers": [
        "X-Requested-With",
        "Range"
      ],
      "Access-Control-Allow-Methods": [
        "GET"
      ],
      "Access-Control-Allow-Origin": [
        "*"
      ]
    },
    "PathPrefixes": [],
    "RootRedirect": "",
    "Writable": false
  },
  "Identity": {
    "PeerID": "<private>",
    "PrivKey": "<nope>"
  },
  "Ipns": {
    "RecordLifetime": "",
    "RepublishPeriod": "",
    "ResolveCacheSize": 128
  },
  "Mounts": {
    "FuseAllowOther": false,
    "IPFS": "/ipfs",
    "IPNS": "/ipns"
  },
  "Reprovider": {
    "Interval": "12h",
    "Strategy": "all"
  },
  "SupernodeRouting": {
    "Servers": null
  },
  "Swarm": {
    "AddrFilters": [
    ],
    "DisableBandwidthMetrics": true,
    "DisableNatPortMap": true,
    "DisableRelay": false,
    "EnableRelayHop": false,
    "EnableAutoNATService": false
  },
  "ConnMgr": {
    "Type": "basic",
    "LowWater": 600,
    "HighWater": 14400,
    "GracePeriod": "20s"
  }
}
@dokterbob
Copy link
Contributor Author

Possibly related: #6237

@5685C4A059D5
Copy link

I'm having a similar issue, on 0.4.19
I have highwater set to 500, but overnight the peer count had gone over 8000.
Probably the only thing stopping it from going higher is the limit on open files (8192).

@sanderpick
Copy link
Contributor

I'm see the same connection runaways. Attaching some profiles from a machine that had about ~12k peers in its swarm (HeapSys near 3g), ~70,000 goroutines. This is with EnableRelayHop off.

A machine will be totally fine until it starts taking on tons of connections, as if it’s discovered by a rogue swarm (too dramatic?). Once that happens, it’s just a matter of time before it goes under w/ OOM.

Version: 5fd5d44

textile-profile-ip-172-31-14-164.us-east-2.compute.internal-2019-05-06T02_28_04+0000.tar.gz

Screen Shot 2019-05-06 at 7 23 02 PM

@whyrusleeping
Copy link
Member

Aside from the random connection spike being really bizarre (we should investigate this) we should probably start looking at hard limits on connections. We've avoided doing it before now because it gets really messy when reasoning through DoS protection, but I think having some true maximum above the high water mark makes good sense. Some open questions include:

  • do we disallow outbound dialing over the hard limit?
  • when we hit the hard limit, do we trigger any behavior?
  • grace period?

@dokterbob
Copy link
Contributor Author

I've been arguing before and I am going to argue again that we should have hard internal resource constraints. I think it's really bad design if any daemon will just swallow up infinite system memory until the OOM kills it, it should really listen to OS' signals and/or have its internal resource management - particularly for something as resource hungry as IPFS.

Also note that this event takes place on timescales under which the memory manager really should have kicked in. With a grace period of 2 seconds it should have started killing connections way earlier, so there's something freaky going on there.

@vyzo
Copy link
Contributor

vyzo commented May 7, 2019

Have you been looking at the kernel memory through smem?
We have observed sudden spikes of kernel memory in a development relay that kills the process, even if the userspace memory usage is stable.

@dokterbob
Copy link
Contributor Author

dokterbob commented May 7, 2019

The memory graph above is strictly the IPFS daemon's systemd slice. Not sure if that includes kernel allocations, but it seems to be that it shouldn't (perhaps I'm wrong though). Regardless, we shouldn't see this many sockets.

@momack2 momack2 added this to Inbox in ipfs/go-ipfs May 9, 2019
@Stebalien
Copy link
Member

This is definitely #6237. I'm still working on a fix.

@obo20
Copy link

obo20 commented May 17, 2019

@dokterbob Out of curiosity, what machine specs are you running that allows you to connect to around 14400 peers at once?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

7 participants