ipfs daemon memory usage grows overtime: killed by OOM after a 10~12 days running #3532

hsanjuan · 2016-12-21T19:38:42Z

Version information:

go-ipfs version: 0.4.5-dev-
Repo version: 4
System version: arm/linux
Golang version: go1.7

Type: Problem

Priority: P4

Description:

I have some Raspberry Pis 3 running go-ipfs daemon. Right now they don't do anything. The Pis don't handle any IPFS requests or anything. They are just there running the daemons. After about 10 days ipfs is getting killed in all of them because they are taking too much memory.

The daemons are killed around RSS=783192
My longest running daemon (11 days) has RSS=605868
A newly started daemon has RSS=92020
A one day running daemon has RSS= 542408

Questions:

What causes memory usage to steadily grow even if the daemons get no usage other than being running?
Is there a way to limit it?
Do we need to gather more information on this? if so, what's the best way and how can I help?

Related: #3318 and the question about running IPFS on platforms with limited resources.

jonnycrunch · 2016-12-29T22:43:05Z

Same here:

ipfs version 0.4.3
Ubuntu 16.0.4 ( 4.4.0-47-generic )
go-lang 1.7

after about 10 days memory grows to about 15G despite only a few hundred files pinned. Issue is replicated across 10 servers. Restarting the daemon fixed it but continues to grow and needs to be restarted.

UPDATE: Ah, ha! I found the enable garbage collection flag in the documentation, so trying:

ipfs daemon --enable-gc

whyrusleeping · 2017-01-01T20:05:12Z

@jonnycrunch the --enable-gc flag refers to disk gc, not memory gc.

The memory leakage is coming from somewhere else... Next time the memory gets out of hand can you get me the debug info described here: https://github.com/ipfs/go-ipfs/blob/master/docs/debug-guide.md#beginning

Particularly the heap profile, goroutine dump and ipfs binary

koriaf · 2017-01-07T12:04:16Z

Hi! We are using this ipfs 0.4.4 at Linux 4.4.35-33.55.amzn1.x86_64 #1 SMP Tue Dec 6 20:30:04 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Currently it eats 65-76% of memory at 2GB instance, OOM sometimes kills it and it starts again and usage grows during several hours to given value. But looks like this is enough for the daemon to not be killed - may be it uses some smart way to determine how not to be killed :-) While experimenting with memory limits I saw that usage grows to whole available memory but not more (no swap used for IPFS, but other applications may have problems with available memory).

ipfs/QmaB2FJr1Z6yGRy9G37aXsBirR43Lc9ya3Q29R4gMYDVDv - dumps. Shall I recreate my node after sharing these files? Do they have a chance to contain any private keys or other data? Node is disposable, and contains no private files yet, but may have in future.

Also I noted that after running disk gc (ipfs repo gc) memory decreased from 70% to 65%, but after adding this debug directory it's again 75% of total host memory.

I have no idea how go works, so if you need more debug info or this one is unhelpful - please feel free to ask for more details.

Also, I have ipfs node run at 512MB Digitalocean instance, and it's managed by supervisord. OOM kills it there pretty fast (several hours), and supervisord starts it, and it dies again, and again, but generally works okay.

come-maiz · 2017-01-10T03:05:47Z

Carla Sella, from the Ubuntu community, reports that using the ipfs v0.4.4, her virtualbox vm starts to get slow after it connects to over 70 peers. Here are her debugging files.
ipfs.tar.gz

jonnycrunch · 2017-01-16T17:49:19Z

Maybe it is time for Garbage Collection to be enabled by default? @whyrusleeping @RichardLitt @diasdavid

Kubuxu · 2017-01-16T17:51:06Z

@jonnycrunch as @whyrusleeping said, the --enable-gc flag is datastore garbage collection, not the program garbage collection.

The core problem is what we call "connection closing", IPFS is currently connecting with almost everyone which in connection with muxer implementation we are currently using takes a lot of memory. We are working on reducing it but it might take a while. The connection closing is much harder problem we initially expected.

The --enable-gc flag shouldn't matter, it might reduce memory usage a bit, but it isn't the core problem as far as I know.

hsanjuan · 2017-02-01T18:24:45Z

This is the debugging information I have collected from 1 node that was still running (2 have died):

https://ipfs.io/ipfs/QmXnYzZT1EAq9pzi6snd6KHD8kNrBSDuyJqLPe7QHzUE23

It was also using 150% CPU when I checked it and >80% MEM. They are still on 0.4.5-pre1 though.

bdimych · 2017-05-14T06:26:36Z

stack dump from #61 ,
this is a vps with CentOS 7 64 with 1Gb memory,
ipfs daemon crashed in 5 days after start:
ipfs-crash-May-07-grep-ipfs-var-log-messages.zip

ipfs package go-ipfs_v0.4.8_linux-amd64.tar.gz

whyrusleeping · 2017-10-05T08:26:13Z

Hey everyone, ipfs 0.4.11 should have some significant improvements here. The issue is not entirely resolved, but the leak should be mitigated.

maznu · 2017-12-25T19:15:56Z

Still leaking memory in 0.4.13 — killed after ~12 hours.

Stebalien · 2018-01-28T07:39:35Z

At the moment, the largest issue is the peerstore. We had a rather nasty bug that will be fixed in the next release (we, uh, kind of didn't forget any address of any peer to which we had ever connected and, worse, advertised these (sometimes ephemeral) addresses to the network..).

victorb · 2018-01-28T10:29:48Z

@Stebalien

that will be fixed in the next release

Does that mean that the fix is already in master or is work in progress?

Stebalien · 2018-01-28T17:04:23Z

Fixed in a dep. PR pending: #4610

…

On January 28, 2018 2:29:49 AM PST, "ᴠɪᴄᴛᴏʀ ʙᴊᴇʟᴋʜᴏʟᴍ" ***@***.***> wrote: @Stebalien > that will be fixed in the next release Does that mean that the fix is already in `master` or is work in progress?

paralin · 2018-03-29T04:00:11Z

I profiled it and it seems like a lot of the CPU waste is surprisingly in AddAddrs in the AddrManager. Reading that code, it seems very hasty and not performance minded. I'll PR something to go-libp2p-peerstore to optimize that with concurrent maps, which should help.

Stebalien · 2018-03-29T04:48:22Z

I'll PR something to go-libp2p-peerstore to optimize that with concurrent maps, which should help.

Unfortunately, the issue is libp2p/go-libp2p-peerstore#26 and the fact that the number of multiaddrs assigned to a peer can grow unchecked*. The peerstore actually works fine with a sane number of addresses.

*The previous version of go-ipfs failed to forget observed multiaddrs for peers and, worse, would gossip these observed multiaddrs. That combined with NATs and ephemeral ports lead to a build up of addresses for some peer.

The solution to this is really to sign peer address records (should be doing this anyways), enforce a maximum number of addresses, and require that there only be one valid peer address record per peer.

paralin · 2018-03-29T05:38:38Z

Yeah, but that code is still unoptimized and in general really rough, even for a small number of addresses. Agreed that there is a bigger reason though as you describe.

maznu · 2019-02-16T19:31:34Z

Still leaking memory in 0.4.18, between 0-100kB/sec (averaging at a rate of somewhere around 10kB/sec).

whyrusleeping · 2019-02-23T01:06:48Z

@maznu are you sure its leaking memory? go is a garbage collected language, which means memory usage will appear to increase until a GC event. after a GC event, memory doesnt necessarily get released back to the OS, but internally the previously allocated memory will get used.

How are you measuring this?

EugeneChung · 2019-02-23T03:50:50Z

Still leaking memory in 0.4.18, between 0-100kB/sec (averaging at a rate of somewhere around 10kB/sec).

https://golangcode.com/print-the-current-memory-usage/

Using this periodically, you can gather memory usages of several days. With a graph tool like Microsoft Excel, you can check tendency of memory usages.

maznu · 2019-02-23T07:53:02Z

Several days? It's eating up all the RAM on a 1Gb VPS (and then being killed by the kernel oom) within eight hours.

You can see there that there is garbage collection and freeing back to the OS — plenty of green spikes within that orange lump of usage — but fundamentally it just continues to grow.

paralin · 2019-02-23T20:53:57Z

Can someone with bad memory usage please grab a memory trace?

alexkursell · 2019-04-14T19:21:14Z

Can someone with bad memory usage please grab a memory trace?

I am experiencing this issue using go-ipfs 0.4.19:
https://ipfs.io/ipfs/QmSkYDJV1BJeLm2uEBqnshcmBRb1LMPPPxdBsUrGDNGv8J

For me it takes ~2 days for the daemon to exhaust 1GB of memory and get OOM killed.

Stebalien · 2019-04-15T18:53:32Z

@alexkursell I'm only seeing ~30MiB of memory usage on the heap. Unfortunately, I can't seem to download the goroutine stack traces.

When you grabbed that memory dump, how much memory was go-ipfs using (at that point in time).

whyrusleeping · 2019-04-15T20:42:52Z

The biggest problem i'm seeing with memory usage lately isnt that ipfs always uses a lot of memory, its that it randomly spikes to a lot of memory, and go will pretty much never release that memory.

To debug this further, I would put a memory limit on the ipfs process (say, 1GB) so that it panics when the memory spikes, and we can then figure out what the problem is.

alexkursell · 2019-04-19T23:23:42Z

@Stebalien. I've grabbed a new set of diagnostics, along with the output of top: https://ipfs.io/ipfs/QmVB4s9Eu1XYxbikuzQix6SGUoDtqS46oyPJFanWLRMwV5 At the time this was taken, it looks like the daemon was using around 750mb.

marrub-- · 2019-04-21T06:14:16Z

I was able to run an ipfs node just fine for a while but it's started taxing my server so much it's impossible to continue using. It would be fine even if it used a gigabyte, but it continues eating more and more memory until the server simply crashes.

Stebalien · 2019-04-23T06:22:13Z

@alexkursell

Go is "only" using about 300MiB of heap memory so it looks like memory usage spiked at some point and go never returned the memory.

The largest actual memory users appear to be:

Provider records. This issue should be fixed in the latest go-ipfs master.
The peerstore (information about peers we've seen). We have a PR (migrate to datastore-backed peerstore #6080) for putting this on-disk but I'd like to do a bit of testing before we land that. I'm also seeing Deduplicate stored protocols libp2p/go-libp2p-peerstore#68.

kaysond · 2019-04-24T06:18:03Z

+1. I just set up a node on an Ubuntu 19.04 vps, and it died after about a day. I'll try the latest master and see if that fixes it.

whyrusleeping · 2019-04-24T16:57:44Z

@kaysond (and others) when your nodes die due to running out of memory, can you please send us the stack traces? It will help us track down whats causing the memory spikes.

kaysond · 2019-04-24T17:25:25Z

I built from the latest source, and it seems to have grown steadily then leveled off at around 600MB overnight.

kaysond · 2019-04-30T02:26:49Z

@whyrusleeping after a few days it looks like it settled out at a solid 1GB RAM. I've attached all the dumps per the debug guide
memdebug.tar.gz

Stebalien · 2019-04-30T04:58:57Z

@kaysond

It looks like that memory is:

The peerstore (fix in migrate to datastore-backed peerstore #6080).
Bandwidth metric tracking. Unfortunately, we never forget old peers. You can disable bandwidth tracking by setting ipfs config --json "Swarm.DisableBandwidthMetrics true".

kaysond · 2019-04-30T05:33:07Z

@Stebalien thanks. I'll add that to my config and see how much it helps. Is there a plan to implement said "forgetting"?

Stebalien · 2019-04-30T07:21:05Z

@kaysond not yet but it looks like we'll have to do that at some point. I've never seen that show up in a heap trace. You must have connected to ~0.5M (estimated) unique peers over the course of a few days.

I've filed an issue (https://github.com/libp2p/go-libp2p-metrics/issues/17) but it's unlikely to be a priority given that most systems connecting to that many peers have quite a bit of memory (unless that was entirely DHT traffic...).

That brings up a good point. If you're memory constrained, try running the daemon with --routing=dhtclient.

kaysond · 2019-04-30T17:47:56Z

I set up a node mainly to serve a single website from ipfs, so the less memory it uses the cheaper my VPS can be.

I'm skeptical that the site draws that much traffic... so I guess its just the nature of being connected to the swarm? The node isn't exactly a public gateway, so I'm not sure what caused all of the connections.

I'll try it with that option and see what happens.

Stebalien · 2019-04-30T17:54:03Z

The node isn't exactly a public gateway, so I'm not sure what caused all of the connections.

Probably the DHT.

mkg20001 · 2019-06-02T17:27:29Z

Any updates on this?

mkg20001 · 2019-06-02T17:29:33Z

Btw, the command to disable bandwith metrics didn't work anymore, the new one is ipfs config --bool Swarm.DisableBandwidthMetrics true

Is it even needed, anymore?

kaysond · 2019-06-02T17:32:53Z

With the command /usr/local/bin/ipfs daemon --enable-gc --routing=dhtclient, after several weeks my node has settled at around 500MB RAM

mkg20001 · 2019-06-02T18:14:55Z

@kaysond Used that command. This + Swarm.DisablebandwidthMetrics works, thx

Stebalien · 2021-04-22T22:33:33Z

The remaining issue is #2848. Closing this one as it's quite old.

hsanjuan mentioned this issue May 12, 2017

Where can I send a crash stack dump? ipfs-inactive/support#61

Closed

whyrusleeping added P0 Critical: Tackled by core team ASAP status/ready Ready to be worked labels Oct 17, 2017

whyrusleeping mentioned this issue Oct 17, 2017

go-ipfs high priority issue list #4312

Closed

18 tasks

Gabriella439 mentioned this issue May 18, 2018

Convert some association lists to homogeneous maps? dhall-lang/dhall-json#27

Closed

Stebalien added status/deferred Conscious decision to pause or backlog and removed status/ready Ready to be worked labels Dec 18, 2018

leerspace mentioned this issue May 1, 2019

Connection counts climbing far past HighWater setting #6286

Closed

momack2 added this to Backlog in ipfs/go-ipfs May 9, 2019

jacobheun added this to Backlog in Go IPFS Roadmap Sep 15, 2020

jacobheun removed this from Backlog in Go IPFS Roadmap Sep 15, 2020

grostirolla1 mentioned this issue Mar 1, 2021

IPFS Client OOMKill #7954

Closed

Stebalien added kind/bug A bug in existing code (including security flaws) and removed P0 Critical: Tackled by core team ASAP status/deferred Conscious decision to pause or backlog labels Apr 22, 2021

Stebalien closed this as completed Apr 22, 2021

ipfs daemon memory usage grows overtime: killed by OOM after a 10~12 days running #3532

ipfs daemon memory usage grows overtime: killed by OOM after a 10~12 days running #3532

Comments

hsanjuan commented Dec 21, 2016

Version information:

Type: Problem

Priority: P4

Description:

jonnycrunch commented Dec 29, 2016 • edited Loading

whyrusleeping commented Jan 1, 2017

koriaf commented Jan 7, 2017

come-maiz commented Jan 10, 2017

jonnycrunch commented Jan 16, 2017

Kubuxu commented Jan 16, 2017

hsanjuan commented Feb 1, 2017

bdimych commented May 14, 2017 • edited Loading

whyrusleeping commented Oct 5, 2017

maznu commented Dec 25, 2017

Stebalien commented Jan 28, 2018

victorb commented Jan 28, 2018

Stebalien commented Jan 28, 2018 via email

paralin commented Mar 29, 2018

Stebalien commented Mar 29, 2018

paralin commented Mar 29, 2018

maznu commented Feb 16, 2019

whyrusleeping commented Feb 23, 2019

EugeneChung commented Feb 23, 2019

maznu commented Feb 23, 2019 • edited Loading

paralin commented Feb 23, 2019

alexkursell commented Apr 14, 2019

Stebalien commented Apr 15, 2019

whyrusleeping commented Apr 15, 2019

alexkursell commented Apr 19, 2019

marrub-- commented Apr 21, 2019

Stebalien commented Apr 23, 2019

kaysond commented Apr 24, 2019

whyrusleeping commented Apr 24, 2019

kaysond commented Apr 24, 2019

kaysond commented Apr 30, 2019

Stebalien commented Apr 30, 2019

kaysond commented Apr 30, 2019

Stebalien commented Apr 30, 2019

kaysond commented Apr 30, 2019

Stebalien commented Apr 30, 2019

mkg20001 commented Jun 2, 2019

mkg20001 commented Jun 2, 2019

kaysond commented Jun 2, 2019

mkg20001 commented Jun 2, 2019

Stebalien commented Apr 22, 2021

jonnycrunch commented Dec 29, 2016 •

edited

Loading

bdimych commented May 14, 2017 •

edited

Loading

maznu commented Feb 23, 2019 •

edited

Loading