Leaks memory #50

mmimica · 2014-12-19T09:03:34Z

Relay seems to be leaking memory. It consumes all 2GB of RAM we gave him until OOMK kicks in.
Routing rules are pretty simple:

init = [
     "addBlack .net",
     "addRoute sendAllMatch core prefix=org.company  12.56.176.41:2014 spool=false pickle=true  12.56.176.42:2014 spool=false pickle=true",
     "addRoute sendAllMatch other prefix=stats  12.56.176.51:2014 spool=false pickle=true  12.56.176.52:2014 spool=false pickle=true"
]

The text was updated successfully, but these errors were encountered:

ghost · 2015-01-22T17:42:32Z

I'm also seeing this issue.

instance = "default"
listen_addr = "0.0.0.0:2103"
admin_addr = "0.0.0.0:2104"
http_addr = "0.0.0.0:8089"

spool_dir = "/log/graphite/spool"
log_level = "error"

init = [
"addRoute sendAllMatch carbonrelay-cluster 10.198.34.168:2013 spool=false pickle=false 10.198.34.169:2013 spool=false pickle=false 10.198.35.228:2013 spool=false pickle=false 10.199.223.31:10005 spool=false pickle=false"
]

[instrumentation]
graphite_addr = "localhost:2103"
graphite_interval = 15000

Slowly eats memory over time.

ghost · 2015-02-10T17:48:35Z

It appears that my configuration was having lots of dest metric drops. This was tracked down to my last destination (10005) which was an influxdb destination that was having issues. When looking at the carbon-relay-ng metrics, it showed lots of errors and drops. Once I removed my influxdb destination, we've been stable. I think this isn't an issue with influxdb being a destination - instead it seems that its an issue with the performance of my influxdb backend. The back-up in processing seems to have been negatively impacting carbon-relay-ng.

willejs · 2015-02-24T23:48:07Z

Any news on this?

Dieterbe · 2015-03-30T20:36:29Z

weird. i have a 2GB machine and never even needed to look at memory. never been an issue for me.
any of you interested in doing some memory profiling? this can tell you where a lot of memory is being allocated.
see http://blog.golang.org/profiling-go-programs

shanemhansen · 2015-05-04T16:02:03Z

I'm also getting oom kills. I'm also pushing to influxdb. Here's a grafana graph of memory usage (sorry for the missing units the graph is ~2 weeks wide and 7.5gb tall. I can try and do some memory profiling soon.

Dieterbe · 2015-05-04T17:39:53Z

what's the green vs the yellow line? (which is memory and what's the other?)

shanemhansen · 2015-05-04T17:57:45Z

Sorry. They are both memory. The yellow line is memory usage on a different
server running carbon-relay-ng at a lower throughput.

The green line does represent total memory usage on a server running
carbon-relay-ng and some other processes, but that memory usage is normally pretty constant, I think the graph shows memory growth up to the point the kernel oom kills the process. I also have oom messages in dmesg.

I have a core file I'm analyzing too.

shanemhansen · 2015-05-04T20:13:19Z

My core file doesn't show anything particularly interesting. Approximately 90% of the core file is NULLs, which would make sense if we had lots of structures initialized to their zero value and ever used. I'll make a build with some memory allocation debugging next.

marcan · 2015-06-02T06:27:57Z

I'm seeing the same issue. Here's a memory graph of two servers, both running carbon-relay-ng with an identical config. carbon-relay-ng was installed between May 28-29. Decreases in cached memory are server reboots.

The difference between both servers is that the first one is getting all the incoming carbon-relay-ng traffic right now.

The config is pretty trivial. This is running on 3 servers, each also running carbon-cache. Clients can talk to carbon-relay-ng on any of the 3 servers, which then forwards the metrics to all 3.

instance = "default"
listen_addr = "192.168.80.11:2103"
admin_addr = "127.0.0.1:2004"
http_addr = "127.0.0.1:8081"
spool_dir = "/var/spool/carbon-relay-ng"
log_level = "notice"
bad_metrics_max_age = "24h"

init = [
     'addBlack collectd.localhost',
     'addRoute sendAllMatch mon-carbons  192.168.80.11:2004 spool=true pickle=true  192.168.80.12:2004 spool=true pickle=true  192.168.80.13:2004 spool=true pickle=true',
]

[instrumentation]
graphite_addr = ""
graphite_interval = 1000

It hasn't OOMed yet, though it looks like it will in a few days at this rate. I'll give a shot at profiling it.

marcan · 2015-06-02T06:49:39Z

Well, looks like carbon-relay-ng already imports pprof, so that was easy.

(pprof) top10
5.41GB of 5.45GB total (99.16%)
Dropped 109 nodes (cum <= 0.03GB)
      flat  flat%   sum%        cum   cum%
    5.41GB 99.16% 99.16%     5.41GB 99.16%  github.com/graphite-ng/carbon-relay-ng/_third_party/github.com/Dieterbe/go-metrics.(*WindowSample).Update
         0     0% 99.16%     5.41GB 99.16%  github.com/graphite-ng/carbon-relay-ng/_third_party/github.com/Dieterbe/go-metrics.(*StandardHistogram).Update
         0     0% 99.16%     5.40GB 99.01%  github.com/graphite-ng/carbon-relay-ng/_third_party/github.com/Dieterbe/go-metrics.(*StandardTimer).Update
         0     0% 99.16%     5.40GB 99.09%  main.(*Conn).HandleData
         0     0% 99.16%     5.45GB   100%  runtime.goexit

That would be where all the memory is going. Sounds like this is broken in metrics_wrapper.go:

    //histogram: NewHistogram(NewExpDecaySample(1028, 0.015)),
    histogram := metrics.NewHistogram(metrics.NewWindowSample())

WindowSample just accumulates values forever, never freeing them, until it is explicitly cleared. ExpDecaySample is limited to 1028 samples, in this case. I'm changing both instances of WindowSample to ExpDecaySample in metrics_wrapper.go and seeing how that goes over the next few days.

TransactCharlie · 2015-06-05T08:23:37Z

Hi marcan,

I haven't dug through the code but just one thing.... you aren't sending internal metrics from carbon-relay-ng anywhere....

[instrumentation]
graphite_addr = ""
graphite_interval = 1000

Maybe if you were it would be flushing the WindowSamples?

marcan · 2015-06-11T12:12:15Z

That's possible, though I'm not currently interested in the instrumentation and just left that section untouched from the sample config file :-)

Dieterbe · 2015-06-18T22:04:14Z

hmm i have to check the go-metrics code again, it is possible that certain metrics types require to be collected for them to be aggregated or their datastructures to be trimmed/reset.

Dieterbe · 2015-06-25T19:22:24Z

btw i created a little lib that automatically creates a memory profile when a certain memory usage is reached. see https://github.com/Dieterbe/profiletrigger
anyone feel like building this in and PR'ing? :) should help all you in tracking this down..

olivierHa · 2016-02-01T15:09:43Z

Any news about this issue ?
I am running into it right now.

Is it linked to a bad 'instrumentation' section ?

ptqa · 2016-02-08T09:43:46Z

@Dieterbe I did the same as @marcan (changing metrics.NewWindowSample() to metrics.NewExpDecaySample(1028, 0.015)) and carbon-relay-ng stopped to leak memory. I can make a PR with that, but I have a lack of understanding how metrics histogram should work in this case.

@olivierHa nope, it's has nothing to do with instrumentation section.

olivierHa · 2016-02-08T09:47:13Z

On my side, this was the instrumentatio :(
If you put a "wrong" or empty address to flush internal metrics, I guess metrics are just 'stacking' in memory until oom occurs.

ptqa · 2016-02-08T12:15:36Z

I see. Well, in my case there is no instrumentation section at all and it's leaking.

pkoraca · 2016-02-08T13:57:30Z

In my case OOM stopped when instrumentation section was configured.

rlodge · 2016-07-29T21:32:39Z

So I've got it set up to flush instrumentation, and I also tweaked it as suggested above to use NewExpDecaySample, but neither made any difference. It constantly receives a low level of incoming metrics (5-15/s) in my test configuration, and it eventually gets killed by the kernel.

I tried to get it to use the referenced profiletrigger, but I haven't had any luck (nothing ever gets written). Basically I did:

        if *memprofile != "" {
               log.Warning("Triggering memory profiling at 1GB to '%s'\n", *memprofile)
               errors := make(chan error)
               trigger, _ := heap.New(*memprofile, 1000000000, 60, time.Duration(1)*time.Second, errors)
               go trigger.Run()
               go func() {
                       for e := range errors {
                               log.Fatal("profiletrigger heap saw error:", e)
                       }
              }()
               log.Warning("Started memprofile error thread")
       }

I don't know too much about Go, so I may be doing something simple wrong.

Dieterbe · 2016-08-01T07:43:24Z

the profiletrigger should be unrelated to the memprofile flag, so instead of using *memprofile you should use a path to a writeable directory. also make sure your threshold makes sense for your system, i believe it's in bytes.

rlodge · 2016-08-01T16:14:40Z

Ah, wasn't clear to me it was supposed to be a directory.

Here's what I get, from a couple of profiles about 3 minutes apart. My threshold is low at the moment to make it happen quickly, but it will use up pretty much all the system's memory in a few hours. I've got an older version of the program running on a different system entirely (different environment, os), handling about 10K metrics/second, and it hasn't been restarted since 6/10. I'm testing this system, and it's only receiving 6-10 metrics/second, but it will run out of memory and the process manager will restart the relay service in 4-6 hours.

go tool pprof /usr/local/bin/carbon-relay-ng /appdata/carbon-relay-ng-profiles/1470066792.profile-heap
(pprof) top 10
821.70MB of 822.21MB total (99.94%)
Dropped 54 nodes (cum <= 4.11MB)
      flat  flat%   sum%        cum   cum%
  478.75MB 58.23% 58.23%   478.75MB 58.23%  main.(*keepSafe).keepClean
  219.14MB 26.65% 84.88%   219.14MB 26.65%  main.NewWriter
  103.06MB 12.53% 97.42%   336.58MB 40.94%  main.NewConn
   13.88MB  1.69% 99.10%    13.88MB  1.69%  main.NewKeepSafe
    6.87MB  0.84% 99.94%     6.87MB  0.84%  github.com/graphite-ng/carbon-relay-ng/badmetrics.New
         0     0% 99.94%   336.58MB 40.94%  main.(*Destination).updateConn
         0     0% 99.94%     6.87MB  0.84%  main.main
         0     0% 99.94%   822.21MB   100%  runtime.goexit
         0     0% 99.94%     6.87MB  0.84%  runtime.main
(pprof) top10 -cum
821.70MB of 822.21MB total (99.94%)
Dropped 54 nodes (cum <= 4.11MB)
      flat  flat%   sum%        cum   cum%
         0     0%     0%   822.21MB   100%  runtime.goexit
  478.75MB 58.23% 58.23%   478.75MB 58.23%  main.(*keepSafe).keepClean
         0     0% 58.23%   336.58MB 40.94%  main.(*Destination).updateConn
  103.06MB 12.53% 70.76%   336.58MB 40.94%  main.NewConn
  219.14MB 26.65% 97.42%   219.14MB 26.65%  main.NewWriter
   13.88MB  1.69% 99.10%    13.88MB  1.69%  main.NewKeepSafe
    6.87MB  0.84% 99.94%     6.87MB  0.84%  github.com/graphite-ng/carbon-relay-ng/badmetrics.New
         0     0% 99.94%     6.87MB  0.84%  main.main
         0     0% 99.94%     6.87MB  0.84%  runtime.main

go tool pprof /usr/local/bin/carbon-relay-ng /appdata/carbon-relay-ng-profiles/1470066982.profile-heap
Entering interactive mode (type "help" for commands)
(pprof) top 10
849.39MB of 849.90MB total (99.94%)
Dropped 58 nodes (cum <= 4.25MB)
      flat  flat%   sum%        cum   cum%
  508.82MB 59.87% 59.87%   508.82MB 59.87%  main.(*keepSafe).keepClean
  226.97MB 26.71% 86.57%   226.97MB 26.71%  main.NewWriter
  106.74MB 12.56% 99.13%   334.21MB 39.32%  main.NewConn
    6.87MB  0.81% 99.94%     6.87MB  0.81%  github.com/graphite-ng/carbon-relay-ng/badmetrics.New
         0     0% 99.94%   334.21MB 39.32%  main.(*Destination).updateConn
         0     0% 99.94%     6.87MB  0.81%  main.main
         0     0% 99.94%   849.90MB   100%  runtime.goexit
         0     0% 99.94%     6.87MB  0.81%  runtime.main
(pprof) top10 -cum
849.39MB of 849.90MB total (99.94%)
Dropped 58 nodes (cum <= 4.25MB)
      flat  flat%   sum%        cum   cum%
         0     0%     0%   849.90MB   100%  runtime.goexit
  508.82MB 59.87% 59.87%   508.82MB 59.87%  main.(*keepSafe).keepClean
         0     0% 59.87%   334.21MB 39.32%  main.(*Destination).updateConn
  106.74MB 12.56% 72.43%   334.21MB 39.32%  main.NewConn
  226.97MB 26.71% 99.13%   226.97MB 26.71%  main.NewWriter
    6.87MB  0.81% 99.94%     6.87MB  0.81%  github.com/graphite-ng/carbon-relay-ng/badmetrics.New
         0     0% 99.94%     6.87MB  0.81%  main.main
         0     0% 99.94%     6.87MB  0.81%  runtime.main

Dieterbe · 2017-03-23T10:09:59Z

the profile above is not very conclusive but i suspect it was taken when carbon-relay-ng hasn't run long enough.

From the reports from various people in this ticket, as well as the code, it's clear that the problem is carbon-relay-ng's internal metrics that pile up, if it's not set up to send them anywhere.
we need to do two things:

A) make this issue extra clear in readme and sample configs and set up default config to work around it, because too many folks are tripping over it. I'll do this now.
B) actually solve it. maybe by using a different library, or by flushing the metrics to /dev/null if graphite_addr is empty. Unfortunately i have no time to do this on the short term.

see #50

scalp42 · 2019-08-02T07:31:08Z

@Dieterbe is there a way to discard these internal metrics?

Dieterbe · 2019-08-02T14:58:20Z

i don't have time right now to dig into the code (anyone else, feel free to do so). from what i recall no / not with our current implementation

shlok007 · 2020-02-07T03:22:40Z

Hey can we add a flag just to disable the instrumentation if it's not needed? If you can help me by directing me to any relevant parts of the code which can help me better understand the situation, I'll be happy to try to contribute a fix. :)

james-woods-mnf · 2020-04-04T21:58:56Z

@shlok007 , did you make any progress on this?

robert-c-ewing · 2024-03-04T14:35:25Z

Seems that we start memory profiling before we know we even want a memory profiler.

carbon-relay-ng/cmd/carbon-relay-ng/carbon-relay-ng.go

Lines 167 to 178 in 1447b5c

    
           go func() { 
        
           	sys := stats.Gauge("what=virtual_memory.unit=Byte") 
        
           	alloc := stats.Gauge("what=memory_allocated.unit=Byte") 
        
           	ticker := time.NewTicker(time.Second) 
        
           	var memstats runtime.MemStats 
        
           	for range ticker.C { 
        
           		runtime.ReadMemStats(&memstats) 
        
           		sys.Update(int64(memstats.Sys)) 
        
           		alloc.Update(int64(memstats.Alloc)) 
        
           	} 
        
           }()

I'm going to take a poke at a PR for this.

Dieterbe mentioned this issue Dec 14, 2016

carbon-relay-ng dies consistently #146

Closed

Dieterbe added a commit that referenced this issue Mar 23, 2017

make the memory leak bug clearer

70f1779

see #50

Dieterbe mentioned this issue Sep 18, 2017

Carbon-relay-ng never releases RAM when there is a lot of traffic, memory leak #221

Closed

Y-Tian mentioned this issue Mar 10, 2020

TCP timeout issue #401

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leaks memory #50

Leaks memory #50

mmimica commented Dec 19, 2014

ghost commented Jan 22, 2015

ghost commented Feb 10, 2015

willejs commented Feb 24, 2015

Dieterbe commented Mar 30, 2015

shanemhansen commented May 4, 2015

Dieterbe commented May 4, 2015

shanemhansen commented May 4, 2015

shanemhansen commented May 4, 2015

marcan commented Jun 2, 2015

marcan commented Jun 2, 2015

TransactCharlie commented Jun 5, 2015

marcan commented Jun 11, 2015

Dieterbe commented Jun 18, 2015

Dieterbe commented Jun 25, 2015

olivierHa commented Feb 1, 2016

ptqa commented Feb 8, 2016

olivierHa commented Feb 8, 2016

ptqa commented Feb 8, 2016

pkoraca commented Feb 8, 2016

rlodge commented Jul 29, 2016

Dieterbe commented Aug 1, 2016

rlodge commented Aug 1, 2016

Dieterbe commented Mar 23, 2017 •

edited

Loading

scalp42 commented Aug 2, 2019

Dieterbe commented Aug 2, 2019

shlok007 commented Feb 7, 2020

james-woods-mnf commented Apr 4, 2020

robert-c-ewing commented Mar 4, 2024 •

edited

Loading

Leaks memory #50

Leaks memory #50

Comments

mmimica commented Dec 19, 2014

ghost commented Jan 22, 2015

ghost commented Feb 10, 2015

willejs commented Feb 24, 2015

Dieterbe commented Mar 30, 2015

shanemhansen commented May 4, 2015

Dieterbe commented May 4, 2015

shanemhansen commented May 4, 2015

shanemhansen commented May 4, 2015

marcan commented Jun 2, 2015

marcan commented Jun 2, 2015

TransactCharlie commented Jun 5, 2015

marcan commented Jun 11, 2015

Dieterbe commented Jun 18, 2015

Dieterbe commented Jun 25, 2015

olivierHa commented Feb 1, 2016

ptqa commented Feb 8, 2016

olivierHa commented Feb 8, 2016

ptqa commented Feb 8, 2016

pkoraca commented Feb 8, 2016

rlodge commented Jul 29, 2016

Dieterbe commented Aug 1, 2016

rlodge commented Aug 1, 2016

Dieterbe commented Mar 23, 2017 • edited Loading

scalp42 commented Aug 2, 2019

Dieterbe commented Aug 2, 2019

shlok007 commented Feb 7, 2020

james-woods-mnf commented Apr 4, 2020

robert-c-ewing commented Mar 4, 2024 • edited Loading

Dieterbe commented Mar 23, 2017 •

edited

Loading

robert-c-ewing commented Mar 4, 2024 •

edited

Loading