Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaks memory #50

Open
mmimica opened this issue Dec 19, 2014 · 28 comments
Open

Leaks memory #50

mmimica opened this issue Dec 19, 2014 · 28 comments

Comments

@mmimica
Copy link

mmimica commented Dec 19, 2014

Relay seems to be leaking memory. It consumes all 2GB of RAM we gave him until OOMK kicks in.
Routing rules are pretty simple:

init = [
     "addBlack .net",
     "addRoute sendAllMatch core prefix=org.company  12.56.176.41:2014 spool=false pickle=true  12.56.176.42:2014 spool=false pickle=true",
     "addRoute sendAllMatch other prefix=stats  12.56.176.51:2014 spool=false pickle=true  12.56.176.52:2014 spool=false pickle=true"
]
@ghost
Copy link

ghost commented Jan 22, 2015

I'm also seeing this issue.

instance = "default"
listen_addr = "0.0.0.0:2103"
admin_addr = "0.0.0.0:2104"
http_addr = "0.0.0.0:8089"

spool_dir = "/log/graphite/spool"
log_level = "error"

init = [
"addRoute sendAllMatch carbonrelay-cluster 10.198.34.168:2013 spool=false pickle=false 10.198.34.169:2013 spool=false pickle=false 10.198.35.228:2013 spool=false pickle=false 10.199.223.31:10005 spool=false pickle=false"
]

[instrumentation]
graphite_addr = "localhost:2103"
graphite_interval = 15000

Slowly eats memory over time.

@ghost
Copy link

ghost commented Feb 10, 2015

It appears that my configuration was having lots of dest metric drops. This was tracked down to my last destination (10005) which was an influxdb destination that was having issues. When looking at the carbon-relay-ng metrics, it showed lots of errors and drops. Once I removed my influxdb destination, we've been stable. I think this isn't an issue with influxdb being a destination - instead it seems that its an issue with the performance of my influxdb backend. The back-up in processing seems to have been negatively impacting carbon-relay-ng.

@willejs
Copy link

willejs commented Feb 24, 2015

Any news on this?

@Dieterbe
Copy link
Contributor

weird. i have a 2GB machine and never even needed to look at memory. never been an issue for me.
any of you interested in doing some memory profiling? this can tell you where a lot of memory is being allocated.
see http://blog.golang.org/profiling-go-programs

@shanemhansen
Copy link

I'm also getting oom kills. I'm also pushing to influxdb. Here's a grafana graph of memory usage (sorry for the missing units the graph is ~2 weeks wide and 7.5gb tall. I can try and do some memory profiling soon.

memory

@Dieterbe
Copy link
Contributor

Dieterbe commented May 4, 2015

what's the green vs the yellow line? (which is memory and what's the other?)

@shanemhansen
Copy link

Sorry. They are both memory. The yellow line is memory usage on a different
server running carbon-relay-ng at a lower throughput.

The green line does represent total memory usage on a server running
carbon-relay-ng and some other processes, but that memory usage is normally pretty constant, I think the graph shows memory growth up to the point the kernel oom kills the process. I also have oom messages in dmesg.

I have a core file I'm analyzing too.

@shanemhansen
Copy link

My core file doesn't show anything particularly interesting. Approximately 90% of the core file is NULLs, which would make sense if we had lots of structures initialized to their zero value and ever used. I'll make a build with some memory allocation debugging next.

@marcan
Copy link

marcan commented Jun 2, 2015

I'm seeing the same issue. Here's a memory graph of two servers, both running carbon-relay-ng with an identical config. carbon-relay-ng was installed between May 28-29. Decreases in cached memory are server reboots.
carbon-relay-ng-leak

The difference between both servers is that the first one is getting all the incoming carbon-relay-ng traffic right now.

The config is pretty trivial. This is running on 3 servers, each also running carbon-cache. Clients can talk to carbon-relay-ng on any of the 3 servers, which then forwards the metrics to all 3.

instance = "default"
listen_addr = "192.168.80.11:2103"
admin_addr = "127.0.0.1:2004"
http_addr = "127.0.0.1:8081"
spool_dir = "/var/spool/carbon-relay-ng"
log_level = "notice"
bad_metrics_max_age = "24h"

init = [
     'addBlack collectd.localhost',
     'addRoute sendAllMatch mon-carbons  192.168.80.11:2004 spool=true pickle=true  192.168.80.12:2004 spool=true pickle=true  192.168.80.13:2004 spool=true pickle=true',
]

[instrumentation]
graphite_addr = ""
graphite_interval = 1000

It hasn't OOMed yet, though it looks like it will in a few days at this rate. I'll give a shot at profiling it.

@marcan
Copy link

marcan commented Jun 2, 2015

Well, looks like carbon-relay-ng already imports pprof, so that was easy.

(pprof) top10
5.41GB of 5.45GB total (99.16%)
Dropped 109 nodes (cum <= 0.03GB)
      flat  flat%   sum%        cum   cum%
    5.41GB 99.16% 99.16%     5.41GB 99.16%  github.com/graphite-ng/carbon-relay-ng/_third_party/github.com/Dieterbe/go-metrics.(*WindowSample).Update
         0     0% 99.16%     5.41GB 99.16%  github.com/graphite-ng/carbon-relay-ng/_third_party/github.com/Dieterbe/go-metrics.(*StandardHistogram).Update
         0     0% 99.16%     5.40GB 99.01%  github.com/graphite-ng/carbon-relay-ng/_third_party/github.com/Dieterbe/go-metrics.(*StandardTimer).Update
         0     0% 99.16%     5.40GB 99.09%  main.(*Conn).HandleData
         0     0% 99.16%     5.45GB   100%  runtime.goexit

That would be where all the memory is going. Sounds like this is broken in metrics_wrapper.go:

    //histogram: NewHistogram(NewExpDecaySample(1028, 0.015)),
    histogram := metrics.NewHistogram(metrics.NewWindowSample())

WindowSample just accumulates values forever, never freeing them, until it is explicitly cleared. ExpDecaySample is limited to 1028 samples, in this case. I'm changing both instances of WindowSample to ExpDecaySample in metrics_wrapper.go and seeing how that goes over the next few days.

@TransactCharlie
Copy link

Hi marcan,

I haven't dug through the code but just one thing.... you aren't sending internal metrics from carbon-relay-ng anywhere....

[instrumentation]
graphite_addr = ""
graphite_interval = 1000

Maybe if you were it would be flushing the WindowSamples?

@marcan
Copy link

marcan commented Jun 11, 2015

That's possible, though I'm not currently interested in the instrumentation and just left that section untouched from the sample config file :-)

@Dieterbe
Copy link
Contributor

hmm i have to check the go-metrics code again, it is possible that certain metrics types require to be collected for them to be aggregated or their datastructures to be trimmed/reset.

@Dieterbe
Copy link
Contributor

btw i created a little lib that automatically creates a memory profile when a certain memory usage is reached. see https://github.com/Dieterbe/profiletrigger
anyone feel like building this in and PR'ing? :) should help all you in tracking this down..

@olivierHa
Copy link

Any news about this issue ?
I am running into it right now.

Is it linked to a bad 'instrumentation' section ?

@ptqa
Copy link

ptqa commented Feb 8, 2016

@Dieterbe I did the same as @marcan (changing metrics.NewWindowSample() to metrics.NewExpDecaySample(1028, 0.015)) and carbon-relay-ng stopped to leak memory. I can make a PR with that, but I have a lack of understanding how metrics histogram should work in this case.

@olivierHa nope, it's has nothing to do with instrumentation section.

@olivierHa
Copy link

On my side, this was the instrumentatio :(
If you put a "wrong" or empty address to flush internal metrics, I guess metrics are just 'stacking' in memory until oom occurs.

@ptqa
Copy link

ptqa commented Feb 8, 2016

I see. Well, in my case there is no instrumentation section at all and it's leaking.

@pkoraca
Copy link

pkoraca commented Feb 8, 2016

In my case OOM stopped when instrumentation section was configured.

@rlodge
Copy link

rlodge commented Jul 29, 2016

So I've got it set up to flush instrumentation, and I also tweaked it as suggested above to use NewExpDecaySample, but neither made any difference. It constantly receives a low level of incoming metrics (5-15/s) in my test configuration, and it eventually gets killed by the kernel.

I tried to get it to use the referenced profiletrigger, but I haven't had any luck (nothing ever gets written). Basically I did:

        if *memprofile != "" {
               log.Warning("Triggering memory profiling at 1GB to '%s'\n", *memprofile)
               errors := make(chan error)
               trigger, _ := heap.New(*memprofile, 1000000000, 60, time.Duration(1)*time.Second, errors)
               go trigger.Run()
               go func() {
                       for e := range errors {
                               log.Fatal("profiletrigger heap saw error:", e)
                       }
              }()
               log.Warning("Started memprofile error thread")
       }

I don't know too much about Go, so I may be doing something simple wrong.

@Dieterbe
Copy link
Contributor

Dieterbe commented Aug 1, 2016

the profiletrigger should be unrelated to the memprofile flag, so instead of using *memprofile you should use a path to a writeable directory. also make sure your threshold makes sense for your system, i believe it's in bytes.

@rlodge
Copy link

rlodge commented Aug 1, 2016

Ah, wasn't clear to me it was supposed to be a directory.

Here's what I get, from a couple of profiles about 3 minutes apart. My threshold is low at the moment to make it happen quickly, but it will use up pretty much all the system's memory in a few hours. I've got an older version of the program running on a different system entirely (different environment, os), handling about 10K metrics/second, and it hasn't been restarted since 6/10. I'm testing this system, and it's only receiving 6-10 metrics/second, but it will run out of memory and the process manager will restart the relay service in 4-6 hours.

go tool pprof /usr/local/bin/carbon-relay-ng /appdata/carbon-relay-ng-profiles/1470066792.profile-heap
(pprof) top 10
821.70MB of 822.21MB total (99.94%)
Dropped 54 nodes (cum <= 4.11MB)
      flat  flat%   sum%        cum   cum%
  478.75MB 58.23% 58.23%   478.75MB 58.23%  main.(*keepSafe).keepClean
  219.14MB 26.65% 84.88%   219.14MB 26.65%  main.NewWriter
  103.06MB 12.53% 97.42%   336.58MB 40.94%  main.NewConn
   13.88MB  1.69% 99.10%    13.88MB  1.69%  main.NewKeepSafe
    6.87MB  0.84% 99.94%     6.87MB  0.84%  github.com/graphite-ng/carbon-relay-ng/badmetrics.New
         0     0% 99.94%   336.58MB 40.94%  main.(*Destination).updateConn
         0     0% 99.94%     6.87MB  0.84%  main.main
         0     0% 99.94%   822.21MB   100%  runtime.goexit
         0     0% 99.94%     6.87MB  0.84%  runtime.main
(pprof) top10 -cum
821.70MB of 822.21MB total (99.94%)
Dropped 54 nodes (cum <= 4.11MB)
      flat  flat%   sum%        cum   cum%
         0     0%     0%   822.21MB   100%  runtime.goexit
  478.75MB 58.23% 58.23%   478.75MB 58.23%  main.(*keepSafe).keepClean
         0     0% 58.23%   336.58MB 40.94%  main.(*Destination).updateConn
  103.06MB 12.53% 70.76%   336.58MB 40.94%  main.NewConn
  219.14MB 26.65% 97.42%   219.14MB 26.65%  main.NewWriter
   13.88MB  1.69% 99.10%    13.88MB  1.69%  main.NewKeepSafe
    6.87MB  0.84% 99.94%     6.87MB  0.84%  github.com/graphite-ng/carbon-relay-ng/badmetrics.New
         0     0% 99.94%     6.87MB  0.84%  main.main
         0     0% 99.94%     6.87MB  0.84%  runtime.main

go tool pprof /usr/local/bin/carbon-relay-ng /appdata/carbon-relay-ng-profiles/1470066982.profile-heap
Entering interactive mode (type "help" for commands)
(pprof) top 10
849.39MB of 849.90MB total (99.94%)
Dropped 58 nodes (cum <= 4.25MB)
      flat  flat%   sum%        cum   cum%
  508.82MB 59.87% 59.87%   508.82MB 59.87%  main.(*keepSafe).keepClean
  226.97MB 26.71% 86.57%   226.97MB 26.71%  main.NewWriter
  106.74MB 12.56% 99.13%   334.21MB 39.32%  main.NewConn
    6.87MB  0.81% 99.94%     6.87MB  0.81%  github.com/graphite-ng/carbon-relay-ng/badmetrics.New
         0     0% 99.94%   334.21MB 39.32%  main.(*Destination).updateConn
         0     0% 99.94%     6.87MB  0.81%  main.main
         0     0% 99.94%   849.90MB   100%  runtime.goexit
         0     0% 99.94%     6.87MB  0.81%  runtime.main
(pprof) top10 -cum
849.39MB of 849.90MB total (99.94%)
Dropped 58 nodes (cum <= 4.25MB)
      flat  flat%   sum%        cum   cum%
         0     0%     0%   849.90MB   100%  runtime.goexit
  508.82MB 59.87% 59.87%   508.82MB 59.87%  main.(*keepSafe).keepClean
         0     0% 59.87%   334.21MB 39.32%  main.(*Destination).updateConn
  106.74MB 12.56% 72.43%   334.21MB 39.32%  main.NewConn
  226.97MB 26.71% 99.13%   226.97MB 26.71%  main.NewWriter
    6.87MB  0.81% 99.94%     6.87MB  0.81%  github.com/graphite-ng/carbon-relay-ng/badmetrics.New
         0     0% 99.94%     6.87MB  0.81%  main.main
         0     0% 99.94%     6.87MB  0.81%  runtime.main

@Dieterbe
Copy link
Contributor

Dieterbe commented Mar 23, 2017

the profile above is not very conclusive but i suspect it was taken when carbon-relay-ng hasn't run long enough.

From the reports from various people in this ticket, as well as the code, it's clear that the problem is carbon-relay-ng's internal metrics that pile up, if it's not set up to send them anywhere.
we need to do two things:

A) make this issue extra clear in readme and sample configs and set up default config to work around it, because too many folks are tripping over it. I'll do this now.
B) actually solve it. maybe by using a different library, or by flushing the metrics to /dev/null if graphite_addr is empty. Unfortunately i have no time to do this on the short term.

@scalp42
Copy link

scalp42 commented Aug 2, 2019

@Dieterbe is there a way to discard these internal metrics?

@Dieterbe
Copy link
Contributor

Dieterbe commented Aug 2, 2019

i don't have time right now to dig into the code (anyone else, feel free to do so). from what i recall no / not with our current implementation

@shlok007
Copy link

shlok007 commented Feb 7, 2020

Hey can we add a flag just to disable the instrumentation if it's not needed? If you can help me by directing me to any relevant parts of the code which can help me better understand the situation, I'll be happy to try to contribute a fix. :)

@james-woods-mnf
Copy link

@shlok007 , did you make any progress on this?

@robert-c-ewing
Copy link

robert-c-ewing commented Mar 4, 2024

Seems that we start memory profiling before we know we even want a memory profiler.

go func() {
sys := stats.Gauge("what=virtual_memory.unit=Byte")
alloc := stats.Gauge("what=memory_allocated.unit=Byte")
ticker := time.NewTicker(time.Second)
var memstats runtime.MemStats
for range ticker.C {
runtime.ReadMemStats(&memstats)
sys.Update(int64(memstats.Sys))
alloc.Update(int64(memstats.Alloc))
}
}()

I'm going to take a poke at a PR for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests