[BUG] OOM (out of memory) recurring every 8-9 days #579

interfan7 · 2024-01-07T13:49:17Z

Describe the bug
When the service is killed by the OS due to OOM, the systemd automatically starts it again.
Then, the memory consumption in the machine steadily increases for 8-9 days until next OOM.

Logs
I've not noticed something too particular in logs. The OOM log appears in system logs (demsg etc...).
I'll be happy to provide specific grep/messages, otherwise the log is huge.

Go-carbon Configuration:

go-carbon.conf:

[common]
user = "carbon"
graph-prefix = "carbon.agents.{host}"
metric-endpoint = "local"
max-cpu = 4
metric-interval = "1m0s"

[whisper]
data-dir = "/data/graphite/whisper/"
schemas-file = "/etc/go-carbon/storage-schemas.conf"
aggregation-file = "/etc/go-carbon/storage-aggregation.conf"
quotas-file = ""
workers = 4
max-updates-per-second = 0
sparse-create = false
physical-size-factor = 0.75
flock = true
compressed = false
enabled = true
hash-filenames = true
remove-empty-file = false
online-migration = false
online-migration-rate = 5
online-migration-global-scope = ""

[cache]
max-size = 100000000
write-strategy = "max"

[udp]
listen = "0.0.0.0:2003"
enabled = true
buffer-size = 0

[tcp]
listen = "0.0.0.0:2003"
enabled = true
buffer-size = 0
compression = ""

[pickle]
listen = ":2004"
max-message-size = 67108864
enabled = true
buffer-size = 0

[carbonlink]
listen = "127.0.0.1:7002"
enabled = true
read-timeout = "30s"

[grpc]
listen = "127.0.0.1:7003"
enabled = true

[tags]
enabled = false
tagdb-url = "http://127.0.0.1:8000"
tagdb-chunk-size = 32
tagdb-update-interval = 100
local-dir = "/data/graphite/tagging/"
tagdb-timeout = "1s"

[carbonserver]
listen = "0.0.0.0:8080"
enabled = true
query-cache-enabled = true
streaming-query-cache-enabled = false
query-cache-size-mb = 0
find-cache-enabled = true
buckets = 100
max-globs = 1000
fail-on-max-globs = false
empty-result-ok = true
do-not-log-404s = false
metrics-as-counters = false
trigram-index = true
internal-stats-dir = ""
cache-scan = false
max-metrics-globbed = 1000000000
max-metrics-rendered = 100000000
trie-index = false
concurrent-index = false
realtime-index = 0
file-list-cache = ""
file-list-cache-version = 1
max-creates-per-second = 0
no-service-when-index-is-not-ready = false
max-inflight-requests = 0
render-trace-logging-enabled = false
[carbonserver.grpc]
listen = ""
enabled = false
read-timeout = "1m0s"
idle-timeout = "1m0s"
write-timeout = "1m0s"
scan-frequency = "5m0s"
quota-usage-report-frequency = "1m0s"

[dump]
enabled = false
path = "/var/lib/graphite/dump/"
restore-per-second = 0

[pprof]
listen = "127.0.0.1:7007"
enabled = false

[[logging]]
logger = ""
file = "/var/log/go-carbon/go-carbon.log"
level = "info"
encoding = "mixed"
encoding-time = "iso8601"
encoding-duration = "seconds"
sample-tick = ""
sample-initial = 0
sample-thereafter = 0

[prometheus]
enabled = false
endpoint = "/metrics"
[prometheus.labels]

[tracing]
enabled = false
jaegerEndpoint = ""
stdout = false
send_timeout = "10s"

storage-schemas.conf:

[carbon]
pattern = ^carbon\.
retentions = 60:90d

[redash-metrics]
pattern = (.*{something I prefer to not share}.*)
retentions = 1m:7y

[production]
pattern = (^production.*|^secTeam.*)
retentions = 1m:60d,15m:120d,1h:3y

[non-production]
pattern = (^non-production.*|^canary.*)
retentions = 1m:14d,30m:30d,1h:180d

[default]
pattern = .*
retentions = 1m:14d,5m:90d,30m:1y

storage-aggregation.conf files:

[min]
pattern = \.min$
xFilesFactor = 0.1
aggregationMethod = min

[max]
pattern = \.max$
xFilesFactor = 0.1
aggregationMethod = max

[sum]
pattern = \.count$
xFilesFactor = 0
aggregationMethod = max

[someTeam_aggregation]
pattern = ^someTeam.*
xFilesFactor = 0
aggregationMethod = average

[default_average]
pattern = .*
xFilesFactor = 0.5
aggregationMethod = average

I wonder whether fields max-size, max-metrics-globbed or max-metrics-rendered have to do with the issue.

Additional context
carbonapi service also runs in same server.
We've an identical dev server, but it's carbonapi is almost not queried.
Interestingly we don't have that issue in the dev server, which suggest the issue has to do with queries.
Here is the memory usage graph for prod (left) and dev (right), side by side, for a period of 22 days:

In addition, the systemd status also indicates considerable different, although the prod service is active for only about 1.5 day.
Dev:

$ sudo systemctl status go-carbon.service | grep -E 'Memory|Active'
     Active: active (running) since Mon 2023-12-18 10:07:53 UTC; 2 weeks 6 days ago
     Memory: 26.5G

Prod:

$ sudo systemctl status go-carbon.service | grep -E 'Memory|Active'
     Active: active (running) since Sat 2024-01-06 05:18:57 UTC; 1 day 8h ago
     Memory: 42.0G

Although that shall make sense since there are almost zero queries from the dev server.

The text was updated successfully, but these errors were encountered:

deniszh · 2024-01-07T20:34:25Z

Hi @interfan7 ,

How many metrics (ie whisper files) does this instance serve?
OOM doesn't always mean bug, go-carbon was designed to use memory instead of disk.

deniszh · 2024-01-07T20:38:00Z

PS: you can enable pprof interface in config, then you csn run heap dumps and investigate them with go pprof command.

interfan7 · 2024-01-08T22:25:03Z

@deniszh
The number of WSP files is 1,508,149.
At least about 200,000 of them are only occasionally fed with datapoints.

I can count how many are updated or accessed in the last 24 hours if that might help?

interfan7 · 2024-01-08T22:25:49Z

@deniszh
I'm not familiar with pprof. That will require some ramp-up time for me. I'll try when I can.

deniszh · 2024-01-09T06:04:57Z

@deniszh
The number of WSP files is 1,508,149.
At least about 200,000 of them are only occasionally fed with datapoints.

I can count how many are updated or accessed in the last 24 hours if that might help?

That's not much. I can check our prod memory consumption to compare. OTOH we're using trie index and trigram is disabled iirc.
Pprof is great tool for live debug of go programs, try to enable it on localhost and experiment. You can even try it on laptop.

interfan7 · 2024-01-16T02:35:31Z

@deniszh
I've fetched heap from pprof. SVG file Attached. When opened in browser locally it's very convenient to zoom/move through it.

Would you mind to tell whether anything interesting/suspicous is observable from it?

Once we've configured it to be the target of our whole prod, it takes go-carbon not much longer than a day to reach OOM.
We plan to change the instance type to move 64GB-->128GB and see whether the memory consumption stops at some point. As you said - OOM doesn't mean there is a memory leak, although it's interesting that it takes it really a while to steadily grow the memory occupation, that's why we thought it might be a leak.

deniszh · 2024-01-16T09:21:33Z

@interfan7 : that's a memory snapshot, and one snapshot doesn't give you much info.
It's more interesting how it changes over time, what exact grows.
BTW, I checked our prod servers. For example, for 4M metrics I see that go-carbon consumes 20-30GB RAM.
Why do you use so huge 'max-metrics-globbed' and 'max-metrics-rendered' ? If I'm reading svg right half of your data is glob cache. We're perfectly fine using

max-metrics-rendered = 10001
max-metrics-globbed  = 90000

Defaults are less strict, but your numbers unusually high.

interfan7 · 2024-01-16T19:23:04Z

@deniszh

go-carbon consumes 20-30GB RAM

How do you see that? There are various ways to take service's/processe's mem occupation.

Why do you use so huge 'max-metrics-globbed' and 'max-metrics-rendered' ?

I think when we've set up the node, the Grafana users have complained they have lacked data or metrics in result, so changing this value has seemed to resolve it. However we've just set a very high value without graduallity of try-and-see cycles.
Having said that, if that's the cause for the high mem usage, then why isn't the usage fluctuating over time, but rather steadily rising? That's why we've thought maybe there are leaks.

I'll get heap profiles of 2 more points in time between service's start and "end" (i.e. somewhat before OOM). I've read pprof is capable of comparing.

flucrezia · 2024-02-03T23:49:04Z

Hi @interfan7,
have you tried to increase the config attributes max-cpu and workers ? If processing can't keep up with queries rate and load, then memory consumption could increase.
I presume that your prod machine has more than 4 vCores to handle 128GB of RAM.

interfan7 · 2024-02-07T23:00:19Z

@flucrezia
The cores seem to relaxed actually so I've not though it could be an issue.

I've decreased the 2 params mentioned above about 2 days ago and I want to see whether the memory will grow to 100GB+ again.

If conclude reducing those params doesn't resolve the issue, at least not for 128GB machine, then I may try your suggestion 🙏🏻

interfan7 added the bug label Jan 7, 2024

This was referenced Jun 27, 2024

[BUG] [carbonserver] metrics find request have differents response based on enabled indexing mode #594

Closed

[BUG] Memory usage steady growing over time #597

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] OOM (out of memory) recurring every 8-9 days #579

[BUG] OOM (out of memory) recurring every 8-9 days #579

interfan7 commented Jan 7, 2024 •

edited

Loading

deniszh commented Jan 7, 2024

deniszh commented Jan 7, 2024

interfan7 commented Jan 8, 2024

interfan7 commented Jan 8, 2024

deniszh commented Jan 9, 2024

interfan7 commented Jan 16, 2024 •

edited

Loading

deniszh commented Jan 16, 2024 •

edited

Loading

interfan7 commented Jan 16, 2024

flucrezia commented Feb 3, 2024

interfan7 commented Feb 7, 2024

[BUG] OOM (out of memory) recurring every 8-9 days #579

[BUG] OOM (out of memory) recurring every 8-9 days #579

Comments

interfan7 commented Jan 7, 2024 • edited Loading

deniszh commented Jan 7, 2024

deniszh commented Jan 7, 2024

interfan7 commented Jan 8, 2024

interfan7 commented Jan 8, 2024

deniszh commented Jan 9, 2024

interfan7 commented Jan 16, 2024 • edited Loading

deniszh commented Jan 16, 2024 • edited Loading

interfan7 commented Jan 16, 2024

flucrezia commented Feb 3, 2024

interfan7 commented Feb 7, 2024

interfan7 commented Jan 7, 2024 •

edited

Loading

interfan7 commented Jan 16, 2024 •

edited

Loading

deniszh commented Jan 16, 2024 •

edited

Loading