Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] OOM (out of memory) recurring every 8-9 days #579

Open
interfan7 opened this issue Jan 7, 2024 · 10 comments
Open

[BUG] OOM (out of memory) recurring every 8-9 days #579

interfan7 opened this issue Jan 7, 2024 · 10 comments
Labels

Comments

@interfan7
Copy link

interfan7 commented Jan 7, 2024

Describe the bug
When the service is killed by the OS due to OOM, the systemd automatically starts it again.
Then, the memory consumption in the machine steadily increases for 8-9 days until next OOM.

Logs
I've not noticed something too particular in logs. The OOM log appears in system logs (demsg etc...).
I'll be happy to provide specific grep/messages, otherwise the log is huge.

Go-carbon Configuration:

go-carbon.conf:

[common]
user = "carbon"
graph-prefix = "carbon.agents.{host}"
metric-endpoint = "local"
max-cpu = 4
metric-interval = "1m0s"

[whisper]
data-dir = "/data/graphite/whisper/"
schemas-file = "/etc/go-carbon/storage-schemas.conf"
aggregation-file = "/etc/go-carbon/storage-aggregation.conf"
quotas-file = ""
workers = 4
max-updates-per-second = 0
sparse-create = false
physical-size-factor = 0.75
flock = true
compressed = false
enabled = true
hash-filenames = true
remove-empty-file = false
online-migration = false
online-migration-rate = 5
online-migration-global-scope = ""

[cache]
max-size = 100000000
write-strategy = "max"

[udp]
listen = "0.0.0.0:2003"
enabled = true
buffer-size = 0

[tcp]
listen = "0.0.0.0:2003"
enabled = true
buffer-size = 0
compression = ""

[pickle]
listen = ":2004"
max-message-size = 67108864
enabled = true
buffer-size = 0

[carbonlink]
listen = "127.0.0.1:7002"
enabled = true
read-timeout = "30s"

[grpc]
listen = "127.0.0.1:7003"
enabled = true

[tags]
enabled = false
tagdb-url = "http://127.0.0.1:8000"
tagdb-chunk-size = 32
tagdb-update-interval = 100
local-dir = "/data/graphite/tagging/"
tagdb-timeout = "1s"

[carbonserver]
listen = "0.0.0.0:8080"
enabled = true
query-cache-enabled = true
streaming-query-cache-enabled = false
query-cache-size-mb = 0
find-cache-enabled = true
buckets = 100
max-globs = 1000
fail-on-max-globs = false
empty-result-ok = true
do-not-log-404s = false
metrics-as-counters = false
trigram-index = true
internal-stats-dir = ""
cache-scan = false
max-metrics-globbed = 1000000000
max-metrics-rendered = 100000000
trie-index = false
concurrent-index = false
realtime-index = 0
file-list-cache = ""
file-list-cache-version = 1
max-creates-per-second = 0
no-service-when-index-is-not-ready = false
max-inflight-requests = 0
render-trace-logging-enabled = false
[carbonserver.grpc]
listen = ""
enabled = false
read-timeout = "1m0s"
idle-timeout = "1m0s"
write-timeout = "1m0s"
scan-frequency = "5m0s"
quota-usage-report-frequency = "1m0s"

[dump]
enabled = false
path = "/var/lib/graphite/dump/"
restore-per-second = 0

[pprof]
listen = "127.0.0.1:7007"
enabled = false

[[logging]]
logger = ""
file = "/var/log/go-carbon/go-carbon.log"
level = "info"
encoding = "mixed"
encoding-time = "iso8601"
encoding-duration = "seconds"
sample-tick = ""
sample-initial = 0
sample-thereafter = 0

[prometheus]
enabled = false
endpoint = "/metrics"
[prometheus.labels]

[tracing]
enabled = false
jaegerEndpoint = ""
stdout = false
send_timeout = "10s"

storage-schemas.conf:

[carbon]
pattern = ^carbon\.
retentions = 60:90d

[redash-metrics]
pattern = (.*{something I prefer to not share}.*)
retentions = 1m:7y

[production]
pattern = (^production.*|^secTeam.*)
retentions = 1m:60d,15m:120d,1h:3y

[non-production]
pattern = (^non-production.*|^canary.*)
retentions = 1m:14d,30m:30d,1h:180d

[default]
pattern = .*
retentions = 1m:14d,5m:90d,30m:1y

storage-aggregation.conf files:

[min]
pattern = \.min$
xFilesFactor = 0.1
aggregationMethod = min

[max]
pattern = \.max$
xFilesFactor = 0.1
aggregationMethod = max

[sum]
pattern = \.count$
xFilesFactor = 0
aggregationMethod = max

[someTeam_aggregation]
pattern = ^someTeam.*
xFilesFactor = 0
aggregationMethod = average

[default_average]
pattern = .*
xFilesFactor = 0.5
aggregationMethod = average

I wonder whether fields max-size, max-metrics-globbed or max-metrics-rendered have to do with the issue.

Additional context
carbonapi service also runs in same server.
We've an identical dev server, but it's carbonapi is almost not queried.
Interestingly we don't have that issue in the dev server, which suggest the issue has to do with queries.
Here is the memory usage graph for prod (left) and dev (right), side by side, for a period of 22 days:
image
In addition, the systemd status also indicates considerable different, although the prod service is active for only about 1.5 day.
Dev:

$ sudo systemctl status go-carbon.service | grep -E 'Memory|Active'
     Active: active (running) since Mon 2023-12-18 10:07:53 UTC; 2 weeks 6 days ago
     Memory: 26.5G

Prod:

$ sudo systemctl status go-carbon.service | grep -E 'Memory|Active'
     Active: active (running) since Sat 2024-01-06 05:18:57 UTC; 1 day 8h ago
     Memory: 42.0G

Although that shall make sense since there are almost zero queries from the dev server.

@interfan7 interfan7 added the bug label Jan 7, 2024
@deniszh
Copy link
Member

deniszh commented Jan 7, 2024

Hi @interfan7 ,

How many metrics (ie whisper files) does this instance serve?
OOM doesn't always mean bug, go-carbon was designed to use memory instead of disk.

@deniszh
Copy link
Member

deniszh commented Jan 7, 2024

PS: you can enable pprof interface in config, then you csn run heap dumps and investigate them with go pprof command.

@interfan7
Copy link
Author

@deniszh
The number of WSP files is 1,508,149.
At least about 200,000 of them are only occasionally fed with datapoints.

I can count how many are updated or accessed in the last 24 hours if that might help?

@interfan7
Copy link
Author

@deniszh
I'm not familiar with pprof. That will require some ramp-up time for me. I'll try when I can.

@deniszh
Copy link
Member

deniszh commented Jan 9, 2024

@deniszh
The number of WSP files is 1,508,149.
At least about 200,000 of them are only occasionally fed with datapoints.

I can count how many are updated or accessed in the last 24 hours if that might help?

That's not much. I can check our prod memory consumption to compare. OTOH we're using trie index and trigram is disabled iirc.
Pprof is great tool for live debug of go programs, try to enable it on localhost and experiment. You can even try it on laptop.

@interfan7
Copy link
Author

interfan7 commented Jan 16, 2024

@deniszh
I've fetched heap from pprof. SVG file Attached. When opened in browser locally it's very convenient to zoom/move through it.

pprof_heap_graphite_go_1

Would you mind to tell whether anything interesting/suspicous is observable from it?

Once we've configured it to be the target of our whole prod, it takes go-carbon not much longer than a day to reach OOM.
We plan to change the instance type to move 64GB-->128GB and see whether the memory consumption stops at some point. As you said - OOM doesn't mean there is a memory leak, although it's interesting that it takes it really a while to steadily grow the memory occupation, that's why we thought it might be a leak.

@deniszh
Copy link
Member

deniszh commented Jan 16, 2024

@interfan7 : that's a memory snapshot, and one snapshot doesn't give you much info.
It's more interesting how it changes over time, what exact grows.
BTW, I checked our prod servers. For example, for 4M metrics I see that go-carbon consumes 20-30GB RAM.
Why do you use so huge 'max-metrics-globbed' and 'max-metrics-rendered' ? If I'm reading svg right half of your data is glob cache. We're perfectly fine using

max-metrics-rendered = 10001
max-metrics-globbed  = 90000

Defaults are less strict, but your numbers unusually high.

@interfan7
Copy link
Author

@deniszh

go-carbon consumes 20-30GB RAM

How do you see that? There are various ways to take service's/processe's mem occupation.

Why do you use so huge 'max-metrics-globbed' and 'max-metrics-rendered' ?

I think when we've set up the node, the Grafana users have complained they have lacked data or metrics in result, so changing this value has seemed to resolve it. However we've just set a very high value without graduallity of try-and-see cycles.
Having said that, if that's the cause for the high mem usage, then why isn't the usage fluctuating over time, but rather steadily rising? That's why we've thought maybe there are leaks.

I'll get heap profiles of 2 more points in time between service's start and "end" (i.e. somewhat before OOM). I've read pprof is capable of comparing.

@flucrezia
Copy link

Hi @interfan7,
have you tried to increase the config attributes max-cpu and workers ? If processing can't keep up with queries rate and load, then memory consumption could increase.
I presume that your prod machine has more than 4 vCores to handle 128GB of RAM.

@interfan7
Copy link
Author

@flucrezia
The cores seem to relaxed actually so I've not though it could be an issue.

I've decreased the 2 params mentioned above about 2 days ago and I want to see whether the memory will grow to 100GB+ again.

If conclude reducing those params doesn't resolve the issue, at least not for 128GB machine, then I may try your suggestion 🙏🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants