Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checker memory leak #921

Closed
ihard opened this issue Sep 23, 2023 · 15 comments · Fixed by #1011
Closed

checker memory leak #921

ihard opened this issue Sep 23, 2023 · 15 comments · Fixed by #1011
Assignees
Labels

Comments

@ihard
Copy link

ihard commented Sep 23, 2023

BUG REPORT

What version of Moira are you using ([binary] --version)?

checker -version
2023/09/23 17:26:12 maxprocs: Leaving GOMAXPROCS=32: CPU quota undefined
Moira Checker
Version: 2.8.4
Git Commit: 2f5f4b314998a7f0a6d00b689ca9b8b251bfbb0f

Configuration

checker.yaml

---

redis:
  addrs: "10.10.88.101:6380,10.10.88.101:6381"
  dbid: 0
  connection_limit: 2048
  metrics_ttl: 6h
  dial_timeout: 1s
  read_timeout: 10s
  write_timeout: 10s
telemetry:
  listen: ":8092"
  pprof:
    enabled: true
  graphite:
    enabled: true
    runtime_stats: false
    uri: "127.0.0.1:2003"
    prefix: metric-server-1.moira.info
    interval: 60s
checker:
  nodata_check_interval: 120s
  check_interval: 60s
  lazy_triggers_check_interval: 10m
  stop_checking_interval: 180s
  max_parallel_checks: 16
  max_parallel_remote_checks: 0
remote:
  enabled: false
log:
  log_file: stdout
  log_level: info

pprof heap + goroutine
pprof.zip

Metric trigger queue
2023-09-23_17-46-46

Grafana Dashboard
2023-09-23_18-14-54
2023-09-23_18-15-49
2023-09-23_18-16-24
2023-09-23_18-17-14

What did you expect to see?

no memory leak

What did you see instead?

usage memory peak to 60GB and OOM

@ihard ihard added the bug label Sep 23, 2023
@kissken
Copy link
Member

kissken commented Sep 23, 2023

@ihard Привет, подскажи, пожалуйста, график с очередью - это в момент проблемы?

сможешь еще дополнить, пожалуйста, графиками?

aliasByNode(keepLastValue(nonNegativeDerivative(*.*.moira.*.checker.metricEventsHandle.count_ps), 1), 2)

keepLastValue(movingAverage(nonNegativeDerivative(*.*.moira.*.checker.loval.triggers.count_ps), '5min'), 1)

aliasByNode(*.*.moira.*.checker.metricEventsHandle.95-percentile, 2)

aliasByNode(*.*.moira.*.checker.local.triggers.95-percentile, 2)

@ihard
Copy link
Author

ihard commented Sep 23, 2023

@kissken added to first post

@ihard
Copy link
Author

ihard commented Sep 25, 2023

In pprof heap - problem function:

14676.30MB 96.79% 96.79% 14676.30MB 96.79% github.com/moira-alert/moira/metric_source.MakeEmptyMetricData (inline)

@kissken
Copy link
Member

kissken commented Sep 25, 2023

@ihard Could you, please, added info are triggers tagged or flat?
and yours triggers have system metrics without aggregate by pods?

I'm just guessing, in this case when many pods down and many up in one trigger, the problem appears or nor?

@ihard
Copy link
Author

ihard commented Sep 25, 2023

99% are flat triggers
tagged triggers all have fewer than 200 metrics
yes, there are some triggers with system metrics whose names are almost always constant
The problem is not related to the mass start or stop of the pods and appears continuously for 24 hours
It’s more likely that some trigger or triggers get into the checker, this causes a large memory consumption, then the process crashes and the trigger or triggers go to another node

@ihard
Copy link
Author

ihard commented Sep 25, 2023

Memory consumption when checking each trigger in debug mode or in metrics would greatly help in analyzing such problems.

@kissken
Copy link
Member

kissken commented Oct 9, 2023

hello, could you tell us, please, how many metrics match for pattern, when open /patterns page and sort by desc value at field metrics?

@ihard
Copy link
Author

ihard commented Oct 9, 2023

~ 100 000 at the time of analyzing the problems, now the number has grown to ~ 250,000
Also, right now the patterns page is not displayed, there is an error in the interface:
Load failed
in api log:
{"level":"info","module":"api","context":"http","http.method":"GET","http.uri":"http://127.0.0.1:7092/api/pattern","http.protocol":"HTTP/1.0","http.remote_addr":"10.225.88.101:50190","username":"anonymous","http.status":200,"http.content_length":377215140,"elapsed_time_ms":3478,"elapsed_time":"3.478224001s","time":"2023-10-09 22:46:16.994","message":"GET http://127.0.0.1:7092/api/pattern/ HTTP/1.0"}

@ihard
Copy link
Author

ihard commented Nov 1, 2023

Updating to 2.9.0 revealed about 100 division-by-0 triggers that were removed.
When updating to 2.9.0, a command was executed that removed all metrics from Redis:

cli -remove-all-metrics
cli -cleanup-last-checks
cli -cleanup-metrics
cli -cleanup-retentions
cli -cleanup-tags

Triggers with warning errors:
target t2 declared as alone metrics target but do not have any metrics and saved state in last check
have also been removed.
Now the checker logs are clean, but the problem persists.

Page patterns:
2023-11-01_18-30-04-2

@ihard
Copy link
Author

ihard commented Nov 21, 2023

We removed a large pack of triggers for which no metrics were received, the problem stopped.

@ihard ihard closed this as completed Nov 21, 2023
@almostinf
Copy link
Member

In addition, I'd like to add that we found a potential issue that may be causing increased memory consumption - the moira-pattern-metrics key is not being cleared, which leads to a bunch of unnecessary requests in redis

Temporary fix for the problem:

  1. SET DEL in triggers automatically clears this key for the trigger
  2. Manually deleting metrics or delete all NODATA metrics also clears the key for the trigger

@ihard
Copy link
Author

ihard commented Feb 26, 2024

It's very similar to our problem, after deleting the metrics, everything goes back to normal for a while, but then the problem returns.

@almostinf
Copy link
Member

almostinf commented Apr 26, 2024

Hi! We added the moira-pattern-metrics key cleanup to the cli command --cleanup-pattern-metrics, which will greatly reduce the load on the checker and the number of resources consumed, since the moira-pattern-metrics key leak generated unnecessary database queries

@almostinf almostinf self-assigned this Apr 26, 2024
@ihard
Copy link
Author

ihard commented Apr 26, 2024

Will the Cli clean the keys in the current database, and the leak itself was previously fixed in some release?

@almostinf
Copy link
Member

almostinf commented Apr 26, 2024

Yes, the command in the cli will clean up the current garbage, the leak fix will come in the next release. We recommend adding --cleanup-pattern-metrics to cli in the regular cronjob

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants