metrictank memory issues #2009

thmour · 2021-10-11T15:21:34Z

We are trying to use Metrictank (v1.1) to move from our graphite monitoring and we have some couple of issues regarding memory.
We run a machine with scylladb and metrictank together and we try to limit the memory that get accumulated over time by metrictank. In scylladb you can set a max memory and it will operate up to that amount.
In metrictank you are trying to configure some parameters in order to flush memory fast enough to the db backend to not have any memory issues. I ran pprof on metrictank and got these values:

(pprof) top 10 
Showing nodes accounting for 12656.25MB, 92.23% of 13722.43MB total
Dropped 139 nodes (cum <= 68.61MB)
Showing top 10 nodes out of 55
      flat  flat%   sum%        cum   cum%
 2317.89MB 16.89% 16.89%  2855.42MB 20.81%  github.com/grafana/metrictank/mdata.NewAggMetric
 2211.26MB 16.11% 33.01%  5901.53MB 43.01%  github.com/grafana/metrictank/idx/memory.(*UnpartitionedMemoryIdx).add
 2168.21MB 15.80% 48.81%  2168.21MB 15.80%  github.com/grafana/metrictank/idx/memory.(*TagIndex).addTagId (inline)
 1522.06MB 11.09% 59.90%  1522.06MB 11.09%  github.com/grafana/metrictank/idx/memory.defByTagSet.add
 1271.10MB  9.26% 69.16%  1271.10MB  9.26%  github.com/grafana/metrictank/mdata/chunk.New
 1081.16MB  7.88% 77.04%  1199.17MB  8.74%  github.com/grafana/metrictank/idx/memory.createArchive
  865.91MB  6.31% 83.35%   865.91MB  6.31%  github.com/grafana/metrictank/mdata/chunk/tsz.(*bstream).writeByte
  490.62MB  3.58% 86.92%   490.62MB  3.58%  github.com/grafana/metrictank/mdata/chunk/tsz.(*bstream).writeBit
  412.02MB  3.00% 89.93%   412.02MB  3.00%  bytes.(*Buffer).String (inline)
  316.02MB  2.30% 92.23%  2567.38MB 18.71%  github.com/grafana/metrictank/mdata.NewAggregator

while top shows this output (27.7g used) so how do I check the RSS memory?

876188 root      20   0   30.3g  27.7g  12832 S  37.1  48.2   6729:21 metrictank

I changed the config before to have

# max age for a chunk before to be considered stale and to be persisted to Cassandra
chunk-max-stale = 30m
# max age for a metric before to be considered stale and to be purged from in-memory ring buffer.
metric-max-stale = 1h
# Interval to run garbage collection job
gc-interval = 1h

But still the memory is constantly acummulating, how should I proceed from here?
Are thene any parameters to reduce the memory used? Shouldn't GC take care of this memory used?
If I set a memorymax on systemd I will get periodic holes in our metrics once it restarts.
Having a second process as a replica just to keep at least one active while the other resets from memory issues should not be a solution.

I have also attached some screenshots to understand the usage, etc. At around 12:00 on 5/10 I restarted the process with the latest config. Most of it is similar to the default, but with 500 read connections to scylladb and the changes i mentioned before.

The text was updated successfully, but these errors were encountered:

deniszh · 2021-10-25T10:59:17Z

Another Metrictank user here. It's definitely memory hog. Also, IMO Metrictank it's not really designed to work as standalone application. E.g. during restart you will loose access to metrics until replay from Kafka would not be finished - so, you need at least pair of instances. It's designed to be used in cluster installations with orchestrated control. In our case if we see number of metrics increasing (which cause memory pressure) we're just increasing number of nodes and doing rolling restart.
Also, sharing single node between services is clearly antipattern.
In my opinion if you do not want to use clustering software try:

go-carbon + carbonapi - hardly scalable but OK for single node - see e.g. https://github.com/go-graphite/docker-go-graphite for config example. Also, do not support tags.
Graphite-clickhouse - it's bit more complex, require Clickhouse installation - example is https://github.com/lomik/graphite-clickhouse-tldr
Victoriametrics + carbonapi - less tested, example is in https://github.com/deniszh/graphite-victoriametrics-tldr (VM has built-in Graphite compatibility but in Enterprise version only).

shanson7 · 2021-10-25T11:04:28Z

If you look at the "metrics active" it seems like there are new series being indexed regularly. You might need to set up pruning in index-rules config to trim off stale series (e.g. not seen in 3 days). This will make those series unquery-able so make sure it's set appropriately.

Also, I agree with @deniszh about running a single instance. MT is meant to scale ingest linearly, so partitioning across multiple instances is the way to grow.

thmour · 2021-10-25T16:31:14Z

@deniszh We already use go-carbon + whisperdb and it has already reached its limits, 99% disk util and we can't run interesting queries without timeouts. I tried to change to metrictank with a single node setup in a 32CPU+60GB ram machine but it looks like the memory is constantly increasing whatever I try to do. I will start using a double instance so at least it ping pongs betweens those two?

@shanson7 thanks for the insight, there is one subgroup of metrics that is currently at 8.9M. I guess I will need to start pruning it.

GuillaumeConnan · 2021-10-26T16:26:20Z

We are experiencing the exact same behavior, with more than 50M existing metrics and ~500k daily new metrics due to ephemeral container or instance IDs in the metric name.

Without index pruning, MT heap uses like 250GB of RAM per instance and keeps growing as new metrics are created, which is not suitable in the long term.

On the other side, with index pruning activated, MT heap tends to stabilize, but old metrics can no longer be rendered even if the data is still present in the backend.

Is there a way to more rely on the backend to lower memory usage, even if it would degrade (maybe not so much?) requests performances?

thmour · 2021-11-04T06:25:50Z

I moved the metrictank instance now to a new VM with 60GB of RAM, so it doesn't have to run with scylladb anymore, prune metrics inactive older than a week, and from 10M active metrics I went down to 4M. Now metrictank uses suddenly a lot of memory and does nothing (no render, no metrics ingest) at 50% of memory usage. Does this have to do that a lot of requests (20K packets per minute with various amount of metrics) go to the carbon input of metrictank?

stale · 2022-03-02T10:04:26Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

thmour added the bug label Oct 11, 2021

thmour changed the title ~~pprof accounting half of memory used~~ metrictank memory issues Oct 11, 2021

stale bot added the stale label Mar 2, 2022

stale bot closed this as completed Apr 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrictank memory issues #2009

metrictank memory issues #2009

thmour commented Oct 11, 2021 •

edited

deniszh commented Oct 25, 2021 •

edited

shanson7 commented Oct 25, 2021

thmour commented Oct 25, 2021

GuillaumeConnan commented Oct 26, 2021

thmour commented Nov 4, 2021 •

edited

stale bot commented Mar 2, 2022

metrictank memory issues #2009

metrictank memory issues #2009

Comments

thmour commented Oct 11, 2021 • edited

deniszh commented Oct 25, 2021 • edited

shanson7 commented Oct 25, 2021

thmour commented Oct 25, 2021

GuillaumeConnan commented Oct 26, 2021

thmour commented Nov 4, 2021 • edited

stale bot commented Mar 2, 2022

thmour commented Oct 11, 2021 •

edited

deniszh commented Oct 25, 2021 •

edited

thmour commented Nov 4, 2021 •

edited