Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus is eating almost 6GB memory, how could this be possible? What's the end of the memory usage? #2358

Closed
xixikaikai opened this Issue Jan 23, 2017 · 15 comments

Comments

Projects
None yet
3 participants
@xixikaikai
Copy link

xixikaikai commented Jan 23, 2017

What did you do?
I make the following changes
image

What did you expect to see?
I hope the memory will fall down at some point, and will not rise up endlessly,
and when I limit the memory to 1G, the prometheus server is so busy restarting prometheus
on marathon(dcos+mesos) that the server bootstap can not be access via ssh channel, I guest that the disk is so busy by the restarting of prometheus, and prometheus need to recover from the disk and write to disk when asked to fall down.

What did you see instead? Under which circumstances?
curl -s http://172.19.0.176:31090/metrics | grep '^prometheus_local_storage'
prometheus_local_storage_checkpoint_duration_seconds 5.366887795
prometheus_local_storage_chunk_ops_total{type="clone"} 1
prometheus_local_storage_chunk_ops_total{type="create"} 2.575955e+06
prometheus_local_storage_chunk_ops_total{type="load"} 421
prometheus_local_storage_chunk_ops_total{type="persist"} 2.460509e+06
prometheus_local_storage_chunk_ops_total{type="pin"} 8353
prometheus_local_storage_chunk_ops_total{type="transcode"} 2.486157e+06
prometheus_local_storage_chunk_ops_total{type="unpin"} 8353
prometheus_local_storage_chunkdesc_ops_total{type="evict"} 25448
prometheus_local_storage_chunkdesc_ops_total{type="load"} 125
prometheus_local_storage_chunks_to_persist 94579
prometheus_local_storage_fingerprint_mappings_total 0
prometheus_local_storage_inconsistencies_total 0
prometheus_local_storage_indexing_batch_duration_seconds{quantile="0.5"} 0.014941447000000002
prometheus_local_storage_indexing_batch_duration_seconds{quantile="0.9"} 0.016533442000000002
prometheus_local_storage_indexing_batch_duration_seconds{quantile="0.99"} 0.019961967
prometheus_local_storage_indexing_batch_duration_seconds_sum 137.80948704499997
prometheus_local_storage_indexing_batch_duration_seconds_count 11727
prometheus_local_storage_indexing_batch_sizes{quantile="0.5"} 1
prometheus_local_storage_indexing_batch_sizes{quantile="0.9"} 1
prometheus_local_storage_indexing_batch_sizes{quantile="0.99"} 1
prometheus_local_storage_indexing_batch_sizes_sum 11995
prometheus_local_storage_indexing_batch_sizes_count 11727
prometheus_local_storage_indexing_queue_capacity 16384
prometheus_local_storage_indexing_queue_length 0
prometheus_local_storage_ingested_samples_total 1.208599541e+09
prometheus_local_storage_maintain_series_duration_seconds{location="archived",quantile="0.5"} NaN
prometheus_local_storage_maintain_series_duration_seconds{location="archived",quantile="0.9"} NaN
prometheus_local_storage_maintain_series_duration_seconds{location="archived",quantile="0.99"} NaN
prometheus_local_storage_maintain_series_duration_seconds_sum{location="archived"} 0
prometheus_local_storage_maintain_series_duration_seconds_count{location="archived"} 0
prometheus_local_storage_maintain_series_duration_seconds{location="memory",quantile="0.5"} 0.003913998
prometheus_local_storage_maintain_series_duration_seconds{location="memory",quantile="0.9"} 0.006114805
prometheus_local_storage_maintain_series_duration_seconds{location="memory",quantile="0.99"} 0.022362947
prometheus_local_storage_maintain_series_duration_seconds_sum{location="memory"} 1289.5624914789798
prometheus_local_storage_maintain_series_duration_seconds_count{location="memory"} 285090
prometheus_local_storage_max_chunks_to_persist 3.333332e+06
prometheus_local_storage_memory_chunkdescs 3.155664e+06
prometheus_local_storage_memory_chunks 2.576376e+06
prometheus_local_storage_memory_series 20889
prometheus_local_storage_non_existent_series_matches_total 0
prometheus_local_storage_out_of_order_samples_total{reason="multiple_values_for_timestamp"} 0
prometheus_local_storage_out_of_order_samples_total{reason="timestamp_out_of_order"} 0
prometheus_local_storage_persist_errors_total 0
prometheus_local_storage_persistence_urgency_score 0.02837491134996454
prometheus_local_storage_rushed_mode 0
prometheus_local_storage_series_ops_total{type="archive"} 1248
prometheus_local_storage_series_ops_total{type="create"} 11995
prometheus_local_storage_series_ops_total{type="maintenance_in_archive"} 0
prometheus_local_storage_series_ops_total{type="maintenance_in_memory"} 285090
prometheus_local_storage_series_ops_total{type="purge_from_archive"} 0
prometheus_local_storage_series_ops_total{type="purge_from_memory"} 0
prometheus_local_storage_series_ops_total{type="purge_on_request"} 0
prometheus_local_storage_series_ops_total{type="quarantine_completed"} 0
prometheus_local_storage_series_ops_total{type="quarantine_dropped"} 0
prometheus_local_storage_series_ops_total{type="quarantine_failed"} 0
prometheus_local_storage_series_ops_total{type="unarchive"} 6
prometheus_local_storage_started_dirty 0

Environment
4 cpu
16G mem

  • System information:
    Linux 3.10.0-327.36.1.el7.x86_64 x86_64
    centos 7.0+

  • Prometheus version:
    1.4.1

  • Alertmanager version:
    no

  • Prometheus configuration file:
    global:
    scrape_interval: 15s # By default, scrape targets every 15 seconds.

    Attach these labels to any time series or alerts when communicating with

    external systems (federation, remote storage, Alertmanager).

    external_labels:
    monitor: 'lkt-monitor-prod'

A scrape configuration containing exactly one endpoint to scrape:

Here it's Prometheus itself.

scrape_configs:

The job name is added as a label job=<job_name> to any timeseries scraped from this config.

  • job_name: 'lkt-prometheus-prod'

    Override the global default and scrape targets from this job every 5 seconds.

    scrape_interval: 5s

    static_configs:

    • targets: ['localhost:9090']
      labels:
      instance: prometheus
  • job_name: 'host'
    scrape_interval: 5s
    metrics_path: '/metrics'
    scheme: 'http'
    static_configs:

    • targets: [
      '172.19.0.175:31902',
      '172.19.0.176:31902',
      '172.19.0.177:31902',
      '172.19.0.176:31666',
      '172.19.0.176:31888'
      ]

PS:
31666 is for mesos exporter metric
31666 is for marathon exporter metric

  • Alertmanager configuration file:
    no

  • Logs:

insert Prometheus and Alertmanager logs relevant to the issue here
@xixikaikai

This comment has been minimized.

Copy link
Author

xixikaikai commented Jan 23, 2017

add args to prometheus on 176 since we have enough memory refer to #1459
"args": [
"-storage.local.memory-chunks=6666666",
"-storage.local.max-chunks-to-persist=3333332",
"-config.file=/etc/prometheus/prometheus.yml"
]

@xixikaikai

This comment has been minimized.

Copy link
Author

xixikaikai commented Jan 23, 2017

This is the output of the cloud node.(It is when I limit the usage of 1G memory)
image
The prometheus server can not be access via ssh channel.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 23, 2017

With those settings Prometheus will use at least 26GB of RAM.

@xixikaikai

This comment has been minimized.

Copy link
Author

xixikaikai commented Jan 23, 2017

How could you calc out 26GB of RAM?
Since I only limit memory-chunks=6666666

@xixikaikai

This comment has been minimized.

Copy link
Author

xixikaikai commented Jan 23, 2017

What is your advice to lower down the usage of RAM?
Since my node has only 16G memory now, and if continuously rise up, the machine will be restarting all the time, thanks.

@songjiayang

This comment has been minimized.

Copy link

songjiayang commented Jan 23, 2017

It confuses me, as I known, a Chunk almost is 1 KB, so memory-chunks=6666666 will be 6 GB, are there some points I miss ?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 23, 2017

There's various overheads, so on 1.4.1 you're talking a minimum of 3.9KB.

@xixikaikai

This comment has been minimized.

Copy link
Author

xixikaikai commented Jan 23, 2017

Is there any formula to calculate that the usage of RAM will be 26GB?
I donot know whether my prometheus will eat up all my 6 GB memory in the end.
And when prometheus eats up all the memory marathon offers,
the server will be abnormal, and can not be accessed via ssh channel.

@xixikaikai

This comment has been minimized.

Copy link
Author

xixikaikai commented Jan 23, 2017

What is the relationship between a minimum of 3.9KB chunks and 26GB of RAM? @brian-brazil
Thanks very much for your kindly answer, I appreciate it!

@xixikaikai

This comment has been minimized.

Copy link
Author

xixikaikai commented Jan 23, 2017

Yeah, It's now using almost 6GB because I rise the value of chunks, and when it comes to 6GB, the prometheus server is likely to crash down, and keeps restarting until I can do nothing but restart the prometheus node manually. @songjiayang

@songjiayang

This comment has been minimized.

Copy link

songjiayang commented Jan 24, 2017

@xixikaikai so bad, you can try to set max-chunks-to-persist smaller.

@xixikaikai

This comment has been minimized.

Copy link
Author

xixikaikai commented Jan 24, 2017

@songjiayang Have you encountered this problem before in your product?

@songjiayang

This comment has been minimized.

Copy link

songjiayang commented Jan 24, 2017

@xixikaikai Memory problem is big problem, Maybe you can get more information in https://prometheus.io/docs/operating/storage/#persistence-pressure-and-rushed-mode

@xixikaikai

This comment has been minimized.

Copy link
Author

xixikaikai commented Jan 25, 2017

@brian-brazil you mean that one memory chunk in prometheus is 3.9KB, so we can calculate the usage of RAM if we config prometheus.
I think that, default usage of prometheus is almost 4GB RAM, which is not very good for most servers.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.