Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus OOM #5266

Closed
XSHui opened this Issue Feb 25, 2019 · 4 comments

Comments

Projects
None yet
3 participants
@XSHui
Copy link

XSHui commented Feb 25, 2019

Bug Report

What did you do?
We use prometheus to monitor tidb cluster

What did you expect to see?
Prometheus work well

What did you see instead? Under which circumstances?
Prometheus sometimes oom(only one cluster, other cluster is ok)

Environment

  • System information:

    Linux 4.14.0-1.el7.*****.x86_64 x86_64

  • Prometheus version:

    prometheus:v2.2.1

  • Logs:

{"log":"level=warn ts=2019-02-25T04:03:08.305497727Z caller=queue_manager.go:224 component=remote msg=\"Remote storage queue full, discarding sample. Multiple subsequent messages of this kind may be suppressed.\"\n","stream":"stderr","time":"2019-02-25T04:03:08.305735716Z"}
{"log":"level=info ts=2019-02-25T04:03:15.792538875Z caller=queue_manager.go:337 component=remote msg=\"Currently resharding, skipping.\"\n","stream":"stderr","time":"2019-02-25T04:03:15.79276374Z"}
{"log":"level=warn ts=2019-02-25T04:03:18.305498693Z caller=queue_manager.go:224 component=remote msg=\"Remote storage queue full, discarding sample. Multiple subsequent messages of this kind may be suppressed.\"\n","stream":"stderr","time":"2019-02-25T04:03:18.305677258Z"}
{"log":"level=info ts=2019-02-25T04:03:25.792549804Z caller=queue_manager.go:337 component=remote msg=\"Currently resharding, skipping.\"\n","stream":"stderr","time":"2019-02-25T04:03:25.792685467Z"}
{"log":"level=warn ts=2019-02-25T04:03:28.305496462Z caller=queue_manager.go:224 component=remote msg=\"Remote storage queue full, discarding sample. Multiple subsequent messages of this kind may be suppressed.\"\n","stream":"stderr","time":"2019-02-25T04:03:28.305639207Z"}
{"log":"level=info ts=2019-02-25T04:03:35.792549961Z caller=queue_manager.go:337 component=remote msg=\"Currently resharding, skipping.\"\n","stream":"stderr","time":"2019-02-25T04:03:35.79268979Z"}
{"log":"level=warn ts=2019-02-25T04:03:38.305500222Z caller=queue_manager.go:224 component=remote msg=\"Remote storage queue full, discarding sample. Multiple subsequent messages of this kind may be suppressed.\"\n","stream":"stderr","time":"2019-02-25T04:03:38.305682879Z"}
{"log":"level=info ts=2019-02-25T04:03:45.792529401Z caller=queue_manager.go:337 component=remote msg=\"Currently resharding, skipping.\"\n","stream":"stderr","time":"2019-02-25T04:03:45.792669703Z"}
{"log":"level=warn ts=2019-02-25T04:03:48.305498944Z caller=queue_manager.go:224 component=remote msg=\"Remote storage queue full, discarding sample. Multiple subsequent messages of this kind may be suppressed.\"\n","stream":"stderr","time":"2019-02-25T04:03:48.305708252Z"}
{"log":"level=info ts=2019-02-25T04:03:55.792499868Z caller=queue_manager.go:337 component=remote msg=\"Currently resharding, skipping.\"\n","stream":"stderr","time":"2019-02-25T04:03:55.792598684Z"}
{"log":"level=warn ts=2019-02-25T04:03:58.305498629Z caller=queue_manager.go:224 component=remote msg=\"Remote storage queue full, discarding sample. Multiple subsequent messages of this kind may be suppressed.\"\n","stream":"stderr","time":"2019-02-25T04:03:58.30574138Z"}
{"log":"level=info ts=2019-02-25T04:04:05.79249142Z caller=queue_manager.go:334 component=remote msg=\"Remote storage resharding\" from=3 to=31\n","stream":"stderr","time":"2019-02-25T04:04:05.79266542Z"}
{"log":"level=info ts=2019-02-25T04:05:56.846535886Z caller=main.go:220 msg=\"Starting Prometheus\" version=\"(version=2.2.1, branch=HEAD, revision=bc6058c81272a8d938c05e75607371284236aadc)\"\n","stream":"stderr","time":"2019-02-25T04:05:56.846711498Z"}
{"log":"level=info ts=2019-02-25T04:05:56.846597305Z caller=main.go:221 build_context=\"(go=go1.10, user=root@149e5b3f0829, date=20180314-14:15:45)\"\n","stream":"stderr","time":"2019-02-25T04:05:56.846754762Z"}
Feb 25 12:05:56 [localhost] kernel: prometheus invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null),  order=0, oom_score_adj=0
Feb 25 12:05:56 [localhost] kernel: prometheus cpuset=18daf50ee3581581cedafbd17142227dda1d2e7b85c81cf856ab1ee08654b0f8 mems_allowed=0-1
Feb 25 12:05:56 [localhost] kernel: CPU: 9 PID: 34680 Comm: prometheus Tainted: G        W       4.14.0-1.el7.ucloud.x86_64 #1
Feb 25 12:05:56 [localhost] kernel: Hardware name: H3C R4900 G2/RS32M2C9S, BIOS 1.01.13 08/31/2017
Feb 25 12:05:56 [localhost] kernel: Call Trace:
Feb 25 12:05:56 [localhost] kernel: dump_stack+0x5c/0x83
Feb 25 12:05:56 [localhost] kernel: dump_header+0x9c/0x23c
Feb 25 12:05:56 [localhost] kernel: ? mem_cgroup_scan_tasks+0x8c/0xe0
Feb 25 12:05:56 [localhost] kernel: oom_kill_process+0x228/0x420
Feb 25 12:05:56 [localhost] kernel: out_of_memory+0x110/0x490
Feb 25 12:05:56 [localhost] kernel: mem_cgroup_out_of_memory+0x49/0x80
Feb 25 12:05:56 [localhost] kernel: mem_cgroup_oom_synchronize+0x2e3/0x310
Feb 25 12:05:56 [localhost] kernel: ? get_mctgt_type_thp.isra.30+0xc0/0xc0
Feb 25 12:05:56 [localhost] kernel: pagefault_out_of_memory+0x2f/0x74
Feb 25 12:05:56 [localhost] kernel: __do_page_fault+0x440/0x4d0
Feb 25 12:05:56 [localhost] kernel: do_page_fault+0x33/0x120
Feb 25 12:05:56 [localhost] kernel: ? page_fault+0x36/0x60
Feb 25 12:05:56 [localhost] kernel: page_fault+0x4c/0x60
Feb 25 12:05:56 [localhost] kernel: RIP: 0033:0x45b0f3
Feb 25 12:05:56 [localhost] kernel: RSP: 002b:000000c4432d5e98 EFLAGS: 00010287
Feb 25 12:05:56 [localhost] kernel: Task in /docker/18daf50ee3581581cedafbd17142227dda1d2e7b85c81cf856ab1ee08654b0f8 killed as a result of limit of /docker/18daf50ee3581581cedafbd17142227dda1d2e7b85c81cf856ab1ee08654b0f8
Feb 25 12:05:56 [localhost] kernel: memory: usage 1048576kB, limit 1048576kB, failcnt 3
Feb 25 12:05:56 [localhost] kernel: memory+swap: usage 1048576kB, limit 1048576kB, failcnt 5803345
Feb 25 12:05:56 [localhost] kernel: kmem: usage 10772kB, limit 9007199254740988kB, failcnt 0
Feb 25 12:05:56 [localhost] kernel: Memory cgroup stats for /docker/18daf50ee3581581cedafbd17142227dda1d2e7b85c81cf856ab1ee08654b0f8: cache:0KB rss:1037804KB rss_huge:344064KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:1036828KB inactive_file:0KB active_file:0KB unevictable:0KB
Feb 25 12:05:56 [localhost] kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Feb 25 12:05:56 [localhost] kernel: [30160] 65534 30160  1785204   253386     608      13        0             0 prometheus
Feb 25 12:05:56 [localhost] kernel: Memory cgroup out of memory: Kill process 30160 (prometheus) score 968 or sacrifice child
Feb 25 12:05:56 [localhost] kernel: Killed process 30160 (prometheus) total-vm:7140816kB, anon-rss:989352kB, file-rss:24192kB, shmem-rss:0kB
Feb 25 12:05:56 [localhost] kernel: oom_reaper: reaped process 30160 (prometheus), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  • grafana:
    image
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Feb 25, 2019

You need to use a system with more RAM and/or reduce the memory usage of Prometheus. It looks like you're using remote write and Prometheus can't send samples fast enough to the remote backend so Prometheus buffers samples in memory. Have a look at the remote_write settings to tune this part. You may also want to upgrade to the latest release of Prometheus as it may improve memory usage.

I'm closing it for now. If you have further questions, please use our user mailing list, which you can also search.

@XSHui

This comment has been minimized.

Copy link
Author

XSHui commented Feb 26, 2019

Thanks for your reply ! @simonpasquier

prometheus remote_write as below:

remote_write:
- url: http://****:##/write
  remote_timeout: 30s
  queue_config:
    capacity: 100000
    max_shards: 1000
    max_samples_per_send: 100
    batch_send_deadline: 5s
    max_retries: 10
    min_backoff: 30ms
    max_backoff: 100ms

Could you give some advice about how to optimize it, and how to confirm ?

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Feb 26, 2019

I'm no remote write expert but you might want to try the defaults proposed in #5267:

max_shards:  100
max_samples_per_send: 1000

Also capacity is quite high.

@cstyan

This comment has been minimized.

Copy link
Contributor

cstyan commented Feb 26, 2019

@XSHui as Simon mentioned you either need a system with more RAM (based on your current config) or to change the config. The capacity is the number of samples each remote write shard can have buffered before it will (force) send to the endpoint.

If you look at your metrics, how many shards do you have?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.