Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus 2.5.0 out of memory #5120

Closed
communityexplorer opened this Issue Jan 20, 2019 · 4 comments

Comments

Projects
None yet
2 participants
@communityexplorer
Copy link

communityexplorer commented Jan 20, 2019

Bug Report

What did you do?
My Prometheus was started at 2019-01-15T12:58:57, but suddenly I found out of memory error at 2019-01-18T17:09:24. After short investigation i notice that data directory size is more than 200GB usually its less than 50GB. I found there are many files in every one minute, normally it should be two hours a file. when restart prometheus it is also out of memory. After we removed many files, it recovered, but we lost the data from 2019-01-18T17:09:24.

What did you expect to see?
i expect not to see this error, and can recover quickly.

What did you see instead? Under which circumstances?
the prometheus version:

Environment

  • System information:

    level=info ts=2019-01-15T12:58:57.901538385Z caller=main.go:246 host_details="(Linux 3.0.101-0.47.71-default #1 SMP Thu Nov 12 12:22:22 UTC 2015 (b5b212e) x86_64 Prometheus_Server )"

  • Prometheus version:
    version="(version=2.5.0, branch=HEAD, revision=67dc912ac8b24f94a1fc478f352d25179c94ab9b)"
    build_context="(go=go1.11.1, user=root@578ab108d0b9, date=20181106-11:40:44)"

  • Prometheus configuration file:

   runtime flags
        - '--storage.tsdb.path=/data'
        - '--storage.tsdb.retention=90d'
        - '--web.enable-lifecycle'  
  • Logs:
Log detail as follows:

1、the first error occured from 2019-01-16T09:00:26 to 2019-01-16T14:58:05, total happend 126 times
/data/01D1AY7GFAF96VA5BJN30Q0N1K]: mmap files: mmap: cannot allocate memory"

level=error ts=2019-01-16T09:00:26.493644156Z caller=db.go:305 component=tsdb msg="compaction failed" err="compact [/data/01D172NP3PGN45XYVTK743CHGQ /data/01D190EY59TQA6TYHBPE62KYFF /data/01D1AY7GFAF96VA5BJN30Q0N1K]: mmap files: mmap: cannot allocate memory"
level=error ts=2019-01-16T09:00:27.494540662Z caller=db.go:305 component=tsdb msg="compaction failed" err="compact [/data/01D172NP3PGN45XYVTK743CHGQ /data/01D190EY59TQA6TYHBPE62KYFF /data/01D1AY7GFAF96VA5BJN30Q0N1K]: mmap files: mmap: cannot allocate memory"
level=error ts=2019-01-16T09:01:31.4956212Z caller=db.go:305 component=tsdb msg="compaction failed" err="compact [/data/01D172NP3PGN45XYVTK743CHGQ /data/01D190EY59TQA6TYHBPE62KYFF /data/01D1AY7GFAF96VA5BJN30Q0N1K]: mmap files: mmap: cannot allocate memory"

2、the second error occured after the fist error from 2019-01-16T15:00:02 to 2019-01-18T17:09:17, total happened 2850 times in 3009 miutes, almost every 1 minute has a error log.

level=error ts=2019-01-16T15:01:04.345406506Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: open block /data/01D1BJT6D6DNF19W00DQ0G4TB2: mmap files: mmap: cannot allocate memory"
level=info ts=2019-01-16T15:02:06.233455554Z caller=compact.go:398 component=tsdb msg="write block" mint=1547640000000 maxt=1547647200000 ulid=01D1BJXZSSEMCQ2E9FWHZTK2XC
level=error ts=2019-01-16T15:02:06.454382796Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: open block /data/01D1BJT6D6DNF19W00DQ0G4TB2: mmap files: mmap: cannot allocate memory"
level=info ts=2019-01-16T15:03:08.523632278Z caller=compact.go:398 component=tsdb msg="write block" mint=1547640000000 maxt=1547647200000 ulid=01D1BJZWEPM1MTTT2TNDKRQ372
level=error ts=2019-01-16T15:03:09.086948094Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: open block /data/01D1BJT6D6DNF19W00DQ0G4TB2: mmap files: mmap: cannot allocate memory"

3、last our of memory at 2019-01-18T17:09:24
level=error ts=2019-01-18T17:09:24.822704884Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: open block /data/01D1BJT6D6DNF19W00DQ0G4TB2: mmap files: mmap: cannot allocate memory"
fatal error: runtime: cannot allocate memory
runtime stack:
runtime.throw(0x1ae8f38, 0x1f)
	/usr/local/go/src/runtime/panic.go:608 +0x72
runtime.persistentalloc1(0x4000, 0x0, 0x2db85b8, 0x2b0aa907da40)
	/usr/local/go/src/runtime/malloc.go:1190 +0x280
runtime.persistentalloc.func1()
	/usr/local/go/src/runtime/malloc.go:1143 +0x45
runtime.persistentalloc(0x4000, 0x0, 0x2db85b8, 0x2d9f528)
	/usr/local/go/src/runtime/malloc.go:1142 +0x82
runtime.(*fixalloc).alloc(0x2da21d8, 0x1)
	/usr/local/go/src/runtime/mfixalloc.go:80 +0x109
runtime.(*mheap).allocSpanLocked(0x2d9ed20, 0x1, 0x2db8588, 0x5c000e263e3)
	/usr/local/go/src/runtime/mheap.go:873 +0x13a
runtime.(*mheap).alloc_m(0x2d9ed20, 0x1, 0x2b348c100065, 0xc000098480)
	/usr/local/go/src/runtime/mheap.go:692 +0x119

4、when try restart prometheus it was also out of memory
level=error ts=2019-01-18T18:02:37.645414684Z caller=main.go:625 err="opening storage failed: open block /data/01D1BWBXMCCCNYVVBM9XCSP4V4: mmap files: mmap: cannot allocate memory"
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jan 21, 2019

This looks very similar to #4392.

@communityexplorer

This comment has been minimized.

Copy link
Author

communityexplorer commented Jan 22, 2019

@simonpasquier
I had seen #4392, likely someone run 64-bit build on 32-bit system, but my prometheus build and system were both 64-bit.
here is my prometheus package: prometheus-2.5.0.linux-amd64.tar.gz
system version:
$uname -a
Linux PR_Prometheus_Server 3.0.101-0.47.71-default #1 SMP Thu Nov 12 12:22:22 UTC 2015 (b5b212e) x86_64 x86_64 x86_64 GNU/Linux
some other args:
$cat /proc/sys/vm/max_map_count
65530
$cat /proc/sys/kernel/shmmni
4096
$cat /proc/sys/kernel/shmmax
4294967296
$cat /proc/sys/kernel/shmall
1725638
$cat /proc/meminfo
MemTotal: 65585036 kB
MemFree: 40688260 kB
Buffers: 517720 kB
Cached: 20542900 kB
SwapCached: 0 kB
Active: 2988524 kB
Inactive: 20486284 kB
Active(anon): 2416928 kB
Inactive(anon): 52 kB
Active(file): 571596 kB
Inactive(file): 20486232 kB
Unevictable: 5704 kB
Mlocked: 5704 kB
SwapTotal: 16779260 kB
SwapFree: 16779260 kB
Dirty: 620 kB
Writeback: 0 kB
AnonPages: 2420064 kB
Mapped: 922924 kB
Shmem: 168 kB
Slab: 715572 kB
SReclaimable: 678008 kB
SUnreclaim: 37564 kB
KernelStack: 2752 kB
PageTables: 29912 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 49571776 kB
Committed_AS: 11631060 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 126060 kB
VmallocChunk: 34326056744 kB
HardwareCorrupted: 0 kB
AnonHugePages: 622592 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 87552 kB
DirectMap2M: 4106240 kB
DirectMap1G: 62914560 kB

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jan 22, 2019

Thanks for the additional information though I don't see anything obvious that would explain the issue. For sure, the tsdb code shouldn't fill the filesystem when it encounters mmap: cannot allocate memory errors but we haven't found a way to reproduce the problem yet. If you manage to reproduce the problem, you can try to take a heap profile when it is happening:

curl -s http://localhost:9090/debug/pprof/heap > heap.out

Do you run queries (dashboards or rules) with large intervals?

As a workaround, one user reported in #4392 that the error went away when running Prometheus as a container. If this is an option for you, you may try.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Feb 1, 2019

Thanks for reporting! I'm closing it as this is a duplicate of #5135. The tsdb fix is prometheus/tsdb#512 (still under discussion).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.