Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upData corruption using Prometheus Docker v2.0.0-alpha.2 image on NFS #2805
Comments
This comment has been minimized.
This comment has been minimized.
cornelf
commented
Jun 5, 2017
|
Confirmed bug for my use case as well.
|
This comment has been minimized.
This comment has been minimized.
|
Does the same happen without EFS? We strongly recommend not using NFS or other networked filesystems. |
This comment has been minimized.
This comment has been minimized.
cornelf
commented
Jun 5, 2017
|
@brian-brazil thanks for the amazingly quick feedback. Is there no chance for the prometheus:2 to support (eventually) network attached storage? or the target is only to support SAN rather than NAS in order to make EBS a recommended storage solution for Prometheus timeseries db? |
This comment has been minimized.
This comment has been minimized.
|
We support working Posix filesystems, and recommend they be local for reliability and performance. NFS is not known for being a working Posix filesystem. |
This comment has been minimized.
This comment has been minimized.
cornelf
commented
Jun 5, 2017
|
@les FYI |
This comment has been minimized.
This comment has been minimized.
|
Regardless, two people reporting this is worth investigating. Practically,
lots of people will run it with EBS or similar out of necessity. And many
have been doing so with the current storage without issues for a long time.
…On Mon, Jun 5, 2017 at 12:15 PM Cornel Foltea ***@***.***> wrote:
@les <https://github.com/les> FYI
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2805 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEuA8l7t8nWdH0jQa-_jSiLbVofDZ6jSks5sA9VIgaJpZM4Nvhi9>
.
|
This comment has been minimized.
This comment has been minimized.
cornelf
commented
Jun 5, 2017
|
I switched the alpha-2 to EBS and will let you know whether the issue is still gonna be there with it. |
This comment has been minimized.
This comment has been minimized.
|
@gouthamve do you have any hunch what could possibly cause us to not write (or even delete) a meta.json file. The code seems super straightforward in that regard. |
fabxc
modified the milestone:
v2.x
Jun 6, 2017
fabxc
added
the
dev-2.0
label
Jun 6, 2017
This comment has been minimized.
This comment has been minimized.
|
I think the issue is only with EFS, but EFS claims to be POSIX compliant. There are some docs on its consistency guarantees: http://docs.aws.amazon.com/efs/latest/ug/using-fs.html#consistency but I am not able to understand why meta.json is not being written. Maybe, the changes are not being written as we are not closing the directory (ref: prometheus/tsdb#96) |
This comment has been minimized.
This comment has been minimized.
|
Interesting possibility. Let's see if this keeps getting reported after a release with the fix. |
brian-brazil
added
the
kind/more-info-needed
label
Jul 14, 2017
This comment has been minimized.
This comment has been minimized.
stp-ip
commented
Aug 25, 2017
|
Seems like we are running into the same issue on a NFS v4 mount. |
This comment has been minimized.
This comment has been minimized.
cainelli
commented
Sep 14, 2017
|
Apparently, we got into this issue with EFS and version |
This comment has been minimized.
This comment has been minimized.
sbadakhc
commented
Sep 23, 2017
|
Seeing this also using an NFS mount. A fix would be appreciated! |
This comment has been minimized.
This comment has been minimized.
|
Per above the issue here is NFS, and this the fix is not to use NFS. |
This comment has been minimized.
This comment has been minimized.
sbadakhc
commented
Sep 24, 2017
•
|
Fair play unfortunately I don't have a choice. The thing I did was to ramp up the sampling time to 1min and it's now been running longer although for how long I'm not sure. Thanks. |
This comment has been minimized.
This comment has been minimized.
saily
commented
Sep 25, 2017
|
I have the same issue when running |
brian-brazil
changed the title
Data corruption using Prometheus Docker v2.0.0-alpha.2 image
Data corruption using Prometheus Docker v2.0.0-alpha.2 image on NFS
Oct 3, 2017
This comment has been minimized.
This comment has been minimized.
sbadakhc
commented
Oct 10, 2017
|
Is there a way to get Prometheus to the send data via a network socket rather than write to disk? If if could stream the data via a network socket I could have it write on a non nfs storage and pick it up from there? |
This comment has been minimized.
This comment has been minimized.
adrissss
commented
Oct 17, 2017
•
|
It happened to us with v2.0.0-rc.0 + docker swarm + EFS volume. Since then, prometheus won't restart anymore.
And it stays there for hours, with no other error message. A grafana query while in this state gives:
Any suggestions to make it start again without losing all the stored metrics? |
This comment has been minimized.
This comment has been minimized.
|
@adrissss This just looks like you ran into the WAL read issue. At SoundCloud we saw startup times of over 90m on our busy servers. Fabian has improved the WAL read times drastically and the same server boots in 2m now. All these changes are part of rc.1, please update and report again. |
This comment has been minimized.
This comment has been minimized.
adrissss
commented
Oct 17, 2017
•
|
Thanks for your reply @grobie It could be reading the WAL as you said, as I see 15-25% of CPU time in io. I've updated to 2.0.0-rc1 and initially everything seemed exactly the same. But after about 25 min the prometheus container just died. So I'd say that whatever it was doing before, now it is doing it faster. Actually I've seen its memory usage grows continuously until it reaches the limit I set up for the container:
And it dies when it reaches exactly those 6.348 GB (in an 8GB node hosting just this prometheus container). The whole process starts again as the service is configured to be restarted on-failure:
Any thoughts ? |
This comment has been minimized.
This comment has been minimized.
|
That indeed sounds like an OOM kill. Prometheus mmaps all storage data and thus externally it appears to just keep using more and more memory. However, the operating system can take this memory back immediately as soon as another process needs it without killing the Prometheus. Running containers with memory limits in Kubernetes confirmed this behavior to work as expected, i.e. the memory occupied by mmap'd is ceiled by the containers limits without killing the process. This should not differ in general. Could you monitor the Prometheus instance itself and tell us what the |
This comment has been minimized.
This comment has been minimized.
haraldschilly
commented
Nov 7, 2017
•
|
Hi, we're running 2.0.0-rc3 and I got here because I just saw this
I don't know what happened, but yes, it could be OOM related. We're not running this on NFS. I'll look at what this edit: in case someone is curious, the meta.json file was completely empty. |
This comment has been minimized.
This comment has been minimized.
haraldschilly
commented
Nov 7, 2017
|
Well, |
This comment has been minimized.
This comment has been minimized.
|
The memory behavior sounds normal. I cannot point my finger at your error right now. We are writing the meta.json file in one go and only make the block visible after that completed. Even a crash should not cause this theoretically. However, we did some fixes that should've addressed NFS related issues. Since you are not running NFS, I'll close here. Please open a new issue if you encounter this again. |
fabxc
closed this
Nov 12, 2017
This comment has been minimized.
This comment has been minimized.
auhlig
commented
Nov 30, 2017
•
|
Happend with prometheus:v2.0.0 using NFS:
Config can be found here. Happend twice. One incident was related to a node reboot, which might have caused an ungraceful shutdown of prometheus. Today this happend again without any obvious reason. Had to delete the folder to get it back working. Memory usage is consistent. EDIT: Happened again with 2 separate Prometheus instances. Out of nowhere. Memory usage consistent. Nothing obvious. Any help would be much appreciated @fabxc. |
This comment has been minimized.
This comment has been minimized.
anguslees
commented
Feb 13, 2018
•
|
I can confirm that prometheus 2.1.0 still chokes on these directories. Fwiw, the issue is caused/exacerbated by nfs's handling of deleted-while-open files.
core@localhost ~ $ sudo find /var/lib/kubelet/pods/ecb68feb-10ae-11e8-b1d0-02500b02b531/volumes/kubernetes.io~nfs/pvc-35d43d9d-f4ec-11e7-a8b1-02120902b07c/01C3F05JJQFGNVC9E9N028SVP3 -ls
33033501 4 drwxr-xr-x 3 nobody nobody 4096 Feb 8 15:30 /var/lib/kubelet/pods/ecb68feb-10ae-11e8-b1d0-02500b02b531/volumes/kubernetes.io~nfs/pvc-35d43d9d-f4ec-11e7-a8b1-02120902b07c/01C3F05JJQFGNVC9E9N028SVP3
33033502 4 drwxr-xr-x 2 nobody nobody 4096 Feb 8 15:30 /var/lib/kubelet/pods/ecb68feb-10ae-11e8-b1d0-02500b02b531/volumes/kubernetes.io~nfs/pvc-35d43d9d-f4ec-11e7-a8b1-02120902b07c/01C3F05JJQFGNVC9E9N028SVP3/chunks
33033504 58896 -rw-r--r-- 1 nobody nobody 60307522 Jan 10 03:00 /var/lib/kubelet/pods/ecb68feb-10ae-11e8-b1d0-02500b02b531/volumes/kubernetes.io~nfs/pvc-35d43d9d-f4ec-11e7-a8b1-02120902b07c/01C3F05JJQFGNVC9E9N028SVP3/chunks/.nfs0000000001f80d2000000001
33033503 8160 -rw-r--r-- 1 nobody nobody 8352682 Jan 10 03:00 /var/lib/kubelet/pods/ecb68feb-10ae-11e8-b1d0-02500b02b531/volumes/kubernetes.io~nfs/pvc-35d43d9d-f4ec-11e7-a8b1-02120902b07c/01C3F05JJQFGNVC9E9N028SVP3/.nfs0000000001f80d1f00000002If it matters at all, these are nfs4 mounts: core@localhost ~ $ grep nfs /proc/mounts
192.168.0.10:/home/kube/monitoring-prometheus-data-pvc-35d43d9d-f4ec-11e7-a8b1-02120902b07c /var/lib/kubelet/pods/ecb68feb-10ae-11e8-b1d0-02500b02b531/volumes/kubernetes.io~nfs/pvc-35d43d9d-f4ec-11e7-a8b1-02120902b07c nfs4 rw,relatime,vers=4.0,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.127,local_lock=none,addr=192.168.0.10 0 0I need to think further about what's causing these to be created, but I suspect prometheus' file loading should ignore these "empty" directories rather than getting all excited and aborting. |
haraldschilly
referenced this issue
Apr 7, 2018
Closed
does not start up after corrupted meta.json file #4058
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
bvis commentedJun 4, 2017
What did you do?
I started a prometheus task in a Docker swarm cluster. I started it with the next config:
Where these "secrets" are just the configuration file and alerts used by Prometheus.
What did you expect to see?
I expected to see it working. 0:)
What did you see instead? Under which circumstances?
I've seen that the service was not running (as it was running as a "beta" service I didn't have any monitoring over it) and when checking the logs I saw:
Environment
It's running in a Swarm cluster, in AWS in EU-WEST-1 in 3 different AZs.
It's running as a swarm service with 1 task.
The data is stored in a EFS system using
rexray/efsplugin.Linux 4.10.0-21-generic x86_64This means that I'm using built-in DNS service discovery in Swarm to autodiscover the task endpoint, to be consumed by Prometheus. It's something I've been doing with previous versions.