Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign up[Intermittent] compaction failed after upgrade from 2.2.1 to 2.3.0 #4292
Comments
brian-brazil
added
kind/bug
component/local storage
labels
Jun 20, 2018
This comment has been minimized.
This comment has been minimized.
|
@fabxc @gouthamve Can ye take a look at this? |
This comment has been minimized.
This comment has been minimized.
|
@aarontams Can you provide the full log line with the error message? Specifically: |
This comment has been minimized.
This comment has been minimized.
|
I don't have the actual logs any more. I got the full error message from our Kibana instead. Hope this helps.
|
This comment has been minimized.
This comment has been minimized.
|
We saw this error before, this is consistent with running multiple prometheus using the same data directory, but we are not entirely sure. More information from your side would help us understand the issue better. Do you run multiple prometheus? And if you did an update, are you sure the older prometheus was completely down before the new one started? |
This comment has been minimized.
This comment has been minimized.
|
It'd also help if you provide the flags used to start prometheus. |
This comment has been minimized.
This comment has been minimized.
|
There is only one prometheus installed on that machine.
Unless you think the ansible (systemctl) restart command didn't completely stopped the prom before starting. |
This comment has been minimized.
This comment has been minimized.
|
More info:
|
This comment has been minimized.
This comment has been minimized.
|
What filesystem is /data using? |
This comment has been minimized.
This comment has been minimized.
|
ext4 |
This comment has been minimized.
This comment has been minimized.
|
@aarontams any chance you can try to replicate and share the exact steps. We have seen this issue few times now, but still unable to find the root cause so finding the steps to replicate would be crucial. |
This comment has been minimized.
This comment has been minimized.
|
I am also thinking if changing the ExecStart path might prevent systemd from killing the previous 2.2.1 instance. Prometheus uses a lock file to prevent more than one instance, but @brian-brazil reminded us that we switched the locking package and this means that 2.3.0 will not detect if another instance is running. The old locking used a text file with PID inside so if this existed on your system after the restart the changes are that the old instance might have been running. |
This comment has been minimized.
This comment has been minimized.
viralkamdar
commented
Jun 20, 2018
•
|
There is no specific exact step. Just upgrade prometheus and restart the service. It may be possible that the old lock file still exists |
This comment has been minimized.
This comment has been minimized.
|
if this happens again if possible please send me the WAL file privately so I can investigate again. kgeorgie at redhat.com |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev, @brian-brazil and/or @gouthamve. Can you tell us where is the old and new lock file. Sounds like you guys thing the issue that I filed here will only happens for the upgrade to 2.3.0 scenario. I need to double check to see if the same problem happens in restarted a 2.3.0 prometheus. I will update here after I gather the info. |
This comment has been minimized.
This comment has been minimized.
|
old lock file is called just tried locally and it is possible to run both at the same time as long as you provide a different listening port so in your case if you run both in the same network namespace this should prevent running both at the same time which means the bug is caused by something else. |
This comment has been minimized.
This comment has been minimized.
|
I just confirmed with my team, the same compaction problem happened last week while we were still in 2.2.1. @viralkamdar will provide the infected wal file offline. |
This comment has been minimized.
This comment has been minimized.
dmitriy-lukyanchikov
commented
Jun 20, 2018
|
got the same problem when upgrade from 2.2.1 but when i try to reproduce the error i not always got it, sometimes upgrade is without error, not sure why |
This comment has been minimized.
This comment has been minimized.
|
Our experience is about 1 out of 4 prom that we upgraded had that problem. |
This comment has been minimized.
This comment has been minimized.
|
I have't tried it yet, but in theory the old locking will not work in k8s/Docker env. So if you scale up/down or reschedule a Prometheus pod/service that use the same data folder this will cause more than one writing to the same database. This is only a theory and hasn't been tested yet so if anyone manages to reproduce this would be of a great help. |
This comment has been minimized.
This comment has been minimized.
viralkamdar
commented
Jun 21, 2018
|
I think we made a mistake by restarting the service with a different path.
We should have stopped the service which was running prometheus from 2.2.1
and then started the service to run prometheus from 2.3.0 to make sure that
the old process has died completely
…On Thu, 21 Jun 2018 at 16:42, Krasi Georgiev ***@***.***> wrote:
I have't tried it yet, but in theory the old locking will not work in
k8s/Docker env.
It matches the PID and since the Pods/Services run in isolated PID
namespaces it will not detect any other running instances.
So if you scale up/down or reschedule a Prometheus pod/service that use
the same data folder this will cause more than one writing to the same
database.
This is only a theory and hasn't been tested yet so if anyone manages to
reproduce this would be of a great help.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4292 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARt1OGiBGW9Oxxu71Xv85V0073nezL1Kks5t-3-IgaJpZM4Uuguc>
.
--
Viral Kamdar
|
This comment has been minimized.
This comment has been minimized.
|
A different shouldn't have caused this, as you'd end up with a Prometheus with an empty tsdb on this new path. It'd take something copying that new tsdb over to explain this. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil I think he meant different path for the Prometheus executable , not the tsdb path. I am not sure if systemd tracks the PID of the running process so it knows how to kill it or does this by the path of the executable in the config file. |
This comment has been minimized.
This comment has been minimized.
|
OTOH in a non k8s/docker env the new instance should fail if the old one is still running as it will try to listen on the same |
This comment has been minimized.
This comment has been minimized.
|
Just tried it locally and I was wrong about releasing the port. |
This comment has been minimized.
This comment has been minimized.
duhang
commented
Jun 21, 2018
•
This comment has been minimized.
This comment has been minimized.
|
I updated my comment as after trying it locally I can't make another Prometheus to start while the first one is shutting down. Although the web handler is stopped early the network port Back to the drawing board. @duhang could you send me those WAL files privately at kgeorgie at redhat.com and I will keep digging. About your second question - the WAL is a temporary holder for the most recent metrics before it is converted into a block so deleting it explains the data loss. |
This comment has been minimized.
This comment has been minimized.
|
btw I am working on a tsdb scan/repair tool so hopefully if we figure out what caused this bug will help improve the tool as well. |
This comment has been minimized.
This comment has been minimized.
duhang
commented
Jun 21, 2018
•
|
In order to reproduce the one prometheus 2.2.1 being shutdown, and one prometheus 2.3.0 being brought up, you need put prometheus binaries in versioned directories. Here is how it looks like in prometheus.service before and after the upgrade.
We captured this in action:
Admittedly, this is our problem to solve in order to prevent 2 Prometheus from running at the same time especially during upgrade. But the data loss, especially the 11AM-12PM loss is puzzling. After we deleted wal/000001 and restarted Prom 2.3.0, we didn't touch any other files under wal directory. Hopefully, we can use the scan tool to put the 12PM-13:30PM metrics backup, then we will be all sound and good! OK, sent the wal files to krasi over email. Thanks for the help! |
This comment has been minimized.
This comment has been minimized.
|
yep I got the files, thanks. I tried starting both exactly like you describe , but the listening port prevents that. both can't listen on the same 9090 port. |
This comment has been minimized.
This comment has been minimized.
duhang
commented
Jun 22, 2018
•
|
@krasi-georgiev Not This scenario can be simulated with
If 2.2.1 released 9090 port quick enough (possibly within tens of million seconds) before 2.3.0 tried to grab the same port number (possibly within hundreds of million seconds), then this whole scenario will appear to be a As I said before, we are responsible for But in theory, there could still be a problem in Prometheus web, had other |
This comment has been minimized.
This comment has been minimized.
|
that is strange I couldn't replicate it locally doing the following.
but hey if you are sure that this is what caused it and can replicate than happy days you can close the issue I will play with the WAL files you sent me and will let you know if I find anything else. |
This comment has been minimized.
This comment has been minimized.
|
We changed our ansible from Prom team, feel free to close the issue or leave it open for other to add comment to it. |
This comment has been minimized.
This comment has been minimized.
|
Thanks for the update that would be very useful use case Lets close for now and will reopen if needed. |
krasi-georgiev
closed this
Jun 23, 2018
This comment has been minimized.
This comment has been minimized.
ntindall
commented
Jul 1, 2018
•
|
We experienced this same problem. We are running a prometheus architecture on amazon ECS with two redundant "master" prometheuses and ~12 federated nodes. We have recently implemented a remote storage adapter to write aggregated metrics to amazon kinesis firehose -> redshift. The remote adapter is only configured for one of the master nodes. Unfortunately, we ran into Turning on I believe during this change we had two writers on the same data volume, so we started experiencing this problem. Unwittingly, the same deployment also upgraded prometheus to Interestingly, we only saw this error on the remote writer master - the "vanilla master" was just fine after the upgrade. After some careful surgery on the In general, prometheus seems very delicate when it comes to this error, we were seeing out of memory errors on the remote writer until we were able to fix the problem. I have seen some other recommendations that decreasing the block size can decrease the severity of this problem - especially on nodes with large retention (ours has 84 days of retention). #4110 Here is how we are starting prometheus now. cmd=(
"/bin/prometheus"
"--config.file=/etc/prometheus/prometheus-$PROMETHEUS_ENV.yml"
"--storage.tsdb.path=${STORAGE_LOCAL_PATH}"
"--storage.tsdb.retention=${STORAGE_LOCAL_RETENTION}"
"--web.external-url=${WEB_EXTERNAL_URL}"
"--web.console.libraries=/usr/share/prometheus/console_libraries"
"--web.console.templates=/usr/share/prometheus/consoles"
"--web.enable-admin-api"
)Not sure we are looking for any "help" right now - but any advice or insight on the above saga would be appreciated. In any case, I hope this helps someone else resolve similar problems in the future. |
This comment has been minimized.
This comment has been minimized.
|
@duhang finaly had a chance to have a look at those WAL files you sent me. @brian-brazil found a bug in the tsdb which is most likely the culprit so now I am less convinced that this was caused by more than one instance writing to the same folder. The bug fix will be included in the next release so please open another report if you still experience the same behaviour. |
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |


aarontams commentedJun 20, 2018
•
edited
Bug Report
What did you do?
Upgrade prom from 2.2.1 to 2.3.0
What did you expect to see?
Prom restart without any issue
What did you see instead? Under which circumstances?
Jun 19 21:34:16 prometheus[29758]: level=error ts=2018-06-19T21:34:16.143435625Z caller=db.go:277 component=tsdb msg="compaction failed"...
Environment
000001- no idea why it was created. Also check the timestamp, something is not right.To work around the issue
After removed wal/000001 and restarted, everything was back to normal
After restart, we lost 1.5 hours of metrics