Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Intermittent] compaction failed after upgrade from 2.2.1 to 2.3.0 #4292

Closed
aarontams opened this Issue Jun 20, 2018 · 36 comments

Comments

Projects
None yet
8 participants
@aarontams
Copy link

aarontams commented Jun 20, 2018

Bug Report

What did you do?
Upgrade prom from 2.2.1 to 2.3.0

What did you expect to see?
Prom restart without any issue

What did you see instead? Under which circumstances?
Jun 19 21:34:16 prometheus[29758]: level=error ts=2018-06-19T21:34:16.143435625Z caller=db.go:277 component=tsdb msg="compaction failed"...

Environment
000001 - no idea why it was created. Also check the timestamp, something is not right.

-rw-r--r--. 1 root root    800655 Jun 19 19:00 000001
-rw-r--r--. 1 root root 268408703 Jun 19 18:34 000027
-rw-r--r--. 1 root root  13269564 Jun 19 20:24 000028
-rw-r--r--. 1 root root 268435456 Jun 19 21:23 000029

To work around the issue

systemctl stop prometheus
rm ./prometheus/wal/000001
systemctl start prometheus

After removed wal/000001 and restarted, everything was back to normal

prometheus[4606]: level=info ts=2018-06-19T21:36:47.725951171Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=241.063549ms

After restart, we lost 1.5 hours of metrics

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 20, 2018

@fabxc @gouthamve Can ye take a look at this?

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Jun 20, 2018

@aarontams Can you provide the full log line with the error message?

Specifically: Jun 19 21:34:16 prometheus[29758]: level=error ts=2018-06-19T21:34:16.143435625Z caller=db.go:277 component=tsdb msg="compaction failed"..., the error message has been cutoff.

@aarontams

This comment has been minimized.

Copy link
Author

aarontams commented Jun 20, 2018

I don't have the actual logs any more. I got the full error message from our Kibana instead. Hope this helps.

level=error ts=2018-06-19T21:34:16.143435625Z caller=db.go:277 component=tsdb msg="compaction failed" err="persist head block: write compaction: add series: out-of-order series added with label set \"{__name__=\\\"management_wls_server_apipashp_adminserver_datasource_apimanageranalyticsdatasource_sp588071522_activeconnectionsaveragecount\\\",async=\\\"true\\\",host=\\\"apipashprodwapip-wls-1.svc3.apicsautopod.oraclevcn.com\\\",instance=\\\"129.146.71.234:8080\\\",job=\\\"Telegraf\\\"}\""
@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Jun 20, 2018

We saw this error before, this is consistent with running multiple prometheus using the same data directory, but we are not entirely sure. More information from your side would help us understand the issue better.

Do you run multiple prometheus? And if you did an update, are you sure the older prometheus was completely down before the new one started?

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Jun 20, 2018

It'd also help if you provide the flags used to start prometheus.

@aarontams

This comment has been minimized.

Copy link
Author

aarontams commented Jun 20, 2018

There is only one prometheus installed on that machine.
We use sensible to restart prometheus after updated the configs to 2.3.0

- name: Start Prometheus Service
  service:
     name: prometheus
     state: restarted
     enabled: yes
     daemon_reload: yes

Unless you think the ansible (systemctl) restart command didn't completely stopped the prom before starting.

@aarontams

This comment has been minimized.

Copy link
Author

aarontams commented Jun 20, 2018

More info:
/etc/systemd/system/prometheus.service

.....
[Service]
ExecStart=/opt/prometheus/prometheus-2.3.0.linux-amd64/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/data/prometheus \
  --storage.tsdb.retention=33d
Restart=always
RestartSec=10
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 20, 2018

What filesystem is /data using?

@aarontams

This comment has been minimized.

Copy link
Author

aarontams commented Jun 20, 2018

ext4

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 20, 2018

@aarontams any chance you can try to replicate and share the exact steps.

We have seen this issue few times now, but still unable to find the root cause so finding the steps to replicate would be crucial.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 20, 2018

I am also thinking if changing the ExecStart path might prevent systemd from killing the previous 2.2.1 instance.

Prometheus uses a lock file to prevent more than one instance, but @brian-brazil reminded us that we switched the locking package and this means that 2.3.0 will not detect if another instance is running.

The old locking used a text file with PID inside so if this existed on your system after the restart the changes are that the old instance might have been running.

@viralkamdar

This comment has been minimized.

Copy link

viralkamdar commented Jun 20, 2018

There is no specific exact step. Just upgrade prometheus and restart the service. It may be possible that the old lock file still exists

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 20, 2018

if this happens again if possible please send me the WAL file privately so I can investigate again.

kgeorgie at redhat.com

@aarontams

This comment has been minimized.

Copy link
Author

aarontams commented Jun 20, 2018

@krasi-georgiev, @brian-brazil and/or @gouthamve. Can you tell us where is the old and new lock file.

Sounds like you guys thing the issue that I filed here will only happens for the upgrade to 2.3.0 scenario. I need to double check to see if the same problem happens in restarted a 2.3.0 prometheus. I will update here after I gather the info.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 20, 2018

old lock file is called lock , a text file with some number which is the actual PID of the running Prometheus.
the new lock is still called lock but it is empty.

just tried locally and it is possible to run both at the same time as long as you provide a different listening port so in your case if you run both in the same network namespace this should prevent running both at the same time which means the bug is caused by something else.

@aarontams

This comment has been minimized.

Copy link
Author

aarontams commented Jun 20, 2018

I just confirmed with my team, the same compaction problem happened last week while we were still in 2.2.1. @viralkamdar will provide the infected wal file offline.

@dmitriy-lukyanchikov

This comment has been minimized.

Copy link

dmitriy-lukyanchikov commented Jun 20, 2018

got the same problem when upgrade from 2.2.1 but when i try to reproduce the error i not always got it, sometimes upgrade is without error, not sure why

@aarontams

This comment has been minimized.

Copy link
Author

aarontams commented Jun 20, 2018

Our experience is about 1 out of 4 prom that we upgraded had that problem.
Note: We also got the same problem when running 2.2.1

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 21, 2018

I have't tried it yet, but in theory the old locking will not work in k8s/Docker env.
It matches the PID and since the Pods/Services run in isolated PID namespaces it will not detect any other running instances.

So if you scale up/down or reschedule a Prometheus pod/service that use the same data folder this will cause more than one writing to the same database.

This is only a theory and hasn't been tested yet so if anyone manages to reproduce this would be of a great help.

@viralkamdar

This comment has been minimized.

Copy link

viralkamdar commented Jun 21, 2018

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 21, 2018

A different shouldn't have caused this, as you'd end up with a Prometheus with an empty tsdb on this new path. It'd take something copying that new tsdb over to explain this.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 21, 2018

@brian-brazil I think he meant different path for the Prometheus executable , not the tsdb path.

I am not sure if systemd tracks the PID of the running process so it knows how to kill it or does this by the path of the executable in the config file.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 21, 2018

OTOH in a non k8s/docker env the new instance should fail if the old one is still running as it will try to listen on the same 9090 port.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 21, 2018

Just had another look and the the web handler is stopped before the database is closed which in busy env this might take a while.

so in theory if systemd sends a kill signal to the old instance , doesn't wait for it to complete the shutdown the web listener will free port 9090 and the new Prom instance will be able to start before the old one is completely shutdown.

Just tried it locally and I was wrong about releasing the port.
Although the web handler is stopped before the database the port is occupied until the Prometheus binary has exited completely so it seems the bu is caused by something else.

@duhang

This comment has been minimized.

Copy link

duhang commented Jun 21, 2018

We will be reimplementing the Prometheus upgrade/restart logic in Ansible, to catch all sorts of conditions that we discovered recently from Prom 2.2.1 and the upgrade to 2.3.0.

Basically, Prometheus is doing what it should be doing, we need learn all the curves especially the interaction with systemd restart and system reboot, and now this early freed 9090 port.

====

But one thing I'm still trying to figure out is the metrics loss, especially the historical metrics loss.

Let's say the normal compaction schedule is 10AM - 12PM - 14PM, and I did see 12PM compaction happened in our graph.

At 13:30PM, we did the Prom 2.2.1 -> 2.3.0 upgrade, where there is a brief PromX2 moment (2.2.1 being shutdown and 2.3.0 being brought up, both running off the same data directory). On 9 out of about ~30 Prometheus, WAL corrupted and we got compaction errors/alerts at 14PM.

~15:00PM, we stopped Prometheus 2.3.0, moved the wal/000001, and restart Prometheus 2.3.0, compaction error was gone.
image

BUT, we found that metrics were missing from those impacted Prometheus between 11AM-13:30PM (the upgrade and 1st Prom restart).
image

So the questions are:

  1. Why we even lost 11AM-12PM metrics, which had already been compacted by Prom 2.2.1 and supposedly immutable right?

  2. How to recover the 12PM-13:30PM metrics, which I'm assuming, are still in the dangling WAL file, such as wal/000135 and wal/000172 below. Is there a manual way to trigger compaction of older WAL file? We still kept the relocated wal/000001 BTW.

ls wal/
3195859 Jun 21 17:00 000135
268195748 Jun 21 16:04 000144
268435456 Jun 21 18:25 000145

On another instance

ls wal/
5245829 Jun 21 17:00 000172
268296390 Jun 21 18:20 000185
268435456 Jun 21 18:24 000186

Thanks a bunch for Prometheus team. You guys rock!

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 21, 2018

I updated my comment as after trying it locally I can't make another Prometheus to start while the first one is shutting down.

Although the web handler is stopped early the network port 9090 is still in use until the Prometheus binary has exited completely so in a non docker/k8s env more than one instance can't run using the same listening port.

Back to the drawing board.

@duhang could you send me those WAL files privately at kgeorgie at redhat.com and I will keep digging.

About your second question - the WAL is a temporary holder for the most recent metrics before it is converted into a block so deleting it explains the data loss.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 21, 2018

btw I am working on a tsdb scan/repair tool so hopefully if we figure out what caused this bug will help improve the tool as well.
prometheus/tsdb#320

@duhang

This comment has been minimized.

Copy link

duhang commented Jun 21, 2018

In order to reproduce the one prometheus 2.2.1 being shutdown, and one prometheus 2.3.0 being brought up, you need put prometheus binaries in versioned directories.

Here is how it looks like in prometheus.service before and after the upgrade.
When we call systemctl restart prometheus, the problem will start intermittently or 3 out of 10, shall I say?

# before
ExecStart=/opt/prometheus/prometheus-2.2.1.linux-amd64/prometheus
# after
ExecStart=/opt/prometheus/prometheus-2.3.0.linux-amd64/prometheus

We captured this in action:

# this is prom 2.2.1 reported exiting gracefully
# but with prom 2.3.0 starting up in the next 10ms, the exit won't be gracefully anyway
level=warn ts=2018-06-19T20:34:01.352764045Z caller=main.go:374 msg="Received SIGTERM, exiting gracefully..."

# this is prom 2.3.0 starting up 10ms later 
# and maybe creating the wal/000001 when it is seeing the lock file in the same data directory? 
level=info ts=2018-06-19T20:34:01.468798605Z caller=main.go:222 msg="Starting Prometheus" version="(version=2.3.0, branch=HEAD, revision=290d71791a507a5057b9a099c9d48703d86dc941)"

Admittedly, this is our problem to solve in order to prevent 2 Prometheus from running at the same time especially during upgrade. But the data loss, especially the 11AM-12PM loss is puzzling.

After we deleted wal/000001 and restarted Prom 2.3.0, we didn't touch any other files under wal directory. Hopefully, we can use the scan tool to put the 12PM-13:30PM metrics backup, then we will be all sound and good!

OK, sent the wal files to krasi over email.

Thanks for the help!

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 22, 2018

yep I got the files, thanks.

I tried starting both exactly like you describe , but the listening port prevents that. both can't listen on the same 9090 port.

@duhang

This comment has been minimized.

Copy link

duhang commented Jun 22, 2018

@krasi-georgiev Not starting both, but shutdown a running 2.2.1, and within million-seconds, bring up 2.3.0 using the same data directory.

This scenario can be simulated with systemctl restart. Well I should say it is exactly what happened in our Prometheus 2.3.0 Upgrade incident.

# Start a 2.2.1 with systemd unit file like
# [Service]
# ExecStart=/opt/prometheus/prometheus-2.2.1.linux-amd64/prometheus
systemctl start prometheus.service

# Update the systemd unit file with 2.3.0 path
# [Service]
# ExecStart=/opt/prometheus/prometheus-2.3.0.linux-amd64/prometheus
systemctl daemon-reload
systemctl restart prometheus.service

# Now check wal/ directory, there should be a 000001 directory intermittently
# or 3 out 10 times from our experience, meaning a 2.3.0 is starting while a
# 2.2.1 is shutting down.

If 2.2.1 released 9090 port quick enough (possibly within tens of million seconds) before 2.3.0 tried to grab the same port number (possibly within hundreds of million seconds), then this whole scenario will appear to be a normal restart, but it isn't. There will be historical data and/or wal corruption and followed by errors on the hour of next scheduled compaction.

As I said before, we are responsible for not shutdown and then startup upgraded Prometheus 2.3.0 correctly. And we will take the blame and fix our startup code.

But in theory, there could still be a problem in Prometheus web, had other
people run into the similar scenario like us, they might loss data during Prometheus upgrade/restart as well.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 22, 2018

that is strange I couldn't replicate it locally doing the following.

  1. add some big delay in the shutdown code.
  2. send a SIGTERM - at this point Prometheus goes into a shutdown sequence. Closes everything including the webhandler,but remains running due to the delay I added.
  3. I manually Deleted all WAL files to see if the new instance I create new WAL
  4. Start another Prometheus while the first one is still pending shutdown - here it fails to start and displays an error for the 9090 listening port and it doesn't create any WAL files.

but hey if you are sure that this is what caused it and can replicate than happy days you can close the issue 😜

I will play with the WAL files you sent me and will let you know if I find anything else.

@aarontams

This comment has been minimized.

Copy link
Author

aarontams commented Jun 22, 2018

We changed our ansible from restart to stop and start.
Rerun the upgrade scenario 50+ times without a lot of data though this time, no corrupt found.
Our original failed env has ~6GB of data.

Prom team, feel free to close the issue or leave it open for other to add comment to it.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 23, 2018

Thanks for the update that would be very useful use case 👍

Lets close for now and will reopen if needed.

@ntindall

This comment has been minimized.

Copy link

ntindall commented Jul 1, 2018

We experienced this same problem. We are running a prometheus architecture on amazon ECS with two redundant "master" prometheuses and ~12 federated nodes.

We have recently implemented a remote storage adapter to write aggregated metrics to amazon kinesis firehose -> redshift. The remote adapter is only configured for one of the master nodes.

Unfortunately, we ran into this bug during the process (prometheus would not come back up without manually deleting the lockfile). We deleted the lockfile and turned on the --no-lockfile option.

Turning on no-lockfile option seems very dangerous on Amazon ECS as well. We later made some changes to the service placement constraints, which meant that the service needed to be replaced (instead of just replacing the underlying prometheus container). The default behavior here is for amazon to run the two services concurrently (uh oh), until the new service's task becomes healthy.

I believe during this change we had two writers on the same data volume, so we started experiencing this problem. Unwittingly, the same deployment also upgraded prometheus to 2.3.1 (which was expected to fix the above mentioned problem with the lockfile for the remote storage writer so we could turn it back on) - so I'm not sure that there is so much as causation in the upgrade as correlation (at least in our case).

Interestingly, we only saw this error on the remote writer master - the "vanilla master" was just fine after the upgrade. After some careful surgery on the wal directory and deleting the affected series on the /admin API - we got the system up and running again. I also deleted the *.tmp directories as described here.

In general, prometheus seems very delicate when it comes to this error, we were seeing out of memory errors on the remote writer until we were able to fix the problem.

I have seen some other recommendations that decreasing the block size can decrease the severity of this problem - especially on nodes with large retention (ours has 84 days of retention). #4110

Here is how we are starting prometheus now.

cmd=(
  "/bin/prometheus"
  "--config.file=/etc/prometheus/prometheus-$PROMETHEUS_ENV.yml"
  "--storage.tsdb.path=${STORAGE_LOCAL_PATH}"
  "--storage.tsdb.retention=${STORAGE_LOCAL_RETENTION}"
  "--web.external-url=${WEB_EXTERNAL_URL}"
  "--web.console.libraries=/usr/share/prometheus/console_libraries"
  "--web.console.templates=/usr/share/prometheus/consoles"
  "--web.enable-admin-api"
)

Not sure we are looking for any "help" right now - but any advice or insight on the above saga would be appreciated. In any case, I hope this helps someone else resolve similar problems in the future.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jul 5, 2018

@duhang finaly had a chance to have a look at those WAL files you sent me.
It is probably too late for you, but I thought I would add some update

@brian-brazil found a bug in the tsdb which is most likely the culprit so now I am less convinced that this was caused by more than one instance writing to the same folder.

The bug fix will be included in the next release so please open another report if you still experience the same behaviour.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.