New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus ended up with empty config after rule-related reload #2501
Comments
Interesting. Your explanation sure sounds like it could be exactly those kind of races. Do you think it is possible to reproduce this and then work on it? 🤔 |
@metalmatze Not sure how easy it is to reproduce reliably, but I'll give it a try! |
After repeatedly re-applying rules in a sandbox cluster (but leaving the main config as is), I got Prometheus to log a parse error about the main config:
So in this case it bails out at line 610, probably because the file is in the middle of being written out. I will see if I can reproduce something like this at least a couple of times, then fix the sidecar and verify that the problem does not appear again. |
cc @s-urbaniak I think we recently have seen this in one of our environments as well no? |
Nearly, I am also still stabbing in the dark here. In our case the config file was not there in the first place: relevant prometheus log entry:
config-reload logs:
Note: Until now, I could not reproduce this and the referenced CI failures in the originally posted issue in https://bugzilla.redhat.com/show_bug.cgi?id=1690162 did not reproduce this issue either for some time. @juliusv Do you see any suspicious entries in the config reloader maybe? |
having said that (thanks @lucab for verification ❤️ !), @juliusv you are definitely right about the missing atomicity properties of As per https://golang.org/src/io/ioutil/ioutil.go?s=2534:2602#L69 there can a lot of stuff happening concurrently between Just to double-check, with atomic moves, the following scenario would happen.
If we do have configuration changes in the last point, we would have a double-restart of Prometheus, initially with an inconsistent set of configuration/rules. PS: we need to ensure those moves happen in the same filesystem (i.e. the same directory) to have the atomicity guarantee. |
@s-urbaniak Good point about the same filesystem. Yeah, it could just be written into the same directory, but as a temporary file first. Regarding seeing suspicious entries in the config reloader: no, it doesn't log anything unless it actually reloads the config (which it's not doing in my case, it just writes it out again, but no reload since the hash is the same). While I cannot reliably reproduce the issue (not too surprising), I managed to get it to break one time in a synthetic setup yesterday. I would propose to just fix the write to be atomic and then see if we get any report like this again? |
@juliusv Yes, that's what we discussed as well. Moving the temporary file to the correct location should be atomic and solve our problems. |
@juliusv yes, having the atomic semantics in place is definitely the correct way, let me submit a patch for that (unless you want to take over) 👍 |
@s-urbaniak Happy to do it! |
This addresses an issue found in the Prometheus Operator, which reuses this reloader sidecar, but which then also has a second sidecar which may trigger rule-based reloads while the config sidecar is in the middle of writing out its config (in a non-atomic way): prometheus-operator/prometheus-operator#2501 Signed-off-by: Julius Volz <julius.volz@gmail.com>
This addresses an issue found in the Prometheus Operator, which reuses this reloader sidecar, but which then also has a second sidecar which may trigger rule-based reloads while the config sidecar is in the middle of writing out its config (in a non-atomic way): prometheus-operator/prometheus-operator#2501 I didn't add a test for this because it's hard to catch the original problem to begin with, but it has happened. Signed-off-by: Julius Volz <julius.volz@gmail.com>
This addresses an issue found in the Prometheus Operator, which reuses this reloader sidecar, but which then also has a second sidecar which may trigger rule-based reloads while the config sidecar is in the middle of writing out its config (in a non-atomic way): prometheus-operator/prometheus-operator#2501 I didn't add a test for this because it's hard to catch the original problem to begin with, but it has happened. Signed-off-by: Julius Volz <julius.volz@gmail.com>
@s-urbaniak How about this: https://github.com/improbable-eng/thanos/pull/962/files |
* Make config reloader file writes atomic This addresses an issue found in the Prometheus Operator, which reuses this reloader sidecar, but which then also has a second sidecar which may trigger rule-based reloads while the config sidecar is in the middle of writing out its config (in a non-atomic way): prometheus-operator/prometheus-operator#2501 I didn't add a test for this because it's hard to catch the original problem to begin with, but it has happened. Signed-off-by: Julius Volz <julius.volz@gmail.com> * Explicitly ignore os.Remove() error Signed-off-by: Julius Volz <julius.volz@gmail.com>
Now we need to bump the dependency over here |
@metalmatze on it |
see #2504 |
We've been hitting this issue quite frequently with our Prometheus Operator deployments so I'm quite keen to get this fix deployed. Is there an ETA on when a release containing this will be available? |
I think we've collected a number of things. This is a short week in Germany so I can't promise we'll get to it this week, but likely by next week there should be a release out. Not promising anything though, things sometimes get into the way 🙂 . |
Appreciate the update @brancz! |
Seeing this quite frequently as I am making a lot of rule updates, happy to hear a fix is on the way. To fix this we just need to update the |
I think https://github.com/coreos/prometheus-operator/releases/tag/v0.30.0 already includes the fix for this. |
@lucab Unfortunately it does not, since the migration to go modules accidentally reverted the fix |
^ ping @metalmatze |
This has been fixed in v0.30.1! |
For people that may stumble upon this issue by accident, it should have been completely fixed by #3457 (prometheus-operator >= v0.42.0) |
Prometheus Operator version: v0.29.0
Prometheus: v2.7.2
Reloader sidecar Operator flags:
We had an incident here where a rule-based change triggered the
rules-configmap-reloader
sidecar to trigger a Prometheus reload and afterwards the Prometheus had an empty config (/config
was empty: no scrape configs, just showing defaults for global config settings). Theprometheus-config-reloader
did not show any reloads around that time, so the config itself didn't actually change. However, from https://github.com/coreos/prometheus-operator/blob/650359b3e627ae97a1f18cbd10d7ed9b2293c240/vendor/github.com/improbable-eng/thanos/pkg/reloader/reloader.go#L156 it looks like that config reloader sidecar might even write out the main config when it didn't really change, but then only trigger a reload if the config contents are different: https://github.com/coreos/prometheus-operator/blob/650359b3e627ae97a1f18cbd10d7ed9b2293c240/vendor/github.com/improbable-eng/thanos/pkg/reloader/reloader.go#L194The way the config reloader sidecar writes out the config is not atomic (since it doesn't go through a temporary file and rename). Is it possible that there is a race condition like this:
TBH I haven't studied the code in depth yet, but seeing a non-atomic write in the sidecar made this suspicious.
The text was updated successfully, but these errors were encountered: