Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus ended up with empty config after rule-related reload #2501

Closed
juliusv opened this issue Mar 20, 2019 · 23 comments
Closed

Prometheus ended up with empty config after rule-related reload #2501

juliusv opened this issue Mar 20, 2019 · 23 comments

Comments

@juliusv
Copy link
Contributor

juliusv commented Mar 20, 2019

Prometheus Operator version: v0.29.0
Prometheus: v2.7.2

Reloader sidecar Operator flags:

--config-reloader-image=quay.io/coreos/configmap-reload:v0.0.1
--prometheus-config-reloader=quay.io/coreos/prometheus-config-reloader:v0.29.0

We had an incident here where a rule-based change triggered the rules-configmap-reloader sidecar to trigger a Prometheus reload and afterwards the Prometheus had an empty config (/config was empty: no scrape configs, just showing defaults for global config settings). The prometheus-config-reloader did not show any reloads around that time, so the config itself didn't actually change. However, from https://github.com/coreos/prometheus-operator/blob/650359b3e627ae97a1f18cbd10d7ed9b2293c240/vendor/github.com/improbable-eng/thanos/pkg/reloader/reloader.go#L156 it looks like that config reloader sidecar might even write out the main config when it didn't really change, but then only trigger a reload if the config contents are different: https://github.com/coreos/prometheus-operator/blob/650359b3e627ae97a1f18cbd10d7ed9b2293c240/vendor/github.com/improbable-eng/thanos/pkg/reloader/reloader.go#L194

The way the config reloader sidecar writes out the config is not atomic (since it doesn't go through a temporary file and rename). Is it possible that there is a race condition like this:

  1. config reloader sidecar notices watcher events on main config
  2. rule reloader sidecar notices watcher events on rules
  3. config reloader sidecar starts writing out config (file initially empty / truncated)
  4. rule reloader sidecar writing out new rules
  5. rule reloader sidecar triggering Prometheus reload
  6. Prometheus reads empty config and new rules
  7. config reloader sidecar finishes writing rules
  8. config reloader not actually triggering config reload since it sees no real changes in the hash

TBH I haven't studied the code in depth yet, but seeing a non-atomic write in the sidecar made this suspicious.

@metalmatze
Copy link
Member

Interesting. Your explanation sure sounds like it could be exactly those kind of races. Do you think it is possible to reproduce this and then work on it? 🤔

@juliusv
Copy link
Contributor Author

juliusv commented Mar 21, 2019

@metalmatze Not sure how easy it is to reproduce reliably, but I'll give it a try!

@juliusv
Copy link
Contributor Author

juliusv commented Mar 21, 2019

After repeatedly re-applying rules in a sandbox cluster (but leaving the main config as is), I got Prometheus to log a parse error about the main config:

level=error ts=2019-03-21T16:26:36.737700944Z caller=main.go:550 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: yaml: line 610: could not find expected ':'"

So in this case it bails out at line 610, probably because the file is in the middle of being written out.

I will see if I can reproduce something like this at least a couple of times, then fix the sidecar and verify that the problem does not appear again.

@brancz
Copy link
Contributor

brancz commented Mar 22, 2019

cc @s-urbaniak I think we recently have seen this in one of our environments as well no?

@s-urbaniak
Copy link
Contributor

Nearly, I am also still stabbing in the dark here. In our case the config file was not there in the first place:

relevant prometheus log entry:

level=info ts=2019-03-17T05:39:01.342Z caller=main.go:620 msg="Starting TSDB ..."
level=info ts=2019-03-17T05:39:01.352Z caller=main.go:635 msg="TSDB started"
level=info ts=2019-03-17T05:39:01.352Z caller=main.go:695 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2019-03-17T05:39:01.352Z caller=main.go:489 msg="Stopping scrape discovery manager..."
level=info ts=2019-03-17T05:39:01.352Z caller=main.go:503 msg="Stopping notify discovery manager..."
level=info ts=2019-03-17T05:39:01.352Z caller=main.go:525 msg="Stopping scrape manager..."
level=info ts=2019-03-17T05:39:01.352Z caller=main.go:499 msg="Notify discovery manager stopped"
level=info ts=2019-03-17T05:39:01.352Z caller=main.go:485 msg="Scrape discovery manager stopped"
level=info ts=2019-03-17T05:39:01.352Z caller=main.go:519 msg="Scrape manager stopped"
level=info ts=2019-03-17T05:39:01.352Z caller=manager.go:736 component="rule manager" msg="Stopping rule manager..."
level=info ts=2019-03-17T05:39:01.352Z caller=manager.go:742 component="rule manager" msg="Rule manager stopped"
level=info ts=2019-03-17T05:39:01.353Z caller=notifier.go:521 component=notifier msg="Stopping notification manager..."
level=info ts=2019-03-17T05:39:01.353Z caller=main.go:679 msg="Notifier manager stopped"
level=error ts=2019-03-17T05:39:01.354Z caller=main.go:688 err="error loading config from \"/etc/prometheus/config_out/prometheus.env.yaml\": couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): open /etc/prometheus/config_out/prometheus.env.yaml: no such file or directory"

config-reload logs:

ts=2019-03-17T05:39:04.034486905Z caller=main.go:70 msg="Starting prometheus-config-reloader version '0.27.0'."
level=info ts=2019-03-17T05:39:04.034665919Z caller=reloader.go:77 msg="started watching config file for changes" in=/etc/prometheus/config/prometheus.yaml out=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2019-03-17T05:39:04.039039746Z caller=runutil.go:43 msg="function failed. Retrying" err="trigger reload: reload request failed: Post http://localhost:9090/-/reload: dial tcp [::1]:9090: connect: connection refused"
level=info ts=2019-03-17T05:39:09.329955433Z caller=reloader.go:182 msg="Prometheus reload triggered" cfg_in=/etc/prometheus/config/prometheus.yaml cfg_out=/etc/prometheus/config_out/prometheus.env.yaml rule_dir=
level=debug ts=2019-03-17T05:39:09.33002366Z caller=reloader.go:97 msg="received watch event" op=CHMOD name=/etc/prometheus/config/prometheus.yaml

Note: Until now, I could not reproduce this and the referenced CI failures in the originally posted issue in https://bugzilla.redhat.com/show_bug.cgi?id=1690162 did not reproduce this issue either for some time.

@juliusv Do you see any suspicious entries in the config reloader maybe?

@s-urbaniak
Copy link
Contributor

s-urbaniak commented Mar 22, 2019

having said that (thanks @lucab for verification ❤️ !), @juliusv you are definitely right about the missing atomicity properties of WriteFile.

As per https://golang.org/src/io/ioutil/ioutil.go?s=2534:2602#L69 there can a lot of stuff happening concurrently between os.OpenFile and f.Close() if there are other processes other than the configreloader accessing that configuration file (in our example, prometheus itself).

Just to double-check, with atomic moves, the following scenario would happen.

  1. config reloader sidecar notices watcher events on main config
  2. rule reloader sidecar notices watcher events on rules
  3. config reloader sidecar starts writing out config (file initially empty / truncated) to tmp-file-xxxx, not finished yet
  4. rule reloader sidecar writing out new rules to tmp-file-yyyy and moves it to the target location
  5. rule reloader sidecar triggering Prometheus reload
  6. Prometheus reads old (at least consistent) config and new rules
  7. config reloader sidecar finishes writing rules, moves to the target location
  8. config reloader not actually triggering config reload since it sees no real changes in the hash

If we do have configuration changes in the last point, we would have a double-restart of Prometheus, initially with an inconsistent set of configuration/rules.

PS: we need to ensure those moves happen in the same filesystem (i.e. the same directory) to have the atomicity guarantee.

@juliusv
Copy link
Contributor Author

juliusv commented Mar 22, 2019

@s-urbaniak Good point about the same filesystem. Yeah, it could just be written into the same directory, but as a temporary file first.

Regarding seeing suspicious entries in the config reloader: no, it doesn't log anything unless it actually reloads the config (which it's not doing in my case, it just writes it out again, but no reload since the hash is the same).

While I cannot reliably reproduce the issue (not too surprising), I managed to get it to break one time in a synthetic setup yesterday. I would propose to just fix the write to be atomic and then see if we get any report like this again?

@metalmatze
Copy link
Member

@juliusv Yes, that's what we discussed as well. Moving the temporary file to the correct location should be atomic and solve our problems.

@s-urbaniak
Copy link
Contributor

@juliusv yes, having the atomic semantics in place is definitely the correct way, let me submit a patch for that (unless you want to take over) 👍

@juliusv
Copy link
Contributor Author

juliusv commented Mar 22, 2019

@s-urbaniak Happy to do it!

juliusv added a commit to juliusv/thanos that referenced this issue Mar 22, 2019
This addresses an issue found in the Prometheus Operator, which reuses
this reloader sidecar, but which then also has a second sidecar which
may trigger rule-based reloads while the config sidecar is in the middle
of writing out its config (in a non-atomic way):

prometheus-operator/prometheus-operator#2501

Signed-off-by: Julius Volz <julius.volz@gmail.com>
juliusv added a commit to juliusv/thanos that referenced this issue Mar 22, 2019
This addresses an issue found in the Prometheus Operator, which reuses
this reloader sidecar, but which then also has a second sidecar which
may trigger rule-based reloads while the config sidecar is in the middle
of writing out its config (in a non-atomic way):

prometheus-operator/prometheus-operator#2501

I didn't add a test for this because it's hard to catch the original
problem to begin with, but it has happened.

Signed-off-by: Julius Volz <julius.volz@gmail.com>
juliusv added a commit to juliusv/thanos that referenced this issue Mar 22, 2019
This addresses an issue found in the Prometheus Operator, which reuses
this reloader sidecar, but which then also has a second sidecar which
may trigger rule-based reloads while the config sidecar is in the middle
of writing out its config (in a non-atomic way):

prometheus-operator/prometheus-operator#2501

I didn't add a test for this because it's hard to catch the original
problem to begin with, but it has happened.

Signed-off-by: Julius Volz <julius.volz@gmail.com>
@juliusv
Copy link
Contributor Author

juliusv commented Mar 22, 2019

bwplotka pushed a commit to thanos-io/thanos that referenced this issue Mar 22, 2019
* Make config reloader file writes atomic

This addresses an issue found in the Prometheus Operator, which reuses
this reloader sidecar, but which then also has a second sidecar which
may trigger rule-based reloads while the config sidecar is in the middle
of writing out its config (in a non-atomic way):

prometheus-operator/prometheus-operator#2501

I didn't add a test for this because it's hard to catch the original
problem to begin with, but it has happened.

Signed-off-by: Julius Volz <julius.volz@gmail.com>

* Explicitly ignore os.Remove() error

Signed-off-by: Julius Volz <julius.volz@gmail.com>
@metalmatze
Copy link
Member

Now we need to bump the dependency over here ☺️

@juliusv
Copy link
Contributor Author

juliusv commented Mar 22, 2019

@metalmatze on it

@juliusv
Copy link
Contributor Author

juliusv commented Mar 22, 2019

see #2504

@shaunmansell
Copy link

We've been hitting this issue quite frequently with our Prometheus Operator deployments so I'm quite keen to get this fix deployed.

Is there an ETA on when a release containing this will be available?

@brancz
Copy link
Contributor

brancz commented Apr 15, 2019

I think we've collected a number of things. This is a short week in Germany so I can't promise we'll get to it this week, but likely by next week there should be a release out. Not promising anything though, things sometimes get into the way 🙂 .

@shaunmansell
Copy link

Appreciate the update @brancz!

@deadlytea
Copy link

Seeing this quite frequently as I am making a lot of rule updates, happy to hear a fix is on the way. To fix this we just need to update the prometheus-config-reloader sidecar release version when its available?

@lucab
Copy link

lucab commented May 29, 2019

I think https://github.com/coreos/prometheus-operator/releases/tag/v0.30.0 already includes the fix for this.

@vsliouniaev
Copy link
Contributor

@lucab Unfortunately it does not, since the migration to go modules accidentally reverted the fix

@s-urbaniak
Copy link
Contributor

^ ping @metalmatze

@vsliouniaev
Copy link
Contributor

This has been fixed in v0.30.1!

@brancz brancz closed this as completed Jun 5, 2019
@simonpasquier
Copy link
Contributor

For people that may stumble upon this issue by accident, it should have been completely fixed by #3457 (prometheus-operator >= v0.42.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants