Prometheus ended up with empty config after rule-related reload #2501

juliusv · 2019-03-20T20:56:28Z

Prometheus Operator version: v0.29.0
Prometheus: v2.7.2

Reloader sidecar Operator flags:

--config-reloader-image=quay.io/coreos/configmap-reload:v0.0.1
--prometheus-config-reloader=quay.io/coreos/prometheus-config-reloader:v0.29.0

We had an incident here where a rule-based change triggered the rules-configmap-reloader sidecar to trigger a Prometheus reload and afterwards the Prometheus had an empty config (/config was empty: no scrape configs, just showing defaults for global config settings). The prometheus-config-reloader did not show any reloads around that time, so the config itself didn't actually change. However, from https://github.com/coreos/prometheus-operator/blob/650359b3e627ae97a1f18cbd10d7ed9b2293c240/vendor/github.com/improbable-eng/thanos/pkg/reloader/reloader.go#L156 it looks like that config reloader sidecar might even write out the main config when it didn't really change, but then only trigger a reload if the config contents are different: https://github.com/coreos/prometheus-operator/blob/650359b3e627ae97a1f18cbd10d7ed9b2293c240/vendor/github.com/improbable-eng/thanos/pkg/reloader/reloader.go#L194

The way the config reloader sidecar writes out the config is not atomic (since it doesn't go through a temporary file and rename). Is it possible that there is a race condition like this:

config reloader sidecar notices watcher events on main config
rule reloader sidecar notices watcher events on rules
config reloader sidecar starts writing out config (file initially empty / truncated)
rule reloader sidecar writing out new rules
rule reloader sidecar triggering Prometheus reload
Prometheus reads empty config and new rules
config reloader sidecar finishes writing rules
config reloader not actually triggering config reload since it sees no real changes in the hash

TBH I haven't studied the code in depth yet, but seeing a non-atomic write in the sidecar made this suspicious.

The text was updated successfully, but these errors were encountered:

metalmatze · 2019-03-21T15:24:33Z

Interesting. Your explanation sure sounds like it could be exactly those kind of races. Do you think it is possible to reproduce this and then work on it? 🤔

juliusv · 2019-03-21T15:45:57Z

@metalmatze Not sure how easy it is to reproduce reliably, but I'll give it a try!

juliusv · 2019-03-21T16:37:03Z

After repeatedly re-applying rules in a sandbox cluster (but leaving the main config as is), I got Prometheus to log a parse error about the main config:

level=error ts=2019-03-21T16:26:36.737700944Z caller=main.go:550 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: yaml: line 610: could not find expected ':'"

So in this case it bails out at line 610, probably because the file is in the middle of being written out.

I will see if I can reproduce something like this at least a couple of times, then fix the sidecar and verify that the problem does not appear again.

brancz · 2019-03-22T11:58:00Z

cc @s-urbaniak I think we recently have seen this in one of our environments as well no?

s-urbaniak · 2019-03-22T12:45:15Z

Nearly, I am also still stabbing in the dark here. In our case the config file was not there in the first place:

relevant prometheus log entry:

level=info ts=2019-03-17T05:39:01.342Z caller=main.go:620 msg="Starting TSDB ..."
level=info ts=2019-03-17T05:39:01.352Z caller=main.go:635 msg="TSDB started"
level=info ts=2019-03-17T05:39:01.352Z caller=main.go:695 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2019-03-17T05:39:01.352Z caller=main.go:489 msg="Stopping scrape discovery manager..."
level=info ts=2019-03-17T05:39:01.352Z caller=main.go:503 msg="Stopping notify discovery manager..."
level=info ts=2019-03-17T05:39:01.352Z caller=main.go:525 msg="Stopping scrape manager..."
level=info ts=2019-03-17T05:39:01.352Z caller=main.go:499 msg="Notify discovery manager stopped"
level=info ts=2019-03-17T05:39:01.352Z caller=main.go:485 msg="Scrape discovery manager stopped"
level=info ts=2019-03-17T05:39:01.352Z caller=main.go:519 msg="Scrape manager stopped"
level=info ts=2019-03-17T05:39:01.352Z caller=manager.go:736 component="rule manager" msg="Stopping rule manager..."
level=info ts=2019-03-17T05:39:01.352Z caller=manager.go:742 component="rule manager" msg="Rule manager stopped"
level=info ts=2019-03-17T05:39:01.353Z caller=notifier.go:521 component=notifier msg="Stopping notification manager..."
level=info ts=2019-03-17T05:39:01.353Z caller=main.go:679 msg="Notifier manager stopped"
level=error ts=2019-03-17T05:39:01.354Z caller=main.go:688 err="error loading config from \"/etc/prometheus/config_out/prometheus.env.yaml\": couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): open /etc/prometheus/config_out/prometheus.env.yaml: no such file or directory"

config-reload logs:

ts=2019-03-17T05:39:04.034486905Z caller=main.go:70 msg="Starting prometheus-config-reloader version '0.27.0'."
level=info ts=2019-03-17T05:39:04.034665919Z caller=reloader.go:77 msg="started watching config file for changes" in=/etc/prometheus/config/prometheus.yaml out=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2019-03-17T05:39:04.039039746Z caller=runutil.go:43 msg="function failed. Retrying" err="trigger reload: reload request failed: Post http://localhost:9090/-/reload: dial tcp [::1]:9090: connect: connection refused"
level=info ts=2019-03-17T05:39:09.329955433Z caller=reloader.go:182 msg="Prometheus reload triggered" cfg_in=/etc/prometheus/config/prometheus.yaml cfg_out=/etc/prometheus/config_out/prometheus.env.yaml rule_dir=
level=debug ts=2019-03-17T05:39:09.33002366Z caller=reloader.go:97 msg="received watch event" op=CHMOD name=/etc/prometheus/config/prometheus.yaml

Note: Until now, I could not reproduce this and the referenced CI failures in the originally posted issue in https://bugzilla.redhat.com/show_bug.cgi?id=1690162 did not reproduce this issue either for some time.

@juliusv Do you see any suspicious entries in the config reloader maybe?

s-urbaniak · 2019-03-22T13:29:39Z

having said that (thanks @lucab for verification ❤️ !), @juliusv you are definitely right about the missing atomicity properties of WriteFile.

As per https://golang.org/src/io/ioutil/ioutil.go?s=2534:2602#L69 there can a lot of stuff happening concurrently between os.OpenFile and f.Close() if there are other processes other than the configreloader accessing that configuration file (in our example, prometheus itself).

Just to double-check, with atomic moves, the following scenario would happen.

config reloader sidecar notices watcher events on main config
rule reloader sidecar notices watcher events on rules
config reloader sidecar starts writing out config (file initially empty / truncated) to tmp-file-xxxx, not finished yet
rule reloader sidecar writing out new rules to tmp-file-yyyy and moves it to the target location
rule reloader sidecar triggering Prometheus reload
Prometheus reads old (at least consistent) config and new rules
config reloader sidecar finishes writing rules, moves to the target location
config reloader not actually triggering config reload since it sees no real changes in the hash

If we do have configuration changes in the last point, we would have a double-restart of Prometheus, initially with an inconsistent set of configuration/rules.

PS: we need to ensure those moves happen in the same filesystem (i.e. the same directory) to have the atomicity guarantee.

juliusv · 2019-03-22T13:59:39Z

@s-urbaniak Good point about the same filesystem. Yeah, it could just be written into the same directory, but as a temporary file first.

Regarding seeing suspicious entries in the config reloader: no, it doesn't log anything unless it actually reloads the config (which it's not doing in my case, it just writes it out again, but no reload since the hash is the same).

While I cannot reliably reproduce the issue (not too surprising), I managed to get it to break one time in a synthetic setup yesterday. I would propose to just fix the write to be atomic and then see if we get any report like this again?

metalmatze · 2019-03-22T14:07:34Z

@juliusv Yes, that's what we discussed as well. Moving the temporary file to the correct location should be atomic and solve our problems.

s-urbaniak · 2019-03-22T15:02:21Z

@juliusv yes, having the atomic semantics in place is definitely the correct way, let me submit a patch for that (unless you want to take over) 👍

juliusv · 2019-03-22T15:08:13Z

@s-urbaniak Happy to do it!

This addresses an issue found in the Prometheus Operator, which reuses this reloader sidecar, but which then also has a second sidecar which may trigger rule-based reloads while the config sidecar is in the middle of writing out its config (in a non-atomic way): prometheus-operator/prometheus-operator#2501 Signed-off-by: Julius Volz <julius.volz@gmail.com>

This addresses an issue found in the Prometheus Operator, which reuses this reloader sidecar, but which then also has a second sidecar which may trigger rule-based reloads while the config sidecar is in the middle of writing out its config (in a non-atomic way): prometheus-operator/prometheus-operator#2501 I didn't add a test for this because it's hard to catch the original problem to begin with, but it has happened. Signed-off-by: Julius Volz <julius.volz@gmail.com>

juliusv · 2019-03-22T16:06:40Z

@s-urbaniak How about this: https://github.com/improbable-eng/thanos/pull/962/files

* Make config reloader file writes atomic This addresses an issue found in the Prometheus Operator, which reuses this reloader sidecar, but which then also has a second sidecar which may trigger rule-based reloads while the config sidecar is in the middle of writing out its config (in a non-atomic way): prometheus-operator/prometheus-operator#2501 I didn't add a test for this because it's hard to catch the original problem to begin with, but it has happened. Signed-off-by: Julius Volz <julius.volz@gmail.com> * Explicitly ignore os.Remove() error Signed-off-by: Julius Volz <julius.volz@gmail.com>

metalmatze · 2019-03-22T16:47:08Z

Now we need to bump the dependency over here ☺️

juliusv · 2019-03-22T17:14:14Z

@metalmatze on it

juliusv · 2019-03-22T18:04:14Z

see #2504

shaunmansell · 2019-04-15T04:49:31Z

We've been hitting this issue quite frequently with our Prometheus Operator deployments so I'm quite keen to get this fix deployed.

Is there an ETA on when a release containing this will be available?

brancz · 2019-04-15T12:00:12Z

I think we've collected a number of things. This is a short week in Germany so I can't promise we'll get to it this week, but likely by next week there should be a release out. Not promising anything though, things sometimes get into the way 🙂 .

shaunmansell · 2019-04-23T00:55:21Z

Appreciate the update @brancz!

deadlytea · 2019-05-29T17:50:47Z

Seeing this quite frequently as I am making a lot of rule updates, happy to hear a fix is on the way. To fix this we just need to update the prometheus-config-reloader sidecar release version when its available?

lucab · 2019-05-29T20:53:54Z

I think https://github.com/coreos/prometheus-operator/releases/tag/v0.30.0 already includes the fix for this.

vsliouniaev · 2019-05-29T20:58:08Z

@lucab Unfortunately it does not, since the migration to go modules accidentally reverted the fix

s-urbaniak · 2019-06-02T10:49:37Z

^ ping @metalmatze

vsliouniaev · 2019-06-02T11:04:58Z

This has been fixed in v0.30.1!

simonpasquier · 2020-11-19T15:05:19Z

For people that may stumble upon this issue by accident, it should have been completely fixed by #3457 (prometheus-operator >= v0.42.0)

juliusv mentioned this issue Mar 22, 2019

Make config reloader file writes atomic thanos-io/thanos#962

Merged

juliusv mentioned this issue Mar 22, 2019

Update Thanos vendoring to include config reloader fixes #2504

Merged

vsliouniaev mentioned this issue May 16, 2019

Update thanos to 0.4.0 #2611

Merged

brancz closed this as completed Jun 5, 2019

wndhydrnt mentioned this issue Nov 13, 2020

Empty configuration in Prometheus after change to ConfigMap #3677

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus ended up with empty config after rule-related reload #2501

Prometheus ended up with empty config after rule-related reload #2501

juliusv commented Mar 20, 2019

metalmatze commented Mar 21, 2019

juliusv commented Mar 21, 2019

juliusv commented Mar 21, 2019

brancz commented Mar 22, 2019

s-urbaniak commented Mar 22, 2019

s-urbaniak commented Mar 22, 2019 •

edited

juliusv commented Mar 22, 2019

metalmatze commented Mar 22, 2019

s-urbaniak commented Mar 22, 2019

juliusv commented Mar 22, 2019

juliusv commented Mar 22, 2019

metalmatze commented Mar 22, 2019

juliusv commented Mar 22, 2019

juliusv commented Mar 22, 2019

shaunmansell commented Apr 15, 2019

brancz commented Apr 15, 2019

shaunmansell commented Apr 23, 2019

deadlytea commented May 29, 2019

lucab commented May 29, 2019

vsliouniaev commented May 29, 2019

s-urbaniak commented Jun 2, 2019

vsliouniaev commented Jun 2, 2019

simonpasquier commented Nov 19, 2020

Prometheus ended up with empty config after rule-related reload #2501

Prometheus ended up with empty config after rule-related reload #2501

Comments

juliusv commented Mar 20, 2019

metalmatze commented Mar 21, 2019

juliusv commented Mar 21, 2019

juliusv commented Mar 21, 2019

brancz commented Mar 22, 2019

s-urbaniak commented Mar 22, 2019

s-urbaniak commented Mar 22, 2019 • edited

juliusv commented Mar 22, 2019

metalmatze commented Mar 22, 2019

s-urbaniak commented Mar 22, 2019

juliusv commented Mar 22, 2019

juliusv commented Mar 22, 2019

metalmatze commented Mar 22, 2019

juliusv commented Mar 22, 2019

juliusv commented Mar 22, 2019

shaunmansell commented Apr 15, 2019

brancz commented Apr 15, 2019

shaunmansell commented Apr 23, 2019

deadlytea commented May 29, 2019

lucab commented May 29, 2019

vsliouniaev commented May 29, 2019

s-urbaniak commented Jun 2, 2019

vsliouniaev commented Jun 2, 2019

simonpasquier commented Nov 19, 2020

s-urbaniak commented Mar 22, 2019 •

edited