Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

config-reloader doesn't reload a valid configuration an invalid configuration was pushed #4708

Closed
simonpasquier opened this issue Apr 7, 2022 · 5 comments

Comments

@simonpasquier
Copy link
Contributor

The issue was originally described in #4705 but it got closed because the root cause (scrape timeout > scrape interval) got fixed in v0.54.1.

How to reproduce

  • Deploy a Prometheus object with 2 replicas and wait for the replicas to be ready.
  • Edit the Prometheus object to trigger an invalid configuration (e.g. spec.queryLogFile: tmp/query.log)
  • Watch the config reloader logs and wait to see errors.
  • Revert the changes to the Prometheus object.

I'd expect the reloader to reload Prometheus with the good configuration but instead it stops doing anything.

Logs:

level=error ts=2022-04-04T12:42:55.379011742Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: received non-200 response: 500 Internal Server Error; have you set `--web.enable-lifecycle` Prometheus flag?"
level=error ts=2022-04-04T12:43:00.373310041Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: received non-200 response: 500 Internal Server Error; have you set `--web.enable-lifecycle` Prometheus flag?"
level=error ts=2022-04-04T12:43:05.377459791Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: received non-200 response: 500 Internal Server Error; have you set `--web.enable-lifecycle` Prometheus flag?"
level=error ts=2022-04-04T12:43:10.376687695Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: received non-200 response: 500 Internal Server Error; have you set `--web.enable-lifecycle` Prometheus flag?"
level=error ts=2022-04-04T12:43:15.386690876Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: received non-200 response: 500 Internal Server Error; have you set `--web.enable-lifecycle` Prometheus flag?"
level=error ts=2022-04-04T12:43:20.380568484Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: received non-200 response: 500 Internal Server Error; have you set `--web.enable-lifecycle` Prometheus flag?"
level=error ts=2022-04-04T12:43:25.385406333Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: received non-200 response: 500 Internal Server Error; have you set `--web.enable-lifecycle` Prometheus flag?"
level=error ts=2022-04-04T12:43:30.384269587Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: received non-200 response: 500 Internal Server Error; have you set `--web.enable-lifecycle` Prometheus flag?"
level=error ts=2022-04-04T12:43:35.376230515Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: received non-200 response: 500 Internal Server Error; have you set `--web.enable-lifecycle` Prometheus flag?"
level=error ts=2022-04-04T12:43:40.396371017Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: received non-200 response: 500 Internal Server Error; have you set `--web.enable-lifecycle` Prometheus flag?"
level=error ts=2022-04-04T12:43:45.375128759Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: received non-200 response: 500 Internal Server Error; have you set `--web.enable-lifecycle` Prometheus flag?"
level=error ts=2022-04-04T12:43:50.370402327Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"http://localhost:9090/-/reload\": context deadline exceeded"
level=error ts=2022-04-04T12:43:50.37050166Z caller=reloader.go:382 msg="Failed to trigger reload. Retrying." err="trigger reload: reload request failed: Post \"http://localhost:9090/-/reload\": context deadline exceeded"

And the logs stop there. The context deadline exceeded message is suspicious but coincides with the good configuration being being recovered.

@heylongdacoder
Copy link
Contributor

config-reloader will try to reload a config with a three minutes timeout. Once it enter the reload stage, it wont accept any new config because the config unzip is happen in the reload stage. Hence the reload of a faulty config will make the reloader try until it timeout and I believe the context deadline message is due to this. Then the new config will trigger a new reload, but this line of code is stopping the reload and make the logs stop. I am not sure whether can consider this as a bug. If it is, maybe can change this condition to something like if r.lastReloadSuccess && (bytes.Equal(r.lastCfgHash, cfgHash) && bytes.Equal(r.lastWatchedDirsHash, watchedDirsHash)). I am happy to do a PR if this consider as a bug :)

@simonpasquier
Copy link
Contributor Author

This looks like a bug to me indeed. Do not hesitate to bring it to the Thanos project :)

@heylongdacoder
Copy link
Contributor

alright, I will do that :D

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had any activity in the last 60 days. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jun 19, 2022
@simonpasquier
Copy link
Contributor Author

closed by #4887 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants