Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[prometheus-thanos] compactor missing liveness and readiness probes #448

Open
mhyllander opened this issue May 18, 2022 · 0 comments
Open

Comments

@mhyllander
Copy link

mhyllander commented May 18, 2022

Is this a request for help?: no


Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

Version of Helm and Kubernetes:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.6", GitCommit:"f59f5c2fda36e4036b49ec027e556a15456108f0", GitTreeState:"clean", BuildDate:"2022-01-19T17:33:06Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.6", GitCommit:"07959215dd83b4ae6317b33c824f845abd578642", GitTreeState:"clean", BuildDate:"2022-03-30T18:28:25Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}
$ helm version
version.BuildInfo{Version:"v3.8.2", GitCommit:"6e3701edea09e5d55a8ca2aae03a68917630e91b", GitTreeState:"clean", GoVersion:"go1.17.5"}

Which chart in which version:
prometheus-thanos 4.9.3

What happened:
The thanos compactor can shutdown without exiting. The ready and healthy states change but the process does not exit. Because there is no liveness probe the unhealthy state is not detected and the pod is not restarted.

Logs:

level=warn ts=2022-05-17T12:08:22.350952394Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="BaseFetcher: iter bucket: context deadline exceeded"
level=info ts=2022-05-17T12:08:22.350965294Z caller=http.go:74 service=http/server component=compact msg="internal server is shutting down" err="BaseFetcher: iter bucket: context deadline exceeded"
level=info ts=2022-05-17T12:08:22.352757502Z caller=http.go:93 service=http/server component=compact msg="internal server is shutdown gracefully" err="BaseFetcher: iter bucket: context deadline exceeded"
level=info ts=2022-05-17T12:08:22.352802302Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="BaseFetcher: iter bucket: context deadline exceeded"
level=error ts=2022-05-17T12:08:22.840731904Z caller=compact.go:480 msg="critical error detected; halting" err="compaction: group 0@10346066409509485645: compact blocks [data/compact/0@10346066409509485645/01EZE4EQFSY10D4BD48CH48ZFZ data/compact/0@10346066409509485645/01EZEBAEQTX88ADDJANYM36YMV data/compact/0@10346066409509485645/01EZEJ65ZQ0MBE1TFPN3MYDQ4A data/compact/0@10346066409509485645/01EZES1X7R80XQ4XBK1GK4XT5H]: 2 errors: populate block: add series: context canceled; context canceled"

The /-/ready and /-/healthy endpoints were added back in 2019. The corresponding readiness and liveness probes are missing in the chart. (As noted in the issue, the readiness probe is not really needed since compactor is not serving any requests, but the liveness probe should be there.)

What you expected to happen:
The unhealthy state should be detected and the compactor pod restarted in case of error.

How to reproduce it (as minimally and precisely as possible):
In our case, the compactor internal HTTP server can apparently timeout and give up after a number of retries. When this happens the process goes idle but does not exit. We are using Azure storage and have configured all timeouts to 60s with 5 retries.

Anything else we need to know:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant