Skip to content

Enhance healthy endpoint to detect unhealthy upstreams #195

@luizberti

Description

@luizberti

The healthcheck probe (/-/healthy) seems to be useless at the moment, and doing nothing more than what the readiness probe (/-/ready) should do.

In my current situation, the Promtail integration was working correctly and pushing to Loki, but the Prometheus one was botched. The healthcheck probe should not state that the agent is healthy when the upstream is botched, the correct behaviour would be to raise an issue if any upstream is faulty, unless the user hasn't configured any upstream to forward metrics to (such as cases where they wanna scrape the Agent's /metrics themselves).

On that same note, the accompanying readiness probe should only report a 200 if config parsed correctly + all internal components have managed to start successfully + I can see things on /metrics on the agent (even if remote_write is failing, or some SD configuration is failing, as those should be reported on the healthcheck instead). I assume this is what happens already but haven't validated, so I'm just mentioning this here for reassurance.

In my case, the remote_write URL was botched, but I've had this happen when I applied a manifest with botched credentials and upstream was returning a 403, and also experienced the same problem with Promtail (not the in-agent one). The URL 404 error of the last occurrence was the one below:

ts=2020-09-23T20:23:27.251090321Z caller=dedupe.go:112 agent=prometheus instance=811f01a1f9d2e40bc66826c12931e4c8 component=remote level=error remote_name=811f01-83fd73 url=http://prometheus/api/prom/push msg="non-recoverable error" count=3 err="server returned HTTP status 404 Not Found: Not Found"

As it stands, even when doing a gradual rollout on Kubernetes, there is nothing that can stop the rollout if I'm pushing botched configuration and killing all of the cluster's monitoring unless if I were to make a disgusting log-parsing hack, but I am not fond of that idea 😅

Metadata

Metadata

Assignees

No one assigned

    Labels

    frozen-due-to-ageLocked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.staleIssue/PR mark as stale due lack of activity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions