Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marathon service discovery successful for 500 response with valid JSON in response body #4090

Closed
rohit01 opened this Issue Apr 16, 2018 · 2 comments

Comments

Projects
None yet
1 participant
@rohit01
Copy link
Contributor

rohit01 commented Apr 16, 2018

Problem statement

We run multiple Prometheus clusters at large scale with Marathon service discovery. At times, marathon responds with 500 error but a valid JSON response is returned. Marathon service discovery does not check for HTTP status code and continues to process. This results in a job without targets and services are not monitored until the next API call.

Use case. Why is this important?
We miss metrics intermittently.

Solution

Treat non-2xx HTTP response as error. It will preserve the existing targets.

Bug Report

What did you do?
It's happening on our production systems intermittently. Was able to reproduce the same with simulated marathon responses in staging.

What did you expect to see?
No target loss. No missing metrics.

What did you see instead? Under which circumstances?
Sudden metric drop for all containers at once triggering alerts.

  • Prometheus version:
    Prometheus 2.2.1

  • Alertmanager version:
    version 0.11.0

  • Prometheus configuration file:
    A standard config with marathon_sd_config

  • Logs:
    All ok in logs.

@rohit01

This comment has been minimized.

Copy link
Contributor Author

rohit01 commented Apr 16, 2018

Raised a pull request for fixing this issue: #4091

@beorn7 @fabxc @xperimental: Please take a look and let me know if you want any changes before merging it to master.

Thanks for your time and contributions :)

rohit01 added a commit to rohit01/prometheus that referenced this issue Apr 17, 2018

gouthamve added a commit to gouthamve/prometheus that referenced this issue Aug 1, 2018

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.