Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition starting HEAD prometheus with static scrape configs #4551

Closed
csmarchbanks opened this Issue Aug 28, 2018 · 13 comments

Comments

Projects
None yet
5 participants
@csmarchbanks
Copy link
Contributor

csmarchbanks commented Aug 28, 2018

Bug Report

What did you do?
Ran prometheus locally, with some basic static configs

What did you expect to see?
Prometheus comes up and my targets are scraped.

What did you see instead? Under which circumstances?
Most of the time none of my static targets are scraped. Approximately 5-10% of the time my targets are scraped properly.

Environment

  • System information:

    Darwin 17.7.0 x86_64

  • Prometheus version:

    latest master (87f1dad)

  • Prometheus configuration file:

global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']
  • Logs:
$ go run cmd/prometheus/main.go --config.file=prometheus.yml --log.level=debug
level=info ts=2018-08-28T03:39:41.803111877Z caller=main.go:235 msg="Starting Prometheus" version="(version=, branch=, revision=)"
level=info ts=2018-08-28T03:39:41.803208011Z caller=main.go:236 build_context="(go=go1.10, user=, date=)"
level=info ts=2018-08-28T03:39:41.803227773Z caller=main.go:237 host_details=(darwin)
level=info ts=2018-08-28T03:39:41.803248042Z caller=main.go:238 fd_limits="(soft=4864, hard=unlimited)"
level=info ts=2018-08-28T03:39:41.803265702Z caller=main.go:239 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2018-08-28T03:39:41.803878491Z caller=main.go:551 msg="Starting TSDB ..."
level=info ts=2018-08-28T03:39:41.80393762Z caller=web.go:395 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-08-28T03:39:41.831430515Z caller=main.go:561 msg="TSDB started"
level=info ts=2018-08-28T03:39:41.831492958Z caller=main.go:621 msg="Loading configuration file" filename=prometheus.yml
level=info ts=2018-08-28T03:39:41.832040894Z caller=main.go:647 msg="Completed loading of configuration file" filename=prometheus.yml
level=info ts=2018-08-28T03:39:41.83207218Z caller=main.go:520 msg="Server is ready to receive web requests."
level=debug ts=2018-08-28T03:39:41.832084248Z caller=manager.go:151 component="discovery manager scrape" msg="discovery receiver's channel was full"
level=debug ts=2018-08-28T03:39:41.83209375Z caller=manager.go:151 component="discovery manager scrape" msg="discovery receiver's channel was full"

Appears to be related to #4523, if I revert that commit everything starts working. The "discovery receiver's channel was full" log only shows up in the failure case.

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Aug 28, 2018

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Aug 28, 2018

at the moment this is to be expected. The main issue is that the Scrape manager blocks the Discovery manager's channel. Reverting the commit only removes the logging, otherwise the problem is still there.
Will be fixed in #4526

@csmarchbanks btw why do you think this is a Race condition?

@csmarchbanks

This comment has been minimized.

Copy link
Contributor Author

csmarchbanks commented Aug 28, 2018

I called it a race condition, since sometimes the scrape configuration would load, and sometimes it would not. Usually I have found that to be due to two concurrent tasks not ending in the correct order. Maybe Race was not the best word, and instead I should have called it non-deterministic.

Going through the code some more, it looks like the old way would block until it could send an update on syncCh, and the new way just gives up right away, causing the issue.

#4526 Does fix it for me, so I will just use that commit for local development until it is merged

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Aug 28, 2018

yeah this change shouldn't change the old behaviour just add this additional debug logging.

sometimes the scrape configuration would load, and sometimes it would not.

how do you know that the config didn't load?

In general until #4526 is tested and merged any scaling down of targets would be slow as it goes in serial instead of parallel as in #4526

@csmarchbanks

This comment has been minimized.

Copy link
Contributor Author

csmarchbanks commented Aug 28, 2018

I am pretty sure that by adding the default cases, behavior is changed:
Old way: Block until it is possible to write to the syncCh so the update is done eventually.
New way: Log a message and drop the update.

I know the scrape config didn't load properly because my /targets page is completely empty even though there should be two targets (prometheus and node-exporter). Since these are static targets, there shouldn't be any scaling down should there?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Aug 28, 2018

yes, I see what you mean.

Here is a PR #4556 that fixes the behaviour.

I will look at it again tomorrow with fresh eyes, but will free to comment if you see anything suspicious 👍

@csmarchbanks

This comment has been minimized.

Copy link
Contributor Author

csmarchbanks commented Aug 29, 2018

👍 Thanks!

@bwplotka

This comment has been minimized.

Copy link
Contributor

bwplotka commented Sep 3, 2018

I can repro same thing, from current master. (sha 674c76a)

level=debug ts=2018-09-03T11:10:26.961711081Z caller=manager.go:186 component="discovery manager scrape" msg="discovery receiver's channel was full"

And no "Target" for my static, local scrape target. Any progress on this?

@bwplotka

This comment has been minimized.

Copy link
Contributor

bwplotka commented Sep 3, 2018

Ah, we need this potentially? #4526 Will help with review

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Sep 3, 2018

Yeah #4556 and even #4526 will fix the issue.

bwplotka added a commit to improbable-eng/thanos that referenced this issue Sep 3, 2018

e2e: Run Query Simple e2e tests against pinned version of Prometheus.
Problems: Because of prometheus/prometheus#4551 Thanos
e2e tests were not passing (98% time). This caused CI to be almost always red.

Additonally: make format.

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

bwplotka added a commit to improbable-eng/thanos that referenced this issue Sep 3, 2018

e2e: Run Query Simple e2e tests against pinned version of Prometheus.
Problems: Because of prometheus/prometheus#4551 Thanos
e2e tests were not passing (98% time). This caused CI to be almost always red.

Additonally: make format.

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

bwplotka added a commit to improbable-eng/thanos that referenced this issue Sep 3, 2018

e2e: Run Query Simple e2e tests against pinned version of Prometheus.
Problems: Because of prometheus/prometheus#4551 Thanos
e2e tests were not passing (98% time). This caused CI to be almost always red.

Additonally: make format.

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

bwplotka added a commit to improbable-eng/thanos that referenced this issue Sep 3, 2018

e2e: Run Query Simple e2e tests against pinned version of Prometheus. (
…#504)

Problems: Because of prometheus/prometheus#4551 Thanos
e2e tests were not passing (98% time). This caused CI to be almost always red.

Additonally: make format.

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
@bwplotka

This comment has been minimized.

Copy link
Contributor

bwplotka commented Sep 5, 2018

Can anyone confirm this is not happening on master?

@csmarchbanks

This comment has been minimized.

Copy link
Contributor Author

csmarchbanks commented Sep 5, 2018

@Bplotka master is working great for me now.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.