Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Context deadline exceeded" for responsive target #2459

Closed
AlexRudd opened this Issue Mar 1, 2017 · 19 comments

Comments

Projects
None yet
@AlexRudd
Copy link

AlexRudd commented Mar 1, 2017

What did you do?

Tried to run Prometheus server in a docker container and scrape metrics from a node_exporter instance running in a separate but linked container (see environment section.)

Used a basic config which scrapes the linked container, via its container name dns address, every 10 seconds.

Noticed that this very basic target was showing as DOWN in the Prometheus dashboard targets view, with the error: "context deadline exceeded"

I attached to the running Prometheus container and timed a manual scrape of the target using time and wget, this worked as expected:

$ docker exec prometheus time wget -qO- http://node_exporter:9100/metrics
# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0.000302033
go_gc_duration_seconds{quantile="0.25"} 0.0005395150000000001
... (other metrics) ...
real	0m 0.01s
user	0m 0.00s
sys	0m 0.00s

Created a separate target group with scrape_interval and scrape_timeout both increased to 30s but targeted still at the same node_exporter instance. Scrapes for this group are successful but appear to take a long time (Last Scrape reported as > 36s ago which I assume means the scrape took upwards of 6 seconds from Prometheus' perspective.)

image

What did you expect to see?

Both target groups reporting as UP and scrapes completing quickly

What did you see instead? Under which circumstances?

DOWN "context deadline exceeded" for the 10s scrape group

Environment

Recreate my exact environment:

# Run monitoring (node_exporter + prometheus server)
docker run --name "node_exporter" -d prom/node-exporter

# Write config into file for mounting into prometheus container
mkdir -p /tmp/prometheus/
cat << EOF > /tmp/prometheus/prometheus.yml
---
global:
  scrape_interval: 10s
  scrape_timeout: 10s

scrape_configs:
  - job_name: 'defaults'
    static_configs:
      - targets: ['node_exporter:9100']
        labels:
          group: 'defaults'
          
  - job_name: 'increased_timeout'
    scrape_interval: 30s
    scrape_timeout: 30s
    static_configs:
      - targets: ['node_exporter:9100']
        labels:
          group: 'increased_timeout'
EOF

docker run --name "prometheus" \
  -d \
  -p 9090:9090 \
  --link node_exporter:node_exporter \
  -v /tmp/prometheus/prometheus.yml:/prometheus.yml \
  -v /tmp/prometheus/data:/data \
  prom/prometheus:v1.5.2 \
  -storage.local.path=/data \
  -config.file=/prometheus.yml

# Open http://localhost:9090 in your browser and look at the Targets section.

# Stop and remove containers:
docker stop node_exporter prometheus
docker rm node_exporter prometheus
  • Prometheus version:
$ docker run -ti prom/prometheus:v1.5.2 -version
prometheus, version 1.5.2 (branch: master, revision: bd1182d29f462c39544f94cc822830e1c64cf55b)
  build user:       root@1a01c5f68840
  build date:       20170210-16:23:28
  go version:       go1.7.5
  • Prometheus configuration file:
---
global:
  scrape_interval: 10s
  scrape_timeout: 10s

scrape_configs:
  - job_name: 'defaults'
    static_configs:
      - targets: ['node_exporter:9100']
        labels:
          group: 'defaults'
          
  - job_name: 'increased_timeout'
    scrape_interval: 30s
    scrape_timeout: 30s
    static_configs:
      - targets: ['node_exporter:9100']
        labels:
          group: 'increased_timeout'
  • Logs:
time="2017-03-01T11:53:25Z" level=info msg="Starting prometheus (version=1.5.2, branch=master, revision=bd1182d29f462c39544f94cc822830e1c64cf55b)" source="main.go:75" 
time="2017-03-01T11:53:25Z" level=info msg="Build context (go=go1.7.5, user=root@1a01c5f68840, date=20170210-16:23:28)" source="main.go:76" 
time="2017-03-01T11:53:25Z" level=info msg="Loading configuration file /prometheus.yml" source="main.go:248" 
time="2017-03-01T11:53:26Z" level=info msg="Loading series map and head chunks..." source="storage.go:373" 
time="2017-03-01T11:53:26Z" level=info msg="1434 series loaded." source="storage.go:378" 
time="2017-03-01T11:53:26Z" level=info msg="Listening on :9090" source="web.go:259" 
time="2017-03-01T11:53:26Z" level=info msg="Starting target manager..." source="targetmanager.go:61" 
time="2017-03-01T11:58:26Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:612" 
time="2017-03-01T11:58:26Z" level=info msg="Done checkpointing in-memory metrics and chunks in 107.457463ms." source="persistence.go:639" 
time="2017-03-01T12:03:26Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:612" 
time="2017-03-01T12:03:26Z" level=info msg="Done checkpointing in-memory metrics and chunks in 97.668263ms." source="persistence.go:639"
...

Nothing which seems to be relevant in the logs.

Hopefully I'm not doing anything obviously wrong!

Thanks

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Mar 1, 2017

Hm, @alexellis mentioned that he had similar problems with Prometheus on Docker - he couldn't even scrape Prometheus itself at localhost. Alex, did you find a solution to that?

@AlexRudd I used your exact steps to reproduce your scenario and it works for me. I wonder if Docker broke something networking-related in recent versions. I'm still on Docker 1.12.1. This might be one of these cases where you want to pull out Wireshark or tcpdump and look at what's really happening on the network.

@alexellis

This comment has been minimized.

Copy link

alexellis commented Mar 1, 2017

I was seeing the same error with the latest docker images for Prometheus. I thought it was because I was running it as a docker stack i.e. docker swarm service.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Mar 1, 2017

It does work for me with exactly the same Prometheus version mentioned here btw., so I suspect it's Docker-related. But since I can't reproduce, it would be cool if someone who sees the problem can dig into actual packet traces to see what's going on.

@AlexRudd

This comment has been minimized.

Copy link
Author

AlexRudd commented Mar 1, 2017

Ah, I didn't think to check my docker version!

I just tried my reproduction steps at home and it worked fine:

$ docker version
Client:
 Version:      1.13.1
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   092cba3
 Built:        Wed Feb  8 08:47:51 2017
 OS/Arch:      darwin/amd64

Server:
 Version:      1.13.1
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   092cba3
 Built:        Wed Feb  8 08:47:51 2017
 OS/Arch:      linux/amd64
 Experimental: true

I'll check what version I'm on at work tomorrow and do some digging through Docker's issues.

I was also having issues scraping Prometheus itself (localhost:9090) but that was giving a different error and thought I'd focus on one thing at a time.

@AlexRudd

This comment has been minimized.

Copy link
Author

AlexRudd commented Mar 2, 2017

So at work I'm running:

$ docker version
Client:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   78d1802
 Built:        Tue Jan 10 20:32:39 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   78d1802
 Built:        Tue Jan 10 20:32:39 2017
 OS/Arch:      linux/amd64

I upgraded to the latest Docker (now Docker CE) and still get the same "context deadline exceeded" error in Prometheus:

$ docker version
Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   3a232c8
 Built:        Tue Feb 28 08:01:32 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   3a232c8
 Built:        Tue Feb 28 08:01:32 2017
 OS/Arch:      linux/amd64
 Experimental: false
@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Mar 2, 2017

Then I'll have to refer to "But since I can't reproduce, it would be cool if someone who sees the problem can dig into actual packet traces to see what's going on." :) It seems it's either a Docker problem, or some weird interaction between how Go resolves things and the way Docker does DNS.

@agaoglu

This comment has been minimized.

Copy link
Contributor

agaoglu commented Mar 3, 2017

Any chance you might be hitting moby/moby#19866? Seems a lot has changed regarding default bridge network and --link.

@gesellix

This comment has been minimized.

Copy link
Contributor

gesellix commented Mar 12, 2017

I get the same error message (context deadline exceeded) on Travis-CI and Docker 1.12.3. I don't use Prometheus, but run integration tests from a Java client against the Docker daemon. The error appears when creating a service and directly afterwards requesting the current tasks. Locally on Docker for Mac 17.03.0-ce-mac2 I cannot reproduce the error.

In other words: I would confirm @juliusv's assumption that this isn't a Prometheus issue - but maybe the Prometheus setup helps to reproduce the issue?

edit: I fixed my issue by adding a client-side timeout of 1s between service creation and the task list request.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Mar 12, 2017

Yeah - I think it's safe to close this issue for now, as it's likely a Docker issue. I'd rather ask the Docker folks about it.

@juliusv juliusv closed this Mar 12, 2017

@meowtochondria

This comment has been minimized.

Copy link

meowtochondria commented Mar 27, 2017

Happening in Centos 7.3 in AWS without using docker. This is a very basic setup as the OP reported, with nothing fancy going on, except that i am running everything native. Would be happy to provide any details that might help resolution.

@prologic

This comment has been minimized.

Copy link

prologic commented Jul 18, 2017

Ran into a similar problem on Docker 17.03-17.05 ce

These versions of docker have a bug such that connecting the service/container to the host bridge network is harder than it should be.

Work-around:

$ docker network connect bridge <cid_of_prometheus>
@joshpollara

This comment has been minimized.

Copy link

joshpollara commented Sep 15, 2017

Ran into the same issue. Amazon Linux 2017.03 no docker. Basic setup.

@MarilynZ

This comment has been minimized.

Copy link

MarilynZ commented Oct 25, 2017

@juliusv
Happening in Ubuntu 16.04 in CDS cloud without using docker. Noticed that "node_exporter" target was showing as DOWN , with the error: "context deadline exceeded".

Prometheus version:
prometheus, version 1.7.1 (branch: master, revision: 3afb3fffa3a29c3de865e1172fb740442e9d0133) build user: root@0aa1b7fc430d build date: 20170612-11:44:05 go version: go1.8.3

Prometheus configuration file:
`
global:
scrape_interval: 30s
evaluation_interval: 30s

external_labels:
monitor: 'Video-monitor'

rule_files:

  • "rules/alert.rules"

alerting:
alertmanagers:

  • scheme: http
    static_configs:
    • targets:
      • "10.8.0.12:9093"

scrape_configs:

  • job_name: 'Video_prometheus'
    static_configs:

    • targets: ['192.168.1.13:9090']
      labels:
      alias: video-prometheus
  • job_name: 'Video_node_exporter'
    scrape_interval: 120s
    scrape_timeout: 120s
    static_configs:

    • targets: ['192.168.1.13:9100']
      labels:
      alias: video-prometheus
    • targets: ['192.168.1.2:9100']
      labels:
      alias: video-master
    • targets: ['192.168.1.3:9100']
      labels:
      alias: video-node01
    • targets: ['192.168.1.4:9100']
      labels:
      alias: video-node02
    • targets: ['192.168.1.5:9100']
      labels:
      alias: video-node03
      ……………………
      `

234

@benitogf

This comment has been minimized.

Copy link

benitogf commented Nov 14, 2017

@juliusv found this issue on ubuntu 16.04:

docker version
Client:
 Version:      17.10.0-ce
 API version:  1.33
 Go version:   go1.8.3
 Git commit:   f4ffd25
 Built:        Tue Oct 17 19:04:16 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.10.0-ce
 API version:  1.33 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   f4ffd25
 Built:        Tue Oct 17 19:02:56 2017
 OS/Arch:      linux/amd64
 Experimental: false
docker ps
4bbab0972d91        prom/node-exporter   "/bin/node_exporte..."   19 minutes ago      Up 19 minutes       0.0.0.0:9100->9100/tcp   prometheus_node-exporter_1
a38bf51d25c6        prom/prometheus      "/bin/prometheus -..."   20 minutes ago      Up 20 minutes       0.0.0.0:9090->9090/tcp   vibrant_brown
prometheus
State | Labels                     | Last Scrape  | Error
DOWN |  instance="128.199.213.45:9100" | 56.407s ago | context deadline exceeded

Scrape interval and timeout set to 30s but no change

@roberteventival

This comment has been minimized.

Copy link

roberteventival commented Jan 31, 2018

Hi, I've had the same problem, but noticed by docker environment is very slow and swapping quite a lot. After adding more RAM and restarting docker, the problem is gone for the moment.

@dcmspe

This comment has been minimized.

Copy link

dcmspe commented Sep 27, 2018

I am experiencing the same issue.

CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS c4120abe7244 prometheus 0.03% 21.57MiB / 15.58GiB 0.14% 44.7kB / 128kB 393kB / 0B 21

My docker has 15GB of limit, adding more memory could solve the issue?

@weskinner

This comment has been minimized.

Copy link

weskinner commented Dec 18, 2018

The fix for me was editing the NetworkPolicy associated with the Pod prometheus was trying to scrape.

@rcsavage

This comment has been minimized.

Copy link

rcsavage commented Feb 27, 2019

@weskinner what exactly did you edit in the NetworkPolicy associated with the Pod? I am seeing the same issue right now, so any advise would help.

@weskinner

This comment has been minimized.

Copy link

weskinner commented Feb 27, 2019

@rcsavage sorry I'm not remember the specifics. I just know the NetworkPolicy that applied to that pod was preventing prometheus from accessing it from the monitoring namespace. Adjusting the labels on the monitoring namespace and/or the selectors in the NP should fix it if that is your problem as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.