Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote read is failing with ‘http: multiple response.WriteHeader calls‘ in remote node #4307

Closed
xudawei opened this Issue Jun 24, 2018 · 7 comments

Comments

Projects
None yet
2 participants
@xudawei
Copy link

xudawei commented Jun 24, 2018

Proposal

Use case. Why is this important?
I have used shard data with three prometheus server to collect more than 2000+ nodes , And Use a prometheus server used remote_read as a proxy to support promsql from grafana. Architecture like follow:
Grafana -> Prom(with remote read)(i will call it proxy for this page) -> Prom( three node with shard) -> node-exporter

Nice to have' is not a good use case :)

Bug Report

What did you do?
I curl a url to proxy : http:// proxy.domain.com//api/v1/series?match[]=up

What did you expect to see?
Return all active node with their labels.

What did you see instead? Under which circumstances?
Sometimes !!! just Somethings .Got a error return : context deadline execeeded

Environment

  • System information:

    X86_64 RHEL 7.4

  • Prometheus version:

level=info ts=2018-06-22T05:39:17.089870496Z caller=main.go:220 msg="Starting Prometheus" version="(version=2.2.1, branch=HEAD, revision=bc6058c81272a8d938c05e75607371284236aadc)"
  • Alertmanager version:

    Do not use Alertmanager

  • Prometheus configuration file:

Proxy (Prom with remote_read)

global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 1m
    
remote_read:
- url: 'http://10.5.39.160:9090/api/v1/read'
  remote_timeout: 5s
  read_recent: true
- url: 'http://10.5.39.161:9090/api/v1/read'
  remote_timeout: 5s
  read_recent: true
- url: 'http://10.5.39.162:9090/api/v1/read'
  remote_timeout: 5s
  read_recent: true

Prom ( one of the node with shard)

global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 1m
  external_labels:
    slave: "10.5.39.162"

scrape_configs:
  - job_name: 'idc_ping'
    scrape_interval: 60s
    metrics_path: /probe
    params:
      module: [icmp]  #ping
    relabel_configs:
      - source_labels: [__address__]
        modulus: 3
        target_label:  __tmp_hash
        action: hashmod
      - source_labels: [__tmp_hash]
        regex: ^2$
        action: keep
        action: keep
      - source_labels: [__address__]
        regex: (.*)(:80)?
        target_label: __param_target
        replacement: ${1}
      - source_labels: [__param_target]
        regex: (.*)
        target_label: ping
        replacement: ${1}
      - source_labels: []
        regex: .*
        target_label: __address__
        replacement: 10.5.39.187:9115  # Blackbox exporter.
    file_sd_configs:
      - files:
        - /app/conf/targets/idc_switch/*.json

  - job_name: 'idc_snmp'
    scrape_interval: 300s
    scrape_timeout: 30s
    metrics_path: /snmp
    params:
      module: [if_mib]
    relabel_configs:
      - source_labels: [__address__]
        modulus: 3
        target_label:  __tmp_hash
        action: hashmod
      - source_labels: [__tmp_hash]
        regex: ^2$
        action: keep
        action: keep
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 10.5.39.187:9116
    file_sd_configs:
      - files:
        - /app/conf/targets/idc_switch/*.json
  - job_name: 'idc_jcpu'
    scrape_interval: 300s
    scrape_timeout: 30s
    metrics_path: /snmp
    params:
      module: [JuniperCPU]
    relabel_configs:
      - source_labels: [__address__]
        modulus: 3
        target_label:  __tmp_hash
        action: hashmod
      - source_labels: [__tmp_hash]
        regex: ^2$
        action: keep
        action: keep
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 10.5.39.187:9116
    file_sd_configs:
      - files:
        - /app/conf/targets/idc_switch/*.json

  • Alertmanager configuration file:
	Do not use Alertmanager
  • Logs:

Proxy (Prom with remote_read)

level=info ts=2018-06-22T05:25:56.159754372Z caller=main.go:220 msg="Starting Prometheus" version="(version=2.2.1, branch=HEAD, revision=bc6058c81272a8d938c05e75607371284236aadc)"
level=info ts=2018-06-22T05:25:56.159865588Z caller=main.go:221 build_context="(go=go1.10, user=root@149e5b3f0829, date=20180314-14:15:45)"
level=info ts=2018-06-22T05:25:56.159902798Z caller=main.go:222 host_details="(Linux 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 17:29:29 EDT 2015 x86_64 c27ad4e6d7b9 (none))"
level=info ts=2018-06-22T05:25:56.15993398Z caller=main.go:223 fd_limits="(soft=65536, hard=65536)"
level=info ts=2018-06-22T05:25:56.164939994Z caller=web.go:382 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-06-22T05:25:56.164954328Z caller=main.go:504 msg="Starting TSDB ..."
level=info ts=2018-06-22T05:25:56.175147161Z caller=main.go:514 msg="TSDB started"
level=info ts=2018-06-22T05:25:56.175188157Z caller=main.go:588 msg="Loading configuration file" filename=/app/conf/prometheus.yml
level=info ts=2018-06-22T05:25:56.176519009Z caller=main.go:491 msg="Server is ready to receive web requests."
level=error ts=2018-06-22T07:22:34.386435013Z caller=engine.go:529 component="query engine" msg="error selecting series set" err="error sending request: context canceled"

Prom ( node with shard)

level=info ts=2018-06-22T05:39:17.089870496Z caller=main.go:220 msg="Starting Prometheus" version="(version=2.2.1, branch=HEAD, revision=bc6058c81272a8d938c05e75607371284236aadc)"
level=info ts=2018-06-22T05:39:17.090013649Z caller=main.go:221 build_context="(go=go1.10, user=root@149e5b3f0829, date=20180314-14:15:45)"
level=info ts=2018-06-22T05:39:17.090054321Z caller=main.go:222 host_details="(Linux 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 17:29:29 EDT 2015 x86_64 37aa86dcc533 (none))"
level=info ts=2018-06-22T05:39:17.090090703Z caller=main.go:223 fd_limits="(soft=65536, hard=65536)"
level=info ts=2018-06-22T05:39:17.094020099Z caller=main.go:504 msg="Starting TSDB ..."
level=info ts=2018-06-22T05:39:17.094025856Z caller=web.go:382 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-06-22T05:39:21.813964411Z caller=main.go:514 msg="TSDB started"
level=info ts=2018-06-22T05:39:21.814052973Z caller=main.go:588 msg="Loading configuration file" filename=/app/conf/prometheus.yml
level=info ts=2018-06-22T05:39:21.815212701Z caller=main.go:491 msg="Server is ready to receive web requests."
level=error ts=2018-06-22T05:54:51.526192719Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=info ts=2018-06-22T07:00:00.246050978Z caller=compact.go:393 component=tsdb msg="compact blocks" count=1 mint=1529640000000 maxt=1529647200000
level=info ts=2018-06-22T07:00:06.669934675Z caller=head.go:348 component=tsdb msg="head GC completed" duration=484.65759ms
level=info ts=2018-06-22T07:00:08.029114057Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=1.359096044s
level=error ts=2018-06-22T07:14:30.287165508Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2018-06-22T07:14:30.306147777Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2018-06-22T07:14:30.326146722Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2018-06-22T07:14:30.331301576Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2018-06-22T07:14:30.486377958Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2018-06-22T07:14:30.625925566Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2018-06-22T07:14:49.192145902Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2018-06-22T07:14:49.227607287Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2018-06-22T07:14:49.316431716Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2018-06-22T07:14:59.750539448Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2018-06-22T07:15:00.04012553Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2018-06-22T07:15:00.070867002Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2018-06-22T07:15:00.104882808Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2018-06-22T07:15:00.32962463Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2018-06-22T07:15:00.335144468Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jun 24, 2018

Thanks for the report!

Hmm, I have tried to reproduce this by pointing a 2.2.1 Prometheus at the remote read endpoint of another 2.2.1 Prometheus, but it is working for me. Proxy config:

global:
  scrape_interval:     10s
  evaluation_interval: 10s

remote_read:
- url: "http://localhost:9090/api/v1/read"
  read_recent: true
  remote_timeout: 5s
- url: "http://demo.robustperception.io:9090/api/v1/read"
  read_recent: true
  remote_timeout: 5s

And http://localhost:9091/api/v1/series?match[]=up works...

Do you still have the problem with Prometheus 2.3.1 by the way?

@xudawei

This comment has been minimized.

Copy link
Author

xudawei commented Jun 26, 2018

Thanks for your reply
Most of the time it works, but sometimes it goes wrong, but I don't know what caused it. If so, is it possible to list some of the reasons for this error and I can help with some testing?

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jun 26, 2018

Ah sorry, I didn't understand earlier that http: multiple response.WriteHeader calls also only happens sometimes.

Hmm, first I thought the proxy Prometheus ran into a query timeout, but that should read context deadline exceeded rather than context canceled normally. But still something causes it to give up on the query, maybe it's still a timeout, but the error message ends up being slightly different for some reason. On the shard Prometheus, it is running this code: https://github.com/prometheus/prometheus/blob/master/web/api/v1/api.go#L592-L667

The error message indicates that we write HTTP headers twice, which can happen if you first set some headers, then start writing the body (which automatically writes out the headers first), then try to set some more headers. The only place where I currently see that could happen is in https://github.com/prometheus/prometheus/blob/master/web/api/v1/api.go#L663-L666. EncodeReadResponse() already writes out headers, but if it then fails to write the body back to the requesting Prometheus server (because it timeouted / canceled the request for some reason), then we still do http.Error() in the caller, which tries to write out headers again. To handle this more cleanly and give a better error message, we should probably differentiate the error that happens while writing back a body (https://github.com/prometheus/prometheus/blob/master/storage/remote/codec.go#L66) and not try to do an http.Error() if it happened.

Your remote timeout is pretty low with 5s btw. Have you tried setting it to something higher to see if the problem goes away?

@xudawei

This comment has been minimized.

Copy link
Author

xudawei commented Jun 27, 2018

Thank you very much for helping me analyze the reasons,But I still don't quite understand the mechanism of writing headers twice

I also think the possibility of timeout is relatively high. My current approach is to optimize the query statement and reduce the response time and content size. Moreover, I changed the timeout period on the proxy to 30s, and then observed for a period of time

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jun 27, 2018

@xudawei

Thank you very much for helping me analyze the reasons,But I still don't quite understand the mechanism of writing headers twice

Imagine the following:

  1. The querying Prometheus server has already timed out and has closed the connection while the shard is still computing the result.
  2. When the shard is finished building the result, it will try to send it: https://github.com/prometheus/prometheus/blob/master/web/api/v1/api.go#L663-L666
  3. Within EncodeReadResponse(), the shard will try to write a response body:
    _, err = w.Write(compressed)
    . When writing a response body, the Go HTTP library automatically writes out / finalizes any headers first. This can only be done once, since the body always follows the header.
  4. Writing out the body fails because the connection is already dead.
  5. EncodeReadResponse() returns an error, upon which we try to set error headers / status code in https://github.com/prometheus/prometheus/blob/master/web/api/v1/api.go#L664. But we can't, because we have already attempted to write the body earlier.

So if you set your timeout higher, this problem will hopefully go away. The only thing remaining here would then be to log this error better.

@xudawei

This comment has been minimized.

Copy link
Author

xudawei commented Jul 2, 2018

Oh,yeah.
Now I know it Very clearly
Thank you very much
It works greatful after increase timeout value
I think this issue could be closed

@xudawei xudawei closed this Jul 3, 2018

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.