Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

federation is failing with `http: multiple response.WriteHeader calls` error #2237

Closed
onorua opened this Issue Dec 1, 2016 · 14 comments

Comments

Projects
None yet
5 participants
@onorua
Copy link

onorua commented Dec 1, 2016

What did you do?
Tried to increase amount of nodes for our prometheus more than 300 and decided to use federation

What did you expect to see?
Seamless and working solution

What did you see instead? Under which circumstances?
Some broken graphs, nothing is actually working

Environment

  • System information:
    Linux prometheus-sfo2-prod-02 4.4.0-45-generic #66-Ubuntu SMP Wed Oct 19 14:12:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

  • Prometheus version:

prometheus, version 1.4.1 (branch: master, revision: 2a89e8733f240d3cd57a6520b52c36ac4744ce12)
  build user:       root@e685d23d8809
  build date:       20161128-09:59:22
  go version:       go1.7.3
  • Prometheus configuration file:
# my global config
global:
  scrape_interval:     30s # By default, scrape targets every 15 seconds.
  evaluation_interval: 30s # By default, scrape targets every 15 seconds.
  scrape_timeout: 30s
  # scrape_timeout is set to the global default (10s).

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
      monitor: 'bhss'

# Load and evaluate rules in this file every 'evaluation_interval' seconds.
rule_files:
  - "alert.rules"
  # - "first.rules"
  # - "second.rules"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  # Auto-discovery via consul tags by Registrator
  - job_name: 'federate'
    scrape_interval: 30s

    honor_labels: true
    metrics_path: '/federate'

    params:
      'match[]':
        - '{job="consul-services"}'

    static_configs:
      - targets:
        - 'prometheus1:9090'
  • Logs:
    on federation server everything seems to be fine:
time="2016-12-01T01:50:56Z" level=info msg="9757 series loaded." source="storage.go:359"
time="2016-12-01T01:50:56Z" level=info msg="Starting target manager..." source="targetmanager.go:63"
time="2016-12-01T01:50:56Z" level=info msg="Listening on :9090" source="web.go:248"
time="2016-12-01T01:59:31Z" level=info msg="Completed maintenance sweep through 13306 in-memory fingerprints in 8m33.461796469s." source="storage.go:1180"
time="2016-12-01T02:00:56Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"
time="2016-12-01T02:01:19Z" level=info msg="Done checkpointing in-memory metrics and chunks in 22.858885958s." source="persistence.go:573"

on federated server (the one which scrapes everything):

time="2016-12-01T01:30:49Z" level=info msg="Listening on :9090" source="web.go:248"
time="2016-12-01T01:30:49Z" level=warning msg="Storage has entered rushed mode." chunksToPersist=1138834 maxChunksToPersist=2572864 maxMemoryChunks=2572864 memoryChunks=2925866 source="storage.go:1607" urgencyScore=1
time="2016-12-01T01:30:54Z" level=error msg="Storage needs throttling. Scrapes and rule evaluations will be skipped." chunksToPersist=1137413 maxChunksToPersist=2572864 maxToleratedMemChunks=2830150 memoryChunks=2920720 source="storage.go:908"
time="2016-12-01T01:40:49Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"
time="2016-12-01T01:42:50Z" level=info msg="Done checkpointing in-memory metrics and chunks in 2m0.862111955s." source="persistence.go:573"
time="2016-12-01T01:44:40Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T01:45:11Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T01:45:44Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T01:46:13Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T01:46:45Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T01:47:16Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T01:47:59Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T01:48:28Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T01:48:58Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T01:49:29Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T01:49:57Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T01:52:42Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T01:52:50Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"
time="2016-12-01T01:53:12Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T01:53:42Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T01:54:52Z" level=info msg="Done checkpointing in-memory metrics and chunks in 2m2.181545716s." source="persistence.go:573"
time="2016-12-01T01:55:34Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T01:56:04Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T01:56:34Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T01:57:04Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T01:57:34Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T01:58:04Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T01:58:34Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
@onorua

This comment has been minimized.

Copy link
Author

onorua commented Dec 1, 2016

Apparently, I was missing - '{__name__=~"job:.*"}' in config:

    params:
      'match[]':
        - '{job="consul-services"}'
        - '{__name__=~"job:.*"}'

@onorua onorua closed this Dec 1, 2016

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 1, 2016

There's still a bug in here, what you provided as the config should work and that error message is bad.

@brian-brazil brian-brazil reopened this Dec 1, 2016

@onorua

This comment has been minimized.

Copy link
Author

onorua commented Dec 1, 2016

Yep, confirming that it was just coincidence that errors disappeared with adding that line. I still see this:

time="2016-12-01T18:42:22Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T18:46:41Z" level=info msg="Completed maintenance sweep through 323276 archived fingerprints in 7h1m43.823738488s." source="storage.go:1223"
time="2016-12-01T18:50:55Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"
time="2016-12-01T18:51:53Z" level=info msg="Done checkpointing in-memory metrics and chunks in 58.246700323s." source="persistence.go:573"
time="2016-12-01T19:01:53Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"
time="2016-12-01T19:02:49Z" level=info msg="Done checkpointing in-memory metrics and chunks in 55.435457784s." source="persistence.go:573"
time="2016-12-01T19:05:36Z" level=info msg="Completed maintenance sweep through 2704215 in-memory fingerprints in 7h46m40.813319602s." source="storage.go:1180"
time="2016-12-01T19:12:49Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"
time="2016-12-01T19:13:45Z" level=info msg="Done checkpointing in-memory metrics and chunks in 56.170504668s." source="persistence.go:573"
time="2016-12-01T19:18:19Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T19:23:45Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"
time="2016-12-01T19:24:42Z" level=info msg="Done checkpointing in-memory metrics and chunks in 56.871120928s." source="persistence.go:573"
time="2016-12-01T19:34:42Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"
time="2016-12-01T19:35:39Z" level=info msg="Done checkpointing in-memory metrics and chunks in 56.586444411s." source="persistence.go:573"
time="2016-12-01T19:39:46Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T19:43:06Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T19:45:39Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"
time="2016-12-01T19:46:31Z" level=info msg="Done checkpointing in-memory metrics and chunks in 52.003160946s." source="persistence.go:573"
time="2016-12-01T19:56:31Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"
time="2016-12-01T19:57:22Z" level=info msg="Done checkpointing in-memory metrics and chunks in 51.153250578s." source="persistence.go:573"
time="2016-12-01T20:04:54Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T20:07:22Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"
time="2016-12-01T20:08:11Z" level=info msg="Done checkpointing in-memory metrics and chunks in 49.640379491s." source="persistence.go:573"
time="2016-12-01T20:12:01Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T20:17:09Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T20:18:11Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"
time="2016-12-01T20:18:12Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T20:19:01Z" level=info msg="Done checkpointing in-memory metrics and chunks in 49.264976097s." source="persistence.go:573"
time="2016-12-01T20:25:55Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T20:29:01Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"
time="2016-12-01T20:29:49Z" level=info msg="Done checkpointing in-memory metrics and chunks in 48.716060418s." source="persistence.go:573"
time="2016-12-01T20:39:49Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"
time="2016-12-01T20:40:37Z" level=info msg="Done checkpointing in-memory metrics and chunks in 47.354029744s." source="persistence.go:573"
time="2016-12-01T20:41:58Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T20:50:19Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"
time="2016-12-01T20:50:37Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"
time="2016-12-01T20:51:22Z" level=info msg="Done checkpointing in-memory metrics and chunks in 45.136754927s." source="persistence.go:573"
time="2016-12-01T21:01:22Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"
time="2016-12-01T21:02:08Z" level=info msg="Done checkpointing in-memory metrics and chunks in 46.118455391s." source="persistence.go:573"
time="2016-12-01T21:03:44Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38"

as you can see, it doesn't happen often, but still...

Do you need me to check something for you, or you know what is this and how to fix it?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 1, 2016

I don't think any of us has had a chance to reproduce yet. We'll let you know if we need further information.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Dec 5, 2016

I haven't seen source="<autogenerated>:38" before. That's rather odd.

@onorua

This comment has been minimized.

Copy link
Author

onorua commented Dec 6, 2016

do you want me to do anything to help debugging it?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 6, 2016

Can you tell us what's hitting this Prometheus (e.g. Grafana), and if that's the full config file?

@natalia-k

This comment has been minimized.

Copy link

natalia-k commented Dec 6, 2016

Hi,
start got the following error once upgrade to v1.4.1 on prometheus pod in kubernetes:
...
time="2016-12-06T08:29:51Z" level=error msg="http: multiple response.WriteHeader calls\n" source=":38"
...

on federation prometheus server just have in targets status : "context deadline exceeded", nothing in log

from 3 prometheus defined in federation, only one has this error.

Deleting the problematic prometheus pod, temparary fixed the problem, but it back some time after that
Thanks

@onorua

This comment has been minimized.

Copy link
Author

onorua commented Dec 6, 2016

Federation server is hitting that prometheus server, we basically have the same problem as @natalia-k reported. Federation did not work, reporting showed that the node is "down" on targets tab, with "context deadline exceeded" in logs.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Dec 6, 2016

Ah, so it is the scraped server – I thought it was the scraping one. Thanks.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Dec 6, 2016

#2259 fixes the issue causing the error. It means though that your federation responses were failing somewhere along the way but the issue in #2259 prevents us from seeing the actual error.

It must be coming from trying to encode the metric families though, which is rather odd.

@widgetpl

This comment has been minimized.

Copy link

widgetpl commented Jan 19, 2017

isn't it connected whit too many metrics on federation endpoint and too low timout to collect them all ?
I have similar problem with prometheus < telegraf where i have interval = "10s" in telegraf.
Logs from telegraf shows me

2017/01/19 08:11:07 I! Using config file: /etc/telegraf/telegraf.conf
2017/01/19 08:11:07 I! Starting Telegraf (version 1.1.2)
2017/01/19 08:11:07 I! Loaded outputs: influxdb influxdb
2017/01/19 08:11:07 I! Loaded inputs: inputs.prometheus
2017/01/19 08:11:07 I! Tags enabled: host=prometheus-339779989-oq0zs k8s_cluster=minikube
2017/01/19 08:11:07 I! Agent Config: Interval:10s, Quiet:false, Hostname:"prometheus-339779989-oq0zs", Flush Interval:10s 
2017/01/19 08:11:20 E! ERROR: input [inputs.prometheus] took longer to collect than collection interval (10s)
2017/01/19 08:11:21 I! Output [influxdb] buffer fullness: 0 / 10000 metrics. Total gathered metrics: 0. Total dropped metrics: 0.
2017/01/19 08:11:21 I! Output [influxdb] buffer fullness: 0 / 10000 metrics. Total gathered metrics: 0. Total dropped metrics: 0.
2017/01/19 08:11:30 E! ERROR: input [inputs.prometheus] took longer to collect than collection interval (10s)
2017/01/19 08:11:31 I! Output [influxdb] buffer fullness: 0 / 10000 metrics. Total gathered metrics: 0. Total dropped metrics: 0.
2017/01/19 08:11:31 I! Output [influxdb] buffer fullness: 0 / 10000 metrics. Total gathered metrics: 0. Total dropped metrics: 0.

or

2017/01/19 08:22:50 I! Using config file: /etc/telegraf/telegraf.conf
2017/01/19 08:22:50 I! Starting Telegraf (version 1.1.2)
2017/01/19 08:22:50 I! Loaded outputs: influxdb influxdb
2017/01/19 08:22:50 I! Loaded inputs: inputs.prometheus
2017/01/19 08:22:50 I! Tags enabled: host=prometheus-339779989-oq0zs k8s_cluster=minikube
2017/01/19 08:22:50 I! Agent Config: Interval:10s, Quiet:false, Hostname:"prometheus-339779989-oq0zs", Flush Interval:10s 
2017/01/19 08:23:03 E! ERROR in input [inputs.prometheus]: error making HTTP request to http://localhost:9090/federate?match[]=%7Bjob%3D~".%2B"%7D: net/http: timeout awaiting response headers
2017/01/19 08:23:12 I! Output [influxdb] buffer fullness: 0 / 10000 metrics. Total gathered metrics: 0. Total dropped metrics: 0.
2017/01/19 08:23:12 I! Output [influxdb] buffer fullness: 0 / 10000 metrics. Total gathered metrics: 0. Total dropped metrics: 0.
2017/01/19 08:23:13 E! ERROR in input [inputs.prometheus]: error making HTTP request to http://localhost:9090/federate?match[]=%7Bjob%3D~".%2B"%7D: net/http: timeout awaiting response headers
2017/01/19 08:23:21 I! Output [influxdb] buffer fullness: 0 / 10000 metrics. Total gathered metrics: 0. Total dropped metrics: 0.
2017/01/19 08:23:21 I! Output [influxdb] buffer fullness: 0 / 10000 metrics. Total gathered metrics: 0. Total dropped metrics: 0.
2017/01/19 08:23:30 E! ERROR: input [inputs.prometheus] took longer to collect than collection interval (10s)
2017/01/19 08:23:32 I! Output [influxdb] buffer fullness: 0 / 10000 metrics. Total gathered metrics: 0. Total dropped metrics: 0.
2017/01/19 08:23:32 I! Output [influxdb] buffer fullness: 0 / 10000 metrics. Total gathered metrics: 0. Total dropped metrics: 0.
2017/01/19 08:23:40 E! ERROR: input [inputs.prometheus] took longer to collect than collection interval (10s)
2017/01/19 08:23:41 I! Output [influxdb] buffer fullness: 0 / 10000 metrics. Total gathered metrics: 0. Total dropped metrics: 0.
2017/01/19 08:23:41 I! Output [influxdb] buffer fullness: 0 / 10000 metrics. Total gathered metrics: 0. Total dropped metrics: 0.

and in prometheus

time="2017-01-19T07:25:45Z" level=info msg="Starting prometheus (version=1.4.1, branch=master, revision=2a89e8733f240d3cd57a6520b52c36ac4744ce12)" source="main.go:77" 
time="2017-01-19T07:25:45Z" level=info msg="Build context (go=go1.7.3, user=root@e685d23d8809, date=20161128-09:59:22)" source="main.go:78" 
time="2017-01-19T07:25:45Z" level=info msg="Loading configuration file /etc/prometheus/prometheus.yml" source="main.go:250" 
time="2017-01-19T07:25:46Z" level=info msg="Loading series map and head chunks..." source="storage.go:354" 
time="2017-01-19T07:26:03Z" level=info msg="465991 series loaded." source="storage.go:359" 
time="2017-01-19T07:26:03Z" level=info msg="Listening on :9090" source="web.go:248" 
time="2017-01-19T07:26:03Z" level=info msg="Starting target manager..." source="targetmanager.go:63" 
time="2017-01-19T07:26:43Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38" 
time="2017-01-19T07:26:54Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38" 
time="2017-01-19T07:27:04Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38" 
time="2017-01-19T07:27:34Z" level=info msg="Completed maintenance sweep through 8 archived fingerprints in 1m20.894716613s." source="storage.go:1223" 
time="2017-01-19T07:29:04Z" level=info msg="Completed maintenance sweep through 8 archived fingerprints in 1m20.68575756s." source="storage.go:1223" 
time="2017-01-19T07:30:06Z" level=info msg="Completed maintenance sweep through 5 archived fingerprints in 51.29890788s." source="storage.go:1223" 
time="2017-01-19T07:30:23Z" level=error msg="http: multiple response.WriteHeader calls\n" source="<autogenerated>:38" 
time="2017-01-19T07:31:03Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"

with wget i get

wget http://192.168.10.10:9090/federate\?match\[\]=\%7Bjob%3D~\".%2B\"\%7D
--2017-01-19 09:35:43--  http://192.168.10.10:9090/federate?match[]=%7Bjob%3D~%22.%2B%22%7D
Łączenie się z 192.168.10.10:9090... połączono.
Żądanie HTTP wysłano, oczekiwanie na odpowiedź... 200 OK
Długość: nieznana [text/plain]
Zapis do: `federate?match[]=%7Bjob%3D~%22.%2B%22%7D'

federate?match[]=%7Bjob%3D~%22.%2B%22%7D                        [                                                                   <=>                                                                    ] 141,97M  3,11MB/s    in 71s
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 6, 2017

That appears to be a different issue.

The issue as reported as resolved. If there's another bug that was being hidden by this one, a new issue should be opened.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.