Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes discovery not refreshed after intermittent Kubernetes master failure #1603

Closed
rvrignaud opened this Issue Apr 28, 2016 · 20 comments

Comments

Projects
None yet
7 participants
@rvrignaud
Copy link

rvrignaud commented Apr 28, 2016

Hello,

I'm using Kubernetes SD on Google Container Engine. For yet an unknown reason it seems that my Kubernetes master stopped / failed. It is back again but prometheus do not refresh the discovery.
Here is an extract of the logs:

prometheus.log.txt

My prometheus server is running inside the cluster as a pod.
I'm running 0.18.0 release on ubuntu.

Here is the configuration:

- job_name: 'kubernetes-cluster'

  # This TLS & bearer token file config is used to connect to the actual scrape
  # endpoints for cluster components. This is separate to discovery auth
  # configuration (`in_cluster` below) because discovery & scraping are two
  # separate concerns in Prometheus.
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

  kubernetes_sd_configs:
  - api_servers:
    - 'https://kubernetes.default.svc'
    in_cluster: true

  relabel_configs:
  - source_labels: [__meta_kubernetes_role]
    action: keep
    regex: (?:apiserver|node)
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - source_labels: [__meta_kubernetes_role]
    action: replace
    target_label: kubernetes_role

# Scrape config for service endpoints.
#
# The relabeling allows the actual service scrape endpoint to be configured
# via the following annotations:
#
# * `prometheus.io/scrape`: Only scrape services that have a value of `true`
# * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
# to set this to `https` & most likely set the `tls_config` of the scrape config.
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: If the metrics are exposed on a different port to the
# service then set this appropriately.
- job_name: 'kubernetes-service-endpoints'

  kubernetes_sd_configs:
  - api_servers:
    - 'https://kubernetes.default.svc'
    in_cluster: true

  relabel_configs:
  - source_labels: [__meta_kubernetes_role, __meta_kubernetes_service_annotation_prometheus_io_scrape]
    action: keep
    regex: endpoint;true
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
    action: replace
    target_label: __scheme__
    regex: (https?)
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    action: replace
    target_label: __address__
    regex: (.+)(?::\d+);(\d+)
    replacement: $1:$2
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels: [__meta_kubernetes_role]
    action: replace
    target_label: kubernetes_role
  - source_labels: [__meta_kubernetes_service_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_service_name]
    action: replace
    target_label: kubernetes_name

- job_name: 'kubernetes-service-endpoints-2'

  kubernetes_sd_configs:
  - api_servers:
    - 'https://kubernetes.default.svc'
    in_cluster: true

  relabel_configs:
  - source_labels: [__meta_kubernetes_role, __meta_kubernetes_service_annotation_prometheus_io_scrape2]
    action: keep
    regex: endpoint;true
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme2]
    action: replace
    target_label: __scheme__
    regex: (https?)
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path2]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port2]
    action: replace
    target_label: __address__
    regex: (.+)(?::\d+);(\d+)
    replacement: $1:$2
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels: [__meta_kubernetes_role2]
    action: replace
    target_label: kubernetes_role
  - source_labels: [__meta_kubernetes_service_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_service_name]
    action: replace
    target_label: kubernetes_name
@rvrignaud

This comment has been minimized.

Copy link
Author

rvrignaud commented Apr 28, 2016

Prometheus did succeed in refresh the Kubernetes discovery a few minutes after the recovery of Kubernetes API.

@jimmidyson

This comment has been minimized.

Copy link
Member

jimmidyson commented Apr 28, 2016

It looks like the amount of time that routing took to come back up - I assume the anonymized IP was correct for the Kubernetes master service?

@rvrignaud

This comment has been minimized.

Copy link
Author

rvrignaud commented Apr 28, 2016

Yes the IP is correct. And prometheus started to work fine after it recovered.
What is the expected time after the kubernetes api recovery that prometheus will take to refresh the discovery ?

@jimmidyson

This comment has been minimized.

Copy link
Member

jimmidyson commented Apr 28, 2016

The retry interval is 1 second by default (hence the repeated logs). As soon as the API is available again & routeable Prometheus should reconnect.

@rvrignaud

This comment has been minimized.

Copy link
Author

rvrignaud commented Apr 28, 2016

The API was available way before prometheus refreshed the pods

@jimmidyson

This comment has been minimized.

Copy link
Member

jimmidyson commented Apr 28, 2016

This could be something to do with the Kubernetes service routing syncing up again, but not sure. I can't think how this could be to do with the Prometheus reconnection logic.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Apr 28, 2016

Sounds like what caused my demo yesterday to fail. After my laptop went
into hibernation briefly the queries to the API server didn't go through.

On Thu, Apr 28, 2016, 6:57 PM Jimmi Dyson notifications@github.com wrote:

This could be something to do with the Kubernetes service routing syncing
up again, but not sure. I can't think how this could be to do with the
Prometheus reconnection logic.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#1603 (comment)

@macb

This comment has been minimized.

Copy link

macb commented Jun 6, 2016

I'm getting similarly stale targets but my kube apiserver was available consistently while the results remained stale.

I'm seeing a fair number of EOF errors in the prometheus logs. This doesn't seem consistent with other applications running within the cluster talking to the api server (they aren't reporting EOFs).

@macb

This comment has been minimized.

Copy link

macb commented Jun 30, 2016

This just happened again. This status page shows attempts to scrape 2 instances that are no longer listed in the k8s api (the service and pods were all deleted previously). I'd be happy to provide any additional information that would be helpful in tracking this down. As it stands we use metrics for quite a few things so I don't want to leave the instance broken but I can capture that data when it occurs again.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 30, 2016

Can I ask which Prometheus version you're on?

@macb

This comment has been minimized.

Copy link

macb commented Jul 1, 2016

branch  stable
buildDate   20160418-10:07:02
buildUser   @523c4185767e
goVersion   go1.5.3
revision    f12ebd6
version 0.18.0

I see a few kubernetes related messages in the changelog since 0.18. Nothing that calls this out expicitly though. Is there any reason we couldn't upgrade from 0.18 to 0.20?

@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Jul 21, 2016

I'd be very interested how this works for you in 0.20 / 1.0. I'm running into a similar thing where the first API contact at startup fails (due to everything in our integration test starting up at once, I suppose) and then Prometheus never recovers from that (at least not in the timeframes I've waited). It does recover when I SIGHUP it.

This particular instance is a regression from 0.19.2, in that version the same setup worked fine.

@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Jul 21, 2016

… and with 1.0.0 it seems to work again.

@treed

This comment has been minimized.

Copy link

treed commented Aug 26, 2016

FWIW, I just ran into this running a version of master I pulled on the 22nd. Hitting the process with HUP caused it to update, thanks for the workaround.

@rvrignaud

This comment has been minimized.

Copy link
Author

rvrignaud commented Sep 23, 2016

Hi,
We still encounter this issue running 1.1.0.
A SIGHUP indeed fixes the problem.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Oct 19, 2016

We've recently merged a brand new k8 integration. While it's still beta, it will be replacing the existing one and thus it's unlikely we'll be digging further into this issue with the old version.

@macb

This comment has been minimized.

Copy link

macb commented Oct 19, 2016

@brian-brazil which version of prometheus has the new integration and/or when did it go in?

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Oct 19, 2016

The 1.3.0-beta.0 release will have it in. It should be cut tomorrow.

On Wed, Oct 19, 2016, 7:16 PM Mac Browning notifications@github.com wrote:

@brian-brazil https://github.com/brian-brazil which version of
prometheus has the new integration and/or when did it go in?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1603 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEuA8lhNR3XDvOhWI2viaZVAE9aI2QZ6ks5q1lBqgaJpZM4ISEDY
.

@macb

This comment has been minimized.

Copy link

macb commented Oct 19, 2016

thanks! looking forward to giving it a try in some of our k8s clusters

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.