Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log successful target manager shutdown #2715

Closed
svend opened this Issue May 12, 2017 · 8 comments

Comments

Projects
None yet
3 participants
@svend
Copy link
Contributor

svend commented May 12, 2017

What did you do?

Shut down prometheus.

What did you expect to see?

A timely shutdown.

What did you see instead? Under which circumstances?

The shutdown took longer than 20 minutes, which is our pods shut down grace period in Kubernetes, and was killed. The logs showed 13 minutes for "Stopping target manager...".
˘

Environment

  • System information:

Kubernetes 1.5.6.

Linux 4.4.0-66-generic x86_64
  • Prometheus version:
prometheus, version 1.6.1 (branch: master, revision: 4666df502c0e239ed4aa1d80abbbfb54f61b23c3)
  build user:       root@7e45fa0366a7
  build date:       20170419-14:32:22
  go version:       go1.8.1
  • Prometheus configuration file:
prometheus.yaml
global:
  scrape_interval: 30s
  evaluation_interval: 30s
  # Add the Kubernetes cluster name to all external systems (Alertmanager,
  # federation, etc.). This is useful information to have in alerts/pages.
  external_labels:
    k8s_cluster: {{ context.name }}

rule_files:
- '/etc/prometheus/rules/*.rules'

scrape_configs:

# Scrape config for local Prometheus.
- job_name: 'prometheus'
  static_configs:
    - targets: ['localhost:9090']

# Scrape config for Kubernetes api servers.
- job_name: 'kubernetes-apiservers'
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  kubernetes_sd_configs:
  - role: endpoints
  # Keep only the default/kubernetes service endpoints for the https port. This
  # will add targets for each API server which Kubernetes adds an endpoint to
  # the default/kubernetes service.
  relabel_configs:
  - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
    action: keep
    regex: default;kubernetes;https

# Scrape config for Kubernetes nodes.
- job_name: 'kubernetes-nodes'
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  kubernetes_sd_configs:
  - role: node
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)

# Scrape config for node-exporter.
- job_name: 'kubernetes-node-exporter'
  kubernetes_sd_configs:
  - role: node
  relabel_configs:
  - source_labels: [__address__]
    action: replace
    regex: (.+):(?:\d+)
    replacement: ${1}:9100
    target_label: __address__

# Scrape config for service endpoints.
#
# The relabeling allows the actual service scrape endpoint to be configured
# via the following annotations:
#
# * `prometheus.io/scrape`: Only scrape services that have a value of `true`
# * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
# to set this to `https` & most likely set the `tls_config` of the scrape config.
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: If the metrics are exposed on a different port to the
# service then set this appropriately.
- job_name: 'kubernetes-service-endpoints'
  kubernetes_sd_configs:
  - role: endpoints
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
    action: replace
    target_label: __scheme__
    regex: (https?)
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    action: replace
    target_label: __address__
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_instance]
    action: replace
    target_label: instance
  - source_labels: [__meta_kubernetes_service_label_app]
    action: replace
    target_label: app
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: namespace
  - source_labels: [__meta_kubernetes_service_name]
    action: replace
    target_label: kubernetes_name

# Scrape config for services
#
# The relabeling allows the actual service scrape endpoint to be configured
# via the following annotations:
#
# * `prometheus.io/scrape-service`: Only scrape services that have a value of `true`
# * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
# to set this to `https` & most likely set the `tls_config` of the scrape config.
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: If the metrics are exposed on a different port to the
# service then set this appropriately.
- job_name: 'kubernetes-services'
  honor_labels: true
  kubernetes_sd_configs:
  - role: service
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape_service]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
    action: replace
    target_label: __scheme__
    regex: (https?)
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    action: replace
    target_label: __address__
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_instance]
    action: replace
    target_label: instance
  - source_labels: [__meta_kubernetes_service_label_app]
    action: replace
    target_label: app
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: namespace
  - source_labels: [__meta_kubernetes_service_name]
    action: replace
    target_label: kubernetes_name

# Example scrape config for probing services via the Blackbox Exporter.
#
# The relabeling allows the actual service scrape endpoint to be configured
# via the following annotations:
#
# * `prometheus.io/probe`: Only probe services that have a value of `true`
# * `prometheus.io/probe-path`: Probe the given URL path on the target
# * `prometheus.io/probe-target`: Probe the given target
- job_name: 'kubernetes-services-blackbox'
  metrics_path: /probe
  params:
    module: [http_2xx]
  kubernetes_sd_configs:
  - role: service
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
    action: keep
    regex: true
  - source_labels: [__address__]
    target_label: instance
  - source_labels: [__address__]
    target_label: __param_target
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe_target]
    target_label: __param_target
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_probe_path]
    action: replace
    target_label: __param_target
    regex: (.+);(.+)
    replacement: $1$2
  - target_label: __address__
    replacement: blackbox-exporter
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    target_label: namespace
  - source_labels: [__meta_kubernetes_service_name]
    target_label: kubernetes_name

# Example scrape config for probing pods via the Blackbox Exporter.
#
# The relabeling allows the actual service scrape endpoint to be configured
# via the following annotations:
#
# * `prometheus.io/probe`: Only probe pods that have a value of `true`
# * `prometheus.io/probe-path`: Probe the given URL path on the target
- job_name: 'kubernetes-pods-blackbox'
  metrics_path: /probe
  params:
    module: [http_2xx]
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_probe]
    action: keep
    regex: true
  - source_labels: [__address__]
    target_label: instance
  - source_labels: [__address__]
    target_label: __param_target
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_probe_path]
    action: replace
    target_label: __param_target
    regex: (.+);(.+)
    replacement: $1$2
  - target_label: __address__
    replacement: blackbox-exporter
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    target_label: namespace
  - source_labels: [__meta_kubernetes_pod_name]
    target_label: kubernetes_name

# Example scrape config for pods
#
# The relabeling allows the actual pod scrape endpoint to be configured via the
# following annotations:
#
# * `prometheus.io/scrape`: Only scrape pods that have a value of `true`
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: Scrape the pod on the indicated port instead of the exposed ports.
- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  # Only scrape containers that explicitly expose a port
  - source_labels: [__meta_kubernetes_pod_container_port_number, __meta_kubernetes_pod_annotation_prometheus_io_port]
    action: drop
    regex: ;
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: namespace
  - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    target_label: kubernetes_name

# Scrape config for Kubernetes etcd.
- job_name: 'kubernetes-etcd'
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  kubernetes_sd_configs:
  - role: node
  relabel_configs:
  - source_labels: [__meta_kubernetes_node_label_wp_controller_node]
    action: keep
    regex: "true"
  - source_labels: [__address__]
    action: replace
    regex: (.+):(?:\d+)
    replacement: ${1}:2379
    target_label: __address__
  - source_labels: [__address__]
    target_label: instance

- job_name: 'kubernetes-route-reflectors'

  ec2_sd_configs:
  - region: us-west-2
    access_key: AKIAIKIVBKYYZS67SNMQ
    secret_key: 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
    # Refresh interval to re-read the instance list.
    # [ refresh_interval: <duration> | default = 60s ]
    port: 9100

  relabel_configs:
  # Non-tag metadata
  - source_labels: [__meta_ec2_availability_zone]
    target_label: availability_zone
  - source_labels: [__meta_ec2_instance_id]
    target_label: instance_id
  - source_labels: [__meta_ec2_instance_state]
    target_label: instance_state
  - source_labels: [__meta_ec2_instance_type]
    target_label: instance_type
  - source_labels: [__meta_ec2_private_ip]
    target_label: private_ip
  - source_labels: [__meta_ec2_public_dns_name]
    target_label: public_dns_name
  - source_labels: [__meta_ec2_public_ip]
    target_label: public_ip
  - source_labels: [__meta_ec2_subnet_id]
    target_label: subnet_id
  - source_labels: [__meta_ec2_vpc_id]
    target_label: vpc_id

  # Filter instances
  - source_labels: [__meta_ec2_tag_KubernetesCluster]
    regex: '_clustername_'
    action: keep
  - source_labels: [__meta_ec2_tag_wp_k8s_role]
    regex: 'RouteReflector'
    action: keep
  • Logs:

Shutdown log:

time="2017-05-12T18:20:52Z" level=warning msg="Received SIGTERM, exiting gracefully..." source="main.go:233" 
time="2017-05-12T18:20:52Z" level=info msg="See you next time!" source="main.go:240" 
time="2017-05-12T18:20:52Z" level=info msg="Stopping target manager..." source="targetmanager.go:75" 
time="2017-05-12T18:33:33Z" level=info msg="Done checkpointing in-memory metrics and chunks in 30m54.920994786s." source="persistence.go:665" 
@grobie

This comment has been minimized.

Copy link
Member

grobie commented May 12, 2017

Were there more log lines after that line? Looking at the last here, it was probably just the checkpointing which took so long.

@svend

This comment has been minimized.

Copy link
Contributor Author

svend commented May 12, 2017

I think the "Done checkpointing" line was the last line logged before the container was terminated.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented May 13, 2017

How big is the given Prometheus instance (targets, samples/sec, #series)? As @grobie said, that's most likely the checkpoint, which may take long for large instances. The target manager just waits for pending scrapes on shutdown, which can't really take longer than the maximum scrape timeout.

@grobie

This comment has been minimized.

Copy link
Member

grobie commented May 13, 2017

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented May 13, 2017

@grobie ah yes, that wasn't worded well. We do cancel in flight scrapes, but as we don't want partial inserts of scraped data, we wait for all samples currently written to the storage to be appended.

Prometheus 2.0's storage has transactions and we could just rollback there. But practically writes are fast enough in either version for it to not matter.

@grobie

This comment has been minimized.

Copy link
Member

grobie commented May 13, 2017

In difference to (almost) all other managers and handlers, the target manager doesn't log once it was successfully stopped. It should print a Target manager stopped. finally so we can see how long the shutdown took.

This would have prevented my request to open this issue.

@grobie grobie changed the title Shutdown took over 20 minutes, 13 minutes to stop target manager Log successful target manager shutdown May 13, 2017

@svend

This comment has been minimized.

Copy link
Contributor Author

svend commented May 15, 2017

@fabxc This instance has 204 targets, about 6k metrics/seconds, and local storage metrics series is 198,000. The instance was overloaded when I was shutting it down, due to a target accidentally exposing over 100,000 metrics.

I see I misread the log output, and it wasn't the target manager stopping that took so long.

gouthamve added a commit to gouthamve/prometheus that referenced this issue Jul 6, 2017

Log the stopping of TargetManager in Info
Fixes prometheus#2715

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>

gouthamve added a commit to gouthamve/prometheus that referenced this issue Jul 6, 2017

Log the stopping of TargetManager in Info
Fixes prometheus#2715

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>

@fabxc fabxc closed this in #2907 Jul 6, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.