Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening storage failed" err="invalid block sequence" #3714

Closed
cauwulixuan opened this issue Jan 20, 2018 · 29 comments
Closed

Opening storage failed" err="invalid block sequence" #3714

cauwulixuan opened this issue Jan 20, 2018 · 29 comments

Comments

@cauwulixuan
Copy link

cauwulixuan commented Jan 20, 2018

What did you do?
I ran prometheus2.0.0 on kubernetesv1.8.5

What did you expect to see?
Everything went well.

What did you see instead? Under which circumstances?
Everything went well at beginning. But several hours later, pods' statuses turned to "CrashLoopBackOff", all prometheus turned unavaliable. After create pods, I didnt do anything.

[root@k8s-1 prometheus]# kubectl get all -n monitoring
NAME                          DESIRED   CURRENT   AGE
statefulsets/prometheus-k8s   0         2         16h

NAME                  READY     STATUS             RESTARTS   AGE
po/prometheus-k8s-0   0/1       CrashLoopBackOff   81         16h
po/prometheus-k8s-1   0/1       CrashLoopBackOff   22         16h

Environment

[root@k8s-1 prometheus]# kubectl version --short
Client Version: v1.8.5
Server Version: v1.8.5
[root@k8s-1 prometheus]# docker images | grep -i prometheus
quay.io/prometheus/alertmanager                          v0.12.0             f87cbd5f1360        5 weeks ago         31.2 MB
quay.io/prometheus/node_exporter                         v0.15.2             ff5ecdcfc4a2        6 weeks ago         22.8 MB
quay.io/prometheus/prometheus                            v2.0.0              67141fa03496        2 months ago        80.2 MB
  • System information:

         [root@k8s-1 prometheus]# uname -srm
         Linux 3.10.0-229.el7.x86_64 x86_64
         ```
    
    
  • Prometheus version:

    v2.0.0

  • Prometheus configuration file:

[root@k8s-1 prometheus]# cat prometheus-configmap.yaml 
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-k8s-config
  namespace: monitoring
data:
  prometheus.yaml: |
    global:
      scrape_interval: 10s
      scrape_timeout: 10s
      evaluation_interval: 10s
    rule_files:
      - "/etc/prometheus-rules/*.rules"
      
    # A scrape configuration for running Prometheus on a Kubernetes cluster.
    # This uses separate scrape configs for cluster components (i.e. API server, node)
    # and services to allow each to use different authentication configs.
    #
    # Kubernetes labels will be added as Prometheus labels on metrics via the
    # `labelmap` relabeling action.
    #
    # If you are using Kubernetes 1.7.2 or earlier, please take note of the comments
    # for the kubernetes-cadvisor job; you will need to edit or remove this job.
    
    # Scrape config for API servers.
    #
    # Kubernetes exposes API servers as endpoints to the default/kubernetes
    # service so this uses `endpoints` role and uses relabelling to only keep
    # the endpoints associated with the default/kubernetes service using the
    # default named port `https`. This works for single API server deployments as
    # well as HA API server deployments.
    scrape_configs:
    - job_name: 'kubernetes-apiservers'
    
      kubernetes_sd_configs:
      - role: endpoints
    
      # Default to scraping over https. If required, just disable this or change to
      # `http`.
      scheme: https
    
      # This TLS & bearer token file config is used to connect to the actual scrape
      # endpoints for cluster components. This is separate to discovery auth
      # configuration because discovery & scraping are two separate concerns in
      # Prometheus. The discovery auth config is automatic if Prometheus runs inside
      # the cluster. Otherwise, more config options have to be provided within the
      # <kubernetes_sd_config>.
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        # If your node certificates are self-signed or use a different CA to the
        # master CA, then disable certificate verification below. Note that
        # certificate verification is an integral part of a secure infrastructure
        # so this should only be disabled in a controlled environment. You can
        # disable certificate verification by uncommenting the line below.
        #
        # insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    
      # Keep only the default/kubernetes service endpoints for the https port. This
      # will add targets for each API server which Kubernetes adds an endpoint to
      # the default/kubernetes service.
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https
    
    # Scrape config for nodes (kubelet).
    #
    # Rather than connecting directly to the node, the scrape is proxied though the
    # Kubernetes apiserver.  This means it will work if Prometheus is running out of
    # cluster, or can't connect to nodes for some other reason (e.g. because of
    # firewalling).
    - job_name: 'kubernetes-nodes'
    
      # Default to scraping over https. If required, just disable this or change to
      # `http`.
      scheme: https
    
      # This TLS & bearer token file config is used to connect to the actual scrape
      # endpoints for cluster components. This is separate to discovery auth
      # configuration because discovery & scraping are two separate concerns in
      # Prometheus. The discovery auth config is automatic if Prometheus runs inside
      # the cluster. Otherwise, more config options have to be provided within the
      # <kubernetes_sd_config>.
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    
      kubernetes_sd_configs:
      - role: node
    
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics
    
    # Scrape config for Kubelet cAdvisor.
    #
    # This is required for Kubernetes 1.7.3 and later, where cAdvisor metrics
    # (those whose names begin with 'container_') have been removed from the
    # Kubelet metrics endpoint.  This job scrapes the cAdvisor endpoint to
    # retrieve those metrics.
    #
    # In Kubernetes 1.7.0-1.7.2, these metrics are only exposed on the cAdvisor
    # HTTP endpoint; use "replacement: /api/v1/nodes/${1}:4194/proxy/metrics"
    # in that case (and ensure cAdvisor's HTTP server hasn't been disabled with
    # the --cadvisor-port=0 Kubelet flag).
    #
    # This job is not necessary and should be removed in Kubernetes 1.6 and
    # earlier versions, or it will cause the metrics to be scraped twice.
    - job_name: 'kubernetes-cadvisor'
    
      # Default to scraping over https. If required, just disable this or change to
      # `http`.
      scheme: https
    
      # This TLS & bearer token file config is used to connect to the actual scrape
      # endpoints for cluster components. This is separate to discovery auth
      # configuration because discovery & scraping are two separate concerns in
      # Prometheus. The discovery auth config is automatic if Prometheus runs inside
      # the cluster. Otherwise, more config options have to be provided within the
      # <kubernetes_sd_config>.
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    
      kubernetes_sd_configs:
      - role: node
    
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    
    # Scrape config for service endpoints.
    #
    # The relabeling allows the actual service scrape endpoint to be configured
    # via the following annotations:
    #
    # * `prometheus.io/scrape`: Only scrape services that have a value of `true`
    # * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
    # to set this to `https` & most likely set the `tls_config` of the scrape config.
    # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
    # * `prometheus.io/port`: If the metrics are exposed on a different port to the
    # service then set this appropriately.
    - job_name: 'kubernetes-service-endpoints'
    
      kubernetes_sd_configs:
      - role: endpoints
    
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: kubernetes_name
    
    # Example scrape config for probing services via the Blackbox Exporter.
    #
    # The relabeling allows the actual service scrape endpoint to be configured
    # via the following annotations:
    #
    # * `prometheus.io/probe`: Only probe services that have a value of `true`
    - job_name: 'kubernetes-services'
    
      metrics_path: /probe
      params:
        module: [http_2xx]
    
      kubernetes_sd_configs:
      - role: service
    
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
        action: keep
        regex: true
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: blackbox-exporter.example.com:9115
      - source_labels: [__param_target]
        target_label: instance
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        target_label: kubernetes_name
    
    # Example scrape config for probing ingresses via the Blackbox Exporter.
    #
    # The relabeling allows the actual ingress scrape endpoint to be configured
    # via the following annotations:
    #
    # * `prometheus.io/probe`: Only probe services that have a value of `true`
    - job_name: 'kubernetes-ingresses'
    
      metrics_path: /probe
      params:
        module: [http_2xx]
    
      kubernetes_sd_configs:
        - role: ingress
    
      relabel_configs:
        - source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_probe]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path]
          regex: (.+);(.+);(.+)
          replacement: ${1}://${2}${3}
          target_label: __param_target
        - target_label: __address__
          replacement: blackbox-exporter.example.com:9115
        - source_labels: [__param_target]
          target_label: instance
        - action: labelmap
          regex: __meta_kubernetes_ingress_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_ingress_name]
          target_label: kubernetes_name
    
    # Example scrape config for pods
    #
    # The relabeling allows the actual pod scrape endpoint to be configured via the
    # following annotations:
    #
    # * `prometheus.io/scrape`: Only scrape pods that have a value of `true`
    # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
    # * `prometheus.io/port`: Scrape the pod on the indicated port instead of the
    # pod's declared ports (default is a port-free target if none are declared).
    - job_name: 'kubernetes-pods'
    
      kubernetes_sd_configs:
      - role: pod
    
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name
[root@k8s-1 prometheus]# cat prometheus-all-together.yaml
apiVersion: v1
kind: Service
metadata:
  labels:
    prometheus: k8s
  name: prometheus-k8s
  namespace: monitoring
  annotations:
    prometheus.io/scrape: "true"
spec:
  ports:
  - name: web
    nodePort: 30900
    port: 9090
    protocol: TCP
    targetPort: web
  selector:
    prometheus: k8s
  sessionAffinity: None
  type: NodePort
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  labels:
    prometheus: k8s
  name: prometheus-k8s
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: prometheus
      prometheus: k8s
  serviceName: prometheus-k8s
  replicas: 2
  template:
    metadata:
      labels:
        app: prometheus
        prometheus: k8s
    spec:
      securityContext:
        runAsUser: 65534
        fsGroup: 65534
        runAsNonRoot: true
      containers:
      - args:
        - --config.file=/etc/prometheus/config/prometheus.yaml
        - --storage.tsdb.path=/cephfs/prometheus/data
        - --storage.tsdb.retention=180d
        - --web.route-prefix=/
        - --web.enable-lifecycle
        - --web.enable-admin-api
        image: quay.io/prometheus/prometheus:v2.0.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 10
          httpGet:
            path: /status
            port: web
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 3
        name: prometheus
        ports:
        - containerPort: 9090
          name: web
          protocol: TCP
        readinessProbe:
          failureThreshold: 6
          httpGet:
            path: /status
            port: web
            scheme: HTTP
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 3
        resources:
          requests:
            cpu: 100m
            memory: 200Mi
          limits:
            cpu: 500m
            memory: 500Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/prometheus/config
          name: config
          readOnly: false
        - mountPath: /etc/prometheus/rules
          name: rules
          readOnly: false
        - mountPath: /cephfs/prometheus/data
          name: data
          subPath: prometheus-data
          readOnly: false
      serviceAccount: prometheus-k8s
      serviceAccountName: prometheus-k8s
      terminationGracePeriodSeconds: 60
      volumes:
      - configMap:
          defaultMode: 511
          name: prometheus-k8s-config
        name: config
      - configMap:
          defaultMode: 511
          name: prometheus-k8s-rules
        name: rules
      - name: data
        persistentVolumeClaim:
          claimName: cephfs-pvc
  updateStrategy:
    type: RollingUpdate
  • Logs:
[root@k8s-1 prometheus]# kubectl logs prometheus-k8s-0 -n monitoring
level=info ts=2018-01-20T03:16:32.966070249Z caller=main.go:215 msg="Starting Prometheus" version="(version=2.0.0, branch=HEAD, revision=0a74f98628a0463dddc90528220c94de5032d1a0)"
level=info ts=2018-01-20T03:16:32.966225361Z caller=main.go:216 build_context="(go=go1.9.2, user=root@615b82cb36b6, date=20171108-07:11:59)"
level=info ts=2018-01-20T03:16:32.966252185Z caller=main.go:217 host_details="(Linux 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 prometheus-k8s-0 (none))"
level=info ts=2018-01-20T03:16:32.969789371Z caller=web.go:380 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-01-20T03:16:32.971388907Z caller=main.go:314 msg="Starting TSDB"
level=info ts=2018-01-20T03:16:32.971596811Z caller=targetmanager.go:71 component="target manager" msg="Starting target manager..."
level=error ts=2018-01-20T03:16:59.781338012Z caller=main.go:323 msg="Opening storage failed" err="invalid block sequence: block time ranges overlap (1516348800000, 1516356000000)"
[root@k8s-1 prometheus]# 
[root@k8s-1 prometheus]# kubectl logs prometheus-k8s-1 -n monitoring
level=info ts=2018-01-20T03:15:22.701351679Z caller=main.go:215 msg="Starting Prometheus" version="(version=2.0.0, branch=HEAD, revision=0a74f98628a0463dddc90528220c94de5032d1a0)"
level=info ts=2018-01-20T03:15:22.70148418Z caller=main.go:216 build_context="(go=go1.9.2, user=root@615b82cb36b6, date=20171108-07:11:59)"
level=info ts=2018-01-20T03:15:22.701512333Z caller=main.go:217 host_details="(Linux 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 prometheus-k8s-1 (none))"
level=info ts=2018-01-20T03:15:22.705824203Z caller=web.go:380 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-01-20T03:15:22.707629775Z caller=main.go:314 msg="Starting TSDB"
level=info ts=2018-01-20T03:15:22.707837323Z caller=targetmanager.go:71 component="target manager" msg="Starting target manager..."
level=error ts=2018-01-20T03:15:54.775639791Z caller=main.go:323 msg="Opening storage failed" err="invalid block sequence: block time ranges overlap (1516348800000, 1516356000000)"
[root@k8s-1 prometheus]# kubectl describe po/prometheus-k8s-0 -n monitoring
Name:           prometheus-k8s-0
Namespace:      monitoring
Node:           k8s-3/172.16.1.8
Start Time:     Fri, 19 Jan 2018 17:59:38 +0800
Labels:         app=prometheus
                controller-revision-hash=prometheus-k8s-7d86dfbd86
                prometheus=k8s
Annotations:    kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"StatefulSet","namespace":"monitoring","name":"prometheus-k8s","uid":"7593d8ac-fcff-11e7-9333-fa163e48f857"...
Status:         Running
IP:             10.244.2.54
Created By:     StatefulSet/prometheus-k8s
Controlled By:  StatefulSet/prometheus-k8s
Containers:
  prometheus:
    Container ID:  docker://98faabe55fb71050aacd776d349a6567c25c339117159356eedc10cbc19ef02a
    Image:         quay.io/prometheus/prometheus:v2.0.0
    Image ID:      docker-pullable://quay.io/prometheus/prometheus@sha256:53afe934a8d497bb703dbbf7db273681a56677775c462833da8d85015471f7a3
    Port:          9090/TCP
    Args:
      --config.file=/etc/prometheus/config/prometheus.yaml
      --storage.tsdb.path=/cephfs/prometheus/data
      --storage.tsdb.retention=180d
      --web.route-prefix=/
      --web.enable-lifecycle
      --web.enable-admin-api
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sat, 20 Jan 2018 11:11:00 +0800
      Finished:     Sat, 20 Jan 2018 11:11:29 +0800
    Ready:          False
    Restart Count:  84
    Limits:
      cpu:     500m
      memory:  500Mi
    Requests:
      cpu:        100m
      memory:     200Mi
    Liveness:     http-get http://:web/status delay=30s timeout=3s period=5s #success=1 #failure=10
    Readiness:    http-get http://:web/status delay=0s timeout=3s period=5s #success=1 #failure=6
    Environment:  <none>
    Mounts:
      /cephfs/prometheus/data from data (rw)
      /etc/prometheus/config from config (rw)
      /etc/prometheus/rules from rules (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from prometheus-k8s-token-x8xzh (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-k8s-config
    Optional:  false
  rules:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-k8s-rules
    Optional:  false
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  cephfs-pvc
    ReadOnly:   false
  prometheus-k8s-token-x8xzh:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-token-x8xzh
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.alpha.kubernetes.io/notReady:NoExecute for 300s
                 node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason      Age                  From            Message
  ----     ------      ----                 ----            -------
  Normal   Pulled      15m (x83 over 17h)   kubelet, k8s-3  Container image "quay.io/prometheus/prometheus:v2.0.0" already present on machine
  Warning  FailedSync  23s (x1801 over 7h)  kubelet, k8s-3  Error syncing pod

Any suggestions?

@cauwulixuan
Copy link
Author

I think it may like Prometheus 2.0 fails to start up after couple of restarts #3191. tell me if you need more details, Thanks.

@cauwulixuan
Copy link
Author

logs on k8s node here:

[root@k8s-3 01C48JAGH1QCGKGCG72E0B2Y8R]# journalctl -xeu kubelet --no-pager
1月 20 11:21:54 k8s-3 kubelet[14306]: I0120 11:21:54.619924   14306 kuberuntime_manager.go:749] Back-off 5m0s restarting failed container=prometheus pod=prometheus-k8s-0_monitoring(7598959a-fcff-11e7-9333-fa163e48f857)
1月 20 11:21:54 k8s-3 kubelet[14306]: E0120 11:21:54.620042   14306 pod_workers.go:182] Error syncing pod 7598959a-fcff-11e7-9333-fa163e48f857 ("prometheus-k8s-0_monitoring(7598959a-fcff-11e7-9333-fa163e48f857)"), skipping: failed to "StartContainer" for "prometheus" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=prometheus pod=prometheus-k8s-0_monitoring(7598959a-fcff-11e7-9333-fa163e48f857)"
1月 20 11:22:08 k8s-3 kubelet[14306]: I0120 11:22:08.615438   14306 kuberuntime_manager.go:500] Container {Name:prometheus Image:quay.io/prometheus/prometheus:v2.0.0 Command:[] Args:[--config.file=/etc/prometheus/config/prometheus.yaml --storage.tsdb.path=/cephfs/prometheus/data --storage.tsdb.retention=180d --web.route-prefix=/ --web.enable-lifecycle --web.enable-admin-api] WorkingDir: Ports:[{Name:web HostPort:0 ContainerPort:9090 Protocol:TCP HostIP:}] EnvFrom:[] Env:[] Resources:{Limits:map[cpu:{i:{value:500 scale:-3} d:{Dec:<nil>} s:500m Format:DecimalSI} memory:{i:{value:524288000 scale:0} d:{Dec:<nil>} s:500Mi Format:BinarySI}] Requests:map[cpu:{i:{value:100 scale:-3} d:{Dec:<nil>} s:100m Format:DecimalSI} memory:{i:{value:209715200 scale:0} d:{Dec:<nil>} s: Format:BinarySI}]} VolumeMounts:[{Name:config ReadOnly:false MountPath:/etc/prometheus/config SubPath: MountPropagation:<nil>} {Name:rules ReadOnly:false MountPath:/etc/prometheus/rules SubPath: MountPropagation:<nil>} {Name:data ReadOnly:false MountPath:/cephfs/prometheus/data SubPath:prometheus-data MountPropagation:<nil>} {Name:prometheus-k8s-token-x8xzh ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/status,Port:web,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:30,TimeoutSeconds:3,PeriodSeconds:5,SuccessThreshold:1,FailureThreshold:10,} ReadinessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/status,Port:web,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:0,TimeoutSeconds:3,PeriodSeconds:5,SuccessThreshold:1,FailureThreshold:6,} Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
1月 20 11:22:08 k8s-3 kubelet[14306]: I0120 11:22:08.615662   14306 kuberuntime_manager.go:739] checking backoff for container "prometheus" in pod "prometheus-k8s-0_monitoring(7598959a-fcff-11e7-9333-fa163e48f857)"

@cauwulixuan
Copy link
Author

cauwulixuan commented Jan 20, 2018

[root@k8s-1 prometheus]# kubectl delete -f prometheus-all-together.yaml
error when stopping "prometheus-all-together.yaml": timed out waiting for "prometheus-k8s" to be synced

@cauwulixuan
Copy link
Author

probably like Prom2: crash on opening WAL block #2795

@cauwulixuan
Copy link
Author

Any updates here?

@kinghrothgar
Copy link

kinghrothgar commented Mar 9, 2018

I am also seeing this in Prometheus v2.2.0

Edit: I will add more info as I uncover it.

@phreaker0
Copy link

I just hit this same error with prometheus v2.2.0 (i have installed this version fresh some days ago). details:

  • prometheus crashed because of memory issues:
...
Mar 12 09:34:28 sam systemd-nspawn[9997]: level=error ts=2018-03-12T08:34:28.218226724Z caller=db.go:281 component=tsdb msg="compaction failed" err="reload blocks: open block /prometheus/01C8BDSXTHADS17HGQC9F04341: mmap files: mmap: cannot allocate memory"
Mar 12 09:35:28 sam systemd-nspawn[9997]: fatal error: runtime: cannot allocate memory
  • starting it again failed, because it hit the open file limit on startup:
Mar 12 09:43:57 sam systemd-nspawn[12671]: level=error ts=2018-03-12T08:43:57.477802976Z caller=main.go:582 err="Opening storage failed open block /prometheus/01C8BJ7GSFJVBBEKW57VATFP91: mmap files: try lock file: open /prometheus/01C8BJ7GSFJVBBEKW57VATFP91/chunks/000358: too many open files"
Mar 12 09:43:57 sam systemd-nspawn[12671]: level=info ts=2018-03-12T08:43:57.477836976Z caller=main.go:584 msg="See you next time!"
  • starting again with raised limit (LimitNOFILE=49152):
Mar 12 09:47:37 sam systemd-nspawn[15139]: level=error ts=2018-03-12T08:47:37.248961761Z caller=main.go:582 err="Opening storage failed invalid block sequence: block time ranges overlap (1520596800000, 1520791200000)"
Mar 12 09:47:37 sam systemd-nspawn[15139]: level=info ts=2018-03-12T08:47:37.248992015Z caller=main.go:584 msg="See you next time!"

@phreaker0
Copy link

i just checked the logs further and the issue already appeared without crashing at runtime before:

....
Mar 11 20:00:00 sam systemd-nspawn[9997]: level=info ts=2018-03-11T19:00:00.503660857Z caller=head.go:348 component=tsdb msg="head GC completed" duration=23.481695ms
Mar 11 20:00:01 sam systemd-nspawn[9997]: level=info ts=2018-03-11T19:00:01.340065516Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=836.346465ms
Mar 11 22:00:00 sam systemd-nspawn[9997]: level=info ts=2018-03-11T21:00:00.070781405Z caller=compact.go:394 component=tsdb msg="compact blocks" count=1 mint=1520791200000 maxt=1520798400000
Mar 11 22:00:00 sam systemd-nspawn[9997]: level=info ts=2018-03-11T21:00:00.497704892Z caller=head.go:348 component=tsdb msg="head GC completed" duration=20.147957ms
Mar 11 22:00:00 sam systemd-nspawn[9997]: level=info ts=2018-03-11T21:00:00.6616537Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=163.892808ms
Mar 11 22:00:00 sam systemd-nspawn[9997]: level=info ts=2018-03-11T21:00:00.715142505Z caller=compact.go:394 component=tsdb msg="compact blocks" count=3 mint=1520769600000 maxt=1520791200000
Mar 11 22:00:01 sam systemd-nspawn[9997]: level=info ts=2018-03-11T21:00:01.661388512Z caller=compact.go:394 component=tsdb msg="compact blocks" count=2 mint=1520726400000 maxt=1520769600000
Mar 11 22:00:02 sam systemd-nspawn[9997]: level=info ts=2018-03-11T21:00:02.770002137Z caller=compact.go:394 component=tsdb msg="compact blocks" count=3 mint=1520596800000 maxt=1520791200000
Mar 11 22:00:05 sam systemd-nspawn[9997]: level=error ts=2018-03-11T21:00:05.693815052Z caller=db.go:281 component=tsdb msg="compaction failed" err="reload blocks: invalid block sequence: block time ranges overlap (1520726400000, 1520791200000)"
Mar 11 22:00:06 sam systemd-nspawn[9997]: level=info ts=2018-03-11T21:00:06.71243681Z caller=compact.go:394 component=tsdb msg="compact blocks" count=2 mint=1520726400000 maxt=1520791200000
Mar 11 22:00:08 sam systemd-nspawn[9997]: level=error ts=2018-03-11T21:00:08.167673295Z caller=db.go:281 component=tsdb msg="compaction failed" err="reload blocks: invalid block sequence: block time ranges overlap (1520596800000, 1520661600000)"
Mar 11 22:01:12 sam systemd-nspawn[9997]: level=info ts=2018-03-11T21:01:12.186361945Z caller=compact.go:394 component=tsdb msg="compact blocks" count=2 mint=1520726400000 maxt=1520791200000
...

@n0guest
Copy link

n0guest commented Mar 12, 2018

i just checked the logs further and the issue already appeared without crashing at runtime before:

We also have 2.2.0 and this issue has few additional symptoms:

  1. it could crash after a while and even refusing to start after that
  2. mostly such errors are also coming with going "all crazy on IOPS" (i.e. huge I/O without any external workload)
  3. looks like it doens't related to "long history" in TSDB and could apeear even on fresh instance with small amount of samples in TSDB (i.e. within an day after wiping all data and restarting Prometheus)

I hope it could help diagnose a problem.

@hectorag
Copy link

Also presenting this problem in my setup, running prometheus v2.2.0 with a empty DB. After some hours 2~3 prometheus start generating this error.

Mar 12 21:45:15 ip-10-2-2-148.ec2.internal docker[19840]: level=error ts=2018-03-12T21:45:15.648452938Z caller=db.go:281 component=tsdb msg="compaction failed" err="reload blocks: invalid block sequence: block time ranges overlap (1520872200000, 1520888400000)" Mar 12 21:45:16 ip-10-2-2-148.ec2.internal docker[19840]: level=info ts=2018-03-12T21:45:16.720721593Z caller=compact.go:394 component=tsdb msg="compact blocks" count=2 mint=1520872200000 maxt=1520888400000

Then was not able to recover and start to fail again and again, showing the error below:

Mar 13 06:09:51 ip-10-2-2-148.ec2.internal docker[987]: level=error ts=2018-03-13T06:09:51.027694261Z caller=main.go:582 err="Opening storage failed invalid block sequence: block time ranges overlap (1520868600000, 1520888400000)"

@gouthamve
Copy link
Member

Sorry about that, this is a bug, fix is here: prometheus-junkyard/tsdb#299 A new bug fix release will be out soon.

@phreaker0
Copy link

@gouthamve i hit it again, but this time rolling back the data to some time in the past (zfs snapshots) wouldn't work as prometheus started compacting block after startup and hit the issue after a couple of seconds. So i grabbed the linked patch, compiled prometheus and running the master + the patch version and it's fine so far, thx.

@shenshouer
Copy link

the same issue met at prometheus v2.2.0

Opening storage failed invalid block sequence: block time ranges overlap (1521165600000, 1521547200000)

@brian-brazil
Copy link
Contributor

Please try 2.2.1.

@shenshouer
Copy link

@brian-brazil It worked fine after I deleted all old data when update the prometheus v2.2.0 to v2.2.1 .

@brian-brazil
Copy link
Contributor

Dupe of #3943.

@bamb00
Copy link

bamb00 commented Mar 30, 2018

@brian-brazil Hi,

I'm hitting this issue with v2.2.1. Does this issue need to be re-open?

        level=info ts=2018-03-30T15:16:29.279332879Z caller=main.go:220 msg="Starting Prometheus" version="(version=2.2.1, branch=HEAD, revision=bc6058c81272a8d938c05e75607371284236aadc)"
	level=info ts=2018-03-30T15:16:29.279436055Z caller=main.go:221 build_context="(go=go1.10, user=root@149e5b3f0829, date=20180314-14:15:45)"
	level=info ts=2018-03-30T15:16:29.279459825Z caller=main.go:222 host_details="(Linux 3.10.0-514.2.2.el7.x86_64 #1 SMP Tue Dec 6 23:06:41 UTC 2016 x86_64 prometheus-k8s-0 (none))"
	level=info ts=2018-03-30T15:16:29.279483284Z caller=main.go:223 fd_limits="(soft=1048576, hard=1048576)"
	level=info ts=2018-03-30T15:16:29.283771092Z caller=web.go:382 component=web msg="Start listening for connections" address=0.0.0.0:9090
	level=info ts=2018-03-30T15:16:29.283653706Z caller=main.go:504 msg="Starting TSDB ..."
	level=info ts=2018-03-30T15:16:30.895623919Z caller=main.go:398 msg="Stopping scrape discovery manager..."
	level=info ts=2018-03-30T15:16:30.895697567Z caller=main.go:411 msg="Stopping notify discovery manager..."
	level=info ts=2018-03-30T15:16:30.895731442Z caller=main.go:432 msg="Stopping scrape manager..."
	level=info ts=2018-03-30T15:16:30.895760845Z caller=manager.go:460 component="rule manager" msg="Stopping rule manager..."
	level=info ts=2018-03-30T15:16:30.895779426Z caller=manager.go:466 component="rule manager" msg="Rule manager stopped"
	level=info ts=2018-03-30T15:16:30.895793446Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
	level=info ts=2018-03-30T15:16:30.89582586Z caller=main.go:394 msg="Scrape discovery manager stopped"
	level=info ts=2018-03-30T15:16:30.895851397Z caller=main.go:407 msg="Notify discovery manager stopped"
	level=info ts=2018-03-30T15:16:30.895955998Z caller=main.go:426 msg="Scrape manager stopped"
	level=info ts=2018-03-30T15:16:30.895998626Z caller=main.go:573 msg="Notifier manager stopped"
	level=error ts=2018-03-30T15:16:30.896042941Z caller=main.go:582 err="Opening storage failed invalid block sequence: block time ranges overlap (1522353600000, 1522360800000)"
	level=info ts=2018-03-30T15:16:30.896097327Z caller=main.go:584 msg="See you next time!"

Thanks.

@Sriharivignesh
Copy link

Is there a way to recover from this error without flushing data out? I don't want to lose a chunk of my metrics data because of this :|

@zhanglijingisme
Copy link

@bamb00 Any update about this?

@bamb00
Copy link

bamb00 commented Apr 20, 2018

@zhanglijingisme I have not heard back from the prometheus team.

@candlerb
Copy link
Contributor

After upgrading from v2.2.1 to v2.3.0 I got this error:

Jun 14 08:10:40 wrn-prometheus prometheus[1334]: level=error ts=2018-06-14T08:10:40.476105933Z caller=main.go:597 err="Opening storage failed invalid block sequence: block time ranges overlap: [mint: 1528848000000, maxt: 1528855200000, range: 2h0m0s, blocks: 2]: <ulid: 01CFVHC16MCPM30SVH0D8PFJ3Y, mint: 1528848000000, maxt: 1528855200000, range: 2h0m0s>, <ulid: 01CFYKMB140Z3AMNEKHWDTS3RQ, mint: 1528848000000, maxt: 1528862400000, range: 4h0m0s>\n[mint: 1528855200000, maxt: 1528862400000, range: 2h0m0s, blocks: 2]: <ulid: 01CFYKMB140Z3AMNEKHWDTS3RQ, mint: 1528848000000, maxt: 1528862400000, range: 4h0m0s>, <ulid: 01CFVR7REHFYBN1DP3QHS3KH8C, mint: 1528855200000, maxt: 1528862400000, range: 2h0m0s>"
Jun 14 08:10:40 wrn-prometheus prometheus[1334]: level=info ts=2018-06-14T08:10:40.476152815Z caller=main.go:599 msg="See you next time!"

I have kept the old data via mv /var/lib/prometheus/{data,data.corrupt} if there's any value in it.

Note: the thing which prompted the upgrade was that prometheus had starting doing much more disk I/O than expected, and was saturating the underlying hard drives. It's a relatively small set of time series which are being monitored - count({__name__=~".+"}) returns 12088. After the 2.3.0 upgrade failed to start, and I blew the database away, it is much happier now.

@mysteryegg
Copy link

I have duplicated the behavior reported by @candlerb when upgrading from 2.2.1 to 2.3.1.
I assume this new behavior falls under prometheus-junkyard/tsdb#347 and should be tracked there.

@uncleNight
Copy link

uncleNight commented Jul 9, 2018

Here's how it went for me (running docker container with prom/prometheus:v2.3.0).
OS was rebooted (manually), after reboot prometheus kept restarting with level=error ts=2018-07-09T09:44:19.761219359Z caller=main.go:597 err="Opening storage failed invalid block sequence: block time ranges overlap: [mint: 1530856800000, maxt: 1530864000000, range: 2h0m0s, blocks: 2]: <ulid: 01CHQD40DG2QE2ZE3MFMMQ1VFS, mint: 1530856800000, maxt: 1530864000000, range: 2h0m0s>, <ulid: 01CHZ45KDMB5S64X6R3AQMWSXD, mint: 1530856800000, maxt: 1530878400000, range: 6h0m0s>\n[mint: 1530871200000, maxt: 1530878400000, range: 2h0m0s, blocks: 2]: <ulid: 01CHZ45KDMB5S64X6R3AQMWSXD, mint: 1530856800000, maxt: 1530878400000, range: 6h0m0s>, <ulid: 01CHQTVEXG910WRSSS7S6D264W, mint: 1530871200000, maxt: 1530878400000, range: 2h0m0s>"
I stopped the container and checked the volume data:

# ls -lh
total 36K
drwxr-xr-x 3 nobody nogroup 4.0K Jul  5 09:00 01CHMTQB0CQNF49HZ7CNR2105S
drwxr-xr-x 3 nobody nogroup 4.0K Jul  6 03:00 01CHPRGWBMQJVZ626S20P5QRB6
drwxr-xr-x 3 nobody nogroup 4.0K Jul  6 09:00 01CHQD40DG2QE2ZE3MFMMQ1VFS
drwxr-xr-x 3 nobody nogroup 4.0K Jul  6 09:00 01CHQD4138KK0ZTADMA90MT9N8
drwxr-xr-x 3 nobody nogroup 4.0K Jul  6 13:00 01CHQTVEXG910WRSSS7S6D264W
drwxr-xr-x 3 nobody nogroup 4.0K Jul  6 15:00 01CHR1Q65F38PZFHJZP1WG3PZ8
drwxr-xr-x 3 nobody nogroup 4.0K Jul  9 09:04 01CHZ45KDMB5S64X6R3AQMWSXD
drwxr-xr-x 3 nobody nogroup 4.0K Jul  9 09:04 01CHZ4JT0317TT5HYKKZKW24BJ.tmp
-rw-rw-r-- 1 nobody nogroup    0 Jul  9 09:22 lock
drwxr-xr-x 2 nobody nogroup 4.0K Jul  6 13:00 wal

# du -sh 01*
99M	01CHMTQB0CQNF49HZ7CNR2105S
94M	01CHPRGWBMQJVZ626S20P5QRB6
12M	01CHQD40DG2QE2ZE3MFMMQ1VFS
34M	01CHQD4138KK0ZTADMA90MT9N8
12M	01CHQTVEXG910WRSSS7S6D264W
13M	01CHR1Q65F38PZFHJZP1WG3PZ8
29G	01CHZ45KDMB5S64X6R3AQMWSXD
27G	01CHZ4JT0317TT5HYKKZKW24BJ.tmp

Note the last two directories, they're the heaviest. If you check ulids mentioned in logs, you will notice they match names of directories. After messing around a little with moving away smaller directories with IDs from logs, I ended up with the same message in logs: somehow, prometheus encounters same time ranges in different chunks from different directories (my speculations only, no idea what kind of satanic magic it runs by).
So I did what seemed to be logical: created a backup directory, moved there everything except for wal directory and the latest (heaviest) non-.tmp directory. So it looked like this:

# ls -lh
total 16K
drwxr-xr-x 3 nobody nogroup 4.0K Jul  9 09:04 01CHZ45KDMB5S64X6R3AQMWSXD
drwxr-xr-x 8 root   root    4.0K Jul  9 09:45 bkp
-rw-rw-r-- 1 nobody nogroup    0 Jul  9 09:22 lock
drwxr-xr-x 2 nobody nogroup 4.0K Jul  6 13:00 wal

Started prometheus again, voila, works again, and the data is there and accessible (I can see it by running queries from the very beginning of monitoring history). Hope it'd help somebody.

@lucasgameiro
Copy link

I had the same issue in windows environment with 2.3.*. I updated to 2.4.3 version and still didn't work.
The only solution to me was changing the storage.tsdb.path attribute, even deleting the data folder didn't solve it.

@sumeshkanayi
Copy link

We also faced similar issue with 2.3.2 .We had to move the data from existing path mentioned under storage.tsdb.path to a new location and restart prometheus

@Sana7H
Copy link

Sana7H commented Dec 12, 2018

I faced the same issue in the version 2.3.2. I tried deleting duplicated chunks and restarting, it didn't work. Finally I had to move the whole data block to a different folder, create one more empty data folder and restart the prometheus service to make it work.

@Teriand
Copy link

Teriand commented Jan 31, 2019

same in 2.7.0

Help fix from @uncleNight with move bad files.

@yarix
Copy link

yarix commented May 6, 2019

had the same issue after killing Prometheus. i've removed all *.tmp and all the directories which reported in the log .<ulid:

@alevchuk
Copy link

alevchuk commented Sep 8, 2019

@uncleNight's solution worked for me!

applying the solution as described, I saw large gaps in the data (i use grafana for dashboarding). To fill those, I moved back all of the "the latest (heaviest)" directories. Now all the data looks great!

btw, the overall problem for me got triggered by running out of disk space.

@lock lock bot locked and limited conversation to collaborators Mar 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests