Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not Deleting Old Data After TSDB Retention Passed #4176

Closed
diarmuidie opened this Issue May 21, 2018 · 8 comments

Comments

Projects
None yet
2 participants
@diarmuidie
Copy link

diarmuidie commented May 21, 2018

Bug Report

What did you do?
Start Prometheus with 15d retention period:
prometheus_time_series_collection_and_processing_server

What did you expect to see?
15 days of metrics stored, per the --storage.tsdb.retention=15d flag.

What did you see instead? Under which circumstances?
7+ weeks of metrics stored and 100% disk usage.
prometheus_time_series_collection_and_processing_server

This Prometheus is running in a Kubernetes pod with an Amazon EBS volume mounted for storage. The cluster is used for testing so the pod has been restarted a number of times (in case that makes a difference).

Environment

  • System information:

    Linux 3.10.0-327.10.1.el7.x86_64 x86_64

  • Prometheus version:

     prometheus, version 2.2.1 (branch: HEAD, revision: bc6058c81272a8d938c05e75607371284236aadc)                                                                                                                                                                                                                               
       build user:       root@prometheus-binary-18-build                                                                                                                                                                                                                                                                        
       build date:       20180322-20:38:40                                                                                                                                                                                                                                                                                      
       go version:       go1.10
    
  • Prometheus configuration file:

global:
  scrape_interval: 1m
  scrape_timeout: 10s
  evaluation_interval: 1m
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093
    scheme: http
    timeout: 10s
rule_files:
- /etc/prometheus/*.rules
scrape_configs:
- job_name: kubernetes-apiservers
  scrape_interval: 1m
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  kubernetes_sd_configs:
  - api_server: null
    role: endpoints
    namespaces:
      names: []
  bearer_token_file: <redacted>/token
  tls_config:
    ca_file: <redacted>/ca.crt
    insecure_skip_verify: false
  relabel_configs:
  - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
    separator: ;
    regex: default;kubernetes;https
    replacement: $1
    action: keep
- job_name: kubernetes-nodes
  scrape_interval: 1m
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  kubernetes_sd_configs:
  - api_server: null
    role: node
    namespaces:
      names: []
  bearer_token_file: <redacted>/token
  tls_config:
    ca_file: <redacted>/ca.crt
    insecure_skip_verify: false
  relabel_configs:
  - separator: ;
    regex: __meta_kubernetes_node_label_(.+)
    replacement: $1
    action: labelmap
  metric_relabel_configs:
  - source_labels: [__name__]
    separator: ;
    regex: openshift_sdn_pod_(setup|teardown)_latency(.*)
    replacement: $1
    action: drop
- job_name: kubernetes-service-endpoints
  scrape_interval: 1m
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - api_server: null
    role: endpoints
    namespaces:
      names: []
  tls_config:
    ca_file: <redacted>/ca.crt
    insecure_skip_verify: true
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: prometheus-node-exporter
    replacement: $1
    action: drop
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
    separator: ;
    regex: "true"
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
    separator: ;
    regex: (https?)
    target_label: __scheme__
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: $1
    action: replace
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    separator: ;
    regex: (.+)(?::\d+);(\d+)
    target_label: __address__
    replacement: $1:$2
    action: replace
  - separator: ;
    regex: __meta_kubernetes_service_label_(.+)
    replacement: $1
    action: labelmap
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: kubernetes_namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: kubernetes_service_name
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
    separator: ;
    regex: (.+);(.+)
    target_label: kubernetes_identifier
    replacement: $1/$2
    action: replace
- job_name: kubernetes-pods
  scrape_interval: 1m
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - api_server: null
    role: pod
    namespaces:
      names: []
  tls_config:
    ca_file: <redacted>/ca.crt
    insecure_skip_verify: true
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    separator: ;
    regex: "true"
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    separator: ;
    regex: (https?)
    target_label: __scheme__
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: $1
    action: replace
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    separator: ;
    regex: ([^:]+)(?::\d+)?;(\d+)
    target_label: __address__
    replacement: $1:$2
    action: replace
  - separator: ;
    regex: __meta_kubernetes_pod_label_(.+)
    replacement: $1
    action: labelmap
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: kubernetes_namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_name]
    separator: ;
    regex: (.*)
    target_label: kubernetes_pod_name
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_node_name]
    separator: ;
    regex: (.*)
    target_label: kubernetes_node_name
    replacement: $1
    action: replace
- job_name: kubernetes-nodes-exporter
  scrape_interval: 1m
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - api_server: null
    role: node
    namespaces:
      names: []
  tls_config:
    ca_file: <redacted>/ca.crt
    insecure_skip_verify: false
  relabel_configs:
  - source_labels: [__address__]
    separator: ;
    regex: (.*):10250
    target_label: __address__
    replacement: ${1}:9100
    action: replace
  - source_labels: [__meta_kubernetes_node_label_kubernetes_io_hostname]
    separator: ;
    regex: (.*)
    target_label: __instance__
    replacement: $1
    action: replace
  - separator: ;
    regex: __meta_kubernetes_node_label_(.+)
    replacement: $1
    action: labelmap
- job_name: openshift-template-service-broker
  scrape_interval: 1m
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  kubernetes_sd_configs:
  - api_server: null
    role: endpoints
    namespaces:
      names: []
  bearer_token_file: <redacted>/token
  tls_config:
    ca_file: <redacted>/service-ca.crt
    server_name: apiserver.openshift-template-service-broker.svc
    insecure_skip_verify: false
  relabel_configs:
  - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
    separator: ;
    regex: openshift-template-service-broker;apiserver;https
    replacement: $1
    action: keep
  • Logs:
    Startup logs
level=info ts=2018-05-21T08:08:28.205842497Z caller=main.go:220 msg="Starting Prometheus" version="(version=2.2.1, branch=HEAD, revision=bc6058c81272a8d938c05e75607371284236aadc)"
--
  | level=info ts=2018-05-21T08:08:28.205956393Z caller=main.go:221 build_context="(go=go1.10, user=root@prometheus-binary-18-build, date=20180322-20:38:40)"
  | level=info ts=2018-05-21T08:08:28.205989311Z caller=main.go:222 host_details="(Linux 3.10.0-327.10.1.el7.x86_64 #1 SMP Tue Feb 16 17:03:50 UTC 2016 x86_64 prometheus-40-x9kqv (none))"
  | level=info ts=2018-05-21T08:08:28.206020466Z caller=main.go:223 fd_limits="(soft=1048576, hard=1048576)"
  | level=info ts=2018-05-21T08:08:28.208683799Z caller=main.go:504 msg="Starting TSDB ..."
  | level=info ts=2018-05-21T08:08:28.208721555Z caller=web.go:382 component=web msg="Start listening for connections" address=localhost:9090
  | level=warn ts=2018-05-21T08:19:29.988672101Z caller=head.go:320 component=tsdb msg="unknown series references in WAL samples" count=682987
  | level=info ts=2018-05-21T08:19:31.618723959Z caller=main.go:514 msg="TSDB started"
  | level=info ts=2018-05-21T08:19:31.618844478Z caller=main.go:588 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
  | level=info ts=2018-05-21T08:19:31.646747609Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
  | level=info ts=2018-05-21T08:19:31.664415642Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
  | level=info ts=2018-05-21T08:19:31.665699947Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
  | level=info ts=2018-05-21T08:19:31.667201564Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
  | level=info ts=2018-05-21T08:19:31.668773304Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
  | level=info ts=2018-05-21T08:19:31.670424029Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
  | level=info ts=2018-05-21T08:19:31.676616026Z caller=main.go:491 msg="Server is ready to receive web requests."
  | level=info ts=2018-05-21T08:19:37.270613156Z caller=compact.go:393 component=tsdb msg="compact blocks" count=1 mint=1523354400000 maxt=1523361600000

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 21, 2018

another unexpected log is the unknown series references in WAL samples which looks similar to
#4108 which I am currently investigating.

It would be easyest to replicate if you can send me a copy of the data folder (kgeorgie at redhat.com) or if you use vscode we can use the new live share feature to troubleshoot this together :)
https://code.visualstudio.com/blogs/2017/11/15/live-share

@diarmuidie

This comment has been minimized.

Copy link
Author

diarmuidie commented May 22, 2018

@krasi-georgiev I think that unknown series references in WAL samples could be as a result of the disk filling up and me having to manually delete some files to get Prometheus to restart so I could look for logs etc.

I'm afraid I can't share the data with you because it's from our companies system and potentially sensitive. If there is something I look for in the data that might help in debugging I can check?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 22, 2018

this is the func that reads the meta info for each block and decides which ones should to be deleted based on the retention settings. Adding some debugging there should give you a clue.

// retentionCutoffDirs returns all directories of blocks in dir that are strictly
// before mint.
func retentionCutoffDirs(dir string, mint int64) ([]string, error) {
df, err := fileutil.OpenDir(dir)
if err != nil {
return nil, errors.Wrapf(err, "open directory")
}
defer df.Close()
dirs, err := blockDirs(dir)
if err != nil {
return nil, errors.Wrapf(err, "list block dirs %s", dir)
}
delDirs := []string{}
for _, dir := range dirs {
meta, err := readMetaFile(dir)
if err != nil {
return nil, errors.Wrapf(err, "read block meta %s", dir)
}
// The first block we encounter marks that we crossed the boundary
// of deletable blocks.
if meta.MaxTime >= mint {
break
}
delDirs = append(delDirs, dir)
}
return delDirs, nil
}

or if you can share the meta.json files for each block I can also have a look. These don't include any sensitive data.

@diarmuidie

This comment has been minimized.

Copy link
Author

diarmuidie commented May 22, 2018

Here is one of the meta files:

{                                                                                                                                                                                                                                                                                                               
        "ulid": "01C9KRPEMW3ECT4R02S1Z083KD",                                                                                                                                                                                                                                                                   
        "minTime": 1522144800000,                                                                                                                                                                                                                                                                               
        "maxTime": 1522152000000,                                                                                                                                                                                                                                                                               
        "stats": {                                                                                                                                                                                                                                                                                              
                "numSamples": 16127178,                                                                                                                                                                                                                                                                         
                "numSeries": 150970,                                                                                                                                                                                                                                                                            
                "numChunks": 150970                                                                                                                                                                                                                                                                             
        },                                                                                                                                                                                                                                                                                                      
        "compaction": {                                                                                                                                                                                                                                                                                         
                "level": 1,                                                                                                                                                                                                                                                                                     
                "sources": [                                                                                                                                                                                                                                                                                    
                        "01C9KRPEMW3ECT4R02S1Z083KD"                                                                                                                                                                                                                                                            
                ]                                                                                                                                                                                                                                                                                               
        },                                                                                                                                                                                                                                                                                                      
        "version": 1                                                                                                                                                                                                                                                                                            
}

Both timestamp are for March 27, 2018 (> 15d). I'll have a look about putting in some debugging in the function to see what's going on.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 22, 2018

maybe doesn't count holidays and weekends 😄

Joke aside ping me on IRC of you need any help adding some debugging info.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 24, 2018

any luck with the debugging?

@diarmuidie

This comment has been minimized.

Copy link
Author

diarmuidie commented May 24, 2018

Unfortunately our test environment got re-build before I had a chance to add some debugging :( I'll keep an eye on the new environment and see if the same thing happens again.

I'll reopen if it does. Thanks for the help!

@diarmuidie diarmuidie closed this May 24, 2018

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.