Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible data corupption #3888

Closed
tonobo opened this Issue Feb 26, 2018 · 2 comments

Comments

Projects
None yet
1 participant
@tonobo
Copy link

tonobo commented Feb 26, 2018

Description

I haven't done any special nor the prometheus reported any exception. But currently it won't allow to fetch all metrics, just a few ones are working. E.g Prometheus Benchmark Dashboard wouldn't load.
It seems like there is a single compaction process broken, because i could load the data from now-4h but 5h past is to much.

Environment

  • System information:
level=info ts=2018-02-26T10:02:26.676285659Z caller=main.go:225 msg="Starting Prometheus" version="(version=2.1.0, branch=HEAD, revision=85f23d82a045d103ea7f3c89a91fba4a93e6367a)"
level=info ts=2018-02-26T10:02:26.676339031Z caller=main.go:226 build_context="(go=go1.9.2, user=root@6e784304d3ff, date=20180119-12:01:23)"
level=info ts=2018-02-26T10:02:26.676360824Z caller=main.go:227 host_details="(Linux 4.13.0-26-generic #29~16.04.2-Ubuntu SMP Tue Jan 9 22:00:44 UTC 2018 x86_64 node1 (none))"
level=info ts=2018-02-26T10:02:26.676375007Z caller=main.go:228 fd_limits="(soft=1048576, hard=1048576)"
  • Prometheus configuration file:
alerting:
    alertmanagers:
    -   static_configs:
        -   targets:
            - localhost:9095
global:
    evaluation_interval: 15s
    scrape_interval: 15s
rule_files:
- /etc/prometheus/rules.d/*.yml
scrape_configs:
-   job_name: storage
    scrape_interval: 30s
    scrape_timeout: 30s
    static_configs:
    -   targets:
        -  node123:132
-   job_name: elastic_metrics
    metrics_path: /elastic_metrics
    scrape_interval: 10s
    scrape_timeout: 10s
    static_configs:
    -   targets:
        - node123:9009
-   job_name: node_metrics
    scrape_interval: 5s
    scrape_timeout: 5s
    static_configs:
    -   targets:
        - localhost:9100
-   job_name: prometheus
    scrape_interval: 10s
    scrape_timeout: 10s
    static_configs:
    -   targets:
        - localhost:9090
-   file_sd_configs:
    -   files:
        - /etc/prometheus/targets.d/nodes/node_exporter_*.json
    job_name: node_node
    scrape_interval: 30s
    scrape_timeout: 30s
-   file_sd_configs:
    -   files:
        - /etc/prometheus/targets.d/nodes/ceph_exporter_*.json
    job_name: node_ceph
    scrape_interval: 60s
    scrape_timeout: 60s
-   file_sd_configs:
    -   files:
        - /etc/prometheus/targets.d/nodes/bird_exporter_*.json
    job_name: node_bird
    scrape_interval: 20s
    scrape_timeout: 20s
  • Commandline
/bin/prometheus --storage.tsdb.retention 60d --config.file /etc/prometheus/prometheus.yml --web.listen-address 0.0.0.0:9090 --query.max-concurrency 100 --query.lookback-delta 3m --web.enable-lifecycle --web.enable-admin-api --web.external-url https://prom.example.org
  • Filesystem Info
Filesystem                  Size  Used Avail Use% Mounted on
/dev/mapper/vg0-prometheus  1.5T  456G  946G  33% /var/lib/docker/volumes/prom/_data

Filesystem                 Inodes IUsed IFree IUse% Mounted on
/dev/mapper/vg0-prometheus    94M   996   94M    1% /var/lib/docker/volumes/prom/_data
  • Logs:
level=error ts=2018-02-26T10:14:39.580288422Z caller=engine.go:527 component="query engine" msg="error selecting series set" err="block: 01C78F1V4NMRS4X4YKQF1WZ9YD: get postings entry: invalid checksum"
level=error ts=2018-02-26T10:14:39.692596291Z caller=engine.go:527 component="query engine" msg="error selecting series set" err="block: 01C78F1V4NMRS4X4YKQF1WZ9YD: get postings entry: invalid checksum"
level=error ts=2018-02-26T10:14:39.692596314Z caller=engine.go:527 component="query engine" msg="error selecting series set" err="block: 01C78F1V4NMRS4X4YKQF1WZ9YD: get postings entry: invalid checksum"
level=error ts=2018-02-26T10:14:39.800154998Z caller=engine.go:527 component="query engine" msg="error selecting series set" err="block: 01C78F1V4NMRS4X4YKQF1WZ9YD: get postings entry: invalid checksum"
level=error ts=2018-02-26T10:14:39.842452025Z caller=engine.go:527 component="query engine" msg="error selecting series set" err="block: 01C78F1V4NMRS4X4YKQF1WZ9YD: get postings entry: invalid checksum"
level=error ts=2018-02-26T10:14:40.128996524Z caller=engine.go:527 component="query engine" msg="error selecting series set" err="block: 01C78F1V4NMRS4X4YKQF1WZ9YD: get postings entry: invalid checksum"
level=error ts=2018-02-26T10:14:40.41587636Z caller=engine.go:527 component="query engine" msg="error selecting series set" err="block: 01C78F1V4NMRS4X4YKQF1WZ9YD: get postings entry: invalid checksum"
  • Corrupted block medadata
~ # cat /var/lib/docker/volumes/prom/_data/data/01C78F1V4NMRS4X4YKQF1WZ9YD/meta.json 
{
	"ulid": "01C78F1V4NMRS4X4YKQF1WZ9YD",
	"minTime": 1519430400000,
	"maxTime": 1519624800000,
	"stats": {
		"numSamples": 56615840911,
		"numSeries": 8646790,
		"numChunks": 469182443
	},
	"compaction": {
		"level": 4,
		"sources": [
			"01C72W1GW4EEHEMAG13VCXEQ86",
			"01C732X846H9AT17VSHQMETXHC",
			"01C739RZC4PW597GWBVJJ81XM4",
			"01C73GMPM41ARMXVHFY6D05MQQ",
			"01C73QGDW4HZNMMC0F7TY6Z3J6",
			"01C73YC545ZCB51NDQXQ4EBQQ6",
			"01C7457WC4NX55PPN5NVKJQCSB",
			"01C74C3KM5NX6N5FQHG93MX08R",
			"01C74JZAW5BB50BW4H9KKV6F35",
			"01C74SV245ADYYCPH16ZP34KZ8",
			"01C750PSC4YSKWGWDPNXRMKM6H",
			"01C757JGM4AJ0BYQEMBP18K1RS",
			"01C75EE7W41JCB5NXVZE6ERBCB",
			"01C75N9Z4MT8J90CV39RE7RAKT",
			"01C75W5PCMA8SRWJGPQ0R1KQC4",
			"01C7631DMKP57NHJEGVYAEBM0D",
			"01C769X4WM8D9MHWS6GR0N6HVS",
			"01C76GRW4S9VN5T2WQ6XRQ1W7K",
			"01C76QMKCNMGE7QP4DNE5PR7V6",
			"01C76YGAMMEEVFX7R5FVJZPRH0",
			"01C775C1WMWCZJE1AF97GP7070",
			"01C77C7S4MBAXP0P6RH1J2BRWK",
			"01C77K3GCK7RDWT0FV4E00AQTK",
			"01C77SZ7MKJWNC0GAGPE7WK9WQ",
			"01C780TYWM51ZDMZ2HSRSCD0QZ",
			"01C787PP4PT8Z2QRJH077YGA18",
			"01C78EJDCMDCKHQQP2MQHN1AYH"
		]
	},
	"version": 2
}
@tonobo

This comment has been minimized.

Copy link
Author

tonobo commented Feb 27, 2018

Upgrading to 2.2.0 rc1 triggered a repair. It works great again!

@tonobo tonobo closed this Feb 27, 2018

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.