Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After unsuccessful Tombstones deletion Prometheus won't start anymore #3782

Closed
jreineke-nli opened this Issue Feb 1, 2018 · 5 comments

Comments

Projects
None yet
4 participants
@jreineke-nli
Copy link

jreineke-nli commented Feb 1, 2018

What did you do?
I used the Tombstones deletion job and my disk filled up. The deletion Job exited with a file system is full error message. After i resized my drive, prometheus is telling me the following:

Opening storage failed invalid block sequence: block time ranges overlap (1510142400000, 1510488000000)

Prometheus terminates after this message.

Environment

  • System information:

    Linux 4.4.0-112-generic x86_64

  • Prometheus version:

    prometheus, version 2.1.0

  • Prometheus configuration file:
    If needed i can post my configuration here.

  • Logs:

level=info ts=2018-02-01T15:38:52.593290616Z caller=main.go:225 msg="Starting Prometheus" version="(version=2.1.0, branch=HEAD, revision=85f23d82a045d103ea7f3c89a91fba4a93e6367a)"
level=info ts=2018-02-01T15:38:52.593365055Z caller=main.go:226 build_context="(go=go1.9.2, user=root@6e784304d3ff, date=20180119-12:01:23)"
level=info ts=2018-02-01T15:38:52.593403783Z caller=main.go:227 host_details="(Linux 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 vmu0301monp01 (none))"
level=info ts=2018-02-01T15:38:52.593437989Z caller=main.go:228 fd_limits="(soft=1024, hard=1048576)"
level=info ts=2018-02-01T15:38:52.597344742Z caller=web.go:383 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-02-01T15:38:52.597303187Z caller=main.go:499 msg="Starting TSDB ..."
level=info ts=2018-02-01T15:39:07.349759231Z caller=main.go:386 msg="Stopping scrape discovery manager..."
level=info ts=2018-02-01T15:39:07.349826375Z caller=main.go:400 msg="Stopping notify discovery manager..."
level=info ts=2018-02-01T15:39:07.349838904Z caller=main.go:424 msg="Stopping scrape manager..."
level=info ts=2018-02-01T15:39:07.349862826Z caller=manager.go:460 component="rule manager" msg="Stopping rule manager..."
level=info ts=2018-02-01T15:39:07.349855045Z caller=manager.go:59 component="scrape manager" msg="Starting scrape manager..."
level=info ts=2018-02-01T15:39:07.349905278Z caller=main.go:418 msg="Scrape manager stopped"
level=info ts=2018-02-01T15:39:07.349884951Z caller=manager.go:466 component="rule manager" msg="Rule manager stopped"
level=info ts=2018-02-01T15:39:07.349958913Z caller=main.go:382 msg="Scrape discovery manager stopped"
level=info ts=2018-02-01T15:39:07.349962549Z caller=main.go:396 msg="Notify discovery manager stopped"
level=info ts=2018-02-01T15:39:07.350005377Z caller=notifier.go:493 component=notifier msg="Stopping notification manager..."
level=info ts=2018-02-01T15:39:07.350037357Z caller=main.go:570 msg="Notifier manager stopped"
level=error ts=2018-02-01T15:39:07.350079738Z caller=main.go:579 err="Opening storage failed invalid block sequence: block time ranges overlap (1510142400000, 1510488000000)"
level=info ts=2018-02-01T15:39:07.350128154Z caller=main.go:581 msg="See you next time!"

Can i somehow delete the duplicate ranges and start my prometheus server with this data set again?

@jreineke-nli

This comment has been minimized.

Copy link
Author

jreineke-nli commented Feb 1, 2018

I also found two meta.json files inside my data directory with the same minTime and maxTime:

cat ./01C58GCYSY54NTQX2HV457AS6R/meta.json

{
        "ulid": "01C58GCYSY54NTQX2HV457AS6R",
        "minTime": 1510142400000,
        "maxTime": 1510488000000,
        "stats": {
                "numSamples": 584195713,
                "numSeries": 110959,
                "numChunks": 4960712
        },
        "compaction": {
                "level": 1,
                "sources": [
                        "01C58GCYSY54NTQX2HV457AS6R"
                ]
        },
        "version": 2
}

cat ./01C58VZK0719S434ZRX0RVB2AJ/meta.json

{
        "ulid": "01C58VZK0719S434ZRX0RVB2AJ",
        "minTime": 1510142400000,
        "maxTime": 1510488000000,
        "stats": {
                "numSamples": 583566123,
                "numSeries": 110840,
                "numChunks": 4955365
        },
        "compaction": {
                "level": 1,
                "sources": [
                        "01C58VZK0719S434ZRX0RVB2AJ"
                ]
        },
        "version": 2
}

Would it be save to delete on of the Folder? It seems like one of them is the duplicate one.

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Feb 1, 2018

Deleting 01C58VZK0719S434ZRX0RVB2AJ is safe but the problem is deeper than that.

Essentially, calling CleanTombstones can take your existing data, write all of it again before deleting the older blocks. This means you might need 2x space available. The fix is not trivial as we need to close the block before we delete it and there is no good way to do that. Will take a deeper look tomorrow.

https://github.com/prometheus/tsdb/blob/master/db.go#L704-L733

@jreineke-nli

This comment has been minimized.

Copy link
Author

jreineke-nli commented Feb 1, 2018

Thanks for your help. I've deleted the said folder and after this prometheus is starting again. No errors were displayed in the log file.

While prometheus was processing the CleanTombstones command my data directory doubled in size. What will happen if i execute the command again? Can i safely execute it again?

However i've saved the corrupt data. Do you need anything more of this data that i can provide you?

Maybe it should be noted in the documentation that you need 2x space to clean tombstones? But I think this would be a issue in the documentation repository.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Aug 24, 2018

duplicate of #4200
and fixed in prometheus/tsdb#341

new blocks will be deleted on failure to allow a clean start of Prometheus.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.