Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean_tombstones not releasing the extra disk space used after it fails #4200

Closed
albertvaka opened this Issue May 28, 2018 · 13 comments

Comments

Projects
None yet
4 participants
@albertvaka
Copy link

albertvaka commented May 28, 2018

Bug Report

Apparently clean_tombstones allocates additional space during the cleanup, and I guess it doesn't actually reduce the total space used until it finishes. The problem comes when the cleaning fails.

I did a first call to /api/v2/admin/tsdb/clean_tombstones because the disk was filling up, and since it took some extra space it failed with the following error:

{"error":"clean tombstones: clean tombstones: /prometheus/data/01CDS7P8RMXN95JVQ9VD49K5YG: write compaction: write chunks: no space left on device","code":13}

The extra space used, however, was not freed afterwards. Now my disk was 100% full, so I resized it and gave it 2x the size. Then I ran clean_tombstones again. This time, it failed again with a different (and more worrying) error:

{"error":"clean tombstones: reload blocks: invalid block sequence: block time ranges overlap (1525068000000, 1525262400000)","code":13}

I don't know how to recover from this one, and also now I'm with about 3x the disk usage I had when I started because of clean_tombstones not releasing the extra space used when it fails.

What can I do?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 28, 2018

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 28, 2018

@albertvaka I will try to replicate and find the culprit.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 29, 2018

@albertvaka when deleting series the tsdb backend first reads a given block in memory removes the series and writes the block to disk again. At this point it needs some space to write the new block before deleting the old one.
I am still trying to replicate , but what I imagine happened is that when it failed it left some old and new blocks with overlapped ranges causing the block time ranges overlap

As a workaround you can use the tsdb scan tool I implemented recently which will scan and remove the overlapping blocks by leaving the biggest block which should include all your existing time series before triggering the deletion.

I have attached the tool so you can try it. Still WIP so use at your own risk 💣

tsdb scan /path/to/data
tsdb.zip

@albertvaka

This comment has been minimized.

Copy link
Author

albertvaka commented May 29, 2018

Thanks! However, I think we should make sure that everything is cleaned up after an error, so this doesn't happen again (seems it can be a pretty common case).

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 29, 2018

yes I am looking into this right now.

@codesome

This comment has been minimized.

Copy link
Member

codesome commented May 29, 2018

From what I understand, during CleanTombstones all the existing blocks are replicated with tombstones removed, and at the end old blocks are deleted during reload.

The deletion of old blocks happens here https://github.com/prometheus/tsdb/blob/master/db.go#L865,
but the storage error is caused here https://github.com/prometheus/tsdb/blob/master/db.go#L852, and we return without cleaning the new blocks formed. Hence this issue.

Cleaning up the new blocks before this line https://github.com/prometheus/tsdb/blob/master/db.go#L853 should fix it.

(Someone has to verify, I may be wrong)

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 30, 2018

yep I think you are right I am just checking the code in few different places to make sure it doesn't have any side effects.

Just checked again and I think this needs to be handled in the Compactor
somewhere in the populateBlock func
https://github.com/prometheus/tsdb/blob/master/compact.go#L509

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 30, 2018

the code suggest that on error the compaction should clean after itself. so this partial write shouldn't really happen.

@albertvaka any chance to send all folders that include any timestamps in the meta.json between 1525068000000 , 1525262400000 so I can check these?
kgeorgie at redhat.com

@albertvaka

This comment has been minimized.

Copy link
Author

albertvaka commented May 30, 2018

We ended up deleting the data altogether, so I can't check your scan tool nor provide the corrupt files, sorry :(

@codesome

This comment has been minimized.

Copy link
Member

codesome commented May 30, 2018

@krasi-georgiev
I think the problem is not something related to this.

Consider that currently there are B1,B2,B3 directories in the database. During CleanTombstones, it was a success for B1 and B2, and let B4 and B5 were the new directories created. But it failed to compact B3, and as you said, the new block related to B3 was cleaned, but B4 and B5 still exist, and are not removed (if it is being removed, then I am wrong).

So the overlap problem might be B1 with B4 and B2 with B5 (also the extra space consumed).

@gouthamve any comments?

PS: maybe we can take this discussion to prometheus/tsdb?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 30, 2018

@albertvaka no worries after a short chat on IRC with @codesome he found the issue and a PR will follow soon.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 6, 2018

happy to say that the fix is now merged and will be included in the next release

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.