Skip to content

sidecar: Enabled uploading all compaction levels if compation is disabled.#207

Closed
bwplotka wants to merge 4 commits into
masterfrom
upload-all
Closed

sidecar: Enabled uploading all compaction levels if compation is disabled.#207
bwplotka wants to merge 4 commits into
masterfrom
upload-all

Conversation

@bwplotka
Copy link
Copy Markdown
Member

Fixes #206

@V3ckt0r

Signed-off-by: Bartek Plotka bwplotka@gmail.com

@bwplotka bwplotka requested a review from fabxc February 12, 2018 13:24
@V3ckt0r
Copy link
Copy Markdown
Contributor

V3ckt0r commented Feb 12, 2018

Hey @Bplotka
This PR should fix the failure you see in the CI build #208

I think its caused by the rebase here prometheus/prometheus#3671

Cheers

@bwplotka
Copy link
Copy Markdown
Member Author

@V3ckt0r thanks, rebased. Make sure you wipe out NOT uploaded blocks from thanos.shipper.json uploaded list and this PR should help

@V3ckt0r
Copy link
Copy Markdown
Contributor

V3ckt0r commented Feb 12, 2018

@Bplotka yup that has done the trick, thanks. After nuking my thanos.shipper.json and bouncing Thanos I see the blocks get shipped.

level=info ts=2018-02-12T19:50:45.176076702Z caller=sidecar.go:293 msg="starting sidecar" peer=01C65S3MND2CBXRCRKAD58EEB1
V3ckt0r fork
level=info ts=2018-02-12T19:50:45.336902012Z caller=shipper.go:182 msg="upload new block" id=01C4Z28176WR17K7PH37K7FG9V
V3ckt0r fork
level=info ts=2018-02-12T19:50:54.672920042Z caller=shipper.go:182 msg="upload new block" id=01C5101JDEX1TK8CMSC4NQK8KP
V3ckt0r fork
level=info ts=2018-02-12T19:51:04.460480531Z caller=shipper.go:182 msg="upload new block" id=01C52XV3T2ETVSG93K73HVP6D1
V3ckt0r fork
level=info ts=2018-02-12T19:51:14.362366858Z caller=shipper.go:182 msg="upload new block" id=01C54VMN5R94ZM7N7F08J20DDA
V3ckt0r fork
level=info ts=2018-02-12T19:51:23.197889608Z caller=shipper.go:182 msg="upload new block" id=01C56SE59DGWBV587GG9M2W99W
V3ckt0r fork
level=info ts=2018-02-12T19:51:34.769121133Z caller=shipper.go:182 msg="upload new block" id=01C58Q7Q5G4R0DFY5HDGD3XC9Y
V3ckt0r fork
level=info ts=2018-02-12T19:51:41.762445014Z caller=shipper.go:182 msg="upload new block" id=01C5AN1885H7B9J1W11DXB137A
V3ckt0r fork
level=info ts=2018-02-12T19:51:51.624289098Z caller=shipper.go:182 msg="upload new block" id=01C5CJTST135PPXDYDWTKEPYTD
V3ckt0r fork
level=info ts=2018-02-12T19:52:00.28093075Z caller=shipper.go:182 msg="upload new block" id=01C5EGMASFVT63NC0QZTJSABFJ
V3ckt0r fork
level=info ts=2018-02-12T19:52:10.69794424Z caller=shipper.go:182 msg="upload new block" id=01C5GEDW21PTYQ8WNKYRCAVQJX
V3ckt0r fork
level=info ts=2018-02-12T19:52:18.74876331Z caller=shipper.go:182 msg="upload new block" id=01C5JC7DGMKPJVQD2DZH0JT7QJ
V3ckt0r fork
level=info ts=2018-02-12T19:52:26.891070271Z caller=shipper.go:182 msg="upload new block" id=01C5MA0ZC22NSD0H7S8G207JFF
V3ckt0r fork
level=info ts=2018-02-12T19:52:34.306757856Z caller=shipper.go:182 msg="upload new block" id=01C5P7TGEMP97FSN779S5E5AYH
V3ckt0r fork
level=info ts=2018-02-12T19:52:43.202266458Z caller=shipper.go:182 msg="upload new block" id=01C5R5M1ANW6RYY81S91VC0F75
V3ckt0r fork
level=info ts=2018-02-12T19:52:51.89408431Z caller=shipper.go:182 msg="upload new block" id=01C5T3DJX67W411JZ9FP745B2Q
V3ckt0r fork
level=info ts=2018-02-12T19:52:59.533732273Z caller=shipper.go:182 msg="upload new block" id=01C5W173WNJA1F01TVJJPT5B93
V3ckt0r fork
level=info ts=2018-02-12T19:53:15.615138786Z caller=shipper.go:182 msg="upload new block" id=01C5XZ0MQ5BSY2W5K9KKAGF5N2
V3ckt0r fork
level=info ts=2018-02-12T19:53:23.76587919Z caller=shipper.go:182 msg="upload new block" id=01C5ZWT693CHSK8KEKBW04SSDX

As you can see it took a few mins to upload about ~7GBs worth of data.

@bwplotka
Copy link
Copy Markdown
Member Author

bwplotka commented Feb 12, 2018

@fabxc mind having a quick look?

@fabxc
Copy link
Copy Markdown
Collaborator

fabxc commented Feb 13, 2018

This generally seems sufficient. However, this will work ok for backfilling already compacted data but it is still not safe to have Prometheus keep compacting data.

Namely the case where Prometheus creates a new block and immediately afterwards compacts it together with a few other blocks. We do create hardlink-copies of a block before copying it. But even during that hardlinking there may be a race where files suddenly disappear from underneath us.

Now, with the compactor running I think that the combination of the sidecars and bucket garbage collection will still converge back to a correct state. But this absolutely needs more reasoning, explicit handling in the sidecar, and testing.

For the sidecar's handling there are various options. I'm not 100% sure which one the simplest and safest would be. I'd like to avoid to have the sidecar call TSDB hooks (like enable/disable compaction) in Prometheus or even do compactions itself.

@bwplotka
Copy link
Copy Markdown
Member Author

True, all sort of races can happen.

@bwplotka
Copy link
Copy Markdown
Member Author

Will take a look how to prevent from these race conditions later this week on this PR.

@bwplotka bwplotka changed the title sidecar: Enabled uploading all compaction levels. [WIP] sidecar: Enabled uploading all compaction levels. Feb 14, 2018
@V3ckt0r
Copy link
Copy Markdown
Contributor

V3ckt0r commented Feb 14, 2018

Hey guys, if I'm understanding the problem space correctly. We are talking about race conditions for the most recent data?

For instance, given:

[data]# ls -ltr
drwxr-xr-x 3 root    root       4096 Feb  6 01:00 01C5MA0ZC22NSD0H7S8G207JFF
-rw------- 1 root    root          6 Feb  6 12:29 lock
drwxr-xr-x 3 root    root       4096 Feb  6 19:00 01C5P7TGEMP97FSN779S5E5AYH
drwxr-xr-x 3 root    root       4096 Feb  7 13:00 01C5R5M1ANW6RYY81S91VC0F75
drwxr-xr-x 3 root    root       4096 Feb  8 07:00 01C5T3DJX67W411JZ9FP745B2Q
drwxr-xr-x 3 root    root       4096 Feb  9 01:00 01C5W173WNJA1F01TVJJPT5B93
drwxr-xr-x 3 root    root       4096 Feb  9 19:00 01C5XZ0MQ5BSY2W5K9KKAGF5N2
drwxr-xr-x 3 root    root       4096 Feb 10 13:00 01C5ZWT693CHSK8KEKBW04SSDX
drwxr-xr-x 3 root    root       4096 Feb 11 07:00 01C61TKQF3YXY1XT01JN2X0A3W
drwxr-xr-x 3 root    root       4096 Feb 12 01:00 01C63RD8T9Y2Z2C7G98YX9RBXV
drwxr-xr-x 3 root    root       4096 Feb 12 07:00 01C64D0D1Y2B9P3QHHH9XCF8NV
drwxr-xr-x 3 root    root       4096 Feb 12 09:00 01C64KW317054P6TNH8DCNGRP9
drwxr-xr-x 3 root    root       4096 Feb 12 11:00 01C64TQT974JFMQV24CH9060XW
drwxrwxrwx 2 root    root       4096 Feb 12 11:56 wal

We're talking about the tail end such as 01C64TQT974JFMQV24CH9060XW, 01C64KW317054P6TNH8DCNGRP9 right?

Shipping off the compaction level 2's straight away should be fine. For the more recent stuff that have compaction level 1, we could introduce some handling that looks at the meta.json minTime/maxTime to determine if safe to ship, as level 1 compactions are roughly every 2 hours right?

@fabxc
Copy link
Copy Markdown
Collaborator

fabxc commented Feb 14, 2018

The problem is not knowing what to ship (e.g. by looking at min/max times) but rather that any read or write operation we have to make against the storage dir for that is subject to races with Prometheus's write operations.

@V3ckt0r
Copy link
Copy Markdown
Contributor

V3ckt0r commented Feb 14, 2018

oh, right. I see. hmmm 🤔

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
@bwplotka
Copy link
Copy Markdown
Member Author

bwplotka commented Feb 19, 2018

As discussed offline we need:

  1. A proper way for Prometheus to deduct if compaction is enabled or disabled (I am currently investigating if we can add upstream flags API endpoint for that. Currently only HTML parsing is available ):
  2. Check on sidecar if Prometheus local compaction is enabled or not. If it's enabled & upload is enabled we should fail, currently there is no easy way of syncing in safe and simple way sidecar upload and local compaction.

@bwplotka
Copy link
Copy Markdown
Member Author

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
@bwplotka bwplotka changed the title [WIP] sidecar: Enabled uploading all compaction levels. sidecar: Enabled uploading all compaction levels. Feb 21, 2018
@bwplotka
Copy link
Copy Markdown
Member Author

PTAL

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
@bwplotka bwplotka changed the title sidecar: Enabled uploading all compaction levels. sidecar: Enabled uploading all compaction levels if compation is disabled. Feb 22, 2018
@fabxc
Copy link
Copy Markdown
Collaborator

fabxc commented Mar 9, 2018

Should we merge this (quite a bit of code) or rather attempt to integrate with the additions we made in Prometheus 2.2?

@bwplotka
Copy link
Copy Markdown
Member Author

bwplotka commented Mar 20, 2018

Let's maybe close this or iterate on this to make sure the Prometheus version with delayed compaction is there, and if it is -> we can run upload. Otherwise just block all

@bwplotka
Copy link
Copy Markdown
Member Author

Explained new approach in #206

@bwplotka bwplotka closed this Apr 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants