Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

address data corruption in chunk encoding for long chunks with >=4.55 hours of nulls #1126

Merged
merged 11 commits into from
Nov 2, 2018

Conversation

Dieterbe
Copy link
Contributor

@Dieterbe Dieterbe commented Oct 31, 2018

gorilla / go-tsz only has 14bits for the delta between the t0 of the chunk and the first point.
2^14 is 16384 seconds so about 4.55 hours.
Thus:

  • when using long chunks (e.g. 6 hours), and >=4.55 hours between the start of a chunk and the first point, the delta overflows and data corruption ensues. If the delta is less than 9h and the chunk has more than 1 datapoint, it is recoverable at read time, see below
  • If the delta is >= 4.55 hours and there is only 1 point, we cannot recover at read time
  • If the delta is >= 9.10 hours (this also requires chunks of more than 9 hours) the data is also not recoverable at read time

More detailed explanation of the problem based on an example testcase "no data for 5 hours, then 1 hour of 60s dense data"

it starts with t0 is 1540728000, which is stored in full.

the first point should have delta of 5h (18000), but due to the 14bit overflow
we instead store the delta of 1616 (18000-2^14)

When decoding the first point, the timestamp should have been:
1540728000 + 18000 = 1540746000
instead we get:
1540728000 + 1616 = 1540729616

From the 2nd point and onwards, we use delta of delta encoding.
so the 2nd point has a dod of -17940 (because delta should change from 18000 to 60s),
this is supported fine and we store this dod.

However, at decode time, what should have happened is:
1540746000 + 18000 -17940 = 1540746060
Instead, what happens is:
1540729616 + 1616 -17940 = 1540713292

all subsequent points have dod 0, so instead of delta:
60+0=60s
they get:
(1616 -17940) + 0 = -16324
and they keep going back in time

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Oct 31, 2018

As for how to address, there's 2 separate concerns: remediation and long term fix

remediation (being able to decode the corrupted chunk data we currently have)

This can use some more thinking, but I see 2 solutions:

  • points should never go back in time, so if delta+dod < 0 we can use 2^14+delta+dod instead, though this requires decoding upto the 2nd point just to get the correct value for the first point, which doesn't work well with our pointwise iterator api
  • we can give "hints" to our iterator. pretty sure all our >4h chunks are rollup chunks, which always have points at a timestamp that is divisible by an interval such as 30, 60, etc. When this is the case, we can tell the iterator about it, so when it decodes the first point and the timestamp is not clean it can try adding 2^14 to the delta. though we may have one or two deployments for customers with very large intervals (e.g. 1h) that have large chunks with raw data

long term fix

  • new chunk format that uses 15 bits? looking a bit deeper though, it seems strange to store the t0 and a delta at all. we may as well just store the timestamp of the first point in full I think.

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Nov 1, 2018

points should never go back in time, so if delta+dod < 0 we can use 2^14+delta+dod instead, though this requires decoding upto the 2nd point just to get the correct value for the first point, which doesn't work well with our pointwise iterator api

i've gone with this approach for now. when reading the first point, we clone the stream, read the upcoming dod, make adjustments as needed, and restore the stream.
this can't recover the point if there's only a single point in the chunk, and there's also the clone (allocation) hit, but otherwise seems like a decent remediation.

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Nov 1, 2018

using a quick and dirty benchmark with docker-dev-custom-cfg-kafka and

fakemetrics feed --kafka-mdm-addr localhost:9092 --mpo 10000
echo 'GET http://localhost:6060/render?target=some.id.of.a.metric.1*&from=-30min' | vegeta attack -rate 5 -duration 5000s  | vegeta report

alloc_objects overhead of go-tsz.(*bstream).clone is about 0.8% which is the thing i was most interested in. alloc_space about 0.15% (and doesn't show up for inuse which is to be expected).

Copy link
Member

@woodsaj woodsaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see any changes here to how chunks are handled. Just changes to comments/docs a unit test and an updated dependency.

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Nov 1, 2018

@woodsaj look at the last 2 commits, they modify tsz.go ; because it's vendored, GH doesn't show the changes by default. if we take this path, i'll move these into our go-tsz fork so that dep doesn't complain.

return true
}
if dod+int32(tDelta) < 0 {
it.tDelta += 16384
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I understanding it right that this will only work if the tDelta has been wrapped around once, but f.e. if we had 10h chunks then this fix might not work if the tDelta has been wrapped twice? I guess for us that's fine, it's just something we should remember.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct.
note that the plan is to, after putting the remediation in place, deprecate this chunk format asap (at least for >4h chunks) and start using a chunk format that doesn't have this bug.

Copy link
Contributor

@replay replay Nov 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if that copying of .br could be saved if .dod() would have a small read buffer that can store 1 return value. If there's a value in the read buffer when .dod() is called it returns the content of the read buffer and clears it, otherwise it goes to the bstream as it does now. After that first call site of .dod() where we know that on the next call to .dod() we'll need to get the same value returned one more time, we could then just put that value into the read buffer. That would save the copying of that bstream, but it makes everything a little more complicated, so it might not be worth it.

Copy link
Contributor

@robert-milan robert-milan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your current fix works for most cases. As replay already pointed out though, there is a possibility that it will wrap more than once, considering our max chunkspan is 24 hours. I think this also means we will need to use 17 bits to cover the entire range, if we decide to pursue that course of action.

As to the implementation, based on your numbers it doesn't seem like the extra copying / allocation is a big deal. If that changes we could look at implementing a peek function to avoid the allocations, although that of course brings its own computational overhead. Just a thought.

Other than it looks good to me.

Copy link
Contributor

@replay replay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
I added one comment about how I think it could possibly improved, but I'm not sure if my suggestion is really better than the current state. I also prefer if some more people take a look at this, because I'm not fully confident about it, as this isn't simple.

@Dieterbe Dieterbe changed the title WIP: address data corruption in chunk encoding for long chunks with >=4.55 hours of nulls address data corruption in chunk encoding for long chunks with >=4.55 hours of nulls Nov 2, 2018
@Dieterbe Dieterbe merged commit 3dc1937 into master Nov 2, 2018
Dieterbe added a commit that referenced this pull request Nov 27, 2018
@Dieterbe Dieterbe deleted the chunk-4h-sparse-bugfix branch March 27, 2019 21:09
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants