-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replay WAL concurrently without blocking #10973
Replay WAL concurrently without blocking #10973
Conversation
2c530ae
to
e6e47e8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a great idea. I am not sure why we did not do this before. I will give this a review in coming days to verify if we are missing something. Could you also try and benchmark this?
@damnever thanks for this, we actually saw the same issue with time.Sleep and replays taking too long. We tried to optimize the existing code removing the lock, then we realized we only needed to wait if the series had already been created so we were going to optimize that bit.
However your idea is even better. I'm sure the benchmark will show great results, you can use this benchmark here Lines 134 to 263 in 4b9f248
|
This is definitely much cleaner now! My comments are optional and opinionated nitpicks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One immediate thing: now we have to process tombstones and exemplars in the same processing function because they rely on the series being created first and should be in order.
@jesusvazquez I have run the benchmark test against the |
e6e47e8
to
99a1f9e
Compare
@damnever I actually ran the benchmark locally myself and these are the results I got Platform:
We'd actually need to test this on a real world WAL since the WALs on these benchmarks are pretty small. The ratio of samples vs series on a real world WAL is much higher Note: Originally I compared the benchmarks in the wrong order and it was saying that the changes were taking longer but it was my bad. |
99a1f9e
to
3a6deda
Compare
The benchmark result looks better on my mac if I modify the test like this: @@ -185,9 +185,13 @@ func BenchmarkLoadWAL(b *testing.B) {
// Write series.
refSeries := make([]record.RefSeries, 0, c.seriesPerBatch)
+ refSamples := make([]record.RefSample, 0, c.seriesPerBatch)
for k := 0; k < c.batches; k++ {
refSeries = refSeries[:0]
for i := k * c.seriesPerBatch; i < (k+1)*c.seriesPerBatch; i++ {
+ if i%2 == 0 {
+ continue
+ }
lbls := make(map[string]string, labelsPerSeries)
lbls[defaultLabelName] = strconv.Itoa(i)
for j := 1; len(lbls) < labelsPerSeries; j++ {
@@ -196,22 +200,48 @@ func BenchmarkLoadWAL(b *testing.B) {
refSeries = append(refSeries, record.RefSeries{Ref: chunks.HeadSeriesRef(i) * 101, Labels: labels.FromMap(lbls)})
}
populateTestWAL(b, w, []interface{}{refSeries})
- }
+ // Write samples.
+ for i := k * c.seriesPerBatch; i < (k+1)*c.seriesPerBatch; i++ {
+ if i%2 == 0 {
+ continue
+ }
+ refSamples = refSamples[:0]
+ for j := 0; j < c.samplesPerSeries; j++ {
+ refSamples = append(refSamples, record.RefSample{
+ Ref: chunks.HeadSeriesRef(i) * 101,
+ T: int64(j) * 10,
+ V: float64(j) * 100,
+ })
- // Write samples.
- refSamples := make([]record.RefSample, 0, c.seriesPerBatch)
- for i := 0; i < c.samplesPerSeries; i++ {
- for j := 0; j < c.batches; j++ {
+ }
+ populateTestWAL(b, w, []interface{}{refSamples})
+ }
+
+ refSeries = refSeries[:0]
+ for i := k * c.seriesPerBatch; i < (k+1)*c.seriesPerBatch; i++ {
+ if i%2 != 0 {
+ continue
+ }
+ lbls := make(map[string]string, labelsPerSeries)
+ lbls[defaultLabelName] = strconv.Itoa(i)
+ for j := 1; len(lbls) < labelsPerSeries; j++ {
+ lbls[defaultLabelName+strconv.Itoa(j)] = defaultLabelValue + strconv.Itoa(j)
+ }
+ refSeries = append(refSeries, record.RefSeries{Ref: chunks.HeadSeriesRef(i) * 101, Labels: labels.FromMap(lbls)})
+
+ // Write samples.
refSamples = refSamples[:0]
- for k := j * c.seriesPerBatch; k < (j+1)*c.seriesPerBatch; k++ {
+ for j := 0; j < c.samplesPerSeries; j++ {
refSamples = append(refSamples, record.RefSample{
- Ref: chunks.HeadSeriesRef(k) * 101,
- T: int64(i) * 10,
- V: float64(i) * 100,
+ Ref: chunks.HeadSeriesRef(i) * 101,
+ T: int64(j) * 10,
+ V: float64(j) * 100,
})
+
}
populateTestWAL(b, w, []interface{}{refSamples})
}
+ populateTestWAL(b, w, []interface{}{refSeries})
}
// Write mmapped chunks.
|
3a6deda
to
2572856
Compare
"we need a little more memory to get the task done" How much are we talking? |
You can run the benchmark with Just pointing that out because its something i learned recently 😂 |
@richardkiene I think there is no significant difference in memory consumption. I'd expect better memory utilization because I think the slice reuse ratio should be better, and the slice reuse leads to better CPU utilization. I have posted some benchmark results in the comment. The reason why I modified the test is that I think it is more proper for this case (maybe real world case). |
I have done some tests manually with more than 1 million series (7 checkpoint segments, 27 WAL segments). The steps are as follows:
Here are the results:
As the result shows, this change is faster, but it may use more CPU since now we get a chance to take advantage of the available CPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to process []tombstones.Stone
and []record.RefExemplar
in the same way as how series processing is sharded. This is because processing tombstones and exemplars follow similar ordering requirement with its corresponding series as we do for samples.
I haven't checked other code changes yet , I will do it after this change. But the idea in general and the benchmarks look good, so we can go ahead with this approach.
Oh, ignore my above comment. I see what you did with series creation. Not all happens in the sharded goroutine. |
Failed tests seem to be unrelated. |
Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>
2572856
to
699e9ea
Compare
Friendly ping @codesome, is there anything we should do here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW (I'm not a maintainer here, nor an expert in this code, but you've asked for my review 🙂) this LGTM, good job!
Please consider my nitpick comments, although they're definitely are not blockers.
Apologies in the delay. I plan to review it tomorrow. Thanks for your patience! |
LGTM with nits above from me and @colega |
Nice work! |
Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>
The WAL replay is extremely slow maybe because we have tons of series, and I have found the
time.Sleep
comes up multiple times. After some learning, I think we better put the WAL process logic in one place for the sake of speed, even if we need a little more memory to get the task done.