Byte Align bstream writes #2010

shanson7 · 2021-10-26T07:58:49Z

Results in a 20-30% speedup of tsz push op (with 120 points using -benchtime=120x):

$ benchstat ./benches/serieslong.no_mutex.many.bench ./benches/serieslong.partial_byte.bench
name                                old time/op  new time/op  delta
PushSeriesLong-12                   51.7ns ± 4%  36.2ns ± 4%  -30.05%  (p=0.000 n=19+18)
PushSeriesLongMonotonicIncrease-12  42.4ns ± 3%  33.3ns ± 3%  -21.35%  (p=0.000 n=19+20)
PushSeriesLongSawtooth-12           45.8ns ± 4%  34.8ns ± 2%  -24.15%  (p=0.000 n=18+19)
PushSeriesLongSawtoothWithFlats-12  46.2ns ± 3%  35.0ns ± 3%  -24.32%  (p=0.000 n=18+20)
PushSeriesLongSteps-12              47.2ns ± 3%  35.5ns ± 4%  -24.89%  (p=0.000 n=19+19)
PushSeriesLongRealWorldCPU-12       54.6ns ± 2%  38.6ns ± 1%  -29.29%  (p=0.000 n=19+20)

Dieterbe · 2021-11-02T16:07:45Z

hmm the benchmarks writes b.N points, which will often be artificially high.
also, the values aren't the most realistic..
a more real world benchmark would be useful.

are you seeing any benefit when deployed?

shanson7 · 2021-11-02T16:13:13Z

hmm the benchmarks writes b.N points, which will often be artificially high.

Above benchmarks were run with -benchtime=120x to force 120 points per chunk

also, the values aren't the most realistic.. a more real world benchmark would be useful.
are you seeing any benefit when deployed?

We ran profiles before and after and saw the expected drop in time spent in Push. However, we didn't really see any change in blackbox metrics (CPU usage, ingest rate) likely because such a small percentage of time was spent in Push.

Dieterbe · 2021-11-02T16:28:38Z

Above benchmarks were run with -benchtime=120x to force 120 points per chunk

with similar results?

shanson7 · 2021-11-02T16:31:23Z

Above benchmarks were run with -benchtime=120x to force 120 points per chunk

with similar results?

The same results. The numbers in the PR description were from -benchtime=120x runs. I'll edit the note to be clearer.

petethepig · 2021-11-18T01:47:35Z

FWIW we ran one of the benchmarks (PushSeriesLong) with pyroscope (I'm one of the maintainers) and here's a diff flamegraph:

Green is functions that are faster and red is functions that are slower after the PR. We ran this with 8,000,000 iterations (to get enough profiling samples) inserting 120 data points at a time.

You can see from the diff that calls to writeBit and writeByte happen much less often now, and more work is done in writeBites (e.g the new alignByte call). Overall you can see about 30% improvement for the whole benchmark (goes from 1.20 mins to 0.83 mins).

Here's a link to a publicly available pyroscope instance with this data) if you want to play with this profiling data.

I do think running pyroscope on some production cluster would be more interesting and produce more representative results. @shanson7, @Dieterbe — I would love to collaborate on that if any of you are interested.

Dieterbe · 2021-11-27T00:02:11Z

So:

In writeByte() you tweaked the optimal path: when the stream ends at at a byte boundary, we can skip some instructions and don't append an anticipatory byte at the end. This might be helpful when the last write to the stream is such an "optimal path write".
In master, in writeBits() we write whole bytes as long as we have any, and then all remaining bits, individually (ignoring byte alignment). Your change does a (likely) partial byte write to achieve alignment, then writes whole bytes, and finally the remaining bits in one shot.

The code looks fine, but I have yet to check out #2005 (I see that the benchmarks quoted here are introduced there). Are your benchmarks against #2005 or against master? There's a couple distinct optimizations in this PR, and the workload will change due to #2005, so it may be tricky to figure out exactly which changes (and combination thereof) are ideal.

From only looking at the code, I'm pretty confident that 1) will always be a net improvement, though i'm doubtful by how much, 2) is likely also an improvement, though it's less obvious.

I'll have a look at #2005 to understand the interplay between that and this better

shanson7 · 2021-11-29T09:54:26Z

Are your benchmarks against #2005 or against master?

Hmm, that's a good question. It's been a while and I cannot remember if I ported the benchmarks or if I based this whole changed on #2005.

Edit:

At any rate, the Sawtooth benchmarks were basically unaffected by #2005 and still saw performance boosts.

shanson7 · 2022-01-19T08:49:36Z

@Dieterbe - Can this get another look?

Dieterbe · 2022-02-01T13:09:27Z

I refactored and analyzed via #2024. I suggest we merge that one.

Dieterbe · 2022-02-01T13:42:05Z

merged

shanson7 added 3 commits September 24, 2021 09:14

Byte align writes

f71db7e

Clean up unused code

9a959b9

Merge branch 'grafana:master' into more_tsz_opt

d9c3f73

shanson7 mentioned this pull request Oct 26, 2021

Byte Align bstream writes bloomberg/metrictank#107

Closed

Dieterbe mentioned this pull request Feb 1, 2022

revamp sean's bstream write optimizations #2024

Merged

Dieterbe closed this Feb 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Byte Align bstream writes #2010

Byte Align bstream writes #2010

shanson7 commented Oct 26, 2021 •

edited

Dieterbe commented Nov 2, 2021

shanson7 commented Nov 2, 2021

Dieterbe commented Nov 2, 2021

shanson7 commented Nov 2, 2021

petethepig commented Nov 18, 2021

Dieterbe commented Nov 27, 2021 •

edited

shanson7 commented Nov 29, 2021 •

edited

shanson7 commented Jan 19, 2022

Dieterbe commented Feb 1, 2022

Dieterbe commented Feb 1, 2022

Byte Align bstream writes #2010

Byte Align bstream writes #2010

Conversation

shanson7 commented Oct 26, 2021 • edited

Dieterbe commented Nov 2, 2021

shanson7 commented Nov 2, 2021

Dieterbe commented Nov 2, 2021

shanson7 commented Nov 2, 2021

petethepig commented Nov 18, 2021

Dieterbe commented Nov 27, 2021 • edited

shanson7 commented Nov 29, 2021 • edited

shanson7 commented Jan 19, 2022

Dieterbe commented Feb 1, 2022

Dieterbe commented Feb 1, 2022

shanson7 commented Oct 26, 2021 •

edited

Dieterbe commented Nov 27, 2021 •

edited

shanson7 commented Nov 29, 2021 •

edited