Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

Byte Align bstream writes #2010

Closed
wants to merge 3 commits into from
Closed

Conversation

shanson7
Copy link
Collaborator

@shanson7 shanson7 commented Oct 26, 2021

Results in a 20-30% speedup of tsz push op (with 120 points using -benchtime=120x):

$ benchstat ./benches/serieslong.no_mutex.many.bench ./benches/serieslong.partial_byte.bench
name                                old time/op  new time/op  delta
PushSeriesLong-12                   51.7ns ± 4%  36.2ns ± 4%  -30.05%  (p=0.000 n=19+18)
PushSeriesLongMonotonicIncrease-12  42.4ns ± 3%  33.3ns ± 3%  -21.35%  (p=0.000 n=19+20)
PushSeriesLongSawtooth-12           45.8ns ± 4%  34.8ns ± 2%  -24.15%  (p=0.000 n=18+19)
PushSeriesLongSawtoothWithFlats-12  46.2ns ± 3%  35.0ns ± 3%  -24.32%  (p=0.000 n=18+20)
PushSeriesLongSteps-12              47.2ns ± 3%  35.5ns ± 4%  -24.89%  (p=0.000 n=19+19)
PushSeriesLongRealWorldCPU-12       54.6ns ± 2%  38.6ns ± 1%  -29.29%  (p=0.000 n=19+20)

@Dieterbe
Copy link
Contributor

Dieterbe commented Nov 2, 2021

hmm the benchmarks writes b.N points, which will often be artificially high.
also, the values aren't the most realistic..
a more real world benchmark would be useful.

are you seeing any benefit when deployed?

@shanson7
Copy link
Collaborator Author

shanson7 commented Nov 2, 2021

hmm the benchmarks writes b.N points, which will often be artificially high.

Above benchmarks were run with -benchtime=120x to force 120 points per chunk

also, the values aren't the most realistic.. a more real world benchmark would be useful.
are you seeing any benefit when deployed?

We ran profiles before and after and saw the expected drop in time spent in Push. However, we didn't really see any change in blackbox metrics (CPU usage, ingest rate) likely because such a small percentage of time was spent in Push.

@Dieterbe
Copy link
Contributor

Dieterbe commented Nov 2, 2021

Above benchmarks were run with -benchtime=120x to force 120 points per chunk

with similar results?

@shanson7
Copy link
Collaborator Author

shanson7 commented Nov 2, 2021

Above benchmarks were run with -benchtime=120x to force 120 points per chunk

with similar results?

The same results. The numbers in the PR description were from -benchtime=120x runs. I'll edit the note to be clearer.

@petethepig
Copy link
Member

FWIW we ran one of the benchmarks (PushSeriesLong) with pyroscope (I'm one of the maintainers) and here's a diff flamegraph:

Pasted_Image_11_17_21__5_17_PM

Green is functions that are faster and red is functions that are slower after the PR. We ran this with 8,000,000 iterations (to get enough profiling samples) inserting 120 data points at a time.

You can see from the diff that calls to writeBit and writeByte happen much less often now, and more work is done in writeBites (e.g the new alignByte call). Overall you can see about 30% improvement for the whole benchmark (goes from 1.20 mins to 0.83 mins).

Here's a link to a publicly available pyroscope instance with this data) if you want to play with this profiling data.

I do think running pyroscope on some production cluster would be more interesting and produce more representative results. @shanson7, @Dieterbe — I would love to collaborate on that if any of you are interested.

@Dieterbe
Copy link
Contributor

Dieterbe commented Nov 27, 2021

So:

  1. In writeByte() you tweaked the optimal path: when the stream ends at at a byte boundary, we can skip some instructions and don't append an anticipatory byte at the end. This might be helpful when the last write to the stream is such an "optimal path write".
  2. In master, in writeBits() we write whole bytes as long as we have any, and then all remaining bits, individually (ignoring byte alignment). Your change does a (likely) partial byte write to achieve alignment, then writes whole bytes, and finally the remaining bits in one shot.

The code looks fine, but I have yet to check out #2005 (I see that the benchmarks quoted here are introduced there). Are your benchmarks against #2005 or against master? There's a couple distinct optimizations in this PR, and the workload will change due to #2005, so it may be tricky to figure out exactly which changes (and combination thereof) are ideal.

From only looking at the code, I'm pretty confident that 1) will always be a net improvement, though i'm doubtful by how much, 2) is likely also an improvement, though it's less obvious.

I'll have a look at #2005 to understand the interplay between that and this better

@shanson7
Copy link
Collaborator Author

shanson7 commented Nov 29, 2021

Are your benchmarks against #2005 or against master?

Hmm, that's a good question. It's been a while and I cannot remember if I ported the benchmarks or if I based this whole changed on #2005.

Edit:

At any rate, the Sawtooth benchmarks were basically unaffected by #2005 and still saw performance boosts.

@shanson7
Copy link
Collaborator Author

@Dieterbe - Can this get another look?

@Dieterbe
Copy link
Contributor

Dieterbe commented Feb 1, 2022

I refactored and analyzed via #2024. I suggest we merge that one.

@Dieterbe
Copy link
Contributor

Dieterbe commented Feb 1, 2022

merged

@Dieterbe Dieterbe closed this Feb 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants