Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming marshalling #17

Merged
merged 8 commits into from
Apr 1, 2022
Merged

Streaming marshalling #17

merged 8 commits into from
Apr 1, 2022

Conversation

mstoykov
Copy link
Contributor

@mstoykov mstoykov commented Mar 23, 2022

Benchmarks results comparing first and last:

name                        old time/op    new time/op    delta
CompileTemplatesSimple-8       106ns ± 7%     984ns ±10%  +825.96%  (p=0.000 n=10+10)
CompileTemplatesComplex-8      181ns ± 8%    1415ns ± 6%  +681.53%  (p=0.000 n=10+10)
EvaluateTemplatesSimple-8      206ns ±10%      25ns ± 8%   -88.07%  (p=0.000 n=10+10)
EvaluateTemplatesComplex-8     186ns ± 7%       7ns ±10%   -96.24%  (p=0.000 n=10+10)
StoreFromTemplates-8           205ms ±10%     158ms ±10%   -23.00%  (p=0.000 n=10+9)

name                        old alloc/op   new alloc/op   delta
StoreFromTemplates-8           157MB ± 0%     117MB ± 0%   -25.35%  (p=0.000 n=9+10)

name                        old allocs/op  new allocs/op  delta
StoreFromTemplates-8            589k ± 0%        0k ± 1%   -99.95%  (p=0.000 n=10+10)

The really important part is that it no longer near close to 600k allocations but instead some three-digit number. This significantly reduces the CPU usage in the real world.

If we compare

GenerateFromTemplatesAndMarshal-8  84.2ms ±15%

from the first commit - this is the part that has been heavily optimized and in the end replaced to

GenerateFromPrecompiledTemplates-8  27.4ms ±20%

(which does include the marshalling actually) you can see almost 4x improvement.

From some experiments it seems that a real world improvement will be in that vicinity if not better actually.

This now also works for the "old" API. So we might think about not actually adding JS precompilation call and then using that precompilation if it's not needed 🤔

Additionally, template parsing errors are now imported back instead of being ignored and just using the template as the value.

name                               time/op
CompileTemplatesSimple-8            106ns ± 7%
CompileTemplatesComplex-8           181ns ± 8%
EvaluateTemplatesSimple-8           206ns ±10%
EvaluateTemplatesComplex-8          186ns ± 7%
StoreFromTemplates-8                205ms ±10%
GenerateFromTemplates-8            61.9ms ± 7%
GenerateFromTemplatesAndMarshal-8  84.2ms ±15%

name                               alloc/op
CompileTemplatesSimple-8
CompileTemplatesComplex-8
EvaluateTemplatesSimple-8
EvaluateTemplatesComplex-8
StoreFromTemplates-8                157MB ± 0%
GenerateFromTemplates-8            89.5MB ± 0%
GenerateFromTemplatesAndMarshal-8   119MB ± 0%

name                               allocs/op
CompileTemplatesSimple-8
CompileTemplatesComplex-8
EvaluateTemplatesSimple-8
EvaluateTemplatesComplex-8
StoreFromTemplates-8                 589k ± 0%
GenerateFromTemplates-8              589k ± 0%
GenerateFromTemplatesAndMarshal-8    589k ± 0%
This means that now the benchmarks are a bit slower as that was
already part of them but now it's actually handled.

name                               old time/op    new time/op    delta
CompileTemplatesSimple-8              106ns ± 7%     115ns ±10%  +8.02%  (p=0.017 n=10+9)
CompileTemplatesComplex-8             181ns ± 8%     186ns ± 5%    ~     (p=0.218 n=10+10)
EvaluateTemplatesSimple-8             206ns ±10%     221ns ±10%  +7.15%  (p=0.015 n=10+10)
EvaluateTemplatesComplex-8            186ns ± 7%     188ns ± 9%    ~     (p=0.725 n=10+10)
StoreFromTemplates-8                  205ms ±10%     212ms ± 9%    ~     (p=0.190 n=10+10)
GenerateFromTemplates-8              61.9ms ± 7%    62.3ms ± 7%    ~     (p=0.842 n=9+10)
GenerateFromTemplatesAndMarshal-8    84.2ms ±15%    83.9ms ±10%    ~     (p=0.905 n=10+9)

name                               old alloc/op   new alloc/op   delta
StoreFromTemplates-8                  157MB ± 0%     152MB ± 0%  -3.41%  (p=0.000 n=9+10)
GenerateFromTemplates-8              89.5MB ± 0%    89.5MB ± 0%    ~     (p=0.965 n=10+8)
GenerateFromTemplatesAndMarshal-8     119MB ± 0%     117MB ± 0%  -2.19%  (p=0.000 n=7+10)

name                               old allocs/op  new allocs/op  delta
StoreFromTemplates-8                   589k ± 0%      589k ± 0%    ~     (p=0.190 n=10+10)
GenerateFromTemplates-8                589k ± 0%      589k ± 0%    ~     (all equal)
GenerateFromTemplatesAndMarshal-8      589k ± 0%      589k ± 0%    ~     (p=0.173 n=10+9)
This has close to ~27% performance improvement in the creation of
timeseries

GenerateFromTemplates-8              62.3ms ± 7%    45.3ms ± 7%  -27.18%  (p=0.000 n=10+9)
GenerateFromTemplatesAndMarshal-8    83.9ms ±10%    67.3ms ± 9%  -19.79%  (p=0.000 n=9+10)
This has pretty significant performance improvements. And it also
reduces allocations by almost half which likely the increases the
impact.

name                               old time/op    new time/op    delta
CompileTemplatesSimple-8              115ns ± 9%     121ns ±13%     ~     (p=0.138 n=10+10)
CompileTemplatesComplex-8             196ns ± 7%     329ns ± 6%  +67.80%  (p=0.000 n=10+10)
EvaluateTemplatesSimple-8             213ns ± 9%     242ns ±10%  +13.76%  (p=0.000 n=10+10)
EvaluateTemplatesComplex-8            199ns ± 6%       7ns ± 6%  -96.63%  (p=0.000 n=9+10)
StoreFromTemplates-8                  206ms ± 4%     176ms ±11%  -14.41%  (p=0.000 n=10+9)
GenerateFromTemplates-8              45.3ms ± 7%    33.5ms ± 5%  -26.07%  (p=0.000 n=9+8)
GenerateFromTemplatesAndMarshal-8    67.3ms ± 9%    55.3ms ± 7%  -17.79%  (p=0.000 n=10+10)

name                               old alloc/op   new alloc/op   delta
StoreFromTemplates-8                  152MB ± 0%     149MB ± 0%   -2.16%  (p=0.000 n=10+10)
GenerateFromTemplates-8              89.5MB ± 0%    86.2MB ± 0%   -3.68%  (p=0.000 n=9+10)
GenerateFromTemplatesAndMarshal-8     117MB ± 0%     114MB ± 0%   -2.82%  (p=0.000 n=10+10)

name                               old allocs/op  new allocs/op  delta
StoreFromTemplates-8                   589k ± 0%      311k ± 0%  -47.21%  (p=0.000 n=10+10)
GenerateFromTemplates-8                589k ± 0%      311k ± 0%  -47.22%  (p=0.000 n=10+8)
GenerateFromTemplatesAndMarshal-8      589k ± 0%      311k ± 0%  -47.22%  (p=0.000 n=9+10)
This has some benchmark changes which likely are more about the error
handling than anything else

name                               old time/op    new time/op    delta
CompileTemplatesSimple-8              121ns ±13%     824ns ± 7%  +580.51%  (p=0.000 n=10+10)
CompileTemplatesComplex-8             329ns ± 6%    1195ns ±17%  +262.75%  (p=0.000 n=10+9)
EvaluateTemplatesSimple-8             242ns ±10%     249ns ± 5%      ~     (p=0.247 n=10+10)
EvaluateTemplatesComplex-8           6.71ns ± 6%    6.79ns ± 6%      ~     (p=0.684 n=10+10)
StoreFromTemplates-8                  176ms ±11%     186ms ± 8%    +5.22%  (p=0.035 n=9+10)
GenerateFromTemplates-8              33.5ms ± 5%    32.2ms ±12%      ~     (p=0.055 n=8+10)
GenerateFromTemplatesAndMarshal-8    55.3ms ± 7%    50.1ms ±12%    -9.38%  (p=0.009 n=10+10)

name                               old alloc/op   new alloc/op   delta
StoreFromTemplates-8                  149MB ± 0%     149MB ± 0%    -0.00%  (p=0.029 n=10+10)
GenerateFromTemplates-8              86.2MB ± 0%    86.2MB ± 0%      ~     (p=0.305 n=10+10)
GenerateFromTemplatesAndMarshal-8     114MB ± 0%     114MB ± 0%      ~     (p=0.143 n=10+10)

name                               old allocs/op  new allocs/op  delta
StoreFromTemplates-8                   311k ± 0%      311k ± 0%    +0.00%  (p=0.000 n=10+10)
GenerateFromTemplates-8                311k ± 0%      311k ± 0%    +0.00%  (p=0.000 n=8+10)
GenerateFromTemplatesAndMarshal-8      311k ± 0%      311k ± 0%    +0.00%  (p=0.000 n=10+9)
This removes the majority of the allocations, speeding up the whole
process almost by twice.

There is also now 2 new APIs:
1. to compile templates once
2. to use those templates and the streaming API

There are a lot of tests and benchmarks to make certain this isn't
broken, but there might still be problems.

Current benchmark results:

name                                time/op
CompileTemplatesSimple-8            1.06µs ± 9%
CompileTemplatesComplex-8           1.53µs ±14%
EvaluateTemplatesSimple-8            240ns ± 9%
EvaluateTemplatesComplex-8          6.80ns ± 4%
StoreFromPrecompiledTemplates-8      151ms ± 9%
StoreFromTemplates-8                 179ms ± 3%
GenerateFromPrecompiledTemplates-8  30.2ms ±15%
GenerateFromTemplates-8             36.9ms ± 8%
GenerateFromTemplatesAndMarshal-8   55.4ms ± 8%
WriteFor-8                          1.09µs ± 9%

name                                alloc/op
CompileTemplatesSimple-8
CompileTemplatesComplex-8
EvaluateTemplatesSimple-8
EvaluateTemplatesComplex-8
StoreFromPrecompiledTemplates-8      117MB ± 0%
StoreFromTemplates-8                 149MB ± 0%
GenerateFromPrecompiledTemplates-8  79.2MB ± 0%
GenerateFromTemplates-8             86.2MB ± 0%
GenerateFromTemplatesAndMarshal-8    114MB ± 0%
WriteFor-8                            629B ± 7%

name                                allocs/op
CompileTemplatesSimple-8
CompileTemplatesComplex-8
EvaluateTemplatesSimple-8
EvaluateTemplatesComplex-8
StoreFromPrecompiledTemplates-8        175 ± 3%
StoreFromTemplates-8                  311k ± 0%
GenerateFromPrecompiledTemplates-8    36.9 ±11%
GenerateFromTemplates-8               311k ± 0%
GenerateFromTemplatesAndMarshal-8     311k ± 0%
WriteFor-8                            0.00

From GenerateFromPrecompiledTemplates vs GenerateFromTemplatesAndMarshal
it's obvious that :
1. this is almost twice as fast
2. the allocations go from 311k to 36.9 (no ks)
That second part especially makes GCing with a lot of those going on a
lot easier

Comparison with previous commit:

name                               old time/op    new time/op    delta
CompileTemplatesSimple-8              824ns ± 7%    1060ns ± 9%  +28.67%  (p=0.000 n=10+10)
CompileTemplatesComplex-8            1.20µs ±17%    1.53µs ±14%  +27.99%  (p=0.000 n=9+10)
EvaluateTemplatesSimple-8             249ns ± 5%     240ns ± 9%   -3.61%  (p=0.045 n=10+10)
EvaluateTemplatesComplex-8           6.79ns ± 6%    6.80ns ± 4%     ~     (p=0.971 n=10+10)
StoreFromTemplates-8                  186ms ± 8%     179ms ± 3%   -3.73%  (p=0.034 n=10+8)
GenerateFromTemplates-8              32.2ms ±12%    36.9ms ± 8%  +14.80%  (p=0.000 n=10+9)
GenerateFromTemplatesAndMarshal-8    50.1ms ±12%    55.4ms ± 8%  +10.62%  (p=0.004 n=10+10)

name                               old alloc/op   new alloc/op   delta
StoreFromTemplates-8                  149MB ± 0%     149MB ± 0%   +0.00%  (p=0.001 n=10+9)
GenerateFromTemplates-8              86.2MB ± 0%    86.2MB ± 0%   +0.00%  (p=0.000 n=10+10)
GenerateFromTemplatesAndMarshal-8     114MB ± 0%     114MB ± 0%   +0.00%  (p=0.000 n=10+10)

name                               old allocs/op  new allocs/op  delta
StoreFromTemplates-8                   311k ± 0%      311k ± 0%   +0.03%  (p=0.000 n=10+10)
GenerateFromTemplates-8                311k ± 0%      311k ± 0%   +0.03%  (p=0.000 n=10+10)
GenerateFromTemplatesAndMarshal-8      311k ± 0%      311k ± 0%   +0.03%  (p=0.000 n=9+10)

the 10% slowing down is due to the more time to compile both the string
and byte version of the label generator. Event with it the new
implementation is a lot faster.
This makes it *almost* as efficient as StoreFromPrecompiledTemplates

name                                time/op
CompileTemplatesSimple-8            1.05µs ±10%
CompileTemplatesComplex-8           1.54µs ± 7%
EvaluateTemplatesSimple-8            230ns ± 7%
EvaluateTemplatesComplex-8          6.88ns ± 6%
StoreFromPrecompiledTemplates-8      155ms ±13%
StoreFromTemplates-8                 153ms ± 7%
GenerateFromPrecompiledTemplates-8  31.6ms ±19%
GenerateFromTemplates-8             35.7ms ±10%
GenerateFromTemplatesAndMarshal-8   58.3ms ± 9%
WriteFor-8                          1.08µs ±10%

name                                alloc/op
CompileTemplatesSimple-8
CompileTemplatesComplex-8
EvaluateTemplatesSimple-8
EvaluateTemplatesComplex-8
StoreFromPrecompiledTemplates-8      117MB ± 0%
StoreFromTemplates-8                 117MB ± 0%
GenerateFromPrecompiledTemplates-8  79.2MB ± 0%
GenerateFromTemplates-8             86.2MB ± 0%
GenerateFromTemplatesAndMarshal-8    114MB ± 0%
WriteFor-8                            640B ± 5%

name                                allocs/op
CompileTemplatesSimple-8
CompileTemplatesComplex-8
EvaluateTemplatesSimple-8
EvaluateTemplatesComplex-8
StoreFromPrecompiledTemplates-8        175 ± 2%
StoreFromTemplates-8                   309 ± 1%
GenerateFromPrecompiledTemplates-8    39.0 ±21%
GenerateFromTemplates-8               311k ± 0%
GenerateFromTemplatesAndMarshal-8     311k ± 0%
WriteFor-8                            0.00

This difference is now probably negligible but might still be somewhat
useful
This does have some performance improvements due to having to not
support some cases.

name                                old time/op    new time/op    delta
CompileTemplatesSimple-8              1.05µs ±10%    0.98µs ±10%   -5.92%  (p=0.015 n=10+10)
CompileTemplatesComplex-8             1.54µs ± 7%    1.41µs ± 6%   -8.35%  (p=0.001 n=9+10)
EvaluateTemplatesSimple-8              230ns ± 7%      25ns ± 8%  -89.30%  (p=0.000 n=10+10)
EvaluateTemplatesComplex-8            6.88ns ± 6%    6.98ns ±10%     ~     (p=0.796 n=10+10)
StoreFromPrecompiledTemplates-8        155ms ±13%     153ms ± 7%     ~     (p=1.000 n=10+10)
StoreFromTemplates-8                   153ms ± 7%     158ms ±10%     ~     (p=0.243 n=10+9)
GenerateFromPrecompiledTemplates-8    31.6ms ±19%    27.4ms ±20%  -13.42%  (p=0.005 n=10+10)
WriteFor-8                            1.08µs ±10%    1.05µs ±11%     ~     (p=0.280 n=10+10)

name                                old alloc/op   new alloc/op   delta
StoreFromPrecompiledTemplates-8        117MB ± 0%     117MB ± 0%     ~     (p=1.000 n=10+9)
StoreFromTemplates-8                   117MB ± 0%     117MB ± 0%     ~     (p=0.143 n=10+10)
GenerateFromPrecompiledTemplates-8    79.2MB ± 0%    79.2MB ± 0%   -0.00%  (p=0.000 n=10+9)
WriteFor-8                              640B ± 5%      739B ±43%     ~     (p=0.795 n=7+10)

name                                old allocs/op  new allocs/op  delta
StoreFromPrecompiledTemplates-8          175 ± 2%       175 ± 1%     ~     (p=0.643 n=10+10)
StoreFromTemplates-8                     309 ± 1%       281 ± 1%   -9.22%  (p=0.000 n=10+10)
GenerateFromPrecompiledTemplates-8      39.0 ±21%      28.3 ±13%  -27.44%  (p=0.000 n=10+10)
WriteFor-8                              0.00           0.00          ~     (all equal)
@mstoykov
Copy link
Contributor Author

This is #15 but in actually understandable commits

@mstoykov mstoykov mentioned this pull request Mar 23, 2022
Copy link
Contributor

@replay replay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StoreFromTemplates-8 589k ± 0% 0k ± 1% -99.95% (p=0.000 n=10+10)

this is very impressive ^^.

I gave it only a rough review because i don't think it's a high risk change, looks good to me as far as I see, great job 👍

@mstoykov
Copy link
Contributor Author

I don't think it's a high risk change

I mean it changes how we marshal to protobuf in order to get the allocations down so low.

I can only hope that between all the test we've run and https://github.com/grafana/xk6-client-prometheus-remote/pull/17/files#diff-c7b61c54e01102d2023bb4d5bf118795021d724bbef1f4ea661ec86e3f59d2fdR175 it will be fine.

I will wait for one more 👍 as we are not in a hurry to merge this

Copy link

@treid314 treid314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This is amazing!

@mstoykov mstoykov merged commit 56b0706 into main Apr 1, 2022
@mstoykov mstoykov deleted the streamingMarshalling branch April 1, 2022 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants