-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming marshalling #17
Conversation
name time/op CompileTemplatesSimple-8 106ns ± 7% CompileTemplatesComplex-8 181ns ± 8% EvaluateTemplatesSimple-8 206ns ±10% EvaluateTemplatesComplex-8 186ns ± 7% StoreFromTemplates-8 205ms ±10% GenerateFromTemplates-8 61.9ms ± 7% GenerateFromTemplatesAndMarshal-8 84.2ms ±15% name alloc/op CompileTemplatesSimple-8 CompileTemplatesComplex-8 EvaluateTemplatesSimple-8 EvaluateTemplatesComplex-8 StoreFromTemplates-8 157MB ± 0% GenerateFromTemplates-8 89.5MB ± 0% GenerateFromTemplatesAndMarshal-8 119MB ± 0% name allocs/op CompileTemplatesSimple-8 CompileTemplatesComplex-8 EvaluateTemplatesSimple-8 EvaluateTemplatesComplex-8 StoreFromTemplates-8 589k ± 0% GenerateFromTemplates-8 589k ± 0% GenerateFromTemplatesAndMarshal-8 589k ± 0%
This means that now the benchmarks are a bit slower as that was already part of them but now it's actually handled. name old time/op new time/op delta CompileTemplatesSimple-8 106ns ± 7% 115ns ±10% +8.02% (p=0.017 n=10+9) CompileTemplatesComplex-8 181ns ± 8% 186ns ± 5% ~ (p=0.218 n=10+10) EvaluateTemplatesSimple-8 206ns ±10% 221ns ±10% +7.15% (p=0.015 n=10+10) EvaluateTemplatesComplex-8 186ns ± 7% 188ns ± 9% ~ (p=0.725 n=10+10) StoreFromTemplates-8 205ms ±10% 212ms ± 9% ~ (p=0.190 n=10+10) GenerateFromTemplates-8 61.9ms ± 7% 62.3ms ± 7% ~ (p=0.842 n=9+10) GenerateFromTemplatesAndMarshal-8 84.2ms ±15% 83.9ms ±10% ~ (p=0.905 n=10+9) name old alloc/op new alloc/op delta StoreFromTemplates-8 157MB ± 0% 152MB ± 0% -3.41% (p=0.000 n=9+10) GenerateFromTemplates-8 89.5MB ± 0% 89.5MB ± 0% ~ (p=0.965 n=10+8) GenerateFromTemplatesAndMarshal-8 119MB ± 0% 117MB ± 0% -2.19% (p=0.000 n=7+10) name old allocs/op new allocs/op delta StoreFromTemplates-8 589k ± 0% 589k ± 0% ~ (p=0.190 n=10+10) GenerateFromTemplates-8 589k ± 0% 589k ± 0% ~ (all equal) GenerateFromTemplatesAndMarshal-8 589k ± 0% 589k ± 0% ~ (p=0.173 n=10+9)
This has close to ~27% performance improvement in the creation of timeseries GenerateFromTemplates-8 62.3ms ± 7% 45.3ms ± 7% -27.18% (p=0.000 n=10+9) GenerateFromTemplatesAndMarshal-8 83.9ms ±10% 67.3ms ± 9% -19.79% (p=0.000 n=9+10)
This has pretty significant performance improvements. And it also reduces allocations by almost half which likely the increases the impact. name old time/op new time/op delta CompileTemplatesSimple-8 115ns ± 9% 121ns ±13% ~ (p=0.138 n=10+10) CompileTemplatesComplex-8 196ns ± 7% 329ns ± 6% +67.80% (p=0.000 n=10+10) EvaluateTemplatesSimple-8 213ns ± 9% 242ns ±10% +13.76% (p=0.000 n=10+10) EvaluateTemplatesComplex-8 199ns ± 6% 7ns ± 6% -96.63% (p=0.000 n=9+10) StoreFromTemplates-8 206ms ± 4% 176ms ±11% -14.41% (p=0.000 n=10+9) GenerateFromTemplates-8 45.3ms ± 7% 33.5ms ± 5% -26.07% (p=0.000 n=9+8) GenerateFromTemplatesAndMarshal-8 67.3ms ± 9% 55.3ms ± 7% -17.79% (p=0.000 n=10+10) name old alloc/op new alloc/op delta StoreFromTemplates-8 152MB ± 0% 149MB ± 0% -2.16% (p=0.000 n=10+10) GenerateFromTemplates-8 89.5MB ± 0% 86.2MB ± 0% -3.68% (p=0.000 n=9+10) GenerateFromTemplatesAndMarshal-8 117MB ± 0% 114MB ± 0% -2.82% (p=0.000 n=10+10) name old allocs/op new allocs/op delta StoreFromTemplates-8 589k ± 0% 311k ± 0% -47.21% (p=0.000 n=10+10) GenerateFromTemplates-8 589k ± 0% 311k ± 0% -47.22% (p=0.000 n=10+8) GenerateFromTemplatesAndMarshal-8 589k ± 0% 311k ± 0% -47.22% (p=0.000 n=9+10)
This has some benchmark changes which likely are more about the error handling than anything else name old time/op new time/op delta CompileTemplatesSimple-8 121ns ±13% 824ns ± 7% +580.51% (p=0.000 n=10+10) CompileTemplatesComplex-8 329ns ± 6% 1195ns ±17% +262.75% (p=0.000 n=10+9) EvaluateTemplatesSimple-8 242ns ±10% 249ns ± 5% ~ (p=0.247 n=10+10) EvaluateTemplatesComplex-8 6.71ns ± 6% 6.79ns ± 6% ~ (p=0.684 n=10+10) StoreFromTemplates-8 176ms ±11% 186ms ± 8% +5.22% (p=0.035 n=9+10) GenerateFromTemplates-8 33.5ms ± 5% 32.2ms ±12% ~ (p=0.055 n=8+10) GenerateFromTemplatesAndMarshal-8 55.3ms ± 7% 50.1ms ±12% -9.38% (p=0.009 n=10+10) name old alloc/op new alloc/op delta StoreFromTemplates-8 149MB ± 0% 149MB ± 0% -0.00% (p=0.029 n=10+10) GenerateFromTemplates-8 86.2MB ± 0% 86.2MB ± 0% ~ (p=0.305 n=10+10) GenerateFromTemplatesAndMarshal-8 114MB ± 0% 114MB ± 0% ~ (p=0.143 n=10+10) name old allocs/op new allocs/op delta StoreFromTemplates-8 311k ± 0% 311k ± 0% +0.00% (p=0.000 n=10+10) GenerateFromTemplates-8 311k ± 0% 311k ± 0% +0.00% (p=0.000 n=8+10) GenerateFromTemplatesAndMarshal-8 311k ± 0% 311k ± 0% +0.00% (p=0.000 n=10+9)
This removes the majority of the allocations, speeding up the whole process almost by twice. There is also now 2 new APIs: 1. to compile templates once 2. to use those templates and the streaming API There are a lot of tests and benchmarks to make certain this isn't broken, but there might still be problems. Current benchmark results: name time/op CompileTemplatesSimple-8 1.06µs ± 9% CompileTemplatesComplex-8 1.53µs ±14% EvaluateTemplatesSimple-8 240ns ± 9% EvaluateTemplatesComplex-8 6.80ns ± 4% StoreFromPrecompiledTemplates-8 151ms ± 9% StoreFromTemplates-8 179ms ± 3% GenerateFromPrecompiledTemplates-8 30.2ms ±15% GenerateFromTemplates-8 36.9ms ± 8% GenerateFromTemplatesAndMarshal-8 55.4ms ± 8% WriteFor-8 1.09µs ± 9% name alloc/op CompileTemplatesSimple-8 CompileTemplatesComplex-8 EvaluateTemplatesSimple-8 EvaluateTemplatesComplex-8 StoreFromPrecompiledTemplates-8 117MB ± 0% StoreFromTemplates-8 149MB ± 0% GenerateFromPrecompiledTemplates-8 79.2MB ± 0% GenerateFromTemplates-8 86.2MB ± 0% GenerateFromTemplatesAndMarshal-8 114MB ± 0% WriteFor-8 629B ± 7% name allocs/op CompileTemplatesSimple-8 CompileTemplatesComplex-8 EvaluateTemplatesSimple-8 EvaluateTemplatesComplex-8 StoreFromPrecompiledTemplates-8 175 ± 3% StoreFromTemplates-8 311k ± 0% GenerateFromPrecompiledTemplates-8 36.9 ±11% GenerateFromTemplates-8 311k ± 0% GenerateFromTemplatesAndMarshal-8 311k ± 0% WriteFor-8 0.00 From GenerateFromPrecompiledTemplates vs GenerateFromTemplatesAndMarshal it's obvious that : 1. this is almost twice as fast 2. the allocations go from 311k to 36.9 (no ks) That second part especially makes GCing with a lot of those going on a lot easier Comparison with previous commit: name old time/op new time/op delta CompileTemplatesSimple-8 824ns ± 7% 1060ns ± 9% +28.67% (p=0.000 n=10+10) CompileTemplatesComplex-8 1.20µs ±17% 1.53µs ±14% +27.99% (p=0.000 n=9+10) EvaluateTemplatesSimple-8 249ns ± 5% 240ns ± 9% -3.61% (p=0.045 n=10+10) EvaluateTemplatesComplex-8 6.79ns ± 6% 6.80ns ± 4% ~ (p=0.971 n=10+10) StoreFromTemplates-8 186ms ± 8% 179ms ± 3% -3.73% (p=0.034 n=10+8) GenerateFromTemplates-8 32.2ms ±12% 36.9ms ± 8% +14.80% (p=0.000 n=10+9) GenerateFromTemplatesAndMarshal-8 50.1ms ±12% 55.4ms ± 8% +10.62% (p=0.004 n=10+10) name old alloc/op new alloc/op delta StoreFromTemplates-8 149MB ± 0% 149MB ± 0% +0.00% (p=0.001 n=10+9) GenerateFromTemplates-8 86.2MB ± 0% 86.2MB ± 0% +0.00% (p=0.000 n=10+10) GenerateFromTemplatesAndMarshal-8 114MB ± 0% 114MB ± 0% +0.00% (p=0.000 n=10+10) name old allocs/op new allocs/op delta StoreFromTemplates-8 311k ± 0% 311k ± 0% +0.03% (p=0.000 n=10+10) GenerateFromTemplates-8 311k ± 0% 311k ± 0% +0.03% (p=0.000 n=10+10) GenerateFromTemplatesAndMarshal-8 311k ± 0% 311k ± 0% +0.03% (p=0.000 n=9+10) the 10% slowing down is due to the more time to compile both the string and byte version of the label generator. Event with it the new implementation is a lot faster.
This makes it *almost* as efficient as StoreFromPrecompiledTemplates name time/op CompileTemplatesSimple-8 1.05µs ±10% CompileTemplatesComplex-8 1.54µs ± 7% EvaluateTemplatesSimple-8 230ns ± 7% EvaluateTemplatesComplex-8 6.88ns ± 6% StoreFromPrecompiledTemplates-8 155ms ±13% StoreFromTemplates-8 153ms ± 7% GenerateFromPrecompiledTemplates-8 31.6ms ±19% GenerateFromTemplates-8 35.7ms ±10% GenerateFromTemplatesAndMarshal-8 58.3ms ± 9% WriteFor-8 1.08µs ±10% name alloc/op CompileTemplatesSimple-8 CompileTemplatesComplex-8 EvaluateTemplatesSimple-8 EvaluateTemplatesComplex-8 StoreFromPrecompiledTemplates-8 117MB ± 0% StoreFromTemplates-8 117MB ± 0% GenerateFromPrecompiledTemplates-8 79.2MB ± 0% GenerateFromTemplates-8 86.2MB ± 0% GenerateFromTemplatesAndMarshal-8 114MB ± 0% WriteFor-8 640B ± 5% name allocs/op CompileTemplatesSimple-8 CompileTemplatesComplex-8 EvaluateTemplatesSimple-8 EvaluateTemplatesComplex-8 StoreFromPrecompiledTemplates-8 175 ± 2% StoreFromTemplates-8 309 ± 1% GenerateFromPrecompiledTemplates-8 39.0 ±21% GenerateFromTemplates-8 311k ± 0% GenerateFromTemplatesAndMarshal-8 311k ± 0% WriteFor-8 0.00 This difference is now probably negligible but might still be somewhat useful
This does have some performance improvements due to having to not support some cases. name old time/op new time/op delta CompileTemplatesSimple-8 1.05µs ±10% 0.98µs ±10% -5.92% (p=0.015 n=10+10) CompileTemplatesComplex-8 1.54µs ± 7% 1.41µs ± 6% -8.35% (p=0.001 n=9+10) EvaluateTemplatesSimple-8 230ns ± 7% 25ns ± 8% -89.30% (p=0.000 n=10+10) EvaluateTemplatesComplex-8 6.88ns ± 6% 6.98ns ±10% ~ (p=0.796 n=10+10) StoreFromPrecompiledTemplates-8 155ms ±13% 153ms ± 7% ~ (p=1.000 n=10+10) StoreFromTemplates-8 153ms ± 7% 158ms ±10% ~ (p=0.243 n=10+9) GenerateFromPrecompiledTemplates-8 31.6ms ±19% 27.4ms ±20% -13.42% (p=0.005 n=10+10) WriteFor-8 1.08µs ±10% 1.05µs ±11% ~ (p=0.280 n=10+10) name old alloc/op new alloc/op delta StoreFromPrecompiledTemplates-8 117MB ± 0% 117MB ± 0% ~ (p=1.000 n=10+9) StoreFromTemplates-8 117MB ± 0% 117MB ± 0% ~ (p=0.143 n=10+10) GenerateFromPrecompiledTemplates-8 79.2MB ± 0% 79.2MB ± 0% -0.00% (p=0.000 n=10+9) WriteFor-8 640B ± 5% 739B ±43% ~ (p=0.795 n=7+10) name old allocs/op new allocs/op delta StoreFromPrecompiledTemplates-8 175 ± 2% 175 ± 1% ~ (p=0.643 n=10+10) StoreFromTemplates-8 309 ± 1% 281 ± 1% -9.22% (p=0.000 n=10+10) GenerateFromPrecompiledTemplates-8 39.0 ±21% 28.3 ±13% -27.44% (p=0.000 n=10+10) WriteFor-8 0.00 0.00 ~ (all equal)
This is #15 but in actually understandable commits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StoreFromTemplates-8 589k ± 0% 0k ± 1% -99.95% (p=0.000 n=10+10)
this is very impressive ^^.
I gave it only a rough review because i don't think it's a high risk change, looks good to me as far as I see, great job 👍
I mean it changes how we marshal to protobuf in order to get the allocations down so low. I can only hope that between all the test we've run and https://github.com/grafana/xk6-client-prometheus-remote/pull/17/files#diff-c7b61c54e01102d2023bb4d5bf118795021d724bbef1f4ea661ec86e3f59d2fdR175 it will be fine. I will wait for one more 👍 as we are not in a hurry to merge this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! This is amazing!
Benchmarks results comparing first and last:
The really important part is that it no longer near close to 600k allocations but instead some three-digit number. This significantly reduces the CPU usage in the real world.
If we compare
from the first commit - this is the part that has been heavily optimized and in the end replaced to
(which does include the marshalling actually) you can see almost 4x improvement.
From some experiments it seems that a real world improvement will be in that vicinity if not better actually.
This now also works for the "old" API. So we might think about not actually adding JS precompilation call and then using that precompilation if it's not needed 🤔
Additionally, template parsing errors are now imported back instead of being ignored and just using the template as the value.