-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize implementation for when there are many frames #102
Comments
As a start to this, I profiled and generated a flame graph of running Very little time is spent in |
Hmm, that There's also a decent amount of time spent in number formatting there. I wonder whether we can use The buffered writes to output we can't do much about, though you are right that #50 should help a bit with that, since there'll just be less stuff to write overall. |
|
Yes, I think so, but I'm not sure what we should do to improve it. Even though |
If the bottleneck is formatting floats with at most two digits after the decimal point, you will want to fork one or both of dtoa or ryu and specialize it for that representation. I don't know which one would turn out faster for this use case. Is the input f32 or f64? Can you provide more detail about the exact output format you need? Never exponential notation? The ryu code is more approachable -- there is an incomprehensible piece that converts a float into mantissa and base-10 exponent which you can pretty much use unchanged, and a comprehensible piece that formats to text, of which we already have two alternative representations for you to base on. The dtoa code would be less approachable but potentially faster if you find the right place to stick an early return when the necessary precision is reached. With ryu it is less feasible to avoid generating full precision in the mantissa. |
@dtolnay all the inputs are I did some digging to provide some more detail, and here are the places we do float formatting in a tight loop: Lines 246 to 247 in 48506e8
Line 534 in 48506e8
There's also this, though I'm a little less worried about this one since it's only used for differentials. We could also probably work around the Line 550 in 48506e8
So the format output is actually pretty straightforward. The numbers are also all fairly small, so I don't think we'll ever need exponential notation. Just returning at the right precision in something like We actually also spend a bit of time formatting integers in Lines 772 to 775 in 48506e8
But that's probably not worth doing anything about for the time being. |
It does seem like Are all the floating point numbers pretty small (meaning not that many digits to the left of the decimal)? If that's the case, you don't really need comma separators and therefore don't really need If my assumption is wrong and you would really like comma separators for some reason, I'd be happy to add floating point support to |
Hmm, actually, yeah, you're right, we specifically don't want comma separators, since these are used as |
Assuming truncation is all you need (not rounding), something like the use std::io;
use rand::Rng;
use str_stack::StrStack;
fn main() -> Result<(), io::Error> {
let mut rng = rand::thread_rng();
let mut stack = StrStack::new();
let mut buf = Vec::new();
for _ in 0..100 {
let f = rng.gen::<f64>() * 1000.0;
write_truncated(&mut stack, &mut buf, f, 2)?;
}
println!("{:?}", &stack);
Ok(())
}
fn write_truncated(
stack: &mut StrStack,
mut buf: &mut Vec<u8>,
f: f64,
precision: usize,
) -> Result<(), io::Error> {
buf.clear();
dtoa::write(&mut buf, f)?;
let decimal_position = buf
.iter()
.position(|b| *b == '.' as u8)
.expect("should always contain a '.'");
let end = decimal_position + precision + 1;
while buf.len() < end {
buf.push('0' as u8);
}
let slice = &buf[..end];
let s = unsafe { std::str::from_utf8_unchecked(slice) };
stack.push(s);
Ok(())
} |
That's a good question -- I don't actually know what |
It looks like That said, if we do not need rounding, here's a branch that uses It appears to pass all the tests except those that it fails because rounding is different than truncation. Perhaps someone who knows more about the benchmark setup you guys are using could see if this does indeed give a speed-up. If so, we can clean it up a little bit and perhaps put it in as a stop-gap on the road to modifying If it's worth pursuing (it may not be), I can create a PR to start collaborating on how to clean up this "proof-of-concept." Branch: https://github.com/bcmyers/inferno/tree/float-formatting |
It probably doesn't really matter if we truncate instead of round. It will slightly diverge in behavior from the Perl version, but if the performance increase is large enough, I'd say it's worth it. Let me do some benchmarking. |
When I run the benchmarks with the default options, it doesn't look like there's much of an improvement.
It's a little better when I use the
|
@jasonrhansen that's interesting -- are you swapping out all the cases I highlighted above where we do formatting? What shows up as the primary hot-spots if you flamegraph the benchmark now? |
I probably missed a couple of areas where we could speed up the code - Will do a more thorough pass sometime soon. Hopefully we can get more than a 1-2% speedup from this. Since hastily putting together the branch above I started writing a little standalone library with |
Here's the flame graph with the changes @bcmyers made. |
That's interesting... From what I can tell, here are the current bottlenecks:
It certainly seems like there aren't any obvious optimization targets that would give us huge wins here. It's all ~10%. I think the best candidates are likely 4, 5, and 7. 6 we should mostly have dealt with at this point, though there are still some allocations in |
I tried using |
Interesting.. I wonder if |
Something is still very weird here... Looking at the performance results in #114, |
This is a differential flame graph comparing with and without |
That's super weird. I don't understand why it's so much slower given that profile :/ |
Well both
|
Yeah, but the file is also larger. I guess there are more calls to the expensive drawing functions relative to the number of lines in the file because there is just less frame merging.. |
I benchmarked with |
#99 identified that performance drops significantly when there are many frames in the output (see #99 (comment)). We should figure out why that is, and fix it! Maybe it's time to flamegraph inferno ;)
The text was updated successfully, but these errors were encountered: