This is the recording of tuning attempts and performance notes about jfrs
- MacBook Air (M2, 2022)
- macOS Monterey 12.4
- Memory: 24GB
- Cores: 8
- JDK: Temurin-11.0.15+10
- Rust: 1.62.1
% cargo install flamegraph
% time java Example ~/develop/playground/jfr/test.jfr 10 > /dev/null
java Example ~/develop/playground/jfr/test.jfr 10 > /dev/null 2.47s user 0.18s system 239% cpu 1.110 total
% time target/release/example ~/develop/playground/jfr/test.jfr 10 > /dev/null
target/release/example ~/develop/playground/jfr/test.jfr 10 > /dev/null 26.29s user 1.62s system 99% cpu 28.099 total
- lol, x30 slower than Java implementation
- We can see that deserialization cost is the dominator
- Possible cause: Classes inside constant pool are always re-deserialized
- So we should pre-deserialize for known types in constant pool
To implement pre-deserialization, many parts should be re-written. Let's continue with changing to direct ValueDescriptor access for now, to see the max performance.
% time target/release/example ~/develop/playground/jfr/test.jfr 10 > /dev/null
target/release/example ~/develop/playground/jfr/test.jfr 10 > /dev/null 1.83s user 1.56s system 98% cpu 3.453 total
- Improved significantly, but still x3 slower than Java implementation
- This time, the dominant part is
read
from the underlying file. - Maybe better to read whole bytes as byte array for each chunk?
- This is what JDK's RecordingInput is doing: RecordingInput.java
For now, let's continue with just reading all file bytes and create Read
by Cursor
.
% time target/release/example ~/develop/playground/jfr/test.jfr 10 > /dev/null
target/release/example ~/develop/playground/jfr/test.jfr 10 > /dev/null 1.20s user 0.08s system 99% cpu 1.289 total
- Performance is now much close to Java baseline (still bit slow though)
- Now we can see growing the vector is taking significant time
- I found that there are some places that creating vector without specifying capacity, though the size is known at the creation time.
% time target/release/example ~/develop/playground/jfr/test.jfr 10 > /dev/null
target/release/example ~/develop/playground/jfr/test.jfr 10 > /dev/null 1.00s user 0.08s system 99% cpu 1.090 total
- Finally, performance is now same level as Java baseline
- We can see that memory alloc/free overhead are notable
- Can we improve further? Try arena-allocator or something?
Besides the allocation overhead, we can see that calculating hash consumes notable amount of time.
It's known that Rust's default hash algorithm is SipHash, which is robust against Hash DoS but it's not quite fast.
Obviously we don't have to care about DoS attack here so we can try more fast hash algorithm (https://github.com/rust-lang/rustc-hash)
% time target/release/example ~/develop/playground/jfr/test.jfr 10 > /dev/null
target/release/example ~/develop/playground/jfr/test.jfr 10 > /dev/null 0.57s user 0.09s system 99% cpu 0.663 total
TO BE CONTINUED