-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve prefetch strategy #298
Conversation
Can we get the benchmark result if we simply make a duplicate segment for |
@xxuejie
Care to elaborate more on why? |
Can we gather more data for the 3 cases here? Such as cache misses, (more importantly) branch misses, bad speculations, etc. I only have a hunch right now, it might be better explained if we can get a more complete pictures. |
@xxuejie Just thinking out loud here:
|
Here's my impression:
For traces that do not end with a branch instruction, execution will more likely run through the full section of code, resulting in Note that we have a tradeoff between branch prediction vs code size here:
So while duplicating That being said, these are just part of the answers, and there are still questions that I do not have good answers:
Personally I still believe that prefetching in |
I agree with the theory that duplicating the code makes I think the most scientific way towards this is to duplicate the code and leave prefetch in both places. As for the code size, we can reevaluate it when it becomes a problem. What do you think? |
I wasn't so sure on this, the question still remains: why does branch misses decrease when we remove prefetching from So if we really need that 5% performance, we can duplicate the code. Otherwise I would suggest more digging into this problem. |
I reran the benchmarks a couple more times and got a similar result. BTW, I'm using an AMD chip (6800H). I'll collect the scripts and upload them later tonight, so maybe if you have access to an Intel chip that could help narrow down the issue. |
One thing I do notice from my perf result is that the counting of cache misses seems very unstable.
|
A more careful look: I think I was reading the earlier numbers wrong. In all 3 benches shown above, In this case I think the number matches the theory: in case of In this case I think this PR is good to go. |
The previous prefetch strategy performs sub-optimally in the presence of many branches. This pr improves the strategy so that prefetch is only performed on traces that do not end with a branch instruction.
This improves the performance on the bn128 benchmark by about ~9-10% and has no negative impact on secp256k1.