Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve prefetch strategy #298

Merged
merged 1 commit into from
Oct 27, 2022

Conversation

XiaowenHu96
Copy link
Contributor

The previous prefetch strategy performs sub-optimally in the presence of many branches. This pr improves the strategy so that prefetch is only performed on traces that do not end with a branch instruction.

This improves the performance on the bn128 benchmark by about ~9-10% and has no negative impact on secp256k1.

@mohanson mohanson requested a review from xxuejie October 25, 2022 02:35
@xxuejie
Copy link
Collaborator

xxuejie commented Oct 25, 2022

Can we get the benchmark result if we simply make a duplicate segment for .prepare_trace which is the same as .CKB_VM_ASM_LABEL_OP_CUSTOM_TRACE_END, meaning we do the same prefetching work in both places? I have some personal doubts regarding how the performance gain is obtained here.

@XiaowenHu96
Copy link
Contributor Author

@xxuejie
Simply duplicating the code brings about a 4.5% performance improvement.

prefetch_at_end:             5.1614 +- 0.0235 seconds time elapsed  ( +-  0.46% )
prefetch_both_place:      5.4310 +- 0.0184 seconds time elapsed  ( +-  0.34% )
previous_prefetch:          5.6924 +- 0.0234 seconds time elapsed  ( +-  0.41% )

Care to elaborate more on why?

@xxuejie
Copy link
Collaborator

xxuejie commented Oct 25, 2022

Can we gather more data for the 3 cases here? Such as cache misses, (more importantly) branch misses, bad speculations, etc.

I only have a hunch right now, it might be better explained if we can get a more complete pictures.

@XiaowenHu96
Copy link
Contributor Author

@xxuejie
I only find the following events available on my hardware, hope that helps. Note that prefetch_only_at_end out-performs prefetch_both_places only by a very small margin.

Just thinking out loud here:
It seems like the branch misses rate is affecting the performance here. Although duplicating the code gives separate branch buffers, the first four branch statements in prepare_trace should be fairly easy to predict overall.
The first two check if the cache is valid, which should be true most of the time, given the trace-hit rate is over 98%. The last two check whether the cycle overflows, which never got triggered.
That leaves to the dispatch method at the end: NEXT_INST.. I guess the dispatch got better prediction rate for the same reason as indirect threaded code?

perfetch_only_at_end:
       767,280,290      L1-dcache-load-misses:u                                       ( +-  0.22% )
       192,412,682      branch-misses:u                                               ( +-  0.35% )
       783,254,655      l2_cache_accesses_from_dc_misses:u                            ( +-  0.20% )
       780,880,531      l2_cache_hits_from_dc_misses:u                                ( +-  0.20% )

            5.2789 +- 0.0168 seconds time elapsed  ( +-  0.32% )

prefetch_both_places:
       763,709,600      L1-dcache-load-misses:u                                       ( +-  0.28% )
       205,590,617      branch-misses:u                                               ( +-  0.36% )
       780,789,354      l2_cache_accesses_from_dc_misses:u                            ( +-  0.30% )
       775,347,232      l2_cache_hits_from_dc_misses:u                                ( +-  0.28% )

            5.4173 +- 0.0198 seconds time elapsed  ( +-  0.37% )

previous_prefetch:
       764,343,211      L1-dcache-load-misses:u                                       ( +-  0.37% )
       267,962,401      branch-misses:u                                               ( +-  1.66% )
       780,868,555      l2_cache_accesses_from_dc_misses:u                            ( +-  0.36% )
       777,054,155      l2_cache_hits_from_dc_misses:u                                ( +-  0.36% )

            5.7112 +- 0.0397 seconds time elapsed  ( +-  0.70% )

@xxuejie
Copy link
Collaborator

xxuejie commented Oct 26, 2022

Here's my impression:

  • .CKB_VM_ASM_LABEL_OP_CUSTOM_TRACE_END is triggered by traces that do not end with a branch instruction
  • .prepare_trace is more likely triggered by traces that end with a branch instruction

For traces that do not end with a branch instruction, execution will more likely run through the full section of code, resulting in NEXT_INST part; while for traces with a branch instruction, I guess a significant portion of the cases will branch to .exit_trace for processing(or maybe not? see questions below). Previously when we mixed the 2 cases together, it's likely the code will confuse branch predictor, which explains why simply duplicating the code increases the performance.

Note that we have a tradeoff between branch prediction vs code size here:

  • On the one hand, duplicating code like this indeed helps with performance, in addition to the example above, notice that NEXT_INST is actually implemented as a macro, and is duplicated for each opcode? This is also done to aid branch prediction. Another example: a handful of RISC-V instructions all map to the single ADDI opcode, if we duplicate ADDI for each of those RISC-V instructions, I bet it will also help with branch prediction(note I mean branch prediction here, performance is a mixture of many things).
  • On the other hand, we do want to keep the generated code size smaller, hoping the whole assembly VM can fit in I-cache for better performance. Simply duplicating opcodes might also have a negative impact on I-cache, resulting in worse performance. I'm already quite concerned about the code used to handle load/store operations, the code size is already bloated here.

So while duplicating .prepare_trace as shown in this PR can indeed help with performance, I do want to point out we should be cautious in duplicating code, and also pay attention to the overall compiled code size for cache reasons.

That being said, these are just part of the answers, and there are still questions that I do not have good answers:

  • Consider traces that end with branch instructions, from the above analysis, we notice that a significant portion of them must've run into .exit_trace, so as to trigger the performance differences, does that mean our current trace slot calculation algorithm is not good enough? In general sense, the trace should be able to capture most of the code executed after running a hot loop for a while. It is very strange that branching could cause so many cases that we need to exit out of assembly VM. To me that's a sign that traces are not good enough for branches.
    • EDIT: A second thought: you did mention above that trace hit rate is 98%, so it could be that this theory is wrong, and the differences in performance are due to prediction in NEXT_INST. For branch instructions, we don't really prefetch the branch targets, maybe this is causing the difference?
  • After duplicating the code, why removing prefetching in .prepare_trace continues to reduce so many branch misses? I could get a sense that cache misses increase since we are prefetching less data, but why does this affect branch misses?

Personally I still believe that prefetching in .prepare_trace can be helpful, since branch instructions typically have 2 targets, when the branch conditions are not met, they will fall through like a normal trace without branch instructions. It just doesn't click for me why adding prefetching here actually hurt the performance.

@XiaowenHu96
Copy link
Contributor Author

I agree with the theory that duplicating the code makes NEXT_INST more friendly to the branch predictor - each trace has to end with either .CKB_VM_ASM_LABEL_OP_CUSTOM_TRACE_END or prepare_trace, which makes those two entries work like a central dispatch point just like how regular switch statement is compiled. Therefore, performance improvement in branch prediction is not surprising for the same reason as indirect threaded code.

I think the most scientific way towards this is to duplicate the code and leave prefetch in both places. As for the code size, we can reevaluate it when it becomes a problem. What do you think?

@xxuejie
Copy link
Collaborator

xxuejie commented Oct 27, 2022

I wasn't so sure on this, the question still remains: why does branch misses decrease when we remove prefetching from .prepare_trace? It's expected to have more cache misses when less data are prefetching(there are still branches that choose the non-taken path, which can be prefetched as .CKB_VM_ASM_LABEL_OP_CUSTOM_TRACE_END), but having branch misses decreasing is a strange behavior here. I'm concerned we might be jumping to the conclusion too soon and missing something.

So if we really need that 5% performance, we can duplicate the code. Otherwise I would suggest more digging into this problem.

@XiaowenHu96
Copy link
Contributor Author

I reran the benchmarks a couple more times and got a similar result. BTW, I'm using an AMD chip (6800H). I'll collect the scripts and upload them later tonight, so maybe if you have access to an Intel chip that could help narrow down the issue.

@XiaowenHu96
Copy link
Contributor Author

One thing I do notice from my perf result is that the counting of cache misses seems very unstable.

perfetch_only_at_end:
       688,702,661      L1-dcache-load-misses:u            ( +-  0.92% )
       166,575,245      branch-misses:u                    ( +-  0.69% )
       704,911,315      l2_cache_accesses_from_dc_misses:u ( +-  1.04% )
       703,205,885      l2_cache_hits_from_dc_misses:u     ( +-  1.03% )
            4.9158 +- 0.0185 seconds time elapsed  ( +-  0.38% )

prefetch_both_places:
       757,952,124      L1-dcache-load-misses:u            ( +-  1.13% )
       200,547,345      branch-misses:u                    ( +-  0.55% )
       775,782,975      l2_cache_accesses_from_dc_misses:u ( +-  1.13% )
       773,690,973      l2_cache_hits_from_dc_misses:u     ( +-  1.13% )
            5.1698 +- 0.0210 seconds time elapsed  ( +-  0.41% )

previous_prefetch:
       695,961,190      L1-dcache-load-misses:u            ( +-  1.27% )
       253,162,883      branch-misses:u                    ( +-  1.37% )
       711,746,321      l2_cache_accesses_from_dc_misses:u ( +-  1.27% )
       708,690,866      l2_cache_hits_from_dc_misses:u     ( +-  1.27% )
            5.4425 +- 0.0306 seconds time elapsed  ( +-  0.56% )

perfetch_only_at_end:
       671,553,864      L1-dcache-load-misses:u            ( +-  0.83% )
       169,950,009      branch-misses:u                    ( +-  0.67% )
       686,206,649      l2_cache_accesses_from_dc_misses:u ( +-  0.83% )
       685,084,320      l2_cache_hits_from_dc_misses:u     ( +-  0.84% )
            4.9425 +- 0.0150 seconds time elapsed  ( +-  0.30% )

prefetch_both_places:
       656,456,309      L1-dcache-load-misses:u            ( +-  1.21% )
       196,797,811      branch-misses:u                    ( +-  0.40% )
       670,932,194      l2_cache_accesses_from_dc_misses:u ( +-  1.19% )
       669,830,672      l2_cache_hits_from_dc_misses:u     ( +-  1.19% )
            5.1207 +- 0.0208 seconds time elapsed  ( +-  0.41% )

previous_prefetch:
       711,602,003      L1-dcache-load-misses:u            ( +-  1.90% )
       235,210,189      branch-misses:u                    ( +-  1.77% )
       727,782,577      l2_cache_accesses_from_dc_misses:u ( +-  1.91% )
       724,432,774      l2_cache_hits_from_dc_misses:u     ( +-  1.87% )
           5.4903 +- 0.0401 seconds time elapsed  ( +-  0.73% )

@xxuejie
Copy link
Collaborator

xxuejie commented Oct 27, 2022

A more careful look: I think I was reading the earlier numbers wrong. In all 3 benches shown above, perfetch_only_at_end has the lowest branch misses, I must've misread something, my apologies.

In this case I think the number matches the theory: in case of .prepare_trace, it's more likely a branch instruction will take the branch instead of falling through, which means prefetching in .prepare_trace put more load on the memory which is not necessary.

In this case I think this PR is good to go.

@mohanson mohanson merged commit 5c464b3 into nervosnetwork:develop Oct 27, 2022
XiaowenHu96 added a commit to XiaowenHu96/ckb-vm that referenced this pull request Oct 27, 2022
mohanson pushed a commit to mohanson/ckb-vm that referenced this pull request Nov 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants