Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[otbn] Prefetch stage proposal #8898

Closed
GregAC opened this issue Oct 26, 2021 · 6 comments
Closed

[otbn] Prefetch stage proposal #8898

GregAC opened this issue Oct 26, 2021 · 6 comments
Assignees

Comments

@GregAC
Copy link
Contributor

GregAC commented Oct 26, 2021

To implement blanking for the register file in OTBN it is necessary to supply the register file inputs directly from flops. Otherwise glitches from the decode logic may produce the power signatures the blanking is aiming to prevent.

To achieve this in OTBN's single stage design an extra pre-fetch stage needs to be added. This will fetch the next instruction a cycle earlier to allow some pre-decode logic to determine the register file inputs for instruction so they can flopped for the instruction execution.

To minimize disruption/change with the existing RTL this can be implemented within the otbn_instruction_fetch module with only minor interface modifications:

  • New outputs for the flopped result of the pre-decode
  • Information from the loop controller providing loop start and end address and remaining loop iterations
  • Stop controller fetching when an instruction is stalled

A new set of flops will be added to otbn_instruction_fetch holding the address of a prefetched instruction. The address of the incoming fetch request insn_fetch_req_addr will be compared against the flopped prefetch address. If they match the incoming prefetched instruction data from imem will be provided as the insn_fetch_resp_data output the following cycle (having been flopped within otbn_instruction_fetch. This effectively matches the behaviour of the existing instruction memory interface other than it's possible for a request to get an invalid response (so the instruction data must be ignored and no instruction executed).

The next address to prefetch will be decided as follows:

  • If the currently prefetched instruction is a jump/branch, prefetch nothing (resulting in an invalid prefetched instruction next cycle)
  • If the currently prefetched address is the end of a loop and the loop isn't on the final iteration prefetch the first instruction of the loop
  • Otherwise prefetch the next instruction (current prefetch address + 4)

Design Impact

Minimal change will be required outside of otbn_instruction_fetch. Looking at the RTL OTBN should already behave appropriately when a fetched instruction returns an invalid response, this is just something that cannot occur in the current design. Area impact is insignificant and it should improve timing.

Verification Impact

New coverpoints will be required for the cases where invalid instructions are seen (only after a branch). It is likely the existing RIG will hit these. Minor ISS changes will be required to deal with the new instruction latencies

Performance Impact

BN.SID and BN.MOVR will have to become two cycle instructions as the output of the base register file cannot be fed directly into the bignum register file. BN.LID doesn't need an extra cycle as it already takes two cycles with the bignum register file write ocurring in the second cycle.

BN.MOVR is heavily used in our current RSA encode and decode (26% of instructions in rsa_1024_enc_test and 9% of instructions in rsa_1024_dec_test are BN.MOVR). BN.SID and jump instructions are also present but far less frequent. With the proposal above we are in danger of missing our performance targets for RSA without either reduction of BN.MOVR usage in the software or improving BN.MOVR performance with the prefetch stage.

One possibility is to only blank on the bignum register file and perform the base register file read in the prefetch stage. This would mean no extra cycle for BN.MOVR or BN.SID at the cost of more complexity (in particular a forwarding path or extra stalling will be required to deal with the new data hazards). For verification extra coverpoints would be needed to check for the various hazarding cases. RIG changes may be needed to hit the coverpoints but the ISS shouldn't be effected.

p256 & p384 sign & verify make minimal use of the impacted instructions so performance reduction will be insignificant.

@GregAC
Copy link
Contributor Author

GregAC commented Oct 26, 2021

@felixmiller @imphil I'd be particular interested to hear if that's any easy way to reduce the BN.MOVR usage within the RSA code.

@moidx moidx added the Hotlist:Security Security Opinion Needed label Oct 26, 2021
@rswarbrick
Copy link
Contributor

@GregAC: Do you have any data on which bn.movr instructions are in the "tight loops" in the RSA operations that take so long? (Alternatively: what's the easiest way to reproduce what you're seeing?)

@GregAC
Copy link
Contributor Author

GregAC commented Oct 27, 2021

The stats output from the ISS gives frequency of various function calls. I've taken a look at that to see which most heavily used functions use bn.movr. cond_sub_to_reg, mont_loop and cond_sub_mod (within modexp.s) are all heavily used and all use bn.movr.

bn.movr is used so each of those functions can work with varying limb sizes, additionally cond_sub_to_reg has a regfile pointer parameter so the varying limbs of the input can live anywhere in the bignum register file. mont_loop and cond_sub_mod used a fixed starting point in the bignum register file.

There is a limited set of what we'd probably use for number of limbs, 2, 4, 8, 12 corresponding to 512,1024,2048,3072 so we could just do unrolled versions of mont_loop and cond_sub_mod choosing which to use depending on the N, potentially even done in a duffs device style with one unrolled loop you jump into at different points depending upon N (or maybe leave at different points depending on N).

It looks like cond_sub_to_reg is only called from one place, so practically can also work with a fixed starting point rather than needing the register 'pointer' it currently has.

mont_loop might get a big big unrolled, 14 instructions in the loop body (though I think would still happily fit in the 4k with the current modexp.s), cond_sub_mod and cond_sub_to_reg are smaller loops with 5 and 6 instructions in the loop body respectively.

mont_loop is also used less than cond_sub_mod and cond_sub_to_reg so the performance penalty for the bn.movr may not matter as much.

I think I can hack up modexp.s to use an unrolled loop with a fixed limb size and see what we get.

@GregAC
Copy link
Contributor Author

GregAC commented Oct 27, 2021

I was wrong about the mont_loop usage frequency, actually the same as cond_sub_to_reg for decrypt.

After a little hacking on modexp.s I discovered compute_rr had a big loop that generated lots of movr too.

I tried replacing the inner loop in compute_rr and cond_sub_to_reg and cond_sub_mod with unrolled versions hardcoded for 8 limbs and it seems to do the job. I also found an pointless movr in cond_sub_to_reg.

Without touching mont_loop I can greatly decrease the use of movr so we can stay within performance requirements (3.7% of instructions in 2048 decrypt and 6.0% of instructions in 2048 encrypt), at the cost of increased imem usage (348 bytes to a total size of 1700 for my hacky test). That will increase with adding multiple limb size support but we've got plenty of spare space.

I would say it's a shame the concept of a movr within a loop carries this performance penalty though. An 'easy' way around it would be for the movr to use dedicated source and destination index registers (say mapped as CSRs) which we read and write within the prefetch stage. Though this has obvious knock on effects on verification and existing software.

@tjaychen
Copy link

@tjaychen

@GregAC
Copy link
Contributor Author

GregAC commented Nov 1, 2021

Following discussion in the security meeting we will be going ahead with implementing this.

We'll implement the initial proposal without any extras to improve movr performance as we don't feel the performance impact is of significance (in part because it can be worked around in critical cases).

@GregAC GregAC closed this as completed Nov 1, 2021
@moidx moidx removed the Hotlist:Security Security Opinion Needed label Nov 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants