[otbn,simd] Add RTL of SIMD instructions implemented in BN ALU#29344
[otbn,simd] Add RTL of SIMD instructions implemented in BN ALU#29344vogelpi merged 2 commits intolowRISC:masterfrom
Conversation
93e5bdf to
ca2aa31
Compare
andrea-caforio
left a comment
There was a problem hiding this comment.
This is cool @etterli. I focused on the first commit because that's where
the math is. ;-)
| * X0 = X[31:0], X1 = X[63:32], ..., X7 = X[255:224], same for Y | ||
| * Di = Decision by carry bits CXi and CYi | ||
| * | ||
| * D7 D7 D6 D7 D0 |
There was a problem hiding this comment.
Correct me if I'm wrong but is the first stage (decision) of this diagram
part of this module because the carry bits are generated externally?
There was a problem hiding this comment.
Also is D7 D0 correct as the inputs to the first decider stage?
There was a problem hiding this comment.
The diagram does only show the selection stage. The decision stage is not depicted. And yes, this module is closely related to the actual adders in the bignum alu. I factored it out to hide some complexity (it is not much to be honest). The decision bits are based upon the actual carry bits from the two adders, so yes, externally.
D7 and D0 are the correct inputs to the selection MUX for the lowest 32 bits. Because depending on the ELEN (either 32 bit or 256 bit), this chunk must use the decision for chunk 0 (D0) in case ELEN = 32 or the selection must be based upon D7 which is the decision if we are operating on 256 bits. In the 256 bit case, the MSB carry decides for all chunks which result to take.
| * The otbn_alu_bignum calculates pseudo modulo addition and subtraction by using two adders and | ||
| * evaluating their carry bits. Depending on the carry bits adder X or Y is selected as result. | ||
| * | ||
| * For addition, subtract mod if a + b >= mod: |
There was a problem hiding this comment.
Isn't this module in a way independent of the modulus? Because it simply
multiplexes some vector elements. So I'm not sure why the modulus
is mentioned here?
There was a problem hiding this comment.
I see your point. However, the whole selection logic makes only sense if it is put in context of the two adders and what they compute. I don't think this module makes any sense in any other standalone use.
Would it help if the header would introduce this context?
| * - Adder X calculates X = a + b | ||
| * - Adder Y calculates Y = X - mod | ||
| * | ||
| * - If X generates a carry: |
There was a problem hiding this comment.
I know what is meant here but it still slightly confusing to use the term "carry" here.
It is a decision bit that indicates whether a value is in the interval [0, mod-1] or
[mod, 2*mod-1].
There was a problem hiding this comment.
This ties in with my comment on mentioning the modulus here even though
the module is independent of it.
There was a problem hiding this comment.
In the current naming, the decision bit is the bit carrying the information what this evaluation resulted in (0 to take result X, 1 to take result Y). The signal which is referred to here is the actual carry bit of the adder X (which is computed externally)..
| * | ||
| * For subtraction, this stage generates an additional signal whether any vector element uses the | ||
| * result of adder Y. This signal is used for MOD integrity checks and blanking assertions. For | ||
| * addition this signal is always set as the carries of Y are used for the decisions. |
There was a problem hiding this comment.
I don't understand this. So this additional signal is only used in the subtraction case for
some security checks? Why not unconditionally set it to 1 like for addition?
There was a problem hiding this comment.
I followed the behaviour of the current OTBN. I do not know the design rationale for making this check dependent on the result. Let's discuss this offline.
|
|
||
| // Vector element length type for bignum vec ISA implemented in BN ALU for | ||
| // bn.addv(m), bn.subv(m) and bn.shv. | ||
| // The ISA forsees only 4 types (16 to 128 bits). However, only a subset is implemented. |
There was a problem hiding this comment.
With "4 types (16 to 128 bits)" you mean 16, 32, 64 and 128?
There was a problem hiding this comment.
Yes. But this line was updated in the last force push.
| ) ( | ||
| input logic [LVLEN-1:0] operand_a_i, | ||
| input logic [LVLEN-1:0] operand_b_i, | ||
| input logic operand_b_invert_i, |
There was a problem hiding this comment.
This is the indicator bit for performing a subtraction, why not call it like that?
There was a problem hiding this comment.
Because executing a subtraction also requires to set the carry in accordingly (to 1, such that the two's complement is correctly computed). This signal only controls whether the operand B should be inverted or not. Could be useful if we want to use a one's complement (but I don't think so).
| * | ||
| * This carry chaining allows to compute additions over multiples of LVChunkLEN wide elements | ||
| * including the full vector width (i.e., a non vectorized addition). To perform subtraction the | ||
| * input B can be inverted and all carries must be set to 1 as: a - b = a + ~b + 1. |
There was a problem hiding this comment.
This sounds like the caller has to invert B in the subtraction case but it is handled in this
module?
There was a problem hiding this comment.
Would something like this be more clear:
A subtraction can be performed by setting the operand_b_invert_i signal and the input carries to one because: a - b = a + ~b + 1.
| /** | ||
| * OTBN vectorized shifter | ||
| * | ||
| * This shifter is capable of shifting vectors elementwise as well as concatenate and shift 256 |
There was a problem hiding this comment.
Maybe you can mention somewhere that these are logical shifts as opposed
to arithmetic ones, which are only supported for the GPR registers.
| * This module transposes the elements of two input vectors in two different ways. | ||
| * It supports 32b, 64b and 128b element lengths. | ||
| * | ||
| * If there are two vectors with 4 elements the transpositions are as follows: |
There was a problem hiding this comment.
Here you should mention that trn1 interleaves even coordinates and trn2 odd ones otherwise
the word transposition is a bit misleading.
There was a problem hiding this comment.
Thanks, this is indeed more clear. I updated it.
ca2aa31 to
c64a89a
Compare
38f6dd0 to
e26ec29
Compare
c3a5c09 to
d0d5835
Compare
| alu_bignum_adder_y_op_a_en = 1'b1; | ||
| alu_bignum_adder_y_op_shifter_en = 1'b1; | ||
| flags_adder_update[flag_group] = 1'b1; | ||
| rf_ren_a_bignum = 1'b1; |
There was a problem hiding this comment.
I haven't reviewed in depth whether for every instruction you enable only those parts of the ALU which are really needed. How confident are you that you only enable what is really needed?
There was a problem hiding this comment.
I am pretty confident. I just checked this again.
There was a problem hiding this comment.
Okay thank you! Also with the reworked output mux, we would now get errors if two data paths were active at the same time.
There was a problem hiding this comment.
When reworking the mux I also added an assertion which checks that only one path is active at the same time.
| * \-----------------------------------------------------------------/ | ||
| * \---------------------------------------------------------------/ | ||
| * | | ||
| * operation_result_o |
There was a problem hiding this comment.
Right now, operation_result_o is implemented using a unique case. But if you look at this figure, you can see that at least of right most inputs, only ever one input can be non-zero. So at least this part of the mux could be implemented using a OR tree, ran than a real MUX. This can lead to a notable area reduction, because the MUX is wide.
However, I am not sure if synthesis is smart enough to figure that out given all the pre-decoding and blanking. You may want to add a blanker again to the "Y result" and the "MOD result" inputs (the latter may not be needed) just for this purpose. And then take the operation result mux out of the unique case and manually implement it using a bitwise OR tree.
There was a problem hiding this comment.
Yes, this is a smart optimization.
What do you think about ORing the 3 non-arithmetic results and then feeding this combined signal into a 3-to-1 MUX? This way we can save two blankers and their control overhead but still reduce the MUX width. But probably two blankers and one wide OR is more efficient.
There was a problem hiding this comment.
The 3-to-1 MUX requires around 1.9kGE whereas with blankers we are at 2.1kGE (excluding the FF and other logic). Both are better than a full MUX which is around 2.4kGE. Estimated using nangate45 values and 2 input gates only. Maybe there are more efficient multi-input gates but it seems to be around the same.
Implemented to 3to1 option for now.
There was a problem hiding this comment.
We discussed this offline. By slightly modifying the mod_result_selector module, we can convert one of the 256-bit multiplexer inputs in the final multiplexer into an 8-bit multiplexer which can even be predecoded. So we'll manage to save one input in the final multiplexer (the one from Adder Y, the result of Adder Y will then always go through the mod_result_selector) and the final multiplexer can be implemented with an OR tree, which will simplify timing optimization during synthesis (the critical path goes through the shifter and followed by Adder Y).
d0d5835 to
b40f17c
Compare
andreaskurth
left a comment
There was a problem hiding this comment.
Nice work, @etterli! Please find a couple of suggestions and questions below. Overall, I don't see any blocking problems, though. 👍
| // The vector chunk length. This defines the width of the internal adders. | ||
| parameter int LVChunkLEN = VChunkLEN, | ||
| // The number of vector chunks, i.e., the number of adders. | ||
| localparam int LNVecProc = LVLEN / LVChunkLEN |
There was a problem hiding this comment.
Suggest adding an ASSERT_INIT to ensure that this divides without remainder
There was a problem hiding this comment.
Good idea. Added it
| assign op_b = operand_b_invert_i ? ~operand_b_i[i_adder * LVChunkLEN+:LVChunkLEN] | ||
| : operand_b_i[i_adder * LVChunkLEN+:LVChunkLEN]; | ||
|
|
||
| // Do the addition and update carry flag |
There was a problem hiding this comment.
Suggest expanding this to:
// Compute op_a + op_b + carry_in using a two-operand addition.
// By appending 1'b1 and carry_in as the LSBs, the addition of the
// LSB position (1 + carry_in) generates a carry into the upper bits
// exactly when carry_in is set, so result[LVChunkLEN:1] = op_a + op_b + carry_in
// and result[LVChunkLEN+1] is the carry out. The LSB of result is unused.
There was a problem hiding this comment.
Thanks, added this.
| AluElen32: begin | ||
| alu_adder_carry_sel_bignum = 1'b1; | ||
| alu_shift_mask_bignum = (32'd1 << (32 - alu_shift_amt_bignum[4:0])) - 32'd1; | ||
| end | ||
| AluElen256: begin | ||
| alu_adder_carry_sel_bignum = 1'b0; | ||
| alu_shift_mask_bignum = {32{1'b1}}; | ||
| end | ||
| default: begin // same as 256b | ||
| alu_adder_carry_sel_bignum = 1'b0; | ||
| alu_shift_mask_bignum = {32{1'b1}}; | ||
| end |
There was a problem hiding this comment.
This code feels a bit brittle as it will break when VChunkLEN != 32, right? I think using that parameter and $clog2(VChunkLEN)-1 instead of the hard-coded indices/sizes could make the code work with other values.
AluElen32 has the 32 even in the enum name - potentially worth renaming to AluElenVChunkLEN (and AluElen256 could become AluElenWLEN)?
There was a problem hiding this comment.
Hmm, I tried to generalize as much as possible but there are still a few places where stuff will break if VChunkLEN is changed (generalizing everything is far from trivial / sometimes not even possible). So I don't think it is worth to generalize this. It will just make it even more complex to read. Also, the only reason why VChunkLEN should change is when 16-bit elements should be implemented. Then this part must anyway be touched.
I would like to refrain from renaming it because in some places, e.g., in the transposer, it is assumed that this type represents the 32 bit case.
There was a problem hiding this comment.
Also, generating the carry control signal based upon the parameter would require us to define all NVecChunk bits because these are also predecoded. Right now, we only need 1 bit instead of 8 bits.
| AluOpBignumTrn1: begin | ||
| expected_trn_en = 1'b1; | ||
| expected_trn_is_trn1 = 1'b1; | ||
| end | ||
| AluOpBignumTrn2: begin | ||
| expected_trn_en = 1'b1; | ||
| expected_trn_is_trn1 = 1'b0; | ||
| end |
There was a problem hiding this comment.
This could be simplified:
AluOpBignumTrn1,
AluOpBignumTrn2: begin
expected_trn_en = 1'b1;
expected_trn_is_trn1 = operation_i.op == AluOpBignumTrn1;
end| expected_shift_en = 1'b1; | ||
| expected_shift_right = operation_i.shift_right; | ||
| end | ||
| AluOpBignumSubv: begin |
There was a problem hiding this comment.
This could also be implemented in the same case as AluOpBignumSub. Differences are in:
expected_adder_y_carries_topadder_update_flags_en_rawexpected_shift_right
Here the benefit I see is less in the reduced amount of code and more in bundling together what belongs together and make the differences between scalar and vector explicit.
| expected_x_res_operand_a_sel = 1'b1; | ||
| expected_shift_mod_sel = 1'b0; | ||
| end | ||
| AluOpBignumAddvm: begin |
There was a problem hiding this comment.
Also this could be implemented in the same case as AluOpBignumAddm. Only expected_adder_y_carries_top needs to be differentiated.
There was a problem hiding this comment.
Merged these cases together.
| expected_shift_en = 1'b1; | ||
| expected_shift_right = operation_i.shift_right; | ||
| end | ||
| AluOpBignumAddv: begin |
There was a problem hiding this comment.
Similar to the suggestion for Sub, and same differences.
| adder_update_flags_en_raw = 1'b0; | ||
| logic_update_flags_en_raw = 1'b0; |
There was a problem hiding this comment.
Is it worth adding an assertion checking that vector operations don't update adder or logic flags? That's an invariant in the current architecture, and if the code is ever changed to violate that invariant, that can result in bugs that are pretty hard to root-cause. Such an assertion would catch this explicitly.
The assertion may be as simple as
ASSERT(VecOpsNoFlagUpdate_A, is_vec_op |-> !adder_update_flags_en_raw && !logic_update_flags_en_raw)(where the is_vec_op helper signal has to be defined - e.g., with an inside construct.)
There was a problem hiding this comment.
Such an assertion is definitively meaningful but I think the simulator will catch this. And that someone changes by accident both, the RTL and the simulator, seems pretty unlikely. What do you think?
There was a problem hiding this comment.
I would add this SVA as it may save time when debugging. Because it's then obvious that this is not "just" a mismatch between model and RTL, but something which is intentionally not meant to happen.
There was a problem hiding this comment.
Done. It also includes bn.rshi, bn.addm, and bn.subm which are the only other BN ALU operations which do not update flags.
| waive -rules {CLOCK_USE RESET_USE} -location {otbn_alu_bignum.sv} \ | ||
| -regexp {'(clk_i|rst_ni)' is connected to '(otbn_vec_transposer|otbn_vec_shifter)' port} \ | ||
| -comment {The module is fully combinatorial, clk/rst are only used for assertions.} | ||
|
|
There was a problem hiding this comment.
This change should be fixed-up into the commit that makes the waiver necessary, I think.
etterli
left a comment
There was a problem hiding this comment.
Thanks @andreaskurth for the review. I answered to your comments and will push the updated design tomorrow. I first want to test all the changes.
@vogelpi, sorry some comments were not shown on the github page. I now also address them. Hope I haven't missed any.
| AluElen32: begin | ||
| alu_adder_carry_sel_bignum = 1'b1; | ||
| alu_shift_mask_bignum = (32'd1 << (32 - alu_shift_amt_bignum[4:0])) - 32'd1; | ||
| end | ||
| AluElen256: begin | ||
| alu_adder_carry_sel_bignum = 1'b0; | ||
| alu_shift_mask_bignum = {32{1'b1}}; | ||
| end | ||
| default: begin // same as 256b | ||
| alu_adder_carry_sel_bignum = 1'b0; | ||
| alu_shift_mask_bignum = {32{1'b1}}; | ||
| end |
There was a problem hiding this comment.
Hmm, I tried to generalize as much as possible but there are still a few places where stuff will break if VChunkLEN is changed (generalizing everything is far from trivial / sometimes not even possible). So I don't think it is worth to generalize this. It will just make it even more complex to read. Also, the only reason why VChunkLEN should change is when 16-bit elements should be implemented. Then this part must anyway be touched.
I would like to refrain from renaming it because in some places, e.g., in the transposer, it is assumed that this type represents the 32 bit case.
| AluElen32: begin | ||
| alu_adder_carry_sel_bignum = 1'b1; | ||
| alu_shift_mask_bignum = (32'd1 << (32 - alu_shift_amt_bignum[4:0])) - 32'd1; | ||
| end | ||
| AluElen256: begin | ||
| alu_adder_carry_sel_bignum = 1'b0; | ||
| alu_shift_mask_bignum = {32{1'b1}}; | ||
| end | ||
| default: begin // same as 256b | ||
| alu_adder_carry_sel_bignum = 1'b0; | ||
| alu_shift_mask_bignum = {32{1'b1}}; | ||
| end |
There was a problem hiding this comment.
Also, generating the carry control signal based upon the parameter would require us to define all NVecChunk bits because these are also predecoded. Right now, we only need 1 bit instead of 8 bits.
| alu_bignum_adder_y_op_a_en = 1'b1; | ||
| alu_bignum_adder_y_op_shifter_en = 1'b1; | ||
| flags_adder_update[flag_group] = 1'b1; | ||
| rf_ren_a_bignum = 1'b1; |
There was a problem hiding this comment.
I am pretty confident. I just checked this again.
| expected_shift_mod_sel = 1'b0; | ||
| expected_mod_is_subtraction = 1'b1; | ||
| end | ||
| AluOpBignumSubvm: begin |
| AluOpBignumTrn1: begin | ||
| expected_trn_en = 1'b1; | ||
| expected_trn_is_trn1 = 1'b1; | ||
| end | ||
| AluOpBignumTrn2: begin | ||
| expected_trn_en = 1'b1; | ||
| expected_trn_is_trn1 = 1'b0; | ||
| end |
| expected_shift_en = 1'b1; | ||
| expected_shift_right = operation_i.shift_right; | ||
| end | ||
| AluOpBignumSubv: begin |
| expected_shift_en = 1'b1; | ||
| expected_shift_right = operation_i.shift_right; | ||
| end | ||
| AluOpBignumAddv: begin |
| * \-----------------------------------------------------------------/ | ||
| * \---------------------------------------------------------------/ | ||
| * | | ||
| * operation_result_o |
There was a problem hiding this comment.
Yes, this is a smart optimization.
What do you think about ORing the 3 non-arithmetic results and then feeding this combined signal into a 3-to-1 MUX? This way we can save two blankers and their control overhead but still reduce the MUX width. But probably two blankers and one wide OR is more efficient.
62f815f to
0cca6c1
Compare
| alu_bignum_adder_y_op_a_en = 1'b1; | ||
| alu_bignum_adder_y_op_shifter_en = 1'b1; | ||
| flags_adder_update[flag_group] = 1'b1; | ||
| rf_ren_a_bignum = 1'b1; |
There was a problem hiding this comment.
Okay thank you! Also with the reworked output mux, we would now get errors if two data paths were active at the same time.
| adder_update_flags_en_raw = 1'b0; | ||
| logic_update_flags_en_raw = 1'b0; |
There was a problem hiding this comment.
I would add this SVA as it may save time when debugging. Because it's then obvious that this is not "just" a mismatch between model and RTL, but something which is intentionally not meant to happen.
| * \-----------------------------------------------------------------/ | ||
| * \---------------------------------------------------------------/ | ||
| * | | ||
| * operation_result_o |
There was a problem hiding this comment.
We discussed this offline. By slightly modifying the mod_result_selector module, we can convert one of the 256-bit multiplexer inputs in the final multiplexer into an 8-bit multiplexer which can even be predecoded. So we'll manage to save one input in the final multiplexer (the one from Adder Y, the result of Adder Y will then always go through the mod_result_selector) and the final multiplexer can be implemented with an OR tree, which will simplify timing optimization during synthesis (the critical path goes through the shifter and followed by Adder Y).
|
CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_alu_bignum.sv This PR adds SIMD support as proposed in an approved RFC. |
5d4fc5a to
4286ba8
Compare
|
@vogelpi @andreaskurth I have now reworked the result mux and added the assertion. I also rebased it on master. If you want to review only the actual changes, see the force push from 2:28PM GMT+1, the next one is the rebase. Please have a look again. |
This adds a vectorized adder, a modulo result selector, a vectorized shifter and a vector transposer module. These modules are the building blocks to construct the vectorized BN ALU. Signed-off-by: Pascal Etterli <pascal.etterli@lowrisc.org>
4286ba8 to
cb521ed
Compare
Add the vectorized instructions implemented in the BN ALU to the OTBN. Signed-off-by: Pascal Etterli <pascal.etterli@lowrisc.org>
cb521ed to
03859b2
Compare
|
CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_alu_bignum.sv This PR adds SIMD support as proposed in an approved RFC. |
This PR adds the first part of the SIMD instructions' RTL implementation. It adds the RTL for all instructions implemented in the Bignum ALU. See #29231 for the instruction definition / description.
Note that many regression tests still fail as not yet all new instructions are implemented in RTL.