Skip to content

[otbn,simd] Add RTL of SIMD instructions implemented in BN ALU#29344

Merged
vogelpi merged 2 commits intolowRISC:masterfrom
etterli:otbn-simd-rtl-bnalu
Feb 26, 2026
Merged

[otbn,simd] Add RTL of SIMD instructions implemented in BN ALU#29344
vogelpi merged 2 commits intolowRISC:masterfrom
etterli:otbn-simd-rtl-bnalu

Conversation

@etterli
Copy link
Copy Markdown
Contributor

@etterli etterli commented Feb 20, 2026

This PR adds the first part of the SIMD instructions' RTL implementation. It adds the RTL for all instructions implemented in the Bignum ALU. See #29231 for the instruction definition / description.

Note that many regression tests still fail as not yet all new instructions are implemented in RTL.

Copy link
Copy Markdown
Contributor

@andrea-caforio andrea-caforio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool @etterli. I focused on the first commit because that's where
the math is. ;-)

* X0 = X[31:0], X1 = X[63:32], ..., X7 = X[255:224], same for Y
* Di = Decision by carry bits CXi and CYi
*
* D7 D7 D6 D7 D0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong but is the first stage (decision) of this diagram
part of this module because the carry bits are generated externally?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is D7 D0 correct as the inputs to the first decider stage?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diagram does only show the selection stage. The decision stage is not depicted. And yes, this module is closely related to the actual adders in the bignum alu. I factored it out to hide some complexity (it is not much to be honest). The decision bits are based upon the actual carry bits from the two adders, so yes, externally.

D7 and D0 are the correct inputs to the selection MUX for the lowest 32 bits. Because depending on the ELEN (either 32 bit or 256 bit), this chunk must use the decision for chunk 0 (D0) in case ELEN = 32 or the selection must be based upon D7 which is the decision if we are operating on 256 bits. In the 256 bit case, the MSB carry decides for all chunks which result to take.

* The otbn_alu_bignum calculates pseudo modulo addition and subtraction by using two adders and
* evaluating their carry bits. Depending on the carry bits adder X or Y is selected as result.
*
* For addition, subtract mod if a + b >= mod:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this module in a way independent of the modulus? Because it simply
multiplexes some vector elements. So I'm not sure why the modulus
is mentioned here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. However, the whole selection logic makes only sense if it is put in context of the two adders and what they compute. I don't think this module makes any sense in any other standalone use.

Would it help if the header would introduce this context?

* - Adder X calculates X = a + b
* - Adder Y calculates Y = X - mod
*
* - If X generates a carry:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know what is meant here but it still slightly confusing to use the term "carry" here.
It is a decision bit that indicates whether a value is in the interval [0, mod-1] or
[mod, 2*mod-1].

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ties in with my comment on mentioning the modulus here even though
the module is independent of it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current naming, the decision bit is the bit carrying the information what this evaluation resulted in (0 to take result X, 1 to take result Y). The signal which is referred to here is the actual carry bit of the adder X (which is computed externally)..

*
* For subtraction, this stage generates an additional signal whether any vector element uses the
* result of adder Y. This signal is used for MOD integrity checks and blanking assertions. For
* addition this signal is always set as the carries of Y are used for the decisions.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. So this additional signal is only used in the subtraction case for
some security checks? Why not unconditionally set it to 1 like for addition?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the behaviour of the current OTBN. I do not know the design rationale for making this check dependent on the result. Let's discuss this offline.

Comment thread hw/ip/otbn/rtl/otbn_pkg.sv Outdated

// Vector element length type for bignum vec ISA implemented in BN ALU for
// bn.addv(m), bn.subv(m) and bn.shv.
// The ISA forsees only 4 types (16 to 128 bits). However, only a subset is implemented.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With "4 types (16 to 128 bits)" you mean 16, 32, 64 and 128?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But this line was updated in the last force push.

) (
input logic [LVLEN-1:0] operand_a_i,
input logic [LVLEN-1:0] operand_b_i,
input logic operand_b_invert_i,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the indicator bit for performing a subtraction, why not call it like that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because executing a subtraction also requires to set the carry in accordingly (to 1, such that the two's complement is correctly computed). This signal only controls whether the operand B should be inverted or not. Could be useful if we want to use a one's complement (but I don't think so).

Comment thread hw/ip/otbn/rtl/otbn_vec_adder.sv Outdated
*
* This carry chaining allows to compute additions over multiples of LVChunkLEN wide elements
* including the full vector width (i.e., a non vectorized addition). To perform subtraction the
* input B can be inverted and all carries must be set to 1 as: a - b = a + ~b + 1.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like the caller has to invert B in the subtraction case but it is handled in this
module?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something like this be more clear:

A subtraction can be performed by setting the operand_b_invert_i signal and the input carries to one because: a - b = a + ~b + 1.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased it

Comment thread hw/ip/otbn/rtl/otbn_vec_shifter.sv Outdated
/**
* OTBN vectorized shifter
*
* This shifter is capable of shifting vectors elementwise as well as concatenate and shift 256
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can mention somewhere that these are logical shifts as opposed
to arithmetic ones, which are only supported for the GPR registers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned it

Comment thread hw/ip/otbn/rtl/otbn_vec_transposer.sv Outdated
* This module transposes the elements of two input vectors in two different ways.
* It supports 32b, 64b and 128b element lengths.
*
* If there are two vectors with 4 elements the transpositions are as follows:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you should mention that trn1 interleaves even coordinates and trn2 odd ones otherwise
the word transposition is a bit misleading.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is indeed more clear. I updated it.

Comment thread hw/ip/otbn/rtl/otbn_vec_shifter.sv
@etterli etterli force-pushed the otbn-simd-rtl-bnalu branch from ca2aa31 to c64a89a Compare February 20, 2026 13:55
@etterli etterli added the CI:Rerun Rerun failed CI jobs label Feb 21, 2026
@github-actions github-actions bot removed the CI:Rerun Rerun failed CI jobs label Feb 21, 2026
@etterli etterli force-pushed the otbn-simd-rtl-bnalu branch 3 times, most recently from 38f6dd0 to e26ec29 Compare February 21, 2026 13:17
Comment thread hw/ip/otbn/rtl/otbn_predecode.sv
Comment thread hw/ip/otbn/rtl/otbn_decoder.sv
Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv
Comment thread hw/ip/otbn/rtl/otbn_vec_adder.sv
Comment thread hw/ip/otbn/rtl/otbn_vec_adder.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_vec_transposer.sv
@etterli etterli force-pushed the otbn-simd-rtl-bnalu branch 2 times, most recently from c3a5c09 to d0d5835 Compare February 24, 2026 10:58
Copy link
Copy Markdown
Contributor

@vogelpi vogelpi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @etterli for your PR, I've reviewed the first commit and will continue later. It's great to see that you thoroughly implemented the feedback from our previous discussions :-)

Comment thread hw/ip/otbn/rtl/otbn_mod_result_selector.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_mod_result_selector.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_mod_result_selector.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_mod_result_selector.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_mod_result_selector.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_vec_adder.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_vec_adder.sv
Comment thread hw/ip/otbn/rtl/otbn_vec_shifter.sv
Comment thread hw/ip/otbn/rtl/otbn_vec_transposer.sv
Comment thread hw/ip/otbn/rtl/otbn_vec_transposer.sv
Copy link
Copy Markdown
Contributor

@vogelpi vogelpi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@etterli , I've now also reviewed the rest. This is fantastic work, well done!

I have mostly nits, a few questions and maybe one or two comments requiring actual work. But this looks really good!

Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv
Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv
Comment thread hw/ip/otbn/rtl/otbn_predecode.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_predecode.sv Outdated
alu_bignum_adder_y_op_a_en = 1'b1;
alu_bignum_adder_y_op_shifter_en = 1'b1;
flags_adder_update[flag_group] = 1'b1;
rf_ren_a_bignum = 1'b1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed in depth whether for every instruction you enable only those parts of the ALU which are really needed. How confident are you that you only enable what is really needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty confident. I just checked this again.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay thank you! Also with the reworked output mux, we would now get errors if two data paths were active at the same time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reworking the mux I also added an assertion which checks that only one path is active at the same time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thanks!

Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv Outdated
* \-----------------------------------------------------------------/
* \---------------------------------------------------------------/
* |
* operation_result_o
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, operation_result_o is implemented using a unique case. But if you look at this figure, you can see that at least of right most inputs, only ever one input can be non-zero. So at least this part of the mux could be implemented using a OR tree, ran than a real MUX. This can lead to a notable area reduction, because the MUX is wide.

However, I am not sure if synthesis is smart enough to figure that out given all the pre-decoding and blanking. You may want to add a blanker again to the "Y result" and the "MOD result" inputs (the latter may not be needed) just for this purpose. And then take the operation result mux out of the unique case and manually implement it using a bitwise OR tree.

Copy link
Copy Markdown
Contributor Author

@etterli etterli Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a smart optimization.

What do you think about ORing the 3 non-arithmetic results and then feeding this combined signal into a 3-to-1 MUX? This way we can save two blankers and their control overhead but still reduce the MUX width. But probably two blankers and one wide OR is more efficient.

Copy link
Copy Markdown
Contributor Author

@etterli etterli Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 3-to-1 MUX requires around 1.9kGE whereas with blankers we are at 2.1kGE (excluding the FF and other logic). Both are better than a full MUX which is around 2.4kGE. Estimated using nangate45 values and 2 input gates only. Maybe there are more efficient multi-input gates but it seems to be around the same.

Implemented to 3to1 option for now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this offline. By slightly modifying the mod_result_selector module, we can convert one of the 256-bit multiplexer inputs in the final multiplexer into an 8-bit multiplexer which can even be predecoded. So we'll manage to save one input in the final multiplexer (the one from Adder Y, the result of Adder Y will then always go through the mod_result_selector) and the final multiplexer can be implemented with an OR tree, which will simplify timing optimization during synthesis (the critical path goes through the shifter and followed by Adder Y).

Comment thread hw/ip/otbn/lint/otbn.waiver
@etterli etterli force-pushed the otbn-simd-rtl-bnalu branch from d0d5835 to b40f17c Compare February 25, 2026 11:49
Copy link
Copy Markdown
Contributor Author

@etterli etterli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vogelpi for the review. I addressed the points.

Comment thread hw/ip/otbn/rtl/otbn_mod_result_selector.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_mod_result_selector.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_mod_result_selector.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_mod_result_selector.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_mod_result_selector.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_decoder.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_decoder.sv Outdated
@etterli etterli added the CI:Rerun Rerun failed CI jobs label Feb 25, 2026
@github-actions github-actions bot removed the CI:Rerun Rerun failed CI jobs label Feb 25, 2026
Copy link
Copy Markdown
Contributor

@andreaskurth andreaskurth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, @etterli! Please find a couple of suggestions and questions below. Overall, I don't see any blocking problems, though. 👍

// The vector chunk length. This defines the width of the internal adders.
parameter int LVChunkLEN = VChunkLEN,
// The number of vector chunks, i.e., the number of adders.
localparam int LNVecProc = LVLEN / LVChunkLEN
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding an ASSERT_INIT to ensure that this divides without remainder

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Added it

Comment thread hw/ip/otbn/rtl/otbn_vec_adder.sv Outdated
assign op_b = operand_b_invert_i ? ~operand_b_i[i_adder * LVChunkLEN+:LVChunkLEN]
: operand_b_i[i_adder * LVChunkLEN+:LVChunkLEN];

// Do the addition and update carry flag
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest expanding this to:

    // Compute op_a + op_b + carry_in using a two-operand addition.
    // By appending 1'b1 and carry_in as the LSBs, the addition of the
    // LSB position (1 + carry_in) generates a carry into the upper bits
    // exactly when carry_in is set, so result[LVChunkLEN:1] = op_a + op_b + carry_in
    // and result[LVChunkLEN+1] is the carry out. The LSB of result is unused.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added this.

Comment thread hw/ip/otbn/rtl/otbn_decoder.sv
Comment on lines +793 to +804
AluElen32: begin
alu_adder_carry_sel_bignum = 1'b1;
alu_shift_mask_bignum = (32'd1 << (32 - alu_shift_amt_bignum[4:0])) - 32'd1;
end
AluElen256: begin
alu_adder_carry_sel_bignum = 1'b0;
alu_shift_mask_bignum = {32{1'b1}};
end
default: begin // same as 256b
alu_adder_carry_sel_bignum = 1'b0;
alu_shift_mask_bignum = {32{1'b1}};
end
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code feels a bit brittle as it will break when VChunkLEN != 32, right? I think using that parameter and $clog2(VChunkLEN)-1 instead of the hard-coded indices/sizes could make the code work with other values.

AluElen32 has the 32 even in the enum name - potentially worth renaming to AluElenVChunkLEN (and AluElen256 could become AluElenWLEN)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I tried to generalize as much as possible but there are still a few places where stuff will break if VChunkLEN is changed (generalizing everything is far from trivial / sometimes not even possible). So I don't think it is worth to generalize this. It will just make it even more complex to read. Also, the only reason why VChunkLEN should change is when 16-bit elements should be implemented. Then this part must anyway be touched.

I would like to refrain from renaming it because in some places, e.g., in the transposer, it is assumed that this type represents the 32 bit case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, generating the carry control signal based upon the parameter would require us to define all NVecChunk bits because these are also predecoded. Right now, we only need 1 bit instead of 8 bits.

Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv Outdated
Comment on lines +1017 to +1024
AluOpBignumTrn1: begin
expected_trn_en = 1'b1;
expected_trn_is_trn1 = 1'b1;
end
AluOpBignumTrn2: begin
expected_trn_en = 1'b1;
expected_trn_is_trn1 = 1'b0;
end
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simplified:

AluOpBignumTrn1,
AluOpBignumTrn2: begin
  expected_trn_en = 1'b1;
  expected_trn_is_trn1 = operation_i.op == AluOpBignumTrn1;
end

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged it.

Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv Outdated
expected_shift_en = 1'b1;
expected_shift_right = operation_i.shift_right;
end
AluOpBignumSubv: begin
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also be implemented in the same case as AluOpBignumSub. Differences are in:

  • expected_adder_y_carries_top
  • adder_update_flags_en_raw
  • expected_shift_right

Here the benefit I see is less in the reduced amount of code and more in bundling together what belongs together and make the differences between scalar and vector explicit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged it.

Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv Outdated
expected_x_res_operand_a_sel = 1'b1;
expected_shift_mod_sel = 1'b0;
end
AluOpBignumAddvm: begin
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this could be implemented in the same case as AluOpBignumAddm. Only expected_adder_y_carries_top needs to be differentiated.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged these cases together.

Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv Outdated
expected_shift_en = 1'b1;
expected_shift_right = operation_i.shift_right;
end
AluOpBignumAddv: begin
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the suggestion for Sub, and same differences.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged.

Comment on lines +811 to +812
adder_update_flags_en_raw = 1'b0;
logic_update_flags_en_raw = 1'b0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding an assertion checking that vector operations don't update adder or logic flags? That's an invariant in the current architecture, and if the code is ever changed to violate that invariant, that can result in bugs that are pretty hard to root-cause. Such an assertion would catch this explicitly.

The assertion may be as simple as

ASSERT(VecOpsNoFlagUpdate_A, is_vec_op |-> !adder_update_flags_en_raw && !logic_update_flags_en_raw)

(where the is_vec_op helper signal has to be defined - e.g., with an inside construct.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such an assertion is definitively meaningful but I think the simulator will catch this. And that someone changes by accident both, the RTL and the simulator, seems pretty unlikely. What do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add this SVA as it may save time when debugging. Because it's then obvious that this is not "just" a mismatch between model and RTL, but something which is intentionally not meant to happen.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It also includes bn.rshi, bn.addm, and bn.subm which are the only other BN ALU operations which do not update flags.

Comment on lines +38 to +41
waive -rules {CLOCK_USE RESET_USE} -location {otbn_alu_bignum.sv} \
-regexp {'(clk_i|rst_ni)' is connected to '(otbn_vec_transposer|otbn_vec_shifter)' port} \
-comment {The module is fully combinatorial, clk/rst are only used for assertions.}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change should be fixed-up into the commit that makes the waiver necessary, I think.

Copy link
Copy Markdown
Contributor Author

@etterli etterli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andreaskurth for the review. I answered to your comments and will push the updated design tomorrow. I first want to test all the changes.

@vogelpi, sorry some comments were not shown on the github page. I now also address them. Hope I haven't missed any.

Comment on lines +793 to +804
AluElen32: begin
alu_adder_carry_sel_bignum = 1'b1;
alu_shift_mask_bignum = (32'd1 << (32 - alu_shift_amt_bignum[4:0])) - 32'd1;
end
AluElen256: begin
alu_adder_carry_sel_bignum = 1'b0;
alu_shift_mask_bignum = {32{1'b1}};
end
default: begin // same as 256b
alu_adder_carry_sel_bignum = 1'b0;
alu_shift_mask_bignum = {32{1'b1}};
end
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I tried to generalize as much as possible but there are still a few places where stuff will break if VChunkLEN is changed (generalizing everything is far from trivial / sometimes not even possible). So I don't think it is worth to generalize this. It will just make it even more complex to read. Also, the only reason why VChunkLEN should change is when 16-bit elements should be implemented. Then this part must anyway be touched.

I would like to refrain from renaming it because in some places, e.g., in the transposer, it is assumed that this type represents the 32 bit case.

Comment on lines +793 to +804
AluElen32: begin
alu_adder_carry_sel_bignum = 1'b1;
alu_shift_mask_bignum = (32'd1 << (32 - alu_shift_amt_bignum[4:0])) - 32'd1;
end
AluElen256: begin
alu_adder_carry_sel_bignum = 1'b0;
alu_shift_mask_bignum = {32{1'b1}};
end
default: begin // same as 256b
alu_adder_carry_sel_bignum = 1'b0;
alu_shift_mask_bignum = {32{1'b1}};
end
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, generating the carry control signal based upon the parameter would require us to define all NVecChunk bits because these are also predecoded. Right now, we only need 1 bit instead of 8 bits.

Comment thread hw/ip/otbn/rtl/otbn_predecode.sv Outdated
alu_bignum_adder_y_op_a_en = 1'b1;
alu_bignum_adder_y_op_shifter_en = 1'b1;
flags_adder_update[flag_group] = 1'b1;
rf_ren_a_bignum = 1'b1;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty confident. I just checked this again.

Comment thread hw/ip/otbn/rtl/otbn_predecode.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv Outdated
expected_shift_mod_sel = 1'b0;
expected_mod_is_subtraction = 1'b1;
end
AluOpBignumSubvm: begin
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged it.

Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv Outdated
Comment on lines +1017 to +1024
AluOpBignumTrn1: begin
expected_trn_en = 1'b1;
expected_trn_is_trn1 = 1'b1;
end
AluOpBignumTrn2: begin
expected_trn_en = 1'b1;
expected_trn_is_trn1 = 1'b0;
end
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged it.

Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv Outdated
expected_shift_en = 1'b1;
expected_shift_right = operation_i.shift_right;
end
AluOpBignumSubv: begin
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged it.

Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv Outdated
expected_shift_en = 1'b1;
expected_shift_right = operation_i.shift_right;
end
AluOpBignumAddv: begin
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged.

Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv Outdated
* \-----------------------------------------------------------------/
* \---------------------------------------------------------------/
* |
* operation_result_o
Copy link
Copy Markdown
Contributor Author

@etterli etterli Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a smart optimization.

What do you think about ORing the 3 non-arithmetic results and then feeding this combined signal into a 3-to-1 MUX? This way we can save two blankers and their control overhead but still reduce the MUX width. But probably two blankers and one wide OR is more efficient.

@etterli etterli force-pushed the otbn-simd-rtl-bnalu branch 2 times, most recently from 62f815f to 0cca6c1 Compare February 26, 2026 08:57
Copy link
Copy Markdown
Contributor

@vogelpi vogelpi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for implementing the feedback, @etterli . There is one more point regarding the final multiplexer. We can then merge the PR.

Comment thread hw/ip/otbn/rtl/otbn_vec_shifter.sv
Comment thread hw/ip/otbn/rtl/otbn_vec_shifter.sv
Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv
Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv
Comment thread hw/ip/otbn/rtl/otbn_mod_result_selector.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_decoder.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_predecode.sv Outdated
alu_bignum_adder_y_op_a_en = 1'b1;
alu_bignum_adder_y_op_shifter_en = 1'b1;
flags_adder_update[flag_group] = 1'b1;
rf_ren_a_bignum = 1'b1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay thank you! Also with the reworked output mux, we would now get errors if two data paths were active at the same time.

Comment on lines +811 to +812
adder_update_flags_en_raw = 1'b0;
logic_update_flags_en_raw = 1'b0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add this SVA as it may save time when debugging. Because it's then obvious that this is not "just" a mismatch between model and RTL, but something which is intentionally not meant to happen.

Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv Outdated
* \-----------------------------------------------------------------/
* \---------------------------------------------------------------/
* |
* operation_result_o
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this offline. By slightly modifying the mod_result_selector module, we can convert one of the 256-bit multiplexer inputs in the final multiplexer into an 8-bit multiplexer which can even be predecoded. So we'll manage to save one input in the final multiplexer (the one from Adder Y, the result of Adder Y will then always go through the mod_result_selector) and the final multiplexer can be implemented with an OR tree, which will simplify timing optimization during synthesis (the critical path goes through the shifter and followed by Adder Y).

@vogelpi
Copy link
Copy Markdown
Contributor

vogelpi commented Feb 26, 2026

CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_alu_bignum.sv
CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_controller.sv
CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_decoder.sv
CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_mod_result_selector.sv
CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_pkg.sv
CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_predecode.sv
CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_vec_adder.sv
CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_vec_shifter.sv
CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_vec_transposer.sv

This PR adds SIMD support as proposed in an approved RFC.

@etterli etterli force-pushed the otbn-simd-rtl-bnalu branch 2 times, most recently from 5d4fc5a to 4286ba8 Compare February 26, 2026 13:34
@etterli
Copy link
Copy Markdown
Contributor Author

etterli commented Feb 26, 2026

@vogelpi @andreaskurth I have now reworked the result mux and added the assertion. I also rebased it on master. If you want to review only the actual changes, see the force push from 2:28PM GMT+1, the next one is the rebase.

Please have a look again.

Copy link
Copy Markdown
Contributor

@vogelpi vogelpi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @etterli , I have one more question, but this is great work!

Comment thread hw/ip/otbn/rtl/otbn_alu_bignum.sv Outdated
Comment thread hw/ip/otbn/rtl/otbn_mod_result_selector.sv Outdated
This adds a vectorized adder, a modulo result selector, a vectorized shifter and a vector transposer
module. These modules are the building blocks to construct the vectorized BN ALU.

Signed-off-by: Pascal Etterli <pascal.etterli@lowrisc.org>
@etterli etterli force-pushed the otbn-simd-rtl-bnalu branch from 4286ba8 to cb521ed Compare February 26, 2026 14:02
Add the vectorized instructions implemented in the BN ALU to the OTBN.

Signed-off-by: Pascal Etterli <pascal.etterli@lowrisc.org>
@etterli etterli force-pushed the otbn-simd-rtl-bnalu branch from cb521ed to 03859b2 Compare February 26, 2026 14:07
@nasahlpa
Copy link
Copy Markdown
Member

CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_alu_bignum.sv
CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_controller.sv
CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_decoder.sv
CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_mod_result_selector.sv
CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_pkg.sv
CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_predecode.sv
CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_vec_adder.sv
CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_vec_shifter.sv
CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_vec_transposer.sv

This PR adds SIMD support as proposed in an approved RFC.

@etterli etterli added the CI:Rerun Rerun failed CI jobs label Feb 26, 2026
@github-actions github-actions bot removed the CI:Rerun Rerun failed CI jobs label Feb 26, 2026
@vogelpi vogelpi added this pull request to the merge queue Feb 26, 2026
Merged via the queue into lowRISC:master with commit b883337 Feb 26, 2026
77 of 81 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants