Skip to content

Latest commit

 

History

History
65 lines (48 loc) · 14.2 KB

File metadata and controls

65 lines (48 loc) · 14.2 KB

Large Integer Arithmetic Operation Hardware Accelerator Project Report

Objectives

The main objective of the final project is to improve the reference design of Lab #3 by designing a new combinational adder that replaces the current N-bit Ripple-Carry Adder in the multi-precision adder (mp_adder) module to increase its performance while minimizing latency. To fully utilize the FSM and achieve a less computation clock cycles for input operands A and B, each of 512 bits, the adder width should be higher than 32 bits.

Design and Implementation Summary

The final adder design is a 128-bit (N) hybrid adder (HYBRID) of Carry Select Adder (CSelA) and Carry Lookahead Adder (CLA) and is implemented through the functional block diagram. The design strategy of this improved adder was derived from a set of delay calculation comparisons conducted in the lecture of Complex Digital Design (CDD), and specifically the 16-bit (M) carry-lookahead adder design draws the expanded expression logic described in CDD lecture 3A (p.34 and p.38). The resulting hybrid adder can realize features of addition, subtraction, and comparison between two 512-bit operands without compromising given slack constrain. The number of cycles this design requires to complete a 512-bit addition, subtraction, and comparison was identical to three, which was determined by the operand width of 512-bit divided by the adder width of 128-bit. The worst-negative-delay of the N-bit HYBRID adder for addition only is 0.802 ns, for adder/subtractor is 0.735 ns, and for adder/subtractor and comparator is 0.549 ns, which is comparable to conventional ripple carry adder of the same adder width; However, it is worth noting that this HYBRID design can be further improved by utilizing modern architectures such as Kogge-stone, Brent-Kung, or Spanning Tree adder.

Technical Description

Adder Designs Strategy

In the case of adder design using conventional adder structure, a few methods of circuit design were tested, including CLA using for-loop, CLA using expanded expression, CSelA with two-to-one multiplexer (MUX) generated in a for loop different from the M-bit CLA pairs, CSelA with a two-to-one multiplexer and M-bit CLA pairs generated in the same iteration loops. The 16-bit width of CLA was selected from the square root of desired adder N bit higher level adder width, allowing an approximately equal span of delay across groups of M-bit CLAs. image

CLA generation with for-loop and with expanded expression

When implemented in CLA block modules into carry select adder, of pre-fixed computing hypothetically would yield a more attractive propagation delay, [tPG + (N)•(tAND + tOR)] + (N/M-1)•tMUX , than ripple carry adder would, tFA,SUM + (N-1) * tFA,CARRY. It was assumed particularly true for which logic of carry propagate chain in a CLA is expanded that the delay is [ tPG + N•tAND + tOR] + (N/M-1)•tMUX. Theoretically with tPG, tAND, and tOR respectively being 1, tFA,CARRY being 2, and tFA,SUM and tMUX being 3, a HYBRID adder would be faster than for-loop HYBRID adder, where ripple carry adder at same adder width would be the slowest.

CSelA with 16-bit for-loop CLA

First, we will explain how the 16-bit CLA is designed. Compared to the Ripple-Carry Adder, the CLA structure can split the computations into two parts, the upper chain to compute sum terms and the bottom chain to drive the carry propagation, thus speeding up the addition operations. The CLA structure can split the computations into two parts, the upper chain to compute sum terms and the bottom chain to drive the carry propagation, thus speeding up addition.

The upper chain of the M-bit CLA consists of M Partial Full Adders. Each one takes three inputs, A, B andC_in, and generates 3 outputs, a sum (∑), generate carry (G), and propagate carry (P). The relationship between outputs and inputs are ∑=A ⨁▒B, G=AB, P=A+B. The bottom chain is Carry Lookahead Logic, which is responsible for computing the C_in for the next stage by this formula, C_out=G+PC_in. The following diagram shows the basic structure of a 4-bit CLA, which can help to understand the structure of a 16-bit CLA. In this way, we get the 16-bit CLA as a unit adder block.

Figure 2: for-loop CLA circuit Because we need an N-bit adder, if we simply concatenate the M-bit CLAs, it is not an ideal structure, because the C_in of the CLA of the later level needs to wait for the C_out of the CLA of the previous level before computing the results, which still has a long waiting time. Fortunately, the structure using CSelA can pre-compute the result of a block without waiting for the actual C_in. Specifically, we can create 2 instances of the M-bit CLAs, which accept the same M-bit operands, one feeding C_in = 0 and the other feeding C_in = 1. In this way, as long as C_(in-1) is available, it can select the pre-computed correct outputs of the CLAs via multiplexers. Figure 4 shows a model diagram of part of our adder design.

Finally, we can ‘cascade’ these M-bit adders together to form the final N-bit adder, except a single CLA is used at the start of the chain. image

Using this architecture, we can achieve a maximum of a 64-bit adder with a time constraint of 125MHz. It is worth mentioning that the 16-bit CLA unit can be optimized. Performing computations 'in series' of all terms C_out=G+PC_in is not a good strategy. Because this structure is equivalent to just transferring the most time-consuming part, i.e., the carry rippling, from the FA of Ripple_Carry_Adder to the Carry Propagation Logic block. The next approach can solve this problem and significantly improve the 16-bit CLA speed.

CSelA with 16-bit expanded CLA

The expansion architecture of CLA (CLA_16b_EXP) carry propagation logic was chosen to avoid ‘in series’ computation of all carry terms where the worst-case delay is expressed. By substituting each propagation carry bit (Pi) and generation carry bit (Gi) with products of Pi-1 and Gi-1, the redistributed circuit allows for parallel computing, improving speed at the cost of area overhead. The intermediate signals, Propagate and Generate, when adding bit Ai and Bi were denoted as P_i=A_i ⊕B_i or P_i=A_i+B_i and G_i=A_i* B_i. The sum and carry bit can be reformulated: ∑_i=P_i⊕C_(i-1) and C_i=G_i+P_i.C_(i-1). Additionally, the selected design for N-bit CSelA utilize a uniform-sized structure, a variable-sized architecture might however allow less gate delays. To exploit one iteration in Verilog, in the code structure each bit of CSelA output sum and intermediate MUXes, mux_s and mux_c, was assigned in the same cycle as the CLA_16b_EXP pairs and instantiated as a function of m = N/M. The CLA_16b_EXP pairs as observed from the figure above, were inputted with hardcoded 0 or 1 for pre-fixed calculations to minimize delay. The investigation of CSelA code structure design yielded both m-th mux_s and mux_c having outputs based on previous mux_c[3•(m-1)] as shown above.

Subtractor Designs Strategy

The general way to calculate subtraction A-B is to perform it as addition with 2's complement, i.e. A+(-B). Therefore, the N-bit subtraction can be implemented with an N-bit adder, with conditioning the input as ‘inverting all bits in operand B and fixing the carry into 1’. In practice, adder circuit can be generalized to be used for subtraction. We only need an extra select signal ‘add_sub’ to decide whether it will perform addition or subtraction and 2 multiplexers to select the corresponding inputs. When add_sub==0, the addition will be performed. When add_sub==1, subtraction will be performed. The model of the generalized circuit looks like this:

image To implement this, we also need to add an extra state in module ‘uart_top’ to receive the operator signal ‘add_sub’.

In our case, operands A and B are 512-bit, and the result ‘oRes[512:0]’ is shared by the signal defined by module mp_adder, which is 513-bit. There are 2 types of results. If A≥B, the result is non-negative, so we need to drop the overflow bit ‘oRes[512] = 1’, and the remaining 512-bit result 'oRes[511:0]' is the original binary code (also the 2’s complement) of the positive result. While if A<B, result is negative, and it is stored in oRes[511:0] and it is already in 2’s complement.

Since the most significant bit (MSB) ‘oRes[512] ‘ does not store the result, to prevent the generation of latches, we need to assign it artificially. Through the above analysis, we know that when A>=B, we need to assign the MSB ‘oRes[512]’ to 1’b0 because the result of subtraction is non-negative, and when A<B, the result is negative, represented by the 2’s complement, so we need to assign 1’b1 to oRes[512]. By the following assignment, we will always get the result of addition or subtraction in the form of 2’s complement.

image

Then we need to adapt the Python code so that it can interpret the receiving 2’ complement to a signed integer.

image

Comparator Designs Strategy

We chose to design a Magnitude Comparator to decide whether N-bit number A is larger than or equal to B. This can be easily implemented based on the subtractor. If A-B≥0, output Y=1, else Y=0. We also expand the possibility of operation selector by one, when the Python side sends add_sub=2, it indicates that it wants to get the result of comparison from FPGA. But FPGA still does the subtraction inside module ‘mp_adder_sub’ and stores the result in oRes[512:0].

It is worth noting that inside the comparator, the result Y can only be 1 or 0, so we can represent the result in a byte and send it to the Python side. When transferring the solution result, only the value of the first bit of oRes[512:0] needs to be stored inside the buffer 'rTxByte[7:0]', and then only this byte needs to be transferred. So we need to add a judgment to the state s_WAIT_TX of module uart_top, if the operator selector add_sub==2, once we finish passing a byte, we need to empty the whole buffer and return to IDLE state.

Post-Synthesis Performance Evaluation

The performance evaluation is based on a post-synthesis timing report to avoid physical and structural optimization conducted during Implementation stage. Module 512-bit mp_adder was set as top and the timing constrain is set to 125 MHz.

CSelA with 16-bit for-loop CLA

As mentioned earlier, this is not the final structure we have designed, it can be further optimized. In this case, it can only be implemented for a maximum of 64-bit adders. This design reports a positive WNS (worst significant slack) of 0.724ns, and the worst-case delay is 7.14 ns, where 1.765 ns and 5.375 ns correspond to the worst-path logic delay and net delay. We also find that the 64-bit HYBRID adder can perform better when we use a block size of 8 bits, which is the square root of 64. The specific performance evaluation can be found in Table 2.

CSelA with 16-bit expanded CLA

A mp_adder constructed respectively out of 128-bit CSelA with expanded 16-bit CLA post-synthesis yielded a WNS of 0.802 ns and a total worst-case delay of 7.062 ns combing 1.517 logic delay and 5.545 net delays, whereas of 64-bit yields a WNS of 2.066 ns and a total worst-case delay of 5.798 ns combing 1.269 logic delay and 4.529 net delay. The numbers of cycles required correspondingly are 3 and 8, where in parallel the LUTs and FFs needed are 1689 , 1552, and 1295 and 1569.

Subtractor and Comparator

The specific figures on speed and area are recorded in Table 1. Since they shared the generalized add_sub circuit, the numbers are the same. Their WNS is only 0.3ns smaller than the 128-bit adder and slightly larger in area. image

Comparison to Reference Adder

The architecture of adder contributes significantly to WNS and accordingly speed performance. Among the learned adder architectures known in the lecture, CLA was considered the fundamental replacement for a full adder. When implemented in block modules into CSelA, advantages of pre-fixed computing hypothetically would yield a more attractive propagation delay performance than ripple carry adder would give. It was assumed particularly true when the propagation logic in a CLA was expanded. In general, expanded HYBRID adder should be faster than for-loop HYBRID adder, where ripple carry adder at same adder width would be the slowest. Table 2 gives a comparative performance overview considering slack, propagation delay, and area constitution. image It can be observed that at an adder width of 32-bit, mp_adder utilizing a hybrid structure of CSelA and CLA constituting a for-loop carry-chain architecture already produce a higher positive slack than that of ripple_carry_adder_Nb, a negative WNS of -0.73 ns, yielding a delta of 3.522 ns. This result validates our hypothesis. In the case of an adder width of 64, the expanded HYBRID adder produces a WNS of 2.066 ns where the worst-case delay is 5.798 ns, the sum of 1.269 ns and 4.429 ns corresponding to the logic + net delays of the worst path. The expanded HYBRID was observed 1.342 ns faster than the for-loop HYBRID adder at the same block size and adder width. It is in parallel better performed than the optimal for-loop HYBRID adder. The simulation results verified our hypothesis and the taught theories. The final adder structure, expanded HYBRID, is consequently proposed, yielding a constrain-satisfied 0.802ns WNS that requires 3 cycles for 512-bit arithmetic operation, at an area cost of 1689 LUTs and 1552 FFs as provided by the Utilization report.