## Title

Kan Shi, David Boland, Member, IEEE, and George A. Constantinides, Senior Memberellow, IEEE

Abstract—The abstract.

Index Terms-

#### I. INTRODUCTION

THIS demo file is intended to serve as a "starter file" for IEEE journal papers produced under LATEX using IEEEtran.cls version 1.8 and later. I wish you the best of success.

mds December 27, 2012

A. Subsection Heading Here

Subsection text here.

1) Subsubsection Heading Here: Subsubsection text here.

#### II. CARRY SELECT ADDER

A. Introduction

B. Timing Models for n-bit Carry Select Adder

In this section, we describe the modelling method for the CSA timing, with the aim of forming the relationship between the operating frequency and the corresponding maximum word-length of CSA. This information can be employed to determine the truncation error based on the models presented in Section.xxx.

In an n-stage CSA, let the stage delay be denoted as  $d_0 \dots d_{n-1}$ , where  $d_0$  and  $d_{n-1}$  represent the delay of MSB stage and LSB stage, respectively. In our analysis, we follow the previous assumption that delay is caused due to carry propagation and, in this case, multiplexing. Thus the delay of the  $i^{th}$  stage can be obtained through (1), where  $\mu_{carry}$  denotes the delay of 1-bit carry propagation and  $\mu_{mux}$  denotes the delay of multiplexer.

$$d_i = n_i \cdot \mu_c + (i+1) \cdot \mu_{mux} \tag{1}$$

Under the timing-driven design environment, the delay of each stage of CSA is set to be approximately uniform to lead to the fastest operation. In this case we obtain (2).

$$d_0 = d_1 = \dots = d_{n-1} \tag{2}$$

Substituting (1) into (2) yields (3), which denotes the CSA word-length of stage i. Note that the word-length of stage c-1 and c-2 are identical, since the least significant stage is composed of RCA.

$$n_i = \begin{cases} n_0 - i \cdot \frac{\mu_{mux}}{\mu_c}, & \text{if } i \in (2, c - 2] \text{ and } c > 2\\ n_0 - (c - 2) \cdot \frac{\mu_{mux}}{\mu_c}, & \text{if } i = c - 1 \text{ and } c \geqslant 2 \end{cases}$$
(3)

K. Shi, D. Boland and G. A. Constantinides are with the Department of Electrical and Electronic Engineering, Imperial College London, London, UK Manuscript received April 19, 2005; revised December 27, 2012. Therefore the word-length of CSA is given by (4).

$$n_{CSA} = \sum_{i=0}^{n-1} n_i = cn_0 - \frac{\mu_{mux}}{\mu_{carry}} \cdot \frac{(c+1)(c-2)}{2}$$
 (4)

In the conventional situation, the word-length of both RCA and CSA should be truncated in order to meet timing. Hence we obtain (5), where  $n_{RCA}$  is determined through (xxx).

$$\mu_c \cdot n_{RCA} = \mu_c n_0 + \mu_{mux} \tag{5}$$

Based on (4) and (5) we form the relationship between the word-length of CSA and RCA under a given timing constraint, as presented in (6).

$$n_{CSA} = c \cdot n_{RCA} - \frac{\mu_{mux}}{\mu_c} \cdot \frac{(c+2)(c-1)}{2}$$
 (6)

It can be seen that c=1 leads to  $n_{CSA}=n_{RCA}$ . This is because the least significant stage of CSA is built by RCA. In addition, combining (3) and (5) to ensure that  $n_{c-1}>0$ . Therefore the stage number is bounded by (7).

$$c < n_{RCA} \cdot \frac{\mu_c}{\mu_{mux}} + 1 \tag{7}$$

## C. Model Verification

As seen in (6), the value of  $\mu_{mux}/\mu_c$  should be determined before applying the timing model. This is achieved by keeping i=0 in (1) while varying the value of  $n_0$ . The corresponding  $d_0$  is recorded through the timing analysis tool. Hence  $\mu_{mux}/\mu_c$  can be obtained by fitting those values, as presented in Fig. (1). Out experimental results show that  $\mu_{mux}/\mu_c \approx 8$ .



Fig. 1. Fitted curve of (1) for the most significant stage (i = 0), based on the delay value obtained through the Xilinx Timing Analyzer.

Using this information, we verify our model with the results obtained through post place and route simulations of Xilinx

Virtex-6 FPGA. Fig. 2 demonstrates both the modelled value and the experimental results of the maximum word-length of RCA and 2-stage CSA, respectively, under a given operating frequency. Note that the maximum input word-length is 16-bit, hence the modelled value is set to 16 if it expires.



Fig. 2. Model Verification.

It can be seen that our model provide slight conservative outcomes than the experimental data, especially at higher operating frequency. This is because the model coefficients are obtained based on timing analysis, which is designed to ensure correct functionality across a range of operating conditions. In addition, routing delay might be introduced to enlarge the overall delay, while our models consider logic delay only.

## D. Area Overhead

Fig. 3 demonstrates the maximum word-length for RCA and CSA with 2 stages and 3 stages respectively under a range of operating frequencies. We only investigate 3 stages since the maximum stages number predicted in (7) is 3. It can be seen that in comparison to RCA, CSA achieves greater word-length when frequency is initially increased. This means smaller truncation error can be obtained by CSA for a given frequency. RCA outperforms than CSA when very high frequency is applied, which in turn leads to small word-length for both structures. In addition, the word-length of 3-stage CSA is always greater than 2-stage CSA across the entire frequency domain, as expected.

However, the accuracy benefits brought by CSA comes with the cost of large area overhead. Fig. 4 depicts the resource usage (in terms of the number of Look-Up Tables (LUTs)) used for all three structures. It can be seen that the 3-stage CSA consumes  $2.4\times \sim 3.7\times$  area than RCA, while the number of the 2-stage CSA is  $1.7\times \sim 3.1\times$ .

# E. Exploring Trade-offs Between Accuracy, Performance and Area

Although advanced architectures such as CSA inheritly offer better performance than the basic structure, this would generate a large area overhead. In the following experiments, the trade-offs between accuracy, performance as well as area are explored. If the available hardware resources are limited, the full word-length of both CSA and RCA might not be



Fig. 3. Maximum word-length for RCA and CSA across a variety of frequencies.



Fig. 4. Area overhead for RCA and CSA.

implemented. This potentially generate truncation errors. For instance, the number of available LUTs is set to 45, 35, 25 and 15 respectively, while timing requirements are met by reducing the input word-length of each architecture. In addition, we also investigate the scenario where RCA is implemented with the maximum possible word-length under the given area constraint, while meeting timing by overclocking. The corresponding error expectations of these scenarios are depicted in Fig.xxx. The optimal design method which achieves the minimum error expectation is labelled.

If there is no area limitation, CSA with higher stage numbers will be the optimal design choice unless very high frequencies are applied. This is because a long carry chain is divided into multiple overlapped sections in comparison to RCA, and the higher stage number the shorter the carry chain length of each stage. When the frequency increases, however, the multiplexer delay becomes comparable to the carry chain delay, and this will limit the maximum word-length of CSA. It can be seen that the overclocked RCA outperforms at higher frequencies.

For a tighter area budget, only part of the complex structures



Fig. 5. LUT=45



Fig. 6. LUT=35



Fig. 7. LUT=25



Fig. 8. LUT=15

can be implemented, while the simple structure still keep full precision. As can be seen in Fig.xxx and Fig.xxx, area instead of timing becomes dominate when the frequency is initially increased. This lead to the truncation of word-length of both 2-stage and 3-stage CSA. For an even stringent area constraint, the word-length of RCA is limited, meanwhile CSA with high stage numbers could not be implemented, as shown in Fig.xxx. In this situation the overclocked RCA achieves best accuracy across the whole frequency domain.

In general, the error expectations at the output of all four design method for a variety of timing and area constraints are demonstrated in Fig. 9.



Fig. 9. Trade-offs between accuracy, performance and area for 4 design methodologies.

#### III. CONCLUSION

The conclusion goes here.

#### ACKNOWLEDGMENT

The authors would like to thank...

## REFERENCES

[1] H. Kopka and P. W. Daly, A Guide to  $\LaTeX$ , 3rd ed. Harlow, England: Addison-Wesley, 1999.

Michael Shell Biography text here.

PLACE PHOTO HERE

John Doe Biography text here.

Jane Doe Biography text here.