

# Lab 2 Report Integrated Systems Architecture

Master degree in Computer Engineering

Authors: ISA36

Nicole Dai Prà s274501, Leonardo Izzi s278564

December 7, 2020

# Contents

| - | Introduction                               |
|---|--------------------------------------------|
| 2 | FP Multiplier                              |
|   | 2.1 Model Verification                     |
|   | 2.2 Synthesis                              |
|   | 2.3 Fine-grain Pipelining and Optimization |
| 3 | MBE Multiplier                             |
|   | 3.1 MBE Implementation                     |
|   | 3.1.1 Dadda Tree                           |
|   | 3.2 Synthesis                              |
|   | 3.3 Conclusions                            |

### CHAPTER 1

# Introduction

The lab 2 assignment consisted in various synthesis experiments on a floating point multiplier and in the development of a unsigned integer multiplier, based on the Booth's algorithm and Dadda's tree, to be used within the floating point multiplier.

As required, there is a GitHub repository available at the following link: https://github.com/leoizzi/isa\_labs/tree/main/lab2.

The folder is organized as follows:

- fpuvhdl, the folder containing all the VHDL files of both the floating point and integer unsigned multipliers.
- lab2\_report.pdf, this file.
- report, the folder containing the Latex files of the report.
- sim, the folder where all the simulation scripts are stored.
- syn, the folder where all the synthesis script, as well as the reports, are saved.
- tb, the folder containing the testbench files.
- dadda.py, a python script that generates the VHDL instantiation of the Dadda tree.

## CHAPTER 2

# FP Multiplier

#### 2.1 Model Verification

Before starting any work on the multiplier, we verified that it was working as intended. Hence, we have created a file called tb\_fpumult.vhd, where the DUT computes the square of the numbers stored in fp\_samples.hex and the result is compared against the values stored in fp\_prod.hex. Then, we added the required input registers and we have verified again that the results were correct.

## 2.2 Synthesis

We have performed various synthesis to analyze the differences between various implementations and constraints. To do this, we have written three different synthesis scripts, that are syn\_script.tcl, syn\_script\_csa.tcl and syn\_script\_pparch.tcl. The first one synthesizes the multiplier by leaving all the implementation choices to the Synopsys' tool, while the second forces the usage of CSA-based multipliers and the last one the usage of parallel-prefix based multipliers. The asked results are shown in table 2.1.

| Multiplier architecture   | $T_{ck}$ | Area                |
|---------------------------|----------|---------------------|
| Chosen by the synthesizer | 1.6 ns   | $3999.04 \ \mu m^2$ |
| CSA                       | 4.6~ns   | $4807.68 \ \mu m^2$ |
| PPArch                    | 4.5~ns   | $3734.91 \ \mu m^2$ |

Table 2.1: Results for different architectures

Performance is remarkable when the synthesizer is allowed to choose the implementations by itself. By looking at the resources' report the synthesizer chooses the parallel prefix multiplier, although it is optimized differently in respect to the one used in the synthesis done with syn\_script\_pparch.tcl. In fact, the former is optimized for both area and speed, while the latter only for area. However, for an increase of about only 7% in area we obtain a performance boost of 65%, hence in a real application there is no doubt in the implementation we would choose.

## 2.3 Fine-grain Pipelining and Optimization

We added, as asked, the register after the significands' multiplier in the second stage, as well as all the required registers to maintain the correct timing. We verified the correctness of the updated design with the tb\_fpumult\_reg.vhd testbench. Then we have run two synthesis, one with the compile\_ultra command (syn\_script\_comp\_ultra.tcl) and one with a simple compile and the optimize\_registers (syn\_script\_opt\_reg.tcl). The results are summarized in table 2.2.

| Synthesis commands          | $T_{ck}$ | Area                |
|-----------------------------|----------|---------------------|
| compile + optimize_register | 0.8~ns   | $4969.41 \ \mu m^2$ |
| compile_ultra               | 1.5~ns   | $4216.1 \ \mu m^2$  |

Table 2.2: Results for different optimization techniques

The synthesis with retiming reaches double the frequency of both the one done with the compile in the previous section and the one done with the compile\_ultra. This shows the value of the optimization techniques we have studied. However, it suffers of an area increase with respect to the compile\_ultra synthesis, since it probably adds more registers due to the non-negative register count on the graph's arcs. Nevertheless, its speed is outstanding.

### CHAPTER 3

# MBE Multiplier

## 3.1 MBE Implementation

We have implemented an unsigned integer multiplier using the radix-4 Modified Booth's Encoding. We generate the partial products as shown in modified-booth.pdf without using any adder/subtractor. To reduce the height of the partial products' tree (and hence the number of HAs and FAs used) we have organized the bits as explained in sign\_extension\_booth\_multiplier\_Stanford.pdf. Unfortunately, to us this structure seems to be not very regular from the perspective of the Dadda tree implementation, in fact we have found difficulties in writing the algorithm by exploiting the for generate statement. Writing the whole tree by hand was not an option neither due to the tree size, therefore we have developed a Python script that generates the tree in VHDL by using only wire connections and FAs and HAs instantiations.

#### 3.1.1 Dadda Tree

The Python script is fairly easy: first, with the function  $gen_start_matrix$  the initial partial products matrix is generated. With a 1 we indicate that in position (i, j) a dot is present, with a 0 the opposite. Then, with the function  $count_levels$  we calculate how many levels the Dadda tree has, given the number of partial products, and how many dots are allowed in a column for each level.

After, we generate an array for each level where we keep the count of how many dots have been placed for each column. This is used to find out when a HA or a FA must be instantiated. This is accompanied by a per-level matrix that has been used for debugging purposes and, in the last tree level, to find out where a ground connection must be placed.

The main algorithm works as follow:

- For each tree level *l* it passes over all the column of cnt\_matrix[1]
- It computes the difference between how many dots are present in the j-th column (that is, the dots present in the current column of the current level), plus the eventual carries coming from the FAs and HAs of the column j-1) and minus the maximum number of dots allowed in the next level.
- If  $diff \leq 0$  all the dots can be simply propagated to the next level, otherwise HAs and/or FAs must be allocated.
- The allocation process first tries to add as many FAs as possible, then it tries to allocate as many HAs as needed. This works because if  $diff \ge 2$  a higher compression is required, hence

it is cheaper in terms of HW to insert FAs. All the remaining dots (if any) are then propagated to the next level.

The script outputs on a file the VHDL instantiation, which has been directly copied in dadda.vhd, since we used an entity to encapsulate the auto-generated code.

### 3.2 Synthesis

We performed synthesis with compile, compile\_ultra and compile + optimize\_registers. The results are summarized in table 3.1.

| Synthesis commands          | $T_{ck}$ | Area                |
|-----------------------------|----------|---------------------|
| compile                     | 2.6~ns   | $5497.42 \ \mu m^2$ |
| compile_ultra               | 1.6~ns   | $5359.37 \ \mu m^2$ |
| compile + optimize_register | 0.8~ns   | $8050.22 \ \mu m^2$ |

Table 3.1: MBE synthesis results

As it can be seen, the compile + optimize\_register still delivers the best result in term of timing. However, this time it comes with a non negligible area increase with respect to the compile\_ultra: in fact, for a 50% speed-up the area increases by 33%.

If we compare the results with the ones shown in table 2.2, we can see that timing performance is similar, however the area has increased, especially in the compile + optimize\_register case.

### 3.3 Conclusions

Given these results, it seems advantageous to let the synthesizer choose the architecture when using a standard compile command. However, when the best performance are required and area does not matter using the optimize\_registers command results in outstanding clock periods, unmatched even by the compile\_ultra.