# $SEAI\_2024\_R12$

Giulio Capecchi, Jacopo Niccolai December 2024



# Contents

| 1        | Intr | oducti  | on                                              | 3  |
|----------|------|---------|-------------------------------------------------|----|
| <b>2</b> | Pro  | ject D  | escription                                      | 3  |
|          | 2.1  | Workfl  | low Overview                                    | 3  |
|          | 2.2  | Jupyte  | er Notebooks                                    | 4  |
|          | 2.3  |         | lementation                                     | 4  |
|          | 2.4  | Loadin  | ng the project into the Vitis Unified IDE       | 5  |
| 3        | Mul  | lti-Lay | er Perceptron (MLP)                             | 7  |
|          | 3.1  | Datase  | et                                              | 7  |
|          | 3.2  | PyTore  | ch Model                                        | 7  |
|          |      | 3.2.1   | Model Architecture                              | 7  |
|          |      | 3.2.2   | Model Training                                  | 8  |
|          |      | 3.2.3   | Exporting Parameters                            | 10 |
|          | 3.3  | C Imp   | lementation                                     | 10 |
|          |      | 3.3.1   | MLP Structure                                   | 10 |
|          |      | 3.3.2   | Forward Pass                                    | 11 |
|          |      | 3.3.3   | Testbench                                       | 12 |
|          | 3.4  | Result  | S                                               | 14 |
|          |      | 3.4.1   | Stages of Development in the Vitis Unified IDE  | 15 |
|          |      | 3.4.2   | Performance Metrics                             | 15 |
|          |      | 3.4.3   | RTL Synthesis and Fail Fast Analysis            | 21 |
|          |      | 3.4.4   | Place & Route and Fail Fast Analysis            | 23 |
| 4        | Con  | volutio | onal Neural Network (CNN)                       | 25 |
|          | 4.1  |         | et                                              | 25 |
|          | 4.2  |         | ch Model                                        | 25 |
|          |      | 4.2.1   | Model Architecture                              | 25 |
|          |      | 4.2.2   | Model Training                                  | 27 |
|          |      | 4.2.3   | Exporting Parameters                            | 29 |
|          | 4.3  | C Imp   | lementation                                     | 29 |
|          |      | 4.3.1   | CNN Structure                                   | 30 |
|          |      | 4.3.2   | Forward Pass                                    | 31 |
|          |      | 4.3.3   | Considerations on Pragmas and Forward-pass Code | 35 |
|          |      | 4.3.4   | Testbench                                       | 36 |
|          | 4.4  | Result  |                                                 | 38 |
|          |      | 4.4.1   | Performance Metrics                             | 38 |
|          |      | 4.4.2   | RTL Synthesis and Fail Fast Analysis            | 42 |
|          |      | 4.4.3   | Place & Route and Fail Fast Analysis            | 43 |

# 1 Introduction

This project focuses on the synthesis of the forward pass for two types of neural network architectures: a Multilayer Perceptron (MLP) and a Convolutional Neural Network (ConvNet), implemented on an FPGA (Field Programmable Gate Array). To achieve this, the network parameters were first obtained using Python and the *PyTorch* library; these were subsequently hard-coded into C code, enabling the hardware synthesis process.



Figure 1: Xilinx Vitis HLS

Vitis Unified Software Platform is a comprehensive suite designed to accelerate the development of applications on FPGAs, Adaptive SoCs, and ACAPs (Adaptive Compute Acceleration Platforms). By combining high-level software programming techniques with hardware-optimized implementations, Vitis enables developers to write applications in C, C++, or OpenCL while leveraging hardware-specific optimizations for enhanced performance. In this project, Vitis was used to synthesize neural network architectures - MLP and ConvNet - onto an FPGA. Its **High-Level Synthesis (HLS)** tools allow for rapid prototyping and optimization of the C code, ensuring efficient resource utilization, parallelism and low-latency execution. The platform's ability to integrate high-level design, simulation, and hardware synthesis simplifies the workflow, bridging the gap between software and hardware development.

# 2 Project Description

## 2.1 Workflow Overview

The project follows a structured workflow. For each neural network architecture, a Jupyter Notebook is provided in the PyTorch folder. As the name of the folder suggests, these were constructed and trained using the PyTorch library. Once trained, the weights and biases were exported to be hardcoded into the corresponding  ${\tt C}$  implementation. The  ${\tt C}$  code was designed to be compatible with FPGA synthesis tools, such as Vitis HLS/Vivado, and can be found inside the HLS-Implementation folder.

# 2.2 Jupyter Notebooks

For each network architecture, a Jupyter Notebook was created to train the model and export the parameters. These notebooks are similar to each other since they contain almost the same steps for each neural network. Their main sections are:

- Dataset Preparation: Loading the chosen dataset to train the model.
- Model Definition: Defining the neural network architecture.
- Model Training: Training the model on the dataset, ensuring that it converges to a satisfactory accuracy.
- Exporting Parameters: Extracting the weights and biases for the forward pass to a text file.



Figure 2: Jupyter Notebooks logo

# 2.3 C Implementation

As for the Jupyter Notebooks, the C code is structured in a similar way for each architecture. There are always *three files* provided:

- architecture.c: contains the forward pass function and the definition of the architecture.
- architecture.h: contains the definition of the architecture and the prototype of the forward pass function.
- testbench.c: reads the dataset and contains the main function to test the forward pass function.

To optimize the C implementation for hardware synthesis, specific **HLS directives** were applied to critical portions of the code. These directives guide the High-Level Synthesis (HLS) tool to produce more efficient hardware designs by controlling resource allocation, loop unrolling, and pipeline creation. The two main directives used in this project are:

- HLS INLINE: This directive forces the complete insertion of the body of a function or loop directly at the point where it is called, eliminating the overhead associated with function calls or separate hardware resource allocation. By doing so, it reduces latency by eliminating function call delays. However, it may increase the usage of the hardware area, as the logic is replicated wherever the function is called. It is typically used for simple or frequently invoked functions to reduce latency.
- HLS PIPELINE: This directive breaks a loop or function into multiple stages (pipeline), allowing multiple iterations or operations to execute simultaneously, thereby increasing the design's throughput. It enables the processing of new iterations in every clock cycle (or at specific intervals called *initiation interval (II)*). The options for this directive include II=N (to specify the interval between iterations, such as 1 clock cycle) and rewind (to automatically restart the loop after completion). It is typically used in loops that process large amounts of data to maximize throughput in repetitive operations.
- Other Directives Used: Several other HLS directives are employed in the implementation to further optimize performance and resource utilization. They will be explained in detail at the points where they appear in the code.

It is important to note that the directives discussed in this section represent only a subset of the directives executed by the HLS compiler during the synthesis process. These directives were explicitly added to address specific warnings and improve the design's performance. They were placed on the exact lines of code where they were required to resolve issues or optimize critical sections of the implementation.

Hardcoding Network Parameters The network's weights and biases, extracted from the trained PyTorch model, were directly integrated into the C implementation as hardcoded values. This allows the FPGA hardware to access the pre-trained parameters directly during the forward pass, avoiding external memory accesses. The process of exporting these parameters is detailed in 3.2.3. After extraction, the parameters were organized in the C code as arrays within dedicated data structures. These structures and their integration will be explained in detail later on for each architecture.

### 2.4 Loading the project into the Vitis Unified IDE

The project was loaded into the Vitis Unified IDE to synthesize the C code and generate the corresponding hardware description. The IDE provides a comprehensive environment for developing, debugging, and deploying applications on Xilinx devices. The following steps has to be followed to load the project into the Vitis Unified IDE:

- 1. Open the Vitis Unified IDE and under HLS Development select Create component....
- 2. Click on Create Empty HLS Component. A new window should appear, where you can specify the component name and location.
- 3. You will then be asked about the configuration file: you can either provide one or click on *next* and let the IDE create one for you.
- 4. In the following page, specify in the Top Function field the name of the function that you want to synthesize: this should be the one that you want to test on the FPGA. In our case, it is the forward pass function (the name is forward for all the networks we worked on).
- 5. Click on next and select a device: it should have enough resources to contain the design.
- 6. Click on next until you reach the finish button. Click on it to create the project.

Once done, you should see on the left a structure similar to the one provided in the image. Inside Settings you can find the configuration file for the project, but the simpler way to add the necessary files to the project is to right-click on Sources and then add the C and header files. Do the same for the testbench, where you should add the txt file containing the dataset and the testbench.c file, specific for the current component. To "run" the project, you can then use the buttons on the bottom of this section (Run, C Synthesis, C/RTL Cosimulation, Package, Implementation). Their usage will be explained later on.



Figure 3: IDE Structure

# 3 Multi-Layer Perceptron (MLP)

Let's analyze the implementation of the forward pass for a Multi-Layer Perceptron. The forward pass for an MLP consists of propagating the input through a series of layers, each followed by an activation function. The code for it can be found inside the PyTorch folder, that contains the notebooks used to train the models and export their parameters.

### 3.1 Dataset

The MLP was trained using the well-known *Iris* dataset, which contains 150 samples of iris flowers, each with four features and a class label (the last value of each row). There is a total of three classes: *setosa*, *versicolor*, and *virginica*. The dataset was split into training and test sets, with 80% of the samples used for training and 20% for testing.

An example of the dataset is shown below:

```
sepal\_length, sepal\_width, petal\_length, petal\_width, species
```

5.1,3.5,1.4,0.2,setosa 5.7,2.8,4.5,1.3,versicolor 6.1,2.6,5.6,1.4,virginica

•••

The dataset's labels were encoded as integers using the Scikit-learn LabelEncoder class, which maps each class to a unique integer value. To facilitate its further usage inside the C implementation, this encoded version of the dataset was saved to a txt file, <code>iris\_dataset\_encoded.txt</code>.

## 3.2 PyTorch Model

### 3.2.1 Model Architecture

The architecture of the MLP model consists of three fully connected layers. The input layer has 4 neurons, corresponding to the 4 features of the Iris dataset. The first and second hidden layers each have 10 neurons, while the output layer has 3 neurons, corresponding to the 3 classes of the Iris dataset. The ReLU activation function is applied after each dense layer, except for the output layer.

The ReLU function is defined as:

$$ReLU(x) = max(0, x)$$

The forward pass of the model applies the ReLU activation function after the first and second layers.

The model was implemented using the PyTorch library as follows, by defining a custom class MLP that inherits from nn.Module:

```
# Define the MLP model
   class MLP(nn.Module):
2
       def __init__(self):
           super(MLP, self).__init__()
           self.fc1 = nn.Linear(4, 10)
5
           self.fc2 = nn.Linear(10, 10)
6
           self.fc3 = nn.Linear(10, 3)
       def forward(self, x):
                                         # Apply ReLU after
           x = torch.relu(self.fc1(x))
               first layer
           x = torch.relu(self.fc2(x)) # Apply ReLU after
11
               second layer
           x = self.fc3(x)
                             # Output layer (no activation)
12
           return x
13
```

### 3.2.2 Model Training

For the training phase, we first defined the model, loss function, and optimizer. We utilized the *CrossEntropyLoss* loss function, which is commonly used for multi-class classification problems, and the *Adam* optimizer, which is an adaptive learning rate optimization algorithm.

```
model = MLP()

# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available()
        else 'cpu')
model = model.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
```

The following code snippet demonstrates the trainign process using PyTorch. The model is trained for 100 epochs.

```
# Training loop
NUM_EPOCHS = 100
for epoch in range(NUM_EPOCHS):
    model.train()
    running_loss = 0.0
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device
    )

s
optimizer.zero_grad()
outputs = model(inputs)
```

```
loss = criterion(outputs, labels)
11
            loss.backward()
12
            optimizer.step()
13
14
            running_loss += loss.item()
16
       # Evaluate the model after each epoch
17
       model.eval()
18
       correct = 0
19
       total = 0
20
       with torch.no_grad():
            for inputs, labels in test_loader:
22
                inputs, labels = inputs.to(device), labels.to(
23
                    device)
                outputs = model(inputs)
24
                _, predicted = torch.max(outputs.data, 1) # Get
25
                    the class index with the highest probability
                total += labels.size(0)
26
                correct += (predicted == labels).sum().item()
27
28
       accuracy = correct / total
29
30
       if (epoch + 1) % 10 == 0:
31
            print(f'Epoch [{epoch+1}/{NUM_EPOCHS}], Loss: {
               running_loss/len(train_loader):.4f}, Accuracy: {
               accuracy * 100:.2f}%')
```

The results of the training process are reported in the table below:

| Epoch | Loss   | Train Accuracy | Test Accuracy |
|-------|--------|----------------|---------------|
| 10    | 0.3258 | 90.48%         | 88.89%        |
| 20    | 0.1060 | 97.14%         | 97.78%        |
| 30    | 0.1312 | 95.24%         | 97.78%        |
| 40    | 0.0853 | 97.14%         | 100.00%       |
| 50    | 0.0675 | 98.10%         | 100.00%       |
| 60    | 0.0971 | 96.19%         | 97.78%        |
| 70    | 0.0952 | 96.19%         | 97.78%        |
| 80    | 0.1020 | 96.19%         | 97.78%        |
| 90    | 0.0929 | 96.19%         | 97.78%        |
| 100   | 0.0788 | 94.29%         | 97.78%        |

Table 1: Training loss, train accuracy, and test accuracy of the MLP over 100 epochs.

The training process shows that the model is able to achieve a high level of accuracy on both the training and test sets: this is to be expected given the simplicity of the Iris dataset and the effectiveness of the MLP architecture in solving such problems. Given the relatively small dataset and the fact that the

model achieves near-perfect accuracy on the test set, we can conclude that the model is generalizing well.

### 3.2.3 Exporting Parameters

The trained model's parameters were exported to be hardcoded into the C implementation. The weights and biases of each layer were extracted and saved in a txt file, *mlp\_weights.txt*, as shown below:

```
import numpy as np
2
   weights = {}
   for name, param in model.named_parameters():
       weights[name] = param.detach().numpy()
5
6
   # Print their shapes to verify the network architecture
   for name, weight in weights.items():
       print(f"{name}: {weight.shape}")
10
11
   import numpy as np
12
   with open('./mlp_weights.txt', 'w') as f:
13
       for name, weight in weights.items():
14
           f.write(f"// {name}, shape: {weight.shape}\n")
15
           if weight.ndim == 2: # Fully connected layer
               weights
               for row in weight:
17
                    f.write("{" + ", ".join(map(str, row)) + "
18
                       },\n")
           elif weight.ndim == 1: # Biases or 1D weights
19
               f.write("{" + ", ".join(map(str, weight)) + "},\
20
                   n")
```

## 3.3 C Implementation

The implementation for Vitis HLS of the MLP was produced with three files:

- mlp.c, which contains the forward pass function, the activation function, and the definition of the MLP structure.
- mlp.h, which contains the definition of the MLP structure and the forward pass function prototype.
- testbench.c, which reads the *iris\_dataset\_encoded.txt* and contains the main function to test the forward pass function.

#### 3.3.1 MLP Structure

The MLP structure was defined as follows (inside mlp.h):

```
typedef struct {
1
       float weights[MAX_NEURONS][MAX_NEURONS]; // matrix
2
      float biases[MAX_NEURONS];
                                    // biases of the layer
3
      float output[MAX_NEURONS];
                                     // output of the layer
4
  } Layer;
  typedef struct {
      int num_layers;
                                     // number of layers
8
                                     // layers of the MLP (array
      Layer layers[MAX_LAYERS];
9
           of layers)
  } MLP;
```

MAX\_NEURONS and MAX\_LAYERS are defined as 100 and 3, respectively.

### 3.3.2 Forward Pass

The forward pass for the MLP is implemented in mlp.c as a sequence of operations applied to each layer of the network. The function processes four input features through three layers, each defined by its respective weights and biases, to produce the predicted class index. It is defined as follows:

```
int forward(float input0, float input1, float input2, float
       input3) {
       const int input_sizes[4] = {4, 10, 10, 3};
2
       const int num_layers = 3;
3
4
       float current_input[MAX_NEURONS];
       float next_input[MAX_NEURONS];
       current_input[0] = input0;
       current_input[1] = input1;
       current_input[2] = input2;
10
       current_input[3] = input3;
11
12
       for (int i = 0; i < num_layers; i++) {</pre>
13
            #pragma HLS UNROLL
14
            Layer *layer = &mlp.layers[i];
15
            for (int j = 0; j < input_sizes[i + 1]; j++) {</pre>
16
                float sum = layer->biases[j];
17
18
                for (int k = 0; k < input_sizes[i]; k++) {</pre>
                     sum += layer->weights[j][k] * current_input[
20
                        k];
21
                next_input[j] = reLu(sum);
22
            }
23
24
            for (int j = 0; j < input_sizes[i + 1]; j++) {</pre>
25
                current_input[j] = next_input[j];
26
```

```
}
27
28
29
        int max_index = 0;
30
        float max = current_input[0];
        for (int i = 1; i < NUM_CLASSES; i++) {</pre>
32
             #pragma HLS UNROLL
33
             if (current_input[i] > max) {
34
                 max = current_input[i];
35
                 max_index = i;
36
             }
38
        return max_index;
39
40
```

As already explained, all weights and biases are hardcoded and directly integrated into the MLP definition at the top of mlp.c. These values are exported from the PyTorch model and were inserted manually into the code.

A notable feature of this implementation is the use of the #pragma HLS UNROLL directive. This directive was necessary to address specific warnings related to loop dependencies and to enhance the throughput of the design by enabling parallel execution of loop iterations. Without this directive, the synthesis tool generated warnings indicating potential performance bottlenecks. By unrolling the loops, the forward pass achieves higher performance, making it suitable for FPGA-based acceleration.

### 3.3.3 Testbench

The testbench function reads the encoded Iris dataset file and applies the forward pass function to each sample. The file is read line-by-line by using the fscanf function, and the four features are passed to the forward pass function. The predicted class index is then compared with the actual class index to calculate the accuracy of the model.

Below the code for the testbench function:

```
int read_data_from_file(const char *path, int num_features,
    int label_size, float input_data[MAX_SAMPLES][
    MAX_FEATURES], float true_value[MAX_SAMPLES]) {
    FILE *file = fopen(path, "r");
    if (!file) {
        perror("Failed to open file");
        return -1;
    }

int sample_count = 0;
    while (fscanf(file, "%f", &input_data[sample_count][0])
        != EOF) {
```

```
for (int i = 1; i < num_features; i++) {</pre>
10
                fscanf(file, "%f", &input_data[sample_count][i])
11
            }
12
            for (int j = 0; j < label_size; j++) {</pre>
                fscanf(file, "%f", &true_value[sample_count]);
14
            }
15
            sample_count++;
16
            if (sample_count >= MAX_SAMPLES) {
17
                break;
            }
       }
20
21
       fclose(file);
22
       return sample_count;
23
   }
24
25
   int main() {
   float input_data[MAX_SAMPLES][MAX_FEATURES];
27
   float true_value[MAX_SAMPLES];
28
   // Read data from file
30
   const char *path = "./datasets/iris_dataset/
31
       iris_dataset_encoded.txt";
   int sample_count = read_data_from_file(path, MAX_FEATURES,
32
       1, input_data, true_value);
33
   // call the forward function and calculate the accuracy
34
   int correct_predictions = 0;
35
   for (int i = 0; i < sample_count; i++) {</pre>
36
       int prediction = forward(input_data[i][0], input_data[i
37
           [1], input_data[i][2], input_data[i][3]);
       if (prediction == true_value[i]) {
38
            correct_predictions++;
39
       }else{
40
            printf("Prediction: %d, True value: %f for input: %f
41
                %f %f %f\n", prediction, true_value[i],
                input_data[i][0], input_data[i][1], input_data[i
               ][2], input_data[i][3]);
       }
42
43
   float accuracy = (float)correct_predictions / sample_count *
44
        100.0;
   printf("Accuracy: %.2f%%\n", accuracy);
45
   }
```

The testbench is used to test that everything is working correctly and to evaluate the accuracy of the model on the dataset. The accuracy obtained should be similar to the one achieved during the training phase in PyTorch, confirming that the forward pass function is correctly implemented in C.

We always obtained an accuracy of 98% with it, consistent with the results obtained with PyTorch.

### 3.4 Results

The results obtained using the Vitis Unified IDE confirm the successful synthesis and implementation of the MLP forward pass on the FPGA. The development process in Vitis offers several stages where detailed reports are generated, providing valuable insights into the design's functionality and performance. The used device for this section was from the Product family **zynq**, and the Target device used was the **xc7z007s-clg225-2**.

The selected target device belongs to the Zynq-7000 family and is designed to integrate ARM processing systems with programmable logic. It features the following specifications:

- Logic Resources: The device provides 14,400 Look-Up Tables (LUTs) for implementing combinatorial logic.
- Flip-Flops (FFs): A total of 28,800 flip-flops are available, offering robust sequential logic capabilities.
- **DSP Blocks**: The device includes 66 DSP slices, making it suitable for high-performance signal processing tasks.
- Block RAM (BRAM): 50 BRAMs are available, ensuring ample onchip memory for intermediate data storage.

This combination of resources makes the xc7z007s-clg225-2 well-suited for applications requiring both computation and flexibility, such as neural network inference on FPGA hardware.

The selected device has a **speed grade of** -2, which represents a medium performance level within the Zynq-7000 family. Speed grades for this family typically range from -1 (lowest performance) to -3 (highest performance). The -2 speed grade offers a balance between performance and power efficiency, providing sufficient timing capabilities for the neural network inference tasks targeted in this design. Lower speed grade numbers correspond to higher achievable clock frequencies and reduced propagation delays, making the -2 grade an optimal choice for this application.

For this implementation, we used the **default clock setting** provided by Vitis, which is configured to a period of 10 ns. This corresponds to a clock frequency of 100 MHz. The following results will demonstrate how this default clock period fits well with our solution, as we achieved efficient performance without the need to modify this default value.

### 3.4.1 Stages of Development in the Vitis Unified IDE

These stages include:

- C Simulation: This initial step ensures the functional correctness of the high-level C implementation. During this phase, the input data is processed entirely in software, and the generated reports confirm that the output matches the expected results, validating the logic before hardware synthesis.
- C Synthesis: In this phase, the high-level C code is converted into a hardware description optimized for the target FPGA. The synthesis report provides important details, such as estimated resource utilization (LUTs, DSPs, BRAMs), latency, and initiation intervals. These metrics help identify potential bottlenecks and guide optimization efforts.
- C/RTL Cosimulation: This step bridges the gap between high-level and low-level design by validating the synthesized hardware description against the functional requirements. This stage is particularly important as it ensures consistency between the high-level model and the Register Transfer Level (RTL) implementation. The reports include timing diagrams, functional waveforms, and a comparison of C simulation outputs with RTL simulation outputs to confirm correctness.
- Packaging: After verifying the synthesized hardware, the design is packaged into an IP (Intellectual Property) core. The packaging reports detail the generated IP core's properties, ensuring that it adheres to the FPGA's integration requirements and is ready for system-level implementation.
- Implementation: In the final stage, the IP core is placed and routed on the FPGA. Implementation reports include metrics such as timing analysis, power estimates, and resource utilization on the physical FPGA fabric. These reports confirm that the design meets the FPGA's constraints, such as timing closure and power consumption.

By consulting these reports, the development process is highly transparent: each step ensures the correctness, performance, and compliance with the FPGA's requirements, resulting in an efficient and reliable implementation of the top-function, in this case the forward pass.

These steps are valid also for the other architectures aside of the MLP, so we will not repeat them in the following sections, but just present them.

### 3.4.2 Performance Metrics

Let's go into detail about the performance metrics obtained during the synthesis of the Multilayer-Perceptron forward pass on the FPGA.

C-Simulation C-simulation provides preliminary performance metrics, focusing on the steady-state execution of the design. These estimates, including the Transaction Interval (TI), highlight potential bottlenecks and optimization areas but may be overly optimistic unless the code is made canonical. This stage serves as an initial evaluation, guiding further refinement and more accurate analysis during synthesis and co-simulation. Here, we could mainly observe the correctness of the code (given by the expected output in the terminal), but we also noticed that there are some dependencies in the code: indeed, we obtained the following guidance message SIM 211-201A cyclic dependence prevents further acceleration of this process. This generally requires some algorithmic changes to improve. However, we still have to remember that this is the pre-synthesis phase, so we can't expect the best performance metrics yet. As we will see, results will be good in the following stages.

**C-Synthesis** Here, we can see the results of the synthesis of the MLP forward pass on the FPGA. The table below shows the *Estimated Quality of results*, which is the first metric presented in the report:

| TARGET   | ESTIMATED | UNCERTAINTY          |  |
|----------|-----------|----------------------|--|
| 10.00 ns | 6.329  ns | $2.70 \mathrm{\ ns}$ |  |

Table 2: Estimated Quality of Results for MLP Forward Pass

As we can see from the table, the estimated latency is 6.329 ns, with an uncertainty of 2.70 ns. This metric provides an initial indication of the design's performance, with lower values indicating faster execution. The uncertainty value represents the range within which the actual latency is expected to fall, providing a margin of error for the estimation. This falls within the expected range for the MLP forward pass, indicating that the design should be good for efficient execution.



Figure 4: Perfomance and Resource Estimates in the C-Synthesis Report

From the above image, we observe that the forward module exhibits an overall estimated latency of 216 cycles, corresponding to an execution time of 2,160 ns under the target clock frequency of 100 MHz (10 ns per cycle). The initiation interval (II) for the main module is reported as 217 cycles, which indicates that

the design could benefit from further pipelining to optimize parallel execution and reduce the interval. Examining the resource utilization, the design employs:

- 9,689 Look-Up Tables (LUTs), which corresponds to approximately 67.3% of the total 14,400 LUTs available on the xc7z007s-clg225-2.
- 7,417 Flip-Flops (FFs), utilizing 25.7% of the total 28,800 FFs available.
- 50 DSP slices, accounting for 75.8% of the total 66 available DSPs.
- No BRAM (BlockRam) or URAM(UltraRam), which indicates that the design relies solely on external or internal registers for storage.

The internal loops of the forward pass module demonstrate varied latencies, with some loops optimized using #pragma HLS UNROLL and #pragma HLS PIPELINE. Latencies range from 5 ns to 690 ns, with the primary loop exhibiting the highest latency (690 ns). This latency suggests potential bottlenecks in data dependencies or resource contention, which could be addressed by restructuring loops or leveraging more efficient parallelization strategies.

We can also check from the report that, as expected, the hardware interface corresponds to the input and output of the forward pass function, and that the utilized pragma syntax is correct and corresponds to the one we used in the code.





Figure 6: Pragma syntax

Figure 5: Hardware interfaces

C/RTL Simulation C-RTL cosimulation is a verification process that ensures the functional equivalence between the high-level C/C++ design and the synthesized Register-Transfer Level (RTL) code. This step is critical as it confirms that the behavior of the RTL implementation matches the original C/C++ description after synthesis. The benefits are multiple:

- Validation of Functional Correctness: Verifies that the generated RTL implementation functions identically to the original high-level design for the same inputs.
- Timing and Latency Estimates: Provides insights into the actual timing behavior of the synthesized RTL.
- Resource Utilization Check: Highlights any discrepancies between resource usage reported during synthesis and actual utilization in hardware.

### How It Works

- 1. **Input Stimuli**: A testbench written in C/C++ is used to provide input data to both the high-level C/C++ design and the RTL design.
- 2. **Output Comparison**: The outputs from the high-level simulation and the RTL simulation are compared.
- 3. **Reports**: Any mismatches or timing violations are reported for debugging purposes.



Figure 7: C/RTL Cosimulation Report - Performance and resource estimates

From an analysis of the results from the C/RTL co-simulation and synthesis, we can highlights several key observations and metrics:

- Initiation Interval (II): The initiation interval across all loops in the design remains constant at 205 cycles. This uniform II suggests a consistent level of pipelining efficiency across the design. However, it also indicates that certain dependencies or resource constraints may limit further reduction of the II.
- Loop Latencies: The latencies for individual loops vary significantly:
  - The loop labeled forward\_Pipeline\_VITIS\_LOOP\_73\_2 exhibits an average latency of 37 cycles, which aligns with expectations for its complexity.
  - Other loops, such as forward\_Pipeline\_VITIS\_LOOP\_84\_44, achieve minimal latencies of just 3 cycles, indicating highly efficient implementation.

The overall latency of the forward pass main module is **204 cycles**, which matches the expected values from the high-level design.

- Total Execution Time: The total execution time for the forward pass is reported as 30,749 ns. This result reflects the aggregated runtime of all components and their interactions.
- Pipeline Observations: While the loops in the forward pass are pipelined, the relatively high initiation interval (205 cycles) suggests potential bottlenecks. These could stem from data dependencies or limited resource availability, particularly in critical paths of the design.

### • Resource Utilization:

- The design effectively utilizes available DSPs, LUTs, and Flip-Flops, as previously described.
- However, no usage of BRAM or URAM is reported. Leveraging these resources could reduce dependency on external memory and improve performance in memory-intensive operations.

**Observations and Suggestions** The results confirm that the design is functional and performs as expected. However, several areas for improvement are identified:

- 1. **Reducing II:** Efforts should be directed towards decreasing the initiation interval by addressing resource contention and loop dependencies. Techniques such as loop unrolling or splitting could be beneficial.
- 2. **Memory Utilization:** Introducing BRAM or URAM for intermediate data storage can minimize external memory accesses and improve throughput.
- 3. **Optimization of Critical Loops:** High-latency loops should be reviewed and restructured to enhance parallelism, potentially improving the overall execution time.

By implementing these optimizations, the design could achieve higher efficiency and better alignment with the hardware capabilities of the target FPGA device.

**Packaging** Regarding the *Package* section, there is not much to be said, since the Vitis IDE doesn't provide a report for this stage. However, we can still infer that the packaging process was successful.

**Implementation** Here in this section we can mainly analyze the *RTL synthesis* and the *Place and Route* stages: the first provides a detailed report on the synthesis of the design into Register-Transfer Level (RTL) code, while the latter focuses on the physical implementation of the design on the FPGA. Regarding the *RTL synthesis*, the report provides insights into the resource

utilization, timing constraints, and design hierarchy. The metrics include the number of Look-Up Tables (LUTs), Flip-Flops (FFs), and Digital Signal Processors (DSPs) used, as well as the critical path delay and maximum frequency. These metrics are crucial for assessing the design's efficiency and performance, guiding further optimization efforts.

Fail Fast Analysis Overview: The Fail Fast analysis is a preliminary verification step designed to ensure that the design adheres to fundamental guidelines before proceeding to computationally intensive stages such as placement and routing. This stage evaluates key aspects of the design, including resource utilization and timing constraints, to identify potential bottlenecks early in the development process.

The following columns are analyzed:

- Criteria: Key aspects of the design, such as LUT usage, FD (Flip-Flop Density), and DFP (Dynamic Floating-Point operations), are monitored to ensure they fall within acceptable thresholds.
- **Guideline:** Defines a reference threshold for each criterion to guide the design toward optimal FPGA resource usage and performance.
- Actual: Displays the measured values of each criterion after the analysis of the current design.
- State: Indicates the compliance status of each criterion, with possible values:
  - **OK:** The criterion satisfies the guideline and requires no action.
  - WARNING: The criterion approaches the threshold, suggesting caution.
  - FAIL: The criterion exceeds the guideline, necessitating immediate attention.

## Criteria Evaluated:

- LUT Usage: Monitors the utilization of Look-Up Tables, ensuring that logic mapping remains within device capacity.
- Flip-Flop Density: Assesses the distribution of flip-flops to avoid routing congestion.
- **DSP Allocation:** Evaluates the usage of DSP slices for arithmetic operations, critical for neural network implementations.
- **Timing Constraints:** Verifies whether the design meets the required clock period and ensures no timing violations.

# 3.4.3 RTL Synthesis and Fail Fast Analysis

| Criteria                                                  | Guideline (%) | Actual (%) | Status |
|-----------------------------------------------------------|---------------|------------|--------|
| LUT                                                       | 70            | 30.82      | OK     |
| FD                                                        | 50            | 18.51      | OK     |
| LUTRAM+SRL                                                | 25            | 1.73       | OK     |
| MUXF7                                                     | 15            | 0.03       | OK     |
| DSP                                                       | 80            | 75.76      | OK     |
| RAMB/FIFO                                                 | 80            | 11.00      | OK     |
| DSP+RAMB+URAM (Avg)                                       | 70            | 43.38      | OK     |
| BUFGCE* + BUFGCTRL                                        | 24            | 0          | OK     |
| DONT_TOUCH (cells/nets)                                   | 0             | 0          | OK     |
| MARK_DEBUG (nets)                                         | 0             | 0          | OK     |
| Control Sets                                              | 270           | 86         | OK     |
| Average Fanout for modules > 100k cells                   | 4             | 2.31       | OK     |
| Max Average Fanout for modules > 100k cells               | 4             | 0          | OK     |
| Non-FD high fanout nets > 10k loads                       | 0             | 0          | OK     |
| TIMING-6 (No common primary clock between related clocks) | 0             | 0          | OK     |
| TIMING-7 (No common node between related clocks)          | 0             | 0          | OK     |
| TIMING-8 (No common period between related clocks)        | 0             | 0          | OK     |
| TIMING-14 (LUT on the clock tree)                         | 0             | 0          | OK     |
| TIMING-35 (No common node in paths with the same clock)   | 0             | 0          | OK     |
| Number of paths above max LUT budgeting (0.250ns)         | 0             | 0          | OK     |
| Number of paths above max Net budgeting (0.177ns)         | 0             | 0          | OK     |

Table 3: Resource Utilization: RTL Synthesis Fail Fast

| Resource | Utilization |
|----------|-------------|
| SLICE    | 0           |
| LUT      | 4,337       |
| FF       | 5,404       |
| DSP      | 50          |
| BRAM     | 11          |
| URAM     | 0           |
| LATCH    | 0           |
| SRL      | 104         |
| CLB      | 0           |

| Timing Metric                    | Value                   |
|----------------------------------|-------------------------|
| Target Clock Period              | $10.000 \; \mathrm{ns}$ |
| Post-Synthesis Clock Period      | 6.872  ns               |
| Post-Implementation Clock Period | N/A                     |

Table 4: Vivado RTL Synthesis Resource Summary and Timing

Resource Utilization and Timing Analysis: The resource utilization for the RTL synthesis of the MLP forward pass implementation remains well within the acceptable limits, as shown in the provided tables. The LUT, FF, and DSP utilization values demonstrate efficient use of FPGA resources while maintaining a balanced design. The timing analysis confirms that the post-synthesis clock period of 6.872 ns is comfortably below the target clock period of 10.000 ns, ensuring the design's ability to meet performance constraints. However, the post-implementation clock period is not available, which indicates that further analysis during the place-and-route phase is necessary to verify final timing metrics.

Fail Fast Results Analysis: The fail fast analysis for the RTL synthesis phase indicates that all criteria have been met, with no violations reported. Resource utilization such as LUTs, FFs, and DSPs falls below the specified guidelines, ensuring that the design is both efficient and scalable. Additionally, there are no high fanout nets or other timing-related issues, confirming that the design is well-optimized at this stage. The results suggest that the design is ready for the subsequent place-and-route phase with no significant concerns, enabling seamless progression in the FPGA implementation process.

# 3.4.4 Place & Route and Fail Fast Analysis

| Criteria                                                  | Guideline (%) | Actual (%) | Status |
|-----------------------------------------------------------|---------------|------------|--------|
| LUT                                                       | 70            | 25.96      | OK     |
| FD                                                        | 50            | 18.63      | OK     |
| LUTRAM+SRL                                                | 25            | 1.73       | OK     |
| MUXF7                                                     | 15            | 0.00       | OK     |
| DSP                                                       | 80            | 75.76      | OK     |
| RAMB/FIFO                                                 | 80            | 11.00      | OK     |
| DSP+RAMB+URAM (Avg)                                       | 70            | 43.38      | OK     |
| BUFGCE* + BUFGCTRL                                        | 24            | 0          | OK     |
| DONT_TOUCH (cells/nets)                                   | 0             | 0          | OK     |
| MARK_DEBUG (nets)                                         | 0             | 0          | OK     |
| Control Sets                                              | 270           | 78         | OK     |
| Average Fanout for modules > 100k cells                   | 4             | 2.23       | OK     |
| Max Average Fanout for modules > 100k cells               | 4             | 0          | OK     |
| Non-FD high fanout nets > 10k loads                       | 0             | 0          | OK     |
| TIMING-6 (No common primary clock between related clocks) | 0             | 0          | OK     |
| TIMING-7 (No common node between related clocks)          | 0             | 0          | OK     |
| TIMING-8 (No common period between related clocks)        | 0             | 0          | OK     |
| TIMING-14 (LUT on the clock tree)                         | 0             | 0          | OK     |
| TIMING-35 (No common node in paths with the same clock)   | 0             | 0          | OK     |
| Number of paths above max LUT budgeting (0.350ns)         | 0             | 0          | OK     |
| Number of paths above max Net budgeting (0.177ns)         | 0             | 0          | OK     |

Table 5: Resource Utilization: Place & Route Fail Fast Summary

| Resource | Utilization |
|----------|-------------|
| SLICE    | 1,518       |
| LUT      | 3,738       |
| FF       | 5,364       |
| DSP      | 50          |
| BRAM     | 11          |
| URAM     | 0           |
| LATCH    | 0           |
| SRL      | 104         |
| CLB      | 0           |

| Timing Metric                    | Value                   |  |
|----------------------------------|-------------------------|--|
| Target Clock Period              | $10.000 \; \mathrm{ns}$ |  |
| Post-Synthesis Clock Period      | 6.872  ns               |  |
| Post-Implementation Clock Period | $7.860 \; \mathrm{ns}$  |  |

Table 6: Vivado Place & Route Resource Summary and Timing

Resource Utilization and Timing Analysis: The Place & Route phase provides a detailed overview of the final resource usage and timing performance. The design consumes 3,738 LUTs, 5,364 FFs, 50 DSP blocks, 11 BRAMs, and 104 SRLs, while not utilizing any URAM, latches, or CLBs. The resource utilization demonstrates efficient use of the FPGA without exceeding critical limits, confirming the robustness of the implementation.

For timing constraints, the target clock period is set to 10.000 ns, corresponding to a frequency of 100 MHz. During the synthesis stage, the achieved clock period was reported as 6.872 ns. Post-implementation, the clock period increased to 7.860 ns, accounting for additional routing delays and placement complexities. Despite this increase, the design still meets the required constraints, providing a sufficient margin for stable operation. The results confirm that the design is capable of operating within the specified timing requirements post-routing, reflecting an efficient and reliable implementation.

Fail Fast Results Analysis: The Fail Fast analysis for the Place & Route phase reveals that all criteria are marked as OK, indicating full compliance with resource and timing guidelines. The resource utilization across LUTs, FFs, DSPs, and BRAMs is well-balanced, and the average fanout and control sets remain within acceptable ranges. The achieved clock period of 7.860 ns confirms timing compliance, demonstrating that routing and placement phases were successfully managed to maintain performance integrity.

Notably, the DSP utilization is at 75.8%, suggesting an opportunity for optimization in future iterations, such as redistributing computational loads across LUTs or exploring alternative arithmetic implementations. The analysis further highlights the effective balance between resource utilization and timing constraints, ensuring a robust design suitable for FPGA acceleration.

# 4 Convolutional Neural Network (CNN)

Convolutional Neural Networks (CNNs) are a class of deep learning models specifically designed to process grid-like data, such as images. Their ability to automatically learn spatial hierarchies of features makes them highly effective for image classification tasks. In this implementation, a simple yet efficient CNN is designed and trained to classify handwritten digits from the famuos MNIST dataset, with the aim of achieving high accuracy while remaining compatible with hardware synthesis constraints. So the ConvNet forward pass was implemented in a similar manner to the MLP, with the main difference being the convolution and pooling operations.

#### 4.1 Dataset

The CNN was trained on the MNIST dataset, a widely used benchmark for handwritten digit classification tasks. The dataset consists of 70,000 grayscale images of digits (0,1,2,3,4,5,6,7,8,9), each with a resolution of  $28 \times 28$  pixels. The dataset was split into two subsets, with 70% used as the training set and 30% as the validation set, resulting in the following distribution:

• Training set: 42,000 samples.

• Validation set: 18,000 samples.

• Test set: 10,000 samples for evaluating the model's performance.

The dataset was preprocessed using normalization to scale pixel values to the range [-1,1], this normalization centers the pixel values around 0 and scales them to have a standard deviation of 1, which can often help with model training and convergence.

# 4.2 PyTorch Model

### 4.2.1 Model Architecture

The ConvNet is designed to be as simple as possible. The structure includes the following layers:

- Convolutional Layer: A single convolutional layer with 3 filters, each of size 3 × 3, stride 1, and padding 1. This layer increases feature representation by extracting local patterns from the input images.
- **ReLU Activation:** Applied after the convolutional layer to introduce non-linearity into the model.
- Pooling Layer: A max-pooling layer with a kernel size of  $2 \times 2$  and a stride of 2, reducing the spatial dimensions of the feature map by half.

• Fully Connected Layer: It takes as input the flattened feature map from the convolutional and pooling layers. As output 10 neurons representing the 10 digit classes (0–9). This layer does not apply an activation function, it will be simply followed by max(1) to obtain the predicted class.

The total number of parameters in this configuration is **5880**, which is relatively low compared to more complex CNN architectures. The forward pass through the network is done as follows:

- 1. Apply the convolutional layer to the input image, followed by the ReLU activation function.
- Perform max-pooling on the resulting feature map to reduce spatial dimensions.
- 3. Flatten the feature map into a one-dimensional vector.
- 4. Pass the flattened vector through the fully connected layer to produce class probabilities.

The simplicity of this architecture ensures compatibility with hardware synthesis while maintaining high accuracy for digit classification tasks.

The model was defined as follows, using the PyTorch library:

```
# Define the CNN model
   class ConvNet(nn.Module):
2
       def __init__(self):
3
           super(ConvNet, self).__init__()
           # Input: x = [1, 28, 28]
5
           self.conv1 = nn.Conv2d(1, 3, kernel_size=3, stride
               =1, padding=1)
           # Convolutional Layer: Output: x = [3, 28, 28]
           # Formula: (W - F + 2P) / S + 1
           self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
           # Pooling Layer: Output: x = [3, 14, 14]
10
           # Formula: (W - F) / S + 1
11
           self.fc1 = nn.Linear(3 * 14 * 14, 10)
12
           # Fully Connected Layer: Flatten the output to match
13
                10 classes
14
       def forward(self, x):
1.5
           x = torch.relu(self.conv1(x)) # Apply ReLU after
16
               convolution
           x = self.pool(x)
                                           # Apply max pooling
17
           x = x.view(x.size(0), -1)
                                           # Flatten the tensor
18
             = self.fc1(x)
                                           # Fully connected
19
               laver
           return x
20
```

### 4.2.2 Model Training

The training process for the Convolutional Neural Network (CNN) was performed using the PyTorch library. The objective was to minimize the cross-entropy loss, a suitable loss function for multi-class classification problems. The training configuration was as follows:

• Loss Function: CrossEntropyLoss.

• Optimizer: Adam optimizer with a learning rate of 0.001.

Batch Size: 64.Epochs: 100.

```
# initialize network, loss function and optimizer
model = ConvNet().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
```

After each epoch, the model was evaluated on the validation set to monitor the training progress and prevent overfitting. The training and validation results at selected epochs are shown in the table below:

| Epoch Train Loss |        | Train Accuracy (%) | Val Loss | Val Accuracy (%) |
|------------------|--------|--------------------|----------|------------------|
| 10               | 0.1194 | 96.49              | 0.1360   | 96.03            |
| 20               | 0.0907 | 97.30              | 0.1232   | 96.38            |
| 30               | 0.0735 | 97.74              | 0.1130   | 96.77            |
| 40               | 0.0636 | 98.03              | 0.1122   | 96.79            |
| 50               | 0.0571 | 98.27              | 0.1147   | 96.75            |
| 60               | 0.0521 | 98.34              | 0.1147   | 96.88            |
| 70               | 0.0491 | 98.39              | 0.1259   | 96.73            |
| 80               | 0.0459 | 98.53              | 0.1285   | 96.69            |
| 90               | 0.0439 | 98.55              | 0.1342   | 96.61            |
| 100              | 0.0411 | 98.71              | 0.1325   | 96.76            |

Table 7: Training and Validation Results at Selected Epochs

The final test performance was as follows:

• Test Loss: 0.1260

• Test Accuracy: 97.06%

Which can be considered good results, showing that the model was able to generalize well to unseen data even with only a "few" parameters and EPOCHs of

training.

The following Python code snippet shows the main training loop:

```
# Training loop
1
   for epoch in range(NUM_EPOCHS):
2
       model.train()
3
       train_loss = 0.0
5
       train_correct = 0
       train_total = 0
       for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device
9
10
11
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
13
           loss.backward()
14
           optimizer.step()
15
16
            train_loss += loss.item()
17
            _, predicted = outputs.max(1)
            train_total += labels.size(0)
            train_correct += predicted.eq(labels).sum().item()
20
21
       train_accuracy = 100. * train_correct / train_total
22
23
       # Validation phase
       model.eval()
       val_loss = 0.0
26
       val_correct = 0
27
       val_total = 0
28
29
       with torch.no_grad(): # Disable gradient calculations
30
            for images, labels in val_loader:
31
                images, labels = images.to(device), labels.to(
32
                    device)
33
                outputs = model(images)
34
                loss = criterion(outputs, labels)
35
36
                val_loss += loss.item()
                _, predicted = outputs.max(1)
38
                val_total += labels.size(0)
39
                val_correct += predicted.eq(labels).sum().item()
40
41
       val_accuracy = 100. * val_correct / val_total
42
       if (epoch + 1) \% 10 == 0:
```

### 4.2.3 Exporting Parameters

To proceed with hardware implementation, the trained parameters (weights and biases) of the CNN model were exported to a structured format so that it could be easily loaded into the C implementation. The following Python code was used to extract and store the parameters:

```
with open('./convnet_weights.txt', 'w') as f:
       for name, weight in weights.items():
2
           f.write(f"// {name}, shape: {weight.shape}\n")
           if weight.ndim == 4: # Convolutional weights: [
5
               out_channels, in_channels, kernel_height,
               kernel width]
               for oc, w_slice in enumerate(weight): # Iterate
6
                    over output channels
                   f.write(f"{{ // Output Channel {oc}\n")
                   for row in w_slice: # Iterate over rows (
                       flattened kernels)
                       f.write(" {" + ", ".join(map(str, row.
9
                           flatten())) + "},\n")
                   f.write("},\n")
10
           elif weight.ndim == 2: # Fully connected layer
               weights
               for row in weight:
13
                   f.write("{" + ", ".join(map(str, row)) + "
14
                       },\n")
15
           elif weight.ndim == 1: # Biases or 1D weights
16
               f.write("{")
17
               f.write(", ".join(map(str, weight)) + "},\n")
18
```

This code snippet exports the weights and biases of the ConvNet model to the convnet\_weights.txt file. From it, the weights and biases for each layer can be simply copy-pasted inside the ConvNet structure in the C implementation.

## 4.3 C Implementation

The implementation of the Convolutional Neural Network (CNN) model in C is divided into three main files:

- ConvNet.h: This header file defines the data structures and constants used in the implementation, including the convolutional and fully connected layers, and the overall network structure.
- ConvNet.c: This file contains the implementation of the CNN's functionality, including the forward pass and activation functions, along with predefined weights and biases.
- testbench.c: This file serves as a testbench to verify the network's functionality. It includes functions for reading input data, executing the forward pass, and evaluating the model's output against the expected results.

The details of each file are described in the following subsections. The device used in this case was the xczu15eg-ffvc900-3-e, from the Zynq UltraScale+family.



Figure 8: Perfomance and Resource Estimates in the C-Synthesis Report

### 4.3.1 CNN Structure

The Convolutional Neural Network (CNN) structure is implemented in C to match the architecture defined in the PyTorch model. Going into the details, the CNN consists of the following layers and operations:

- Convolutional Layer: This layer performs the convolution operation by *sliding* a set of filters (or **kernels**) over the input image to extract spatial features. The implementation uses nested loops to apply the filters, taking into account padding and stride.
- **ReLU Activation:** The Rectified Linear Unit (ReLU) activation function is applied after the convolutional operation to introduce non-linearity. The ReLU function replaces negative values in the feature map with zeros, its formulas was already provided in the MLP section.
- Pooling Layer: A max-pooling operation reduces the spatial dimensions of the feature map by selecting the maximum value within a kernel-sized region. This layer helps to downsample the feature map and make the model more robust to small spatial variations.
- Fully Connected Layer: The fully connected layer takes the flattened output from the pooling layer as input and maps it to the output classes. This is achieved by performing a linear transformation using the preloaded weights and biases.



Figure 9: Convolutional Neural Network Architecture

The CNN structure is defined in the header file ConvNet.h, which outlines the data structures and constants used for the network. Here the ConvNet structure:

```
// Structure to represent a convolutional layer
1
   typedef struct {
2
       float weights[CONV1_OUTPUT_CHANNELS][INPUT_CHANNELS
3
           ][3][3]; // Filters of the convolutional layer
       float biases[CONV1_OUTPUT_CHANNELS]; // Biases for the
   } ConvLayer;
5
6
   // Structure to represent a fully connected layer
7
   typedef struct {
8
       float weights[NUM_CLASSES][FC1_INPUT_SIZE]; // Weights
9
           of the fully connected layer
       float biases[NUM_CLASSES]; // Biases of the fully
10
           connected layer
   } FullyConnectedLayer;
11
12
   // General structure of the network
13
   typedef struct {
       ConvLayer conv1; // First convolutional layer
15
       FullyConnectedLayer fc1; // Fully connected layer
16
17
```

As for the MLP implementation, all weights and biases in the CNN are hard-coded and directly integrated into the ConvNet.c file, where they can be found at the top.

### 4.3.2 Forward Pass

The forward pass function propagates an input image through the Convolutional Neural Network (CNN) to generate class probabilities as the output. It sequentially applies operations such as convolution, activation, pooling, and fully connected layers.

The following C code implements the forward pass:

```
int forward(float input[INPUT_HEIGHT][INPUT_WIDTH][
       INPUT_CHANNELS], float output[NUM_CLASSES]) {
       // Convolutional layer output buffer
3
       float conv_output [CONV1_OUTPUT_CHANNELS] [INPUT_HEIGHT] [
           INPUT_WIDTH];
5
       // Convolutional layer operation
       // Loop over output channels
       convolutional_layer: for (int oc = 0; oc <</pre>
           CONV1_OUTPUT_CHANNELS; oc++) {
            // Loop over input height
9
            for (int h = 0; h < INPUT_HEIGHT; h++) {</pre>
10
11
                // Temporary buffer for input data (with padding
12
                // Kernel size is 3x3, with a 3-row buffer
13
                float temp_input[3][INPUT_WIDTH+2][
14
                    INPUT_CHANNELS];
15
                // Load the input data into the buffer with
16
                    padding
                // Load 3 rows of input data
                for (int i = 0; i < 3; i++) {
18
                    // Adjust row index for kernel centering
19
                    int curr_h = h + i - 1;
20
                    // Input width with padding
21
                    for (int w = 0; w < INPUT_WIDTH + 2; w++) {</pre>
22
                         // Loop over input channels
                         for (int c = 0; c < INPUT_CHANNELS; c++)</pre>
^{24}
                             // Apply zero-padding for boundary
25
                                 conditions
                             if (curr_h >= 0 && curr_h <</pre>
26
                                 INPUT_HEIGHT && w-1 >= 0 && w-1 <
                                  INPUT_WIDTH) {
                                 temp_input[i][w][c] = input[
27
                                     curr_h][w-1][c];
                             } else {
28
                                 temp_input[i][w][c] = 0.0f;
29
30
                         }
                    }
32
                }
33
34
                // Perform convolution for the current row of
35
                    the output
```

```
for (int w = 0; w < INPUT_WIDTH; w++) {</pre>
36
                     // Initialize with bias value
37
                     float sum = convnet.conv1.biases[oc];
38
39
                     // Convolve the kernel with the input buffer
                     // Loop over input channels
41
                     for (int ic = 0; ic < INPUT_CHANNELS; ic++)</pre>
42
                         {
                         // Kernel height
43
                         for (int kh = 0; kh < 3; kh++) {</pre>
44
                              // Kernel width
                              for (int kw = 0; kw < 3; kw++) {</pre>
46
                                  sum += temp_input[kh][w + kw][ic
47
                                      ] * convnet.conv1.weights[oc
                                      ][ic][kh][kw];
                              }
48
                         }
49
                     }
                     // Apply ReLU activation function
51
                     conv_output[oc][h][w] = reLu(sum);
52
                }
53
            }
54
       }
55
       // MaxPooling layer output buffer
57
        float pool_output[CONV1_OUTPUT_CHANNELS][INPUT_HEIGHT /
58
           POOL_SIZE][INPUT_WIDTH / POOL_SIZE];
59
        // MaxPooling operation
60
        // Loop over output channels
61
        for (int oc = 0; oc < CONV1_OUTPUT_CHANNELS; oc++) {</pre>
            // Loop over pooled height
63
            for (int h = 0; h < INPUT_HEIGHT / POOL_SIZE; h++) {</pre>
64
                // Loop over pooled width
65
                for (int w = 0; w < INPUT_WIDTH / POOL_SIZE; w</pre>
66
                    ++) {
                     // Initialize with a very small value
67
                     float max_val = -1e9;
                     // Pooling window height
69
                     for (int ph = 0; ph < POOL_SIZE; ph++) {</pre>
70
                         // Pooling window width
71
                         for (int pw = 0; pw < POOL_SIZE; pw++) {</pre>
72
                              // Calculate input height index
73
                              int ih = h * POOL_SIZE + ph;
74
                              // Calculate input width index
                              int iw = w * POOL_SIZE + pw;
                              // Ensure within bounds
77
                              if (ih < INPUT_HEIGHT && iw <</pre>
78
                                  INPUT_WIDTH) {
                                  // Find max value in window
79
```

```
if (conv_output[oc][ih][iw] >
80
                                       max_val) {
                                       max_val = conv_output[oc][ih
81
                                           ][iw];
                                   }
                              }
83
                          }
84
                     }
85
                      // Store max value in pooled output
86
                     pool_output[oc][h][w] = max_val;
87
                 }
            }
89
        }
90
91
        // Fully connected layer input buffer
92
        float fc_input[FC1_INPUT_SIZE];
93
        #pragma HLS ARRAY_PARTITION variable=fc_input complete
94
95
        // Flatten pooling output into a 1D array
        // Index for flattened array
97
        int idx = 0;
98
        // Loop over output channels
        flatten: for (int oc = 0; oc < CONV1_OUTPUT_CHANNELS; oc
100
            ++) {
            // Loop over pooled height
101
             for (int h = 0; h < INPUT_HEIGHT / POOL_SIZE; h++) {</pre>
102
                 // Loop over pooled width
103
                 for (int w = 0; w < INPUT_WIDTH / POOL_SIZE; w</pre>
104
                     ++) {
                     fc_input[idx++] = pool_output[oc][h][w];
105
                 }
106
            }
107
        }
108
109
        // Fully connected layer computation
110
        // Loop over output classes
111
        fully_connected_loop: for (int o = 0; o < NUM_CLASSES; o</pre>
112
            ++) {
            // Initialize with bias value
113
             float sum = convnet.fc1.biases[o];
114
            #pragma HLS PIPELINE II=1
115
             // Loop over flattened input
116
            for (int i = 0; i < FC1_INPUT_SIZE; i++) {</pre>
117
                 // Weighted sum of inputs
118
                 sum += fc_input[i] * convnet.fc1.weights[o][i];
119
120
            // Store the computed class score
121
             output[o] = sum;
122
123
        return 0; // Success
124
```

### 4.3.3 Considerations on Pragmas and Forward-pass Code

- 1. Array Partitioning (#pragma HLS ARRAY\_PARTITION): The use of the ARRAY\_PARTITION pragma is important for improving parallelism, especially in operations that require concurrent access to multiple data elements. Partitioning arrays into independent variables, as in the case of the input array, convolution weights, and convolution output, allows the design to access different data in parallel without conflicts, improving efficiency. Completing the partition for each dimension of the array means that each element of the array is mapped to a separate hardware resource, enabling simultaneous access to the data. In the code, for example, applying #pragma HLS ARRAY\_PARTITION variable=fc\_input complete within the forward function allows concurrent access to every element during the fully connected layer's multiply-accumulate loops, thereby avoiding memory bottlenecks and improving overall throughput.
- 2. Pipeline (#pragma HLS PIPELINE): The PIPELINE pragma is used to insert pipelining commands, which means operations can begin executing in parallel. The directive II=1 reduces the initiation interval to 1 clock cycle, speeding up loop execution and reducing latency.

This is particularly useful in loops where the operations are independent, such as in the convolution and max-pooling calculations. In the code of convolution loop, for instance, using #pragma HLS PIPELINE II=1 ensures that each new iteration can start processing in the next clock cycle, leading to significantly lower latency and higher throughput.

### 3. Efficiency of Convolution and Max-Pooling

Convolution and pooling are kept in separate, parallelizable loops, leveraging local buffers (such as the 3-row temp\_input) to handle padding and data retrieval. By loading only three rows of the input at a time, the design obtains precisely the rows needed by the  $3\times 3$  kernel at each step, improving memory usage and reducing overhead compared to storing extra unused rows. This targeted buffering approach minimizes the data transferred per output row, leading to better latency and throughput when combined with pipelining and partitioning directives.

# 4. Memory Efficiency and Buffer Management

Temporary buffers like the 3-row buffer for convolution and the intermediate arrays for outputs help manage partial computations and boundary conditions efficiently. However, it is important to size these buffers to match the actual kernel requirements, avoiding wasted storage and potential resource conflicts. By adopting a 3-row buffer specifically for a  $3\times3$  kernel, the design loads only the necessary lines, reducing on-chip memory usage and preventing unnecessary data transfers.

### 4.3.4 Testbench

The testbench reads input data from a file, processes it through the CNN using the forward function, and verifies the output against the expected label. This ensures that the implementation behaves as expected and aligns with the PyTorch model. The testbench performs the following tasks:

- Input Loading: Reads the input image and its corresponding label from a text file (input\_image.txt), produced by the Jupyter Notebook. This has to be loaded inside the *testbench* section of the Vitis IDE.
- Forward Pass Execution: Propagates the input through the CNN to compute class probabilities.
- **Result Validation**: Compares the predicted label with the true label and outputs the result.

The following code showcases the implementation of the testbench:

```
float input[INPUT_HEIGHT][INPUT_WIDTH][INPUT_CHANNELS];
2
       float output[NUM_CLASSES];
3
       int label;
5
       read_input_image(INPUT_FILE_PATH, input, &label);
6
7
       int results = forward(input, output);
       if (results != 0) {
10
            printf("Error during forward pass\n");
11
            return 1;
12
13
14
       printf("Predicted output:\n");
15
       for (int i = 0; i < NUM_CLASSES; i++) {</pre>
            printf("Class %d: %f\n", i, output[i]);
18
19
       float max_prob = output[0];
20
       int predicted_label = 0;
21
22
       for (int i = 1; i < NUM_CLASSES; i++) {</pre>
23
            if (output[i] > max_prob) {
24
                max_prob = output[i];
25
                predicted_label = i;
26
            }
27
       }
28
       printf("Predicted label: %d\n", predicted_label);
       printf("True label: %d\n", label);
31
```

```
32     return 0;
33  }
```

Apart from the <code>read\_input\_image</code> (not reported here for brevity) we can see that the code follows the workflow explained above: reads an input image, passes it to the forward function, and then compares the predicted label with the true label.

### 4.4 Results

Target Device The implementation of the Convolutional Neural Network forward pass on the FPGA was carried out on the Zynq UltraScale+ MPSoCs platform. The target device selected for this design was xczu15eg-ffvc900-3-e, which is part of the Zynq UltraScale+ family. The package used is ffvc900, with a speed grade of -3, this indicates that the target FPGA device is optimized for high performance, offering the fastest timing characteristics within its family. This makes it suitable for designs requiring high clock frequencies and low-latency operations, such as the convolutional neural network implementation.

The device offers the following key resources:

• LUTs: 341,280

• Flip-Flops (FFs): 682,560

• **DSP** slices: 3,528

• Block RAMs (BRAMs): 744

This device was chosen for its high resource availability and support for efficient hardware acceleration, making it suitable for the implementation and optimization of the ConvNet design.

### 4.4.1 Performance Metrics

Let's dive into the results obtained inside the Vitis IDE for the simple convolutional neural network.

**C-Simulation** As before, the first step is to run the C simulation to verify the correctness of the implementation. Here we could observe mainly the validity of the code: in the following image we report the output of the model with PyTorch and testbench one, showing the predicted label and the true label.



Figure 10: PyTorch Model Prediction Figure 11: Prediction inside Vitis IDE

As we can see, results are equal to the third decimal digit, which is a good sign that the C implementation is working as expected. As before, in the *Code Analyer* section of the Vitis IDE, the message A cyclic dependence prevents further acceleration of this process. This generally requires some algorithmic changes to improve. As said for the MLP, this message is not a problem, but it's just a warning that the code could be optimized further.

**C-Synthesis** As for the MLP, we report the *Estimated Quality of results* table:

| TARGET   | ESTIMATED | UNCERTAINTY |  |
|----------|-----------|-------------|--|
| 10.00 ns | 7.103 ns  | 2.70 ns     |  |

Table 8: Estimated Quality of Results for ConvNet Forward Pass

The estimated clock period is 7.103 ns, which is below the target of 10.00 ns, and this indicates that the design meets the timing constraints. The uncertainty of 2.70 ns represents the margin of error in the estimation, providing a range within which the actual performance may vary.



Figure 12: Perfomance and Resource Estimates in the C-Synthesis Report

From the above image, we observe that the forward module exhibits an overall

estimated latency of **18,181 cycles**, corresponding to an execution time of approximately **182,010 ns** under the target clock frequency of 100 MHz (10 ns per cycle).

The initiation interval (II) for the forward\_Pipeline\_fully\_connected\_loop is reported as 11 cycles, indicating a moderate level of pipelining. This implies that the loop can start a new iteration every 11 cycles. The fully connected layer within this module shows a latency of 2,366 cycles, with the usage of 2,895 DSPs, 266,776 flip-flops (FFs), and 208,584 lookup tables (LUTs). Similarly, the convolutional layer demonstrates a latency of 14,625 cycles, with an iteration latency of 4,875 cycles and a trip count of 3. This layer makes significant use of BRAMs (26) and DSPs (2,940), as well as 292,026 FFs and 216,165 LUTs, highlighting the resource-intensive nature of this computation.

As we did for the MLP, let's also check from the report that the hardware interface corresponds to the input and output of the forward pass function, and that the utilized pragma syntax is correct.



Figure 13: Hardware interfaces

C/RTL Simulation The C/RTL simulation was performed to evaluate the performance and resource usage of the ConvNet design implemented for FPGA acceleration. This analysis focuses on key metrics, such as initiation interval (II), latency, and total execution time, to optimize the design's efficiency while maintaining correctness.



Figure 15: C/RTL Cosimulation Report - Performance and resource estimates

The image above summarizes the breakdown of the simulation's key loops and modules. Each hierarchical level of the design is analyzed for critical factors affecting the throughput and latency:

- The forward function is the top-level module, with a latency of 22,879 clock cycles and a consistent initiation interval (II) of 1, indicating efficient pipeline execution.
- Submodules within the forward function, such as Pipeline\_flatten and Pipeline\_fully\_connected\_loop, achieve low latency values (592 and 2,364 cycles, respectively) due to effective loop unrolling and partitioning strategies.
- The convolutional\_layer function exhibits a latency of 19,330 cycles, with an II of 1 for its critical loop. This highlights the successful application of optimization techniques, such as array partitioning, ensuring maximum utilization of processing resources.
- The VITIS\_LOOP\_55\_1 loop, part of the convolutional layer, demonstrates a significant latency of 6,443 cycles, as it handles the critical computations of the kernel convolution. This indicates room for further optimization in kernel execution.

**Packaging** Regarding the Packaging section, there is not much to report, as the Vitis IDE does not generate a detailed report for this stage. However, it can be inferred that the packaging process was completed successfully, ensuring that the ConvNet design is ready for integration into larger FPGA-based systems.

Implementation In this section, we analyze the RTL Synthesis and the Place and Route stages of the ConvNet design implementation. These steps generate reports providing insights into the resource utilization, timing performance, and critical paths of the design, enabling further optimization and refinement. During the implementation phase of the ConvNet design for FPGA acceleration, two important reports were generated:

- Vivado RTL Synthesis Report (hls\_impl\_syn.rpt)
- Vivado Place & Route Report (hls\_impl\_pnr.rpt)

These reports correspond to two critical steps in the FPGA design process:

# 4.4.2 RTL Synthesis and Fail Fast Analysis

The synthesis step translates the high-level design into a register-transfer level (RTL) description. This involves:

- Generating the hardware description in Verilog or VHDL.
- Estimating the resource utilization (LUTs, FFs, DSPs, BRAMs, etc.).
- Providing an initial assessment of the timing performance against the target clock period.

The results of the synthesis step are summarized in the tables below:

| Criteria                                                  | Guideline | Actual (%) | Status |
|-----------------------------------------------------------|-----------|------------|--------|
| LUT                                                       | 70%       | 52.71%     | OK     |
| FD                                                        | 50%       | 36.51%     | OK     |
| LUTRAM+SRL                                                | 25%       | 1.95%      | OK     |
| CARRY8                                                    | 25%       | 16.69%     | OK     |
| MUXF7                                                     | 15%       | 0.00%      | OK     |
| LUT Combining                                             | 20%       | 27.81%     | REVIEW |
| DSP                                                       | 80%       | 83.33%     | REVIEW |
| RAMB/FIFO                                                 | 80%       | 1.75%      | OK     |
| URAM                                                      | 80%       | 0.00%      | OK     |
| DSP+RAMB+URAM (Avg)                                       | 70%       | 42.54%     | OK     |
| BUFGCE* + BUFGCTRL                                        | 24        | 0          | OK     |
| DONT_TOUCH (cells/nets)                                   | 0         | 0          | OK     |
| MARK_DEBUG (nets)                                         | 0         | 0          | OK     |
| Control Sets                                              | 6399      | 3124       | OK     |
| Average Fanout for modules >100k cells                    | 4         | 1.19       | OK     |
| Max Average Fanout for modules >100k cells                | 4         | 1.19       | OK     |
| Non-FD high fanout nets >10k loads                        | 0         | 0          | OK     |
| TIMING-6 (No common primary clock between related clocks) | 0         | 0          | OK     |
| TIMING-7 (No common node between related clocks)          | 0         | 0          | OK     |
| TIMING-8 (No common period between related clocks)        | 0         | 0          | OK     |
| TIMING-14 (LUT on the clock tree)                         | 0         | 0          | OK     |
| TIMING-35 (No common node in paths with the same clock)   | 0         | 0          | OK     |
| Number of paths above max LUT budgeting (0.250ns)         | 0         | 0          | OK     |
| Number of paths above max Net budgeting (0.177ns)         | 0         | 0          | OK     |

Table 9: Resource Utilization: RTL Synthesis Fail Fast

| Resource | Utilization |
|----------|-------------|
| LUT      | 179,894     |
| FF       | 249,181     |
| DSP      | 2,940       |
| BRAM     | 26          |
| URAM     | 0           |
| SRL      | 3,306       |

| Timing Metric               | Value                   |  |
|-----------------------------|-------------------------|--|
| Target Clock Period         | $10.000 \; \mathrm{ns}$ |  |
| Post-Synthesis Clock Period | $3.232~\mathrm{ns}$     |  |

Table 10: Vivado RTL Synthesis Resource Summary and Timing

The synthesis results indicate that the design fits comfortably within the available resources of the target FPGA, but with the necessary attention on resources with REVIEW status, commented below. The achieved clock period (3.232 ns) is well below the target period (10.000 ns), ensuring sufficient timing slack at this stage.

Fail Fast Results Analysis: The REVIEW status in the RTL Synthesis report indicates that specific resource usage metrics, such as DSP utilization (83.33%) and LUT Combining (27.81%), exceed their recommended thresholds (80% and 20%, respectively). While these values do not cause a failure, they suggest potential bottlenecks or inefficiencies that could impact overall design performance, in particular the DSP utilization is relatively high which might require attention during optimization.

- **DSP Utilization**: DSP resources are slightly over the recommended guideline. While acceptable for synthesis, it may limit future scalability or integration with additional components.
- LUT Combining: The value exceeding the guideline indicates potential inefficiencies in logic mapping. We should consider that it is less critical than DSP utilization.

So, to determine whether these values are acceptable, we should consider the subsequent Place & Route (PnR) Report, which provides post-implementation resource usage and timing analysis. If the PnR report will confirm that timing constraints are met and no routing congestion occurs, the design can be considered acceptable despite the REVIEW statuses in the RTL synthesis phase.

### 4.4.3 Place & Route and Fail Fast Analysis

The place-and-route step maps the synthesized RTL onto the physical resources of the FPGA. This step provides:

- $\bullet$  The final resource utilization after physical implementation.
- The timing analysis considering routing delays.
- A pass/fail evaluation of key design metrics.

The results of the place-and-route step are summarized in the tables below:

| Criteria                                                  | Guideline | Actual (%) | Status |
|-----------------------------------------------------------|-----------|------------|--------|
| LUT                                                       | 70%       | 44.13%     | OK     |
| FD                                                        | 50%       | 36.25%     | OK     |
| LUTRAM+SRL                                                | 25%       | 1.84%      | OK     |
| CARRY8                                                    | 25%       | 16.69%     | OK     |
| MUXF7                                                     | 15%       | 0.00%      | OK     |
| DSP                                                       | 80%       | 83.33%     | REVIEW |
| RAMB/FIFO                                                 | 80%       | 1.75%      | OK     |
| URAM                                                      | 80%       | 0.00%      | OK     |
| DSP+RAMB+URAM (Avg)                                       | 70%       | 42.54%     | OK     |
| BUFGCE* + BUFGCTRL                                        | 24        | 0          | OK     |
| DONT_TOUCH (cells/nets)                                   | 0         | 0          | OK     |
| MARK_DEBUG (nets)                                         | 0         | 0          | OK     |
| Control Sets                                              | 6399      | 2546       | OK     |
| Average Fanout for modules >100k cells                    | 4         | 0.99       | OK     |
| Max Average Fanout for modules >100k cells                | 4         | 0.99       | OK     |
| Non-FD high fanout nets >10k loads                        | 0         | 0          | OK     |
| TIMING-6 (No common primary clock between related clocks) | 0         | 0          | OK     |
| TIMING-7 (No common node between related clocks)          | 0         | 0          | OK     |
| TIMING-8 (No common period between related clocks)        | 0         | 0          | OK     |
| TIMING-14 (LUT on the clock tree)                         | 0         | 0          | OK     |
| TIMING-35 (No common node in paths with the same clock)   | 0         | 0          | OK     |
| Number of paths above max LUT budgeting (0.250ns)         | 0         | 0          | OK     |
| Number of paths above max Net budgeting (0.177ns)         | 0         | 0          | OK     |

Table 11: Resource Utilization: Place & Route Fail Fast

| Resource | Utilization |
|----------|-------------|
| LUT      | 150,621     |
| FF       | 247,414     |
| DSP      | 2,940       |
| BRAM     | 26          |
| URAM     | 0           |
| SRL      | 3,242       |

| Timing Metric           | Value                   |  |
|-------------------------|-------------------------|--|
| Target Clock Period     | $10.000 \; \mathrm{ns}$ |  |
| Post-Route Clock Period | 9.354  ns               |  |

Table 12: Vivado Place & Route Resource Summary and Timing

The resource utilization remains within acceptable limits after place-and-route, with slight changes compared to the synthesis stage.

Timing Analysis: The timing was successfully met, with a post-route clock period of 9.354 ns, which is below the target clock period. This indicates that the design is capable of operating at the desired frequency after routing delays are considered: the previously reported Post-Synthesis Clock Period of 3.232 ns reflects the timing achievable after synthesis, where the design is represented at a higher level of abstraction without considering physical constraints like placement and routing. In contrast, at this point we obtain a clock period of 9.354 ns that accounts for the additional delays introduced by the placement and routing phases, including wire propagation delays and congestion. Despite these variations, both timing values comfortably meet the Target Clock Period of 10.000 ns, ensuring that the design operates reliably within the specified timing constraints.

Fail Fast Results Analysis: The implementation phase results confirm that the design fits within the available resources and meets timing requirements. We have to note that the high DSP utilization (83.33%) across both stages suggests potential optimization opportunities. However, the design is functional and meets the performance criteria, indicating successful implementation for FPGA acceleration.

# Final Remarks

The implementation and analysis provided in this document are supported by the source code and additional resources available in the GitHub repository. The repository contains the complete project files, including code, configuration scripts, and detailed documentation.

You can access the GitHub repository at the following link:



https://github.com/giuliocapecchi/Vitis-HLS