

## DEPARTMENT OF INFORMATION TECHNOLOGY AND ELECTRICAL ENGINEERING

Spring Semester 2020

## Implementation of a Heterogeneous System for Image Processing on an FPGA

Semester Project / Master Project

Pierre-Hugues BLELLY pblelly@student.ethz.ch

May 2020

Supervisors: Matheus Cavalcante, matheusd@iis.ee.ethz.ch

Samuel Riedel, sriedel@iis.ethz.ch

Professor: Prof. Luca Benini, lbenini@ethz.ch

# Acknowledgements

# Abstract

# Declaration of Originality

I hereby confirm that I am the sole author of the written work here enclosed and that I have compiled it in my own words. Parts excepted are corrections of form and content by the supervisor. For a detailed version of the declaration of originality, please refer to Appendix B

Pierre-Hugues BLELLY, Zurich, May 2020

## Contents

| Li         | st of | Acronyms                                                         | ix |
|------------|-------|------------------------------------------------------------------|----|
| 1.         |       | oduction                                                         | 1  |
|            |       | Design Issue with heterogeneous systems                          |    |
| 2.         | Bac   | kground                                                          | 5  |
|            | 2.1.  | HERO                                                             | 5  |
|            | 2.2.  | Halide Language                                                  | 6  |
|            |       | 2.2.1. Programing model                                          | 6  |
|            |       | 2.2.2. Debugging Options                                         | 7  |
|            |       | 2.2.3. Basic Scheduling Options                                  | 7  |
| 3.         | Des   | ign Implementation                                               | 12 |
|            | 3.1.  | Porting Halide to new Platforms                                  | 12 |
|            | 3.2.  | Schedule Implementation                                          | 13 |
|            |       | 3.2.1. Modification to the PULP runtime                          | 13 |
|            | 3.3.  | Compilation Workflow                                             | 13 |
|            |       | 3.3.1. Compiling for the full platform                           | 14 |
| 4.         | Res   | ults                                                             | 15 |
|            | 4.1.  | Test Setup                                                       | 15 |
|            | 4.2.  | Halide Results                                                   | 16 |
|            | 4.3.  | Comparison with an already working toolchain: OpenMP             | 18 |
|            | 4.4.  | Comparaison between OpenMP and Halide on the different platforms | 20 |
| <b>5</b> . | Con   | aclusion                                                         | 22 |
|            | 5.1.  | Conclusion                                                       | 22 |
|            | T 0   | D-4 W-1-                                                         | 20 |

## Contents

|     |      |           | ${f ription}$ |        |                 |      |      |  |  |  |  |  |  |  |  |  | <b>24</b> |
|-----|------|-----------|---------------|--------|-----------------|------|------|--|--|--|--|--|--|--|--|--|-----------|
|     | A.1. | Introd    | uction .      |        |                 |      |      |  |  |  |  |  |  |  |  |  | 24        |
|     | A.2. | Projec    | t descrij     | ption  |                 |      |      |  |  |  |  |  |  |  |  |  | 25        |
|     | A.3. | Requir    | ed skills     | S      |                 |      |      |  |  |  |  |  |  |  |  |  | 25        |
|     |      | A.3.1.    | Meetin        | gs & I | Presei          | ntat | ions |  |  |  |  |  |  |  |  |  | 26        |
|     |      | A.3.2.    | Referen       | nces . |                 |      |      |  |  |  |  |  |  |  |  |  | 26        |
| В.  | Dec  | laratio   | n of O        | rigina | $\mathbf{lity}$ |      |      |  |  |  |  |  |  |  |  |  | 27        |
| Glo | ssar | <b>·y</b> |               |        |                 |      |      |  |  |  |  |  |  |  |  |  | 29        |

# List of Figures

| 2.1. | Base Schedule.                                                             | 8  |
|------|----------------------------------------------------------------------------|----|
| 2.2. | Reorder Schedule                                                           | 8  |
| 2.3. | Fused Schedule                                                             | 9  |
| 2.4. | Split Schedule                                                             | 9  |
| 2.5. | Tile Schedule                                                              | 10 |
| 2.6. | Unroll Schedule                                                            | 10 |
| 2.7. | Parallel Schedule                                                          | 11 |
| 4.1. | Halide results relative to Halide base schedule performance                | 16 |
| 4.2. | Roofline plot for Halide                                                   | 17 |
| 4.3. | Roofline plot for Halide                                                   | 18 |
| 4.4. | Halide results relative to Halide base schedule performance                | 19 |
| 4.5. | Impact of the vectorization factor on the performance of the application . | 21 |

# List of Tables

| 4.1. | Benchmark results in number of cycles and operation per cycles for Halide | 17 |
|------|---------------------------------------------------------------------------|----|
| 4.2. | Benchmark results in number of operations (operations per cycle) for      |    |
|      | Halide and OpenMP                                                         | 19 |

## List of Acronyms

AARCH64 . . . . 64 bit ARM architecture

API . . . . . . . Application Programming Interface

CI . . . . . . . . . Continuous Integration

CPU . . . . . . . . . Central Processing Unit

 $\ensuremath{\mathrm{CUDA}}$  . . . . . . . Compute Unified Device Architecture

EEES . . . . . . Energy Efficient Embedded Systems

ETH Zürich . . . Eidgenössische Technische Hochschule Zürich

FPGA . . . . . . Field Programmable Gate Array

 $\operatorname{GPU}$  . . . . . . . Graphic Processing Unit

 $\ensuremath{\mathsf{HERO}}\xspace$  . . . . . . . Heterogeneous Embedded Research Platform

IIS . . . . . . . . Integrated Systems Laboratory

ISA . . . . . . . Instruction Set Architecture

LLVM . . . . . . Low Level Virtual Machine

MIPS . . . . . . . Microprocessor without Interlocked Pipelined Stages

MIT . . . . . . . . Massachusetts Institute of Technology

OpenCL . . . . Open Computing Language

## List of Acronyms

OpenMP . . . . Open Multi-Processing

PMCA . . . . . . Programmable Many Core Accelerator

PULP . . . . . . . Parallel Ultra Low Power

RTL . . . . . . . Register Transfert Level

SIMD . . . . . . . Single Instruction Multiple Data

 $\operatorname{SoC}$  . . . . . . System-on-Chip

ULP . . . . . . . . . Ultra Low Power



## Introduction

Thanks to the smaller nodes of modern lithography technologies and the transistor density we can achieve with them, modern low-power Central Processing Units (CPUs) can have a large amount of cores while keeping their power consumption under a few Watts. A single Raspberry Pi 3 has a peak performance of 6 GFLOP/s for a power consumption of only 7 Watts [1]. Embedded systems can take advantage of this increase in efficiency to become more autonomous and not rely on an external computer for heavy computation. We can find this type of architecture on some nano drones such as the CrazyFlie 2.0 [2], which can be extended with additional shields. Using a custom shield, the Integrated Systems Laboratory (IIS) of ETH Zürich achieved to analyze a video signal in real-time and train a neural network for autonomous navigation [3]. The compute unit achieved a rate of 281 MMAC/s on a power-enveloppe of only 45 mW. These results were achieved thanks to the heterogeneous architecture of the drone. To keep it's power consumption low, the dronee wakes-up the Ultra Low Power (ULP) chip of the shield only during computation. The shield use a Parallel Ultra Low Power (PULP) cluster, which is an RISC-V System-on-Chip (SoC) which can be configured with up to eight cores, this chip provides the computing power needed to analyze the video. With this configuration, the energy consumption of the CrazyFlie stays low, the autonomy when using the shield only drops by ten seconds compared to when the shield is turned off [?].

In more general terms, heterogeneous systems are composed of multiple coprocessors all managed by a host processor. This architecture is interesting when it comes to embedded systems, as it is possible to achive greater energy efficiency than homogeneous systems. If each coprocessor has been designed to solve a certain task, it can achieve energy efficiency than a general purpose CPU. According to Venkar and Tullsen [4] pusblished by the University of California, researchers showed that under heavy design constraints (such as die area or therman dissipation), systems using multiple Instruction Set Architectures (ISAs) achieved better performances than their best homogeneous counterpart.

#### 1. Introduction

This strategy has been used in the SoC industry by ARM since 2011 [5]. The big.LITTLE architecture is based on two clusters of ARM Cortex A7 (the "LITTLE" cores) and A15 (the "big" cores), and was designed to increase the computing power in low power systems such as smartphones while increasing the battery life of the device. This architecture relied on a single ISA (ARMv7). The goal was to use the more powerful cores during heavy computation or graphic rendering, and let the low power cores handle the background tasks or manage the device during sleep. Presently, every smartphone SoC manufacturer use the bigLITTLE architecture or a similar technology.

Even in data centers, where power consumption is also an issue, Graphic Processing Units (GPUs) are used thanks to their massive core count and the various Application Programming Interfaces (APIs) such as Compute Unified Device Architecture (CUDA) or Open Computing Language (OpenCL) which simplify the development process for GPU accelerators.

Heterogeneous Embedded Research Platform (HERO) [6] is a heterogeneous system developed by the IIS of ETH Zürich and the Energy Efficient Embedded Systems (EEES) of the University of Bologna. This platform is composed of a hard macro ARM 64 CPU and up to eight PULP clusters (RISC-V cores) running on an Xilinx ZYNC ZC706 Field Programmable Gate Array (FPGA).

This platform is designed to "facilitate rapid exploration on all software and hardware layers" [6], and includes a heterogeneous compilation toolchain with support for Open Multi-Processing (OpenMP), an API developed to make developement of multi threaded applications easier [7]. This API implements new preprocessor instructions to tell the compiler how to execute the code on the system.

## 1.1. Design Issue with heterogeneous systems

During their conception, numerous design choices need to be made specifically how the CPUs in the system will interact will each other. These choices will impact the peak performance of the design or its power consumption [4]. The computer architect has to choose how the different Programmable Many Core Accelerators (PMCAs) will interact, how they will share data and maybe extend the existing ISAs to distribute tasks. The software design is challenging, when compiling for heterogeneous platforms. The compiler needs to create an executable that will run on the host processor, but also dedicate parts of the final binary to embed the code that will be distributed on the PMCAs.

Even though APIs such as CUDA, OpenMP or OpenMP did a great job at making the overall developpement easier, most of the work is still done by hand. The programmer has to handle memory mapping, depending on the system architecture, the data may be stored in the shared memory or on the PMCA memory, every task needs to be scheduled by hand, and distributed on the correct PMCA. Moreover, the code is often not portable as some schedule are target dependant. An algorithm coded with CUDA will only run

#### 1. Introduction

on a GPU, so the code cannot be reused for another platform. Porting APIs to new platform is not trivial, and might require months of work to port it to a new target.

## 1.2. Currently Available Workflow for HERO

Currently, HERO supports OpenMP [8], an API which "defines a portable, scalable model with a simple and flexible interface for developing parallel applications on platforms from the desktop to the supercomputer" [9]. This API has been implemented on HERO to easily take advantage of the PULP clusters. The toolchain uses the Clang compiler [7] to compile the applications. HERO uses custom Clang front-ends to support all the available configurations (only the PULP cluster for simulation with the ARM host CPU or with a 64 bits RISC-V CPU).

To distribute the code, OpenMP uses preprocessor instructions to tell Clang where the code will run and how it will be executed. Exploring the design space using OpenMP's directives can be time-consuming. For example, the developer must explicitly tell which part of the code to offload. Trying to change the order of multiple loops may cause bugs in the algorithm, and complex schedules often impact code readability making them harder to debug.

Halide [10] was proposed to explore the idea of separating the algorithm from how the code will be executed on the target (the schedule). This separation makes testing different schedule easier on the developer, as the algorithm code will stay the same, and only the scheduling will be changed when testing. Every processing pipeline designed with Halide has two parts. The first part consists of the functional description of the processing kernel, i.e., the algorithm that will be executed. The second part is the schedule of the pipeline. The programer will explicitly tell Halide how the pipeline should be executed. Thanks to specific function calls, the developer can decide whether the code will be run on multiple threads or a single one, change the order of execution of different parts, split or unrollthem. The developer also has the freedom to implement any schedule he wants but without having to change the main algorithm. This programming model is interesting because the developer can quicky implement the algorithm without having to take into account the boundaries of the inputs, and then work on an optimal schedule, or quickly adapt it if the algorithm need to be executed on another platform.

The intermediate variables can be bounded afterwards if needed, and the pricipal variables such as characteristics of the inputs are automatically bounded by Halide. An image processing pipeline will only compute the output on the pixels of the input. The scheduling process can even be done automatically during the compilation by the library, in order to find an optimal schedule on the target platform.

The goal of this project was to port Halide to HERO, and execute image processing kernels on the HERO system running on an FPGA. First Halide needs to be compiled to support RISC-V and compile basic applications to the hardware simulation. From

## 1. Introduction

then we can work on the heterogeneous compilation to support the current HERO test platform.



## Background

Overview of the content, to be redacted

## 2.1. HERO

The HERO platform is an heterogeneous platform available in different configurations. This platform is composed of a hard macro multicore ARM 64 Juno SoC (composed of two Cortex A57 and four Cortex A53 cores) and up to eight PULP clusters (each of them using up to eight RI5CY cores [6]), running on an FPGA (a Xilinx ZYNC ZC706). PULP is a cluster of CPU based on the RISC-V ISA, an open source ISA designed to support a wide range of platform from embedded systems to accelerators in datacenters [11]. The modularity of the ISA makes it an interesting for PMCAs as the core are designed to support only the useful instruction for the task we want to run, which make them small and energy efficient. The system is also available with a Ariane RISC-V 64-bits core, or just as an independent PULP cluster for hardware simulation. The system uses 256 KiB of L1 scratchpad data cache, coupled with 256 KiB of L2 data and instruction cache and a 4 KiB of L1 cache.

HERO has a fully functional software toolchain with "support for OpenMP, a linux driver and runtime libraries for both the host and the PMCA" [6]. The toolchain uses clang, a C compiler front-end of Low Level Virtual Machine (LLVM). The heterogeneous compiling is done by separately compiling the part of the code and then bundling together inside a single binary the two compiled file during the linking phase of the host [8]. The final binary uses the host ISA, and embed the PMCA code in dedicated sections of the binary.

## 2.2. Halide Language

### 2.2.1. Programing model

Halide is a functional programming language embedded into C++, designed to write high performance image and array-processing code [12]. This language uses a functional paradigm to describe the processing pipeline, and dissociate the array-processing code from its schedule (how the code will be compiled and run on the system).

Every pipeline is a function (Halide::Func) built using other functions and expressions (Halide:expr) or variables (Halide::Vars). The code Listing 2.1 describe a basic pipeline which computes the distance of each coordinate of a two-dimensional array from a given position (center\_x, center\_y). The creation of the pipeline is straightforward, we only need to write the desired operation using the variables x and y. During the execution of the pipeline or it's compilation, Halide will bound x and y according to the size of the output.

This simple pipeline only has one stage, but it is possible to create multi-stage pipelines and schedule them as desired. They can be transformed into a single-stage inlined pipeline or kept as is. The different stages can be scheduled to start as soon as they have enough data, or wait for the previous one to finish before starting to compute.

Scheduling is done via basic scheduling primitives implemented by Halide. The primitives consist of basic code transformations such as loop unrolling or reordering, loop splitting or merging variables together into a single one. More advanced instructions like parallelization or vectorization are also available. These instructions can be combined as needed to create complex schedules. Section 2.2.3 explains the most important scheduling instructions in more details.

The Listing 2.2 shows how schedule are designed. All instructions are a function of the pipeline object, they can be executed on any variable of the pipeline. Some instructions need the variables to be bounded (e.g., the vectorize instruction) before using them. The scheduling primitives can be combined as needed, and the programmer can also create intermediate variables via those primitives to control precisely the execution of the code.

```
gradient.parallel(x);
gradient.unroll(y, 10);
```

Listing 2.2: Simple Schedule Example.

### 2. Background

In the Listing 2.2, Halide creates one task per value of x. These tasks will be executed in parallel on all the cores of the system. Every task will execute a single loop over the y axis, but instead of computing only one value of the output of the pipeline per iteration, the task will compute ten values per iteration.

The pipeline can be translated or compiled by Halide to be executed directly on the compilation computer or in another application. The pipeline can be immediatly executed using the function <code>.realize(x\_max, y\_max)</code>. If an output buffer of the correct size is provided, Halide will execute the pipeline over the rectangular domain (0,0), (x\_max,y\_max) As Halide was designed primarily to work with different hardware platforms, the cross-compilation process has been simplified, and the pipeline can be translated to other languages. Halide support translation to C code, LLVM assembly file, or already compiled object file specific to a given target (CUDA, ARM, RISC-V, Microprocessor without Interlocked Pipelined Stages (MIPS), PowerPc...), and a given Operating Systems (Linux, Mac, Windows, Android). The pipeline can also be exported as a static library to use in another application.

### 2.2.2. Debugging Options

Halide provides tools to debug the pipelines, and debugging tips to help the developers [13]. The print instructions prints the value of a variable at any point of the pipeline, print\_when() only print when a boolean condition is True. The .trace\_store() function keeps a trace of every function evaluation during execution, as long as the function has not been inlined, the parameters and the result of the function call will be stored in the trace and printed after the execution.

Halide can print more information on the screen during the compilation of the source code by setting the environemental variable HL\_DEBUG\_CODEGEN to 1. Halide will output information about every stages of the compilation and a pseudo code representation of the pipeline loops. Finally, variables and functions can be labeled. Halide will replace the generic name of the variable with the label when printing the pseudo code or when using gdb.

### 2.2.3. Basic Scheduling Options

Every Halide schedule applies a simple modification to the source code. Every instruction affects one or multiple variables. There are no limitation to the complexity of the schedule or the number of variable inside a pipeline.

### 2. Background



Figure 2.1.: Base Schedule.

#### Default Schedule

If no schedule is specified, Halide will evaluate the pipeline in the same order as it's arguments. The first variable being the inner most loop, and the last one the outer most loop. In Figure 2.1, Halide will compute the output of the pipeline in a row major fashion.

### Reorder



Figure 2.2.: Reorder Schedule.

The .reorder instruction reorders the variable to have the given nesting order, starting from the innermost. In the Figure 2.2, the array is now processed in a column major

fashion.

#### **Fuse**



Figure 2.3.: Fused Schedule.

The .fused instruction fuses two dimensions together, transforming a two-dimensionnal array into a one-dimensionnal array.

## Split



Figure 2.4.: Split Schedule.

This schedule split a loop in an inner and an outer subdimensions, where the size of the inner dimension is specified by the last argument. This shedule is useful to cut the array

### 2. Background

in smaller pieces that will be computed in parallel or using Single Instruction Multiple Data (SIMD) instructions.

#### Tile



Figure 2.5.: Tile Schedule.

The Tile schedule is similar to the Split schedule, but along two dimensions. It creates multiples smaller rectangular tiles which can be processed independently.

## Unroll



Figure 2.6.: Unroll Schedule.

## 2. Background

The Unroll schedule unrolls the code along one dimension. This technique is often used when multiple computations share the same data, to prevent multiple memory access. In the Figure 2.6, we first split the x dimension before unrolling as Halide cannot unroll a variable if it is not bounded.

#### Parallel



Figure 2.7.: Parallel Schedule.

The parallel schedule distributes the pipeline to all the available cores.

In the Figure 2.7, the code is distributed on three cores, each of them execute a single loop along the y axis.

Halide will create a task for each value the variable can take, and these tasks will be executed with the halide\_do\_par\_for function. This function needs to be overwritten on HERO to distribute the tasks on the PULP cluster.

#### Vectorize

The goal of this schedule is to setup the code so to make use of the SIMD instructions of the CPU. Currently, LLVM does not support the SIMD extension implemented in the PULP cluster, but the generated code will take advantages of all the registers available to compute the output values, and try to compute multiple values at the same time.



## Design Implementation

To test Halide, I used two applications. The first one was a basic gradient example 2.1, and the second one a matrix multiplication pipeline that I took in the Halide repository and then adapted to be used in a hero application. The matrix example is more interesting, as it represents what a typical signal processing application may do. It is also quite easy to benchmark with different sizes to see the impact of the memory access on the execution time.

```
ImageParam A(type_of <int>(), 2);
ImageParam B(type_of <int>(), 2);
Var x, y;
Func matrix_mul("matrix_mul");
Func out;
RDom k( 0, A. width() );
matrix_mul(x, y) += A(x, k) * B(k, y);
out(x, y) = matrix_mul(x, y);
Listing 3.1: Matrix Multiplication Pipeline
```

The Listing 3.1 shows the full algorithm implementation. The code is straight forward and is pretty close to the mathematical expression of the operation.

## 3.1. Porting Halide to new Platforms

Halide is compiled using LLVM, as the HERO toolchain already has a build of this compiler, we can use it to compile Halide. Some of the build options are incompatible with Halide, the -DBUILD\_SHARED\_LIBS flag has to be disabled, and LLVM also needs to

### 3. Design Implementation

support the x86 ISA to compile Halide for the build computer. Halide can now be built automatically using the tc-halide target in the main Makefile. Once we added Halide to the toolchains, we can work on porting it to HERO.

The library header file generated by Halide explains which functions need to be implemented to make Halide work on our target platform. We can also use the error messages during the compilation to implement the missing functions.

To make Halide work, we only need to port a small subset of functions. As most of the schedules are only reformatting the code. Only the parallel schedule has platform-specific code. So we can make Halide work with these functions, the memory allocation primitives, and the debugging functions (halide\_printf).

## 3.2. Schedule Implementation

Most of the schedules implemented on halide does not require any platform-specific implementation as they are working with loops, but we have to add the missing functions to the PULP runtime.

#### 3.2.1. Modification to the PULP runtime

The missing halide functions needs to be accessible to the PULP runtime, to do so, we created a new file in the kernel (halide\_api.c). This file contains all the basic functions required to run halide on HERO, and the Atomic Operationss.

Currently, only the parallel() instruction needs a specific function: halide\_do\_par\_for. This function initializes the PULP cluster and add all the parallel tasks to the cluster queue. These task executes halide\_do\_par\_for\_fork which is a wrapper around the pipeline function. Each core select which task it will run based on the task id, if the id of the task modulo the number of cores is equal to the core id, the task will be executed.

Halide does not support the RISC-V SIMD extension, but the vectorize schedule may still be used, as Halide will reshape the code as if it was manipulating vectors.

## 3.3. Compilation Workflow

Every application has at least two source files, one C++ file which will generate the object file of the pipeline, the main application. The compilation has two phases, during the first one, we compile the Halide application using LLVM and run it on the host platform, this application will then generate a RISC-V object file and a header. Then we compile the hero application using the same Makefile as the OpenMP applications,

### 3. Design Implementation

but we also include the header in the main application and the object file to the sources during the linking command.

### 3.3.1. Compiling for the full platform

This process only works on the hardware simulation, and I didn't achieved to make it work on the full HERO platform. I tried to approach the question using different strategy. The first one was to use the already compiled object file and add it during the linking process. This method didn't work as Clang didn't have any indication to distribute the code on the PULP cluster. So the RISC-V object file was incompatible.

The second idea was to use Halide to output C code and include the source in the HERO application. The issue is that the output of Halide is not pure C code, the pipeline function is coded in C but some structures are still using C++ style. The output needs to be modified by hand to be included in the application. But even after those modifications, the header creates incompatibilities, and thus this method is not usable.

The last idea was to use the OpenMP #pragma call to distribute the execution of the function to the PULP cluster, and then use the LLVM assembly file of the pipeline to include it in the application. As the first step of the compilation process for OpenMP uses the LLVM assembly files to compute the offsets in memory, this method may be the best one to compile a Halide application for HERO. But I didn't have enough time to make the heterogeneous compilation work, so this method might not work.



## Results

## 4.1. Test Setup

I tested the Halide implementation using two applications, first the simple pipeline described in the listing 2.1 to check if everything was working and then with a matrix multiplication pipeline to be closer to a real scenario. Once I checked that Halide was working correctly. I ran a similar matrix multiplication program using OpenMP to compare the parallelisation performance of Halide against the current toolchain available on HERO. To check the output of the Halide application, I compared the output of the pipeline to the precomputed result. During all Halide benchmarks, the two input matrices were randomly generated using a python script, as the multiplication on the PULP cluster always take two cycles, the content of the matrices does not impact the execution time of the programs. To get comparable results for both applications, stored the matrices in the L1 cache of HERO to reduce the memory latency. To measure the performance of both applications (on Halide and on OpenMP), I used two functions of the HERO runtime: hero\_reset\_clk\_counter() and hero\_get\_clk\_counter(). These two functions respectively reset a cycle counter and return the counter's value when called. As their execution is short, we can get cycle accurate result for our benchmarks. The benchmarks were executed on the hardware simulator, as I did not have time to make the heterogenous compilation work.

## 4.2. Halide Results



Figure 4.1.: Halide results relative to Halide base schedule performance

#### 4. Results



Figure 4.2.: Roofline plot for Halide

| Schedule                     | 15x15         | 20x20         | 25x25          |
|------------------------------|---------------|---------------|----------------|
| Halide: No Schedule          | 40628 (0.166) | 93818 (0.171) | 180686 (0.173) |
| Halide: Parallel             | 6950 (0.971)  | 15585 (1.027) | 30413 (1.028)  |
| Halide: Parallel + Vectorize | 4339 (1.556)  | 8358 (1.914)  | 15585 (2.005)  |

| Schedule                     | 30x30          | 35x35          | 40x40          |
|------------------------------|----------------|----------------|----------------|
| Halide: No Schedule          | 309426 (0.175) | 488316 (0.176) | 725606 (0.176) |
| Halide: Parallel             | 42659 (1.266)  | 71279 (1.203)  | 92536 (1.383)  |
| Halide: Parallel + Vectorize | 18776 (2.876)  | 32295 (2.655)  | 36487 (3.508)  |

Table 4.1.: Benchmark results in number of cycles and operation per cycles for Halide

The figures 4.1 and 4.2 show the results of Halide for different matrix sizes ranging from 15 to 40. The benchmarks were done using three different schedule: the default one, with no parallelisation, one schedule with parallelisation along the y axis, and one schedule which combines a parallel schedule with a vectorize one. On figure 4.1, we can see that the parallelisation instruction is efficient as this schedule performs between five and eight times better than the default schedule.

Figure 4.2 show the performance of Halide compared to an ideal scenario. As one matrix multiplication can be done in  $2n^3$  arithmetic operations, in the best scenario, each core

achieve one arithmetic operation per cycle, so we can get up to eight operations per cycle.

For the default schedule, we can see that the overall performance stays consant and is not dependant on the matrix size. But for the two other schedule, the performance increase when we increase the matrix size. This can be explained by the core utilisation, as every core only executes the task when: core\_id == task\_id % n\_cores, the overall core utilization is lower for smaller matrices when the matrix size is not a multiple of the number of cores. For a matrix of size twenty, four cores will execute three tasks, and four cores will execute two tasks. But for bigger matrices, core will stay inactive for a smaller fraction of the total execution time, increasing the overall performance of the system.

# 4.3. Comparison with an already working toolchain: OpenMP



Figure 4.3.: Roofline plot for Halide

### 4. Results



Figure 4.4.: Halide results relative to Halide base schedule performance

| Schedule                | 15x15         | 20x20         | 25x25          |
|-------------------------|---------------|---------------|----------------|
| Halide: No Schedule     | 40628 (0.166) | 93818 (0.171) | 180686 (0.173) |
| Halide: Parallel        | 6950 (0.971)  | 15585 (1.027) | 30413 (1.028)  |
| OpenMp: Single Thread   | 39820 (0.17)  | 92650 (0.173) | 179030 (0.175) |
| OpenMp: Parallel No DMA | 12079 (0.559) | 24750 (0.646) | 38090 (0.82)   |

| Schedule                | 30x30          | 35x35          | 40x40          |
|-------------------------|----------------|----------------|----------------|
| Halide: No Schedule     | 309426 (0.175) | 488316 (0.176) | 725606 (0.176) |
| Halide: Parallel        | 42659 (1.266)  | 71279 (1.203)  | 92536 (1.383)  |
| OpenMp: Single Thread   | 307210 (0.176) | 485440 (0.177) | 721970 (0.177) |
| OpenMp: Parallel No DMA | 59887 (0.902)  | 89283 (0.96)   | 126523 (1.012) |

Table 4.2.: Benchmark results in number of operations (operations per cycle) for Halide and OpenMP

As HERO already have a working workflow to distribute computation on the PULP cluster using OpenMP, we can see how Halide compare to this API on a parallel schedule. First, we can see that both implementation of the matrix multiplication achieve the same parformance, differing only by less than four thousand cycles for fourty by fourty

matrices.

The most interesting results come from the parallel code distribution by the two APIs.

I made sure that the two matrices were stored in the L1 cache, to have the best access time possible. To measure the number of cycles needed to run the application, I used two functions available in the hero sdk: hero\_reset\_clk\_counter() and hero\_get\_clk\_counter(). These functions reset and return the value of a cycle counter. As they only take few assembly instructions, they are useful to get cycles accurate measurements of the execution time. With this setup, we can easily compare the performances of Halide and OpenMP in a real world scenario for at least two basic schedules: single threaded and multi threaded. I then experimented with different schedule with Halide to see the maximal performance I could get with this application.

To give the results more meaning, I converted the benchmark data in operations per cycles where one operation can either be an addition or a multiplication, so for a matrix of size n, the number of operations to finish the multiplication is :  $2n^3$ .

# 4.4. Comparaison between OpenMP and Halide on the different platforms

Halide base schedule performs similarly from OpenMP single threaded, differing only by a few thousands cycles for the last test (with two matrices of size 35 by 35). The two implementation are prettty similar, and Halide overhead is in this case almost negligeable.

The second schedule I tested on both APIs was the parallel schedule; as parallelisation is the most efficient way to increase performance especially when there is no data dependancy in the pipeline. To have the best possible performance, the API needs to be as small as possible.

The table ?? show the results of both applications, with every size tested, we can see that Halide is in every situation at least .2 operations per cycle faster than OpenMP.

I also tried multiple schedule for Halide to see the best performance HERO could achieve on this benchmark. I tried to combine the parallel schedule with loop unrolling or tiling, but I was only getting worse results due to the additional jumps or the additional computation implied by loop unrolling. When the unrolling factor is not a divisor of the number of loop iteration, Halide shift the final iteration to always compute the same number of element each iteration. This shift forces the pipeline to recompute some output values.

```
out.parallel(y);
out.vectorize(x, 10);
Listing 4.1: Schedule using Parallel and Vectorize
```

### 4. Results

The vectorize schedule used with parallel proves to be the most efficient solution, on twenty by twenty matrices, the parallel schedule alone achieved 1.026 operations per cycle, against 2.169 operations per cycle using the schedule 4.1. I then exhaustively tried every vectorization factor possible to see which performed best.



Figure 4.5.: Impact of the vectorization factor on the performance of the application



## Conclusion

## 5.1. Conclusion

The goal of the project was to port Halide, an image processing language on HERO, and run image processing kernel on the test hardware. Some functions were missing on the PULP runtime, so we first added them to the source code. Then we added the Halide source to the HERO project and built the library. The compiling options of LLVM needed to be changed to successfully compile Halide. Using the Makefile of the OpenMP applications as a base, I successfully compiled Halide applications to run on the Register Transfert Level (RTL) simulator. I then tested two applications a gradient and a matrix multiplication to debug the schedules and test if they were working correctly. After that, I ran some benchmarks on different matrix sizes to compare the perfromance of Halide and OpenMP to determine whether Halide could compete with OpenMP or not. I then tried to make the heterogeneous compilation work on the HERO platform with the ARM host but I did not have enough time make it work. I slightly changed the LLVM target to include other object file during linking. But in the end I did not have enough time to make it work on the hardware platform.

Even if I could not finish the project, on the RTL simulator, Halide showed promising results, but it need to be benchmarked more thoroughly to have a better idea of the performances Halide can achieve.

#### 5.2. Future Work

A lot of work needs to be done to merge Halide on the main branch of HERO, The heterogeneous workflow for Halide needs to be fixed as it is impossible right now to distribute code to the PULP cluster from the ARM cluster. The Continuous Integration (CI) is

## 5. Conclusion

currently not working, which is probably due to the change of options when compiling LLVM, this may also cause compability issues with OpenMP or other components of the toolchain. This branch requires in depth testing before being merged with the main project.



## Task Description

## A.1. Introduction

Heterogeneous systems combine a general-purpose host processor with domain-specific Programmable Many-Core Accelerators (PMCAs). Such systems are highly versatile, due to their host processor capabilities, while having high performance and energy efficiency through their PMCAs. HERO is a FPGA-based research platform developed at IIS that combines a PMCA composed by RISC-V cores, implemented as soft cores on an FPGA fabric, with a hard ARM Cortex-A multicore host processor.

Heterogeneous systems have a complex programming model, which lead to significant effort to develop tools to retain a high programmer productivity. Halide is domain specific programming language designed to write fast image processing algorithms. More specifically, it is a C++ dialect with a functional programming paradigm. It's aim is to separate the function applied to the image (pipeline), and the sequence in which the algorithm is executed (schedule). For example, the schedule encompasses how the algorithm is parallelized, if the image is tiled, processed in column or row major order, if solutions required by multiple threads are shared or recomputed, if parts of the computation is offloaded to an accelerator, and so on. This allows a programmer to write a functional description of the image processing algorithm and then explore ways of scheduling the execution with only a couple of lines of code, and without modifying the algorithm. Furthermore, the same algorithm can be run efficiently on multiple different architectures by only changing the schedule. To have Halide generate efficient code, the specific architecture requires to have an efficient Halide runtime implementation, and good compiler support, as Halide is tightly coupled with the compiler.

## A.2. Project description

The goal of this project is to bring up Halide on HERO, using Ariane, a 64-bit RV64GC core, as a host processor. Ariane would manage Halide's frontend, while the image processing tasks would execute on 32-bit cores in the cluster. The final goal of this thesis is to have Halide programmed image processing kernels running on an HERO system implemented on an FPGA.

The project can be done by as one or two semester thesis. The project consists of three parts:

- 1. Familiarizing with the Halide language and the architecture of HERO ( $^{\sim}2$  person weeks).
- 2. Add a RISC-V target to Halide's frontend (~3 person weeks).
- 3. Test up the Halide environment on an FPGA with a set of custom image processing kernels (~1 person week)
- 4. Documentation and report writing (~1 person week)

## A.3. Required skills

To work on this project, you will need:

- to have worked in the past with at least one RTL language (SystemVerilog or Verilog or VHDL). Having followed the VLSI 1 course is recommended.
- to have prior knowlegde of the C++ programming language
- to have prior knowledge of hardware design and computer architecture
- to be motivated to work hard on a super cool open-source project

### Status: In progress

• Student: Pierre-Hugues Blelly

• Supervision: Matheus Cavalcante, Samuel Riedel, Andreas Kurth

#### Professor

Luca Benini

### A. Task Description

## A.3.1. Meetings & Presentations

The students and advisor(s) agree on weekly meetings to discuss all relevant decisions and decide on how to proceed. Of course, additional meetings can be organized to address urgent issues.

Around the middle of the project there is a design review, where senior members of the lab review your work (bring all the relevant information, such as prelim. specifications, block diagrams, synthesis reports, testing strategy, ...) to make sure everything is on track and decide whether further support is necessary. They also make the definite decision on whether the chip is actually manufactured (no reason to worry, if the project is on track) and whether more chip area, a different package, ... is provided. For more details refer to (1).

At the end of the project, you have to present/defend your work during a 15 min. presentation and 5 min. of discussion as part of the IIS Colloquium.

#### A.3.2. References

- Andreas Kurth, Pirmin Vogel, Alessandro Capotondi, Andrea Marongiu, Luca Benini. HERO: Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Accelerators on FPGA. CARRV' 2017. link
- 2. Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, Frédo Durand. Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines. SIGGRAPH 2012. link



## Declaration of Originality

Include the declaration of authorship with the \includepdf command (sign it and scan it). For more information about plagiarism, please visit https://www.ethz.ch/students/en/studies/performance-assessments/plagiarism.html

- English version: https://www.ethz.ch/content/dam/ethz/main/education/rechtliches-abschluesse/leistungskontrollen/declaration-originality.pdf
- German version: https://www.ethz.ch/content/dam/ethz/main/education/rechtliches-abschluesse/leistungskontrollen/plagiat-eigenstaendigkeitserklaerung.pdf

## B. Declaration of Originality



Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich

#### **Declaration of originality**

The signed declaration of originality is a component of every semester paper, Bachelor's thesis, Master's thesis and any other degree paper undertaken during the course of studies, including the respective electronic versions.

Lecturers may also require a declaration of originality for other written papers compiled for their courses.

I hereby confirm that I am the sole author of the written work here enclosed and that I have compiled it in my own words. Parts excepted are corrections of form and content by the supervisor.

| Title of work (in block letters):                    |
|------------------------------------------------------|
| IMPLEMENTATION OF AN HETEORGENEOUS SYSTEM ON AN FPGA |

Authored by (in block letters):

For papers written by groups the names of all authors are required.

| Name(s):<br>BLELLY | First name(s): |
|--------------------|----------------|
| BLELLY             | PIERRE-HUGUES  |
|                    |                |
|                    |                |
|                    |                |
|                    |                |
|                    |                |

With my signature I confirm that

- I have committed none of the forms of plagiarism described in the '<u>Citation etiquette</u>' information sheet.
- I have documented all methods, data and processes truthfully.
- I have not manipulated any data.
   I have mentioned all persons who were significant facilitators of the work.

I am aware that the work may be screened electronically for plagiarism.

| Place, date         | Signature(s)                                                                                                                                          | >          |
|---------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|------------|
| Toulouse 25/05/2020 |                                                                                                                                                       |            |
|                     |                                                                                                                                                       |            |
|                     |                                                                                                                                                       |            |
|                     |                                                                                                                                                       |            |
|                     |                                                                                                                                                       |            |
|                     | For papers written by groups the names of all authors a<br>required. Their signatures collectively guarantee the ent<br>content of the written paper. | re<br>tire |

# Glossary

Atomic Operations An operation during which a processor can simultaneously read a location and write it in the same bus operation. This prevents any other processor or I/O device from writing or reading memory until the operation is complete..

## Bibliography

- [1] P. J.Basford, S. J.Johnston, C. S.Perkins, T. Garnock-Jones, F. P. Tso, D. Pezaros, R. D.Mullins, E. Yoneki, J. Singer, and S. J.Coxa, "Performance analysis of single board computer clusters," 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167739X1833142X
- [2] Wikipedia, "Crazyflie 2.0 wikipedia," 2020, [Online; accessed 27-May-2020]. [Online]. Available: https://en.wikipedia.org/wiki/Crazyflie\_2.0
- [3] D. Palossi, A. Loquercio, F. Cont, E. Flamand, D. Scaramuzza, and L. Benini, "A 64-mW DNN-Based Visual Navigation Engine for Autonomous Nano-Drones," 2019. [Online]. Available: https://arxiv.org/abs/1805.01831
- [4] A. Venkat and D. M. Tullsen, "Harnessing ISA Diversity: Design of a Heterogeneous-ISA Chip Multiprocessor," *ICSA*, pp. 121–132, 2014.
- [5] A. Shan, "big.little processing arm," 2012, [Online; accessed 17-May-2020]. [Online]. Available: https://web.archive.org/web/20121022055646/http://www.arm.com/products/processors/technologies/bigLITTLEprocessing.php
- [6] A. Kurth, P. Vogel, A. Capotondi, and L. Benini, "HERO: Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Accelerators on FPGA," 2017. [Online]. Available: https://arxiv.org/abs/1712.06497
- [7] Wikipedia, "OpenMP Wikipedia," 2020, [Online; accessed 24-May-2020]. [Online]. Available: https://en.wikipedia.org/wiki/OpenMP
- [8] K. Wolters, "Software Stack for the First Fully Open-Source Heterogeneous SoC," 2019, master Thesis.
- [9] OpenMp, "Home OpenMp," 2020, [Online; accessed 17-May-2020]. [Online]. Available: https://www.openmp.org/

### **Bibliography**

- [10] C. Barnes, A. Adams, S. Paris, J. M. Ragan-Kelley, F. Durand, and S. P. Amarasinghe, "Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines," 2013. [Online]. Available: https://dspace.mit.edu/handle/1721.1/85943
- [11] EPI, "European processor initiative," 2019, [Online; accessed 28-May-2020]. [Online]. Available: https://www.european-processor-initiative.eu/?p=unsubscribe
- [12] Halide, "Halide," 2020, [Online; accessed 17-May-2020]. [Online]. Available: https://www.halide-lang.org/
- [13] —, "Debugging tips · halide/halide wiki," 2020, [Online; accessed 24-May-2020]. [Online]. Available: https://github.com/halide/Halide/wiki/Debugging-Tips
- [14] PULP, "PULP Platform," 2020, [Online; accessed 17-May-2020]. [Online]. Available: https://pulp-platform.org
- [15] Hero, "HERO: The Open Heterogeneous Research Platform," 2020, [Online; accessed 17-May-2020]. [Online]. Available: https://pulp-platform.org/hero.html
- [16] Wikipedia, "Heterogeneous computing Wikipedia," 2020, [Online; accessed 17-May-2020]. [Online]. Available: https://en.wikipedia.org/wiki/Heterogeneous\_computing
- [17] A. Shan, "Heterogeneous Processing: a Strategy for Augmenting Moore's Law | Linux Journal," 2006, [Online; accessed 17-May-2020]. [Online]. Available: https://www.linuxjournal.com/article/8368