# AutoPhase: Compiler Phase-Ordering for High Level Synthesis with Deep Reinforcement Learning



# AutoPhase: Compiler Phase-Ordering for High-Level Synthesis with Deep Reinforcement Learning

Qijing Huang\*,†,1, Ameer Haj-Ali\*,†,1, William Moses†,2, John Xiang1, Ion Stoica1, Krste Asanovic1, John Wawrzynek<sup>1</sup>

- \* Equal contribution
- † Primary contributor
- <sup>1</sup> University of California, Berkeley {qijing.huang, ameerh, johnxiang, istoica, krste, johnw}@berkeley.edu, wmoses@mit.edu
- <sup>2</sup> Massachusetts Institute of Technology

Abstract—The performance of the code generated by a compiler depends on the order in which the optimization passes are applied. In the context of high-level synthesis, the quality of the generated circuit relates directly to the code generated by the front-end compiler. Unfortunately, choosing a good order-often referred to as the phaseordering problem-is an NP-hard problem. As a result, existing solutions rely on a variety of sub-optimal heuristics.

In this paper, we evaluate a new technique to address the phase-ordering problem: deep reinforcement learning. To this end, we implement a framework that takes any group of programs and finds a sequence of passes that optimize the performance of these programs. Without loss of generality, we instantiate this framework in the context of an LLVM compiler and target multiple High-Level Synthesis programs. We compare the performance of deep reinforcement learning to state-of-the-art algorithms that address the phase-ordering problem. Overall, our framework runs one to two orders of magnitude faster than these algorithms, and achieves a 16% improvement in circuit performance over the -O3 compiler flag.

#### I. Introduction

High-Level Synthesis (HLS) automates the process of creating digital hardware circuits from algorithms written in high-level languages. Modern HLS tools [1]-[3] use the same front-end as the traditional software compilers. They rely on traditional compiler techniques to optimize the input program's intermediate representation (IR) and produce circuits in the form of RTL code from the IRs. Thus, the quality of compiler front-end optimizations directly impacts the performance of HLS-generated circuit. Meanwhile, optimizing a program is a notoriously difficult task for the compiler. Programs must be in just the right form for an optimization pass to recognize it a task which a programmer might be able to easily do, but is often difficult for the compiler. Moreover, a programmer might have to perform different optimizations depending on the hardware platform. Despite a decade or so of research, an expert designer can still produce RTL that outperforms the results of HLS, though it incurs huge costs both in terms of time and human capital.

The process of HLS compilation consists of a sequence of analysis and optimization phases that are applied to the program. Each phase consumes the output of the previous phase, and generates a modified version of the program for the next phase. Unfortunately, these phases are not commutative which makes the order in which these phases are applied critical to the performance of the output.

```
_attribute___((const))
double mag(const double *A, int n) {
    double sum = 0;
    for(int i=0; i<n; i++) {</pre>
        sum += A[i] * A[i];
    return sqrt(sum);
void norm(double *restrict out,
           const double *restrict in, int n) {
    for(int i=0; i<n; i++) {</pre>
        out[i] = in[i] / mag(in, n);
```

Fig. 1: A simple program to normalize a vector.

Consider the program in Figure 1, which normalizes a vector. Without any optimizations the norm function will take  $\Theta(n^2)$  to normalize a vector. However, a smart compiler will implement the loop invariant code motion (LICM) [4, Sec. 13.2] optimization, which allows it to move the call to mag above the loop, resulting in the code on the left column in Figure 2. This brings the runtime down to  $\Theta(n)$  – a big speedup improvement.

```
const double *restrict in, int n) {
                                                              double precompute, sum = 0;
                                                              for(int i=0; i<n; i++) {</pre>
void norm(double *restrict out,
                                                                  sum += A[i] * A[i];
          const double *restrict in, int n) {
    double precompute = mag(in, n);
                                                              precompute = sqrt(sum);
    for(int i=0; i<n; i++) {</pre>
                                                              for(int i=0; i<n; i++) {</pre>
        out[i] = in[i] / precompute;
                                                                  out[i] = in[i] / precompute;
}
            Fig. 2: Progressively applying LICM (left) then inlining (right) to the code in Figure 1.
                                                         void norm(double *restrict out,
                                                                    const double *restrict in, int n) {
void norm(double *restrict out,
          const double *restrict in, int n) {
                                                              double sum;
    for(int i=0; i<n; i++) {</pre>
                                                              for (int i=0; i<n; i++) {</pre>
        double sum = 0;
                                                                  sum = 0:
        for(int j=0; j<n; j++) {</pre>
                                                                  for(int j=0; j<n; j++) {</pre>
            sum += A[j] * A[j];
                                                                      sum += A[j] * A[j];
        out[i] = in[i] / sqrt(sum);
                                                                  out[i] = in[i] / sqrt(sum);
    }
```

Fig. 3: Progressively applying inlining (left) then LICM (right) to the code in Figure 1.

Another optimization the compiler could perform is (function) inlining [4, Sec. 15.2]. With inlining, a call to a function is simply replaced with the body of the function, reducing the overhead of the function call. Applying inlining to the code will result in the code in the right column of Figure 2.

}

Consider applying these optimization passes in the opposite order: first inlining then LICM. After inlining, we get the code on the left of Figure 3. Once again we get a modest speedup, having eliminated n function calls, though our runtime is still  $\Theta(n^2)$ . If the compiler afterwards attempted to apply LICM, we would find the code on the right of Figure 3. LICM was able to successfully move the allocation of sum outside the loop. However, it was unable to move the instruction setting sum=0 outside the loop, as doing so would mean that all iterations excluding the first would have a garbage value for sum. Thus, the internal loop could not be moved out.

Consequently, having the wrong phase ordering for a program can mean the difference between one's program running in  $\Theta(n^2)$  versus  $\Theta(n)$ . It is thus crucial to determine the optimal phase ordering for the program for it to run efficiently. Unfortunately, not only is this a difficult task, but the optimal phase ordering may vary from program to program. The goal of this paper is to provide a mechanism for automatically determining good phase orderings for HLS programs to optimize for the circuit speed.

Recent advancements in deep reinforcement learning (RL) [5] offer opportunities to address the phase ordering challenge. In deep RL algorithms, software agents take

actions in an environment to optimize accumulative rewards. Based on these rewards and environment observations, statistical machine learning is applied to estimate the long-term benefit of the actions and optimize these actions accordingly. Significant challenges exist in understanding how to formulate the phase ordering optimization problem in an RL framework. In this paper, we investigate two techniques to characterize the optimization state. One is to directly use salient features from the program. The other one is to use the applied optimizations as features without any information from the programs. We implement a framework that takes a group of programs and quickly finds a phase ordering that produces better quality of results. Our main contributions are:

void norm(double \*restrict out,

- An analysis of two methods to represent program state in the RL setting: (1) program features and (2) a histogram of previously applied passes. We conclude why the later approach is more efficient in achieving better algorithm optimization time and circuit cycle count.
- A framework that integrates the compiler with the deep RL algorithms. This framework takes a group of programs as an input and robustly generates the optimal sequence of passes to apply.
- An evaluation and comparison of multiple RL algorithms against state-of-the-art approaches for compiler phase-ordering showing that RL runs one to two orders of magnitude faster and achieves 16% better performance over -O3.

#### II. BACKGROUND

#### A. Compiler Phase-ordering

Compilers execute optimization passes to transform programs into more efficient forms to run on various hardware targets. Groups of optimizations are often packaged into "optimization levels" -O0 (no optimization), -O1 (some optimization) O2 (more optimization), -O3 (most optimization) for ease. While these optimization levels offer a simple set of choices for developers, they are handpicked by the compiler-designers and often most benefit certain groups of benchmark programs. The compiler community has attempted to address the issue by selecting a particular set of compiler optimizations on a per-program or per-target basis for software [6]–[8].

Since the search space of phase-ordering is too large for exhaustive search, many heuristics have been proposed to explore the space by using machine learning. Huang et al. tried to address this challenge for HLS applications by using modified greedy algorithms [9] [10]. In [11] both independent and Markov models were applied to automatically target an optimized search space for iterative methods to improve the search results. In [12], genetic algorithms were used to tune heuristic priority functions for three compiler optimization passes. Milepost [13] used machine learning to determine the set of passes to apply to a given program, based on a static analysis of its features. It achieved 11% execution time improvement over -O3, for the ARC reconfigurable processor on the MiBench program suite1. In [14] the challenge was formulated as a Markov process and supervised learning was used to predict the next optimization, based on the current program state. Wang et al. [15], described a relationship between machine learning and compiler optimization, where the program features are used as an observation, without explicitly providing results. In this work, we tried using the program features as an observation but found that it faces several challenges explained in Section IV.

# B. Reinforcement Learning

Reinforcement learning (RL) is a machine learning approach in which an agent continually interacts with the environment [16]. In particular, the agent observes the state of the environment, and based on this observation takes an action. The goal of the RL agent is then to compute a policy—a mapping between the environment states and actions—that maximizes a long term reward.

RL can be viewed as a stochastic optimization solution for solving Markov Decision Processes (MDPs) [17], when the MDP is not known. An MDP is defined by a tuple with four elements: S,A,P(s,a),r(s,a) where S is the set of states of the environment, A describes the set of actions or transitions between states,  $s' \sim P(s,a)$  describes the probability distribution of next states given the current state and action and  $r(s,a):S\times A\to R$  is the reward of taking action a in state s. Given an MDP, the goal of the agent is to gain the largest possible aggregate reward. The objective of an RL algorithm associated with an MDP is to find a decision policy  $\pi^*:s\to A$  that achieves this goal for that MDP:

$$\pi^* = \arg\max_{\pi} E_{\tau \sim \rho_{\pi(\tau)}} \left[ \sum_{t} r(s_t, a_t) \right] = \arg\max_{\pi} \sum_{t=1}^{T} E_{(s_t, a_t) \sim \rho_{\pi(s_t, a_t)}} \left[ r(s_t, a_t) \right]. \quad (1)$$

Deep RL leverages a neural network to learn the policy (and sometimes the reward function). Recently, Deep RL has achieved impressive results, such as learning to play 49 Atari games with human-level capabilities [18], and defeating the Go world champion [19]. Over the past couple of years, a plethora of deep RL techniques have been proposed [20]–[23]. In this paper, we focus on two algorithms: Policy Gradient (PG) [22] and Deep Q-Network (DQN), commonly referred to as Q-Learning [23]. We chose these algorithms as they are simple, mature, and adequate to solve the problem.

PG computes the policy directly by differentiating the aggregate reward *i.e.*, the term in Equation 1:

$$\nabla_{\theta} J = \nabla_{\theta} E_{\tau \sim \rho_{\pi(\tau)}} \left[ \sum_{t} r(s_{t}, a_{t}) \right] =$$

$$E_{\tau \sim \rho_{\pi(\tau)}} \left[ \left( \sum_{t} \nabla_{\theta} log \pi_{\theta}(a_{t}|s_{t})) \left( \sum_{t} r(s_{t}, a_{t}) \right) \right] \approx$$

$$\frac{1}{N} \sum_{i=1}^{N} \left[ \left( \sum_{t} \nabla_{\theta} log \pi_{\theta}(a_{i,t}|s_{i,t})) \left( \sum_{t} r(s_{i,t}, a_{i,t}) \right) \right]$$
(2)

and then updating the network parameters (weights) in the direction of the gradient:

$$\theta \leftarrow \theta + \alpha \nabla_{\theta} J,$$
 (3)

where  $\alpha$  is called the learning rate. PG is an on-policy method in that it uses only the current policy to compute the new policy.

In contrast, DQN is an off-policy method which takes partially random actions to explore the state space. The DQN's goal is to find which actions will maximize future rewards from a given state. To do so DQN computes a Q-function,  $Q(s_i, a_i)$  that predicts the future overall reward of taking action  $a_i$  from state  $s_i$ . To compute this Q-function, DQN uses a neural network paramterized by weights  $\phi$ ). More formally:

$$y_i = r(s_i, a_i) + \arg\max_{a'_i} Q_{\phi}(s'_i, a'_i)$$
 (4)

$$\phi \leftarrow \underset{\phi}{\operatorname{arg\,min}} \sum_{i} ||Q_{\phi}(s_i, a_i) - y_i||^2, \tag{5}$$

where  $y_i$  is the target result,  $a_i$  and  $s_i$  are the current action and state respectively,  $a_i'$  and  $s_i'$  are the next action and state respectively, and  $r(s_i, a_i)$  is the reward for taking action  $a_i$  at state  $s_i$ . On top of that, the policy is basically defined as follow:

$$\pi_{\theta}(a_t|s_t) = \begin{cases} 1 & \text{if } a_t = \arg\max_{a_t} Q_{\phi}(s_t', a_t') \\ 0 & \text{otherwise} \end{cases}$$
 (6)

#### III. SIMULATION FRAMEWORK

We leverage an existing open-source HLS framework called LegUp [3] that takes a C program as input and outputs a hardware RTL design. In [9], an approach is devised to quickly determine the number of hardware execution cycles without requiring time-consuming logic simulation. We develop our reinforcement learning simulator environment based on the existing harness provided by LegUp and validate our final results by going through the time-consuming logic simulation. The framework takes a program (or multiple programs) and intelligently explores the space of possible passes to figure out an optimal pass sequence to apply. The framework is illustrated in Figure 4.

# A. HLS Compiler

Our framework takes a set of programs as input and compiles them using the Clang front-end of the LLVM compiler. LLVM represents programs in a hardware-independent intermediate representation (IR). Optimization and analysis passes act as transformations on the IR, taking a program as input and emitting a new IR as output. The HLS tool LegUp is invoked after the compiler optimization as a back-end pass, which transforms LLVM IR into hardware modules.

# B. Clock-cycle Time Profiler

Once the hardware RTL is generated one could run a hardware simulation to gather the cycle count results of the synthesized circuit. This process is quite timeconsuming, hindering RL and all other optimization approaches. Therefore, we approximate cycle count using the profiler in LegUp [9], which runs  $20\times$  faster than hardware simulation. In LegUp, the frequency of the generated circuits is set as a compiler constraint that directs the HLS scheduling algorithm. In other words, HLS tool will always try to generate hardware that can run at a certain frequency. In our experiment setting, without loss of generality, we set the target frequency of all generated hardware to 50MHz.

# C. IR Feature Extractor

In addition to the LegUp backend tools, we developed analysis passes to extract 56 static features from the program, such as the number of basic blocks, branches, and instructions of various types. We use these features as partially observable states for the RL learning and hope the neural network can capture the correlation of certain combination of these features and certain optimization. Unfortunately, due to the page limit, we could not include the table of features.

#### D. Deep Reinforcement Learning Framework

We wrapped compilation utilities into a Python script and wrote an environment simulator with APIs similar to an OpenAI gym [24]. Our utilities take an input a program (as LLVM IR) and returns the program feature vectors and the total number of clock cycles. We consider two types of features as the state for the RL: the current program IR and the sequence of passes that have been applied. The overall flow works as follows:

- 1) The program is compiled into LLVM IR.
- LegUp compiles the LLVM IR into hardware RTL and produces an estimated clock-cycle time for the generated circuit.
- 3) The IR feature extractor is run to extract program features
- 4) The clock-cycle time and program features are fed into various machine learning algorithms (random, genetic, greedy, and RL) to derive a good LLVM optimization sequence. In RL specifically, the clock-cycle time is used for reward calculation, and the program features are used as partially observable states for the RL agents.
- 5) In iterative methods (*i.e.*, greedy algorithm, genetic algorithm, and RL), the algorithm then predicts the next best action to take. In RL, the action would be the next optimization to apply to the end of an existing sequence of passes. In a modified greedy algorithm, the next action could be the next best place to insert the optimization pass.
- New LLVM IR is generated after the new optimization sequence is applied.

 The machine learning algorithm iterates through steps 2)–6) until convergence.



Fig. 4: A block diagram of our framework. The programs are compiled using the Clang/LLVM, features and cycles are extracted from the IR and HLS toolchain respectively, and are afterwards fed to the RL network that tries to learn the optimal passes.

# IV. DEEP REINFORCEMENT LEARNING ARCHITECTURE

DQN and PG are used in the implemented architecture. DQN can take advantage of optimal substructures during training. It can also alleviate the impact of long compilation (environment step) time by greatly reducing the data needed. Wang *et al.* [15] proposed to convert a program into an observation by extracting all the features of the program, *i.e.* number of every operation. Similarly, we extracted program features from the programs.

Another approach is to use the actions (passes) themselves as observations. In other words, the input of the network (observation) was defined as a histogram of the passes given so far and the passes were given an index from zero to the number of passes. The output is the next pass to apply. This allows the network to keep track of passes applied so far and learn which sequences should be applied to maximize the reward.

We also considered various initial transformations on the programs before running a learning algorithm. We added seven prepasses that will invoke necessary program analysis passes for all programs. This approach reduces the difficulty for the network in finding the best initialization passes. The reward was defined as the negative number of cycles, and thus a higher reward means a lower number of cycles to run the program.



Fig. 5: The circuit speedup as function of algorithm training time for different optimization algorithms normalized to the performance without any optimization. The RL framework ran for 30 minutes on each program.

#### A. Program Features Versus Histogram of Actions

To evaluate the two approaches, we ran the framework on 12 HLS benchmarks. We trained each benchmark for 30 minutes. The methodology is described in Section V. Figure 5 shows the performance as function of time for the two approaches with DQN as the RL algorithm, and -O3, normalized to the performance without any optimization. Both approaches achieve similar performance. The average performance improvement is 13.3% over -O3 and 50% over the performance without any optimization. These improvements are mainly due to the fact that DQN efficiently and robustly learns which passes are useful and which are not.

Figure 6 shows the mean cycles as a function of time when running the framework on one of the programs (gsm) using the two approaches. Both approaches improve the program's mean cycles over time. We observe two things: (1) Rather than having a smooth line, the instability and chaotic behavior of DQN is seen in the jumps of mean cycles of both approaches. (2) The DQN approach that uses previous passes as the input observation is more stable than using the program features.

While both program feature and histogram approaches are efficient for operating on a single program, the histogram approach provided better performance in the long run. We observed that several passes are effective only if preceded by other passes, which may not change the program features significantly. For example, a pass may change only the order of instructions, but not their distribution. The network mistakenly learns to apply these effective passes directly and gain a zero benefit, resulting in unstable behavior. Furthermore, the space of features for various programs is sufficiently complex that it is difficult for a simple framework to learn, and it is not possible to derive a perfect optimization ordering on multiple programs simultaneously since each program



Fig. 6: The mean cycles as function of time when running the framework on the gsm task.

has different features.

In contrast, using previous actions as the observation guarantees that after applying each pass the observation changes, making it less ambiguous for the network to learn. In addition, it enables operating on multiple programs simultaneously since the observation is independent from the program, which provides faster training. To further optimize the time it takes to run the framework we could also leverage the fact that the observations are directly extracted from the actions. This makes it possible to roll out an entire trajectory of actions and observations from the policy without compiling the programs. Effectively, we apply a pass, generate the new histogram, feed the new histogram to the policy again, which outputs a new pass and so on. The only things missing in this technique are the intermediate rewards that could be calculated by compiling the program after applying each pass, which we are trying to avoid as it takes a long time.

In PG, the term in Equation 2 that depends on the reward value is the sum of rewards in the entire trajectory. This value could be obtained directly by compiling the program once at the end with all the applied passes. This makes using PG feasible although it generally requires more data samples and time to learn compared to DQN. In addition, in DQN, we could compile a few times, each one after applying a sequence of actions and average the reward for all these actions. Beyond significantly reducing the training time, this approach helps alleviate the sparsity of rewards in cases where multiple passes are necessary to get a single reward and make the seemingly ineffective passes more effective.

To operate on multiple programs simultaneously, the

reward function was defined as follows:

$$- reward_t = \sqrt[n]{C\_Prog_1(t) \cdot \dots \cdot C\_Prog_n(t)} - \sqrt[n]{C\_Prog_1(t-1) \cdot \dots \cdot C\_Prog_n(t-1)}, \quad (7)$$

where n is the number of programs and  $C\_Prog_i(t)$  is the cycle count of program i at time step t. This reward function was chosen to guarantee that all programs are equally optimized; any improvement in one will positively affect the reward and make the rewards more dense. To speed the training process we took a multi-threaded approach inspired by [20]. We implemented a multithreaded program that compiles the different programs simultaneously taking the same actions (passes) in each program.

## B. One-Hot Versus Histogram Representation

Instead of using a histogram, a concatenation of K one-hot vectors could be used, where K is the number of passes to apply. The index i of each vector $_j$  is 1 if the  $j^{th}$  pass applied is i, otherwise zero. This would allow RL to better understand the ordering of previously applied passes. In practice, this did not improve performance largely because the state space is much larger, it takes a long time to learn. Moreover, increasing the maximum number of passes will make it effectively impossible for the network to learn. This is because adding one more pass requires adding 45 inputs/features to the network (the total number of passes) making it more difficult for the network to converge.

#### V. RESULTS

We implemented the framework in Python and TensorFlow [25]. The network topology consisted of a  $512 \times 256$  FC layer for the DQN, and  $256 \times 256$  FC layer for the PG. We ran the framework on a four-core Intel i7-4765T CPU [26] with a Tesla K20c GPU [27] for handling the training and inference. We use the number of clock cycles reported by the HLS profiler as the performance metric. In [9], results showed a one-to-one correspondence between the clock cycle count and the actual hardware execution time. Therefore, better clock cycle count will lead to better hardware performance.

Success is defined as learning a sequence of passes that performs better than -O3 within a reasonable time. We use 12 benchmarks for evaluation taken from CH-stone [28] and LegUp examples. The benchmarks contain a variety of applications from different domains, such as arithmetic, media, and cryptography. The number of cycles required to run the benchmarks ranges from thousands to millions. We ran DQN and PG, and compared it against random search, greedy algorithms [9],

TABLE I: The optimization time of the search algorithms for different number of passes.

|           | Optimization Time (minutes) |     |    |        |        |         |
|-----------|-----------------------------|-----|----|--------|--------|---------|
| # passes  | Exhaustive                  | DQN | PG | Random | Greedy | Genetic |
| 3 passes  | 4725                        | 44  | 49 | 4725   | 19     | 150     |
| 12 passes | 3.58E+18                    | 31  | 21 | 360    | 222    | 411     |
| 24 passes | 2.46E+38                    | 20  | 34 | 360    | 837    | 1017    |

genetic algorithms [29], and -O3. All results are normalized to the case where no optimization was applied.

### A. Impact of the Sequence Length

Figure 7 shows the circuit speedup for various algorithms with sequence lengths of 3, 12, and 24. For each algorithm, we set the optimization sequence to a fixed length. The sequences for three passes are generated with each algorithm running to optimize for four programs. The sequences for 12 and 24 passes are generated with each algorithm optimizing six programs. We then applied the generated sequences to all 12 programs, normalized the results to the No-Opt performance, and took the average of the normalized performance for the 12 programs. Results show that three passes is not adequate to achieve peak performance for the 12 benchmarks.

For DQN, PG, genetic and greedy algorithms, little performance improvement is observed by increasing the length from 12 to 24. With 24 passes, PG, DQN and Genetic algorithms perform better than the rest.

#### B. Optimization Time Comparison

Table I lists the time it took to run each algorithm for differing number of passes. For three passes, it is possible to find the optimal sequence as the exhaustive-search is feasible. The exhaustive-search solution took 78.75 hours to run. We ran the random search for 78.75 hours, yet it did not find the optimal sequence. Also the greedy algorithm did not find this sequence.

The DQN, PG, and genetic algorithms quickly achieved this sequence. Additionally, DQN and PG achieved this sequence  $3.4\times$  and  $3.1\times$ , respectively, faster than the genetic algorithm. To ensure RL did not achieve the sequence by luck, we ran the RL framework ten times and verified that it derived the optimal ordering every time. Furthermore, we kept running the RL until it found the optimal sequence, although it found suboptimal ones much earlier. This is why it took more time to run the RL frameworks with three passes.

Results show that adding more passes resulted in a minor impact on optimization time for the RL algorithms but a major impact for the greedy and genetic algorithms. Note that for 12 and 24 passes, it would take more than one trillion years to run exhaustive-search. Therefore, the time given is an estimate based on the optimization



Fig. 7: Circuit speedup of various algorithms compared to No-Opt with sequence length of 3, 12, and 24.

time for three passes. Furthermore, we limited the time to run random search to six hours when applying 12 and 24 passes. Greedy and genetic algorithms run one to two orders of magnitude slower than the RL algorithms and achieve lower circuit performance for greedy algorithm and similar circuit performance for genetic algorithm. The algorithm optimization time benefits are mainly due to the data efficiency of the RL algorithms and the optimizations proposed in Section IV that make it possible to significantly reduce the number of compilations required when running the framework and using the previously applied passes as the input observations.

#### C. Circuit Performance Comparison

Figure 8 shows the circuit speedup of the different algorithms for finding the best 12 passes as a function of time, normalized to No-Opt performance. The highest improvement is seen in DQN and genetic algorithms that both achieve 16% better circuit speed over -O3 (six programs improved, three programs remained the same, and three programs worsened slightly). PG achieves 12% better circuit speed over -O3. DQN requires less data to learn and this is why it achieves better speedup over PG.

Figure 9 shows the geomean of circuit cycles of the programs as a function of time for genetic and greedy algorithms, and the average geomean of cycles for PG and DQN (averaged over batch size). For a fair comparison we cut the time after the algorithms stop improving. The greedy and genetic algorithms both have a large improvement at the beginning of the training phase because of its greedy nature. The greedy algorithm converges to a worse minimum result compared to the other algorithms, suggesting locally optimal choices may not be the best approach for tackling the phase-ordering problem. The genetic algorithm, however, has mutation,



Fig. 8: The circuit speedup for searching the best 12 passes using the different search algorithms for different tasks normalized to the case without any optimization.

introducing randomness which helps to improve the performance. We see that both RL algorithms improve the average cycles of a batch over time.

#### D. Validation

After the framework finishes running on the LegUp simulator, we validate the cycle time results by compiling the generated Verilog and simulating the design with ModelSim. We see a 0.48% difference between the actual and profiler-generated cycle time. We also compile all the optimized circuit designs with a standard FPGA toolflow and verify that frequency meets the 50 MHz constraint. We take the geomean across all the benchmarks for the area results. Compared to the -O3 area results, the DQN-optimized circuits have 0.6%, 3%, 43% increase in LUTs, registers, and BRAMs respectively. The DSPs remained the same. Note that the objective of this paper is to optimize for circuit speed. Optimizing for area or both circuit speed and area is also feasible in a similar manner.

## E. Possible Alternative Machine Learning Algorithms

Inspired by the observed advantage of randomness in random search, genetic algorithm, and DQN that enables more exploration, as well as the redundancy of observations, we believe other techniques such as Bandits [30], Exploration [31], Proximal Policy Gradients [32], and Evolution Strategies [33] might also be good candidates to explore. These works leverage a great deal of randomness and exploration while maximizing rewards. Furthermore, it might be beneficial to



Fig. 9: Geomean of circuit cycles for various algorithms as a function of algorithm training time.

combine the applied optimizations and program features as observations to the RL so it could benefit from both representations of the state.

#### VI. CONCLUSIONS

In this work, we demonstrate a novel deep reinforcement learning based approach to improve performance of HLS designs by optimizing the compiler phase ordering. RL techniques achieve 16% better results than traditional -O3 results. Such improvements require only a few minutes of training—one to two orders of magnitude faster than state-of-the-art approaches based on genetic algorithms. While in this paper, we applied the techniques to HLS, the same RL techniques and framework can also be applied to software compilation and optimization.

#### REFERENCES

- [1] Xilinx, "Vivado Design Suite User Guide -High-Level Synthesis," June 2015. [Online]. Available: http://www.xilinx.com/support/documentation/sw\_manuals/ xilinx2015\_2/ug902-vivado-high-level-synthesis.pdf
- [2] Intel, "Intel FPGA SDK for OpenCL"." [Online]. Available: https://www.intel.com/content/www/us/en/programmable/ products/design-software/embedded-software-developers/opencl/ developer-zone.html
- [3] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, T. Czajkowski, S. D. Brown, and J. H. Anderson, "Legup: An open-source high-level synthesis tool for fpga-based processor/accelerator systems," ACM Transactions on Embedded Computing Systems (TECS), vol. 13, no. 2, p. 24, 2013.
- [4] S. S. Muchnick, "Advanced compiler design and implementation." Morgan Kaufmann, 1997.
- [5] R. S. Sutton and A. G. Barto, Introduction to reinforcement learning. MIT press Cambridge, 1998, vol. 135.
- [6] S. Triantafyllis, M. Vachharajani, N. Vachharajani, and D. I. August, "Compiler optimization-space exploration," in *Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization.* IEEE Computer Society, 2003, pp. 204–215.
- [7] L. Almagor, K. D. Cooper, A. Grosul, T. J. Harvey, S. W. Reeves, D. Subramanian, L. Torczon, and T. Waterman, "Finding effective compilation sequences," *ACM SIGPLAN Notices*, vol. 39, no. 7, pp. 231–239, 2004.
- [8] Z. Pan and R. Eigenmann, "Fast and effective orchestration of compiler optimizations for automatic performance tuning," in Proceedings of the International Symposium on Code Generation and Optimization. IEEE Computer Society, 2006, pp. 319–332.
- [9] Q. Huang, R. Lian, A. Canis, J. Choi, R. Xi, S. Brown, and J. Anderson, "The effect of compiler optimizations on high-level synthesis for fpgas," in Field-Programmable Custom Computing Machines (FCCM), 2013 IEEE 21st Annual International Symposium on. IEEE, 2013, pp. 89–96.
- [10] Q. Huang, R. Lian, A. Canis, J. Choi, R. Xi, N. Calagar, S. Brown, and J. Anderson, "The effect of compiler optimizations on high-level synthesis-generated hardware," *ACM Transactions* on *Reconfigurable Technology and Systems (TRETS)*, vol. 8, no. 3, p. 14, 2015.
- [11] F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. O'Boyle, J. Thomson, M. Toussaint, and C. K. Williams, "Using machine learning to focus iterative optimization," in *Proceedings of the International Symposium on Code Generation and Optimization*. IEEE Computer Society, 2006, pp. 295–305.
- [12] M. Stephenson, S. Amarasinghe, M. Martin, and U.-M. O'Reilly, "Meta optimization: Improving compiler heuristics with machine learning," in *Proceedings of the ACM SIGPLAN 2003 Conference* on *Programming Language Design and Implementation*, ser. PLDI '03, 2003.

- [13] G. Fursin, Y. Kashnikov, A. W. Memon, Z. Chamski, O. Temam, M. Namolaru, E. Yom-Tov, B. Mendelson, A. Zaks, E. Courtois et al., "Milepost gcc: Machine learning enabled self-tuning compiler," *International journal of parallel programming*, vol. 39, no. 3, pp. 296–327, 2011.
- [14] S. Kulkarni and J. Cavazos, "Mitigating the compiler optimization phase-ordering problem using machine learning," in Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, ser. OOPSLA '12, 2012.
- [15] Z. Wang and M. OBoyle, "Machine learning in compiler optimization," vol. 106, no. 11, Nov 2018, pp. 1879–1901.
- [16] L. P. Kaelbling, M. L. Littman, and A. W. Moore, "Reinforcement learning: A survey," vol. 4, 1996, pp. 237–285.
- [17] R. Bellman, "A markovian decision process," in *Journal of Mathematics and Mechanics*, 1957, pp. 679–684.
- [18] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski *et al.*, "Human-level control through deep reinforcement learning," vol. 518, no. 7540, 2015, p. 529.
- [19] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot *et al.*, "Mastering the game of go with deep neural networks and tree search," vol. 529, no. 7587, 2016, p. 484
- [20] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, "Asynchronous methods for deep reinforcement learning," in *International conference on machine learning*, June 2016, pp. 1928–1937.
- [21] S. Ross, G. Gordon, and D. Bagnell, "A reduction of imitation learning and structured prediction to no-regret online learning," in *Proceedings of the fourteenth international conference on artificial intelligence and statistics*, 2011, pp. 627–635.
- [22] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, "Policy gradient methods for reinforcement learning with function approximation," in *Advances in neural information processing systems*, 2000, pp. 1057–1063.
- [23] C. J. Watkins and P. Dayan, "Q-learning," vol. 8, no. 3-4. Springer, 1992, pp. 279–292.
- [24] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, "Openai gym," 2016.
- [25] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard *et al.*, "Tensorflow: a system for large-scale machine learning." in *OSDI*, vol. 16, 2016, pp. 265–283.
- [26] Intel Inc., "Intel Core i7-4765T Processor Specification," 2017. [Online]. Available: https://ark.intel.com/products/75121/ Intel-Core-i7-4765T-Processor-8M-Cache-up-to-3-00-GHz-
- [27] Nvidia Inc., "NVIDIA Tesla K20c GPU Specification," 2012. [Online]. Available: https://www.nvidia.com/content/tesla/pdf/ Tesla-KSeries-Overview-LR.pdf
- [28] Y. Hara, H. Tomiyama, S. Honda, H. Takada, and K. Ishii, "CHstone: A benchmark program suite for practical c-based highlevel synthesis," in *Circuits and Systems*, 2008. ISCAS 2008. IEEE International Symposium on, 2008, pp. 1192–1195.
- [29] F.-A. Fortin, F.-M. De Rainville, M.-A. Gardner, M. Parizeau, and C. Gagné, "DEAP: Evolutionary algorithms made easy," *Journal of Machine Learning Research*, vol. 13, pp. 2171–2175, jul 2012.
- [30] O. Chapelle and L. Li, "An empirical evaluation of thompson sampling," in *Advances in neural information processing systems*, 2011, pp. 2249–2257.
- [31] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos, "Unifying count-based exploration and intrinsic motivation," in *Advances in Neural Information Processing Sys*tems, 2016, pp. 1471–1479.

- [32] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, "Proximal policy optimization algorithms," *arXiv preprint arXiv:1707.06347*, 2017.
- [33] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever, "Evolution strategies as a scalable alternative to reinforcement learning," *arXiv preprint arXiv:1703.03864*, 2017.