# Device-Level Composition in ReWire

A Dissertation presented to the Faculty of the Graduate School at the University of Missouri

In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

by

Ian Graves

Dr. William L. Harrison, Dissertation Supervisor  ${\rm DEC~2015}$ 

The undersigned, appointed by the Dean of the Graduate School, have examined the dissertation entitled:

## 

presented by Ian Graves,

a candidate for the degree of Doctor of Philosophy and hereby certify that, in their opinion, it is worthy of acceptance.

| Dr. William L. Harrison |
|-------------------------|
|                         |
| Dr. Michela Becchi      |
|                         |
| Dr. Sean P. Goggins     |
|                         |
|                         |

Dr. Rohit Chadha

#### ACKNOWLEDGMENTS

Works such as these can't be accomplished without the tutelage, help, advice, and company of great people. First and foremost, I would like to thank my advisor Dr. William L. Harrison. It has been a privilege to work and learn in a lab that has grown and evolved so much from its inception.

Thanks go to Dr. Adam Procter, Benjamin Schulz, Chris Hathhorn, Dr. Soumya Deepta Sanyal, and my other colleagues in the Center for High Assurance Computing both past and present. Adam Procter in particular has been an incredible source of technical knowledge and creativity on the ReWire project and an all around great guy. Additionally, I would like to thank Dr. Sean P. Goggins for his invaluable advice on career and research matters, Dr. Michela Becchi for her advice and consultation on technical issues related to the work on ReWire.

Graduate work can be trying on a personal level. My case is no exception. I have found that the best way to lean in is to do so with and in good company. I'd like to thank Adam Procter and Ben Schulz for being a great guys to talk to (or share animated GIFs with). I'd like to thank my friends Brian Linquist, Ryan VanMaele, and Mark McLaughlin for their support and friendship through this time as well, despite all of us being so far flung.

Lastly, I'd like to thank my family. My parents Leland and Beverly raised me and gave me the privilege of going to school to study the things that fascinate me and bring me joy. I am forever grateful to them for everything they have done for me. I would also like to thank my sister Emma and my brother Jonah for putting up with me and supporting me as well.

My final thanks goes to my beautiful wife Amanda. Amanda has been my rock through this process and has supported me the entire way while being incredibly graceful and patient in the process. I'm a pretty lucky guy.

## TABLE OF CONTENTS

| A            | CKN  | OWLI    | $\mathbf{E}\Gamma$ | GI           | ME         | EN'  | TS   |                     |     |     |    |    | •  |    |    | •  |    |     |    |    |   | • | • | • | • | • | • |     | ii    |
|--------------|------|---------|--------------------|--------------|------------|------|------|---------------------|-----|-----|----|----|----|----|----|----|----|-----|----|----|---|---|---|---|---|---|---|-----|-------|
| Ll           | ST ( | OF LIS  | $\mathbf{ST}$      | IN           | GS         |      |      |                     |     |     |    |    |    |    |    | •  |    | •   |    |    | • |   |   |   |   |   |   |     | xi    |
| Ll           | ST ( | OF TA   | ΔBI                | LES          | <b>S</b> . |      |      |                     |     |     |    |    |    |    |    | •  |    | •   |    |    | • |   |   |   |   |   |   |     | XV    |
| Ll           | ST ( | OF FIG  | GU                 | $\mathbf{R}$ | ES         |      |      |                     |     |     |    |    |    | •  |    | •  |    |     |    |    |   |   |   |   |   |   |   |     | xvi   |
| $\mathbf{A}$ | BST  | RACT    | Γ.                 |              |            |      |      |                     |     |     |    | •  |    | •  |    |    |    |     |    |    |   |   |   |   |   |   |   | . х | cviii |
| $\mathbf{C}$ | HAP  | TER     |                    |              |            |      |      |                     |     |     |    |    |    |    |    |    |    |     |    |    |   |   |   |   |   |   |   |     |       |
| 1            | Intr | oducti  | ior                | ı.           |            |      |      |                     |     |     | •  | •  |    | •  |    |    |    |     |    |    |   |   | • |   |   |   |   | •   | 1     |
|              | 1.1  | Proble  | em                 | s ar         | nd (       | Que  | esti | ons                 | 3.  |     |    |    |    |    |    |    |    |     |    |    |   |   |   |   |   |   |   |     | 2     |
|              | 1.2  | Hypot   | the                | ses          |            |      |      |                     |     |     |    |    |    |    |    |    |    |     |    |    |   |   |   |   |   | • | • |     | 2     |
|              | 1.3  | Overv   | viev               | v of         | Lit        | tera | atui | re                  |     |     |    |    |    |    |    |    |    |     |    |    |   |   |   |   |   |   |   |     | 3     |
| 2            | Bac  | kgrour  | nd                 | an           | d I        | Rel  | late | $\operatorname{ed}$ | W   | or  | ·k | •  |    | •  |    | •  |    |     |    |    |   | • |   |   |   |   |   | •   | 7     |
|              | 2.1  | Haskel  | ell a              | and          | Mo         | ona  | dic  | Pr                  | og  | ra  | m  | mi | ng |    |    |    |    |     |    |    |   |   |   |   |   |   |   |     | 7     |
|              | 2.2  | Reacti  | tive               | Re           | sun        | npt  | ion  | s, 1                | Fr∈ | ee  | Μ  | on | ad | s, | ar | ıd | Ιt | era | te | es |   |   |   |   |   |   |   | •   | 9     |
|              | 2.3  | Specify | fyir               | ng F         | Haro       | dwa  | are  | in                  | На  | ask | æl | 1  |    |    |    |    |    |     |    |    |   |   |   |   |   |   | • | •   | 11    |
|              |      | 2.3.1   | L                  | ava          | ι.         |      |      |                     |     |     |    |    |    |    |    |    |    |     |    |    |   |   |   |   |   |   |   |     | 11    |
|              |      | 2.3.2   | C                  | las          | h .        |      |      |                     |     |     |    |    |    |    |    |    | •  |     |    |    |   |   | • |   |   |   | • | •   | 12    |
|              |      | 2.3.3   | В                  | lue          | spe        | е.   |      |                     |     |     |    |    |    |    |    |    | •  |     |    |    |   |   | • |   |   |   | • | •   | 12    |
|              |      | 2.3.4   | F                  | ors          | yde        |      |      |                     |     |     |    |    |    |    |    |    | •  |     |    |    |   | • | • |   |   |   | • |     | 13    |
|              |      | 2.3.5   | Г                  | elit         | te         |      |      |                     |     |     |    |    |    |    |    |    |    |     |    |    |   |   |   |   |   |   |   |     | 13    |

|   | 2.4 | Visual  | l or Flow-based Program Specification                  | 14 |
|---|-----|---------|--------------------------------------------------------|----|
|   |     | 2.4.1   | Flow-based Programming                                 | 14 |
|   |     | 2.4.2   | Arrows and Profunctors in Haskell                      | 14 |
|   |     | 2.4.3   | Visualizing Functional Programs                        | 15 |
|   | 2.5 | Funct   | ional Module Systems                                   | 16 |
|   | 2.6 | Forma   | al Methods for Hardware Design                         | 17 |
|   | 2.7 | Comp    | iling Regular Expression Matchers to Hardware          | 19 |
| 3 | Con | nect I  | Logic                                                  | 21 |
|   | 3.1 | Motiv   | ation                                                  | 22 |
|   | 3.2 | Conne   | ect Logic Primitives                                   | 22 |
|   |     | 3.2.1   | Primitive Functions                                    | 23 |
|   |     | 3.2.2   | Non-Primitive Functions                                | 29 |
|   | 3.3 | Imple   | mentation                                              | 30 |
|   |     | 3.3.1   | Structuring Connect Logic Devices                      | 31 |
|   |     | 3.3.2   | Compiling Primtives                                    | 34 |
|   |     | 3.3.3   | Compiling Non-primitives                               | 39 |
| 4 | Mo  | dularit | sy Principles and a Module System                      | 40 |
|   | 4.1 | Modu    | larity with ReWire                                     | 40 |
|   |     | 4.1.1   | Functions                                              | 41 |
|   |     | 4.1.2   | Reactive Resumptions                                   | 41 |
|   |     | 4.1.3   | Modularity and Composability Follow from Connect Logic | 42 |
|   | 12  | Δ Μο    | dula System for RoWiro                                 | 43 |

|   |      | 4.2.1   | The ReWire Module System                                    | 43 |
|---|------|---------|-------------------------------------------------------------|----|
|   |      | 4.2.2   | From Separate Compilation and Future Work on Module Systems | 45 |
| 5 | Visi | ual Pro | ogramming in ReWire                                         | 48 |
|   | 5.1  | Motiva  | ation                                                       | 49 |
|   | 5.2  | Specifi | ication and Features                                        | 49 |
|   | 5.3  | Tool I  | mplementation                                               | 53 |
|   |      | 5.3.1   | Front End                                                   | 53 |
|   |      | 5.3.2   | Back End                                                    | 53 |
|   | 5.4  | Using   | RVT                                                         | 54 |
|   | 5.5  | Code    | Generation                                                  | 54 |
| 6 | Con  | currer  | nt Devices in ReWire and Connect Logic                      | 56 |
|   | 6.1  | Barrie  | r Synchronization                                           | 57 |
|   | 6.2  | Triple  | Modular Redundancy                                          | 60 |
|   | 6.3  | Mutua   | d Exclusion                                                 | 64 |
|   | 6.4  | Semap   | phore Constructions                                         | 69 |
|   | 6.5  | Segme   | ntation                                                     | 75 |
|   |      | 6.5.1   | Data Types                                                  | 77 |
|   |      | 6.5.2   | Security Policy Functions                                   | 78 |
|   |      | 6.5.3   | The Request Master                                          | 80 |
|   |      | 6.5.4   | The Response Master                                         | 82 |
|   |      | 6.5.5   | Composing the Bus Master                                    | 85 |
|   |      | 6.5.6   | Using the Segmenter with Processors                         | 86 |

| 7 | Cas | e Study: Regular Expression Compilation                                     | 90 |
|---|-----|-----------------------------------------------------------------------------|----|
|   | 7.1 | Abstract                                                                    | 91 |
|   | 7.2 | Introduction                                                                | 91 |
|   | 7.3 | A Methodology for Synthesis from Functional EDSLs                           | 96 |
|   | 7.4 | Case Study 1: Matching State of the Art                                     | 97 |
|   | 7.5 | Case Study 2: Surpassing State of the Art                                   | 01 |
|   | 7.6 | Conclusions and Future Work                                                 | 04 |
|   | 7.7 | Acknowledgments                                                             | 06 |
| 8 | Cas | e Study: Implementing the Salsa20 Cipher                                    | 07 |
|   | 8.1 | Abstract                                                                    | 07 |
|   | 8.2 | Introduction                                                                | 08 |
|   | 8.3 | Connect Logic in ReWire                                                     | 12 |
|   |     | 8.3.1 Pure Functional Languages & Equational Verification 12                | 13 |
|   |     | 8.3.2 Extending ReWire with Connect Logic                                   | 15 |
|   | 8.4 | Provably Correct Development of Salsa20 Devices in ReWire and Connect Logic | 22 |
|   |     | 8.4.1 Salsa20 Reference Specification                                       | 22 |
|   |     | 8.4.2 Salsa20 Iterative Implementation                                      | 23 |
|   |     | 8.4.3 Pipelining Salsa20                                                    | 23 |
|   | 8.5 | Evaluating Provably Correct Salsa20 Devices                                 | 24 |
|   |     | 8.5.1 Performance                                                           | 25 |
|   |     | 8.5.2 Testing the Iterative Salsa20 Device Automatically 12                 | 26 |
|   |     | 8.5.3 Verification of Pipelining                                            | 28 |

|   | 8.6 | Summ   | ary and Conclusions                            |
|---|-----|--------|------------------------------------------------|
| 9 | Cas | e Stud | ly: Implementing a Pipelined DLX Processor 134 |
|   | 9.1 | Introd | luction                                        |
|   | 9.2 | The D  | DLX Processor                                  |
|   | 9.3 | Const  | ructing the Processor                          |
|   |     | 9.3.1  | Instructions and Architecture                  |
|   |     | 9.3.2  | Fetch                                          |
|   |     | 9.3.3  | Decode                                         |
|   |     | 9.3.4  | Execute                                        |
|   |     | 9.3.5  | Memory                                         |
|   |     | 9.3.6  | Writeback                                      |
|   | 9.4 | Comp   | osing the DLX Processor                        |
|   |     | 9.4.1  | Parallelizing and Connecting Devices           |
|   |     | 9.4.2  | Considering and Mitigating Pipelining Hazards  |
|   |     | 9.4.3  | Delay Slot Implementation                      |
|   |     | 9.4.4  | Stalling Functionality                         |
|   | 9.5 | Testin | g                                              |
|   |     | 9.5.1  | A Haskell Test Bench                           |
|   | 9.6 | Synth  | esizing the Design                             |
|   |     | 9.6.1  | Proper Compilable ReWire                       |
|   |     | 9.6.2  | Back end Primitives                            |
|   |     | 963    | Synthesis Results 162                          |

|              | 9.7  | Conclusion                                                             |
|--------------|------|------------------------------------------------------------------------|
| 10           | Sum  | mary and Future Works                                                  |
|              | 10.1 | Summary of Results                                                     |
|              |      | 10.1.1 Connect Logic Primitives                                        |
|              |      | 10.1.2 Modularity and Modules                                          |
|              |      | 10.1.3 Novel Designs with ReWire and Connect Logic 165                 |
|              | 10.2 | Future Works                                                           |
|              |      | 10.2.1 Structural Metaprogramming With Connect Logic 166               |
|              |      | 10.2.2 Network-on-Chip Paradigms                                       |
|              |      | 10.2.3 Type Level Naturals and Vectors                                 |
|              |      | 10.2.4 Program Transformations for Power Consumption and Circuit Depth |
| ΒI           | BLI  | OGRAPHY                                                                |
| Al           | PPEI | NDIX                                                                   |
| $\mathbf{A}$ | Con  | nect Logic Implementation in Haskell 180                               |
|              | A.1  | Parallel Combinator                                                    |
|              | A.2  | Refold Combinator                                                      |
|              | A.3  | RefoldT Combinator                                                     |
|              | A.4  | Iter Combinator                                                        |
| В            | DLY  | Component Implementation                                               |
|              | B.1  | Types for DLX                                                          |
|              | B.2  | DLX Fetch Stage                                                        |
|              | В.3  | DLX Decode Stage                                                       |

| VITA |                                     | 234 |
|------|-------------------------------------|-----|
| B.7  | Combining DLX Phases to a Processor | 225 |
| B.6  | DLX Writeback Phase                 | 217 |
| B.5  | DLX Memory Access Phase             | 212 |
| B.4  | DLX Execute Phase                   | 199 |

# LIST OF LISTINGS

| 2.1  | Reactive Resumption Monads vs. Free Monads                               | 10 |
|------|--------------------------------------------------------------------------|----|
| 3.1  | Haskell implementation of the Connect Logic parallel $(<\&>)$ combinator | 24 |
| 3.2  | Haskell implementation of the Connect Logic ${\tt refold}$ combinator    | 25 |
| 3.3  | Haskell implementation of the Connect Logic ${\tt refoldT}$ combinator   | 27 |
| 3.4  | Haskell implementation of the Connect Logic iter combinator              | 29 |
| 3.5  | The Haskell definition of the pipeline function in ReWire                | 30 |
| 3.6  | A data type for decomposed CL expressions                                | 32 |
| 3.7  | Types used in the construction of the renamed CL tree                    | 33 |
| 3.8  | An example parallel device                                               | 35 |
| 3.9  | VHDL code for the example parallel device                                | 35 |
| 3.10 | An example refolded device                                               | 37 |
| 3.11 | VHDL code for the example refold device                                  | 38 |
| 4.1  | ReWire single compilation transformation example                         | 43 |
| 5.1  | An Example ReWire device with I/O of product types                       | 50 |
| 5.2  | The data types for the first iteration of RVT                            | 51 |

| 6.1  | A transformation to make any device in ReT a stalling device using                |    |
|------|-----------------------------------------------------------------------------------|----|
|      | the refoldT primitive. Here we use the isomorphic types Stall and                 |    |
|      | Busy in place of a Maybe type                                                     | 58 |
| 6.2  | Creating a barrier in ReWire for devices typed in ReT                             | 58 |
| 6.3  | Simple Triple Modular Redundancy with Connect Logic                               | 62 |
| 6.4  | Functional TMR $[1]$ with redundant voting logic in Connect Logic                 | 62 |
| 6.5  | A left-argument-biased mutex specification for two ReWire devices                 | 64 |
| 6.6  | Utilizing the semaphore as a device in a closed system                            | 67 |
| 6.7  | Types for a 2-semaphore device implementation                                     | 70 |
| 6.8  | Pure functions for managing semaphore state and incoming requests .               | 71 |
| 6.9  | The first semaphore device implementation. A stand-alone semaphore                |    |
|      | device                                                                            | 73 |
| 6.10 | The second semaphore device implementation. A semaphore integrated                |    |
|      | in primarily pure logic refolded with its constituent devices                     | 74 |
| 6.11 | Types and helper functions for a memory segmenter                                 | 77 |
| 6.12 | Policy functions for a memory bus master                                          | 78 |
| 6.13 | Definitions for the request master function. The transition function is           |    |
|      | given by reqMaster_ and the initialized device is given by reqMaster.             | 81 |
| 6.14 | Definitions for the response master. The transition function is given             |    |
|      | by ${\tt rspMaster\_}$ and the initialized device is given by ${\tt rspMaster\_}$ | 82 |
| 6.15 | The bus master is composed from the request and response master.                  |    |
|      | We use routing logic in the functions outputSelect and inputSelect                |    |
|      | in a refold over the paralleized regMaster and regMaster devices                  | 85 |

| 6.16 | Using the bus master to interface two processors to a memory module      |     |
|------|--------------------------------------------------------------------------|-----|
|      | unit in ReWire                                                           | 86  |
| 9.1  | The type of the DLX processor device                                     | 134 |
| 9.2  | Types for the fetch component given in Haskell                           | 139 |
| 9.3  | Types for the decode component given in Haskell                          | 141 |
| 9.4  | Types for the execution (ALU) phase given in Haskell                     | 141 |
| 9.5  | Types for the memory access phase given in Haskell                       | 142 |
| 9.6  | Types for the writeback phase given in Haskell                           | 145 |
| 9.7  | The type of the DLX processor device                                     | 145 |
| 9.8  | Constructing the intermediate ReWire device                              | 146 |
| 9.9  | Connective functions for each pipelining phase of the DLX processor.     | 147 |
| 9.10 | Functions for forwarding register values                                 | 149 |
| 9.11 | DLX assembly code illustrating the the appearance of a delay slot        |     |
|      | instruction on line 3                                                    | 151 |
| 9.12 | Haskell code for flushing the pipeline in the execution phase of the     |     |
|      | ReWire DLX processor implementation                                      | 153 |
| 9.13 | Stepping functions with output memoization for stalling                  | 154 |
| 9.14 | The top level DLX device type for testing                                | 156 |
| A.1  | Haskell implementation of the Connect Logic parallel<br>(<&>) combinator | 180 |
| A.2  | Haskell implementation of the Connect Logic ${\tt refold}$ combinator    | 181 |
| A.3  | Haskell implementation of the Connect Logic ${\tt refoldT}$ combinator   | 181 |
| A.4  | Haskell implementation of the Connect Logic iter combinator              | 182 |
| В.1  | Types defined for DLX processor implementation                           | 183 |
| B 2  | Haskell implementation of DLX Fetch Stage                                | 187 |

| В.3 | Haskell implementation of DLX Decode Stage                        | 188 |
|-----|-------------------------------------------------------------------|-----|
| B.4 | The DLX execute phase implemented in Haskell                      | 199 |
| B.5 | Haskell implementation of the DLX Memory Access processor phase.  | 212 |
| B.6 | The DLX Writeback phase implemented in Haskell                    | 217 |
| B.7 | Combining the subcomponents of the DLX processor with support for |     |
|     | stalling                                                          | 225 |

# LIST OF TABLES

| Table | ]                                                                                                       | Page |
|-------|---------------------------------------------------------------------------------------------------------|------|
| 4.1   | Composing synchronous and combinational logic in ReWire. Output                                         | 40   |
| 0.1   | from the left is fed to the right                                                                       | 43   |
| 8.1   | Resource usage, Fmax, and throughput (T) of the Salsa20 algorithm as implemented and compiled in ReWire | 127  |
| 9.1   | DLX R-Type instructions encoding and semantics                                                          | 137  |
| 9.2   | DLX I-Type instructions encoding and semantics                                                          | 138  |
| 9.3   | DLX J-Type instructions encoding and semantics                                                          | 138  |
| 9.4   | FPGA synthesis results for our DLX implementation                                                       | 162  |

# LIST OF FIGURES

| Figure |                                                | Page |
|--------|------------------------------------------------|------|
| 3.1    | Parallel Composition                           | . 25 |
| 3.2    | Refold Composition                             | . 27 |
| 3.3    | RefoldT Composition                            | . 28 |
| 3.4    | Pipeline Composition                           | . 31 |
| 5.1    | Diagramming devices using RVT in a web browser | . 54 |
| 5.2    | Generated code from an RVT specification       | . 55 |
| 6.1    | Barriers                                       | . 57 |
| 6.2    | Functional Triple Modular Redundancy           | . 61 |
| 6.3    | Mutex Construction                             | . 65 |
| 6.4    | Semaphore Construction                         | . 69 |
| 6.5    | A segmented memory controller                  | . 76 |
| 6.6    | The request master component                   | . 80 |
| 6.7    | The response master component                  | . 83 |
| 7.2    | Virtualized, traditional EDSLs                 | . 93 |
| 7.1    | FP Methodology for HLS                         | . 93 |

| 7.3 | RexHacc tcp25 benchmark                                | 95  |
|-----|--------------------------------------------------------|-----|
| 7.4 | An NFA and its Sidhu and Prasanna-style implementation | 97  |
| 7.5 | RexHacc performance comparisons                        | 100 |
| 7.6 | NFA before and after state splitting                   | 103 |
| 7.7 | Comparisons of RexHacc with state splitting enabled    | 104 |
| 8.1 | Bird-Wadler Program Development                        | 110 |
| 8.2 | Device Constructors                                    | 113 |
| 8.3 | Salsa20 Hashing Algorithm                              | 119 |
| 8.4 | Reference Specification of Salsa20 Hash Function       | 120 |
| 8.5 | Iterative Salsa20 Device in ReWire                     | 121 |
| 8.6 | Ten Stage Pipeline                                     | 124 |
| 8.7 | Twenty Stage Pipeline                                  | 125 |
| 8.8 | ReWire circuit diagrams                                | 125 |
| 9 1 | DLX timing diagram                                     | 153 |

#### ABSTRACT

ReWire provides engineers with a tool to specify, verify and implement hardware devices for FPGAs from a high-level Haskell-like language. Previous work has shown ReWire to be a productive source language for developing whole systems in the form of single, monolithic monadic specifications.

To achieve scale, modularity and reusability, some form of modularity principle must be identified and realized within ReWire. The questions we wish to answer are, what are the basic units of a ReWire specification and how may such units be identified, abstracted over and reused to achieve a realistic work flow for device construction in ReWire? This research identifies a modularity principle for ReWire as a suite of language abstractions for breaking apart ReWire specifications into its constituent components called *Connect Logic* and considers its implementation and application.

Adding flexibility to ReWire to support device-level composition would significantly enhance design with ReWire as it would promote reuse of specifications that are complete, tested, and verified and thus reduce redundant work on the part of the designer. These functions allow the developer to manipulate and compose existing device specifications without the need to otherwise modify them. This work integrates a suite of functions into ReWire which provide engineers the ability to incorporate existing specifications into new designs and decompose projects into constituent components using Connect Logic while remaining fully synthesizable to hardware. Connect Logic provides an intuitive way to consider synchronous logic versus combinational logic in ReWire designs. We demonstrate applications of Connect Logic to improve

the performance of complex systems including cryptographic ciphers. We utilize Connect Logic to develop a fully pipelined microprocessor, to implement commonly used high-level concurrency primitives in hardware, and we demonstrate Connect Logic as a substrate for visual programming in ReWire.

## Chapter 1

## Introduction

This dissertation is an investigation into the composition and modularity of ReWire for hardware design. Reactive Resumption monads have been shown to be verfiable models for designing software [2] and as a basis for the design of verifiable hardware systems [3]. This work introduces connectivity primitives for writing productive and reusable hardware components in the ReWire programming language. We extend the ReWire language with four primitive functions for the composition, manipulation, and introduction of device-level, or Reactive Resumption, devices. We demonstrate that these primitives, called Connect Logic primitives, can be used as tools for optimizing deep combinational circuits (reducing gate delays) with semantics-preserving transformations. We demonstrate foundational software engineering techniques utilizing Connect Logic including modularization, encapsulation and information hiding. In total, this dissertation demonstrates that Connect Logic enables productive design that is modular, composable, and produces efficient circuits in hardware implementations.

## 1.1 Problems and Questions

This work seeks to address questions pertaining to the composition, modularity and reuse of a functional hardware description laguage (HDL). How can we promote modularity and re-use in a functional Hardware Description Language? Hughes noted in his seminal work [4] that functional programming matters because it promotes modular and composable programs where other programming paradigms do not. In a functional hardware description language, what is our notion of modularity here? What opportunities exist for composition of modular devices in a functional HDL? As a follow on question to promoting modularity, what does modularity and composition looks like in a functional hardware design language? Are good performance and resource characteristics possible with modular and composable design? Can a language like ReWire produce implementations that yield high throughput? What functional design techniques can we incorporate into hardware design? Functional programming has been shown to give programmers productive techniques for software. Can we incorporate any of these techniques into hardware design?

## 1.2 Hypotheses

This dissertation proceeds with the following hypotheses regarding the problems and questions posed in the previous section. *First Hypothesis*: we can enable communication between synchronous logic using a series of combinators as an extension to the ReWire programming language called Connect Logic. This gives us complete inter-device communication in ReWire. We can share information between specifications as combinational (pure) functions as well as synchronous (ReT) specifications.

Second Hypothesis: Connect Logic enables the composition of synchronous and combination logic give us a higher degree of control over performance characteristic in a functional HDL. This control will enable us to maintain good performance characteristics and resource utilization in hardware implementations. Combinators for device composition promote device reuse as well as other modular-design-enhancing features commonly seen in software engineering. Appropriating these techniques to hardware design will provide us with additional principled approaches to developing hardware. Third Hypothesis: we can compile these combinators in a way that generates device implementations with performance characteristics in line with the state of the art.

#### 1.3 Overview of Literature

Chapter 2 is a discussion of background and related work. We introduce Haskell and monadic programming as well as Reactive Resumptions and related programming techniques in Haskell. We discuss the background work of Haskell-based systems for generating hardware devices as well as related structural and visual idioms for programming (in Haskell and otherwise).

Chapter 3 introduces Connect Logic. We discuss the design and implementation of Connect Logic primitives as well as some non-primitive, but useful transformations consisting of Connect Logic primitives. We also provide Haskell definitions of all functions for reference. Approaches to compiling Connect Logic primitives are discussed in this chapter. The first hypothesis is substantiated partially in this chapter by way of an introduction to Connect Logic.

Chapter 4 is a discussion of modularity principles that apply to ReWire. Modules as they appear in Haskell are not as useful for designing hardware. We consider modularity in the context of hardware design with ReWire. We follow on to this with a discussion of the implementation of ReWire's module system and a discussion of module system support for extra features such as polymorphic functions. The second hypothesis is covered in this discussion of modules and modules systems in ReWire.

Chapter 5, Chapter 6, and Chapter 9 are case studies using ReWire for applications to demonstrate its efficacy in design and implementation efficiency. Chapter 6 describes the design and implementation of high level concurrency primitives in hardware in addition to memory segmentation functionality and redundancy transformations. These case studies serve as a demonstration of the efficacy of ReWire with Connect Logic for integrating functionality by composition of devices and device transformations with complex compositions. Synchronization functionality for concurrent applications is added by transforming a devices to communicate with synchronization primitives. We perform a similar transformation with memory segmentation functionality. For redundancy transformation, we introduce transformations that make arbitrary devices Triple Modular Redundant and Functionally Triple Modular Redundant and demonstrate pipelining for these devices to minimize overhead to the developer.

In Chapter 9, we demonstrate Connect Logic's application to the construction of a fully pipelined microprocessor architecture, the DLX architecture. Inter-device communciation (the first hypothesis) and composability (second hypothesis) are demonstrated in this chapter. Prior work [3] has utilized ReWire for the implementation of microprocessors. We continue further by using Connect Logic to construct a processor

that is fully pipelined with support for hazard mitigation using Connect Logic. We demonstrate the modularity and encapsulation that Connect Logic brings and how it enables us to design subcomponents of a complex system in isolation. We construct a processor by combining its pipeline phases together with Connect Logic primitives, delivering a synthesizable result that operates with reasonable performance characteristics and resource requirements.

Chapter 5 is a case study in using ReWire as a substrate for a visual programming tool for hardware. We demonstrate a tool and visual editing environment using pre-defined devices as blocks for the user to wire together. We provide a transformation from the visual representation to a synthesizable Connect Logic expression. This chapter demonstrates the additional kinds of tooling that Connect Logic in ReWire can enable and the productivity it provides when coupled with the Haskell programming language for tool design.

Chapter 7 and Chapter 8 are studies in the application of ReWire and Connect Logic. Chatper 7 explores high performance Regular Expression compilation using ReWire as a target. We explore high level reasoning about regular expressions, optimizing high level models, converting the models to ReWire and the performance and resource implications of all of the above in this chapter. Chapter 8 covers an implementation of the Salsa20 stream cipher using ReWire and Connect Logic. This work demonstrates a performant implementation in ReWire using Connect Logic as mentioned by the third hypothesis. Salsa20 is complex cipher that is not feasible to directly implement as a pure function in hardware. Steps need to be taken to reduce the combinational depth of the algorithm and these steps are made feasible in ReWire with Connect Logic. We design two different implementations from the same sub-

component: an iterative version of the cipher and a pipelined version. This illustrates the space/performance tradeoff that is achievable and intuitive using Connect Logic with ReWire.

The dissertation concludes with a summary discussing the work and future works in Chapter 10.

## Chapter 2

# Background and Related Work

This section establishes the background and related work to the work described in this thesis. The background is framed as it relates to the core contributions of this dissertation. This includes a discussion of Haskell and monadic programming in Haskell, a discussion of the reactive resumption and its monadic representation in Haskell along with similar models of programming. Also discussed are techniques for dataflow-oriented programming both in a visual and textual ways as they related to this work. Lastly, we discuss module systems to support separate compilation in the Haskell programming language.

## 2.1 Haskell and Monadic Programming

Haskell is a lazy, purely functional, strongly typed, and inferred programming language [5]. The most popular implementation of the language is the Glasgow Haskell Compiler (GHC). The Haskell language itself and GHC are very closely linked with one another. New additions and extensions to Haskell syntax and the Haskell type system are routinely facilitated through GHC's infrastructure. A number of language extensions have been integrated into Haskell proper over time. This is evident when one compares Haskell 98 to Haskell 2010 [6].

Haskell has seen the integration of many modern programming language innovations over the span of its existence. One of the most importantly used constructions in Haskell is the use of the monad. The monad, first proposed by Moggi as a useful programming feature [7], provides Haskell programmers a way to regain the sequential style of computation seen in imperative programming, while doing it in a way that is type safe and explicit with respect to effects. Monads are incorporated into Haskell from category theory. In Haskell, the Monad is treated as a kind of type class whose methods must adhere to a set of laws equivalent to the monad laws in category theory Monads can be used to model aspects of programs that are considered primitive in many programming languages. Indeed Haskell itself comes without a system for handling exceptions. Programmers are expected to provide a system for handling exceptional cases. Modeling pure, exceptional computations can be accomplished by using a structure as simple as a sum type, or the Either data type in Haskell's standard prelude. Additional solutions can be constructed from Either. These include Maybe (isomorphic to Either () a) or the Error or Exception monad (sequenced actions in Either a b where the appearance of Left a raises an exception of type a. Popular monads commonly seen "in the wild" include Maybe, Error, State, Reader, and Writer or simple failure, exceptions, global state, local state, and logging respectively. These monadic components can be useful in isolation but are generally used in concert like one would in a language such as Java. That is, exceptions with global and local state. To this end, we use monad transformers. Monad transformers are combinations of monads that are themselves monads. They are constructed by way of a simple lift morphism [8]. Many commonly used Haskell applications (even GHC) have a monad transformer at their core. The core abstraction of ReWire, the reactive resumption, is also monad transformer [2,9].

Haskell uses a monad to trap IO actions. IO in Haskell is problematic because Haskell is an otherwise pure, referentially transparent language. The monad abstraction is used to compartmentalize IO actions and require the user to explicitly type IO and separate it from pure computations [10]. This is in spite of the fact that computations in IO can be made to break the monad laws [11]. Pure actions can be lifted into IO like any monad, but IO actions cannot safely be made "pure" in Haskell. Synthesizable logic in ReWire must occur inside of a reactive resumption monadic computation in the same way that all Haskell programs that do actual work (ex. read or write to a file or port) on a machine occur in an IO typed computation. In the case of software, the IO interface dictates the boundary between the pure program realm and the rest of the world. In ReWire, the reactive resumption model provides us with an interface to set up how work is done within discrete units of time.

# 2.2 Reactive Resumptions, Free Monads, and Iteratees

The Reactive Resumption is part of a family of patterns that is commonly seen across the Haskell community. Members of this family of patterns have taken on a number of names over time [12]. Scholz referred to this as a concurrency monad [13], Claessen

refers to it as "a poor man's concurrency monad" [14], and Harrison refers to it as the Reactive Resumpion [2,9]. The above patterns are more concrete instances of a more general type of monad called the free monad. Free monads in Haskell are a more recent innovation and are based in categorical concepts that bear the same name [15,16]. All of the aforementioned patterns (less the general-case free monad) can be used to readily model cooperative multitasking.

```
-React with Pause wrapping a tuple of output and resumption function

2 data React i o a = Done a | Pause (o, i -> a))

3

4 — Free monad. Note f is of kind * -> * and is a Functor

5 data Free f a = Pure a | Free (f (Free f a))

6

7 — A functor to structure our free monad

8 newtype Funktor i o a = Funktor (o, i -> a)

9 instance Functor (Funktor i o) where

10 fmap f (Funktor (o, r)) = Funktor (o, fmap f r)
```

Listing 2.1: Reactive Resumption Monads vs. Free Monads

A common use case of this reactive family of patterns is their use in another design pattern in Haskell called the iteratee [17]. The iteratee pattern and iteratee IO is identified and named in the work of Kiselyov [18]. Iteratees are composable structures (closely resembling that of a reactive resumption) that give programmers a precise way to control evaluation of their programs. This is useful in practice where IO and laziness are concerned. A lack of control over evaluation and IO in Haskell can lead to space leaks [19] (especially thunk leaks and stream leaks) because of Haskell's laziness. Iteratees can alleviate this problem by giving the user a straightforward way to control strictness and avoid leaks. Many different Haskell libraries for data stream

processing including Conduit [20], Pipes [21], and Enumerator [22] are based upon the iteratee design formalized by Kiselyov.

ReWire specifications fall into the family of reactive programming because of their implicit reaction to some input stimulus or an argument accompanying a clock tick. ReWire and other Haskell-based HDL's aren't the first languages to employ a reactive style to model an HDL. SystemC uses the C++ language syntax as a way to model reactive systems and can also be synthesized to VDHL [23]. SystemC leverages the object model of C++ to facilitate reactive system specification. Like ReWire, it can also be synthesized to hardware.

ReWire inherits many ideas from the functional reactive programming (FRP) paradigm [24]. Conal Elliot pioneered the FRP paradigm under the name of functional reactive animation (FRAN) in his initial work [25]. Later work by Hudak has explored FRP and its application to program and device specification as well FRP's relation to Arrows [26]. An important difference between FRP and ReWire is FRP's focus on signals and their timing. Devices in ReWire (typed in React) are synchronously clocked where FRP signals are considered to be continuous.

## 2.3 Specifying Hardware in Haskell

#### 2.3.1 Lava

Lava is the oldest approach for circuit specification in Haskell [27]. The modern iteration of the Lava is Kansas Lava [28]. Lava leverages the expressiveness of Haskell to specify circuits in succinct and elegant ways, but this isn't without its challenges.

The question of synthesizing a recursive function in a sane and efficient way has resulted in different approaches to the same question. Chalmers Lava relies on the use of monadic constructions in Haskell to accomplish this task while later iterations of the work in Kansas Lava rely on observable sharing, or a form of cyclic graph analysis on Haskell code to identify recursive structures in circuit specification. As a Haskell-hosted domain specific language, Lava can be simulated by using an interpreter written in Haskell. Circuits specified in Haskell using Lava can be simulated and checked for their accuracy using unit testing or other lightweight formal methods.

#### 2.3.2 Clash

Clash (CAES Language for Synchronous Hardware) is a newer solution for specifying hardware in the Haskell programming language [29]. Clash is billed as a Haskell-to-VHDL compiler by way of term rewriting [30]. Like ReWire, Clash places restrictions on the kinds of Haskell programs once can synthesize. Clash's restrictions center on the kinds of algebraic data types, a user can employ when specifying a device using Haskell.

### 2.3.3 Bluespec

Bluespec is a well-established tool for specifying hardware in a Haskell like language. Arvind [31] describes it as a "relatively a relatively simple DSL (GAAs [Guarded Atomic Actions] and modules) with a fully functioning Haskell-like meta programming layer on top." Bluespec enjoys a successful reputation in industry. The tool appears to be closed source and I personally have not employed it for comparison

purposes to ReWire. From the descriptions that are available, Bluespec appears to be an environment in which developers specify hardware in a synthesizable DSL (of GAAs) and extend the functionality and expressiveness of the tool by using metaprogramming in Haskell. ReWire currently has plans for metaprogramming extensions, but currently does not employ any first-class metaprogramming features.

#### 2.3.4 Forsyde

ForSyDe (Formal System Design) is a methodology for hardware design that starts from high level formal models and maps them to hardware through a series of refinement stages [32]. ForSyDe is both a framework and a shallow-embedded domain specific language in Haskell. Like Lava, ForSyDe designs can be simulated in Haskell. ForSyDe operates on a higher level of abstraction than Lava in that synthesized systems are specified semantically instead of structurally (ala circuit-style devices in Lava). Additionally, both ReWire and ForSyDe specify synchronous devices. ForSyDe emphasizes the use of synchronous devices as targets because it simplifies reasoning about these devices at the higher level [33]. Users of ForSyDe are restricted to specifying only the kinds of devices that can be mapped through a ForSyDe refinement process.

#### 2.3.5 Delite

The Delite DSL compiler framework [34] seeks to address the "three P's" with respect to implementing software on parallel, heterogeneous systems. Delite addresses portability (i.e., retargetability of DSL compilers to a broad range of parallel hardware)

through language virtualization.

## 2.4 Visual or Flow-based Program Specification

There are numerous approaches to programming from an entirely visual standpoint as well as numerous more domain specific approaches or visual representations of specific programming languages [35, 36]. Additionally, some approaches take a reversed approach and use their own syntax to create visual programming techniques in their (non-visual) native languages (Arrows and Profunctors are an example).

#### 2.4.1 Flow-based Programming

Flow-based programming is a visual programming paradigm for specifying applications as asynchronous processes connected visually by directed data flows [37]. The formalization of flow-based programming has seen the creation of numerous visual tools in this vein with new tools continuing to appear on a regular basis.

#### 2.4.2 Arrows and Profunctors in Haskell

There has been a significant amount of work in the Haskell community on methods to generalize composable, functional computations. Arrows are a programming model that generalize functions in Haskell and are an alternative way to structure computations to more traditional mainstream methods like function composition and monadic computation (though certain types of Arrows are equivalent to monads) [38]. Arrows are named for their correspondence to functors in category theory (arrows). The

basic Arrow combinators bear some similarity to the connect logic primitives defined in this work. Indeed, Connect Logic regains similar composability to that of Arrows, specifically for reactive resumptions.

A similar work to Arrows in Haskell is the community's work on structuring computations using profunctors [39]. Profunctors are built on existing Haskell functor types and perform an equivalent function as Arrows. Profunctors are structures in Haskell that are a combination of covariant (i.e. the Functor typeclass) and contravariant functors. A Haskell function is a concrete example of a profunctor its output type can be extended by providing a function from the original output to a new output (covariance) and its input type can be extended by providing a function from a new input type to the original input type (contravariance). The refold function in Connect Logic can be used as the basis for treating React in ReWire as a profunctor as it can extend both the input and output of a reactive resumption.

### 2.4.3 Visualizing Functional Programs

Techniques for visualizing programs and visual program manipulation are not new. Functional programming languages make particularly good targets for visualization. The basis of a functional program is an expression of function applications, which is intuitively visualized. Additionally, many functional languages provide a more explicit means to control for side effects, which are inherently hard to visualize, especially when they may not be represented in the source language.

The lambda calculus itself has been a target for visualization. The work of Citrin proposes a series of circle and line arrangements to represent applications in an expression [35]. The work of Massalogin [36] proposes a similar set of circle arrange-

ments. A sophisticated visual representation representation for the Haskell language was proposed by Reekie [40].

Additional work has taken a more universal approach to defining what a visual language is and what it means. The work of Erwig has seen specifications for semantics of visual languages [41] as well as hybrid visual and textual languages [42]. Other (perhaps more esoteric) work has seen attempts to provide graphical descriptions of existing programming languages including functional languages. Hemann's work in visualizing languages uses unique approaches such as mapping programs onto a Hilbert curve for visualization [43].

## 2.5 Functional Module Systems

The work detailed in this work centers on a module system for ReWire that supports object code in the form of VHDL. ReWire is a proper subset of Haskell and thus a module system in ReWire syntactically follows the structure and semantics of modules described by the Haskell Report. Work has been done to establish the syntax and semantics of module declarations in the Haskell programming language. The Haskell Report [6] establishes the use and structure of modules for engineers to use. Additional work has been done to establish a formal semantics to the Haskell module system by Diatchki et al. [44] which provides an in-Haskell semantics for reasoning about Haskell modules.

Using Haskell for practical applications inevitably leads one to need an interface with a foreign programming environment. This is also the case where modules are concerned. Can we compile Haskell modules to an object format that interface with each other? One implementation of a system in this vein is the work by Finnel et al. to interface Haskell with the Microsoft Component Object Model (COM) system [45]. This work, while targeting software instead of hardware, describes a compilation method to extend and compile Haskell so that Haskell programs can be componentized in a main stream object framework. This work addresses the steps taken to address the various impedance mismatches between the source language and the target environment (in this case object-oriented strict imperative vs. purely functional and lazy).

The work of Fortounis and Papaspyrou focuses on supporting separate compilation for a subset of Haskell that includes parametric polymorphism in functions [46, 47]. This work emphasizes support for defunctionalization using a kind of closure conversion. The ReWire compiler includes similar program transformations in its compilation pipeline. This work is relevant in that it discusses separate compilation that allows for defunctionalization across different modules that may have overlapping, but disconnected defunctionalized members. For example functions with the same arity in their higher-order arguments will have different closure types if compiled separately. This work unifies these same, yet differently-named definitions. Separate compilation in ReWire supports parametric polymorphism in compiled (VHDL) modules in a similar way to how this work supports it, but in the C programming language.

## 2.6 Formal Methods for Hardware Design

There is a long history of formal methods being applied to hardware designs [48]. The general process involves encoding a hardware design in the logic of a theorem prover by hand<sup>1</sup> and then proving theorems about the encoding. There is an obvious danger that the encoding process—which one might call *semantic archaeology*—will introduce errors as well as a problem of soundness (i.e., how do you know a theorem about the encoding applies to the hardware device itself?).

"Semantic archaeology"—the process of developing a formal specification for an existing computing artifact—is the principal reason that formal methods can be so time-consuming and expensive. Sarkar et al. [49] describe the semantic archaeology process in the context of modeling the x86 multiprocessor instruction set architecture: "The key difficulty was to go from the informal-prose vendor documentation, with its often-tantalising ambiguity, to a fully rigorous definition (mechanised in HOL) that one can be reasonably confident is an accurate reflection of the vendor architectures (Intel 64 and IA-32, and AMD64)."

Cryptol [50] is a domain-specific language for specifying, verifying and implementing cryptographic algorithms. Given a cryptographic algorithm, one can specify it in Cryptol, run a number of automatic and semi-automatic proof tools over the specification, and ultimately generate C code implementing the algorithm itself. The current open source version of Cryptol (v.2) does not generate hardware implementations, although a previous proprietary version (v.1) did. ReWire, by contrast, is a subset of Haskell compilable to VHDL and is not restricted to cryptographic algorithms. Salsa20 has been specified in Cryptol v.2, but no effort has been made to backport this specification to Cryptol v.1 and synthesize it.

The usual standards for evaluating hardware architectures and design flows are

<sup>&</sup>lt;sup>1</sup>E.g., Isabelle/HOL (http://isabelle.in.tum.de), ACL2 (http://www.cs.utexas.edu/users/moore/acl2), and PVS (http://pvs.csl.sri.com), are the most commonly used provers for hardware verification.

performance-based metrics (e.g., time and space performance, power usage, etc.). Within the context of mission critical systems, formal analysis and verification are required evaluation modes as well. The Common Criteria for Information Technology Security Evaluation (a.k.a. Common Criteria or CC) is an international standard (ISO/IEC 15408) for computer security certification and the US Federal government mandates following the CC requirements for mission critical systems. The CC sets seven evaluation assurance levels (EAL). The most stringent such level is EAL7, which requires "extensive formal analysis" for applications in "extremely high risk situations and/or where the high value of the assets justifies the higher costs" ensuing from formal verification [51]. For reconfigurable computing to be applied in the space of mission critical systems, cost effective formal methods techniques must be developed. The current research is a step in this direction.

Previous work demonstrated the construction and verification of a secure many-core system in ReWire [52]. The present work, in contrast, demonstrates the expression of a common hardware design pattern (stall-free pipelining) in ReWire and its verification. The emphasis in the former was on the design and implementation of the ReWire language, while the current work focuses on ReWire as a vehicle for hardware verification.

# 2.7 Compiling Regular Expression Matchers to Hardware

The conversion of sets of regular expressions into NFAs is a well-known procedure [53]. Sidhu and Prasanna [54] have proposed an efficient FPGA implementation of NFAs.

Their solution is based on the one-hot encoding scheme; the use of an NFA representation avoids the  $O(2^n)$  space complexity that is characteristic of DFA (deterministic finite automata) representations, typically adopted in memory-based regular expression matching implementations [55–58]. Subsequent efforts on FPGA [59–62] have refined Sidhu and Prasanna's implementation and achieved gigabit/sec processing throughputs on real-world pattern sets.

Chapter 7 demonstrates that the ReWire compiler works at scale as the generated ReWire programs are on the order of 100K LOC. Great care was taken in the design of ReWire so that it possesses a rigorous denotational semantics to support formal verification while maintaining synthesizability for all of its programs [3]. ReWire is also a virtualized DSL in that it has a separate compiler backend for producing FPGA-based implementations while reusing large parts of its host language's infrastructure—including Haskell's type system, front end, etc. In George, et al., [63], the Delite framework is adapted to the generation of hardware from DSLs, specifically the hardware acceleration of kernels in a heterogeneous setting.

# Chapter 3

# Connect Logic

The reactive resumption type is demonstrably useful for specifying whole synchronous devices. In practice, however, devices are not usually monolithic specifications. Engineers and developers reuse prior specifications that are applicable to the project at hand. This is the essence of reusability in practice and is a critical component to all modern software design methods. Connect Logic for ReWire introduces new primitives between synchronous devices to further reusability and introduce aspects for information sharing and timing between different devices. This chapter describes the nature of Connect Logic, its main primitive functions, and new functionality in ReWire that can be derived from Connect Logic. A discussion of the implementation of Connect Logic is also provided.

#### 3.1 Motivation

When considering a language or an environment for writing software, one of the first things a programmer considers is how the tool allows for the decomposition of a problem and how easily parts of another previously developed solution can be reapplied to future problems. How easily can we import our old subprograms and subroutines for use in a new program? The same is true for hardware design. Common subcomponents can be factored out of just about any hardware specification. Visual inspection of any hardware device reveals hundreds, if not thousands of embedded subcomponents.

ReWire in its original form supports the composition of non-synchronous logic via pure function composition. Connect Logic seeks to extend the compositionality of ReWire by enabling interactions between combinational/synchronous interfacing and synchronous/synchronous interfacing. It does so in the following ways. First, Connect Logic provides the means to represent parallelized synchronous devices. Second, Connect Logic provides a way to compose synchronous devices with combinational logic. Third and last, through the two prior mentioned extensions, Connect Logic allows for clocked communication between synchronous devices specified in ReT. In short, Connect Logic brings communication between components specified in ReWire, and in so doing it also brings reusability.

# 3.2 Connect Logic Primitives

Connect Logic is comprised of four primitives: refold, parallel, iter and refoldT.

Additionally there are some important non-primitive functions that follow from these

which include pipeline.

#### 3.2.1 Primitive Functions

Primitive functions in Connect Logic are called primitive because of the nature of the types they operate one. ReWire does not provide support for general recursion, higher order functions, or operations that can decompose types in ReT (i.e. decompose device specifications). While general recursion is disallowed, ReWire allows for tail recursion in functions where the codomain is in ReT. These tail recursive calls in are allowed because correspond to state transitions in a finite state machine and be compiled as such. In order to maintain full synthesizability of all proper ReWire programs, it follows that any extensions to ReWire must preserve this correspondence in addition to corresponding to some synthesizeable structure in hardware. Connect Logic is a set of specific extensions that adheres to both of these principles.

Intuitively, where one pictures synthesizeability of structures in ReT by way of a transform-ability to finite state machines, one can picture the synthesizeability of Connect Logic extensions by picturing them as "the wiring" between whole devices on a board. Connect Logic operations work by encapsulating existing specifications in a new, opaque device, that performs wiring "under the hood". In the case of parallel, the new device takes the inputs of both interior parallel devices, splits the input and routes it to each corresponding device, then combines the output again for consumption or inspection from the outside. The refold function behaves similarly, but instead takes two pure functions to manipulate inputs and outputs of the old device in ReT to "route" old input and output types to new ones.

#### The Parallel Operator

The parallel function is a basic parallel operation for devices in the ReT type. Pure functions in ReWire have the benefit of being parallel by their nature. When it comes to whole devices specified in ReT, however, we need to be more explicit with an additional caveat. Devices in ReT are synchronous logic (or "clocked") unlike pure combinational logic. When merging two devices together using the parallel operator, the devices are now synchronized in lock step with one another because they are now treated as one device on the same clock. This is formalized in the definition of the parallel operator in Listing 3.1.

Listing 3.1: Haskell implementation of the Connect Logic parallel (<&>) combinator

The parallel operator (depicted as the ampersand in Listing 3.1) takes two arguments of type ReT with different inputs and outputs over the Identity monad and the same termination type a. The parallel operator restricts the monad stack to Identity to force devices placed in parallel to be fully encapsulated. Devices can still have internal state, but it cannot be represented in the monad transformer stack here, lest

we imply the existence of unclocked information channels between devices placed in parallel. We make these channels explicit using refolding noted in a later section. The result is a new reactive resumption with input and output types that are tuples of those of the original two devices. The new device is synchronized such that one "tick" corresponds to a complete iteration of all of its internal devices. In other words, the "speed" of the new device is as fast as its slowest sub-component. A visual representation of this intuition is given in Figure 3.1.



Figure 3.1: An illustration of the parallel operator in Connect Logic.

#### The refold Operator

The refold operator provides the developer with a means to alter a given device in ReT given combinational logic (pure functions) to affect its inputs and outputs. The refold operator can be thought of as a device given two adapters. This is formalized in Listing 3.2.

```
_1 refold :: (Monad m) \Rightarrow (o1 \rightarrow o2) \rightarrow
```

```
(o1 -> i2 -> i1) ->

ReT i1 o1 m a ->

ReT i2 o2 m a

refold otpt inpt (ReT r) =

ReT (do

r' <- r

case r' of

Left a -> return (Left a)

Right (o1, res1) ->

return (Right (otpt o1, \i2 ->

refold otpt inpt (res1 (inpt o1 i2)))))
```

Listing 3.2: Haskell implementation of the Connect Logic refold combinator

The first argument to refold is a function that converts the device's original output type to a new output type. The second argument adapts the original device's input to accept a new input by way of a function that maps the new input type to the old input type. This conversion of input types observes the original output of the device as part of the mapping. This enables the developer to write adapter functions that allow him or her to observe device state based on device output. This added expressiveness becomes important later on as we use it to implement pipelining with Connect Logic. A visual representation of this mapping is provided in Figure 3.2.

#### The refoldT Operator

We can manipulate devices based on their inputs and outputs with refold and we can pair devices together using the parallel operator, but there is a component missing: timing. Reactive Resumptions execute in a step-wise fashion in a manner similar to



Figure 3.2: An illustration of the refold operator's functionality in Connect Logic.

a hardware device reacting to clock pulses. It would be very useful if we had a means to control whether or not a given device steps without needing to alter the device. This is where refoldT comes in. The refoldT function is a generalization of refold developed when working with systems in need of finer grained execution control.

```
13 case fi o1 i2 of

14 Nothing -> ReT (return (Right (fo o1, dispatch o1 resume)))

15 Just x -> refoldT fo fi (resume x)
```

Listing 3.3: Haskell implementation of the Connect Logic refoldT combinator

Listing 3.3 shows the type of refoldT. The definition refoldT is nearly identical to refold with the exception of the result of the input-manipulation function. For values of the input function that are Nothing, the internal device is stalled and will be paused. Values in Just execute the same way as a normal refold. Indeed, refoldT is a generalization of sorts of the original refold function where refold fo fi = refoldT fo  $(\x y -\x Just (fi x y))$ . We illustrate refoldT in Figure 3.3.



Figure 3.3: Applying refoldT to a device Dev to manipulate inputs with  $f_i$  and outputs with  $f_o$ .

#### The iter Operator

The iter function is a kind of lifting function for pure functions into the ReT monad. The type of iter is demonstrated in Listing 3.4.

Listing 3.4: Haskell implementation of the Connect Logic iter combinator.

The iter function produces a new synchronous device in ReT that wraps a pure function. The user is required to provide an output initialization for this device. This combinator gives us a way turn pure functions into synchronous ones.

#### 3.2.2 Non-Primitive Functions

Non-primitive are functions that transform devices using only primitive functions. Put another way, non-primitive functions can be thought of as macros of primitive functions. Here we describe one natural non-primitive function, pipeline that follows from primitive Connect Logic functions. Additional non-primitive Connect Logic transformations appear in Chapter 6.

#### The pipeline Operator

The pipeline function is a (non-primitive) combination of the parallel and refold functions. This function creates a pipeline of devices that perform operations in parallel and feed-forward information for further computation to subsequent devices in the pipeline. This function allows the developer to break large combinational hardware into smaller synchronous devices to increase throughput. The device is defined in Listing 3.5.

```
pipeline :: Monad m => ReT i z I a -> ReT z o I a -> ReT i o I a
pipeline dev1 dev2 = let combined = dev1 <&> dev2
in refold combined snd pipe
where
pipe whole_output i_input = (i_input ,snd whole_output)
```

Listing 3.5: The Haskell definition of the pipeline function in ReWire

The pipeline function is formed from the main primitive functions in connect logic. All devices in a pipeline execute in parallel. Connecting inputs and outputs between pipelined devices is accomplished by the refold operator and some simple pure functions that select members of a tuple. This action is visualized in Figure 3.4.

### 3.3 Implementation

Implementation of Connect Logic occurs at multiple levels of the ReWire compilation process. Primitive functions are factored out of early compilation and are used to direct the port mapping of their subexpressions on the back end of the compilation process. Non-primitive functions must be discharged prior to compilation. This can



Figure 3.4: An illustration of the pipeline operation's functionality in Connect Logic.

#### 3.3.1 Structuring Connect Logic Devices

be accomplished with inlining, and simplifications.

Connect Logic is intended as a system for connecting discrete, seperate components. In the source program, these components are presented as subexpressions of a monolithic, whole ReWire expression. The first step in this process is to identify and separate subcomponents in ReWire expressions. For the purpose of this description, we refer to ReWire component specifications containing no Connect Logic primitives as trivial devices and devices containing Connect Logic primitives as non-trivial devices. Devices that contain additional Connect Logic will contain device specifications. We call these contained devices nested devices. Connect Logic expressions have no limit on expression depth. Pipelining devices is performed in two different phases, one "deep" approach involves nested Connect Logic device specifications and another "shallow" approach is a single, n-length parallel expression wrapped by a

single refold expression. The shallow approach produces less complex VHDL while the deep approach allows more flexibility to the programmer to design a pipeline without knowledge of its entire structure before-hand. Both of these approaches are discussed in the subsequent section on pipelineing. Tests using pipelining technique employed in the Salsa specification in Chapter 8 created expressions that were both deep and shallow. The synthesized hardware from both approaches yielded hardware with nearly identical performance characteristics.

During the compilation phase, the Connect Logic extension to the ReWire compiler operates by analyzing the expression tree of the start (main) function specified by the developer. The Connect Logic extension performs type-directed analysis of the main ReWire device expression. A non-trivial ReWire device contains subexpressions in the type of ReT. Trivial device specifications are passed along to the compiler unchanged. This is the same process as a standard non-Connect Logic ReWire compilation procedure. The decomposition process works by constructing a new tree composed of trivial devices, refold nodes, and parallel nodes. The relevant structures are illustrated in Listing 3.6.

Listing 3.6: A data type for decomposed CL expressions

The CLTree type illustrated in Listing 3.6 is a data type parameterized over a

function type f and a label type a. The type f parameterizes the type of the pure function that refold uses to manipulate the input and output of a device. The type we use is RWCExp, the type of ReWire expressions in the ReWire front end. This is represented in the CLExp and CLNamed type synonyms. The initial transformation from the ReWire core representation yields a structure in the type CLExp. We perform a monadic transformation from CLExp that maps to CLNamed using the following monadic construction.

```
1 type CLM = WriterT [NCL] (WriterT [NRe] (StateT Int Identity))
2 clExpr :: RWCExp -> CLExp
3 rnCLExp :: CLExp -> CLM CLNamed
```

Listing 3.7: Types used in the construction of the renamed CL tree

Given a ReWire expression, we can transform it to the equivalent CLExp representation, which removes the Connect Logic primitive functions from the expression and structures its subexpressions accordingly in the tree. Given the resulting tree, we remove each node from the tree, starting with the leaves, save it, and replace it with a unique named reference (or node in the type CLNamed). First, the trivial (all leaf nodes) are replaced, then the connecting non-trivial nodes (corresponding to Conenct Logic expressions) are replaced. The end result is the root node of the expression, which we then use to drive VHDL code generation.

VHDL code generation is performed on each subexpression (and subdevice) singularly. Given any node from the original CLExp tree, we are presented with one of two situations. First, the node we have is a Leaf node, and is trivial, or a non-Connect Logic device. We can compile this node as is done in the standard ReWire approach. If the node we are presented with is a refold or parallel node, then we compile it

in one of the methods described in the next section. The resulting compilation is a series of VHDL entity declarations in a single VHDL file, tied together using VHDL structural mapping features.

#### 3.3.2 Compiling Primtives

The primitive operators in Connect Logic are compiled in one of two different ways. In the case of parallel, the operation is given a reference to the two sub components to execute in parallel, and VHDL code is constructed to execute two mapped devices with structural parallelism. In the case of refold, a singular reference to a device is given, along with two different pure functions. Both functions are compiled to pure VHDL functions encapsulated in a VHDL entity. These functions are applied to the input and the output of the nested, or encapsulated sub-component given by the device reference. The I/O of the encapsulated device is mapped to internal signals of the refold device. The corresponding functions are applied to those signals. The output function transforms the output of the encapsulated device and the input function produces input for the device based on its previous output as well as the input to the refold device. A more detailed description of each follows in the subsections.

#### parallel

A parallel operation is compiled by analyzing its sub-components, compiling them, and then constructing a VHDL specification based on the bit widths of the input and output types of a device in ReT. We provide an example of a hand-compiled device composed of a single parallel (parl) function specified in Listing 3.8. The following example is illustrative of the Connect Logic compilation process for a parallel oper-

ator, which produces significantly larger code that includes compiled trivial devices.

These are omitted here.

```
1 data Bit = Zero | One
2
3 not :: Bit -> Bit
4 not x = case x of {Zero -> One; One -> Zero}
5
6 not_d :: Bit -> ReT Bit Bit Id ()
7 not_d x = signal (not x)
8
9 start :: ReT (Bit, Bit) (Bit, Bit) Id ()
10 start = (not_d Zero) <&> (not_d Zero)
```

Listing 3.8: An example parallel device

In Listing 3.8 we specify a simple device composed of two devices that not their inputs using the parl combinator. In an intuitive sense, compiling the parallel operation follows from compiling left Zero and right One. These devices are trivial devices. We can inspect their types from the type checker to determine their type width, which is one bit wide. Thus, the total width for the parallelized device is two bits. We require this information before generating VHDL so we can properly split the input of the combined device to each of its sub-components. The VHDL for the device specified at start in Listing 3.8 is illustrated in Listing 3.9.

```
6 architecture behavioral of start is
    signal dev0in : std_logic_vector(0 to 0);
    signal devOout : std_logic_vector(0 to 0);
    signal dev1in : std_logic_vector(0 to 0);
    signal dev1out : std_logic_vector(0 to 0);
10
11 begin
            \leq input(0 \text{ TO } 0);
    dev0in
12
    dev1in
            \langle = input(1 \text{ TO } 1);
13
14
    dev0 : entity work.not_d(behavioral)
15
       port map (clk, dev0input, dev0output)
16
17
    dev1 : entity work.not_d(behavioral)
18
       port map (clk, devlinput, devloutput)
19
20
    output <= dev0out & dev1out;
22 end behavioral;
```

Listing 3.9: VHDL code for the example parallel device.

The code in Listing 3.9 is an abbreviated version of the VHDL-compiled form of the device specified in the start function in Listing 3.8. Paralleism specified by Connect Logic is accomplished by port mapping instances of the entity not\_d (not illustrated). As noted prior, the input and output for this device is two-bits long. The clock signal, clk is passed through to the sub-components, dev0 and dev1. The parallelization of the two sub-components is a simple mapping of inputs and outputs in and imposes none of its own overhead on the performance of the parallelized devices. In other words, a parallelized device is as fast as its slowest sub-component.

#### refold

Compiling a refold function follows a similar process as compiling a parallel function. A refold operation is, in essence, "wrapping" a device with functions that manipulate its inputs and outputs. Where in the parallel operation we place to devices side by side and map their combined inputs and outputs to a wrapping device, in refold we wrap one device and map its I/O, but we additionally map the I/O through the pure functions provided as arguments to the refold function. An example hand-transformation between a ReWire representation is illustrated in Listing 3.10 and Listing 3.11.

Listing 3.10: An example refolded device

The example ReWire code in Listing 3.10, illustrates a device consisting of a single refold that inverts the output of the device opaque\_dev and uses and to map two Bit values to one Bit, the input type of opaque\_dev. In this illustration we keep the definition of opaque\_dev abstract. The compiled form of start is given in Listing 3.11.

```
1 entity start is
    Port ( clk
                  : in std_logic ;
           input : in std_logic_vector (0 to 0);
           output : out std_logic_vector (0 to 0));
5 end rwcomp4;
6 architecture behavioral of start is
    signal opaque_dev_in : std_logic_vector(0 to 0);
    signal opaque_dev_out : std_logic_vector(0 to 0);
    function and(arg1 : std_logic_vector, arg2 : std_logic_vector)
     returns std_logic_vector;
    function not(arg1 : std_logic_vector) returns std_logic_vector;
11
13 begin
    opaque_dev_in <= and(opaque_dev_out,input);
    output
                   <= not(opaque_dev_out);</pre>
15
16
    dev0 : entity work.opaque_dev(behavioral)
17
      port map (clk,input,output)
19 end behavioral;
```

Listing 3.11: VHDL code for the example refold device.

The VHDL generated from start in Listing 3.10 is given in the entity in Listing 3.11. In the compiled code, start entity wraps the abstract opaque\_dev device. Input from the outer device is mapped to the inner device by way of the opaque\_dev\_in signal where the and function performs the input mapping specified by the second function argument to refold. The output of opaque\_dev is mapped to the opaque\_dev\_out signal. The output of the whole start entity is mapped from the output signal from the interior device, but is first manipulated by the not function. We note that the compiled function definitions in the architecture of the device are omitted.

#### 3.3.3 Compiling Non-primitives

We note that the non-primitive functions in ReWire are comprised of primitives and as such are not normal synthesizeable functions. We discharge non-primitive functions when they are observed in the parsing of an expression by performing appropriate syntax-level transformations of ReWire programs. In the case of pipeline, we in-line the definition of pipeline where it appears and perform a beta-reduction.

# Chapter 4

# Modularity Principles and a Module System

Connect Logic and modular programming are integral concepts in ReWire. We promote modular programming by promoting the reuse of synchronous components and we enable the reuse of synchronous components with Connect Logic. This chapter describes how modular programming in ReWire follows from the implementation of Connect Logic and the module system that is enabled through this addition.

### 4.1 Modularity with ReWire

ReWire is a subset of Haskell, but it is also a hardware description language (HDL). When designing hardware in ReWire it is important to give consideration to the primary ReWire artifacts, how they translate to artifacts in hardware, and the implications for modularity and reuse in the ReWire language.

#### 4.1.1 Functions

Pure functions in ReWire are where most computation occurs. Work performed by a specification should be placed in a pure function. Pure functions are akin to unclocked combinational logic in a circuit. A pure function of the type (i - > o) can be considered a "black box" that operates on inputs of type i and gives outputs of type o. Functions of a higher arity than one can be considered akin to entities with multiple input ports, but these functions are isomorphic to a unary function by way of uncurrying.

#### 4.1.2 Reactive Resumptions

Reactive Resumptions are the constructions we use to express synchronicity in ReWire. Punctions express what work to be performed, Reactive Resumptions express when to perform it. A Reactive Resumption expresses a state machine similar in function to a Moore machine. Designers interface with these structures in ReWire by way of monadic binding (>>= operator) and the non-proper morphism signal which allows a device to yield output and resume on the next input after the call to signal. ReWire restricts the developer to these two methods of encoding synchronous logic. Approaches to managing Reactive Resumptions other than the two mentioned are disallowed. Reactive Resumptions are not allowed to be nested or combined in any way other than traditional monadic binding. No inspection of the underlying structure of ReT is allowed (for example in Haskell, the underlying structure is actually an Either type).

If there are two devices specified as Reactive Resumptions in ReWire, a programmer can combine them by hand with monadic binding (like do-notation in Haskell) by sequencing one to take place after the other with some control over how this sequencing occurs (conditionally or otherwise). This is not unlike concatenating two state machines together. We can define transitions into and out of existing device state machines in a number of ways, but the issue remains that we cannot execute two devices in parallel without completely refactoring existing specifications. For modular programming and specification, this is not sufficient. If we need inspect the internals of a device in order to integrate it into a design, this severely diminishes the potential for its modularity. In other words, without another way to combine synchronous devices, modularity in ReWire is hampered and a proper module system will not promote code reuse.

# 4.1.3 Modularity and Composability Follow from Connect Logic

Connect Logic was added to ReWire in order to fully realize modular programming for hardware while maintaining feasible synthesizablility of specifications using Connect Logic primitives. With Connect Logic, we can interface synchronous devices specified in Reactive Resumption form without having to refactor them. We can place devices in parallel, execute them in step with one another, and route input and output values among the devices. With these functions fully supported in the compiler as they are now, we can pave the way to modular programming in ReWire with reuse of commonly used synchronous components as well as combinational functions in a module system.

ReWire gives us two notions of modularity: synchronous modularity and combinational modularity. The Connect Logic extension to ReWire enables the developer to combine these different types of components. We consider the base form of modu-

|               | Combinational     | Synchronous           |
|---------------|-------------------|-----------------------|
| Combinational | left . right      | refold id             |
|               |                   | (const right) left    |
| Synchronous   | refold right      | left 'pipeline' right |
|               | (flip const) left |                       |

Table 4.1: Composing synchronous and combinational logic in ReWire. Output from the left is fed to the right.

larity to be pure functions (combinational) and Reactive Resumptions (synchronous). Table 4.1 demonstrates how we can use Connect Logic and some traditional Haskell functions (const, etc.) to combine synchronous and combinational modules with one another.

This notion of modularity gives hardware developers using ReWire a principled approach to modeling synchronous and combinational systems, separating these concerns, and maximizing reuse through this approach.

# 4.2 A Module System for ReWire

We equipped the ReWire language and compiler with support for single compilation using the same style of imports and exports seen in Haskell modules. We describe the approach in this section.

#### 4.2.1 The ReWire Module System

The ReWire module system is a single compilation system that functions as a subset of the Haskell namespace and module system. Module imports are handled at compile time by merging modules together by fully qualifying their names.

```
1 module Module1 where
2 data Foo = A | B
3 \text{ funA} :: Foo \longrightarrow Foo
4 \text{ funA } x = \dots
5 —A module using Module1
6 module Main where
7 import Module1
_8 funB :: Foo -> Foo
9 \text{ funB } x = \dots \text{ funA } \dots
10 -- Main module, fully merged and qualified
11 module Main where
12 data Module1.Foo = Module1.A | Module1.B
13 Module1.funA :: Module1.Foo -> Module1.Foo
14 \text{ Module1.funA} = \dots
15 funB :: Module1.Foo -> Module1.Foo
16 funB x = \dots Module1.funA ...
```

Listing 4.1: ReWire single compilation transformation example

We provide the same tools as Haskell for qualifying and renaming names. The ReWire module system currently assumes a single working directory structure for imported source files, but a package system similar to the Haskell Cabal package system could be integrated as a future work. Since we restrict ourselves to single compilation, importing modules is equivalent to a source transformation on ReWire files. We illustrate this source transformation in Listing 4.1.

# 4.2.2 From Separate Compilation and Future Work on Module Systems

With synchronous device reuse fully realized, we added module support to the ReWire compiler. When this project was undertaken, the initial approach considered a fully separate compilation system with support for previously compiled ReWire components. We determined that this approach was impractical and unnecessary. If a separate compilation module system from ReWire to VHDL were to be complete, it would need to have support for separately-compiled polymorphic functions and Reactive Resumptions. ReWire does not support compilation of polymorphic structures. If something is to be compiled to VHDL, it must be monomorphic, or without any quantified types (type variables). That isn't to say that polymorphic functions aren't without their applications in ReWire programming, however. Functions such as snd and fst are commonly used, and are usually monomorphized by hand by the designer typically by inlining them.

What does this mean for modularity of polymorphic functions in ReWire? We considered VHDL support for polymorphism with VHDL generic entities. These features allow one to specify entities in VHDL that are parameterized over values specifying attributes of the entity. One potential use of generics in entity specifications is for the developer to allow flexibility in the size of the inputs, outputs, and internal storage and wiring of a device. ReWire types are all assumed to be "grounded" or "bitty" types that can be represented and encoded in a bitwise form. Even quantified types (types with type variables), will have an implied "bittyness" to them that isn't fully realized until the types are evaluted to a monomorphic form. We can use this grounded nature of all ReWire types as a potential avenue for compiling polymorphism, but it comes

at a cost. For polymorphic ReWire functions to be represented in terms of VHDL entities with generic arguments, a few requirements need to be met. We must require a way to quantify the size of arguments in polymorphic functions and how this relates to the computation performed by the polymorphic value of the function. We believe that bit size is the only requirement for this to occur. When fully evaluating type level lambdas (or making type variables monomorphic), we must have a way to relay the size of the types to the VHDL entity representing the polymorphic function.

Given that this process works, the VHDL synthesis tools are left to do the work of monomorphizing the device specifications that we opted out of. Generic devices can't be synthesized to hardware. The synthesis tools must discharge all generics with actual values before it can generate an actual design. By adding this feature we will have effectively "kicked the can down the road" to tools that are largely black boxes. Given that the ReWire compiler emphasizes verification in design and implementation and given that this one feature would require a significant increase in compiler complexity for separate compilation where the benefits are somewhat nebulous, we opted to discontinue working towards a separate compilation system and focus instead on a single compilation system while offering an equivalent alternative to support polymorphic functions in a compilation pipeline for ReWire.

We can support polymorphic functions by partially evaluating them with respect to their types, but not their terms. We describe this process as follows. For this example we consider ReWire expressions in their System F form with regards to their types where type variables are bound by a type-level (big) lambda. ReWire functions are valid for compilation if their types contain no big lambdas. We propose a method for monomorphizing polymorphic pure functions in ReWire:

- 1. Detect all polymorphic functions.
- 2. In-line (expand) all polymorphic functions.
- 3. Discharge (purge) all polymorphic function definitions.
- 4.  $\beta$ -reduce all big-lambdas of the expanded lambda terms from the expanded polymorphic functions
- 5. Lambda-lift all of the lambda terms from expanded polymorphic functions.
- 6. Merge all identical lambda-lifted functions to single definitions. Rename replaced definitions accordingly.

The steps enumerated above provide us an analogous replacement to the process that the VHDL tools use in elaboration and synthesis to instantiate generic entities while maximizing sharing of functions. An emphasis is placed on only reducing types and leaving lambda terms as they are without reducing them. This approach allows us to maximize sharing or reduce the amount of redundant work and promote work reuse. The ReWire compiler already contains functionality to perform a number of these functions. We surmise that adding lambda lifting functions and type-reducing functions would be additional labor, but are not novel works and would likely not incur significant work on a compiler engineer.

# Chapter 5

# Visual Programming in ReWire

This chapter introduces a visual programming environment based largely on Connect Logic as a system for composing devices written in ReWire. We refer to the visual programming programming environment as the ReWire Visual Tool or RVT. Visual Programming in RVT is similar to drawing a block diagram of a hardware system in tool such as Vivado. The user draws wires between connection points on shapes representing synchronous devices and pure logic. We design RVT with two shape types to represent simple devices and functions with constrained inputs and outputs; devices are restricted to one connection point for input and one connection point for output while functions are allowed many input connection points but are constrained to one output connection point. We demonstrate a method to compile a DSL representation of this box diagram to a ReWire implementation.

#### 5.1 Motivation

The motivation for developing the RVT is two-fold. Firstly, we wished to demonstrate the ease of which one could compile a visual representation of a hardware device specification using ReWire. By carefully selecting our programming model and by utilizing ReWire as the back end target for compilation, we are able to recreate a high level system-specification tool usually seen as part of very mature and complex system design suites (such as Vivado by Xilinx). This replication of design demonstrates the advantages that the ReWire back end brings to the design of system tools. Specifically, that a performant, high-level intermediate representation used as a compiler target allows us to build even higher level system tools with a relatively small effort. Secondly we wished to develop the visual tool as a layer on top of a specialized DSL for combining ReWire devices, or Reactive Resumptions, along with pure functions. The development and implementation of the RVT follows the design process used in the implementation of the RexHacc compiler in Chapter 7. The design tool emits a machine specification in its DSL, which is then compiled to ReWire and then to VHDL by way of the ReWire compiler.

# 5.2 Specification and Features

We begin the development of this tool by specifying its basic features. First, we wish to be able to compose a system of opaque synchronous devices (essentially something specified as a reactive resumption). This is as simple as a box diagram of devices strung together with lines representing wires. A question that arises immediately from this is "how many connection points should we allow for device shapes?" Reactive

resumptions take a single input "argument" and yield a single output "result" when they pause, but those single argument and results, in many cases, are values in some product type that are decomposed by the device. We illustrate an example of this in Listing 5.1.

```
1 type Input = (Bit, Bit)
2 type Output = (Bit, Bit)
3
4 device :: ReT Input Output Id ()
5 device (left, right) = signal (right, left) >>= device
6
7 start = device (Zero, Zero)
```

Listing 5.1: An Example ReWire device with I/O of product types.

At first glance, an obvious way to allow devices to have many I/O connection points would be to inspect the type of the device and, if it were a tuple type, generate n-many connection points on a rendered device representation that matches the n-tuple of the input or output type. In other words, if an input type were a triple, provide three connection points. The problem with this approach is that it doesn't scale. Haskell does not have one singular product type. Any algebraic data type (ADT) in Haskell with a type constructor that has more than one type argument is a product type. For this reason, we opt to restrict the I/O connection points to a basic device representation to one.

The second feature we wish to establish in our base specification is the ability to manipulate data "in-between" our devices. While at first it seems that one could simply chain devices together to get a desired functionality, but it should be noted that in our model, devices perform their actions synchronously and in parallel with one another. Intuitively, this means that each device takes an input on one cycle and yields it at the beginning of the next cycle. To chain devices together would require us to institute a pipeline delay where we might not want one. Manipulating data outside of this synchronous, cycle-oriented, or input-oriented model gives us more flexibility to manipulate data without thinking about how it may impact our pipeline with respect to its actual staging. For this we can create a second presentation format that is a pure function box. Given that this visual representation derives its properties from an actual pure function, we allow it many inputs (as arguments) but limit it to one output for the same reasons we limit synchronous devices.

For our base design and implementation of the Visual tool, we assume that all functions and devices available to the user to be included in a visual design are abstract. The user designs a greater implementation by composing specifications from what are abstract (i.e. interals are opaque) functions and devices. An important additional feature will be to allow the user to compose a device or function that they can use later, but can also inspect and manipulate. We defer this functionality to a later iteration of RVT. Lastly, we specify that the composition of devices and pure functions results in a device itself. This allows a user to produce components that can be reused in the environment. To this end we provide an input and output port that devices can connect to. These are given by Input and Output in Listing 5.2. Counterintuitively, in our model Input is an output port and Output is an input port.

Given the design specifications laid out in the previous section, we proceed with specifying the data types that we will use to model visual connections in the RVT.

```
1 type NodeRef = Text
2 data NodeId = Id Text | Input | Output
3 data Node = Device NodeId NodeRef
```

Listing 5.2: The data types for the first iteration of RVT.

In Listing 5.2 we illustrate the types and data structures used to model RVT. At the top level, a program in RVT is comprised of lists of Node and Link. A Node data type models functions and devices in a visual specification. The anchor types (IAnchor and OAnchor correspond to input anchors and output anchors, or points where a user can connect devices by wires. All types in Node contain a NodeRef which refers to the name of the device or function as well as a NodeId which is a unique identifier for a Node. We note that a NodeId can appear as an identifier, or the constructors Input or Output. The Input and Output constructors are used only as anchor points that connect to the input and output of the whole device being specified. In our model, the user is designing the interior of what becomes a whole synchronous device that takes input and yields output. A correctly specified visual program is a device where all components are fully connected: all inputs have input connections from some output port, and all output ports are connected to some input port. Any number of lines can originate from the input port denoted by Input while only one line is allowed to terminate at the output port denoted by Output.

# 5.3 Tool Implementation

The implementation of RVT is split into two components: front end and back end.

The front end of RVT is the visual component for user interaction. The back end consists of a small web service with a small compiler that converts front end graph representations to ReWire.

#### 5.3.1 Front End

We implement the front as a web service because of the availability of visual diagramming tools in JavaScript as well as its platform independence for end use. There are visual programming libraries and tools written for Haskell, but they lack the platform independence and maturity of their web-based counterparts. The diagramming tool is based on the JointJS JavaScript library for drawing diagrams with connections. The library includes facilities for serializing and deserializing diagrammatic graph representations we use to JSON formats as well as aesthetic features such as routing libraries for user-drawn connections between components in a graph.

#### 5.3.2 Back End

The back end of RVT is implemented in Haskell. We utilize the Scotty web framework for Haskell as our web service to interface with the RVT front end. The back end serves the tool software to the user and handlers user requests to compile a given visual implementation. Visual implementations are submitted to the server as serialized JSON objects, that are parsed using the Aeson library and transformed to the data types illustrated in Listing 5.2. Once transformed to these types, we compile the RVT



Figure 5.1: Diagramming devices using RVT in a web browser.

to ReWire code and return the compiled code to the user.

# 5.4 Using RVT

RVT is utilized as a drawing tool from a web browser. We utilize it as a web service. The tool is hosted by a server and editing is executed on the client. An example diagram from the tool is illustrated in Figure 5.1 where we use the tool to implement a 4-bit describilizer. Functions are represented by squared boxes while synchronous logic components are represented by boxes with rounded edges.

#### 5.5 Code Generation

The service component of the RVT generates code using the graph drawing from the client component to construct a corresponding Connect Logic expression that implements the graph. All synchronous components are placed in parallel execution using the <&> operator. Combinational pure functions are applied to arguments in

Figure 5.2: Generated code from an RVT specification.

the same phase as routing data flow between synchronous devices placed in parallel. For this we use the refold operator. We construct a routing function by observing the source of each input at each component. Each device input is constructed by a let expression that makes use of the pure functions that may manipulate the input before it arrives at the component. This is an automation of the process that we use to construct the DLX processor in Chapter 9. The tool outputs textual representations of the expression. We illustrate this output code in Figure 5.2.

RVT is a proof-of-concept tool. The code it generates isn't typechecked, but is subject to type checking before synthesis. Early test examples were synthesized to hardware and produced good implementations. Further iterations of the tool could see more integration with the compiler as well as better support for saved designs and visual testing and debugging. We leave these features for future work on a more robust and integrated tool.

# Chapter 6

# Concurrent Devices in ReWire and Connect Logic

In this chapter we illustrate the applicability of Connect Logic to a variety of use-cases involving inter-device communication. The applications in this chapter demonstrate the usefulness of Connect Logic in ReWire as a method for implementing concurrent systems in hardware in addition to parallel systems. Parallel programming in ReWire is accomplished by utilizing combinational pure functions. Execution of parallel functions in ReWire is deterministic in nature. Concurrent programming in ReWire is not necessarily deterministic. We can execute synchronous logic in ReT concurrently where threads have internal state and may complete work at different points in time. Concurrent programming employs a number of useful idioms to manage shared resources between concurrently-running devices. We illustrate some of these idioms in this chapter in addition to a redundancy transformation on synchronous logic in ReWire.

## 6.1 Barrier Synchronization



Figure 6.1: Hardware threads in ReWire in a system with barrier synchronization.

In software, one of the most commonly seen concurrent synchronization idioms is the barrier. Barriers act as a synchronization point between concurrent threads [64]. Given a collection of threads synchronized to a single barrier, a single thread must pause execution once it reaches the barrier until all other threads synchronized to the barrier reach the barrier. Once all of the threads in the barrier have reached the barrier, all threads are subsequently un-paused and concurrent execution can resume in the same manner as before: running until they reach the barrier again, pausing, and continuing yet again. This concept is illustrated in Figure 6.1. Barriers are employed in the execution of parallel loops seen in OpenMP [65] and Pthreads (POSIX threads) library [66]. The principle of barrier-halted execution also applies to devices in hardware. Different components in a hardware device may need to synchronize

before continuing execution just like their threaded software counterparts.

Connect Logic can express barriers using refoldT. With refoldT, we can develop barriers that can pause concurrently running hardware devices until all devices have reached the barrier.

Listing 6.1: A transformation to make any device in ReT a stalling device using the refoldT primitive. Here we use the isomorphic types Stall and Busy in place of a Maybe type.

Before constructing a barrier transformation, we need a method by which to make an arbitrary device a *stalling device*. Listing 6.1 defines the function makeStaller which transforms a device in this way using the refoldT primitive. The refoldT function stalls a device if the input of this function is Stall. The makeStaller function exposes this functionality by extending the input type i (from a device ReT i o makeStaller i. Systems using a transformed device then have a method to pause it, which is to supply Stall instead of Continue a.

```
ReT i1 (Busy o1) I a ->
ReT i2 (Busy o2) I a ->
ReT (i1, i2) (Busy (o1, o2)) I a
```

```
4 barrier d1 d2 = let dp = (makeStaller d1) <&> (makeStaller d2)
                    in refold out inp dp
7 inp :: (Busy o1, Busy o2) -> (i1, i2) -> (Stall i1, Stall i2)
s inp o (i1, i2) = case o of
                       -If neither device has produced output,
                       -keep allowing input to both
10
                        (Busy, Busy)
                                                  -> (Continue i1, Continue
11
      i2)
                       -If the left device has produced, stall it
12
                                                  \rightarrow (Stall, Continue i2)
                        (Complete 1, Busy)
13
                       -If the right device has produced, stall them
14
                                                  -> (Continue i1, Stall)
                        (Busy, Complete r)
15
                       -If both devices have produced,
16
                       --let them both continue
17
                        (Complete 1, Complete r) -> (Continue i1, Continue
18
      i2)
20 out :: (Busy o1, Busy o2) -> Busy (o1, o2)
out o = case o of
                (Busy, _-) \rightarrow Busy
                (_, Busy) -> Busy
                (Complete a, Complete b) -> Complete (a,b)
24
```

Listing 6.2: Creating a barrier in ReWire for devices typed in ReT

In Listing 6.2 we define the barrier-constructing function that acts on two different devices to produce a single device. The barrier function defined on lines 1-5 takes two devices that yield output in the type of Busy o. It combines them into a single device that accepts a pair of inputs, one for each device, and yields output in the

type Busy (o1,o2) where the types o1 and o2 correspond to the outputs of the internal devices. The barrier device yields output when both devices have produced a value. The device parameters to barrier are transformed to stalling devices by applying makeStaller to them and then placing them in parallel with the Connect Logic parallel combinator. The synchronization of the devices is managed by the inp input-processing function defined on lines 7-18. Once a device has produced output in the form of Complete x, we feed that device Stall to pause further execution from that device until the other device has also produced output. Once both devices have produced output, both are allowed to proceed executing once again. In the same cycle devices are allowed to proceed, the barrier yields both of their outputs.

The barrier device demonstrates a critical application of refoldT: managed execution of ReWire components. While this example exhibits a basic usage of the Connect Logic combinator, it enables a very commonly used concurrency idiom found in software.

# 6.2 Triple Modular Redundancy

Electronic components can suffer from *single event upsets* (SEUs) or non-destructive, "soft errors" that change the state of a circuit to an erroneous one. These effects have been observed to be common in high altitude or outer space environments where system failures can lead to disastrous results [67]. One approach to mitigating the risk of error propagation in mission-critical devices is through redundancy. Redundancy regimes have been studied extensively for decades and one of the most common regimes, the Triple Modular Redundancy (TMR) regime, was initially conceived by



Figure 6.2: A Functional Triple Modular-redundant system constructed using the ftmr transformation.

von Neumann [68] and formalized by Lyons [1]. In TMR regimes, the core logic replicated three times and executes over the same input, yielding the same output in ideal conditions. The output of each redundant component is given to a voting process that determines the correct output by the mode of its input. In the event of an SEU corrupting one of the redundant components, the other two components will yield correct output to carry forward to the voting process and the correct result prevails.

We can develop transformations on devices in ReWire using Connect Logic to enable TMR in a transparent and scalable way. We demonstrate a simple tmr transformation in Listing 6.3. In this example we rely on the Eq typeclass to provide us with a definition of Eq for polymorphic types, but these could be easily instantiated in ReWire to monomorphic forms. Connect Logic makes this transformation on devices straightforward. We construct a voting-transformation that uses a majority-wins fil-

tering approach on the output of three different devices. The tmr transformation replicates a given device three times using the voter transformation.

```
1 \text{ vote } :: \text{ Eq } a \implies ((a, a), a) \implies a
2 vote ((a1, a2), a3) \mid a1 = a2 = a1
                             | a1 == a3 = a1
                             | a2 = a3 = a2
                             | otherwise = a1
7 \text{ fan } :: a \rightarrow i \rightarrow ((i,i),i)
s fan _{i} = ((i,i),i)
10 voter :: Eq o \Rightarrow ReT i o I a \Rightarrow
                           \operatorname{ReT} i o I a \rightarrow
                           \operatorname{ReT} i o I a \rightarrow
12
                           ReT i o I a
13
voter d1 d2 d3 = refold vote fan ((d1 \ll d2) \ll d3)
15
_{16} tmr :: Eq o \Longrightarrow ReT i o I a \Longrightarrow ReT i o I a
17 tmr dev = voter dev dev dev
```

Listing 6.3: Simple Triple Modular Redundancy with Connect Logic

There exist more sophisticated (and better) redundancy regimes. The simple TMR regime described by Listing 6.3 has an obvious flaw: the voting logic is not redundant. If an SEU were to affect the voting logic, then the correct results of the components would be for naught!

```
1 voteRed :: Eq a \Rightarrow ((a,a),a) \rightarrow ((a,a),a)
2 voteRed a = let v1 = vote a
3 v2 = vote a
```

```
v3 = vote a
                          in ((v1, v2), v3)
7 ftmr :: Eq o \Rightarrow ReT i o I a \rightarrow ReT ((i,i),i) ((o,o),o) I a
 8 \text{ ftmr dev} = \text{refold}
                             voteRed
                             (\ -\ i\ ->\ i\ )
10
                             ((dev <&> dev) <&> dev)
11
12
13 pipeline_ftmr :: (Eq z, Eq o) \Rightarrow ReT i z I a \Rightarrow
                                                            \operatorname{ReT} z o I a \rightarrow
14
                                                            \operatorname{ReT} \ (\left(\begin{smallmatrix} i &, i \end{smallmatrix}\right), i \right) \ \left(\left(\begin{smallmatrix} o &, o \end{smallmatrix}\right), o \right) \ I \ a
15
16 pipeline_ftmr left right = refold
17
                                                       (\(left_out, right_out) inp ->
18
                                                                 (inp,left_out)
19
20
                                                       ((ftmr left) <&> (ftmr right))
```

Listing 6.4: Functional TMR [1] with redundant voting logic in Connect Logic.

In Listing 6.4 we modify our previous transformation from using the voter transformation to one based on the notion of Functional Triple Modular Redundancy (FTMR). This transformation on a given device dev is illustrated in Figure 6.2. In the ftmr transformation we transform a device by replicating it three times and routing unique inputs to each device. We compute the output by way of three different redundant voting components (pure logic, so this is indicated by the application of the pure function vote) in voteRed. Unlike our previous definition, the ftmr transformation changes the input and output types. This is a necessity because if we were to use logic

to "merge" the outputs, this logic create a single point of failure. We also require inputs to be redundant as well as outputs.

The usefulness of this choice is exhibited in the pipelining function in pipeline\_ftmr. This is an alternative to the simple pipeline idiom discussed in Chapter 3 which sequences devices together by their inputs and outputs that transforms both devices into an FTMR form. Similar to pipeline a designer could take any number of components to sequence and transform them into having redundant sequentiality. The result being a single device taking redundant inputs and yielding redundant outputs. A designer could then choose how to "merge" the outputs and "fan out" inputs in a way consistent with their redundancy regime. Additionally as an alternative, we can apply ftmr to any number of devices and then combine them by using the canonical pipeline transformation.

#### 6.3 Mutual Exclusion

Mutual exclusion locks are a useful primitive for guaranteeing that only one thread of execution has access to a critical section. A design using a mutex component with two concurrent devices is illustrated in Figure 6.3. We implement a non-blocking mutex for two hardware threads as a device with three states: unlocked, left-locked, and right-locked with each state corresponding to which argument, if any, has the mutex lock. In this implementation, the mutex encapsulates the critical section, which is a value (val) of type a.

```
1 data Req a = ReqLock | Release | Write a | NullReq
2 data Rsp = LockGrant | Ack | NullResp
```

64



Figure 6.3: Constructing a system that utilizes a mutex for protecting a value of type a. The mutex is a distinct device in this design and communicates with concurrent devices with synchronous connections via Connect Logic primitives.

```
4 unlocked :: (Req val, Req val) ->
                                  ->
               ReT (Req val, Req val) (val, (Rsp, Rsp)) I ()
7 unlocked reqs val = case reqs of
                 (ReqLock, _) -> do
                    i <- signal (val, (LockGrant, NullResp))
                    leftLocked i val
10
                 (_, ReqLock) -> do
                    i <- signal (val, (NullResp, LockGrant))
12
                    rightLocked i val
13
                              -> do
14
                    i <- signal (val, (NullResp, NullResp))
15
                    unlocked i val
16
```

```
17
18 leftLocked :: (Req val, Req val) ->
                                      ->
                  val
19
20
                  ReT (Req val, Req val) (val, (Rsp, Rsp)) I ()
21 leftLocked regs val = case regs of
                    (Write v, _) -> do
22
                       i \leftarrow signal(v,(Ack,NullResp))
23
                      leftLocked i v
24
                    (Release, _) -> do
25
                       i <- signal (val, (Ack, NullResp))
26
                       unlocked i val
27
                    (ReqLock, _) -> do
28
                       i <- signal (val, (LockGrant, NullResp))
29
                       leftLocked i val
30
                                 -> do
31
                       i <- signal (val, (LockGrant, NullResp))
32
                       leftLocked i val
33
35 rightLocked :: (Req val, Req val) ->
                   val
                   ReT (Req val, Req val) (val, (Rsp, Rsp)) I ()
38 rightLocked regs val = case regs of
                        (_, Write v) -> do
39
                          i <- signal (v, (NullResp, Ack))
40
                          rightLocked i v
41
                        (_, Release) -> do
42
                          i <- signal (val, (NullResp, Ack))
43
                          unlocked i val
44
                        (-, ReqLock) \rightarrow do
45
```

```
i <- signal (val,(NullResp,LockGrant))

rightLocked i val

-> do

i <- signal (val,(NullResp,LockGrant))

rightLocked i val
```

Listing 6.5: A left-argument-biased mutex specification for two ReWire devices.

In Listing 6.6 we utilize the mutex previously defined in Listing 6.5. We place devices that make use of the mutex lock to execute in parallel with one another and we use refold to connect the devices together for requests and responses for the mutex lock.

The mutex device is external to the devices requesting use of the lock. Requests will be delayed by a clock cycle. At time  $t_0$  if a device requests a lock, the mutex will process this request at time  $t_1$ . The response from the mutex will be received by the requesting device at the clock cycle  $t_2$ . Thus, the device needs to do work in the intermediate cycle  $t_1$  or otherwise stall while a response is calculated by the mutex. This design choice modularizes our mutex, but comes at the cost of a lost cycle for devices waiting for a response. In certain situations this trade off could be acceptable, but there are alternative approaches that sacrifice modularity to regain the cost of the clock cycle. We demonstrate this approach in our semaphore implementation.

```
signal NullReq
9
                                   devLeft
10
                             -> do
11
12
                                   devLeft
13
14 devRight :: ReT Rsp (Req Int) I ()
15 devRight = do
                 rsp <- signal ReqLock
16
                 case rsp of
17
                    LockGrant \rightarrow do
18
                                     rsp <- signal (Write 2)
19
                                     signal Release
20
                                     signal NullReq
21
                                     devRight
                               -> do
23
                                     devRight
24
26 mutex :: ReT (Req Int, Req Int) (Int,(Rsp,Rsp)) I ()
27 mutex = unlocked (NullReq, NullReq) 0
29 device :: ReT () Int I ()
30 device = refold (\langle (-,(-,(i,-))) \rightarrow i)
                    (\(lreq, (rreq, (oval, (lrsp, rrsp)))) () ->
31
                     (lrsp,(rrsp, (lreq,rreq))))
                    (devLeft <&> (devRight <&> mutex))
33
```

Listing 6.6: Utilizing the semaphore as a device in a closed system

## 6.4 Semaphore Constructions



Figure 6.4: A semaphore construction in ReWire. Four concurrent devices request access to a critical section with a semaphore in pure logic. The semaphore maintains its own state in a separate device. The critical section is kept abstract in this design and is illustrated with dashed borders and connections.

Semaphores are more general mutexes. We can use semaphores to regulate resource sharing that allows more than one active process into a critical section. We implement a non-blocking semaphore in a K by N fashion where we have K concurrent processes requesting access to N different resources. In the subsequent code listings, we demonstrate the implementation of a 4 by 2 semaphore. The first implementation treats the semaphore as a separate device with a longer latency for requesting access to the critical section. The revised version of the semaphore, illustrated in Figure 6.4 combines semaphore logic with devices to eliminate request latency. This implementation is a non-blocking semaphore. Devices requesting a lock get an imme-

diate response, but may not successfully receive a lock and must act accordingly in response. We note that blocking implementations are also possible utilizing refoldT in a manner similar to the barrier implementation seen earlier in this chapter.

```
1 —Two slots to work with
2 data Count = Z | One | Two
4 — Connection Priority
5 data Priority = C0 | C1 | C2 | C3
7 — Protocol
s data Req = NRq | P | V
_{9} data Rsp = NRp | Ack
inc :: Count -> Count
12 inc c = case c of
                Z \rightarrow One
                One \rightarrow Two
14
                Two -> Two
16
17 dec :: Count -> Count
18 \ dec \ c = case \ c \ of
              Z \rightarrow Z
              One \rightarrow Z
              Two \rightarrow One
23 adv :: Priority -> Priority
adv p = case p of
              C0 \rightarrow C1
              C1 \rightarrow C2
26
```

```
C2 -> C3
C3 -> C0
```

Listing 6.7: Types for a 2-semaphore device implementation.

We define our data types and helper functions in Listing 6.7. We define a Count data type for tracking how many slots are available in the critical section. The Priority data type specifies names for each concurrent process which we use for managing who has the highest priority when requesting the an available slot. The Req and Rsp data types are our request and response protocols. The functions we define are for managing the counter of available spots and iterating which process has the highest request priority in a round-robin fashion.

```
1 -- Rotations for Round-Robin priority
2 rotate :: Priority \rightarrow (a,a,a,a) \rightarrow (a,a,a,a)
_{3} rotate p (10,11,12,13) = case p of
                                               C0 \rightarrow (10, 11, 12, 13)
                                               C1 \rightarrow (13, 10, 11, 12)
                                               C2 \rightarrow (12, 13, 10, 11)
                                               C3 \rightarrow (11, 12, 13, 10)
9 rotate' :: Priority \rightarrow (a,a,a,a) \rightarrow (a,a,a,a)
10 rotate' p (10, 11, 12, 13) = case p of
                                               C0 \rightarrow (10, 11, 12, 13)
                                               C1 \rightarrow (11, 12, 13, 10)
12
                                               C2 \rightarrow (12, 13, 10, 11)
                                               C3 \rightarrow (13, 10, 11, 12)
14
16 \text{ lock} :: \text{Count} \rightarrow (\text{Req}, \text{Req}, \text{Req}, \text{Req}) \rightarrow (\text{Count}, (\text{Rsp}, \text{Rsp}, \text{Rsp}, \text{Rsp}))
17 lock count reqs = case count of
```

```
Two -> case reqs of
18
                                                             --Two Requests
19
                                                              (P,P,_-,_-) \ -\!\!\!> \ (Z\,,(\,Ack\,,Ack\,,NRp,NRp)\,)
20
                                                              (P, \_, P, \_) \rightarrow (Z, (Ack, NRp, Ack, NRp))
21
                                                              (P, -, -, P) \rightarrow (Z, (Ack, NRp, NRp, Ack))
22
                                                              (, P, P, ,) \rightarrow (Z, (NRp, Ack, Ack, NRp))
23
                                                              (-,P,-,P) \rightarrow (Z,(NRp,Ack,NRp,Ack))
24
                                                              (-,-,P,P) \rightarrow (Z,(NRp,NRp,Ack,Ack))
25
                                                             --One Request
26
                                                              (P, \_, \_, \_) \rightarrow (One, (Ack, NRp, NRp, NRp))
27
                                                              (-,P,-,-) \rightarrow (One,(NRp,Ack,NRp,NRp))
28
                                                              (\ \_\ ,\ \_\ ,P\ ,\ \_\ )\ ->\ (\mathrm{One}\,,(\,\mathrm{NRp},\mathrm{NRp},\mathrm{Ack}\,,\mathrm{NRp})\,)
29
                                                              (-,-,-,P) \rightarrow (One,(NRp,NRp,NRp,Ack))
30
                                                             --No Requests
31
                                                                              -> (count, (NRp, NRp, NRp, NRp))
32
                                           One -> case regs of
33
                                                             (P, \_, \_, \_) \rightarrow (Z, (Ack, NRp, NRp, NRp))
34
                                                              (-,P,-,-) \rightarrow (Z,(NRp,Ack,NRp,NRp))
35
                                                              \left(\begin{smallmatrix} -& & -& \\ -& & -& \end{smallmatrix}\right) \; -\!\!\!> \; \left(\begin{smallmatrix} Z& & & \\ & & & \end{smallmatrix}\right) \left(\begin{smallmatrix} X& & & \\ & & & \end{smallmatrix}\right) \left(\begin{smallmatrix} X& & & \\ & & & \\ & & & \end{smallmatrix}\right)
36
                                                              (-,-,-,P) \rightarrow (Z,(NRp,NRp,NRp,Ack))
37
                                                             --No Requests
38
                                                                              -> (count, (NRp, NRp, NRp, NRp))
                                           \mathbf{Z}
                                                 \rightarrow (Z, (NRp, NRp, NRp, NRp))
40
42 yield :: Count -> (Req, Req, Req, Req) -> Count
43 yield count reqs = let c1 = inc count
                                     in case regs of
44
                                              --Two Requests
45
                                              (V, V, _{-}, _{-}) -> Two
46
```

```
(V, -, V, -) -> Two
47
                                   (V, -, -, V) -> Two
48
                                   (-,V,V,-) \rightarrow Two
49
                                   (-, V, -, V) -> Two
50
                                   (-,-,V,V) \rightarrow Two
51
                                   --One Request
52
                                   (V, -, -, -) -> c1
53
                                   (-, V, -, -) -> c1
54
                                   (-,-,V,-) -> c1
55
                                   (-,-,-,V) -> c1
56
                                   --No Requests
57
                                                -> count
58
```

Listing 6.8: Pure functions for managing semaphore state and incoming requests

In Listing 6.8 we define pure functions for managing the locking and rotating priority of processes for requesting semaphore locks. We note that the rotate functions are inverse to one another. The lock function manages incoming lock requests from processes giving the leftmost argument the highest priority. We use the rotation functions to alternate which process sits in the left most positions of the 4-tuple and thus has the highest priority. The yield function adjusts the count based on how many devices yield their lock on the semaphore.

```
inp <- signal
(rotate' pri rsps)

sem inp final_count
(adv pri) --
```

Listing 6.9: The first semaphore device implementation. A stand-alone semaphore device.

Like the mutex before, we implement a discrete semaphore device. Each cycle the count is updated first, then the requests are rotated by priority and responses are computed by the lock function. The results are signaled and the next cycle resumes with an advanced priority and the next set of incoming requests. Similarly to the mutex implementation, using this device to manage other devices incurs a clock cycle delay from request to response.

```
1 dev0, dev1, dev2, dev3 :: ReT Rsp Req I ()
2
3 countDev :: ReT Count Count I ()
4 priority Dev :: ReT Priority Priority I ()
5
7 devs :: ReT (Rsp, Rsp, Rsp, Rsp) (Req, Req, Req, Req) I ()
s \, devs = refold \, ((d0, (d1, (d2, d3))) \rightarrow (d0, d1, d2, d3))
                    (\ - \ - \ \ (d0, d1, d2, d3) \ - \ \ (d0, (d1, (d2, d3))))
                    (\text{dev}0 < \& > (\text{dev}1 < \& > (\text{dev}2 < \& > \text{dev}3)))
10
12 system :: ReT () () I ()
13 system = refold (const ())
                      (\out -> \ -> \ case \ out \ of
14
                          ((count, priority), reqs) ->
15
```

```
let reqsp = rotate priority reqs

countp = yield count reqs

in case lock countp reqsp of

(final_count, rsps) ->

((final_count,()), rotate' priority rsps)

((countDev &> priorityDev) &> devs)
```

Listing 6.10: The second semaphore device implementation. A semaphore integrated in primarily pure logic refolded with its constituent devices.

As an alternative to discrete devices with cycle delays for communication, we introduce another implementation of the semaphore that integrates the functionality with its four processes using refold and parallel functions. In this example we introduce two additional devices countDev and priorityDev that save a value for a clock cycle. Where we saved values as function parameters in the previous example, we feed them forward into these respective devices here using Connect Logic. The result is a refold over the parallel devices that looks very similar to the definition of the sem function seen in Listing 6.9. The result is an integrated 4-device semaphore without the one clock cycle delay between request and response.

# 6.5 Segmentation

We utilize techniques exhibited in the previous examples to implement a policyenforcing memory controller. The concept of this controller is illustrated in Figure 6.5.



Figure 6.5: Constructing a segmented memory controller from a high and low security processor. Memory requests made by processors are screened by a smart bus controller.

In this scenario, we have two processors making concurrent requests to a single memory module with a single input bus. The processors exist in two different domains: a high domain that can read all of memory while only writing to its own region of memory, and a low domain that can only read and write in its own region.

We begin with a rough sketch of the requirements of such a segmentation device. First, requests made simultaneously by both processors need to be handled so that one request prevails, but no processor is starved. Second, the memory module must "proxy" valid requests to the memory unit itself. Responses from the memory unit return on some subsequent cycle. We assume that memory read responses return on the immediate subsequent cycle when generally in many settings they may return in one or more cycles. The is generally not the case in real world situations, but our

specification can be easily modified to support any type of read/write latency from a memory unit. We begin by defining types and basic devices in Listing 6.11.

#### 6.5.1 Data Types

```
1 module Segmenter where
з import Types
4 import Data. Word
6 type Address = Word32
7 type Data
                 = Word8
9 data MemAcc = NoReq
                              | Read Address
                                                  | Write Address Data
                              | Success | Retry | ReadResult Data
10 data MemRsp = NoRsp
11
12 data RspMask = NoRes
                                  -- Notifies memory written
                     Written
                     Busy
                                  -- Device busy, reattempt
14
                    ReadRes
                                  ---Result of Read
15
17 data Priority = C0 | C1
19 adv :: Priority -> Priority
20 \text{ adv } p = \text{case } p \text{ of}
             C0 \rightarrow C1
             C1 \rightarrow C0
```

Listing 6.11: Types and helper functions for a memory segmenter

In this implementation we assume a 32-bit byte-addressable memory unit. Memory accesses are represented by MemAcc encoding whether or not there is a read (Read), a write (Write), or no memory access request (NoReq) is taking place. Responses to processors are defined as type constructors for MemRsp on Line 10. These indicate no action (NoRsp), write successes (Success), a signal to retry if the bus is busy (Retry), and the result of a read operation (ReadResult). Memory requests and responses occur on different clock cycles. This example makes the simplifying assumption that the result of the memory access is available on the clock cycle immediately following the clock cycle of the request. The RspMask type is an internal encoding for how to handle the response from the memory module. The "response mask" is fed to the response master device from the request master device (both detailed later). The response master device considers the response masks and routes the memory response in MemRsp to both requesting devices. We note the presence of a Priority type for managing which requesting device has top priority in the event of simultaneous requests. The priority is rotated after every conflict is resolved so the losing device will be first priority in a subsequent conflict.

#### 6.5.2 Security Policy Functions

```
policyH :: MemAcc -> (MemAcc, RspMask)
policyH req = case req of

--No Request
r@(NoReq) -> (r, NoRes)
--A read request reads from any address
r@(Read _) -> (r, ReadRes)
--A Write Request
r@(Write addr _) -> if addr >= 0x7FFFFFFF
```

```
---Valid address range
9
                                                then (r, Written)
10
                                                --- Illegal address range
11
                                                else (NoReq, NoRes)
12
13
14 policyL :: MemAcc -> (MemAcc, RspMask)
15 policyL req = case req of
                     --No Request
16
                     r@(NoReq)
                                           -> (r, NoRes)
17
                     --Read Request
18
                     r@(Read addr)
                                           \rightarrow if addr < 0x7FFFFFFF
19
                                                ---Valid read range
20
                                                then (r, ReadRes)
21
                                                --invalid read
22
                                                else (NoReq, NoRes)
23
                     ---Write Request
24
                     r@(Write addr _) -> if addr < 0x7FFFFFFF
25
                                               ---Valid write
26
                                                then (r, Written)
27
                                                --Invalid write
28
                                                else (NoReq, NoRes)
29
```

Listing 6.12: Policy functions for a memory bus master

In Listing 6.12 we define memory access policies as two functions. One policy is for the high security (policyH) domain and the other is for the low security domain (policyL). The high domain policy function restricts writes to the upper half the addressable memory bank while the low policy restricts reads and writes to the lower half. The policies are defined as transformations on memory accesses given by processor devices. A function yields a tuple of a memory access crossed with a response mask to be fed to the response master. If the request is not allowed by the policy, it is treated as no request and silently fails.

#### 6.5.3 The Request Master



Figure 6.6: The request master component. Up to two requests are received in a cycle, are processed by the policy functions and then scheduled. A single memory request is sent to the memory module while response masks are sent to the response master component.

The request master component is one of two subcomponents that comprise the memory segmenting device. A block diagram of this device is illustrated in Figure 6.6. This component handles inbound memory requests from two processors. The requests are checked by the policy functions and in the event of two valid requests, a winning function is selected by the scheduler. Response masks are sent to the response master subcomponent which is detailed later.

```
reqMaster_ :: Priority
                -> (MemAcc, MemAcc)
                -> ReT (MemAcc, MemAcc)
3
                        (MemAcc, (RspMask, RspMask))
                        I()
6 reqMaster_ p reqs =
    case reqs of
                         -> do
      (NoReq, NoReq)
                              i <- signal (NoReq, (NoRes, NoRes))
9
                             reqMaster_ p i
10
      (req, NoReq)
                         -> do
11
                              let (acc, rsp) = policyH req
12
                              i <- signal (acc, (rsp, NoRes))
13
                              reqMaster_ p i
14
      (NoReq, req)
                         -> do
15
                              let (acc, rsp) = policyL req
16
                              i <- signal (acc, (NoRes, rsp))
17
                              reqMaster_ p i
18
      (high, low)
                         -> case p of
19
                             C0 \rightarrow do
20
                                     let (acc, rsp) = policyH high
21
                                     i <- signal (acc, (rsp, Busy))
22
                                     reqMaster_ (adv p) i
23
                             C1 -> do
24
                                     let (acc, rsp) = policyL low
25
                                     i <- signal (acc, (Busy, rsp))
26
                                     reqMaster_ (adv p) i
28 —Here we initialize the request master
29 reqMaster :: ReT (MemAcc, MemAcc, (RspMask, RspMask)) I ()
```

```
30 reqMaster = reqMaster_ C0 (NoReq, NoReq)
```

Listing 6.13: Definitions for the request master function. The transition function is given by reqMaster\_ and the initialized device is given by reqMaster.

The code for the request master device is listed in Listing 6.13. The first three branches of the case statement (lines 3-13) handle cases where a single request or no request is made and the scheduling mechanism is not invoked. The final case handles a contention for the bus. If the priority is CO, the high processor gets the bus, otherwise the low processor wins. The priority is then advanced so the loser will win in the next contention. In this specification, if a winning processor makes an illegal request during a contention, it will win the contention, but no read or mutation will occur on the memory module.

#### 6.5.4 The Response Master

The response master is the second half of the memory segmenting component. The block diagram for this subcomponent is illustrated in Figure 6.7. When a memory request is made to the memory module, we send response masks to the response component to indicate how the result from the memory unit should be handled in the next clock cycle. Every clock cycle, the response master reads the masks and the data from the memory unit (if necessary) and signals responses to each processor accordingly.



Figure 6.7: The response master component. The response master component computes the responses to send to both processors based on the output from the memory module unit and the requests masks in a given cycle.

```
case (hmask, lmask) of
        (NoRes, NoRes) -> do
                             r <- signal (NoRsp, NoRsp)
                             rspMaster_ r
        (ReadRes, NoRes) -> do
                             r <- signal (ReadResult dta, NoRsp)
10
                             rspMaster_ r
11
        (ReadRes, Busy) -> do
                             r <- signal (ReadResult dta, Retry)
13
                             rspMaster_ r
        (NoRes, ReadRes) -> do
15
                             r <- signal (NoRsp, ReadResult dta)
                             rspMaster_ r
17
```

```
(Busy, ReadRes) -> do
18
                              r <- signal (Retry, ReadResult dta)
19
                              rspMaster_ r
20
         (Written, NoRes) -> do
21
                              r <- signal (Success, NoRsp)
22
                              rspMaster_ r
23
         (Written, Busy) -> do
24
                              r <- signal (Success, Retry)
25
                              rspMaster_ r
26
         (NoRes, Written) -> do
27
                              r <- signal (NoRsp, Success)
28
                              rspMaster_ r
29
         (Busy, Written) -> do
30
                              r <- signal (Retry, Success)
31
                              rspMaster_ r
32
         -All pathological cases result
33
         —in no responses to both processors.
34
         pathological
                         -> do
35
                              r <- signal (NoRsp, NoRsp)
36
                              rspMaster_ r
37
39 rspMaster :: ReT (Data, (RspMask, RspMask))
                     (MemRsp, MemRsp) I ()
_{1} rspMaster = rspMaster_{-} (0, (NoRes, NoRes))
```

Listing 6.14: Definitions for the response master. The transition function is given by rspMaster\_ and the initialized device is given by rspMaster.

The code for the response master device is given in Listing 6.14. The device operates on an input tuple that includes data (typed Data) from the memory module and a pair

of response masks (typed (RspMask,RspMask)) given by the input type of the device on Line 2. The device outputs a pair of MemRsp responses to be fed to requesting processors. The case statement beginning on Line 5 scrutinizes a pair of response masks and acts on all valid pairs of them. A valid pair of masks are ones such that there is one "acting" mask (i.e. a read or a write) paired with a non-request (NoRes) or "busy" mask (Busy) to imply that the device lost a contention and should retry. All valid pairs are explicit branches in this case statement and all non-listed pairs are considered pathological cases. If a pathological case occurs then no response is sent to either processor.

#### 6.5.5 Composing the Bus Master

The bus master of the memory segmenting component is composed from the request and response master subcomponents using the parallel and refold combinators. We illustrate this in the code given in Lising 6.15.

```
inputSelect
(reqMaster <&> rspMaster)
```

Listing 6.15: The bus master is composed from the request and response master. We use routing logic in the functions outputSelect and inputSelect in a refold over the paralleized regMaster and regMaster devices.

The bus master is the top level definition of the memory segmenting device. It is given by busMaster in Listing 6.15. We compose it by placing reqMaster and rspmaster in parallel with the <&> combinator and refolding over the combined device with routing logic with the functions inputSelect and outputSelect. We note that the input and output types of the busMaster definition on Line 10 encapsulate the interconnections between the two subcomponents. That is, the response masks are kept internal and are not available for external interfacing in this definition. The bus master takes input in the form of Data from an external memory module with a pair of memory access requests. It yields a single memory access request (MemAcc) to an external memory module as well as request responses to the processors. The high level processor is represented by the leftmost memory access and response request while the low is on the right.

# 6.5.6 Using the Segmenter with Processors

The bus master is a stand-alone component in ReWire that we can use by interfacing it with two processor devices and a memory module. We illustrate a use case with a modified ReWire DLX implementation in Listing 6.16.

```
1 memory :: ReT MemAcc MemRsp I ()
2 proc :: ReT (Instr, MemRsp) (NextInst, MemAcc) I ()
```

```
3
4 type SystemOut = ((NextInst, MemAcc),
                      ((NextInst, MemAcc),
                       ((MemAcc, (MemRsp, MemRsp)), Data)))
6
* type SystemIn = ((Instr, MemRsp),
                     ((Instr, MemRsp),
                      ((Data, (MemAcc, MemAcc)), MemAcc)))
10
11
12 systemOut :: SystemOut
                -> (NextInst, NextInst)
13
14 systemOut ((nextInstH, _-), ((nextInstL, _-), _-)) = (nextInstH, nextInstL)
            :: SystemOut
17 systemIn
                -> (Instr, Instr)
18
                -> SystemIn
19
20 systemIn ((nextInstH, memAccH),((nextInstL, memAccL),
              ((memAccM, (memRspH, memRspL)), dta))) (instH, instL) =
              ((instH, memRspH), ((instL, memRspL),
22
               ((dta, (memAccH, memAccL)), memAccM)))
25 system :: ReT (Instr, Instr) (NextInst, NextInst) I ()
_{26} system = refold
              systemOut
              systemIn
              (parI proc (parI proc (parI busMaster memory)))
29
```

Listing 6.16: Using the bus master to interface two processors to a memory module unit in ReWire.

In Listing 6.16 we define a system composed of two identical processors proc, a busMaster, and a memory module memory. The memory module has an input type MemAcc and an output type Data to be compatible with the bus master. The memory module takes a request on cycle n and returns the data response if the request is a read on cycle n+1. The memory module makes no impositions or restrictions on requests. This is handled by the segmentation bus master. The processors are a modified version of the ReWire DLX implementation from Chapter 9 that is compatible with this bus master. The specification is modular in a way such that making this change or other similar changes with regards to memory units is trivial. The types SystemIn and SystemOut on Lines 4-10 are the "raw" input types of the device that is the result composing all the subcomponents together in parallel as is done on Line 29.

The functions systemIn and systemOut are the routing functions used in our refold of the parallelized components. They operate on the raw input and output of the combined devices. The function systemOut selects the outputs from the combined devices that are meant for external interfacing. The output type of the combined system of processors, bus, and memory is (NextInst,NextInst) or the addresses of the next instructions to be fetched. These fetched instructions are the input of the system ((Instr,Instr)) listed in the type on Line 25. The function systemIn routes the external input and internal outputs between the combined devices. The memory accesses and responses are routed between the processors, bus, and memory unit, which is encapsulated by the refold on Line 26. We define the composed and refold device as system on Line 25. This definition encapsulates the memory module used for reading and writing, but leaves an interface for two separate program memory modules for the high and low-level processors. At a given cycle two fetched instruc-

tions are provided as inputs ((Instr,Instr)) and two addresses are yielded for the next instruction fetch ((NextInst,NextInst)).

# Chapter 7

# Case Study: Regular Expression Compilation

The following chapter is from an accepted paper (Applied Reconfigurable Computing 2015) on regular expression compilation in ReWire. This paper demonstrates the use of ReWire to compile domain specific languages to regular expression pattern matchers, primarily for the use in network packet inspection, and compares the results to the state of the art [55]. Additionally, we demonstrate a novel and effective method for domain specific language-driven device specification. In the synthesis experiments described in this paper, ReWire is able to generate VHDL specifications that match or exceed the state of the art.

#### 7.1 Abstract

Although FPGAs have the potential to bring software-like flexibility and agility to the hardware world, designing for FPGAs remains a difficult task divorced from standard software engineering norms. A better programming flow would go far towards realizing the potential of widely deployed, programmable hardware. We propose a general methodology based on domain specific languages embedded in the functional language Haskell to bridge the gap between high level abstractions that support programmer productivity and the need for high performance in FPGA circuit implementations. We illustrate this methodology with a framework for regular expression to hardware compilers, written in Haskell, that supports high programmer productivity while producing circuits whose performance matches and, indeed, exceeds that of a state of the art, hand-optimized VHDL-based tool. For example, after applying a novel optimization pass, throughput increased an average of 28.3% over the state of the art tool for one set of benchmarks. All code discussed in the paper is available online [69].

#### 7.2 Introduction

FPGAs are notably difficult to program and this has motivated research into high-level synthesis (HLS) from high level programming languages and, in particular, from domain-specific languages [63]. This language-based approach is attractive because of its potential to make hardware engineering more like software engineering with its support for modularity, reuse, and abstraction, and thereby create a wider group of developers for programmable hardware. This paper describes a methodology for deriving performant hardware implementations directly from high-level functional

embedded domain-specific languages (EDSL).

This work makes the following contributions. We present ReWire [70], a subset of the Haskell functional language as a compiler target for compiling domain-specific languages to FPGAs. We show that ReWire can be effectively used as a compiler target because it supports the compilation of large input programs (over 100K LOC) and can generate competitively fast hardware implementations versus state of the art, domain-specific tools.

These contributions comprise a methodology supporting the "three P's" [34] for programming reconfigurable hardware: productivity, performance and portability. DSLs address the first two P's directly because domain specialization supports programmer productivity and, furthermore, allows aggressive optimization of domain-specific idioms. Portability is achieved by using ReWire, a retargetable language for specifying hardware devices.

New language constructs raise issues with respect to performance. Is there a performance price to be paid and, if so, is the increased expressiveness worth it? Does the increased expressiveness enable better performance and programmer productivity? In light of these questions, we evaluate our methodology via two case studies. The case studies presented here consider a purely functional framework for REHC construction, called RexHacc (for "Regular Expression HArdware compiler-compiler"). RexHacc is an EDSL-structured compiler-compiler, implemented in Haskell, for Perl-compatible regular expressions (PCRE) similar to those seen in popular intrusion detection systems (e.g., Snort [71]).

#### Overview of Methodology.



Figure 7.2: Combining the ease of use of traditional EDSLs with the power and run-time performance of a virtualized language.

The methodology factors the problem of HLS into a series of translations between EDSLs. An EDSL is a domain-specific language that is defined as a collection of constructs within an existing high level



Figure 7.1: FP Methodology for HLS

language. The methodology is illustrated in the inset figure. A problem domain can be realized as a DSL embedded in Haskell. DSL cross-compilers targeting ReWire enable synthesis onto an FPGA via the ReWire compiler. Sec. 7.3 presents a more in-depth discussion of our methodology.

The case studies involve regular expression to hardware compilation (see Fig. 7.2) in which we generate artifacts that perform as well as and often better than state of the art approaches. The case studies reported here consider the problem domain of regular expression to hardware compilers (REHC) [54]. Following Fig. 7.1, we developed a reusable and modular framework for REHC called *RexHacc* and demonstrated that circuits produced with it meet or exceed the performance of state-of-the-art REHC.

The RexHacc Framework. We performed an experiment in which we compared RexHacc to the performance of the state-of-the-art REHC of Becchi and Crowley [60] (henceforth reg2vhdl) against its own benchmarks. The goal is to demonstrate both the productivity gain and high performance achievable via our method-

ology in the construction and testing of compilers generated by RexHacc. The presentation here is deliberately high-level. We suppress the definitions of functions and data types; the code is online [69].

The entry point for RexHacc is the function rexhacc with Haskell type:

The declaration form "::" is pronounced "has type". The function rexhact takes two inputs, an optimization function (of type NFA a -> NFA a) as well as a regular expression (of type RegEx a). The type NFA a (resp., RegEx a) represents non-deterministic finite automata (resp., regular expressions) over an alphabet of type a. A regular expression compiler is generated with RexHacc by applying the top-level rexhact function to an optimization pass, opt:

Each  $o_i$  is an optimization pass of functional type NFA a  $\rightarrow$  NFA a, all of which are composed using Haskell's function composition operator (i.e., the infix ".") into a single pass. This composition corresponds to the middle box in Fig. 7.2 and each  $o_i$  is a phase inside that box. The generated compiler takes a regular expression over an alphabet of type a and converts it into an NFA a, which is then fed to the optimization pass opt. The optimization pass produces an NFA a from which ReWire code is generated. The ReWire output from this compiler can either be translated into VHDL by the ReWire compiler or executed as software in any standard Haskell environment.



Figure 7.3: Maximum throughput for the tcp25 benchmark, comparing reg2vhd1 and the RexHacc case study compilers (Secs. 7.4 and 7.5). Parameter k indicates stride length (Sec. 7.4). Case study 2 shows an average of 28.3% throughput increase over reg2vhd1.

Summary of Case Study Results. Secs. 7.4 and 7.5 each describe the definition of an REHC in the RexHacc framework. Each case study was tested against reg2vhd1 using existing test suites [60] with respect to standard metrics for circuit size, clock speed and throughput (see Fig. 7.3). The first case study (Sec. 7.4) implements the same optimization passes as reg2vhd1, and it was clear that this compiler generally matched or exceeded the performance of the hand-optimized compiler reg2vhd1 with a tiny increase in circuit size. It was observed that one of the benchmarks (tcp25) seemed to be particularly challenging for both the first case study compiler and reg2vhd1 with respect to throughput. This observation motivated the second case study (Sec. 7.5), which improves on the first with an (apparently novel) optimization pass that results in better performance than reg2vhd1 on the tcp25 benchmark.

# 7.3 A Methodology for Synthesis from Functional EDSLs

Synthesis from pure functional languages (e.g., Haskell, www.haskell.org) is appealing because combinational hardware is functional in nature, functional languages have powerful features supporting programmer productivity (e.g., modularity, expressive data types, static type inference, etc.), and the absence of side effects (e.g., destructive update) simplifies synthesis. But general purpose functional languages also contain a number of features that cannot be represented in hardware (e.g., general recursion and garbage collection) and this makes HLS directly from existing functional languages more challenging.

ReWire [70] is a proper *sublanguage* of Haskell—i.e., any ReWire program is a Haskell program, but not all Haskell programs are ReWire programs. ReWire programs, in contrast with general purpose functional languages like Haskell, are always synthesizable to hardware. ReWire restricts Haskell by disallowing the use of higher-order functions and general recursion at runtime (though techniques like partial evaluation may enable their use at compile time). RexHacc uses the ReWire hardware compiler as a back-end for producing VHDL implementations.

#### Front End.

The RexHacc compilation process begins with a collection of regular expressions written in Perl-compatible regular expression (PCRE) syntax. We use the parser combinator library Parsec in Haskell to parse the regular expressions in the source file. The regular expression is converted to the NFA type via a textbook translation of regular expressions to NFAs [53]. The resulting NFA is passed to the optimization portion of



Figure 7.4: An NFA and its corresponding Sidhu and Prasanna-style implementation. the compilation chain.

Simulating Circuits in Haskell.

Because ReWire is a sublanguage of Haskell, we can execute ReWire code as software in any Haskell environment with a test harness for executing reactive resumptions. The implementation of rexhaccwas tested and debugged using a test harness in Haskell which is included in the code base [69].

# 7.4 Case Study 1: Matching State of the Art

We undertake the construction of a tool equivalent in functionality to the state of the art [60] (reg2vhd1) and to examine the feasibility of duplicating this functionality with our approach. The purpose of this case study is to demonstrate the ease with which such a tool can be constructed. The optimizations were chosen to match those of Becchi and Crowley [60] and include head zipping, striding, alphabet compression, and epsilon elimination. These results indicate that the rexhacc-based compiler compares favorably to and often surpasses reg2vhd1 where throughput is concerned, and area utilization is similarly competitive. Each optimization phase was implemented in a few dozen lines of Haskell code; this is a rough indication that the

amount of programmer effort required is small.

- Head zipping. Head zipping is a transformation that merges outbound transitions from a state that have the same transition labels. Nodes with more than one inbound transition are not head zipped because this would result in a non-equivalent NFA. Head zipping is performed by merging the destination nodes of the matching transitions into one node that includes all of the outbound transitions from the merged nodes.
- Striding. Striding is an optimization pass that doubles the number of characters an NFA matches at each transition. Striding traverses the graph's edges and looking two transitions ahead from each state, converts two-transition sequences to a single transition consuming two characters.
- Alphabet compression. Alphabet compression is a technique that increases sharing of logic by exploiting the identical treatment of different characters by an NFA. If two characters always result in the same transitions between all states, then these characters are compressed into one character class.
- Epsilon elimination. Eliminating ε-transitions reduces the complexity and size
  of NFAs and simplifies code generation. NFAs with ε-transitions allow state
  transitions without consuming input. States connected to an NFA solely by
  ε-transitions can be eliminated. Eliminating unnecessary states reduces the
  number of flip flops required to implement the NFA on an FPGA. A textbook
  ε-elimination algorithm is used [53].

#### Experiments and Evaluation.

To test the performance of RexHacc, we selected three benchmark sets of regular expressions from the literature [55,60]. Snort24 is a set of 24 regular expressions drawn from the Snort network intrusion detection system [71]. Tcp25 is a set of 79 regular expressions designed to match malicious SMTP traffic, also drawn from the Snort NIDS. Bro217 is a set of 217 regular expressions drawn from the Bro NIDS [72]. Matchers for each of these benchmarks were generated using reg2vhd1, as well as RexHacc. Each benchmark was tested at stride lengths k = 1, k = 2, and k = 4, producing circuits that consume input streams at one, two, and four bytes per clock cycle. The resulting VHDL was then synthesized using Xilinx's XST synthesis tool for the Xilinx Spartan-3E X3CS500E FPGA, speed grade -4. The synthesis tools are optimized for speed. The frequencies that we list are synthesis estimates.

Fig. 7.5 compares the resulting circuits in terms of three performance metrics:
(a) logic slice utilization, (b) LUT utilization, and (c) maximum throughput as measured in megabits per second. (Flip flop utilization was extremely close between the two tools and thus is not shown.) RexHacc compares favorably with reg2vhdl on virtually all fronts.

#### Throughput.

RexHacc matches or exceeds reg2vhd1's total throughput for all but one of the nine benchmarks. In the best case (benchmark bro217, k = 1) throughput is around 60% higher. In the worst case (benchmark tcp25, k = 2) throughput is around 13% lower. Both tools, in all cases, are capable of processing input at a rate of more than 1 Gbit/sec. In the best case, RexHacc is capable of handling input rates up to 7.5



Figure 7.5: Performance comparisons of RexHacc to reg2vhdl tool (here, "r2v").

Gbit/sec on a Xilinx Spartan-3E FPGA at a relatively low clock rate. Tests on a Xilinx 7-series platform (not presented here, but available online [69]) indicate that throughputs of up to 25 Gbit/sec are achievable with a more modern FPGA.

#### Logic utilization.

With the exception of the single-strided (k=1) benchmarks, LUT utilization for RexHacc-generated circuits ranged from 88% to 116% of their reg2vhdl counterparts. In the specific case where k=1, RexHacc tends to produce circuits with higher LUT counts (up to 219% higher), suggesting that the combinational next-state logic produced by the RexHacc code generator is more complicated for these circuits. For all benchmarks, flip flop utilization for RexHacc was close to, but slightly higher than, the results generated by reg2vhd1. This is not surprising since each state in the NFA is represented by a single flip flop, and both tools tend to generate similar numbers of NFA states. RexHacc, however, pays a small penalty here, because it generates output signals synchronously, storing them in flip flops, while reg2vhd1 does not. Please note, however, that the choice of synchronous outputs rather than asynchronous ones is optional in the most recent version of ReWire.

The results exhibited here suggest that the case study compiler is competitive with the state of the art. The extra flexibility of the modular, purely functional design does not come at a prohibitive cost in terms of circuit size, and indeed brings substantial benefits with respect to throughput.

# 7.5 Case Study 2: Surpassing State of the Art

In this case study, we demonstrate the *agility* of the RexHacc approach by identifying an opportunity for an optimization, and rapidly implementing that optimization as a compiler phase in RexHacc. The modular nature of RexHacc made it easy both to identify a key performance bottleneck, and to implement a new optimization pass to address it.

#### Identifying the bottleneck.

While conducting the experiments of Sec. 7.5, we noticed that one of the benchmarks, tcp25, stood out for its relatively low maximum throughput when processed by RexHacc as well as by reg2vhd1. While striding enabled our compiler to produce circuits with maximum throughput in excess of 6 Gbit/sec for snort24 and

bro217, maximum throughput for tcp25 just barely exceeded 4 Gbit/sec. The throughput advantage over reg2vhdl observed for snort24 and bro217 was essentially nonexistent for tcp25.

To explore the reasons for this, we instrumented our compiler pipeline by using the Haskell Functional Graph Library's built-in support for generating graph visualizations via GraphViz (www.graphviz.org). We observed that the tcp25 NFA exhibited a structural feature that was not present in the snort24 and bro217 NFAs. Specifically, the tcp25 NFA contained one state that had a large number of inbound transitions. A simplified example of this problem is exhibited in Fig. 7.6 (left), where state 9 has eight inbound transitions. A large number of inbound transitions emerges when the source regular expression contains a long chain of choice operators. This pattern is not uncommon in packet inspection rulesets (e.g., consider a long chain of alternative filenames followed by the common suffix ".exe").

In the circuit implementation the inbound transitions translate to a large fan-in of signals that must be ORed together to determine whether to activate that state. As the size of this fan-in grows large, the combinational logic involved begins to dominate the critical path of the circuit. The result is a sharp reduction in maximum operating clock frequency, and therefore throughput. This suggested an opportunity for optimization: namely, to transform the NFA in such a way as to reduce the number of inbound transitions to heavily-loaded states.

#### State Splitting Optimization.

To address the performance bottleneck, we extended the compiler of Sec. 7.4 with an optimization called *state splitting*. Suppose we have in our NFA a state s with



Figure 7.6: NFA for (a|b|c|d|e|f|g|h)z, before state splitting (left) and after (right).

inbound transitions  $e_1, \dots, e_n$ , and assume without loss of generality that s has no self-loops. Observe that we can produce an equivalent NFA by "splitting" s in two: that is, introducing a new state (call it s'), and reassigning half of the inbound transitions (say,  $e_1, \dots, e_{\lceil n/2 \rceil}$ ) to s' instead of s. State splitting works by applying this transformation to each node whose indegree exceeds a certain fixed threshold t. Fig. 7.6 (right) illustrates the results of applying state splitting to the NFA for t = 2. N.b., the maximum indegree has been reduced from 8 to 2 in this example.

The reader may note that this optimization may have the effect of *increasing* the number of inbound transitions for successor states of split nodes. This is generally not a problem for two reasons: first, as long as state splitting succeeds in reducing the *maximum* indegree, it is likely to pay off even if some states see their number of inbound transitions increased. Second, state splitting may be iterated; if the splitting of state  $s_1$  results in state  $s_2$  exceeding the split threshold,  $s_2$  itself may be split.

The full code for the state-splitting optimization, consisting of 17 lines of code, is given as the splitStates function in the code base [69]. We can insert the state-splitting into the optimization pipeline simply by adding an extra phase to the rexhacc call; this is an instance of (‡) from Sec. 7.2:



Figure 7.7: Comparisons of RexHacc with state splitting enabled to reg2vhdl (here, "r2v") tool.

### 7.6 Conclusions and Future Work

This research is a substantial case study utilizing the ReWire compiler at scale. ReWire is a subset of Haskell limited in expressive power to ensure the synthesizability of every ReWire program. There is a potential drawback to such restrictions: it excludes many powerful functional programming idioms. In spite of this potential drawback, we demonstrate that ReWire maintains sufficient expressiveness to support the design and implementation of high level DSLs for specifying fast hardware accelerators. Future work aims to improve the resource usage of ReWire-generated devices by optimizing ReWire's code generation stages.

The methodology leverages the intrinsic power of Haskell and functional program-

ming. RexHacc is modular and customizable in the sense that optimization passes can be easily added and removed. Because the ordering of passes is exposed as function composition in Haskell, experimentation with optimization ordering is enabled. A RexHacc-generated compiler can be instrumented in a straightforward manner as we did with GraphViz and take advantage of existing external Haskell tools.

The flexibility of the RexHacc framework derives from the cross-compilation to ReWire and the ability of ReWire to generate VHDL synthesizable to efficient circuits. The methodology we have introduced lowers the barrier to entry for reconfigurable computing for functional programmers. At the same time, it provides an opportunity for hardware designers to leverage the power of the functional paradigm to improve productivity. The choice of a purely functional language does not come at a performance cost: our benchmarking demonstrates that we match or exceed the performance of a state-of-the-art hand-tuned compiler for a number of real-world tests.

The two research directions we are pursuing have to do with increasing the expressiveness of the type system to support metaprogramming and hardware security. The current methodology is based on metaprogramming (i.e., ReWire/Haskell programs are generated by Haskell programs) and there are type systems for staged programming (e.g., MetaML [73]) that we believe will improve programmer productivity further while automatically enforcing type safety. We developed a type system for enforcing fault isolation on ReWire [74] and we are currently extending to information flow security.

# 7.7 Acknowledgments

The authors would like to thank Jason Agron of Intel Corporation and David Andrews of the University of Arkansas for their helpful feedback.

# Chapter 8

# Case Study: Implementing the Salsa20 Cipher

This chapter describes the design and implementation of the Salsa20 stream cipher algorithm using ReWire and Connect Logic. It emphasizes equational reasoning to prove correct the pipelining transformation described in Chapter 3. We implement an iterative and pipelined form of Salsa20 here and show that the implementation is performant with good resource characteristics.

### 8.1 Abstract

There is a semantic gap between the hardware definition languages used to design and implement hardware and the languages and logics used to formally specify and verify them. Bridging this gap—i.e., constructing formal models from existing hardware artifacts—can be costly, time-consuming, and error prone—and yet utterly necessary

if formal verification is to proceed. This work demonstrates that this gap can be collapsed by starting in a pure functional language that is also a hardware description language, and that equational style verifications may be performed directly on the source text of a hardware design, thereby significantly lowering the verification cost for reconfigurable designs. When combined with an efficient compiler, this methodology achieves both good performance and low cost verification.

#### 8.2 Introduction

Reconfigurable computing emphasizes a "mix and match" approach to system construction, frequently involving specially tailored "one off" components. Formal methods can provide high confidence that systems obey critical properties (e.g., safety and security), but, by reputation, they can also involve a substantial investment of time and effort. Formal methods may, therefore, seem somewhat antithetical to reconfigurable computing. Can it make economic sense to invest the resources for formal methods on potentially "one off" reconfigurable systems?

The proposed methodology aims to make hardware verification cost effective for reconfigurable designs via a functional programming language that also serves as a hardware description language. The principal hypothesis of this research is that following this methodology can significantly reduce the effort of verifying hardware designs, thereby making formal verification cost effective for reconfigurable computing. The functional language—ReWire [52]—plays a dual rôle for both hardware description and formal specification. We support this hypothesis with a demonstration of the approach in which the stream cipher Salsa20 [75] is implemented efficiently in

ReWire and verified using equational reasoning on the implementation source code.

In the functional programming community, equational reasoning about programs frequently goes by the moniker "Bird-Wadler style" (so named for the influential textbook [76]). Functional programmers reason about source programs in an equational style, by replacing equals for equals, making simplifications, induction and coinduction, etc. Equational reasoning is commonly used to justify, among other things, source-to-source transformations and program correctness. This is precisely what we use Bird-Wadler reasoning for in this paper, although, in ReWire, programs are hardware descriptions.

This research demonstrates that formal methods and reconfigurable systems are not antithetical to one another at all. The contributions of this paper are as follows. (1) We describe a methodology for developing high assurance, reconfigurable systems leveraging pure functional languages and equational reasoning. A standard practice in functional programming—Bird-Wadler reasoning—is repurposed to hardware design with this methodology. (2) We introduce an extension to ReWire called Connect Logic, which consists of domain specific language abstractions for hardware devices that support a mixture of functional and structural design styles. (3) Encapsulation of a pipelining structuring technique in Connect Logic is exhibited along with (4) several performant implementations of the Salsa20 stream cipher based on it.

#### Reconfigurable Salsa20 without ReWire

Consider the following experiment. A hardware designer decides to implement the Salsa20 cipher in hardware. There are a number of good reasons to do so, not the

**Theorem** (Fib). For all  $n \ge 0$ , fib(n) = fst(fib2(n)).

Figure 8.1: Bird-Wadler Program Development

least of which is that reconfigurable hardware can increase the possible throughput compared to a software implementation. The hardware designer uses a tried and true hardware definition language (HDL) like VHDL or Verilog. The implementation path is straightforward—she implements Bernstein's defining equations [75] in terms of the HDL and performs her usual development process involving synthesis, simulation, and testing.

This first implementation is one step removed from Bernstein's high-level specification, and, furthermore, is expressed in a language without a formal semantics. So, how does she prove that the first implementation is correct? It becomes clear to the hardware engineer that the first implementation does not suffice: even implemented in the most optimized fashion, it contains too many gates for most FPGAs. So, the hardware engineer produces a second implementation structured in an explicitly pipelined form resulting in a circuit that fits on her FPGA.

Is she all done? Not if formal proof is required that the second implementation is correct. The second implementation is two steps removed from Bernstein's high level specification and it is written in a language without a formal semantics. To verify its correctness, where does she even start? She could attempt to verify the

implementation by encoding it in the logic of a theorem prover, but, observe that this involves yet another translation—and one which is not straightforward. With this approach, how can we be sure that her logical specification faithfully relates Bernstein's high-level specification to a VHDL implementation?

#### Bird-Wadler Provably Correct Development

To illustrate the formal methodology we advocate for reconfigurable computing, consider first this classic example (p.131, [76]) of Bird-Wadler style equational reasoning in Fig. 8.1. On the left is the usual recursive definition of the Fibonacci function. It serves as a reference specification defining the meaning of the Fibonacci function, but it has terrible  $O(2^n)$  performance. The other version of the Fibonacci function on the right is in an optimized, "accumulator-passing style" form with O(n) performance.

The hallmark of Bird-Wadler development is that there is a reference specification (e.g., fib) and one or more transformations from it (e.g., into fib2) that give rise to an equational verification (e.g., the Fib theorem in Fig. 8.1). This verification justifies using the optimized version (i.e., replacing fib(n) with fst(fib2(n))).

#### Provably Correct Development of Salsa20 with ReWire

It is precisely the Bird-Wadler style of development that ReWire enables for reconfigurable computing. Fig. 8.4 presents the hash function from the Salsa20 stream cipher [75] represented in a Haskell-like syntax. We discuss this figure in some detail as well as explain the requisite Haskell syntax in subsequent sections. It suffices to say that Fig. 8.4 contains a functional program defining the Salsa20 hash function that also serves as the high-level reference specification in the Bird-Wadler development

presented in our case study. To render it into a synthesizable form, we add some Connect Logic annotations to produce the ReWire code in Fig. 8.5. The ReWire compiler can now synthesize a circuit for Salsa20. This new implementation can now be measured in two ways: against standard performance metrics as in Table 8.1 or by verifying that it produces the same answers as the reference specification (Theorem 1). The first ReWire implementation is now rewritten using pipelining constructs also written in Connect Logic (the ten and twenty stage pipelines in Figures 8.6 and 8.7, resp.). The correctness of the pipelining transformation is given in Theorem 2.

Section 8.3 introduces Connect Logic and the pipelining structuring technique applied and verified in Sections 8.4 and 8.5, resp. Section 8.6 summarizes and concludes. Most of the subject matter in this paper relates to provably correct development of reconfigurable hardware rather than on more traditional areas of reconfigurable computing. The targeted audience for the paper is, however, the reconfigurable computing community and so considerable effort has been made to make the paper as self-contained as possible.

# 8.3 Connect Logic in ReWire

Connect Logic has operations for composing and connecting smaller devices to create larger ones. Sec. 8.3.2 below introduces Connect Logic at a high level; for reasons of space, a semantic treatment of Connect Logic is left for future work. We then illustrate the use of Connect Logic via the design of a pipelining transformation for ReWire. Section 8.3.1 gives background information on pure functional languages and equational verification.



Figure 8.2: Device Constructors

#### 8.3.1 Pure Functional Languages & Equational Verification

#### Primer on Haskell/ReWire Syntax

For the sake of being as self-contained as possible, this section presents a quick overview of Haskell—and, hence, ReWire—syntax necessary to understand this paper.

Haskell [5] is a strongly-typed, purely functional language. A Haskell program consists of a number of function and datatype declarations. The type of a function from type a to type b is written,  $a \rightarrow b$ . The type for a tuple with first and second components a and b, resp., is written (a, b). The fact that a Haskell expression e has type a is written e :: a. Haskell has a built-in list type constructor: [a] is the type of all lists of elements of type a. Because of Haskell's lazy evaluation strategy, lists can have an infinite number of elements—such lists are also called streams.

Below are a number of function declarations. The simplest function is the identity function, which takes its argument and simply returns it.

In Haskell/ReWire, we can introduce new datatypes with the data keyword. In the following declarations, Quad and Hex are type constructors that, given any type a, construct new types, Quad a and Hex a, resp. To construct a value of a datatype, apply a data constructor; the data constructors below are Q and H. For example, a value Q 1 2 3 4 is of type Quad Int; we write this type declaration as

Q 1 2 3 4 :: Quad Int. A Bit is either High or Low.

```
data Quad a = Q a a a a
data Hex a = H a a a a a a a a a a a a a
data Bit = High | Low
```

ReWire has built-in types for words. A 32-bit (128-bit) word belongs to the type W32 (W128). For example, a value of type ( $Quad\ W32$ ) has the form ( $Q\ w1\ w2\ w3\ w4$ ), which is nothing more than four 32-bit words.

#### Purity and Equational Verification

Haskell (and, hence, ReWire) is a pure language, which is a critical foundation for equational reasoning. Purity means that the type of a Haskell program faithfully represents its value and behavior. If a Haskell function has type  $Int \rightarrow Int$ , then the function takes an Int as input and produces an Int as output. Furthermore, we can conclude that the function possesses no side effects whatsoever because, in Haskell, side effects are reflected accurately in the types. The expression (print "Hello World"), for instance, prints out Hello World to the prompt and, therefore, (print "Hello World") :: IO ()—it produces the value nil, (), which is tagged in its type with IO, meaning it performs input/output in some form.

To prove an equation, e = e', one starts from e and "replaces equals for equals" until e' is reached. In symbols, this proof is  $e = e_1 = e_2 = \cdots = e_n = e'$  in which each step is justified by a known equation x=y—as in "replace x in  $e_i$  by y to obtain  $e_{i+1}$ ". Purity supports this style of reasoning because, being all Haskell expressions are side effect free, they cannot interact unpredictably with the expressions in which they are substituted.

#### 8.3.2 Extending ReWire with Connect Logic

This section presents the ReWire operators for the compositional construction of devices from other devices. We refer to these particular operators as "Connect Logic". Connect Logic enables two or more existing devices to be composed in parallel and connected together. Connect Logic supports a compositional style of hardware design akin to structural VHDL. Formulating the design of a hardware device may be accomplished as in previous work [52] (i.e., without Connect Logic), or, existing devices may be composed with Connect Logic operations into bigger devices.

There is a type constructor Dev for synchronous devices in ReWire. There are three basic architectural constructors that Connect Logic adds to the ReWire language. The first, iter, constructs a synchronous device from a pure function from inputs to outputs. The second,  $\langle \& \rangle$ , composes two devices in parallel. The third, refold, is a recursion operator that is used to interconnect devices and/or express feedback loops (i.e., feed back device outputs to inputs).

#### Types for Devices

There is one basic unit of Connect Logic, devices, for which we introduce the following type:  $Dev\ i\ o$  for any types i and o. A term of type,  $Dev\ i\ o$ , represents a clocked computation that, for each clock cycle, takes an input of type i, produces an output of type o, and may possess internal storage. We eschew the formal definition of  $Dev\ as$  it is unnecessary to understanding Connect Logic and its uses. Device d is clocked, as illustrated in the inset figure. The clock is represented by the underlying structure of  $Dev\ i\ o$ , rather than as an

explicit parameter. A device is created in Connect Logic by either iterating a function or through composition of existing devices. We introduce operators for constructing devices and composing them into larger, interconnected devices. All Connect Logic operations are constructors for Dev, meaning that they are functions producing  $Dev\ i\ o$  values for some i and o types.



#### Iteration

The most basic Connect Logic constructor, iter, iterates a pure function of type  $i \rightarrow o$ , producing an output corresponding to the input at each clock cycle. The Haskell definition of iter is as follows:

Fig. 8.2(a) illustrates the device created with the *iter* operation. The type declaration above means that *iter* is a device constructor that takes a function from inputs i to outputs o and an initial output value and constructs a corresponding device. The device  $(iter\ f\ o)$  will, at the first clock cycle, return output o and, in the next clock cycle after consuming an input i, will produce a new output,  $(f\ i)$ . This pattern repeats recursively ad infinitum. The  $(signal\ o)$  operator outputs its argument o and returns the next input. The definition of the  $(iter\ f\ o)$  constructor above may be read as (1) output o (i.e.,  $signal\ o$ ), (2) receive the next input  $(i.e.,\ do\ i < -signal\ o)$ , and then (3) repeat the pattern with new "initial" output  $(f\ i)$ .

#### Parallelism

Parallelism is expressed with the device constructor,  $\langle \& \rangle$ , that composes two existing devices, d1 and d2, into a single device,  $d1 \langle \& \rangle d2$ , in which both devices operate in parallel and in isolation from one another. N.b., we are assuming, here and elsewhere, that both arguments d1 and d2 are non-terminating. The type declaration of  $\langle \& \rangle$  is:

$$\langle \& \rangle$$
 :: Dev i<sub>1</sub> o<sub>1</sub> -> Dev i<sub>2</sub> o<sub>2</sub> -> Dev (i<sub>1</sub>, i<sub>2</sub>) (o<sub>1</sub>, o<sub>2</sub>)

We omit its Haskell definition as doing so would require an unnecessary excursion into Haskell's syntax and semantics. Fig. 8.2(b) presents a pictorial version of  $d1 \langle \& \rangle d2$ . The type signature of  $\langle \& \rangle$  means that the input and output types of constructed device  $d1 \langle \& \rangle d2$  are pairs of the inputs and outputs of d1 and d2, resp. Both subdevices d1 and d2 are isolated from one another in  $d1 \langle \& \rangle d2$ —i.e., there is no intercommunication or shared state between them. Such interaction may be added explicitly using the *refold* operator described below. The parallelism operator may be generalized to arbitrary numbers of devices (i.e., beyond two), but, for lack of space, we only present the simplest case.

#### Interdevice Communication & Feedback

Making interconnections between devices occurs using another device level operator, refold. The refold operator can be used to connect sub-devices within its third argument and to hide internal connections as well. The use of refold is illustrated in Fig. 8.2(c). Given a device  $d :: Dev i_1 o_1$ , and two pure functions, out  $:: o_1 \rightarrow o_2$  and  $conn :: (o_1 \rightarrow i_2 \rightarrow i_1)$ , refold out conn d is a new device with the following be-

havior. Given an external input i' and current value output o by internal device d, the new input to d is  $conn \ o \ i'$  and the new external output is  $out \ o$ . The type of refold is:

refold :: 
$$(o_1 \rightarrow o_2)$$
 ->  $(o_1 \rightarrow i_2 \rightarrow i_1)$  -> Dev  $i_1 o_1$  -> Dev  $i_2 o_2$ 

#### Defining a Pipeline

The form of pipeline we consider is a simple one, namely stall-free pipelines, in which the output from a stage flows directly into the input of the next stage. It is possible to define more complex pipelines (e.g., instruction pipelines that stall, etc.) with Connect Logic, but we leave that subject for a follow-on publication.

Stall-free pipelines—henceforth simply "pipelines"—have the flavor of functional composition, and the architectural combinators of ReWire allow the formalization of this intuition. For functions,  $f_j$ , of appropriate type, the composition,  $f_n \circ \cdots \circ f_1$ , resembles a pipeline. Of course, this ignores the timing aspect of a pipeline. In ReWire, we can express this pipeline, along with its timing, as the following:

$$iter f_1 o_1 \leadsto \cdots \leadsto iter f_n o_n$$

where  $f_j$  ::  $a_j \rightarrow a_{j+1}$  are pure functions from input of type  $a_j$  to output of type  $a_{j+1}$  and each  $o_j$  ::  $a_{j+1}$  is the initial output value produced by pipeline stage  $iter f_j o_j$ . The  $\leadsto$  combinator chains each stage together, connecting the output of the  $j^{th}$  stage to the input of the  $j+1^{th}$  stage. The combinators for pipelining, etc., are defined below.

```
\begin{bmatrix} x[4] & \oplus = (x[0] \boxplus x[12]) \lll 7 \\ x[14] & \oplus = (x[10] \boxplus x[6]) \lll 7 \end{bmatrix}
                                                                       x[9] \oplus (x[5] \boxplus x[1]) \ll 7
                                                                       x[3] \oplus = (x[15] \boxplus x[11]) \ll 7
2 \begin{bmatrix} x[8] \oplus = (x[4] \boxplus x[0]) \lll 9 \\ x[2] \oplus = (x[14] \boxplus x[10]) \lll 9 \end{bmatrix}
                                                                       x[13] \oplus = (x[9] \boxplus x[5]) \lll 9
                                                                      x[7] \oplus =(x[3] \boxplus x[15]) \ll 9
                                                                       x[1] \oplus =(x[13] \boxplus x[9]) \ll 13
                                                                      x[11] \oplus = (x[7] \boxplus x[3]) \ll 13
      x[0] \oplus (x[12] \boxplus x[8]) \ll 18
                                                                      x[5] \oplus = (x[1] \boxplus x[13]) \ll 18
4 \begin{bmatrix} x_{[0]} & \oplus -(x_{[1}z_{1} + z_{1}z_{1}) \\ x[10] & \oplus -(x_{[6]} \oplus x[2]) \ll 18 \end{bmatrix}
                                                                       x[15] \oplus = (x[11] \boxplus x[7]) \ll 18
_{5}\left[ \begin{array}{c} x[1] \\ \end{array} \oplus = (x[0] \boxplus x[3]) \ll 7 \right]
                                                                        x[6] \oplus (x[5] \boxplus x[4]) \ll 7
      x[11] \oplus = (x[10] \boxplus x[9]) \ll 7
                                                                       x[12] \oplus = (x[15] \boxplus x[14]) \ll 7
6 \begin{bmatrix} x[2] \oplus = (x[1] \boxplus x[0]) \lll 9 \\ x[8] \oplus = (x[11] \boxplus x[10]) \lll 9 \end{bmatrix}
                                                                       x[7] \oplus (x[6] \boxplus x[5]) \ll 9
                                                                      x[13] \oplus = (x[12] \boxplus x[15]) \ll 9
7\begin{bmatrix}x[3] \oplus = (x[2] \boxplus x[1]) \ll 13\\x[9] \oplus = (x[8] \boxplus x[11]) \ll 13\end{bmatrix}
                                                                       x[4] \oplus (x[7] \boxplus x[6]) \ll 13
                                                                      x[14] \oplus = (x[13] \boxplus x[12]) \ll 13
                                                                      x[5] \oplus = (x[4] \boxplus x[7]) \ll 18
                                                                      x[15] \oplus = (x[14] \boxplus x[13]) \ll 18
```

Figure 8.3: Salsa20 Hashing Algorithm [77]. Operation  $\oplus$  is bitwise exclusive OR and  $\boxplus$  is addition modulo  $2^{32}$ , and  $\iff$  is left rotate. Each set of four assignments numbered 1–8 is a quarter round, and each round, R1 and R2, consists of four quarter rounds each. The algorithm consists of repeating each double round (R1; R2) ten times in succession. Argument x is a 16 element array of 32 bit words.

Note that  $\rightsquigarrow$  is not syntactic sugar for function composition. For example, while it is true that  $id \circ f = f$ , it is also the case that  $iter\ id\ o_1 \rightsquigarrow iter\ f\ o_2 \neq iter\ f\ o_2$ . The LHS of this inequality is a two stage pipeline while the RHS is a one stage pipeline. The outputs both pipelines produce will be related, of course.

Given two devices, d1 and d2, the ReWire code for connecting them in pipelined sequence is below. This construction is illustrated in Fig. 8.2(d). The two devices are first placed unconnected in parallel (i.e., d1 < & > d2 :: Dev(a, b)(b, c)) and, in this context, both devices operate in isolation. The combined device consumes a single input of type (a, b) and produces a single output of type (b, c). The output type for  $(d1 \leadsto d2)$  is c; i.e., the second component of the output tuple of d1 < & > d2. The external input (of type a) to  $(d1 \leadsto d2)$  is passed to the subdevice d1 and the output of d1 to the input of d2; thus the routing function pipe is as defined below:

```
salsa20 :: W128 \rightarrow Hex W32
salsa20 \ nonce = hash (initialize key_0 key_1 nonce)
hash :: Hex W32 \rightarrow Hex W32
hash x = x + doubleround(\cdots(doubleround(x))\cdots)
doubleround :: Hex W32 \rightarrow Hex W32
doubleround x = rowround (columnround x)
quarterround :: Quad W32 -> Quad W32
quarterround (y_0, y_1, y_2, y_3) = (z_0, z_1, z_2, z_3)
  where
     z_1 = y_1 \oplus (y_0 + y_3) \ll 7
     z_2 = y_2 \oplus (z_1 + y_0) \ll 9
     z_3 = y_3 \oplus (z_2 + z_1) \iff 13
     z_0 = y_0 \oplus (z_3 + z_2) \ll 18
rowround :: Hex W32 \rightarrow Hex W32
rowround (y_0, \ldots, y_{15}) = (z_0, \ldots, z_{15})
  where
     (z_0, z_1, z_2, z_3)
                       = quarterround (y_0, y_1, y_2, y_3)
     (z_5, z_6, z_7, z_4) = quarterround(y_5, y_6, y_7, y_4)
     (z_{10}, z_{11}, z_8, z_9) = quarterround (y_{10}, y_{11}, y_8, y_9)
     (z_{15}, z_{12}, z_{13}, z_{14}) = quarterround (y_{15}, y_{12}, y_{13}, y_{14})
columnround :: Hex W32 \rightarrow Hex W32
columnround (x_0, \ldots, x_{15}) = (y_0, \ldots, y_{15})
  where
     (y_0, y_4, y_8, y_{12}) = quarterround (x_0, x_4, x_8, x_{12})
     (y_5, y_9, y_{13}, y_1) = quarterround(x_5, x_9, x_{13}, x_1)
     (y_{10}, y_{14}, y_2, y_6) = quarterround (x_{10}, x_{14}, x_2, x_6)
     (y_{15}, y_3, y_7, y_{11}) = quarterround (x_{15}, x_3, x_7, x_{11})
```

Figure 8.4: Reference Specification of Salsa20 Hash Function [75], which plays the rôle of reference specification in our case study. Operation  $\oplus$  is bitwise exclusive OR, + is addition modulo  $2^{32}$ , and  $\ll$  is left rotate.

```
sls20dev :: Dev (Bit, W128) (Hex W32)
sls20dev = refold out conn (passthru <math>\langle \& \rangle dblrd)
          :: Hex W32
zeros
          =\langle \dots \text{sixteen all zero words...} \rangle
zeros
dblrd :: Dev (Hex W32) (Hex W32)
dblrd = iter doubleround (doubleround zeros)
passthru :: Dev (Hex W32) (Hex W32)
passthru = iter\ id\ zeros
          :: (Hex \ W32, Hex \ W32) \rightarrow Hex \ W32
out ((x_0, \dots, x_{15}), (y_0, \dots, y_{15})) = (x_0 + y_0, \dots, x_{15} + y_{15})
conn :: (\text{Hex W}32, \text{Hex W}32) \rightarrow
           (Bit, W128) \rightarrow (Hex W32, Hex W32)
conn (o_1, o_2) (Low, nonce) = (o_1, o_2)
conn (o_1, o_2) (High, nonce)) = (x, x)
  where
     x = initialize key_0 key_1 nonce
               Figure 8.5: Iterative Salsa20 Device in ReWire.
(\leadsto) :: Dev a b -> Dev b c -> Dev a c
```

( $\leadsto$ ) :: Dev a b -> Dev b c -> Dev a c d1  $\leadsto$  d2 = refold snd pipe (d1  $\langle \& \rangle$  d2) where pipe (b, c) a = (a, b)

#### Compiling Connect Logic

A pure function f in ReWire will be compiled into a combinational circuit of fixed depth that, in turn, determines a fixed delay. If  $f = f_n \circ \cdots \circ f_1$ , then its depth is additive as is its delay. Composition of pure functions exposes an opportunity for a pipelining optimization to reduce the average propagation delay of the entire circuit.

The two operators  $\langle \& \rangle$  and refold are treated as primitives in the ReWire compilation process. These operations correspond directly to structural features in generated VHDL. The  $\langle \& \rangle$  operation is compiled to a single VHDL entity that handles the combined IO of two ReWire devices, and port maps it accordingly. The refold operator is a single entity with included functions to manipulate the IO of a device in the manner prescribed by the type of the refold function.

# 8.4 Provably Correct Development of Salsa20 Devices in ReWire and Connect Logic

Salsa20 is a stream cipher developed by Bernstein [77] and is part of the ECRYPT ESTREAM [78] portfolio of cryptographic ciphers. Salsa20 was originally intended for software implementation, but can also be synthesized on an FPGA with careful consideration given to space and mapping constraints. Fig. 8.3 presents the Salsa20 hashing algorithm, which is the heart of the Salsa20 algorithm itself and where the bulk of its computation occurs. The inputs to the algorithm include a 16-element array of 32 bit words, called x in the figure.

## 8.4.1 Salsa20 Reference Specification

Fig. 8.4 contains the reference specification for Salsa20. This specification simply recasts Bernstein's functional specification [75] using Haskell syntax. The function hash formulates the original specification from Fig. 8.3 and the function salsa20 is the entry point for the whole algorithm. There are certain details which we have left out of this code for the sake of brevity and comprehensibility; these include routines to

change endianness, to reform words as sequences of bytes, and similar such routines.

The function *initialize* sets up the initial input; its definition is omitted as well.

#### 8.4.2 Salsa20 Iterative Implementation

Fig. 8.5 contains the additional ReWire code to create an iterative version of Salsa20. Two devices are created, *dblrd* and *passthru*, using the *iter* constructor in Connect Logic. A diagrammatic view of the circuit produced is found in Fig. 8.8(a). Synthesis estimates of resource usage and FMax for *sls20dev* are in Table 8.1.

There is one functional unit performing the doubleround operation. This operates ten cycles to produce on answer. When the inputs to the device sls20 dev are  $[(High, n), (Low, n_0), ..., (Low, n_9), ...]$ , then, on the cycle with input  $(Low, n_9)$ , the output will be salsa20 n. The High bit signifies that the device should start hashing n. The (Low, n') input signifies that n' should be ignored and that the iteration should continue.

## 8.4.3 Pipelining Salsa20

The numbers for the iterative device are reasonable, but the structure of the cipher algorithm would indicate that there is room for improvement. There is an apparent performance gap with this approach: nine cycles of the device do not yield useful output. Pipelining our base components together gives us a way to keep our performance characteristics with respect to clock speed roughly the same while enabling our device to be productive on every clock cycle. We do so by placing ten different  $passthru \langle \& \rangle dblrd$  devices in sequence, connecting their inputs and outputs together

```
pipe10 :: Dev W128 (Hex W32)
pipe10 = refold out inpt tenstage
where
tenstage = \underbrace{stage \leadsto \cdots \leadsto stage}_{10}
stage = passthru \langle \& \rangle dblrd
```

Figure 8.6: Ten Stage Pipeline

to obtain *pipe*10 in Fig. 8.6.

A twenty stage pipeline may be created by increasing the granularity of each stage. Now, instead of staging each *doubleround* as before, each component *columnround* and *rowround* is staged (see Fig. 8.7).

# 8.5 Evaluating Provably Correct Salsa20 Devices

This section evaluates the devices created in the previous section according to two modes: performance and verification. The devices synthesized by the ReWire compiler exhibit performance comparable to a previously published, hand optimized design [79] We sketch the verification of general theorem which characterizes the correctness of the pipelining transformation applied in Section 8.4.3.

In this section, we sketch the verification of the pipelining transformation defined in Section 8.4. There is a function of the following type that serves to run a device on stream of inputs:  $feed :: [i] \rightarrow Dev \ i \ o \rightarrow [o]$ . For a stream of inputs is :: [i] and a device  $d :: Dev \ i \ o$ ,  $feed \ is \ d$  is the stream of outputs created by running the device d on is. N.b., feed preserves the order of outputs with respect to inputs; i.e., if i is the  $n^{th}$  input in is, then the  $(n+1)^{st}$  item in  $feed \ is \ d$  was produced by d on i. We

Figure 8.7: Twenty Stage Pipeline.

omit the definition of feed.

#### 8.5.1 Performance

We evaluated the performance of the VHDL generated from our high level specifications by synthesizing it using Xilinx ISE targeting a Kintex 7 FPGA (xc7k160t-3fbg676). The synthesis results detailed in Table 8.1 show an increase in throughput



Figure 8.8: Diagrammatic views of circuits produced by ReWire in Figs. 8.5 and 8.6.

and resource utilization as we pipeline that is in line with intuitive expectations. The 10-stage pipeline and the iterative implementation are the same design core replicated tenfold. We observe a nearly tenfold increase of flip-flop usage and a notable increase in LUT usage (likely impacted by optimizations in the synthesis tools). In the 20-stage pipeline, we divide our basic unit into separate rowRound and columnRound pipeline stages. This introduces some additional LUT usage, but doubles flip-flop (slice) usage because the number of stages in the pipeline are doubled. The maximum frequency of the 20-stage pipeline increases by approximately 1.7 times which indicates a doubling effect from doubling the pipeline with a moderate amount of overhead. These numbers demonstrate that our approach is competitive with similar work in the area of synthesizing Salsa20 [79] on modern FPGAs.

## 8.5.2 Testing the Iterative Salsa20 Device Automatically

We used the QuickCheck tool [80] to test the putative correctness of the relationship between the reference specification salsa20 and the iterative ReWire definition sls20dev (from Figs. 8.4 and 8.5, resp.). Below is a Bool-valued function, test, that takes a W128 nonce n as input and computes an equation. Note that the value of input stream is is of the form [(High, n), (Low, undefined), (Low, undefined), ...] where<math>undefined is a special "don't care" constant built-in to Haskell.

```
test :: W128 -> Bool

test n = reference == iterative
    where
        reference = salsa20 n
        iterative = nth 10 (feed is sls20dev)
```

```
is = (High, n) : repeat (Low, undefined)
```

QuickCheck can generate random inputs to *test* and, if *test* returns *True* for each input, then QuickCheck remarks that the tests were passed; below is a transcript of running QuickCheck on this correctness condition for *sls20dev*:

The correctness condition is neatly summed up in the following theorem (stated without proof):

**Theorem 1** (Correctness of Iterative Salsa20). For all nonces  $n, n_0, \ldots, n_9 :: W128$ , assume input stream is has the form  $[(High, n), (Low, n_0), \cdots, (Low, n_9), \ldots]$ . Then, the following equation holds: salsa20 n = nth 10 (feed is sls20 dev).

|           | LUTs  | Slices | Fmax (MHz) | T (Gbit/s) |
|-----------|-------|--------|------------|------------|
| Iterative | 3459  | 651    | 99.4       | 5.1        |
| 10 Stage  | 22840 | 6019   | 97.5       | 49.9       |
| 20 Stage  | 25519 | 12309  | 167.4      | 85.7       |

Table 8.1: Resource usage, Fmax, and throughput (T) of the Salsa20 algorithm as implemented and compiled in ReWire.

## 8.5.3 Verification of Pipelining

#### Lemmas

This section states the Lemmas used in proving the correctness of pipelining (Theorem 2 below). Each lemma is left unproven, although we describe the intuitive meaning of each.

Lemma 1 says that the pipelining operator is associative. The associativity of  $\rightsquigarrow$  allows for "parentheses to be dropped"; i.e.,  $(f \rightsquigarrow g \rightsquigarrow h)$  can stand for either the right- or left-hand sides of the equation in the lemma.

**Lemma 1** (Associativity). The  $\rightsquigarrow$  operation is associative.

$$f \rightsquigarrow (q \rightsquigarrow h) = (f \rightsquigarrow q) \rightsquigarrow h$$

Lemma 2 relates stages in a pipeline of devices created with iter. The LHS below performs f and g in succession. The RHS performs f and g in the first stage and the identity function in the second stage. N.b., the RHS is *not* identical to  $iter(g \circ f)(g \circ o_2)$  because the former has two stages while the latter has one.

**Lemma 2.** Let  $g :: b \rightarrow c, f :: a \rightarrow b, o_1 :: c, and o_2 :: b. Then, we have:$ 

iter f 
$$o_2 \rightsquigarrow \text{iter g } o_1$$
  
=  $iter (g \circ f) (g o_2) \rightsquigarrow iter id o_1$ 

Lemma 3 relates feed l with  $\rightsquigarrow$  in terms of infinite streams. It gives a condition under which the pipeline may be reduced by one stage.

**Lemma 3.** Let l be an infinite stream and  $\varphi$  :: Dev i o, then: feed l ( $\varphi \leadsto iter\ id\ o$ ) = o : feed l  $\varphi$ 

Lemma 4 characterizes the interaction of feed and iter in terms of a stream recording the outputs of device argument to feed. The first is just the initial output of one stage pipeline device ( $iter\ f\ o$ ) and the rest are simply  $f\ mapped$  onto l.

**Lemma 4.** Let l :: [i] be an infinite stream and f ::  $i \rightarrow o$ . Then, feed l (iter f o) = o : map f l.

#### Correctness Theorem

The following theorem says that feeding an n-stage pipeline a stream of inputs is the same as mapping a composite function across those inputs, as long as the first n outputs are ignored.

**Theorem 2** (Correctness of Pipelining). Assuming that  $f = f_1 \circ \cdots \circ f_n$  and that l is an infinite stream, then:

map f l 
$$= drop \ n \ (feed \ l \ (iter \ f_n \ o_n \ \leadsto \ \cdots \ \leadsto \ iter \ f_1 \ o_1))$$

First, define:  $F_0 = id$  and  $F_{i+1} = F_i \circ f_{i+1}$ . Observe that, by Lemmas 1 and 2 (n-1 times),

$$\begin{split} & \mathsf{iterf_n} \, \mathsf{o_n} \, \leadsto \cdots \leadsto \, \mathsf{iterf_1} \, \mathsf{o_1} \\ & = \, \mathsf{iter} \, \mathsf{F_n} \, \big( \mathsf{F_{n-1}} \, \mathsf{o_n} \big) \\ & \qquad \leadsto \, \mathsf{iter} \, \mathsf{id} \, \big( \mathsf{F_{n-2}} \, \mathsf{o_{n-1}} \big) \\ & \qquad \leadsto \, \cdots \\ & \qquad \leadsto \, \mathsf{iter} \, \mathsf{id} \, \big( \mathsf{F_0} \, \mathsf{o_1} \big) \\ & \{ \mathit{f} = \mathit{F_n}, \mathit{F_0} = \mathsf{id} \} \end{split}$$

129

```
\begin{array}{ll} = & \text{iter}\,f\left(F_{n-1}\,o_n\right) \\ \\ & \leadsto \, \text{iter}\,\text{id}\left(F_{n-2}\,o_{n-1}\right) \\ \\ & \leadsto \, \cdots \\ \\ & \leadsto \, \text{iter}\,\text{id}\,o_1 \end{array} \tag{\ddagger}
```

Working from the RHS of the theorem statement:

```
drop \ n \ (feed \ l \ (iter \ f_n \ o_n \ \leadsto \cdots \leadsto \ iter \ f_1 \ o_1))
\{ \text{By ($\ddagger$)} \}
= drop \ n \ (feed \ l \ \text{iter f ($F_{n-1} \ o_n$)} \ \text{$\leadsto$ iter id ($F_{n-2} \ o_{n-1}$)$}
\iff \text{$\leadsto$ iter id o_1$}
\{ \text{Lemma 3, n-1 times } \}
= drop \ n \ (o_1 : \cdots : F_{n-2} \ o_{n-1} : feed \ l \ (iter \ f \ (F_{n-1} \ o_n)))
\{ \ (\dagger), \text{Section 8.3.1} \}
= drop \ 1 \ (feed \ l \ (iter \ f \ (F_{n-1} \ o_n)))
\{ \text{Lemma 4 } \}
= drop \ 1 \ (F_{n-1} \ o_n : map \ f \ l)
\{ \text{Defn. drop } \}
= map \ f \ l
```

## 8.6 Summary and Conclusions

This paper considered the provably correct development of several reconfigurable designs and implementations of the Salsa20 stream cipher. The vehicle for this development is the ReWire language. ReWire is a sublanguage of the pure, functional language Haskell, and, as such, possesses a rigorous semantics that supports formal verification. Functional languages are generally quite expressive, and, consequently, the Salsa20 specifications in ReWire were quickly produced, concise and comprehensi-

ble, and elegant. Connect Logic—a previously unpublished part of ReWire—supports a structural style of development in a functional HDL. Connect Logic was key to rapidly prototyping Salsa20 in ReWire, especially in the introduction of pipelining optimizations to the specifications.

It is commonplace for hardware engineers to "think in diagrams". Any circuit or device specification will include a diagram depicting the high-level structure of the device. This diagram domain abstraction is used as an informal guide for comprehending the design. But how do we express such structural notions in a functional language-based HDL like ReWire? To this end, we introduced an extension to ReWire called Connect Logic, that encapsulates the diagrammatic style directly in the syntax of ReWire. This paper defines Connect Logic and illustrates its use with a case study of the construction of an efficient, pipelined hardware design and implementation of the Salsa20 stream cipher. Furthermore, and more to the point, we verify the correctness of this device through equational reasoning on the ReWire source text.

New language abstractions are not typically cost free. There is usually some tradeoff with respect to performance and language implementers attempt to minimize such
overheads. Furthermore, new abstractions tend to be more useful in some situations
than in others. The Salsa20 cipher was chosen as a test for ReWire to evaluate
(1) how well cryptographic algorithms might be expressed in ReWire and (2) what
performance trade-off, if any, might arise with respect to carefully hand-optimized
implementations? The performance of the synthesized ReWire devices (as shown in
Table 8.1) was quite good and, although there are not any published numbers on
hand-optimized implementations of Salsa20 that afford direct comparison with our
results, the achieved performance was in line with the only relevant publication in the

area [79]. Question (1) concerns what is, admittedly, more of an aesthetic issue than a measurable quantity. Still, it is safe to say that the Salsa20 specifications in ReWire would be readily comprehensible to those with experience in functional programming.

More importantly, a clear advantage of the ReWire methodology is that the artifacts we produced were verified in the manner of ordinary functional programs directly on the text of the design. This is a point worth emphasizing: verification of ReWire programs takes place on the program itself. Because VHDL has no mathematical semantics, artifacts produced in VHDL (or in Verilog for that matter) would require an additional step in which the formal specification of the device would encoded by hand in the logic of a theorem prover [48]. This hand-encoding is fraught with the potential for error as well as being quite time-consuming.

## Chapter 9

# Case Study: Implementing a Pipelined DLX Processor

## 9.1 Introduction

In this chapter we detail the design and implementation of the DLX microprocessor using ReWire with Connect Logic. Prior work has demonstrated ReWire's viability as a tool to design and implement processors [3]. In this case, we choose a pipelined architecture to illustrate Connect Logic's usefulness in building pipelined devices in the manner of those seen in processor architectures. This chapter describes the implementation from the high level and how we arrive at the resulting device illustrated in Listing 9.1.

```
1 dlx :: ReT (Data, Instruction) (Address, Maybe Data, PC) I ()
2 dlx = refold output_select wiring
```

(fetch <&> decode <&> execute <&> memory <&> writeback)

Listing 9.1: The type of the DLX processor device

## 9.2 The DLX Processor

The DLX processor architecture is a micro-architecture designed by Hennessy and Patterson for didactic purposes in their seminal book on micro-architecture design [81]. The DLX architecture is similar in form to the MIPS processor family. DLX is intended to be implemented as a pipelined processor with five processor phases: fetch, decode, execute (ALU), memory access, and register write back. We design the stages as separate ReWire components in near isolation from one another to demonstrate the modularity that Connect Logic brings to VLSI design in ReWire. The DLX specification, while less common than more popular processor architectures (MIPS, POWER, x86, etc.) has a large enough community to provide us with mature tools to program our processor. We use these tools to demonstrate that our specification has correct behavior in testing in Haskell. From there, we move to converting to ReWire and synthesizing to a Xilinx FPGA to report resource utilization and performance characteristics.

## 9.3 Constructing the Processor

The specification of a physical process follows primarily from its instruction set. The features that we need to support in the instruction set translate to physical features in our final product. We do this by building each processor phase individually, giving

consideration to the instructions we support. Finally, we combine these isolated components into DLX using Connect Logic.

#### 9.3.1 Instructions and Architecture

The DLX processor is a reduced instruction set computer (RISC) consisting of thirty-two, 32-bit registers. These registers include two special-purpose registers. The first register R0 is a zero constant value and is read only. The last register R31 is intended for use as a return address register for procedure calls [82] and similar routines. In this work, we treat DLX as a big-endian processor in the spirit of Sailer et al. [82]. The DLX architecture includes a number of logic, arithmetic, control-flow, and data-flow-affecting instructions. For brevity, this implementation covers a representative subset of the DLX architecture. Supported instructions are listed in Table 9.1, Table 9.2, and Table 9.3. Specifically it excludes all floating-point operations, but it also only includes a subset of the remaining instructions as well.

DLX instructions are classified into three different types. The first type, the R-type instruction, is the type of instructions that utilize register-to-register functionality in the ALU stage of the processor. These instructions include a function opcode and three register opcodes: two source registers and one destination register. Tables 9.1-9.3 list the instructions supported by our DLX implementation with their opcode values in addition to their semantics. Source and destination register variables are denoted by rd and  $rs_x$  respectively. Immediate values provided by a programmer in an encoded instruction are denoted by immediate. The extend function is a sign-extension operation. The MEM keyword serves as a C-like array interface to system memory in our semantics.

The selection of R-Type instructions supported by our DLX implementation are listed in Table 9.1. I-type instructions are similar to their R-type cousins, except they contain one fewer source register and reserve 16-bits for an immediate value for a programmer to determine in advance. I-type functions contain immediate-argument variations of R-type instructions as well as jumping, branching, and memory access instructions. The final group of instructions, the J-Type instructions, are a smaller group of unconditional jumping instructions. These are listed in Table 9.3. This format consists of an opcode and an address for a jump target. This is the smallest set of instruction types for the DLX ISA.

| Instruction | Opcode | Semantics                                                            |
|-------------|--------|----------------------------------------------------------------------|
| add         | 0x20   | $rd \leftarrow (rs_1) + (rs_2)$                                      |
| and         | 0x24   | $rd \leftarrow (rs_1) \& (rs_2)$                                     |
| or          | 0x25   | $rd \leftarrow (rs_1) \parallel (rs_2)$                              |
| seq         | 0x28   | $rd \leftarrow (rs_1) = (rs_2) ? (0^{31} \parallel 1) : (0^{32})$    |
| sle         | 0x2C   | $rd \leftarrow (rs_1) \le (rs_2) ? (0^{31} \parallel 1) : (0^{32})$  |
| sll         | 0x04   | $rd \leftarrow (rs_1)[(rs_2 \% 8) : 31] \parallel 0^{rs_2\%8}$       |
| slt         | 0x2A   | $rd \leftarrow (rs_1) < (rs_2) ? (0^{31} \parallel 1) : (0^{32})$    |
| sne         | 0x29   | $rd \leftarrow (rs_1) \neq (rs_2) ? (0^{31} \parallel 1) : (0^{32})$ |
| sra         | 0x07   | $rd \leftarrow (rs_1[0])^{rs_2\%8} \parallel rs_1[0:(31-(rs_2\%8))]$ |
| srl         | 0x06   | $rd \leftarrow 0^{rs_2} \parallel rs_1[0: (31 - (rs_2 \% 8))]$       |
| sub         | 0x22   | $rd \leftarrow (rs_1) - (rs_2)$                                      |
| xor         | 0x26   | $rd \leftarrow (rs_1) \oplus (rs_2)$                                 |

Table 9.1: DLX R-Type instructions encoding and semantics.

#### 9.3.2 Fetch

The fetch phase of the execution pipeline of the DLX is the first phase. This phase of the pipeline facilitates requests with the program counter for memory reads from

| Instruction | Opcode | Semantics                                                                      |
|-------------|--------|--------------------------------------------------------------------------------|
| addi        | 0x08   | $rd \leftarrow (rs_1) + immediate$                                             |
| andi        | 0x0C   | $rd \leftarrow (rs_1) \& immediate$                                            |
| beqz        | 0x04   | $PC \leftarrow (rs_1 = 0 ? extend(immediate) + 4 : 4)$                         |
| bnez        | 0x05   | $PC \leftarrow (rs_1 \neq 0 ? extend(immediate) + 4 : 4)$                      |
| jalr        | 0x13   | $R31 \leftarrow (PC + 8); PC \leftarrow rs_1$                                  |
| jr          | 0x12   | $PC \leftarrow rs_1$                                                           |
| lhi         | 0x0F   | $rd \leftarrow (immediate \parallel 0^{16})$                                   |
| lw          | 0x23   | $rd \leftarrow MEM[rs_1 + extend(immediate)]$                                  |
| ori         | 0x0D   | $rd \leftarrow rs_1 \parallel immediate$                                       |
| seqi        | 0x18   | $rd \leftarrow rs_1 = extend(immediate) ? 1 : 0$                               |
| slei        | 0x1C   | $rd \leftarrow rs_1 \leq extend(immediate) ? 1 : 0$                            |
| slli        | 0x14   | $rd \leftarrow (rs_1)[(immediate \% 8): 31] \parallel 0^{immediate\%8}$        |
| slti        | 0x1A   | $rd \leftarrow (rs_1) < immediate ? (0^{31} \parallel 1) : (0^{32})$           |
| snei        | 0x19   | $rd \leftarrow (rs_1) \neq immediate ? (0^{31} \parallel 1) : (0^{32})$        |
| srai        | 0x17   | $rd \leftarrow (rs_1[0])^{immediate\%8} \parallel rs_1[0:(31-(immediate\%8))]$ |
| srli        | 0x16   | $rd \leftarrow 0^{immediate} \parallel rs_1[0:(31-(immediate \% 8))]$          |
| subi        | 0x0A   | $rd \leftarrow (rs_1) - immediate$                                             |
| sw          | 0x2B   | $MEM[rs_1 + extend(immediate)] \leftarrow rd$                                  |
| xori        | 0x0E   | $rd \leftarrow (rs_1) \oplus immediate$                                        |

Table 9.2: DLX I-Type instructions encoding and semantics.

| Instruction | Opcode | Semantics                                   |
|-------------|--------|---------------------------------------------|
| j           | 0x02   | $PC \leftarrow PC + extend(value)$          |
| jal         | 0x03   | $R31 \leftarrow PC + 4; PC \leftarrow rs_1$ |

Table 9.3: DLX J-Type instructions encoding and semantics.

program memory. The fetch stage receives an instruction word from the program memory each cycle. In our design we assume a one-cycle read time from program memory. Every cycle the fetch phase can receive a new instruction to feed forward. For example, given clock cycle at time  $t_n$ , the instruction received by the fetch component corresponds to the instruction at the program counter corresponding to the program counter at  $t_{(n-1)}$  (or  $PC_{n-1}$ ). The types for the fetch phase are illustrated in Listing 9.2. The fetch component feeds forward the instruction from memory to the decode phase in addition to an updated program counter (NextInst) and the current program counter to the decode phase (PC). We note that the fetch component accepts an additional value of type NewAdd as input. This value represents new values to assign to the program counter in the event of a jump or branch processed further down the instruction pipeline. We omit the definition of this device in this section (the code for the implementation is in the Appendix beginning on Page 187), but note that if the value of type NewAdd is Nothing then the program counter is incremented. If it is not Nothing the program counter is set to the argument of the Just. Lastly we note that the first member of the 3-tuple input argument to the fetch device is a stall bit. All components up to and including the execute stage include a stall to support stalling messages from the memory access component. The fetch component requires two IO connections that route outside of the processor. The fetched instruction comes from a program memory bank outside of the processor and the address of the next instruction is to be fed to a program memory module for a read for the subsequent instruction fetch.

```
1 type Instr = Vector32 Bit
2 type NewAdd = Maybe (Vector32 Bit)
3 type PC = Vector32 Bit
```

```
4 type NextInst = Vector32 Bit
5 type Stall = Bit
6 type FetchI = (Stall ,Instr ,NewAdd)
7 type FetchO = (NextInst ,Instr ,PC)
8
9 fetch :: ReT FetchI FetchO I ()
```

Listing 9.2: Types for the fetch component given in Haskell

#### 9.3.3 Decode

The decode phase of the DLX pipeline is responsible for decoding instructions as they appear in 32-bit binary form to a form that subsequent processor stages can interpret and act on. The decode phase is responsible for loading source register values from the register file. In our DLX implementation, the values of the register file are stored and managed by the writeback phase, with read-only values being fed backwards to the decode stage. The decoding component takes a stall bit, the un-decoded instruction as a Vector32 Bit, values comprising the register file (RegFile) and a register value (RegVal) indicating the address of the instruction (the value of the program counter when the instruction was fetched). The decoder outputs a decoded version of the instruction along with register names in Haskell types. We note that this transformation incurs a minimal cost. The DLX instruction encoding is such that microcode isn't required to represent parsed instructions because there are as many registers as can be represented by a 5-bit register encoding, as well as instructions represented by a 6-bit ocpode encoding. When expressing these in as algebraic data types in ReWire, the compiler re-encodes the constructors a form represented by

just as many bits so long as we choose to keep the same number of instruction opcodes and registers (or less). Register values require an additional 64-bits of routing to be feed forward to the ALU from the decoder. The output is expressed as a 4-tuple of the opcode, the destination register, and the first and second source registers paired with their values retrieved from the register file, respectively.

```
type RegFile = Vector32 (Vector32 Bit)
type RegVal = Vector32 Bit
data Opcode = ADD | .. | NOP — All supported opcodes
data Reg = R0 | .. | R32
type DecodeI = (Stall, Vector 32 Bit, RegFile, RegVal)
type DecodeO = (Opcode, Reg, (Reg, RegVal), (Reg, RegVal))
decode :: ReT DecodeI DecodeO I ()
```

Listing 9.3: Types for the decode component given in Haskell

#### **9.3.4** Execute

The execute component, or the ALU, is where the bulk of the computation occurs in the DLX processing pipeline. The most combinationally expensive computations such as arithmetic operations occur here and the purpose of this phase is to isolate those operations so they are at most 1-operation deep. Other phases include some fixed addition (fetch increments the program counter) that is combinationally separate from this phase, but pipelined as to balance out the delay to different phases and keep this separate work executing in parallel.

```
1 type Flush = Bit
2 type ExecuteI = (Stall, Opcode, Flush, Reg, RegVal, RegVal)
```

```
3 type ExecuteO = Maybe (Opcode, RegDest, RegVal, RegVal)
4 execute :: ReT ExecuteI ExecuteO I ()
```

Listing 9.4: Types for the execution (ALU) phase given in Haskell.

On line 2 of Listing 9.4 the input type of execute is specified by the first argument of ReT as a 6-Tuple. The execute phase component takes a stall bit, opcode, flush bit, destination register for writebacks, and two source register values. The opcode determines which functionality we execute in the ALU. The register values are used according to the semantics of the instruction corresponding to the opcode. Two values are supplied, but are not always used and are sometimes meaningless (zero) values. Likewise, destination registers are not required by all instructions and are ignored by some instructions. The flush bit indicates whether or not a branch in the subsequent memory phase has been evaluated to be taken. In this case, the execution phase accepts the number of instructions remaining in the pipeline equivalent to the number of delay slots in our implementation and ignores the remaining ones until receiving the branch target instruction.

## 9.3.5 Memory

The memory access phase of the DLX pipeline is responsible for interfacing with the system memory external to the processor. In this case we define the memory as a RAM with an output 32-bit data bus, an input 32-bit data bus, and an input 32-bit address bus with an additional signalling bit for specifying a read or a write.

```
type Data = Vector32 Bit
type Address = Vector32 Bit
type MemI = (Data, Maybe (Opcode, Reg, RegVal, RegVal))
```

```
4 type MemO = (Maybe Data, Address, Stall, Flush,

Maybe (Reg, RegVal), Maybe PC)

6 mem :: ReT MemI MemO I ()
```

Listing 9.5: Types for the memory access phase given in Haskell.

The input type to the memory access phase is represented in Listing 9.5 on line 3 by MemI consisting of a tuple of the output data bus from a memory module as well as a potential instruction to process. The memory unit acts upon memory access instructions, but is also responsible for calculating jumps and branches. Instructions not within these two groups are fed forward to the writeback phase of the pipeline in the same way as NOP instructions.

The output from the device is specified on lines 4-5 of Listing 9.5 is a 6-tuple consisting of a number of signals and data outputs going to memory and every other component in the pipeline. The first argument is a value of type Maybe Data. This argument encapsulates reading and writing. If it is Nothing this signifies a read operation, as no data is sent to the attached memory unit. If the argument is a Just, its argument is considered the data to be written to memory. Encoding a Maybe is equivalent to using a separate read/write signalling bit because ReWire only needs one bit to encode the two different constructors of a Maybe type which are appended to the encoding of the type variable (Data here) of Maybe. The second argument to the output tuple is the address to be read or written to. The Stall and Flush bits are signals to the previous devices to stall or drop current work. A high stall bit indicates that work should be held. A high flush bit indicates a branch is to be taken and all work aside from the delay slot should be ignored. This is performed by the other devices inserting yielding the work equivalent to a NOP instruction when

receiving a signal to flush. The fifth argument is a register and register value pair. When this value is populated (not Nothing), the writeback stage will save this value to the indicated register in the register file. The final argument is the value to set the program counter to in the event of a jump or branch. If the memory access unit encounters a jump or a branch that is to be taken, this value is populated with the branch or jump target.

The memory component of the pipeline assumes that a read instruction will require two clock cycles to complete. This is one cycle to signal the memory unit and one cycle to receive the result. We choose this form because it is both reasonable and amenable to expression using reactive resumptions in ReWire with signal i.e. three signals, two with stalled (NOP) outputs and the third with the output containing the read result from memory. The delay for a write instruction is one cycle. This is less than a read because the processor doesn't need to wait on a result from the memory to continue processing. The implementation of our processor is not dependent upon the admittedly high requirements we place on our memory modules with regards to speed. If we were to change the number of cycles required to wait on loads and stores, the only changes we would need to make to our processor would be in the number of cycles to signal stalls.

#### 9.3.6 Writeback

The writeback phase of the DLX pipeline is responsible for managing updates to the register file. In our implementation, the writeback receives a value that is populated by a register-value pair as input. This pair indicates what value to assign to the specified register as an update (if the value is not Nothing). We note again that the

register R0 is fixed to a zero value and cannot be altered.

```
1 type WriteBackI = Maybe (Reg,RegVal)
2 type WriteBackO = RegFile
3 writeback :: ReT WriteBackI WriteBackO I ()
```

Listing 9.6: Types for the writeback phase given in Haskell.

## 9.4 Composing the DLX Processor

Our DLX processor implementation is composed of the subcomponents described in the previous section. We construct the whole processor by using the Connect Logic primitives refold and parl. We begin by specifying the processor's type in Listing 9.7.

```
1 dlx :: ReT (Data, Instruction) (Address, Maybe Data, PC) I ()

Listing 9.7: The type of the DLX processor device
```

The dlx device is our top level device that we ultimately want to synthesize. For this reason, we keep the types of the inputs, outputs, and failure (here we use unit, failure is not used) monomorphic even though failure could remain polymorphic. The inputs of our DLX processor are a pair consisting of Data values from memory read operations and Instruction values from program memory for instruction fetches. The output of the processor is a triple consisting of an Address for memory operations, Maybe Data signalling a potential read or write with a value, and a program counter value for reading the next fetched instruction from program memory. This type outlines the goal of what remains of our work. We need to compose our subcomponents in a way that is true to the behavior of the DLX specification as well as following the type

laid out at the top level. The following subsections illustrate the necessary steps to reach this end.

## 9.4.1 Parallelizing and Connecting Devices

The first step is to construct an intermediate device that consists of all components placed in parallel. We do this by using the parallel operator and we illustrate this step in Listing 9.8.

```
_{1} type InterI = (
                   (Stall, Instr, NewAdd), —Fetch
                   (Stall, Instr, RegFile, RegVal), —Decode
                   Maybe (Reg, RegVal) --- Writeback
s \text{ type InterO} = (
                   (NextInst, Instr, PC), —Fetch
                   (\operatorname{Opcode}, \operatorname{Reg}, (\operatorname{Reg}, \operatorname{RegVal}), (\operatorname{Reg}, \operatorname{RegVal})), --\operatorname{Decode}
10
                   (Maybe (Opcode, RegDest, RegVal, RegVal)), —Execute
11
                   (Maybe Data, Address, Stall,
12
                    Flush, Maybe (Reg, RegVal), Maybe PC), —Memory
                   RegFile -Writeback
14
15
16 dlx_inter :: ReT InterI InterO I ()
dlx_inter = fetch
                       <&> decode
               execute <&> memory <&> writeback
```

Listing 9.8: Constructing the intermediate ReWire device.

On lines 1-13 of Listing 9.8 show the input types of the combined devices using the parallel operator. We illustrate the types as flat tuples, which are isomorphic to the nested structure that would be produced by subsequent applications of the parallel combinator. The combined dlx\_inter device is akin to an un-wired device that has output and input ports open, but unconnected. We connect these ports in the next steps by defining connecting functions for each device based on the combined device type and the types of the top level DLX processor defined earlier.

```
1 type DLXI = (Data, Instruction)
2 type DLXO = (Address, Maybe Data, PC) I ()
4 connFetch
               :: InterO -> DLXI -> (Instr, NewAdd)
               :: InterO -> DLXI -> (Instr, RegFile, RegVal)
5 connDecode
6 connExecute :: InterO -> DLXI -> (Opcode, Flush, RegDest, RegVal, RegVal)
7 connMem
               :: InterO -> DLXI ->
                    (Data, Maybe (Opcode, Reg, RegVal, RegVal))
               :: InterO -> DLXI -> RegFile
9 connWB
               :: InterO -> DLXO
11 dlxOut
13 dlx :: ReT DLXI DLXO I ()
dlx = refold
           dlxOut
           (\interO \rightarrow \dlxI \rightarrow \ let \ f = \connFetch \ interO \ dlxI
16
                                      d = connDecode interO dlxI
17
                                      e = connExecute interO dlxI
18
                                      m = connMem interO dlxI
19
                                      w = connWB interO dlxI
20
                                   in (f,d,e,m,w))
```

#### dlx\_inter

Listing 9.9: Connective functions for each pipelining phase of the DLX processor.

We re-introduce the input and output types of the top level DLX device as DLXI and DLXO on lines 1 and 2 of Listing 9.9. These types, along with the intermediate dlx\_inter output types, appear in the typing of the five different connecting functions on lines 4-9. The definitions of these functions are omitted. Trivial connection functions construct their output by selecting the corresponding members of the input provided to them. No additional computation is performed by a trivial connection function and they are akin to "wiring". We will discuss particular non-trivial functions in a later section and note here that the non-trivial functions are different from trivial ones because they facilitate register value forwarding to account for data flow hazards in the pipeline. The dlxOut function on line 11 selects the values from the InterO type to comprise the output for the whole processor. These are the address for the memory address bus, the read/write Maybe Data type and the program counter for reading the next instruction from program memory.

The top level device representing the whole processor is the dlx definition that appears on lines 13-22. We refold over the dlx\_inter device by providing the output selection function dlxOut as the output modification function. An anonymous function constructs the input for dlx\_inter by using each connection function to first construct the input for each individual stage in the let bindings on lines 16-20. The input is the tuple constructed on line 21.

## 9.4.2 Considering and Mitigating Pipelining Hazards

Pipelined architecutres do not come without drawbacks. Aside from the latency introduced by multiple pipeline stages, the more critical issue that arises from pipelining a processor comes in the form of hazards. We describe the applicable hazards to our implementation as they appear in the Hennesy and Patterson text [81] and discuss our remedies to these hazards in our implementation as well as how they relate to the connection functions described in Listing 9.8.

#### **Data Hazards**

Data flow hazards arise when an earlier stage of the pipeline depends on a value produced at a later stage, but that value hasn't yet reached the register file via the writeback (final) phase of the pipeline. The hazard creates a window of subsequent instructions in the pipeline that require a "forward glance" to the results of their prior instructions at different phases. This is called *register forwarding* [81].

```
Just (frd, frv) -> case otro of
13
                                                     (rd, rv) \rightarrow case frd = rd of
14
                                                                  H \rightarrow frv
15
                                                                  L \rightarrow rv
16
17 connExecute
                :: InterO -> DLXI ->
      (Stall, Opcode, Flush, RegDest, RegVal, RegVal)
18 connExecute (_,(dcOp,dcDreg,regA,regB),
                  aluO, -, -, stall, flush, mbWbReg, -, -) =
19
                                                           (stall, dcOp, flush, dcDreg,
20
                                                            fwd2 aluO mbWbReg regA,
21
                                                            fwd2 aluO mbWbReg regB)
22
```

Listing 9.10: Functions for forwarding register values.

Listing 9.10 illustrates our approach for forwarding written register values not yet written to the register file back to the execute phase of the processor pipeline. We utilize two functions fwd2 and fwd1 to facilitate forwarding. Values for assignment are computed at the execute or memory phases of the pipeline by either computation from the ALU or reads from memory. The freshest (or most recently written) place to find a potential value of a register is in the output of the execute phase. The second freshest place is the output of the memory access phase. Lowest priority of register freshness is given to the register file, which is read at the decode phase.

Our forwarding functions work by checking whether or not either of the registervalue pairs emitted by the decoder overlap with a register assignment emitted by the execute phase. If this is the case, we replace the incoming register value from the decoder with the value emitted by execute phase. This is performed in fwd2. If there isn't an overlap with the output of the execution phase, we check for an overlap with the output of the memory access phase in fwd1. If neither functions discover a match, the value from the decode phase is fed through. Note that after the forwarding test, source register information is no longer required and none of the subsequent components accept it as input. This saves us resources and reduces the complexity of post-decode phases as well as reduces resource utilization post-synthesis.

#### Control Flow Hazards

Control flow hazards occur in pipelined architectures when branching occurs, but the pipeline is saturated with instructions that occur after the branching instruction and shouldn't be executed. Handling a control flow hazard is in essence making sure that these instructions do not result in an effect on the machine. In our processor implementation, we manage this by directing the execute phase to replace these invalid instructions in the pipeline with something equivalent to a NOP. This is, in essence a small state machine that states that when the Flush bit (output from the memory phase) is high, the execute phase will drop instructions that are invalid given a branch.

## 9.4.3 Delay Slot Implementation

We don't discard *all* extra instructions in the pipeline, however. One instruction after a branch in DLX is considered a valid instruction known as the *delay slot* instruction. Listing 9.11 demonstrates a section of DLX assembly code that shows an instruction  $i_0$  occupying the delay slot on line 3.

#### 1 START:

- BEQZ r0, TARGET; branchIns
- 3 i 0

Listing 9.11: DLX assembly code illustrating the the appearance of a delay slot instruction on line 3.

A timing diagram in Figure 9.1 illustrates the execution of the code in Listing 9.11 in our the DLX pipeline. The lines represented are the clock CLK, the various input lines to each execution phase of the processor, and a selected flushing bit signal FLUSH. The state of the processor can be interpreted by viewing the timing diagram in a column-wise fashion. We abstract the particular values of each input in this diagram. The inputs of each execution phase represent an instruction as it moves along the pipeline. The diagram begins at the first upward clock cycle with the branch instruction reaching the input of the memory execution phase as its input. In the subsequent clock cycle, it has been determined that the branch is to be taken, so the FLUSH signal is raised to high, notifying the execution phase that the pipeline currently contains invalid instructions and to "drop" the next three instructions it receives on the pipeline. The execution phase of the processor inserts NOP instructions in the pipeline for subsequent pipeline phases. This ensures that no effects will occur from the invalid instructions currently in the pipeline. This process is referred to by

Hennesy and Patterson as "bubbling" [81].



Figure 9.1: Timing diagram illustrating branching and delay slots. Pipeline state can be read column-wise.

Instruction  $i_0$ , however, has already proceeded past the execute device before bubbling begins. If we did not have a delay slot, the memory phase would have to be extended to ignore the instruction in this circumstance. In the case of a single delay slot instruction for pipelined processors, it appears that allowing for a single delay slot actually decreases pipeline complexity! Indeed this prevents us from having to alter the design of our memory access phase in our implementation, and only focusing on adding extra states to the execute phase.

```
...non-flushing cases ...
```

Listing 9.12: Haskell code for flushing the pipeline in the execution phase of the ReWire DLX processor implementation.

Adding flushing or bubbling functionality to our execute phase is a fairly straightforward process. Listing 9.12 shows the code of our execution stepping function that
checks whether or not the flush bit is high. If the bit is high, then we signal
Nothing three times using the input provided by the third signal to continue evaluation. We note that adding this type of stalling functionality adds three "dead states"
to state machine that comprises the exec device. This comes at a cost of increased
flip flops (potentially) and circuit complexity to facilitate these additional state transitions. We localize this additional complexity to one device instead of three other
stages to simplify the code and types for the other two devices.

## 9.4.4 Stalling Functionality

10

Certain DLX instructions take longer than one cycle to execute at a single stage. In our implementation, these instructions are memory access instructions at the memory device. Read instructions take two cycles to complete, for example. To ameliorate the *structural hazard* of resource contention on the memory device, we must stall the pipeline so subsequent instructions cannot access the device while it is still in use by a previous instruction. We accomplish this by adding stall support in each processor phase prior to the memory phase by adding output memoization to each device's stepping function.

```
1 fetch_step :: FetchO -> FetchI -> ReT FetchI FetchO I ()
2 fetch_step o i = case i of
```

```
(H, -, -) -> do
                                     i' <- signal o
                                      fetch_step o i'
                       ..non-stalling code..
s decode_step :: DecodeO -> DecodeI -> ReT DecodeI DecodeO I ()
9 decode_step o i = case i of
                        (H, -, -, -) -> do
10
                                          i' \leftarrow signal o
11
                                          decode_step o i'
12
                        ..non-stalling code..
13
14
  execute_step :: ExecuteO -> ExecuteI -> ReT ExecuteO ExecuteI I ()
  execute_step o i = case i of
                        (H, -, -, -, -, -) -> do
17
                                              i' <- signal o
18
                                              execute_step o i'
19
                        ..non-stalling code..
```

Listing 9.13: Stepping functions with output memoization for stalling.

We refer to stepping functions as functions that allow us to accept input and transition (or step) to another state in a state machine. This is similar to the kind of transitions seen in Moore machines. Stepping functions for Reactive Resumptions of the type ReT i o m a have the type i -> ReT i o m a. For stalling devices, we require an extra argument to the stepping function to remember the previous cycle's output when evaluating the current cycle's input for a stall. If a stalling device receives a signal to stall, the previous cycle's output is re-used and saved for the next cycle (using it again as the first argument to our memoizing stepping function) where we evaluate

the next cycle's input for a stall. The stalling function of each stalling portion of our DLX implementation appears in Listing 9.13. In each definition, there is a case for acting on a stall signal. We omit non-stalling cases. In each case, we signal with the previous cycle's output (saved as the first argument) and save the output again for use in the next cycle by supplying it as an argument a tail recursive call to the stepping function.

## 9.5 Testing

The ReWire language is a subset of the Haskell programming language and as such device specifications written in ReWire can be tested in Haskell. We validate our DLX implementation by testing it in a Haskell-hosted runtime.

#### 9.5.1 A Haskell Test Bench

Testing ReWire devices in Haskell is akin to setting up a test bench in a popular VLSI testing suite, except that Haskell provides us with a more powerful and more expressive substrate to work with. The process of establishing a testing suite for a ReWire device begins with the type of the device "under test".

```
8 runReacT :: Monad m => ReacT input output m a ->
                           (output -> m input)
10 \text{ runReacT} = \dots
12 — Top level DLX device
13 dlx :: ReT (Data, Instruction) (Address, Maybe Data, PC) I ()
14 \, dlx = ...
16 —A Memory stepping function
17 memory :: (Address, Maybe Data) -> IOUArray Int32 Word8 ->
             ReT (Address, Data) (Vector 32 Bit) I ()
_{19} memory = \dots
21 mem4096 :: ReT (Address, Maybe Data) (Vector32 Bit) IO ()
22 \text{ mem} 4096 = do
                arr < - lift (newArray (0,4096) 0)
                inp < - signal 0
                memory inp arr
27 loadMem :: FilePath -> ReT (Address, Maybe Data) (Vector32 Bit) IO ()
_{28} loadMem f = do
                     bs <- lift (BS.readFile f)
                     let bs' = BS.unpack bs
30
                     arr <- lift (newListArray (0::Int32, fromIntegral
31
      (length bs' - 1) bs'
                     inp \leftarrow signal 0
                    memory inp arr
33
34
35 program :: FilePath -> ReT () RegFile IO ()
```

```
_{36} program f = do
                  let prog = refold
                                id
38
                                (\  \  i \rightarrow (i, Nothing))
39
                                (ramFromFile f)
40
                  let paired = parI mem65535 (parI (prog) (dlx_testreg))
41
                  refold devout devinp paired
42
     where
43
        devout (ramout, (progout, (nextinst, rwdata, addr, rf))) =
44
           (nextinst, addr, rwdata, rf)
45
        devinp (ramout, (progout, (nextinst, rwdata, addr, rf))) () =
46
           ((addr,rwdata),(nextinst, (progout,ramout)))
47
  test :: FilePath -> IO ()
  test f = runReacT (program f) (\((regfile -> do)
                               .. analyze and report processor state here..
52
```

Listing 9.14: The top level DLX device type for testing

Listing 9.14 illustrates a test bench for the ReWire DLX implementation. To make testing with files and terminal I/O easier, we begin by swapping out the underlying Identity monad (I) in the ReWire DLX stack for the IO monad. Haskell utilizes the IO monad to represent side-effecting IO operations which include file operations and other standard operations like printing and reading from the command line. Our implementation makes no use of its inner monad stack because it consists of only Identity. Making this alteration for test doesn't affect any of our components. The inner monad is given by I which is treated as a type synonym for IO on line 5. The other components require for testing are memory banks for system and for program

memory. We provide the type of our run function for Reactive Resumptions on line 8. This function allows us to operate on outputs from the device during test and represent them for observation in the IO monad. On line 17 we provide the type of the stepping function for a memory device (definition omitted) that is readable and writable. This device is a reactive wrapper over an unboxed Haskell array to emulate a byte-addressable memory bank. The device accepts an Address to read or write and a Maybe Data specifying data to write, if any. We utilize this memory simulator for both system memory and program memory for DLX, but we only allow reads from program memory. Lines 21-25 are a small memory device for testing. The loadMem function on lines 27-33 produces a memory device that is populated with values from a file. This enables us to create program memory for assembled DLX programs using third-party assemblers. The program function on lines 35-42 creates a sealed system around our processor that accepts the Unit type as input and yields the state of the register value as output. We yield the register file as output here to easily inspect the values of the registers in test. The input of type Unit is akin to a hardware device receiving a clock pulse signalling it to step forward. The top level testing function takes a program file path as an argument and simulates a processor run and takes a user-provided inspection function that produces the next input in IO to test execute a device. This function is executed for every cycle of the device. In the case of this test bench, we inspect the register file at every clock cycle. More elaborate tests could select additional values for inspection including specific registers, memory locations, signals between processor stages, and any other value yielded from a reactive device contained in the processor. We stay with the top level signals in this example, but with some tweaks to the underlying implementation, a programmer can expose any

signal for top level evaluation in test.

Using Haskell as a substrate for testing devices is immensely helpful, especially when validating a device using third party tools. In this case, we utilized an assembler found online to test the behavior of the processor to ensure it complied with expected norms by observing register values when doing normal arithmetic, loading and storing to memory, and executing branches.

## 9.6 Synthesizing the Design

Our DLX implementation was synthesized to an FPGA by converting the specification to VHDL. The process is straightforward and we note the stages involved in synthesizing the device here.

## 9.6.1 Proper Compilable ReWire

ReWire that is synthesizable to hardware needs to follow the conventions established by Procter [3]. That is, all pure functions are prohibited from being recursively defined. Functions in ReT (called stepping functions in this chapter) are allowed to be tail recursive, but only if the tail calls are guarded. Guarded tail calls are tail calls following a call to signal in a monadic expression. Anecdotally, these restrictions, while perhaps unusual to adhere to at first, become entirely natural when writing device specifications with consideration to clocked behavior. The core ReWire language is a subset of the whole Haskell syntax. At the time of this writing, we eschew some of the more sugar-y aspects of Haskell to simplify parsing and compilation (whitespace rule, irrefutable patterns in let-bindings, Haskell-style do-notation, etc.).

It is feasible to write Haskell in this impoverished style, however. Doing so can speed the process to testing on hardware, but we acknowledge this comes at a cost to some expressive freedom at the source level.

All DLX code was written in the ReWire-style when being implemented and tested in Haskell. Migrating the code to ReWire requires a few minor tweaks, but is a fairly painless process. We took care to write the device specifications so they adhered to ReWire's guardedness and recursion requirements so these parts of the code would not need to be rewritten during the code migration process.

Connect Logic allows us a kernel of operators that act on devices typed in ReT. It is allowable to write non-primitive functions comprised of these calls, but before compilation, these functions must be inlined. Functions typed in ReT that are not called from a tail-position in a larger expression typed in ReT must be inlined. The ReWire compiler provides an interactive REPL phase where the user can invoke commands to inline and reduce (beta-reduce) expressions in ReWire. Functions that must be discharged through this process can be thought of as user-directed macros. Currently, this process is done by hand and automated approaches to discharging these macros is left to future work on the ReWire compiler.

#### 9.6.2 Back end Primitives

There are many functions that, while feasible to do so in ReWire, make sense to defer to synthesis tools to generate. Such features generally include arithmetic and logical functionality. ReWire includes a vhdl keyword similar to Haskell's foreign keyword that one can use to declare the existence of VHDL functions on the back and treat them as first-class ReWire functions. FPGA synthesis tools carry significant

insight for optimizing functions (such as addition) for the targetable devices. We opt to utilize standard VHDL functions over "home-grown" ReWire implementations where possible and practical. The majority of the functionality of the arithmetic logic unit in our DLX implementation relies on VHDL standard functions. We use ReWire to faciliate how these functions are applied, but the functions themselves we treat as black boxes when composing our specifications.

#### 9.6.3 Synthesis Results

ReWire compiled to VHDL was synthesized using the Xilinx ISE development environment. We targeted a Kintex-7 family **xc7k160t-3fbg676** chip. The results of the synthesis are given in Table 9.4.

| Emphasis | Effort | LUT Usage | Flip Flops | Maximum Frequency $(f_{max})$ |
|----------|--------|-----------|------------|-------------------------------|
| Speed    | Normal | 26928     | 7371       | $106.200 \mathrm{MHz}$        |
| Speed    | High   | 26928     | 7371       | 106.200MHz                    |
| Area     | Normal | 26928     | 7367       | $105.076 \mathrm{MHz}$        |
| Area     | High   | 26928     | 7367       | $105.076 \mathrm{MHz}$        |

Table 9.4: FPGA synthesis results for our DLX implementation.

Synthesis runs were given to enumerate the combinations of high and normal effort versus speed and area emphasis. As shown in Table 9.4, the results are essentially identical in all cases. There is a slight difference between speed and area emphasis, but it is negligible. The LUT usage amounted to about 26% of the chip area and the flip flop usage utilized about 3% of the chip's available flip flops. A large amount of LUT usage is likely due to the space requirements of our arithmetic logic unit in the execute phase of the processor. Implementing the full DLX ISA would result in

a significant increase in area usage. ReWire currently does not utilize chip-specific features that could result in a reduction of space requirements (this includes features such as DSP slices). Work to enhance the ReWire compiler in this area will be FPGA family specific and perhaps chip specific. We leave these enhancements to future work.

#### 9.7 Conclusion

In this chapter we demonstrate various ReWire Connect Logic features and their application to building a pipelined processor implementing the DLX ISA. Previous work by Procter [3] illustrates ReWire's applicability to processor design with a Xilinx Picoblaze implementation in ReWire. This work goes a step further by using the Connect Logic primitives to implement a fully pipelined processor architecture with support for pipeline hazard avoidance. We demonstrate that ameliorating hazards is a straightforward process that is natural when using the Connect Logic primitives to compose devices. Furthermore, this process enables developers to incorporate popular software engineering techniques such as encapsulation and data hiding in natural ways when implementing and testing DLX and its subcomponents.

## Chapter 10

## Summary and Future Works

## 10.1 Summary of Results

This dissertation delivers three results. First, we selected and implemented primitive functions in ReWire to support composition and manipulation of device-level constructs. Second, we established a principle of modularity in ReWire utilizing our first set of primitives. Lastly, we demonstrate novel design techniques utilizing ReWire and Connect Logic to implement sophisticated and efficient designs in hardware.

## 10.1.1 Connect Logic Primitives

In this dissertation, we introduced four primitive Connect Logic functions, refold, <&>, iter, and refoldT. These functions were implemented in Haskell and support has been added to the ReWire compiler to support the compilation of these functions as primitives. Prior to Connect Logic, functions taking Resumptions had

been disallowed. In this work, we soften this restriction with primitives that allow for significant control over structuring devices and modifying them by input and output values as well as providing a lifting for pure functions to synchronous devices and a method for controlling device stalling.

#### 10.1.2 Modularity and Modules

Modularity in ReWire follows from Connect Logic. With Connect Logic in place, we establish a notion of modularity centered on synchronous devices or Resumptions. This idea of modularity is similar to that seen in HDLs such as VHDL or Verilog. Connect Logic allows us to compose preexisting devices with different devices, pure functions, or elevate pure functions to create complex systems that we can synthesize to hardware. Connect Logic reinforces some common best practices usually seen in high level programming languages such as encapsulation and data hiding by giving designers fine grained control over device placement and input and output routing: he or she can introduce and hide ports when composing devices for maximum flexibility and better work flow.

## 10.1.3 Novel Designs with ReWire and Connect Logic

This work introduces a number of novel design techniques for hardware using ReWire with Connect Logic as a medium. We use ReWire as a target language for implementing efficient pattern matching for regular expression files while giving careful consideration to resource and performance implications by applying transformations at the intermediate level. We utilize Connect Logic with ReWire to construct a pro-

cessor from pipeline stages, and implement a stalling pipeline from those stages with Connect Logic. We provide a number of implementations for important hardware and concurrency features including concurrency idioms like mutexes and semaphores as well as transformations on devices to create redundant hardware at minimal overhead to the designer. Theses techniques exhibit encapsulation and data hiding: integrating concurrency into groups of devices requires very little knowledge about the devices themselves other than they support a communication protocol with the synchronization logic. Redundancy transformations require no introspection into the devices they replicate and thus impose very little cost to the design process.

In summary, the applications demonstrated in this work demonstrate that although hardware composition with Connect Logic is a novel process, we are able to reincorporate more classical software engineering design concepts using Connect Logic.

#### 10.2 Future Works

#### 10.2.1 Structural Metaprogramming With Connect Logic

In this dissertation we introduce a non-primitive function called pipeline which is, in essence, a macro or metaprogrammatic function that we must discharge before compilation because it isn't a primitive function. In fact, the Connect Logic implementation requires that Connect Logic primitives as they are named can be the only devices that act upon Reactive Resumptions in ReWire. This limits our ability to reuse useful idioms when combining devices together. While it's fairly easy to rewrite

or inline instances of a pipeline function, there may exist more complicated structural idioms that we would wish to reuse. As such, it would be exceptionally useful if we could devise a way to enable functions that manipulate devices, but only do so in ways that rely on Connect Logic primitives.

Simple approaches to this problem involve simply inlining any function that makes calls to Connect Logic primitives, but this may not be enough. Work needs to be done to establish the formalisms of Connect Logic and how they extend ReWire so that we can easily express these structural macros without the need for hand-inlining them at compile time. This work could likely include syntactic analysis of Connect Logic expressions that is type-driven.

#### 10.2.2 Network-on-Chip Paradigms

The Network-on-Chip (NoC) is a hardware design paradigm [83,84] that emphasizes component-level reuse and reduced complexity of inter-component wiring (where in some cases component connectivity witnessed by networks on chip is infeasible without networking). ReWire provides a significant offering with regards to high assurance and ease of programming for desingers targeting HDLs. With Connect Logic, we can proceed a step further by focusing on combinators for modeling a variety of network topologies, switching, and communication between devices. We can build on previous work with equational reasoning with ReWire to provide a variety of assurances to inter-device networking on-chip. This has implications in NoC security properties, performance characteristics, and network quality-of-service (QoS) guarantees.

#### 10.2.3 Type Level Naturals and Vectors

Circuits can be designed with great flexibility when the designer is able to abstract the size of types in designs. Extending ReWire with vector types that are parameterized by their size would be immensely useful for implementing operations that are flexible with regards to their input size (logical functions, addition, etc.). It is possible that we could implement this with an approach rooted in partial evaluation. Partial evaluation for constructing synthesizable logic for varying-length structures is particularly challenging. The result of partial evaluation needs to "construct" a functional specification that can be converted to hardware, or the compiler itself needs to be instrumented to manage how to go about this process automatically if this process is computationally possible or feasible.

## 10.2.4 Program Transformations for Power Consumption and Circuit Depth

Circuit power consumption and heat profile are critical factors to consider when evaluating a design. Currently, ReWire offers very little for evaluating high level specifications and how they may behave with regards to power requirements and heat generation. A method for evaluating potential power usage based on expression structure could go far to strengthening ReWire's viability as an HDL. Additionally, evaluating the depth of a behavioral ReWire specification could lead us to a method for automating circuit depth reduction by automated pipelining. This work illustrates the equivalence between pipelined and unpipelined expressions (pull equivalence). The final piece of the pipeline puzzle in a compiler would be a method for identifying points to pipline and exploit them with optimizations. Even if this process is based solely on

heuristics we believe that the performance implications are present to warrant more work in this area.

## **BIBLIOGRAPHY**

- [1] Robert E Lyons and Wouter Vanderkulk. The use of triple-modular redundancy to improve computer reliability. *IBM Journal of Research and Development*, 6(2):200–209, 1962.
- [2] W. L. Harrison and A. Procter. Cheap (but functional) threads. Submitted to Journal of Functional Programming, 2005.
- [3] Adam Procter. Semantics-Driven Design and Implementation of High-Assurance Hardware. PhD thesis, University of Missouri, 2014.
- [4] John Hughes. Why functional programming matters. *The computer journal*, 32(2):98–107, 1989.
- [5] Simon L. Peyton Jones. Haskell 98 language and libraries: the revised report. Cambridge University Press, 2003.
- [6] Simon Marlow et al. Haskell 2010 language report. Available online http://www. haskell. org/(May 2011), 2010.
- [7] Eugenio Moggi. Notions of computation and monads. *Information and Computation*, 93(1):55–92, 1991.

- [8] Sheng Liang, Paul Hudak, and Mark Jones. Monad transformers and modular interpreters. In *Proceedings of the 22nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages*, pages 333–343. ACM.
- [9] William L. Harrison. The essence of multitasking, 2006.
- [10] SP Jones. Tackling the awkward squad: monadic input/output, concurrency, execptions and foreign-language calls. Lecture Notes for a tutorial given at Marktoberdorf Summer School, 2002.
- [11] Russel O'Connor. Io is not a monad.
- [12] S. Marlow. The haxl project at facebook, 2014.
- [13] Enno Scholz. A concurrency monad based on constructor primitives, or, being first-class is not enough. Freie Univ., Fachbereich Mathematik, 1995.
- [14] Koen Claessen. A poor man's concurrency monad. *Journal of Functional Programming*, 9(03):313–323, 1999.
- [15] Janis Voigtländer. Asymptotic Improvement of Computations over Free Monads, volume 5133 of Lecture Notes in Computer Science, book section 20, pages 388– 403. Springer Berlin Heidelberg, 2008.
- [16] E. Kmett. Free monads for less (part 1 of 3): Codensity, 2011.
- [17] John W Lato. Iteratee: Teaching an old fold new tricks. *The Monad. Reader*, 16:19–35, 2010.
- [18] Oleg Kiselyov. *Iteratees*, pages 166–181. Springer, 2012.

- [19] E.Z. Yang. Space leak zoo, 2011.
- [20] M. Snoyman. Conduit library on hackage.
- [21] G. Gonzalez. Pipes library on hackage.
- [22] J. Millikin. Enumerator library on hackage.
- [23] Stan Liao, Steve Tjiang, and Rajesh Gupta. An efficient implementation of reactivity for modeling hardware in the scenic design environment. In *Proceedings* of the 34th annual Design Automation Conference, pages 70–75. ACM.
- [24] Edward Amsden. A survey of functional reactive programming. *Unpublished*, 2011.
- [25] Conal Elliott and Paul Hudak. Functional reactive animation. Proceedings of the second ACM SIGPLAN international conference on Functional programming -ICFP '97, pages 263–273, 1997.
- [26] P. Hudak. Modular domain specific languages and tools. Proceedings. Fifth International Conference on Software Reuse (Cat. No.98TB100203), pages 134– 142, 1998.
- [27] P. Bjesse, K. Claessen, M. Sheeran, and S. Singh. Lava: hardware design in haskell. ACM SIGPLAN Notices, 1998.
- [28] Andy Gill, ygill@ittc.ku.edu, andygill@ittc.ku.edu, Tristan Bull, tbull@ittc.ku.edu, Garrin Kimmell, kimmell@ittc.ku.edu, Erik Perrins, esp@ittc.ku.edu, Ed Komp, komp@ittc.ku.edu, Brett Werling, and bwer-

- ling@ittc.ku.edu. Introducing kansas lava. Implementation and Application of Functional Languages, pages 18–35, 2011.
- [29] R. Wester, C. Baaij, and Kuper. A two step hardware design method using clash. 2012.
- [30] Christiaan Baaij, c.p.r.baaij@utwente.nl, Jan Kuper, and j.kuper@utwente.nl. Using rewriting to synthesize functional languages to digital circuits. Trends in Functional Programming, pages 17–33, 2014.
- [31] Arvind. Bluespec and haskell, 2013.
- [32] A. Acosta. Hardware synthesis in forsyde. Sweden: KTH/ICT/ETS, 2007.
- [33] Ingo Sander, er, and Axel Jantsch. Modelling adaptive systems in forsyde. *Electronic Notes in Theoretical Computer Science*, 200(2):39–54, 2008.
- [34] Hyouk Joong Lee, Kevin Brown, Arvind Sujeeth, Hassan Chafi, Tiark Rompf, Martin Odersky, and Kunle Olukotun. Implementing domain-specific languages for heterogeneous parallel computing. *IEEE Micro*, 31(5):42–53, September 2011.
- [35] W. Citrin, R. Hall, and B. Zorn. Programming with visual expressions. In Visual Languages, Proceedings., 11th IEEE International Symposium on, pages 294–301.
- [36] Viktor Massalõgin. Visual lambda calculus. Thesis, 2008.
- [37] J Paul Morrison. Flow-Based Programming: A new approach to application development. CreateSpace, 2010.

- [38] John Hughes. Generalising monads to arrows. Science of Computer Programming, 37(1-3):67–111, 2000.
- [39] Dan Piponi. Profunctors in haskell, 2011.
- [40] H John Reekie. Visual haskell: a first attempt. Report, Citeseer, 1994.
- [41] Martin Erwig. Abstract syntax and semantics of visual languages. *Journal of Visual Languages and Computing*, 9(5):461–483, 1998.
- [42] M. Erwig and B. Meyer. Heterogeneous visual languages-integrating visual and textual programming. *Proceedings of Symposium on Visual Languages*, pages 318–325, 1995.
- [43] Jason Hemann and Eric Holk. Visualizing the turing tarpit. In *Proceedings of the* first ACM SIGPLAN workshop on Functional art, music, modeling and design, pages 71–76. ACM.
- [44] Iavor S Diatchki, Mark P Jones, and Thomas Hallgren. A formal specification of the haskell 98 module system. In *Proceedings of the 2002 ACM SIGPLAN workshop on Haskell*, pages 17–28. ACM, 2002.
- [45] Sigbjorn Finne, Daan Leijen, Erik Meijer, and Simon Peyton Jones. Calling hell from heaven and heaven from hell. In ACM SIGPLAN Notices, volume 34, pages 114–125. ACM, 1999.
- [46] Georgios Fourtounis and Nikolaos S Papaspyrou. Supporting separate compilation in a defunctionalizing compiler. In *SLATE*, pages 39–49, 2013.

- [47] Georgios Fourtounis, Nikolaos S Papaspyrou, and Panagiotis Theofilopoulos. Modular polymorphic defunctionalization. Computer Science and Information Systems, (00):30–30, 2014.
- [48] T. Melham. Higher Order Logic and Hardware Verification, volume 31 of Cambridge Tracts in Theoretical Computer Science. Cambridge University Press, 1993.
- [49] Susmit Sarkar, Peter Sewell, Francesco Zappa Nardelli, Scott Owens, Tom Ridge, Thomas Braibant, Magnus O. Myreen, and Jade Alglave. The semantics of x86-cc multiprocessor machine code. In *Proceedings of the 36th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages*, POPL '09, pages 379–391, 2009.
- [50] Levent Erkok, Dylan McNamee, Joe Kiniry, Iavor Diatchki, and John Launchbury. Programming Cryptol. Galois Inc., 2014.
- [51] NIAP-CCEVS. Common criteria for information technology security evaluation part 3: Security assurance components. Technical Report CCMB-2012-09-003, National Information Assurance Partnership, September 2012. https://www. niap-ccevs.org/.
- [52] Adam Procter, William L. Harrison, Ian Graves, Michela Becchi, and Gerard Allwein. Semantics driven hardware design, implementation, and verification with ReWire. In ACM SIGPLAN/SIGBED Conf. on Languages, Compilers, Tools and Theory for Embedded Systems (LCTES), 2015.

- [53] John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction to Automata Theory, Languages, and Computation (3rd Edition). Addison-Wesley, 2006.
- [54] Reetinder Sidhu and Viktor K. Prasanna. Fast regular expression matching using FPGAs. In Proc. of the 4th 9th Annual IEEE Symp. on Field-Programmable Custom Computing Machines, pages 227–238, 2001.
- [55] Michela Becchi and Patrick Crowley. An improved algorithm to accelerate regular expression evaluation. In *Proc. of the 2007 ACM/IEEE Symp. on Architecture for Networking and Communications Sys.*, pages 145–154.
- [56] Sailesh Kumar, Sarang Dharmapurikar, Fang Yu, Patrick Crowley, and Jonathan Turner. Algorithms to accelerate multiple regular expressions matching for deep packet inspection. In Proc. of the 2006 Conf. on Applications, Technologies, Architectures, and Protocols for Computer Communications, SIGCOMM '06, pages 339–350, 2006.
- [57] Benjamin C. Brodie, David E. Taylor, and Ron K. Cytron. A scalable architecture for high-throughput regular-expression pattern matching. In 2006 ISCA, pages 191–202.
- [58] Michela Becchi and Patrick Crowley. A hybrid finite automaton for practical deep packet inspection. In *Proc. of the 2007 ACM CoNEXT Conf.*, pages 1–12.
- [59] Abhishek Mitra, Walid Najjar, and Laxmi Bhuyan. Compiling PCRE to FPGA for accelerating SNORT IDS. In Proc. of the 2007 ACM/IEEE Symp. on Architecture for Networking and Communications Sys., pages 127–136.

- [60] Michela Becchi and Patrick Crowley. Efficient regular expression evaluation: theory to practice. In Proc. of the 4th ACM/IEEE Symp. on Architectures for Networking and Communications Systems, pages 50–59. ACM, 2008.
- [61] Ioannis Sourdis, João Bispo, João M. Cardoso, and Stamatis Vassiliadis. Regular expression matching in reconfigurable hardware. J. Signal Process. Syst., 51(1):99–121, April 2008.
- [62] Yi-Hua E. Yang, Weirong Jiang, and Viktor K. Prasanna. Compact architecture for high-throughput regular expression matching on fpga. In Proc. of the 2008 ACM/IEEE Symp. on Architectures for Networking and Communications Sys., pages 30–39.
- [63] Nithin George, Hyoukjoong Lee, David Novo, Tiark Rompf, Kevin Brown, Arvind Sujeeth, Martin Odersky, Kunle Olukotun, and Paolo Ienne. Hardware system synthesis from domain-specific languages. In Proc. of 24th Int. Conf. on Field Prog. Logic and App. (FPL '14).
- [64] Gregory R Andrews. Concurrent programming: principles and practice.
- [65] Barbara Chapman, Gabriele Jost, and Ruud Van Der Pas. *Using OpenMP:* portable shared memory parallel programming, volume 10. MIT press, 2008.
- [66] Bradford Nichols, Dick Buttlar, and Jacqueline Farrell. Pthreads programming: A POSIX standard for better multiprocessing. "O'Reilly Media, Inc.", 1996.
- [67] James F Ziegler and WA Lanford. Effect of cosmic rays on computer memories. Science, 206(4420):776–788, 1979.

- [68] John Von Neumann. Probabilistic logics and the synthesis of reliable organisms from unreliable components. *Automata studies*, 34:43–98, 1956.
- [69] ARC 15 Code Base. http://goo.gl/efJ6SO.
- [70] A. Procter, W.L. Harrison, I. Graves, M. Becchi, and G. Allwein. Semantics-directed machine architecture in ReWire. In 2013 Int. Conf. on Field Programmable Technology (FPT '13), pages 446–449.
- [71] Martin Roesch. Snort lightweight intrusion detection for networks. In Proc. of the 13th USENIX Conf. on System Administration, LISA '99, pages 229–238, 1999.
- [72] Vern Paxson. Bro: A system for detecting network intruders in real-time. In *Proc. of the 1998 Conf. on USENIX Security Symp.*, pages 3–3.
- [73] Walid Taha and Tim Sheard. Metaml and multi-stage programming with explicit annotations. *Theoretical Computer Science*, 248(1?2):211 242, 2000.
- [74] William L. Harrison, Adam Procter, and Gerard Allwein. The confinement problem in the presence of faults. In *Proc. 14th ICFEM*, pages 182–197, 2012.
- [75] Daniel J. Bernstein. Salsa20 specification, 2005. http://cr.yp.to/snuffle/spec.pdf.
- [76] Richard Bird and Phillip Wadler. *Introduction to Functional Programming*. Prentice Hall, 1988.
- [77] Daniel J. Bernstein. New stream cipher designs. chapter The Salsa20 Family of Stream Ciphers, pages 84–97. 2008.

- [78] Daniel J. Bernstein. The eSTREAM project eSTREAM phase 3 Salsa20 (portfolio profile 1), 2005. Retrieved November 11, 2014.
- [79] Jaroslaw Sugier. Low-cost hardware implementations of salsa20 stream cipher in programmable devices. Journal of Polish Safety and Reliability Association Summer Safety and Reliability Seminars, 4(1), 2013.
- [80] Koen Claessen and John Hughes. Quickcheck: A lightweight tool for random testing of haskell programs. SIGPLAN Not., 46(4):53–64, May 2011.
- [81] John L Hennessy and David A Patterson. Computer architecture: a quantitative approach. Elsevier, 1990.
- [82] Patty M. Sailer, Philip M. Sailer, and David R. Kaeli. The DLX Instruction Set Architecture Handbook. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 1996.
- [83] Luca Benini and Giovanni De Micheli. Networks on chip: a new paradigm for systems on chip design. In *Design, Automation and Test in Europe Conference* and Exhibition, 2002. Proceedings, pages 418–419. IEEE, 2002.
- [84] Axel Jantsch, Hannu Tenhunen, et al. *Networks on chip*, volume 396. Springer, 2003.

## Appendix A

# Connect Logic Implementation in Haskell

#### A.1 Parallel Combinator

This is the Parallel combinator.

Listing A.1: Haskell implementation of the Connect Logic parallel (<&>) combinator

#### A.2 Refold Combinator

This is the refold combinator.

Listing A.2: Haskell implementation of the Connect Logic refold combinator

## A.3 RefoldT Combinator

This is the refoldT combinator.

```
dispatch of resume = \ i2 ->
case fi of i2 of

Nothing -> ReT (return (Right (fo of, dispatch of resume)))

Just x -> refoldT fo fi (resume x)
```

Listing A.3: Haskell implementation of the Connect Logic refoldT combinator

### A.4 Iter Combinator

This is the iter combinator.

Listing A.4: Haskell implementation of the Connect Logic iter combinator.

## Appendix B

## **DLX** Component Implementation

## B.1 Types for DLX

```
1 module Redux.Types (
2    Bit (...), Vector5 (...), Vector6 (...), Vector16 (...),
3    Vector26 (...), Vector32 (...), Reg (...),
4    Opcode (...), Maybe (...), RegVal, regEq, zeroReg, oneReg, I, ReT,
5    StT, parI, refold, refoldT, signal, extrude, get, put, lift, return
6 ) where
7    simport Control.Monad.Resumption.Reactive
9 import Control.Monad.Resumption.Connectors
10 import Data.Functor.Identity
11 import Control.Monad.State hiding (lift)
12 import Control.Monad.Morph
13
```

```
_{15} type ReT = ReacT
16 -- Over IO for testing in Haskell
_{17} type I = IO
_{18} type StT s m = StateT s m
19 type RegVal = Vector32 Bit
21 par I :: ReT i o I a \rightarrow ReT j p I a \rightarrow ReT (i,j) (o,p) I a
parI = (<||>)
24 data Bit = L | H deriving Show
26 data Vector5 a = Vector5 a a a a
_{27} data _{Vector6} a _{E} 
28 data Vector16 a = Vector16 a a a a a a a
                                                                                           a a a a a a a
_{31} data Vector26 a = Vector26 a a a a a a a
                                                                                           a a a a a a a
                                                                                           a a a a a a a
33
                                                                                           a a
34
36 data Vector32 a = Vector32 a a a a a a a
                                                                                           a a a a a a a
37
38
                                                                                           a a a a a a a
                                                                                           a a a a a a a deriving Show
39
41 data Reg = R0 | R1 | R2 | R3 | R4 | R5 | R6 | R7
                                                   | R8 | R9 | R10 | R11 | R12 | R13 | R14
                                                   | R15 | R16 | R17 | R18 | R19 | R20 | R21
43
```

```
| R22 | R23 | R24 | R25 | R26 | R27 | R28
44
                      | R29 | R30 | R31 | R32
45
46
_{47} \text{ regEq} :: \text{Reg} \rightarrow \text{Reg} \rightarrow \text{Bit}
48 \text{ regEq} =
     (R1,R1) \rightarrow H
50
                                 (R2,R2) \rightarrow H
51
                                 (R3,R3) \rightarrow H
52
                                 (R4,R4) \rightarrow H
53
                                 (R5,R5) \rightarrow H
54
                                 (R6,R6) \rightarrow H
55
                                 (R7,R7) \rightarrow H
56
                                 (R8,R8) \rightarrow H
57
                                 (R9, R9) -> H
58
                                 (R10, R10) -> H
59
                                 (R11, R11) -> H
60
                                 (R12,R12) \rightarrow H
61
                                 (R13, R13) -> H
62
                                 (R14, R14) -> H
63
                                 (R15, R15) -> H
64
                                 (R16, R16) -> H
65
                                 (R17, R17) -> H
66
                                 (R18, R18) -> H
67
                                 (R19, R19) \rightarrow H
68
                                 (R20, R20) -> H
69
                                 (R21, R21) -> H
70
                                 (R22, R22) \rightarrow H
71
                                 (R23, R23) -> H
72
```

```
(R24, R24) -> H
73
                             (R25, R25) \rightarrow H
74
                            (R26, R26) -> H
75
                             (R27, R27) \rightarrow H
76
                            (R28, R28) \rightarrow H
77
                             (R29, R29) -> H
78
                            (R30, R30) \rightarrow H
79
                             (R31, R31) -> H
80
                            (R32, R32) -> H
81
                                        \rightarrow L
82
83
84 data Opcode = ADD | ADDI | AND | ANDI | BEQZ | BNEZ
                   | J | JAL | JALR | JR | LHI | LW | OR
                   | ORI | SEQ | SEQI | SLE | SLEI | SLL
86
                   | SLLI | SLT | SLTI | SNE | SNEI | SRA
87
                   | SRAI | SRL | SRLI | SUB | SUBI | SW
88
                   | XOR | XORI | NOP deriving Show
89
  extrude :: (Monad m) => ReT i o (StT s m) a ->
               s \rightarrow
               ReT i o m (a,s)
94 extrude (ReacT m) s =
       let a = flip evalStateT s $ do
                                          res <- m
96
                                            <- get
97
                                          case res of
                                                        -> return $ Left (a,s)
99
                                           Right (o,r) \rightarrow return \$ Right <math>(o, \ i
      -> extrude (r i) s)
```

```
in ReacT a
in Rea
```

Listing B.1: Types defined for DLX processor implementation

## B.2 DLX Fetch Stage

```
1 module Redux. Fetch where
3 import Redux. Types
4 -- Stand in function for primitive VHDL adding function
5 import Redux. Test. ArithLogic (add_)
7 type Instr
               = Vector32 Bit
s type NewAdd = Maybe (Vector32 Bit)
9 type PC
                = Vector32 Bit
10 type NextInst = Vector32 Bit
11 type FetchI
                = (Instr, NewAdd)
12 type FetchO
                = (NextInst, Instr, PC)
15 fourReg :: Vector32 Bit
16 fourReg =
```

```
Vector 32 \ L \ L \ L \ L \ L \ L \ L \ L
17
                 LLLLLLLL
18
                 \mathbf{L} \ \mathbf{L}
19
                 LLLLHLL
20
21
23 fetch :: Monad m => PC -> FetchI -> ReT FetchI FetchO m ()
24 fetch_pcinp = case inp of
                  ( \_, ( Just addr) ) \rightarrow do
                                            inp <- signal (addr, zeroReg, zeroReg)
26
                                            fetch_ addr inp
27
                  (inst, Nothing) -> do
28
                                           let pc4 = add_p pc fourReg
                                           inp <- signal (pc4, inst, pc)
30
                                           fetch_ pc4 inp
31
33 fetch :: Monad m => ReT FetchI FetchO m ()
34 fetch = fetch_ zeroReg (zeroReg, Just zeroReg)
```

Listing B.2: Haskell implementation of DLX Fetch Stage

## B.3 DLX Decode Stage

```
1 module Redux.Decode where
2
3 import Redux.Types
4 import Redux.Writeback
5 import Redux.Test.ArithLogic
6 import Debug.Trace
```

```
s data IJ = I | Jm
9 extend :: Vector16 Bit -> Vector32 Bit
_{10} extend vec =
   case vec of
     Vector16 b15 b14 b13 b12 b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0 ->
      Vector32 b15 b15 b15 b15 b15 b15 b15 b15
13
              b15 b15 b15 b15 b15 b15 b15
14
              b15 b14 b13 b12 b11 b10 b9
15
              b7 b6 b5 b4 b3 b2 b1
                                         b0
16
17
18 lextend :: Vector16 Bit -> Vector32 Bit
19 lextend vec =
   case vec of
    Vector16 b15 b14 b13 b12 b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0 ->
     b15 b14 b13 b12 b11 b10 b9 b8
              b7 b6 b5 b4 b3 b2 b1 b0
24
25
27 highHalf :: Vector16 Bit -> Vector32 Bit
_{28} high Half vec = case vec of
      Vector16 b15 b14 b13 b12 b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0 ->
      Vector32 b15 b14 b13 b12 b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0
               31
32
34 decodeReg :: (Bit, Bit, Bit, Bit, Bit) -> Reg
_{35} decodeReg inp = case inp of
             (L,L,L,L,L) \rightarrow R0
```

```
(L,L,L,L,H) \rightarrow R1
37
                    (L,L,L,H,L) \rightarrow R2
38
                    (L,L,L,H,H) \rightarrow R3
39
                    (L,L,H,L,L) \rightarrow R4
40
                    (L,L,H,L,H) \rightarrow R5
41
                    (L,L,H,H,L) \rightarrow R6
42
                    (L,L,H,H,H) \rightarrow R7
43
                    (L,H,L,L,L) \rightarrow R8
44
                    (L,H,L,L,H) \rightarrow R9
45
                    (L, H, L, H, L) \rightarrow R10
46
                    (L, H, L, H, H) -> R11
47
                    (L, H, H, L, L) -> R12
48
                    (L,H,H,L,H) \rightarrow R13
49
                    (L, H, H, H, L) -> R14
50
                    (L,H,H,H,H) -> R15
51
                    (H, L, L, L, L) -> R16
52
                    (H, L, L, L, H) -> R17
53
                    (H,L,L,H,L) \rightarrow R18
54
                    (H, L, L, H, H) -> R19
55
                    (H, L, H, L, L) -> R20
56
                    (H,L,H,L,H) \rightarrow R21
57
                    (H, L, H, H, L) \implies R22
58
                    (H,L,H,H,H) \rightarrow R23
59
                    (H,H,L,L,L) \rightarrow R24
60
                    (H, H, L, L, H) -> R25
61
                    (H,H,L,H,L) \rightarrow R26
62
                    (H,H,L,H,H) \rightarrow R27
63
                    (H,H,H,L,L) \rightarrow R28
64
                    (H,H,H,L,H) -> R29
65
```

```
(H,H,H,H,L) -> R30
66
                   (H,H,H,H,H) -> R31
67
68
  decodeIJ :: (Bit, Bit, Bit, Bit, Bit, Bit) -> (Opcode, IJ)
70 decodeIJ inp = case inp of
                   (L,L,L,L,L,L) \rightarrow (NOP,I)
71
                   (L, L, L, L, L, H) \rightarrow (NOP, I)
72
                   (L,L,L,L,H,L) \rightarrow (J,Jm)
73
                   (L,L,L,L,H,H) \rightarrow (JAL,Jm)
74
                   (L, L, L, H, L, L) \rightarrow (BEQZ, I)
75
                   (L,L,L,H,L,H) \rightarrow (BNEZ,I)
76
                   (L,L,L,H,H,L) \rightarrow (NOP,I)
77
                   (L,L,L,H,H,H) \rightarrow (NOP,I)
                   (L,L,H,L,L,L) \rightarrow (ADDI,I)
79
                   (L,L,H,L,L,H) \rightarrow (NOP,I)
80
                   (L,L,H,L,H,L) \rightarrow (SUBI,I)
81
                   (L,L,H,L,H,H) \rightarrow (NOP,I)
82
                   (L,L,H,H,L,L) \rightarrow (ANDI,I)
83
                   (L,L,H,H,L,H) \rightarrow (ORI,I)
84
                   (L,L,H,H,H,L) \rightarrow (XORI,I)
85
                   (L,L,H,H,H,H) \rightarrow (LHI,I)
86
                   (L,H,L,L,L,L) \rightarrow (NOP,I)
87
                   (L,H,L,L,L,H) \rightarrow (NOP,I)
88
                   (L,H,L,L,H,L) \rightarrow (JR,I)
89
                   (L,H,L,L,H,H) \rightarrow (JALR,I)
90
                   (L,H,L,H,L,L) \rightarrow (SLLI,I)
91
                   (L,H,L,H,L,H) \rightarrow (NOP,I)
92
                   (L,H,L,H,H,L) \rightarrow (SRLI,I)
93
                   (L,H,L,H,H,H) \rightarrow (SRAI,I)
94
```

```
(L,H,H,L,L,L) \rightarrow (SEQI,I)
95
                     (L,H,H,L,L,H) \rightarrow (SNEI,I)
96
                     (L,H,H,L,H,L) \rightarrow (SLTI,I)
97
                     (L,H,H,L,H,H) \rightarrow (NOP,I)
98
                     (L,H,H,H,L,L) \rightarrow (SLEI,I)
99
                     (L,H,H,H,L,H) \rightarrow (NOP,I)
100
                     (L,H,H,H,H,L) \rightarrow (NOP,I)
101
                     (L,H,H,H,H,H) \rightarrow (NOP,I)
102
                     (H,L,L,L,L,L) \rightarrow (NOP,I)
103
                     (H,L,L,L,L,H) \rightarrow (NOP,I)
104
                     (H,L,L,L,H,L) \rightarrow (NOP,I)
105
                     (H,L,L,L,H,H) \rightarrow (LW,I)
106
                     (H,L,L,H,L,H) \rightarrow (NOP,I)
107
                     (H,L,L,H,H,L) \rightarrow (NOP,I)
108
                     (H,L,L,H,H,H) \rightarrow (NOP,I)
109
                     (H,L,H,L,L,H) \rightarrow (NOP,I)
110
                     (H,L,H,L,H,L) \rightarrow (NOP,I)
111
                     (H,L,H,L,H,H) \rightarrow (SW,I)
112
                     (H,L,H,H,L,L) \rightarrow (NOP,I)
113
                     (H,L,H,H,L,H) \rightarrow (NOP,I)
114
                     (H,L,H,H,H,L) \rightarrow (NOP,I)
115
                     (H,L,H,H,H,H) \rightarrow (NOP,I)
116
                     (H,H,L,L,L,L) \rightarrow (NOP,I)
117
                     (H,H,L,L,L,H) \rightarrow (NOP,I)
118
                     (H,H,L,L,H,L) \rightarrow (NOP,I)
119
                     (H,H,L,L,H,H) \rightarrow (NOP,I)
                     (H,H,L,H,L,L) \rightarrow (NOP,I)
121
                     (H,H,L,H,L,H) \rightarrow (NOP,I)
                     (H,H,L,H,H,L) \rightarrow (NOP,I)
123
```

```
(H,H,L,H,H,H) \rightarrow (NOP,I)
124
                      (H,H,H,L,L,L) \rightarrow (NOP,I)
125
                      (H,H,H,L,L,H) \rightarrow (NOP,I)
126
                      (H,H,H,L,H,L) \rightarrow (NOP,I)
127
                      (H,H,H,L,H,H) \rightarrow (NOP,I)
128
                      (H,H,H,H,L,L) \rightarrow (NOP,I)
129
                      (H,H,H,H,L,H) \rightarrow (NOP,I)
130
                      (H,H,H,H,H,L) \rightarrow (NOP,I)
131
                      (H,H,H,H,H,H) \rightarrow (NOP, I)
132
   decodeR :: (Bit, Bit, Bit, Bit, Bit, Bit) -> Opcode
    decodeR inp = case inp of
                      (L, L, L, L, L, L) \rightarrow NOP
136
                      (L, L, L, L, L, H) \rightarrow NOP
137
                      (L, L, L, L, H, L) \rightarrow NOP
138
                      (L, L, L, L, H, H) \rightarrow NOP
139
                      (L,L,L,H,L,L) \rightarrow SLL
140
                      (L, L, L, H, L, H) \rightarrow NOP
141
                      (L,L,L,H,H,L) \rightarrow SRL
142
                      (L, L, L, H, H, H) \rightarrow SRA
143
                      (L, L, H, L, L, L) \rightarrow NOP
144
                      (L, L, H, L, L, H) \rightarrow NOP
145
                      (L, L, H, L, H, L) \rightarrow NOP
146
                      (L,L,H,L,H,H) \rightarrow NOP
147
                      (L, L, H, H, L, L) \rightarrow NOP
148
                      (L, L, H, H, L, H) \rightarrow NOP
149
                      (L,L,H,H,H,L) \rightarrow NOP
150
                      (L,L,H,H,H,H) \rightarrow NOP
151
                      (L,H,L,L,L,L) \rightarrow NOP
152
```

```
(L,H,L,L,L,H) \rightarrow NOP
153
                      (L,H,L,L,H,L) \rightarrow NOP
154
                      (L,H,L,L,H,H) \rightarrow NOP
155
                      (L,H,L,H,L,L) \rightarrow NOP
156
                      (L,H,L,H,L,H) \rightarrow NOP
157
                      (L,H,L,H,H,L) \rightarrow NOP
158
                      (L,H,L,H,H,H) \rightarrow NOP
159
                      (L,H,H,L,L,L) \rightarrow NOP
160
                      (L,H,H,L,L,H) \rightarrow NOP
161
                      (L,H,H,L,H,L) \rightarrow NOP
162
                      (L,H,H,L,H,H) \rightarrow NOP
163
                      (L,H,H,H,L,L) \rightarrow NOP
164
                      (L,H,H,H,L,H) \rightarrow NOP
165
                      (L,H,H,H,H,L) \rightarrow NOP
166
                      (L,H,H,H,H,H) \rightarrow NOP
167
                      (H, L, L, L, L, L) \rightarrow ADD
168
                      (H, L, L, L, L, H) \rightarrow NOP
169
                      (H,L,L,L,H,L) \rightarrow SUB
170
                      (H, L, L, L, H, H) \rightarrow NOP
171
                      (H, L, L, H, L, L) \rightarrow AND
                      (H, L, L, H, L, H) \rightarrow OR
173
                      (H,L,L,H,H,L) \rightarrow XOR
                      (H, L, L, H, H, H) \rightarrow NOP
175
                      (H, L, H, L, L, L) \rightarrow SEQ
176
                      (H, L, H, L, L, H) \rightarrow SNE
177
                      (H,L,H,L,H,L) \rightarrow SLT
                      (H,L,H,L,H,H) \rightarrow NOP
179
                      (H,L,H,H,L,L) \rightarrow SLE
                      (H, L, H, H, L, H) \rightarrow NOP
181
```

```
(H,L,H,H,H,L) \rightarrow NOP
182
                   (H,L,H,H,H,H) \rightarrow NOP
183
                   (H,H,L,L,L,L) \rightarrow NOP
184
                   (H,H,L,L,L,H) \rightarrow NOP
185
                   (H,H,L,L,H,L) \rightarrow NOP
186
                   (H,H,L,L,H,H) \rightarrow NOP
187
                   (H,H,L,H,L,L) \rightarrow NOP
188
                   (H,H,L,H,L,H) \rightarrow NOP
189
                   (H,H,L,H,H,L) \rightarrow NOP
190
                   (H,H,L,H,H,H) \rightarrow NOP
191
                   (H,H,H,L,L,L) \rightarrow NOP
192
                   (H,H,H,L,L,H) \rightarrow NOP
193
                   (H,H,H,L,H,L) \rightarrow NOP
194
                   (H,H,H,L,H,H) \rightarrow NOP
195
                   (H,H,H,H,L,L) \rightarrow NOP
196
                   (H,H,H,H,L,H) \rightarrow NOP
197
                   (H,H,H,H,H,L) \rightarrow NOP
198
                   (H,H,H,H,H,H) \rightarrow NOP
199
200
   rtype :: Vector32 Bit -> RegFile -> RegVal
               -> (Opcode, Reg, (Reg, RegVal), (Reg, RegVal))
202
203 rtype vec rf pc = case vec of
                   Vector32 _
204
                                    b25 b24 b23 b22
205
                               b21 b20 b19 b18 b17
206
                               b16 b15 b14 b13 b12
207
                               b11 b10 b9
                                               b8
                                                   b7
208
                               b6 b5 b4 b3 b2 b1 b0 ->
209
                      let rs1 = decodeReg (b25, b24, b23, b22, b2)
210
```

```
rs2 = decodeReg (b20, b19, b18, b17, b16)
211
                    in (decodeR
                                  (b5, b4, b3, b2, b1, b0),
212
                        decodeReg (b15, b14, b13, b12, b11),
213
                        (rs1, getReg rs1 rf),
214
                        (rs2, getReg rs2 rf)
215
                       )
216
217
218 ijtype :: Vector32 Bit -> RegFile -> RegVal ->
      (Opcode, Reg, (Reg, RegVal), (Reg, RegVal))
_{219} ijtype vec rf pc = case vec of
                Vector32 b31 b30 b29 b28 b27
220
                          b26 b25 b24 b23 b22
221
                          b21 b20 b19 b18 b17
222
                          b16 b15 b14 b13 b12
223
                          b11 b10 b9
                                        b8
224
                          b6
                              b5 b4
                                        b3
                                            b2 b1 b0 ->
225
                 let exvalue = Vector32 b25 b25 b25 b25 b25 b25 b25
226
                                            b24 b23 b22 b21 b20 b19 b18
                                            b17 b16 b15 b14 b13 b12 b11
228
                                            b10 b9 b8 b7 b6 b5 b4 b3 b2
229
                                            b1 b0
230
                                = Vector16 b15 b14 b13 b12 b11 b10 b9
                      imm
231
                                            b8 b7 b6 b5 b4 b3 b2 b1 b0
232
                               = extend imm
233
                      eximm
                      leximm = lextend imm
234
               in case decodeIJ (b31, b30, b29, b28, b27, b26) of
235
                    (opc, I) \rightarrow let rs1 = decodeReg (b25, b24, b23, b22, b21)
236
                                     rs1v = getReg rs1 rf
237
                                     rd = decodeReg (b20, b19, b18, b17, b16)
238
```

```
in case opc of
239
                             ADDI \rightarrow (opc, rd, (rs1, rs1v), (R0, eximm))
240
                             ANDI -> (opc, rd, (rs1, rs1v), (R0, leximm))
241
                             BEQZ \rightarrow (opc, R0, (rs1, rs1v), (R0, add_pc eximm))
242
                             BNEZ \rightarrow (opc, R0, (rs1, rs1v), (R0, add_pc eximm))
243
                             JALR \rightarrow (opc, rd, (R0, pc), (rs1, rs1v))
244
                             JR
                                    \rightarrow (opc, R0, (rs1, rs1v), (R0, zeroReg))
245
                             LHI -> (opc,rd,(R0,highHalf imm),(R0,zeroReg))
246
                                    -> (opc, rd, (rs1, rs1v), (R0, leximm))
                             LW
247
                             ORI \rightarrow (opc, rd, (rs1, rs1v), (R0, leximm))
248
                             SEQI \rightarrow (opc, rd, (rs1, rs1v), (R0, eximm))
249
                             SLEI \rightarrow (opc, rd, (rs1, rs1v), (R0, eximm))
250
                             SLLI -> (opc,rd,(rs1,rs1v),(R0,leximm))
251
                             SLTI \rightarrow (opc, rd, (rs1, rs1v), (R0, eximm))
252
                             SNEI -> (opc,rd,(rs1,rs1v),(R0,eximm))
253
                             SRAI \rightarrow (opc, rd, (rs1, rs1v), (R0, eximm))
254
                             SRLI \rightarrow (opc, rd, (rs1, rs1v), (R0, leximm))
255
                             SUBI \rightarrow (opc, rd, (rs1, rs1v), (R0, eximm))
256
                                  \rightarrow (opc, rd, (rs1, add_rs1v eximm),
257
                                         (rd, getReg rd rf))
258
                             XORI \rightarrow (opc, rd, (rs1, rs1v), (R0, leximm))
259
                                    -> (NOP, R0, (R0, zeroReg), (R0, zeroReg))
260
                       (opc,Jm) -> case opc of
261
                                        J \longrightarrow (J, R0,
262
                                                    (R0, exvalue), (R0, pc))
263
                                        JAL \rightarrow (JAL, R31, (R0, pc),
                                                   (R0, exvalue))
265
                                             \rightarrow (NOP, R0,
266
                                                   (R0, zeroReg), (R0, zeroReg))
267
```

```
268
  decodeInst :: Vector32 Bit -> RegFile
                   \rightarrow RegVal
                                  ->
270
271
                   (Opcode, Reg, (Reg, RegVal), (Reg, RegVal))
   decodeInst inst rf pc =
        case inst of
          Vector32 L L L L L L
274
                    b25 b24 b23 b22
275
                    b21 b20 b19 b18 b17
276
                    b16 b15 b14 b13 b12
277
                    b11 b10 b9
                                   b8
                                      ^{\mathrm{b7}}
278
                    b6 b5
                             b4
                                  b3
                                       b2 b1 b0 ->
279
                      rtype inst rf pc
280
          _ -> ijtype inst rf pc
281
282
   decoder_ :: Monad m => (Vector32 Bit, RegFile, RegVal)
                              -> ReT (Vector32 Bit, RegFile, RegVal)
284
                                      (Opcode, Reg, (Reg, RegVal), (Reg, RegVal))
285
                                      m ()
286
   decoder_ inp = case inp of
                 (i, rf, pc) \rightarrow do
288
                                  i \leftarrow signal (decodeInst i rf pc)
289
                                  decoder_ i
290
292 decode :: Monad m => ReT (Vector32 Bit, RegFile, RegVal)
293
                                (Opcode, Reg, (Reg, RegVal), (Reg, RegVal))
                                \mathbf{m} ()
294
295 decode = decoder_ (zeroReg, zeroFile, zeroReg)
```

Listing B.3: Haskell implementation of DLX Decode Stage

### B.4 DLX Execute Phase

This is the DLX execute phase.

```
1 module Redux.ALU where
3 import Redux. Types
4 import Redux. Test. ArithLogic
_{6} type RegDest = Reg
7 type Flush
               = Bit
9 type ALUI = (Opcode, Flush, RegDest, RegVal, RegVal)
10 type ALUO = (Maybe (Opcode, RegDest, RegVal, RegVal))
12 mod8 :: RegVal -> (Bit, Bit, Bit)
13 mod8 rv =case rv of
              (Vector32 _ _ _ _ _ _
15
16
                         17
19 proc :: (Opcode, Flush, RegDest, RegVal, RegVal) ->
          (Maybe (Opcode, RegDest, RegVal, RegVal))
_{21} proc inp = case inp of
              (op,_,rdest,ra,rb) -> case op of
              ADD -> Just (op, rdest, add_ ra rb, zeroReg)
23
              ADDI -> Just (op,rdest,add_ ra rb, zeroReg)
24
             AND -> Just (op, rdest, and ra rb, zeroReg)
25
              ANDI -> Just (op, rdest, and ra rb, zeroReg)
```

```
BEQZ -> case eq_ ra zeroReg of
27
                         H -> Just (op, R0, rb, oneReg)
28
                         L -> Just (op,R0,zeroReg,zeroReg)
29
               BNEZ -> case not_ (eq_ ra zeroReg) of
30
                         H -> Just (op, R0, rb, oneReg)
31
                         L -> Just (op, R0, zeroReg, zeroReg)
32
                   -> Just (op,R0,add_ra rb,zeroReg)
33
               JAL -> Just (op, R31, add_ ra (lit (4 :: BWord)), add_ ra rb)
34
               JALR -> Just (op, R31, add_ ra (lit (4 :: BWord)), rb)
35
                     -> Just (op,R0,ra,zeroReg)
36
               LHI -> Just (op, rdest, ra, zeroReg)
37
               LW
                     -> Just (op, rdest, add_ ra rb, zeroReg)
38
                     -> Just (op, rdest, or_ ra rb, zeroReg)
               OR
39
               ORI -> Just (op, rdest, or_ ra rb, zeroReg)
40
               SEQ -> case eq_ ra rb of
41
                         H -> Just (op, rdest, oneReg, zeroReg)
42
                         L -> Just (op, rdest, zeroReg, zeroReg)
43
               SEQI -> case eq_ ra rb of
44
                          H -> Just (op, rdest, oneReg, zeroReg)
45
                          L -> Just (op, rdest, zeroReg, zeroReg)
46
               SLE -> case lte_ ra rb of
47
                         H -> Just (op, rdest, oneReg, zeroReg)
48
                         L -> Just (op, rdest, zeroReg, zeroReg)
49
               SLEI -> case lte_ ra rb of
50
                          H -> Just (op, rdest, oneReg, zeroReg)
                          L -> Just (op, rdest, zeroReg, zeroReg)
52
               SLL
                      -> case ra of
53
                  Vector 32 \ b31 \ b30 \ b29 \ b28 \ b27 \ b26 \ b25 \ b24
54
                           b23 b22 b21 b20 b19 b18 b17 b16
55
```

```
b15 b14 b13 b12 b11 b10 b9
56
                                ^{\mathrm{b6}}
                                   b5
                                        b4 b3 b2
                                                      b1
                                                           b0 ->
                            b7
57
       case mod8 rb of
58
           (L,L,L) -> Just (op,rdest,ra,zeroReg)
59
           (L,L,H) \rightarrow let vec = Vector 32 b 30 b 29 b 28 b 27 b 26 b 25 b 24
60
                                   b23 b22 b21 b20 b19 b18 b17 b16 b15
61
                                    b14 b13 b12 b11 b10 b9 b8 b7 b6 b5 b4
62
                                   b3 b2 b1 b0 L
63
                            in Just (op, rdest, vec, zeroReg)
64
           (L,H,L) \rightarrow let vec = Vector 32 b 29 b 28 b 27 b 26 b 25 b 24
65
                                    b23 b22 b21 b20 b19 b18 b17 b16
66
                                    b15 b14 b13 b12 b11 b10 b9
67
                                    b7 b6 b5 b4 b3 b2
                                                              b1
                                                                   b0 L L
                            in Just (op, rdest, vec, zeroReg)
69
           (L,H,H) \rightarrow let vec = Vector 32 b 28 b 27 b 26 b 25 b 24
70
                                    b23 b22 b21 b20 b19 b18 b17 b16
71
                                    b15 b14 b13 b12 b11 b10 b9
72
                                   b7 b6 b5 b4 b3 b2 b1
                                                                  b0 L L L
73
                            in Just (op, rdest, vec, zeroReg)
74
           (H,L,L) \rightarrow let vec = Vector 32 b 27 b 26 b 25 b 24
75
                                    b23 b22 b21 b20 b19 b18 b17 b16
76
                                   b15 b14 b13 b12 b11 b10 b9
                                                                   b8
77
                                    b7 b6 b5 b4 b3 b2 b1
                                                                   b0 L L L L
78
                             in Just (op, rdest, vec, zeroReg)
79
           (H,L,H) \rightarrow let vec = Vector 32 b 26 b 25 b 24
80
                                    b23 b22 b21 b20 b19 b18 b17 b16
                                    b15 b14 b13 b12 b11 b10 b9
82
                                    b7 b6 b5 b4 b3 b2 b1 b0 L L L L
83
                             in Just (op, rdest, vec, zeroReg)
84
```

```
(H,H,L) \rightarrow let vec = Vector 32 b 25 b 24
85
                                       b23 b22 b21 b20 b19 b18 b17 b16
86
                                       b15 b14 b13 b12 b11 b10 b9
                                       b7 b6 b5 b4 b3 b2 b1 b0 L L L L L L
88
                                in Just (op, rdest, vec, zeroReg)
            (H,H,H) \rightarrow let vec = Vector 32 b 24 b 23 b 22 b 21
90
                                       b20 b19 b18 b17 b16
91
                                       b15 b14 b13 b12 b11 b10 b9
92
                                                         b3
                                       b7
                                            ^{\mathrm{b6}}
                                                ^{\mathrm{b5}}
                                                    ^{\mathrm{b4}}
                                                              b2
                                                                   b1
                                                                         b0
93
                                            \mathbf{L}
                                                 \mathbf{L}
                                                     \mathbf{L}
                                       \mathbf{L}
                                                          \mathbf{L}
                                                               \mathbf{L}
                                                                    \mathbf{L}
94
                                in Just (op, rdest, vec, zeroReg)
95
                    SLLI
                            -> case ra of
96
                                  Vector32 b31 b30 b29 b28 b27 b26 b25 b24
                                             b23 b22 b21 b20 b19 b18 b17 b16
                                             b15 b14 b13 b12 b11 b10 b9
                                                                               b8
99
                                             b7 b6
                                                     b5 b4 b3 b2 b1
                                                                               b0 ->
100
                 case mod8 rb of
101
                     (L,L,L) -> Just (op,rdest,ra,zeroReg)
102
                     (L,L,H) \rightarrow let vec =
103
                                     Vector32 b30 b29 b28 b27 b26 b25 b24
104
                                                b23 b22 b21 b20 b19 b18 b17 b16
105
                                                b15 b14 b13 b12 b11 b10 b9
                                                                                 b8
106
                                                b7 b6 b5 b4 b3 b2 b1
                                                                                 b0 L
107
                                         in Just (op, rdest, vec, zeroReg)
108
                      (L,H,L) \rightarrow let vec = Vector 32 b 29 b 28 b 27 b 26 b 25 b 24
109
                                                b23 b22 b21 b20 b19 b18 b17 b16
110
                                                b15 b14 b13 b12 b11 b10 b9
111
                                                b7 b6 b5 b4 b3 b2 b1
                                                                                b0 L L
112
                                          in Just (op, rdest, vec, zeroReg)
113
```

```
(L,H,H) \rightarrow let vec =
114
                                 Vector32 b28 b27 b26 b25 b24
115
                                      b23 b22 b21 b20 b19 b18 b17 b16
116
                                      b15 b14 b13 b12 b11 b10 b9
117
                                      118
                                      in Just (op, rdest, vec, zeroReg)
119
                    (H,L,L) \rightarrow let vec =
120
                                 Vector32 b27 b26 b25 b24
121
                                 b23 b22 b21 b20 b19 b18 b17 b16
122
                                 b15 b14 b13 b12 b11 b10 b9 b8
123
                                 b7 b6 b5 b4 b3 b2 b1 b0 L L L L
124
                                 in Just (op, rdest, vec, zeroReg)
125
                    (H,L,H) \rightarrow let vec =
126
                                 Vector32 b26 b25 b24
127
                                 b23 b22 b21 b20 b19 b18 b17 b16
128
                                 b15 b14 b13 b12 b11 b10 b9 b8
129
                                    b6 b5 b4 b3 b2 b1 b0 L L L L L
130
                                 in Just (op, rdest, vec, zeroReg)
131
                    (H,H,L) \rightarrow let vec =
132
                                 Vector32 b25 b24
133
                                 b23 b22 b21 b20 b19 b18 b17 b16
134
                                 b15 b14 b13 b12 b11 b10 b9
                                                                ^{\mathrm{b8}}
135
                                     b6
                                         b5
                                              b4
                                                  b3
                                                       b2
                                                           b1
                                                                b0
136
                                     L
                                          L
                                              L
                                                   L
                                 L
                                                       L
137
                                in Just (op, rdest, vec, zeroReg)
138
                    (H,H,H) \rightarrow let vec =
139
                                 Vector32 b24
140
                                 b23 b22 b21 b20 b19 b18 b17 b16
141
                                 b15 b14 b13 b12 b11 b10 b9 b8
142
```

```
b7
                                       b6
                                           b5
                                               b4
                                                    b3
                                                         b2
                                                             b1
                                                                  b0
143
                                  L
                                       L
                                           L
                                                L
                                                    L
                                                         L
                                                             L
144
                                 in Just (op, rdest, vec, zeroReg)
145
                 SLT -> case lt_ ra rb of
146
                             H -> Just (op, rdest, oneReg, zeroReg)
147
                             L -> Just (op, rdest, zeroReg, zeroReg)
148
                 SLTI -> case lt_ ra rb of
149
                             H -> Just (op, rdest, oneReg, zeroReg)
150
                             L -> Just (op, rdest, zeroReg, zeroReg)
151
                 SNE -> case not_ (eq_ ra rb) of
152
                            H -> Just (op, rdest, oneReg, zeroReg)
153
                            L -> Just (op, rdest, zeroReg, zeroReg)
154
                 SNEI -> case not_ (eq_ ra rb) of
155
                            H -> Just (op, rdest, oneReg, zeroReg)
156
                            L -> Just (op, rdest, zeroReg, zeroReg)
157
                 SRA -> case ra of
158
                                Vector32 b31 b30 b29 b28 b27 b26 b25 b24
159
                                          b23 b22 b21 b20 b19 b18 b17 b16
160
                                          b15 b14 b13 b12 b11 b10 b9
                                                                          b8
161
                                          b7
                                              b6
                                                   b5 b4 b3
                                                                b2 b1
                                                                          b0 ->
162
              case mod8 rb of
163
                 (L,L,L) -> Just (op,rdest,ra,zeroReg)
164
                 (L,L,H) \rightarrow let vec =
165
                               Vector32 b31 b31 b30 b29 b28 b27 b26 b25 b24
166
                                         b23 b22 b21 b20 b19 b18 b17 b16
167
                                         b15 b14 b13 b12 b11 b10 b9
                                                                        b8
168
                                            b6 b5 b4 b3 b2
169
                                  in Just (op, rdest, vec, zeroReg)
170
                 (L,H,L) \rightarrow let vec =
171
```

```
Vector32 b31 b31 b31 b30
172
                                          b29 b28 b27 b26 b25 b24
173
                                          b23 b22 b21 b20 b19 b18 b17 b16
174
                                          b15 b14 b13 b12 b11 b10 b9
175
                                          b7 b6 b5 b4 b3
                                                                b2
176
                                   in Just (op, rdest, vec, zeroReg)
177
                  (L,H,H) \rightarrow let vec =
178
                               Vector32 b31 b31 b31 b31 b30
179
                                          b29 b28 b27 b26 b25 b24
180
                                          b23 b22 b21 b20 b19 b18 b17 b16
181
                                          b15 b14 b13 b12 b11 b10 b9
182
                                          b7 b6 b5 b4 b3
183
                                   in Just (op, rdest, vec, zeroReg)
184
                  (H,L,L) \rightarrow let vec =
185
                               Vector32 b31 b31 b31 b31 b30
186
                                          b29 \ b28 \ b27 \ b26 \ b25 \ b24
187
                                          b23 b22 b21 b20 b19 b18 b17 b16
188
                                          b15 b14 b13 b12 b11 b10 b9
189
                                          b7 b6 b5 b4
190
                                   in Just (op, rdest, vec, zeroReg)
191
                  (H,L,H) \rightarrow let vec =
192
                               Vector32 b31 b31 b31 b31 b31 b31
193
                                          b30 b29 b28 b27 b26 b25 b24
194
                                          b23 \ b22 \ b21 \ b20 \ b19 \ b18 \ b17 \ b16
195
                                          b15 b14 b13 b12 b11 b10 b9
196
                                          b7 b6 b5
                                   in Just (op, rdest, vec, zeroReg)
198
                  (H,H,L) \rightarrow let vec =
199
                               Vector32 b31 b31 b31 b31 b31
200
```

```
b31 b31 b30 b29 b28 b27 b26 b25 b24
201
                                          b23 b22 b21 b20 b19 b18 b17 b16
202
                                          b15 b14 b13 b12 b11 b10 b9
203
                                          b7 b6
204
                                   in Just (op, rdest, vec, zeroReg)
205
                  (H,H,H) \rightarrow let vec =
206
                                Vector32 b31 b31 b31 b31 b31 b31
207
                                          b31 b31 b30 b29 b28 b27 b26 b25 b24
208
                                          b23 b22 b21 b20 b19 b18 b17 b16
209
                                          b15 b14 b13 b12 b11 b10 b9
210
                                          b7
211
                                   in Just (op, rdest, vec, zeroReg)
212
                 SRAI -> case ra of
213
                                 Vector32 b31 b30 b29 b28 b27 b26 b25 b24
214
                                           b23 b22 b21 b20 b19 b18 b17 b16
215
                                           b15 b14 b13 b12 b11 b10 b9
                                                                           b8
216
                                           b7
                                              ^{\mathrm{b6}}
                                                   b5 b4 b3 b2 b1
                                                                           b0 ->
217
                case mod8 rb of
218
                   (L,L,L) -> Just (op,rdest,ra,zeroReg)
219
                   (L, L, H) \rightarrow
220
                     let vec = Vector 32 b 31 b 31 b 30 b 29
221
                                           b28 b27 b26 b25 b24
222
                                           b23 b22 b21 b20 b19 b18 b17 b16
223
                                           b15 b14 b13 b12 b11 b10 b9
                                                                           b8
224
                                               ^{\mathrm{b6}}
                                                   b5 b4 b3 b2 b1
                                           b7
225
                                    in Just (op, rdest, vec, zeroReg)
226
                   (L,H,L) \rightarrow let vec = Vector 32 b 31 b 31 b 31 b 30 b 29
227
                                           b28 b27 b26 b25 b24
                                           b23 b22 b21 b20 b19 b18 b17 b16
229
```

```
b15 b14 b13 b12 b11 b10 b9
230
                                               b6 b5 b4 b3 b2
                                           b7
231
                                     in Just (op, rdest, vec, zeroReg)
232
                   (L,H,H) \rightarrow let vec = Vector 32 b 31 b 31 b 31 b 31 b 30
233
                                           b29 b28 b27 b26 b25 b24
234
                                           b23 b22 b21 b20 b19 b18 b17 b16
235
                                           b15 b14 b13 b12 b11 b10 b9 b8
236
                                           b7 b6 b5 b4 b3
237
                                     in Just (op, rdest, vec, zeroReg)
238
                   (H,L,L) \rightarrow let \ vec = Vector32 \ b31 \ b31 \ b31 \ b31
239
                                           b30 b29 b28 b27 b26 b25 b24
240
                                           b23 b22 b21 b20 b19 b18 b17 b16
241
                                           b15 b14 b13 b12 b11 b10 b9
242
                                           b7
                                               b6 b5 b4
243
                                      in Just (op, rdest, vec, zeroReg)
244
                   (H, L, H) \rightarrow let vec = Vector 32 b 31 b 31 b 31 b 31
245
                                           b31 b30 b29 b28 b27 b26 b25 b24
246
                                           b23 b22 b21 b20 b19 b18 b17 b16
247
                                           b15 b14 b13 b12 b11 b10 b9
248
                                           b7 b6 b5
249
                                        in Just (op, rdest, vec, zeroReg)
250
                   (H,H,L) \rightarrow let vec = Vector 32 b 31 b 31 b 31 b 31 b 31 b 31
251
                                           b31 b30 b29 b28 b27 b26 b25 b24
252
                                           b23 b22 b21 b20 b19 b18 b17 b16
253
                                           b15 b14 b13 b12 b11 b10 b9
254
                                           b7
                                               ^{\mathrm{b6}}
                                     in Just (op, rdest, vec, zeroReg)
256
                   (H,H,H) \rightarrow let vec = Vector 32 b 31 b 31 b 31 b 31 b 31 b 31
257
                                           b31 b31 b30 b29 b28 b27 b26 b25 b24
258
```

```
b23 b22 b21 b20 b19 b18 b17 b16
259
                                          b15 b14 b13 b12 b11 b10 b9
                                                                         b8
260
                                          b7
261
                                   in Just (op, rdest, vec, zeroReg)
262
                 SRL -> case ra of
263
                                Vector32 b31 b30 b29 b28 b27 b26 b25 b24
264
                                          b23 b22 b21 b20 b19 b18 b17 b16
265
                                          b15 b14 b13 b12 b11 b10 b9
266
                                          b7
                                              ^{\mathrm{b6}}
                                                   b5 b4 b3
                                                                b2 b1
                                                                          b0 ->
267
             case mod8 rb of
268
                 (L,L,L) -> Just (op,rdest,ra,zeroReg)
269
                 (L,L,H) \rightarrow let vec = Vector 32 L b 31 b 30 b 29 b 28
270
                                         b27 b26 b25 b24
271
                                         b23 b22 b21 b20 b19 b18 b17 b16
272
                                         b15 b14 b13 b12 b11 b10 b9
                                                                         b8
273
                                         b7 b6 b5 b4 b3 b2 b1
274
                                  in Just (op, rdest, vec, zeroReg)
275
                 (L,H,L) \rightarrow let vec = Vector 32 L L b 31 b 30 b 29
276
                                         b28 b27 b26 b25 b24
277
                                         b23 b22 b21 b20 b19 b18 b17 b16
278
                                         b15 b14 b13 b12 b11 b10 b9
279
                                         b7 b6 b5
                                                     b4 b3
                                                               b2
280
                                  in Just (op, rdest, vec, zeroReg)
281
                 (L,H,H) \rightarrow let vec = Vector 32 L L L b 31 b 30 b 29
282
                                         b28 b27 b26 b25 b24
283
                                         b23 b22 b21 b20 b19 b18 b17 b16
                                         b15 b14 b13 b12 b11 b10 b9
285
                                         b7 b6 b5 b4 b3
286
                                  in Just (op, rdest, vec, zeroReg)
287
```

```
(H,L,L) \rightarrow let vec = Vector 32 L L L L b 31 b 30 b 29
288
                                          b28 b27 b26 b25 b24
289
                                          b23 b22 b21 b20 b19 b18 b17 b16
290
                                          b15 b14 b13 b12 b11 b10 b9
291
                                          b7 b6 b5 b4
292
                                   in Just (op, rdest, vec, zeroReg)
293
                  (H,L,H) \rightarrow let vec = Vector32 L L L L L b31 b30 b29
294
                                          b28 b27 b26 b25 b24
295
                                          b23 b22 b21 b20 b19 b18 b17 b16
296
                                          b15 b14 b13 b12 b11 b10 b9
297
                                          b7 b6 b5
298
                                   in Just (op, rdest, vec, zeroReg)
299
                  (H,H,L) \rightarrow let vec = Vector 32 L L L L L L b 31
300
                                          b30 b29 b28 b27 b26 b25 b24
301
                                          b23 b22 b21 b20 b19 b18 b17 b16
302
                                          b15 b14 b13 b12 b11 b10 b9 b8
303
                                          b7 b6
304
                              in Just (op, rdest, vec, zeroReg)
305
                  (H,H,H) \rightarrow let vec = Vector 32 L L L L L L L b 31
306
                                          b30 b29 b28 b27 b26 b25 b24
307
                                          b23 b22 b21 b20 b19 b18 b17 b16
308
                                          b15 b14 b13 b12 b11 b10 b9
                                                                          ^{\mathrm{b8}}
309
                                          b7
310
                                   in Just (op, rdest, vec, zeroReg)
311
                  SRLI -> case ra of
312
                                 Vector32 b31 b30 b29 b28 b27 b26 b25 b24
313
                                           b23 b22 b21 b20 b19 b18 b17 b16
314
                                           b15 b14 b13 b12 b11 b10 b9
                                                                           b8
315
                                           b7
                                               ^{\mathrm{b6}}
                                                    b5 b4 b3 b2 b1
                                                                          b0 ->
316
```

```
case mod8 rb of
317
                      (L,L,L) -> Just (op,rdest,ra,zeroReg)
318
                      (L,L,H) \rightarrow let vec = Vector32 L b31 b30 b29
319
                                              b28 b27 b26 b25 b24
320
                                              b23 b22 b21 b20 b19 b18 b17 b16
321
                                              b15 b14 b13 b12 b11 b10 b9
                                                                             b8
322
                                                          b4 b3
                                              b7
                                                 b6 b5
                                                                   b2
                                                                        b1
323
                                       in Just (op, rdest, vec, zeroReg)
324
                      (L,H,L) \rightarrow let vec = Vector 32 L L b 31 b 30
325
                                              b29 b28 b27 b26 b25 b24
326
                                              b23 b22 b21 b20 b19 b18 b17 b16
327
                                              b15 b14 b13 b12 b11 b10 b9
328
                                              b7 b6
                                                      b5 b4
                                                               b3
329
                                       in Just (op, rdest, vec, zeroReg)
330
                      (L,H,H) \rightarrow let vec = Vector 32 L L L
331
                                              b31 b30 b29 b28 b27 b26 b25 b24
332
                                              b23 b22 b21 b20 b19 b18 b17 b16
333
                                              b15 b14 b13 b12 b11 b10 b9
334
                                              b7 b6 b5 b4 b3
335
                                       in Just (op, rdest, vec, zeroReg)
336
                      (H,L,L) \rightarrow let vec = Vector 32 L L L L
337
                                              b31 b30 b29 b28 b27 b26 b25 b24
338
                                              b23 b22 b21 b20 b19 b18 b17 b16
339
                                              b15 b14 b13 b12 b11 b10 b9 b8
340
                                              b7 b6 b5 b4
341
                                       in Just (op, rdest, vec, zeroReg)
342
                      (H,L,H) \rightarrow let vec = Vector32 L L L L L
343
                                              b31 b30 b29 b28 b27 b26 b25 b24
344
                                              b23 b22 b21 b20 b19 b18 b17 b16
345
```

```
b15 b14 b13 b12 b11 b10 b9
                                                                              b8
346
                                              b7
                                                  b6
                                                      b5
347
                                       in Just (op, rdest, vec, zeroReg)
348
                      (H,H,L) \rightarrow let vec = Vector 32 L L L L L
349
                                              b31 \ b30 \ b29 \ b28 \ b27 \ b26 \ b25 \ b24
350
                                              b23 b22 b21 b20 b19 b18 b17 b16
351
                                              b15 b14 b13 b12 b11 b10 b9 b8
352
                                              b7
                                                 b6
353
                                   in Just (op, rdest, vec, zeroReg)
354
                      (H,H,H) \rightarrow let vec = Vector 32 L L L L L L L
355
                                              b31 b30 b29 b28 b27 b26 b25 b24
356
                                              b23 b22 b21 b20 b19 b18 b17 b16
357
      b15
                                              b14 b13 b12 b11 b10 b9
                                                                         b8 b7
358
                                        in Just (op, rdest, vec, zeroReg)
359
                 SUB -> Just (op, rdest, sub_ ra rb, zeroReg)
360
                 SUBI -> Just (op,rdest,sub_ra rb, zeroReg)
361
                 SW -> Just (op, rdest, ra, rb)
362
                 XOR -> Just (op,rdest,xor_ ra rb, zeroReg)
363
                 XORI -> Just (op, rdest, xor_ ra rb, zeroReg)
364
                        -> Nothing
365
367 procp :: Monad m => ALUI -> ReT ALUI ALUO m ()
_{368} procp inp = _{do}
                case inp of
369
                   (-,H,-,-,-) \rightarrow do
370
                                        signal Nothing
371
                                        signal Nothing
372
                                        signal Nothing
373
```

```
inp ' <- signal Nothing
procp inp '

- > do
inp ' <- signal (proc inp)
inp ' <- signal (proc inp)
procp inp '

procp inp '

procp inp '

alu :: Monad m => ReT ALUI ALUO m ()

alu = procp (NOP, L , R0, zeroReg, zeroReg)
```

Listing B.4: The DLX execute phase implemented in Haskell

# **B.5** DLX Memory Access Phase

```
1 module Redux. Memory where
3 import Redux. Types
6 type Stall
                 = Bit
7 type Flush
                 = Stall
s type Data
                 = RegVal
9 type Address
                 = Data
10 type PC
                 = RegVal
_{12} type MemI = (
               -Data bus coming from Memory unit
14
               -Output from the ALU. Opcode, Dest Reg Name, Register A,
     Register B values
```

```
Maybe (Opcode, Reg, RegVal, RegVal)
16
17
18
  type MemO = (
19
                -Data to be written to memory, if Nothing, we are reading
                Maybe Data,
21
                ---Address to be read/written
22
                Address,
23
                -Shall we stall the pipeline?
24
                Stall,
25
                -Shall we flush the pipeline (in the event of a branch)?
26
                Flush,
27
                -- Destination register and value to be written back to it
28
                Maybe (Reg, RegVal),
29
                --New value of the PC in the event of a branch
30
                Maybe PC
31
               )
32
34 lastBit :: Vector32 Bit -> Bit
  lastBit vect = case vect of
                    Vector32
37
                               _ _ _ _ b -> b
38
40 feed :: Monad m \Rightarrow (Reg, RegVal) \rightarrow ReT MemI MemO m MemI
41 feed fwd = signal (Nothing, zeroReg, L, L, Just fwd, Nothing)
42
43 nop :: Monad m => ReT MemI MemO m MemI
44 nop = signal (Nothing, zeroReg, L, L, Nothing, Nothing)
```

```
45
46 readStall :: Monad m => Address -> ReT MemI MemO m MemI
47 \text{ readStall addr} = do
                 - <- signal (Nothing, addr, H, L, Nothing, Nothing)
                signal (Nothing, addr, H, L, Nothing, Nothing)
49
50
51 writeStall :: Monad m => Data -> Address -> ReT MemI MemO m MemI
52 writeStall dta addr = signal (Just dta, addr, L, L, Nothing, Nothing)
53
55 branch :: Monad m => Maybe (Reg,RegVal) -> RegVal -> ReT MemI MemO m
     MemI
56 branch mb val = signal (Nothing, zeroReg, L, H, mb, Just val)
58 \text{ memProc} :: Monad m \Rightarrow MemI \rightarrow ReT MemI MemO m ()
_{59} memProc inp = case inp of
               (dta, Nothing) -> do
                                      i <- signal (Nothing, zeroReg, L, L,
      Nothing, Nothing)
                                     memProc i
62
               --RA is also the value to be written for all others
63
               (dta, Just (opcode, reg, ra, rb)) ->
64
               case opcode of
65
                            LW -> do
66
                                      i <- readStall ra
67
                                      case i of
                                         (dta, _) ->
69
                                          do
70
                                           -Stall one last time
71
```

```
inp < -
72
                                               signal
73
                                              (Nothing, zeroReg, H,
74
                                              L, Just (reg, dta), Nothing)
75
                                             memProc inp
76
                              SW \rightarrow do
77
                                        i <- writeStall rb ra
                                       memProc i
79
                              J -> do
80
                                        inp <- branch Nothing ra
81
                                       memProc inp
82
                              JAL -> do
83
                                       --RA is PC+4
84
                                       --RB is PC += extend(value)
85
                                        inp <- branch (Just (R31, ra)) rb
86
                                       memProc inp
                              JALR -> do
                                         ---RA is PC+4
89
                                          --RB is PC = Rs1
90
                                          inp <- branch (Just (R31, ra)) rb
91
                                          memProc inp
92
                              JR -> do
93
                                          --RA is Rs1 (value)
94
                                          inp <- branch Nothing ra
95
                                          memProc inp
96
                              BEQZ \rightarrow do
                                          case lastBit rb of
98
                                              H \rightarrow do
99
                                                         <- branch Nothing ra
100
```

```
inp <- signal (Nothing,
101
      zeroReg, L, L, Nothing, Nothing)
                                                      memProc inp
102
                                             L -> do
103
                                                       inp <- nop
104
                                                       memProc inp
105
                             BNEZ \rightarrow do
106
                                         case lastBit rb of
107
                                             H -> do
108
                                                          <- branch Nothing ra
109
                                                      inp <- signal (Nothing,
110
      zeroReg, L, L, Nothing, Nothing)
                                                     memProc inp
111
                                              L -> do
112
                                                       inp <- nop
113
                                                       memProc inp
114
                             NOP -> do
115
                                        inp <- nop
116
                                        memProc inp
117
                                  -> do
118
                                        --Feed RA and reg forward
119
                                        inp <- feed (reg,ra)
                                        memProc inp
121
123 mem :: Monad m => ReT MemI MemO m ()
124 mem = memProc (zeroReg, Nothing)
```

Listing B.5: Haskell implementation of the DLX Memory Access processor phase.

## B.6 DLX Writeback Phase

```
1 module Redux. Writeback where
3 import Prelude (Monad)
5 import Redux. Types
6 import Redux. Instructions
7 import Control. Monad. Resumption. Reactive
9 type RegFile = Vector32 (RegVal)
11 zeroFile :: RegFile
12 zeroFile =
    Vector32 zeroReg zeroReg zeroReg zeroReg
            zeroReg zeroReg zeroReg
14
            zeroReg zeroReg zeroReg
15
            zeroReg zeroReg zeroReg
16
            zeroReg zeroReg zeroReg
17
            zeroReg zeroReg zeroReg
            zeroReg zeroReg zeroReg
            zeroReg zeroReg zeroReg
22 —Reg Muxer
23 getReg :: Reg -> RegFile -> RegVal
_{24} getReg reg regfile = case regfile of
                         (Vector32 b31 b30 b29 b28 b27 b26
                                  b25 b24 b23 b22 b21 b20
                                  b19 b18 b17 b16 b15 b14
```

```
b13 b12 b11 b10 b9
                                                                        b8
28
                                                    b6 b5 b4 b3 b2 b1 b0) ->
29
                                   case reg of
30
                                      R0 \rightarrow b0
31
                                      R1 -> b1
32
                                      R2 \rightarrow b2
33
                                      R3 -> b3
34
                                      R4 \rightarrow b4
35
                                      R5 \rightarrow b5
36
                                      R6 -> b6
37
                                      R7 \rightarrow b7
38
                                      R8 -> b8
39
                                      R9 -> b9
40
                                      R10 -> b10
41
                                      R11 -> b11
42
                                      R12 -> b12
43
                                      R13 -> b13
44
                                      R14 -> b14
45
                                      R15 -> b15
46
                                      R16 -> b16
47
                                      R17 \rightarrow b17
48
                                      R18 -> b18
49
                                      R19 \rightarrow b19
50
                                      R20 -> b20
51
                                      R21 -> b21
52
                                      R22 \rightarrow b22
53
                                      R23 -> b23
54
                                      R24 -> b24
55
                                      R25 \rightarrow b25
56
```

```
R26 \rightarrow b26
57
                                         R27 -> b27
58
                                         R28 -> b28
59
                                         R29 -> b29
60
                                         R30 \rightarrow b30
61
                                         R31 -> b31
62
63
65 setReg :: (Reg,RegVal) -> RegFile -> RegFile
66 setReg rrv rfile = case rfile of
         (\,\, \text{Vector} 32 \,\, \, \text{b} 31 \,\, \, \text{b} 30 \,\, \, \text{b} 29 \,\, \, \text{b} 28 \,\, \, \text{b} 27 \,\, \, \text{b} 26 \,\, \, \text{b} 25 \,\, \, \text{b} 24 \,\, \, \text{b} 23 \,\, \, \text{b} 22 \,\, \, \text{b} 21 \,\, \, \text{b} 20 
67
                      b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b9 b8
68
                      b7
                            b6
                                b5 b4 b3 b2 b1 b0) ->
69
        case rrv of
70
              (R0, _) -> rfile --R0 remains unchanged always at zero
71
              (R1, b1) \rightarrow (Vector 32 \ b31 \ b30 \ b29 \ b28 \ b27 \ b26
72
                                           b25 b24 b23 b22 b21 b20
73
                                           b19 b18 b17 b16 b15 b14
74
                                           b13 b12 b11 b10 b9
                                                                       b8
75
                                            b7 b6
                                                     b5 b4 b3 b2
                                                                             b1
                                                                                  b0)
76
              (R2, b2) \rightarrow (Vector 32 \ b31 \ b30 \ b29 \ b28 \ b27 \ b26
77
                                            b25 b24 b23 b22 b21 b20
78
                                            b19 b18 b17 b16 b15 b14
79
                                           b13 b12 b11 b10 b9
                                                                       b8
80
                                            b7
                                                 b6
                                                     b5 b4 b3
                                                                       b2
                                                                               b1
                                                                                    b0)
              (R3, b3) -> (Vector32 b31 b30 b29 b28 b27 b26
82
                                            b25 b24 b23 b22 b21 b20
83
                                            b19 b18 b17 b16 b15 b14
84
                                            b13 b12 b11 b10 b9 b8
85
```

```
b7 b6 b5 b4 b3 b2
                                                                   b1 b0)
86
            (R4, b4) -> (Vector32 b31 b30 b29 b28 b27 b26
87
                                     b25 b24 b23 b22 b21 b20
                                     b19 b18 b17 b16 b15 b14
89
                                     b13 b12 b11 b10 b9
                                                             b8
                                     b7
                                         ^{\mathrm{b6}}
                                              b5
                                                  ^{\mathrm{b4}}
                                                       b3
                                                             b2
                                                                   b1
                                                                      b0)
91
            (R5, b5) \rightarrow (Vector 32 \ b31 \ b30 \ b29 \ b28 \ b27 \ b26
92
                                     b25 b24 b23 b22 b21 b20
93
                                     b19 b18 b17 b16 b15 b14
94
                                     b13 b12 b11 b10 b9
                                                             b8
95
                                     b7 b6 b5 b4 b3 b2
                                                                   b1 b0)
96
            (R6, b6) \rightarrow (Vector 32 b 31 b 30 b 29 b 28 b 27 b 26
97
                                     b25 b24 b23 b22 b21 b20
                                     b19 b18 b17 b16 b15 b14
                                     b13 b12 b11 b10 b9
                                                             b8
100
                                     b7 b6
                                             b5 b4 b3
                                                             b2
                                                                   b1 b0)
101
            (R7, b7) \rightarrow (Vector 32 b 31 b 30 b 29 b 28 b 27 b 26
102
                                     b25 b24 b23 b22 b21 b20
103
                                     b19 b18 b17 b16 b15 b14
104
                                     b13 b12 b11 b10 b9
                                                             b8
105
                                     b7 b6
                                              b5 b4 b3
                                                             b2
                                                                      b0)
                                                                   b1
106
            (R8, b8) -> (Vector 32 b 31 b 30 b 29 b 28 b 27 b 26
107
                                     b25 b24 b23 b22 b21 b20
108
                                     b19 b18 b17 b16 b15 b14
109
                                     b13 b12 b11 b10 b9
                                                             b8
110
                                     b7 \quad b6 \quad b5 \quad b4 \quad b3
                                                             b2
                                                                   b1
                                                                      b0)
111
            (R9, b9) -> (Vector32 b31 b30 b29 b28 b27 b26
112
                                     b25 b24 b23 b22 b21 b20
113
                                     b19 b18 b17 b16 b15 b14
114
```

```
b13 b12 b11 b10 b9
                                                     b8
115
                                 b7 b6 b5 b4 b3
                                                      b2
                                                            b1
                                                               b0)
116
           (R10, b10) -> (Vector32 b31 b30 b29 b28 b27 b26
117
                                 b25 b24 b23 b22 b21 b20
118
                                 b19 b18 b17 b16 b15 b14
119
                                 b13 b12 b11 b10 b9
                                                      b8
120
                                 b7 b6 b5 b4 b3 b2
                                                            b1
                                                               b0)
121
           (R11, b11) -> (Vector32 b31 b30 b29 b28 b27 b26
122
                                 b25 b24 b23 b22 b21 b20
123
                                 b19 b18 b17 b16 b15 b14
124
                                 b13 b12 b11 b10 b9
                                                      b8
125
                                 b7 b6 b5 b4 b3 b2
                                                            b1
                                                               b0)
126
           (R12, b12) -> (Vector32 b31 b30 b29 b28 b27 b26
127
                                 b25 b24 b23 b22 b21 b20
128
                                 b19 b18 b17 b16 b15 b14
129
                                 b13 b12 b11 b10 b9
                                                      b8
130
                                 b7 b6 b5 b4 b3
                                                      b2
                                                            b1
                                                               b0)
131
           (R13, b13) -> (Vector32 b31 b30 b29 b28 b27 b26
132
                                 b25 b24 b23 b22 b21 b20
133
                                 b19 b18 b17 b16 b15 b14
134
                                 b13 b12 b11 b10 b9
135
                                 b7 b6 b5 b4 b3
                                                     b2
                                                            b1
                                                               b0)
136
           (R14, b14) -> (Vector32 b31 b30 b29 b28 b27 b26
137
                                 b25 b24 b23 b22 b21 b20
138
                                 b19 b18 b17 b16 b15 b14
139
                                 b13 b12 b11 b10 b9
                                                      b8
140
                                 b7 b6 b5 b4 b3 b2
                                                            b1 b0)
141
           (R15, b15) -> (Vector32 b31 b30 b29 b28 b27 b26
142
                                 b25 b24 b23 b22 b21 b20
143
```

```
b19 b18 b17 b16 b15 b14
144
                                 b13 b12 b11 b10 b9
                                                      b8
145
                                 b7 b6 b5 b4 b3 b2
                                                            b1 b0)
146
           (R16, b16) -> (Vector32 b31 b30 b29 b28 b27 b26
147
                                 b25 b24 b23 b22 b21 b20
148
                                 b19 b18 b17 b16 b15 b14
149
                                 b13 b12 b11 b10 b9
                                                      b8
150
                                 b7 b6 b5 b4 b3 b2
                                                            b1
                                                                b0)
151
           (R17, b17) -> (Vector32 b31 b30 b29 b28 b27 b26
152
                                 b25 b24 b23 b22 b21 b20
153
                                 b19 b18 b17 b16 b15 b14
154
                                 b13 b12 b11 b10 b9
155
                                 b7 b6 b5 b4 b3
                                                      b2
                                                            b1
                                                                b0)
156
           (R18, b18) -> (Vector32 b31 b30 b29 b28 b27 b26
157
                                 b25 b24 b23 b22 b21 b20
158
                                 b19 b18 b17 b16 b15 b14
159
                                 b13 b12 b11 b10 b9
                                                       b8
160
                                 b7 b6 b5 b4 b3
                                                      b2
                                                            b1
                                                                b0)
161
           (R19, b19) -> (Vector32 b31 b30 b29 b28 b27 b26
162
                                 b25 b24 b23 b22 b21 b20
163
                                 b19 b18 b17 b16 b15 b14
164
                                 b13 b12 b11 b10 b9
                                                      b8
165
                                 b7 b6 b5 b4 b3 b2
                                                            b1
                                                                b0)
166
           (R20, b20) -> (Vector32 b31 b30 b29 b28 b27 b26
167
                                 b25 b24 b23 b22 b21 b20
168
                                 b19 b18 b17 b16 b15 b14
169
                                 b13 b12 b11 b10 b9
                                                      b8
170
                                 b7 b6 b5 b4 b3
                                                     b2
                                                            b1
                                                               b0)
171
           (R21, b21) -> (Vector32 b31 b30 b29 b28 b27 b26
172
```

```
b25 b24 b23 b22 b21 b20
173
                                 b19 b18 b17 b16 b15 b14
174
                                 b13 b12 b11 b10 b9
                                                      b8
175
                                 b7 b6 b5 b4 b3 b2
                                                              b0)
176
           (R22, b22) -> (Vector32 b31 b30 b29 b28 b27 b26
                                 b25 b24 b23 b22 b21 b20
178
                                 b19 b18 b17 b16 b15 b14
179
                                 b13 b12 b11 b10 b9
                                                      b8
180
                                 b7 b6 b5 b4 b3 b2
                                                           b1
                                                              b0)
181
           (R23, b23) -> (Vector32 b31 b30 b29 b28 b27 b26
182
                                 b25 b24 b23 b22 b21 b20
183
                                 b19 b18 b17 b16 b15 b14
184
                                 b13 b12 b11 b10 b9
                                                      b8
185
                                 b7 b6 b5 b4 b3
                                                     b2
                                                           b1
                                                              b0)
186
           (R24, b24) -> (Vector32 b31 b30 b29 b28 b27 b26
187
                                 b25 b24 b23 b22 b21 b20
188
                                 b19 b18 b17 b16 b15 b14
189
                                 b13 b12 b11 b10 b9
                                                      b8
190
                                 b7 b6 b5 b4 b3 b2
                                                           b1 b0)
191
           (R25, b25) -> (Vector32 b31 b30 b29 b28 b27 b26
192
                                 b25 b24 b23 b22 b21 b20
193
                                 b19 b18 b17 b16 b15 b14
194
                                 b13 b12 b11 b10 b9
195
                                 b7 b6 b5 b4 b3 b2
                                                               b0)
                                                           b1
196
           (R26, b26) -> (Vector32 b31 b30 b29 b28 b27 b26
197
                                 b25 b24 b23 b22 b21 b20
198
                                 b19 b18 b17 b16 b15 b14
199
                                 b13 b12 b11 b10 b9
                                                      b8
200
                                 b7 b6 b5 b4 b3 b2
                                                           b1 b0)
201
```

```
(R27, b27) -> (Vector32 b31 b30 b29 b28 b27 b26
202
                                    b25 b24 b23 b22 b21 b20
203
                                    b19 b18 b17 b16 b15 b14
204
                                    b13 b12 b11 b10 b9
205
                                    b7
                                        b6 b5 b4 b3
                                                          b2
                                                                b1
                                                                    b0)
206
            (R28, b28) \rightarrow (Vector 32 b31 b30 b29 b28 b27 b26
207
                                   b25 b24 b23 b22 b21 b20
208
                                    b19 b18 b17 b16 b15 b14
209
                                   b13 b12 b11 b10 b9
                                                          b8
210
                                       ^{\mathrm{b6}}
                                                          b2
                                                                    b0)
                                    b7
                                           b5 b4 b3
                                                                b1
211
            (R29, b29) -> (Vector32 b31 b30 b29 b28 b27 b26
212
                                    b25 b24 b23 b22 b21 b20
213
                                    b19 b18 b17 b16 b15 b14
214
                                    b13 b12 b11 b10 b9
                                                          b8
215
                                    b7 b6
                                           b5 b4 b3
                                                          b2
                                                                b1
                                                                    b0)
216
            (R30, b30) \rightarrow (Vector 32 b31 b30 b29 b28 b27 b26
217
                                    b25 b24 b23 b22 b21 b20
218
                                    b19 b18 b17 b16 b15 b14
219
                                   b13 b12 b11 b10 b9
                                                          b8
220
                                    b7 b6 b5 b4 b3
                                                          b2
                                                                b1
                                                                    b0)
221
            (R31, b31) -> (Vector32 b31 b30 b29 b28 b27 b26
222
                                   b25 b24 b23 b22 b21 b20
223
                                    b19 b18 b17 b16 b15 b14
224
                                    b13 b12 b11 b10 b9
225
                                                          ^{\mathrm{b8}}
                                    b7
                                        b6
                                             b5
                                                 b4
                                                      b3
                                                          b2
                                                                b1
                                                                    b0)
226
228
229 writeback :: (Monad m) => Maybe (Reg, RegVal) -> ReT (Maybe
       (Reg, RegVal)) RegFile (StT RegFile m) ()
```

```
230 \text{ writeback} i = do
                s <- lift get
                let sp = case i of
232
                              Just inp -> setReg inp s
233
                              Nothing -> s
234
                lift (put sp)
235
                inpp <\!\!- signal sp
236
                writeback_ inpp
237
238
239 writeback :: (Monad m) => ReT (Maybe (Reg, RegVal)) RegFile m ()
_{240} writeback = do
                    extrude (writeback_ (Nothing)) zeroFile
                    return ()
242
```

Listing B.6: The DLX Writeback phase implemented in Haskell

## B.7 Combining DLX Phases to a Processor

```
1 module Redux.Proc where
2
3 import Redux.Types
4 import Redux.Fetch
5 import Redux.Decode
6 import Redux.ALU
7 import Redux.Memory
8 import Redux.Writeback
9
10
11 flatten (a,(b,(c,(d,e)))) = (a,b,c,d,e)
```

```
12 fanOut ((a,b,c),(d,e,f,g),h,(i,j,k,l,m,n),o) =
      (a,b,c,d,e,f,g,h,i,j,k,l,m,n,o)
13 pack (a,b,c,d,e) = (a,(b,(c,(d,e))))
14 packIn (a,b,c,d,e,f,g,h,i,j,k,l,m,n,o) =
      ((a,b),(c,d,e,f),(g,h,i,j,k),(l,m),(n,o))
15
16
17 alu_stall = refoldT
                   id
18
                   (\ \ (s,f) \rightarrow case \ s \ of
19
                                    H -> Nothing
20
                                    L -> Just f
^{21}
                                  )
22
                   alu
23
24
decode_stall = refoldT
26
                     (\ \ (s,f) \rightarrow case \ s \ of
27
                                       H -> Nothing
28
                                       L -> Just f
29
                     )
30
                     decode
31
32
  fetch_stall = refoldT
                    id
34
                    (\ \ (s,f) \rightarrow case \ s \ of
35
                                     H -> Nothing
36
                                     L -> Just f
37
                    fetch
38
```

```
39 devOut :: (NextInst,
            Instr,
40
            \operatorname{Redux}.\operatorname{Fetch}.\operatorname{PC},
41
42
            Opcode,
            Reg,
43
            (Reg, RegVal),
44
            (Reg, RegVal),
45
            ALUO,
46
            Maybe Data,
47
            Address,
48
            Stall,
49
            Redux. Memory. Flush,
50
            Maybe (Reg, RegVal),
            Maybe Redux. Memory. PC,
52
            RegFile) -> (NextInst, Maybe Data, Address)
  devOut (a,b,c,d,e,f,g,h,i,j,k,l,m,n,o) = (a,i,j)
  devOutTR :: (NextInst,
            Instr,
57
            Redux. Fetch.PC,
58
            Opcode,
59
            Reg,
60
            (Reg, RegVal),
61
            (Reg, RegVal),
62
            ALUO,
63
            Maybe Data,
64
            Address,
65
            Stall,
66
            Redux. Memory. Flush,
67
```

```
Maybe (Reg, RegVal),
           Maybe Redux. Memory. PC,
69
           RegFile) -> (NextInst, Maybe Data, Address, RegFile)
70
  devOutTR (a,b,c,d,e,f,g,h,i,j,k,l,m,n,o) = (a,i,j,o)
72
  dlx = refold
           (\x -> devOut (fanOut (flatten x)))
           (\out -> \inf -> let outp = (fanOut (flatten out))
75
                     in let fe
                                  = fetchIn outp inp
76
                     in let dc
                                  = decodeIn outp inp
                     in let al
                                  = aluIn outp inp
                     in let me
                                  = memIn outp inp
79
                     in let wb
                                  = wbIn outp inp
                     in pack (fe, dc, al, me, wb))
           (parI fetch_stall
82
            (parI decode_stall
83
             (parI alu (parI mem writeback))))
85
  dlx_testreg = refold
86
           (devOutTR . fanOut . flatten)
           (\out inp -> let outp = (fanOut (flatten out))
                                  = fetchIn outp inp
                             fе
89
                             dc
                                  = decodeIn outp inp
90
                                  = aluIn outp inp
                             al
91
                                  = memIn outp inp
                             me
                             wb
                                  = wbIn outp inp
93
                          in pack (fe, dc, al, me, wb))
94
           (parI fetch_stall
95
            (parI decode_stall
96
```

```
(parI alu_stall (parI mem writeback))))
98
   fwd2 \ :: \ Maybe \ (Opcode \, , \ RegDest \, , \ RegVal \, , \ RegVal \, )
            -> Maybe (Reg, RegVal)
            -> (Reg, RegVal) -> RegVal
101
102 \text{ fwd2} aluo memo otro = case aluo of
                                                    -> fwd1 memo otro
                               Nothing
103
                                Just (-,R0,-,-) \longrightarrow fwd1 memo otro
104
                                Just (_, frd , frv ,_) ->
105
                                 case otro of
106
                                  (rd,rv) -> case regEq frd rd of
107
                                                 ---ALU FWD is a hit
108
                                                 H \rightarrow frv
109
                                                 ---ALU FWD is a miss, try Mem FWD
110
                                                 L \rightarrow fwd1 memo otro
111
113 fwd1 :: Maybe (Reg, RegVal) -> (Reg, RegVal) -> RegVal
114 \text{ fwd1 memo otro} = \text{case memo of}
                              -- No matches in the current inst
115
                              Nothing -> (snd otro)
116
                              Just (frd, frv) -> case otro of
117
                                                       (rd, rv) ->
118
                                                         case regEq frd rd of
119
                                                              H \rightarrow frv
120
                                                              L \rightarrow rv
121
122
  fetchIn :: (NextInst, Instr,
             Redux. Fetch.PC,
             Opcode,
125
```

```
Reg,
126
            (Reg, RegVal),
127
            (Reg, RegVal),
128
            ALUO,
129
            Maybe Data,
130
            Address,
131
            Stall,
132
            Redux. Memory. Flush,
133
            Maybe (Reg, RegVal),
134
            Maybe Redux. Memory. PC,
135
            RegFile) -> (Instr, Data) -> (Bit, (Instr, NewAdd))
136
137
   fetchIn (nextInst, instr, fetchPC,
138
             dcOp, dcDreg, (regA, regAv),
139
             (regB, regBv), aluO, rwData,
140
             dAddr, stall, flush, mbWbReg,
141
             mbPC, rfile) (instIn, dataIn) = (stall, (instIn, mbPC))
142
143
144
   decodeIn :: (NextInst, Instr,
            Redux. Fetch.PC,
146
            Opcode,
147
            Reg,
148
            (Reg, RegVal),
149
            (Reg, RegVal),
150
            ALUO,
151
            Maybe Data,
152
            Address,
153
            Stall,
154
```

```
Redux. Memory. Flush,
155
            Maybe (Reg, RegVal),
156
            Maybe Redux. Memory. PC,
157
            RegFile) -> (Instr, Data) -> (Bit, (Vector32 Bit, RegFile, RegVal))
158
159
   decodeIn (nextInst, instr, fetchPC,
             dcOp, dcDreg, (regA, regAv),
161
             (regB, regBv), aluO, rwData,
162
             dAddr, stall, flush, mbWbReg,
163
             mbPC, rfile) (instIn, dataIn) = (stall, (instr, rfile, fetchPC))
164
165
   aluIn :: (NextInst, Instr,
            Redux. Fetch.PC,
            Opcode,
168
            Reg,
169
            (Reg, RegVal),
170
            (Reg, RegVal),
171
            ALUO,
172
            Maybe Data,
173
            Address,
174
            Stall,
175
            \operatorname{Redux}. Memory . Flush,
176
            Maybe (Reg, RegVal),
177
            Maybe Redux. Memory. PC,
178
            RegFile) -> (Instr, Data) -> (Bit, ALUI)
179
   aluIn (nextInst, instr, fetchPC,
             dcOp, dcDreg, regA,
182
             regB, aluO, rwData,
183
```

```
dAddr, stall, flush, mbWbReg,
184
             mbPC, rfile) (instIn, dataIn) =
185
             (stall,(dcOp,flush,dcDreg,
186
              fwd2 aluO mbWbReg regA, fwd2 aluO mbWbReg regB))
187
188
  memIn :: (NextInst, Instr,
189
              Redux. Fetch.PC,
190
              Opcode,
191
              Reg,
192
              (Reg, RegVal),
193
              (Reg, RegVal),
194
              ALUO,
195
              Maybe Data,
196
              Address,
197
              Stall,
198
              Redux. Memory. Flush,
199
              Maybe (Reg, RegVal),
200
              Maybe Redux. Memory. PC,
201
              RegFile) -> (Instr, Data) -> MemI
202
204 memIn (nextInst, instr, fetchPC,
             dcOp, dcDreg, regA,
205
             regB, aluO, rwData,
206
             dAddr, stall, flush, mbWbReg,
207
             mbPC, rfile) (instIn, dataIn) = (dataIn, aluO)
208
210
211 wbIn :: (NextInst, Instr,
              Redux. Fetch.PC,
212
```

```
Opcode,
213
                Reg,
214
                (Reg, RegVal),
215
                (Reg, RegVal),
216
                ALUO,
217
                Maybe Data,
218
                Address,
219
                Stall,
220
                Redux. Memory. Flush,
221
                Maybe (Reg, RegVal),
222
                Maybe Redux. Memory. PC,
223
                RegFile) \ -\!\!\!> \ (Instr \ , \ Data) \ -\!\!\!> \ Maybe \ (Reg \ , \ RegVal)
224
225
   wbIn (nextInst, instr, fetchPC,
              dcOp, dcDreg, regA,
227
              regB, aluO, rwData,
228
              dAddr, stall, flush, mbWbReg,
229
              mbPC, rfile) (instIn, dataIn) = mbWbRegjk
230
```

Listing B.7: Combining the subcomponents of the DLX processor with support for stalling

### VITA

Ian Graves was born in Kansas City, Missouri on July 11th, 1986, to Beverly and Leland Graves. He is an Eagle Scout and an alumnus of Lee's Summit Senior High School in the class of 2005. He currently resides in the Portland, Oregon area with his wife, Amanda Graves (B.S. 2011, M.A. 2013). He received his B.S. degree in Computer Science *cum laude* with a minor in mathematics in May of 2009 and he completed the Ph.D. degree in December, 2015.