James Hegarty jhegarty@stanford.edu
Rigel is a language for describing image processing hardware embedded in Lua. Rigel can compile to Verilog hardware designs for Xilinx FPGAs, and also can compile to fast x86 test code using Terra.
See our SIGGRAPH 2016 Paper on the design of Rigel
Installing Rigel is not required. Rigel requires luajit, which can be easily installed with either apt-get or homebrew. Rigel provides a wrapper script (rigelLuajit) that automatically sets up the correct paths for luajit:
cd [rigel root]/examples
../rigelLuajit pointwise_wide_handshake.lua
Which will then generate the verilog file "[rigel root]/examples/out/pointwise_wide_handshake.v".
Rigel has two optional dependencies: Verilator and Terra. Verilator can be installed using apt-get or homebrew, and enables fast verilog simulation:
cd [rigel root]/examples
make out/pointwise_wide_handshake.verilator.bmp
Which runs the pointwise example verilog through Verilator and produces an output image.
Optionally installing terra enables our fast Terra simulator. You can download the Terra binary files on https://github.com/zdevito/terra/releases (tested with release-2016-03-25) or build it. You can clone the repository and build Terra using the instructions in the Terra Readme. Run the REPL and make sure it installed correctly. The terra simulator can then be run similarly to Verilator:
cd [rigel root]/examples
make out/pointwise_wide_handshake.terra.bmp
Once set up, make can be used to run our full suite of tests:
cd [rigel root]/examples
make verilog
make verilator
make terra
This runs 100s of test pipelines, so it may be slow. The output of each test is located at examples/out/[testname].v
, examples/out/[testname].verilator.bmp
, and examples/out/[testname].terra.bmp
. If make completes without errors, this means that all tests were successful.
make verilog
will write out verilog files for each test (located at examples/out/[testname].v
). You can compile and run these verilog files on your board using your own flow. However, we also provide a simplified Xilinx FPGA flow that we recommend (implemented in examples/makefile
). Our flow supports both generating a Xilinx bitstream (.bit) and also loading and running the pipeline on the board automatically (over ethernet).
Rigel's FPGA flow was tested with Xilinx ISE 14.5. This is the only version of ISE that works with the Zynq 7100. Later versions of ISE should work fine as well, if you are building for the Zynq 7020. If ISE is installed, our makefile should be able to build bitstreams for the Zynq 7020 and 7100.
Rigel's FPGA flow optionally also supports loading the generated bitstream onto an FPGA board and running an image through the pipeline for testing. The board must be accessible via SSH over ethernet. We tested with Xilinx Linux. This is the version of linux that comes by default on the Zedboards, etc. Our flow may be compatible with other Linux releases. We require that /dev/xdevcfg
exists, which allows us to reprogram the FPGA fabric from within linux by cat'ing the bitstream to this device.
By default, the makefile is set up for two board configurations:
Zynq 7020 (zedboard, Trenz Electric TE0720)
address: 192.168.2.2
username: root
password: root
fpga local write path: /var/volatile
Zynq 7100 (Avnet)
address: 192.168.1.10
username: root
password: root
fpga local write path: /tmp
Addresses, etc can be modified in examples/makefile
.
Our makefile supports the following options:
- make verilog: Build all (generic, platform-independent) Verilog files. Outputs at
examples/out/[testname].v
- make verilator: Simulator Verilog using Verilator. Outputs at
examples/out/[testname].verilator.bmp
- make terra: Run compiler and Terra x86 simulator on all tests. Sim outputs at
examples/out/[testname].terra.bmp
. - make: build and run all simulations and bitstreams on both boards
- make clean: delete all built files from
examples/out/
On Xilinx, we support three FPGAs, the Zynq 7010, 7020 (most common, used in the Zedboard), and 7100. We also support both ISE and Vivado.
- make zynq20isebits: Builds all 7020 bitstreams using ISE. Outputs at
examples/out/[testname].zynq20ise.bit
- make zynq20ise: Run all 7020 bitstream on board. Outputs at
examples/out/[testname].zynq20ise.bmp
- make zynq100vivadobits: Builds all 7100 bitstreams using Vivado. Outputs at
examples/out/[testname].zynq20vivado.bit
- make zynq100vivado: Run all 7100 bitstream on board. Outputs at
examples/out/[testname].zynq20vivado.bmp
- make isim: Run Xilinx ISIM simulator on all tests. Outputs at
examples/out/[testname].isim.bmp
- etc
For Ross Daly's camera test rig:
- make camerabits: Build all bitstreams for the camera rig. Outputs at
examples/out/[testname].camera.bit
- make [testname].camera.run: Run bitstream on camera rig (over ethernet)
A verbose debug mode can be activated by setting the environment variable v
, i.e.:
v=1 make terra
Rigel hosts build scripts that automatically test streaming modules (e.g. those produced with Rigel) on multiple Xilinx Zynq SoC platforms. The scripts are not tied to Rigel however, and can be used with any verilog block that implements a streaming (Handshake) interface, similar to AXI-stream, and provides valid configuration options.
Each supported SoC platform has a folder in the platform/ directory. Each platform is identified by a name in the format CHIPCOMPILER. Currently support CHIP options are zynq10 (Zynq 7010, such as the Zybo board), zynq20 (Zynq 7020, such as the Zedboard), zynq100 (Zynq 7100, such as the AES-MINI-ITX-7z100-G), and zu9 (Zynq Ultrascale+ MPSOC XCZU9EG, such as the ZCU102 board). Currently supported COMPILER options are ise and vivado. For example, compiling for the Zedboard (which uses the 7020 chip) on Vivado would use the scripts in platform/zynq20vivado.
As the scripts only use internal wiring on the FPGA (no external IO), they should be robust to working on different FPGA board with the same chip. Different SKUs in the same chip series can typically be supported by simply changing the Xilinx 'part' option in the compile script. Pull requests for more parts would be appreciated.
Simulator platforms are also supported: platform/verilator contains scripts to test the target block using the open source verilator simulator.
Within each platform directory, two scripts are provided:
compile VERILOG_FILE METADATA_FILE BUILDDIR OUTFILE
Wrap the input Verilog file with a test harness, and compile it to either a bitstream or executable (for simulators).
- VERILOG_FILE: path to the verilog file input (see description below for expected module interface).
- METADATA_FILE: path to the metadata file describing the Verilog module in VERILOG_FILE (see description below).
- BUILDDIR: Path to a folder where build temporaries, logs, and stats can be written. Each script will expect the directory BUILDDIR to exist in the working directory from which they are called.
- OUTFILE: Path where the final output file of the script should be written. For Zynq plaforms this is a bitfile, for Verilator it is an executable.
run BITFILE METADATA_FILE OUTFILE PLATFORM_SPECIFIC
For FPGA platforms, this logs in to the FPGA, programs the fabric, runs the tests, and transfers it back to the host. For simulator platforms, this runs the simulator.
- BITFILE: Compiled FPGA bitfile to run (returned from the compile script). For simulator platforms, this is the compiled executable.
- METADATA_FILE: path to the metadata file describing the Verilog module in the bitfile (see description below).
- OUTFILE: Path where the final output file of the script should be written. This will be the output of the streaming module, written to a file as raw binary data.
- PLATFORM_SPECIFIC: (optional) platform dependent data field. For FPGA platforms, this is the IP address of the FPGA board (e.g. 192.168.1.10).
Each script will expect all paths to be relative to the working directory from which they are called.
The module metadata file describes the IO interface of your verilog module, and the test data to run. Module metadata is provided in lua table format:
return {topModule=[String], tapBits=[Number], tapValue=[String], inputImage=[String],
inputBitsPerPixel=[Number], inputV=[Number], inputWidth=[Number], inputHeight=[Number],
inputV=[Number], outputBitsPerPixel=[Number], outputV=[Number], outputWidth=[Number],
outputHeight=[Number]}
With the following definition:
- topModule: Name of verilog module to run as top.
- tapBits: Total size in bits of runtime configurable GPIO constants.
- tapValue: GPIO constants to use for test, in hex.
- inputImage: Path to test input data file (stored in raw binary format). Path should be relative to the working directory where the build scripts are run from.
- bitsPerPixel: Number of bits per pixel.
- V: Number of pixels your module expects per cycle. Width of your verilog module should be VbitsPerPixel*.
- width: image width in pixels.
- height: image height in pixels.
The options 'bitsPerPixel', 'V', 'width', and 'height' behave the same for input and output.
Note: the metadata file format contains redundant information so that the output of your module can be interpreted as an image. However, this is not a strict requirement, and the script works for raw binary data as well. If IO is not an image, simply set 'width=# of tokens', 'height=1', 'bitsPerPixel=bitwidth of module interface', and 'V=1'.
The scripts expect the Verilog top module being tested to have the following format:
module Top( input CLK, input reset, input [tapBits-1:0] taps,
input [inputBitsPerPixel*inputV:0] process_input, output ready,
output [outputBitsPerPixel*outputV:0] process_output, input ready_downstream );
With the following definition:
- CLK: module clock
- reset: synchronous module reset, resets when high.
- taps: Packed GPIO pins.
- process_input: Input data token. Valid bit is packed in top bit of this value.
- ready: upstream ready (for the module input)
- process_output: Output data token. Valid bit is packed in top bit of this value.
- ready_downstream downstream ready (for the module output)
If your Verilog module does not conform to this interface, we suggest adding a small wrapper module to translate the interfaces, and marking that as top.
While the scripts should be able to build bitstreams successfully for most boards, Zynq SoCs also contain a large software stack that is more variable. Non-standard Linux configurations can cause our scripts to need to vary significantly. For most run scripts, we tested it with the default binaries from the Xilinx binary releases.
Users implement applications in a image processing pipeline description language that we call Rigel. Rigel pipelines are then lowered to a low-level RTL-like hardware description language of our own design called Systolic. Users of Rigel sometimes need to interact with Systolic directly to extend Rigel's core functionality.
This readme documents the core functionality and APIs of both Rigel and Systolic. We intend this document to cover both how to use the system, and also to document the code so that users can modify and extend the system.
We first cover the Base Types that are common to both Rigel and Systolic. Then we cover the core APIs needed to construct pipelines in Rigel. Then we cover the large suite of Rigel Modules that accomplish image processing tasks. Finally, we provide information about how to construct hardware modules in Systolic.
- Base Types
- Rigel
- Rigel Modules
- Core Modules lift, lambda, stencil, linebuffer
- Array Manipulation SoAtoAoS, pad, crop, border
- Multi-Rate Modules downsample, upsample, filter
- Higher-Order Modules map, reduce
- Interfaces liftBasic, liftDecimate, liftHandshake
- Streams broadcastStream, fifo
- Misc lut, fread, fwrite
- Systolic
- Misc Other Files
Rigel and Systolic's type system supports arbitrary precision ints, uints, floats, bitfields, bools, 2d arrays, and tuples. types.t provides the types that are used throughout both Rigel and Systolic.
Types can be marked as constant, indicating that their value will not change while the pipeline is executing. This allows the compiler to perform additional optimizations. Rigel's constants are roughly equivilant to 'uniforms' in shader programming, or 'taps' in image processing.
types = require "types"
-- base type constructors
types.bool( constant:bool )
types.uint( precision:uint, constant:bool )
types.int( precision:uint, constant:bool )
types.bits( precision:uint, constant:bool ) -- bitfield
types.float( precision:uint, constant:bool )
types.array2d( arrayover:Type, width:uint, height:uint )
types.tuple( typelist:Type[] )
types.t also provides a number of helper functions for manipulating and introspecting on types.
types.isType(x) -- is table a Type?
type:const() -- is type constant?
ty = type:makeConst() -- returns constant version of type
ty = type:stripConst() -- returns non-constant version of type
type:isArray()
type:arrayOver() -- if type is array, return type that array is over. Otherwise error out.
type:baseType() -- if type is array, return type that array is over. Otherwise return type.
sz = type:arrayLength() -- return array dimensions as lua type {width,height}
type:channels() -- returns width*height
type:isTuple()
type:tupleList() -- return lua table of Types in tuple
type:isBool()
type:isInt()
type:isUint()
type:isBits()
type:isFloat()
type:isNumber() -- float, int, or uint
ty = type:toTerraType( asPointer:bool, [vectorWidth:uint] ) -- lower Rigel type to Terra type
type:valueToTerra( value:lua ) -- convert lua value into terra value, coerced into format of this type
-- this checks that a lua value can be coerced into a Rigel type.
-- Lua numbers can convert to float, uint, or int (as long as they don't overflow).
-- Lua tables can convert into tuples or arrays, as long as sizes and types match.
type:checkLuaValue(value:lua)
type:fakeValue() -- returns a arbitrary lua value that will successfully convert to 'type'
type:sizeof() -- bytes when type is used in Terra
type:verilogBits() -- bits when type is used in Verilog
Users construct image processing pipelines in Rigel by creating a Directed Acyclic Graph (DAG) of Rigel operations, which manipulate Rigel values. Most Rigel opreations involve applying a Rigel Module to a Rigel value. Rigel contains a large suite of built-in modules for performaning typical image processing operations, which will be covered later in the document.
Each Rigel module has an interface type, which specifies low-level information about the underlying hardware interface of the module. This is typically used to indicate whether the module support a synchronous interface, handshake interface, or bus interface. Modules can only be applied if interface types match. The current implementation does not perform automatic type conversions on interface types. For now, simply applying type conversion modules (Interface Modules) until the types match should work correctly.
rigel.t
contains core functionality for manipulating Rigel DAGs, and the core classes for Rigel interface types and modules. modules.t
(shown later) contains all the built-in Rigel modules that can be applied.
The table representation for all RigelIR nodes will have at least the following fields:
{
type : Type, -- output type of node
inputs : RigelIR[], -- inputs to node, as lua list
sdfRate : SDFRate,
name : String, -- a string that identifies this node when lowering to Verilog
loc : String -- file location where node was constructed
}
The following functions are used for defining leaf values of Rigel DAGs.
rigel.input( type:Type, [sdfRate:SDFRate] ) : RigelIR
fields: {..., kind="input" }
rigel.input
is used to declare an input to a pipeline of a given type types
, and optionally with a SDF rate sdfRate
.
rigel.constant( name:String, value:LuaValue, type:Type ) : RigelIR
fields: {..., kind="constant", value = value }
rigel.constant
is used to create a constant value (set at compile time and never changed).
The following functions construct RigelIR nodes that perform various operations on Rigel values.
rigel.apply( name:String, module:RigelModule, input:RigelIR ) : RigelIR
fields: {..., kind="apply", fn=module }
Apply the Rigel Module module
to input
. name
is used to identify this application when the graph is lowered to Verilog.
rigel.applyMethod( name:String, instance: RigelInstance, functionName:String, input:RigelIR ) : RigelIR
fields: {..., kind="applyMethod", inst=instance, fnname=functionName }
Apply the function named functionName
on module instance instance
to input
. name
is used to identify this application when the graph is lowered to Verilog.
Typically, this only comes up with FIFOs. FIFO modules have both a load
and store
'function', which must be applied separately.
TODO: really, the IR representation for this and apply should be the same.
rigel.tuple( name:String, inputList:RigelIR[], [packStreams:Bool] ) : RigelIR
fields: {..., kind="tuple", packStreams=packStreams }
Compose a tuple of the ordered list of values in lua array inputList
. name
is used to identify this concatenation when lowering to Verilog. When dealing with concatenating Handshake streams, packStreams
indicates whether we should treat the output as one Handshake stream, instead of multiple streams (default true - one stream).
TODO: remove packStreams
(legacy option). It seems like this only needs to be a real operator when we are operating on streams? Packing regular tuples could just be a module
rigel.array2d( name:String, inputList:RigelIR[width*height], width:Uint, height:Uint, [packStreams:Bool] ) : RigelIR
fields: {..., kind="array2d", W=width, H=height, packStreams=packStreams }
Compose an array of the ordered list of values in lua array inputList
. inputList
must be passed as a 2D array flattened into a 1D lua array in row major order. 2D dimensions for this data are then specified by width
and height
. name
is used to identify this concatenation when lowering to Verilog. When dealing with concatenating Handshake streams, packStreams
indicates whether we should treat the output as one Handshake stream, instead of multiple streams (default true - one stream). TODO: remove packStreams
(legacy option).
rigel.selectStream( name:String, input:RigelIR, i:Uint ) : RigelIR
fields: {..., kind="selectStream", i=i }
Given an array or tuple of Handshake values, this returns the stream at index i
(0 indexed).
TODO: turn this into a module.
rigel.statements( values:RigelIR[] ) : RigelIR
fields: {..., kind="statements" }
Rigel graphs can only have one input and output, however statements
allows you to pack multiple pipelines into one, reducing this restriction. Typically, this is only necessary in graphs with FIFOs. Only the first value is passed as the actual output of the graph (lua index 1).
Each module's interface type specifies low-level information about the underlying hardware implementation interface of the module. This is typically used to indicate whether the module support a synchronous interface, handshake interface, or bus interface. Modules can only be applied if interface types match. The current implementation does not perform automatic type conversions on interface types.
It is useful to think of these as abstract types that enforce rules of composition, not as literal descriptions of the signals of the resulting hardware. The Rigel compiler runs a small amount of code to correctly wire each interface type to its neighbors when it lowers to RTL. TODO: in the future it may be possible to make the low-level RTL layer powerful enough to type all of these interfaces and compose them directly.
By convention this document will use capital letters at the start of the alphabet like A, B, etc to refer to an arbitrary type. In this section they will refer to base types only (see types.t).
Synchronous Module that performs decimation. This corresponds to a statically-timed hardware module with a valid bit on the output only. V stands for valid.
Synchronous Module that performs both decimation and expansion. This corresponds to a statically-timed hardware module with a valid bit on both inputs/output, and a ready bit. RV stands for ready-valid.
Note that while it is possible to construct a statically-timed pipeline out of multiple modules with ready bits, this will not work! And this case is not checked by the compiler! A statically-timed pipeline can only have one module with a ready bit, because upstream stalls are not supported by static pipelines. To compose multiple modules with ready bits, the modules must first be lifed to Handshake types (below).
Handshake (full Ready-Valid on both input/output) interface. Supports composition of modules that perform both decimation and expansion.
Rigel also has the ability to make aggregates (arrays and tuples) of Handshakes. Since every Rigel module only has a single input and output, this allows us to implement modules that work on multiple streams. These types can not be nested (no Handshakes of Handshakes). Handshake(A) only works over base types.
Array of Handshakes - multiple handshake streams treated as one input.
Similar to the array of Handshakes above, but allows for different types.
Rigel also has provisional support for the ability to pack multiple streams onto one shared bus. Compared to the aggregates above (which have N streams on N data lines), multiplexed busses have N streams on 1 data line. This allows for supporting time-multiplexed hardware (hardware that operates on different streams in different cycles), or serializing multiple streams into one stream.
Unlike an array of Handshakes (above), a HandshakeArrayOneHot can only have 1 of the N streams active per cycle. HandshakeArrayOneHot's downstream ready bit identifies the stream that should be written on the data bus in the current cycle.
HandshakeTmuxed behaves like a regular Handshake but allows multiple different data streams to flow over one interface. HandshakeTmuxed's valid bit identifies the ID of the stream that is active.
Regular Handshake modules can be lifted to work over HandshakedTmuxed, but this should only be done if the module isn't stateful - there is no way to automatically duplicate the module state so that it stays coherent for each stream!
Next we will cover Rigel Modules: both the list of built-in modules, and how to construct new modules out of RigelIRs. This section covers functionality that is common to all modules.
Lower the Rigel module to a simulator module in Terra.
Each module is lowered to a Terra class with the following format:
local Struct Module {}
-- reset the module. Must call this before start of a frame
-- reset function always exists, but doesn't necessarily do anything
terra Module:reset() end
-- fire the module (perform 1 cycle of operation)
-- note that the caller must allocate space for the input and output
terra Module:process( inputValue : &module.inputType, output : &module.outputType )
-- calculate ready bit in backwards pass (see below)
-- only appears if module has ready bit
terra Module:calculateReady( readyDownstream : bool )
-- write out useful simulator statistics about this module
-- function is passed the name of the instance - to identify itself in output
-- stats must be provided, but doesn't necessarily have to do anything
terra Module:stats( moduleInstanceName : &int8 ) end
Rigel types are lowered to Terra like this:
Base types (the types in types.t
) are lowered to Terra directly - the type systems are compatible (ints become ints, arrays become arrays etc).
-- the bools are the valid bit
Handshake(A): tuple(A,bool)
V(A): tuple(A,bool)
RV(A): tuple(A,bool)
Notice that ready bit inputs do not appear in the lowered types. Ready bits are handled separately. Ready bit data flows the opposite direction of input data. Input data flows from the front of the pipeline, but ready bit data flows from downstream. For this reason we special-case ready bits and calculate them in a separate backwards pass. Each simulation cycle, the simulator will call calculateReady
on all ready-valid modules in backwards order first, before calling process
on all the modules. Each module stores the result of the ready bit calculation internally, so that it can remember the value of the ready bit when process
fires.
Lowers the Rigel module to a Systolic module.
The Rigel module is lowered to a module with the following format:
SystolicModule Module:
-- this dataflow resets stateful modules.
-- Note that this reset on a high value, which is unconventional
Dataflow Module:reset( doReset : bool ) : nil
-- Perform the module calculation
Dataflow Module:process( inputValue : module.inputType ) : module.outputType
-- Perform the ready bit calculation
Dataflow Module:ready( readyDownstream : bool ) : bool
Convenience function - simply calls toSystolic()
and then calls toVerilog()
on the resulting systolic module.
modules.t contains Rigel's core set of modules. This is intended to be the smallest and simplest set of core functionality that an architecture needs to support to implement Rigel's programming model. Some additional useful modules built out of these core modules are implemented in Rigel's standard library, examplescommon.t. A few of these modules from examplescommon.t are described here for completeness, and are marked with a (*).
All valid Rigel modules contain at least these fields:
{ kind:String, -- unique name for module
inputType:Type,
outputType:Type,
sdfInput:SDFRate,
sdfOutput:SDFRate,
delay:Number, -- pipeline delay (synchronous modules only)
stateful:Bool, -- Is module referentially transparent?
terraModule:Terra,
systolicModule:SystolicModule
}
Each module may include additional fields to describe the particular configuation of that module. Those additions are documented below.
type: input.type -> output.type
modules.lambda( newModuleName : String, input : RigelIR, output : RigelIR, [instances : RigelInstance[]] )
fields: {..., kind="lambda", name=newModuleName, input=input, output=output, instances=instances }
Define a new Rigel module based on a user-constructed pipeline. input
must be a RigelIR node returned from rigel.input
. output
is a RigelIR of the intented output. instances
is a list of any module instances returned from rigel.instantiateRegistered
that you want to include (typically only used for FIFOs).
modules.lift( newModuleName:String, inputType:Type, outputType:Type, delay:Number, terraFunction:terraFunction, systolicInput:SystolicValue, systolicOutput:SystolicValue)
fields: {..., kind="lift_"..newModuleName}
lift takes a function from the lower-level languages (Terra and Systolic) and lifts them into a Rigel module. delay
must correspond to the pipeline delay from systolicInput to systolicOutput.
type: null->null
modules.seqMap( module:RigelModule, width:uint, height:Uint, T:Uint )
fields: {..., kind="seqMap", W=width, H=height, T=T }
seqMap
executes a module multiple times sequentially over image data stored in memory. This is used to generate the top level module that actually executes the pipeline in the simulator or in hardware from DRAM.
seqMap
always reads/write data of the width expected by module
. This module will then read (W*H)/T
items of that size before halting. TODO: change this to match the inputCount
interface of seqMapHandshake
.
seqMap
expects the module that it is mapping to have a basic input and output type - Handshake interfaces should use seqMapHandshake
below.
type: null->null
modules.seqMapHandshake( module:RigelModule, [tapInputType:Type], [tapValue=tapValue], inputCount:Uint, outputCount:Uint, axi:Bool, [readyRate:Uint] )
fields: {..., kind="seqMapHandshake", inputCount=inputCount, outputCount=outputCount }
seqMapHandshake
executes a module multiple times sequentially over image data stored in memory. This is used to generate the top level module that actually executes the pipeline.
tapInputType
is an optional 'tap' value to send into the pipeline. Taps are runtime-configurable constants that can only be modified between frames. The Verilog we generate is designed to make sure the tap values are not common subexpression eliminated by the compiler. tapValue
gives an initial value for the tap input. Taps are passed into the pipeline in the high bits. If taps are included, the pipeline must include this as part of its type, and the tap input should be marked as a constant type.
if axi
is true, generate hardware for reading from DRAM on the Zynq platform. If axi
is false, generate a test harness for the Verilog simulator.
This module will read inputCount
items of size module.inputType
and write outputCount
items of size module.outputType
.
type: A[width,height] -> A[xmax-xmin+1,ymax-ymin+1][width,height]
examplescommon.stencil( A:Type, width:Uint, height:Uint, xmin:Int, xmax:Int, ymin:Int, ymax:Int )
fields: {..., kind="stencil", type=A, w=width, h=height, xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax }
stencil takes an array of size widthxheight and turns it into an array of stencils. Similar to linebuffer, for pixels outside the array the value is undefined (this occurs on borders of the array). Borders with defined values can be implemented using a pad and crop, or with the unpackStencil module, which does not read outside the array.
type: A[T]->A[T,-ymin+1]
modules.linebuffer( A:Type, width:Uint, height:Uint, T:Uint, ymin:Int <= 0 )
fields: {..., kind="linebuffer", type=A, w=width, h=height, T=T, ymin=ymin }
linebuffer takes a stream of pixels of type A running at throughput T and converts it into a stream of columns of past lines, with size set by ymin. Essentially, this module returns a 1D column stencil. This stream of columns can then be converted into a stream of 2D stencils using the SSR module.
As explained in Darkroom 2014, when processing the array in scanline order (low indices to high indices), the linebuffer can only provide pixels at indices ymin<=0, because these are the only values that have been seen.
type: A[T,-ymin+1]->A[T-xmin,-ymin+1]
modules.SSR( A:Type, T:Uint, xmin:Int <= 0, ymin:Int <= 0 )
fields: {..., kind="SSR", type=A, T=T, xmin=xmin, ymin=ymin }
SSR implements a stencil shift register. The stencil shift register converts a stream of columns of lines (1D stencil) into a stream of 2D stencils. The resulting stencil size is (-xmin+1)x(-ymin+1)
. If T>1, this module returns multiple stencils per firing coalesced into one large stencil, because neighboring stencils share values. This can then be broken into multiple stencils using unpackStencil.
type: A[T]->A[T-xmin,-ymin+1]
examplescommon.stencilLinebuffer( A:Type, width:Uint, height:Uint, T:Uint, xmin:Int <= 0, xmax:Int = 0, ymin:Int <= 0, ymax: Int = 0)
stencilLinebuffer is a fusion of a linebuffer and a SSR for convenience. It takes in a stream of pixels (running a T
pixels per cycle), and turns it into a stream of stencils of size (-xmin+1)x(-ymin+1)
.
Currently xmax
and ymax
must always be 0, and xmin
and ymin
must be <=0, which means that the stencil always reads valid values from the past (as explained in Darkroom 2014). TODO: for extra programmer convenience, implement the scheduling algorithm from Darkroom 2014 to allow the user to write pipelines that read values with coordinates >0.
type: if fullOutput==false (the defualt) then A[1,-ymin+1]->A[(-xmin+1)*T,-ymin+1], else A[1,-ymin+1]->A[-xmin+1,-ymin+1]
modules.SSRPartial( A:Type, T:Number <= 1, xmin:Int <= 0, ymin:Int <= 0, [stride:Uint], [fullOutput:Bool] )
fields: {..., kind="SSRPartial", type=A, T=T, xmin=xmin, ymin=ymin, stride=stride, fullOutput=fullOutput }
SSRPartial is a special-case fusion of SSR and changeRate, which reduces the hardware cost by 50% by merging the shift registers for both modules into a single shift register with more complicated control logic.
type: Handshake(A[T])->Handshake(A[1(-xmin+1)*T,-ymin+1])
examplescommon.stencilLinebufferPartial( A:Type, width:Uint, height:Uint, T:Number <= 1, xmin:Int <= 0, xmax:Int = 0, ymin:Int <= 0, ymax: Int = 0 )
Convenience module that combines a linebuffer and SSRPartial.
type: Handshake(A[T]) -> Handshake()
examplescommon.stencilLinebufferPartialOffsetOverlap( A: Type, width:Uint, height:Uint, T:Number<=1, xmin:Int<=0, xmax: Int=0, ymin:Int<=0, ymax:Int=0, offset : Number, overlap : number )
This is a special case variant of stencilLinebufferPartial that generates partial stencils with overlaps. This allows you to implement stencils of stencils efficiently in hardware.
type: A[stencilWidth+width-1, stencilHeight+height-1] -> A[stencilWidth, stencilHeight][width,height]
examplescommon.unpackStencil( A:Type, stencilWidth:Uint, stencilHeight:Uint, width:Uint, [height:Uint] )
unpackStencil takes an array of type A and converts it into an array of stencils. Each stencil is composed of values with coordinates in the range X=[-stencilWidth+1,0] and Y={-stencilHeight+1,0]. Unlike stencil, this module never reads outside of the array, so all stencils are made entirely of valid values. As a result the output array is smaller than the input array.
TODO: make the arguments this takes more compatible with stencil.
type: {Array2d(a,W,H),Array2d(b,W,H),...} -> Array2d({a,b,c},W,H)
examplescommon.SoAtoAos( width:uint, height:uint, typelist:Type[], [asArray:bool] )
SoAtoAos converts a tuple of arrays of the same size into an array of tuples (i.e. struct of arrays to arrays of structs transform).
type: null -> {uint16,uint16}[T]
modules.posSeq( width:Uint, height:Uint, T:Uint )
fields = {..., kind="posSeq", W=width, H=height, T=T }
posSeq is a state machine that provides a {x,y} image coordinate tuple at throughput T.
type: A[width,height] -> A[width,height]
examplescommon.border( A:Type, width:Uint, height:Uint, left:Uint, right:Uint, bottom:Uint, top:Uint, value:LuaValue )
border take in an array, and applies a border to the edges, keeping the array size the same. border is useful for implementing boundary conditions for stencil reads. Currently the only supported border condition is setting it to a constant value value
. left
number of pixels on the left of the image are replaced with the constant value, along with right
number of pixels on the right etc.
type: A[T]->A[T]
examplescommon.borderSeq( A:Type, width:Uint, height:Uint, T:Uint, left:Uint, right:Uint, bottom:Uint, top:Uint, value:LuaValue )
Sequentialized version of border
that operates on T
pixels at a time, instead of the whole array at a time.
type: A[width,height] -> A[width-left-right,height-top-bottom]
examplescommon.crop( A : Type, width : Uint, height : Uint, left : Uint, right : Uint, bottom : Uint, top : Uint )
cropSeq takes an image of size width x height and removes left pixels from the left, right pixels from the right etc. This yields a smaller image of size (width-left-right x height-top-bottom).
type: A[T]->V(A[T])
modules.cropSeq( A:Type, width:Uint, height:Uint, T:Uint, left:Uint, right:Uint, bottom:Uint, top:Uint )
fields: NYI
cropSeq takes an image of size width x height and removes left pixels from the left, right pixels from the right etc. This yields a smaller image of size (width-left-rightxheight-top-bottom). cropSeq is sequentialized to perform this operation on images of type A at throughput T.
type: A[width,height] -> A[width+left+right,height+top+bottom]
examplescommon.pad( A : Type, width : Uint, height : Uint, left : Uint, right : Uint, bottom : Uint, top : Uint, value : LuaValue )
pad is the opposite of crop. pad takes an image of size width x height and adds left pixels to the left, right pixels to the right, etc., with value value.
type: V(A[T])->RV(A[T])
modules.padSeq( A:Type, width:Uint, height:Uint, T:Uint, left:Uint, right:Uint, bottom:Uint, top:Uint, value:A )
fields: {..., kind="padSeq", type=A, T=T, width=width, height=height, L=left, R=right, B=bottom, Top=top, value=value }
padSeq takes an image of size width x height and adds left pixels to the left, right pixels to the right, etc., with value value. This yields a larger image of size (width+left+rightxheight+top+bottom). padSeq is sequentialized to perform this operation on images of type A at throughput T.
type: V(A[inputRate,H]) -> RV(A[outputRate,H])
modules.changeRate( A:Type, H:Uint, inputRate:Uint, outputRate:Uint )
fields: {..., kind="changeRate", type=A, H=H, inputRate=inputRate, outputRate=outputRate }
changeRate implements either the vectorize or devectorize operator. vectorize takes smaller vectors and concatenates them (inputRate<outputRate
). devectorize takes larger vectors and writes them out over multiple firings (inputRate>outputRate
).
Note that by default changeRate de/vectorizes 2D arrays column at a time. This can limit the amount of rate change if the array has many more rows than columns. However, the array can always be casted to a 1D array prior to changeRate, which will expose the maximum number of rate factors.
type: {A,Bool} -> V(A)
modules.filterSeq( A:Type, width:Number, height:Number, rate:Number, fifoSize:Number )
fields: { ..., kind="filterSeq", A=A }
filterSeq takes two inputs: a stream of data of type A, and a stream of bools which indicate whether the data should pass through (true) or be filtered out (false). filterSeq's SDF rate is set as 1/rate.
filterSeq may override the stream of bools on some cycles in order to keep the stream's behavior valid within the SDF model, assuming the filterSeq is followed by a fifo with fifoSize entries. In particular, filterSeq will override the filter if the fifo will under/overflow or if the total number of output tokens does not equal width*height/rate
by the end of frame execution.
TODO: refactor this module into a few sub modules (ie an unprotected filter, underflow module, and average rate module). Add a module to support stream compaction, so that filterSeq can support throughputs > 1.
type: A[T]->V(A[T/scale]). If scale>T, this is A[T]->V(A[1]).
modules.downsampleYSeq( A: Type, width:Uint, height:Uint, T:Uint, scale:Uint )
fields: NYI
downsampleXSeq performs a downsample in X (i.e. keeps only 1/scale columns, where X%scale==0). downsampleXSeq is sequentialized to work on scanline streams of vectors of type A with size T. width and height indicate the total size of the image to be downsampled.
type: A[T]->V(A[T])
modules.downsampleYSeq( A: Type, width:Uint, height:Uint, T:Uint, scale:Uint )
fields: NYI
downsampleYSeq performs a downsample in Y (i.e. keeps only 1/scale lines, where Y%scale==0). downsampleYSeq is sequentialized to work on scanline streams of vectors of type A with size T. width and height indicate the total size of the image to be downsampled.
type: V(A[T])->RV(A[T])
examplescommon.downsampleSeq( A: Type, width:Uint, height:Uint, T:Uint, scaleX:Uint, scaleY:Uint )
downsampleSeq is a convenience function to instantiate downsampleXSeq and/or downsampleYSeq modules to perform a combined scale in X and Y. scaleX and scaleY must be >= 1.
type: Handshake(A[T])->Handshake(A[T])
modules.upsampleXSeq( A:Type, T:Uint, scale:Uint )
fields: { ..., kind="upsampleXSeq", A=A, T=T, scale=scale }
upsampleXSeq performs an upsample in X (i.e. duplicates each column scale times, starting with column X=0). upsampleXSeq is sequentialized to work on scanline streams of vectors of type A with size T. width and height indicate the total size of the image to be upsampled.
type: V(A[T])->RV(A[T])
modules.upsampleYSeq( A:Type, width:Uint, height:Uint, scale:Uint )
fields: {..., kind="upsampleYSeq", A=A, T=T, width=width, height=height, scale=scale }
upsampleYSeq performs an upsample in Y (i.e. duplicates each line scale times, starting with line Y=0). upsampleYSeq is sequentialized to work on scanline streams of vectors of type A with size T. width and height indicate the total size of the image to be upsampled.
type: Handshake(A[T]) -> Handshake(A[T])
examplescommon.upsampleSeq( A:Type, width:Uint, height:Uint, T:Uint, scaleX:Uint, scaleY:Uint )
upsampleSeq is a convenience function to instantiate upsampleXSeq and/or upsampleYSeq modules to perform a combined upscale in X and Y.
given f:A->B has type A[w,h]->B[w,h]
modules.map( f : RigelModule, width:Number, height:Number )
fields: {..., kind="map", fn=f, W=width, H=height }
map lifts a scalar function to work on arrays.
given f:{A,A}->A has type A[W,H]->A
modules.reduce( f : RigelModule, W : Uint, H : Uint )
fields: {..., kind="reduce", fn=f, W=W, H=H }
reduce performs an efficient tree reduction. The user passes any module f
that provides a binary operator to perform as the reduction operation, and reduce
lifts it work on arrays (similar to fold in other languages).
given f:{A,A}->A has type A->V(A)
modules.reduceSeq( f : RigelModule, T : Uint>1 )
fields: {..., kind="reduce", fn=f, T=T }
reduceSeq performs a sequential reduction (e.g. sort of like an accumulator module). The user passes any module f
that provides a binary operator to, and this module lifts it to perform this binary operator sequentially of T items over T firings. Every T firings reduceSeq writes out the result and resets. f must have pipeline delay of 0.
given f:A->B, has type A->V(B)
modules.liftBasic( f:Module )
fields: {..., kind="liftBasic", fn=f}
liftBasic takes a basic module, and lifts it to have a valid bit (i.e. supports decimation).
given f:A->V(B), has type V(A)->RV(B)
modules.liftDecimate( f:Module )
fields: {..., kind="liftDecimate", fn=f}
liftDecimate lifts modules that support decimation only (i.e. A->V(B)) to the synchronous interface type that supports both data decimation and increase (V(A)->RV(B)).
given f:V(A)->RV(B), has type Handshake(A)->Handshake(B)
modules.liftHandshake( f:Module )
fields: {..., kind="liftHandshake", fn=f}
liftHandshake converts the most general synchronous interface type V(A)->RV(B) to a handshake interface. Modules with handshake interfaces support upstream stalls, whereas synchronous interfaces do not.
given f:A->B, has type Handshake(A)->Handshake(B)
modules.makeHandshake( f:Module )
fields: {..., kind="makeHandshake"}
Convenience module that directly converts a module with a basic interface to a Handshake interface.
given f:V(A)->RV(B) has type RV(A)->RV(B)
modules.RPassthrough( f:Module )
fields: {..., kind="RPassthrough", fn=f}
RPassthrough allows for composition of V(A)->RV(B) interfaces. Note that synchronous pipelines do not support upstream stalls! This should only be used in pipelines where that can not occur.
given f:A->RV(B) has type V(A)->RV(B)
modules.waitOnInput( f:Module )
waitOnInput is typically for internal compiler use only. Modules with A->RV(B) type have ambiguous behavior, because they do not define how the module will behave with invalid input. This higher-order module provides one possible semantic:
if f is ready, f with execute iff input valid is true (i.e. it waits on input). if f is not ready, inner will always run (input data is undefined).
This is useful for implementing modules that upsample data. If they are ready to read the input, the module will only ever see valid data. If they are not ready (i.e. are generating data themselves), input data is irrelevant so the module runs regardless.
type: {Handshake(a), Handshake(b),...} -> Handshake({a,b,...})
modules.packTuple( typelist:Type[] )
fields: {..., kind="packTuple"}
packTuple takes multiple Handshake streams and synchronizes them into a single stream (i.e. to implement fan-in).
type: Handshake(A) -> Handshake(A)[N]
modules.broadcastStream( A:Type, N:Uint )
fields: {..., kind="broadcastStream", A=A, N=N }
broadcastStream
takes a single Handshake stream, and splits it into multiple streams to implement fan-out. This is essentially the opposite of packTuple
. The reason this module is necessary is that when there are multiple downstream streams, the upstream ready bit must be an AND of all the downstream ready bits. It must stall if ANY of the downstream streams are stalled. This module implements the AND.
Note: Currently the compiler does not check that you have correctly inserted this module! Handshake streams can be wired multiple times, which will compile, but won't behave correctly. TODO: Add a compiler pass to check for this (or rearchitect it so that fan-out is not allowed without an explicit module).
store type: Handshake(A) -> nil
load type: nil -> Handshake(A)
modules.fifo( A : Type, size : Uint>0, [nostall : bool], [W : Uint], [H : Uint], [T : Uint] )
The fifo module implements a first-in first-out queue. FIFOs are used to hide latency variation in modules, match pipeline delays, or support back edges (cycles). FIFOs are different than other modules because they have two separate 'ports' on it, a load and a store, which are decoupled. This allows us to support cycles in our language that only has DAGs.
Because FIFOs have multiple ports, they must be instantiated differently than other modules:
fifoinst = rigel.instantiateRegistered( "myfifo", modules.fifo( ... ) )
fifoloadval = rigel.applyMethod( "myload", fifoinst, "load" ) -- apply 'load' method on fifo
-- fifoinst must also be passed to rigel.lambda in the instances list
size
is the size of the FIFO (# of entries). Native FIFO hardware on the Zynq platform supports either 128, 512, 1024, or 2048 entries, so it is probably a good idea to choose one of those sizes. 128 entries will be implemented in slices, >128 is implemented with BRAMs.
nostall
disables the upstream stall signal as a performance optimization. If your FPGA clock period is limited by the length of the stall lines (which is pretty common), this will remove the stall lines, which can significantly improve the clock. This is only valid if you know that the FIFO is sufficiently sized to never need to stall. If nostall=true
and the FIFO fills with data, the board will deadlock!
W
, H
, T
are used to calculate the number of input items the FIFO expects to see total. Then the Verilog simulator uses this to print out usage statistics at the end of the frame (specifically, max fifo size seen). This debug information is optional. TODO: could probably simplify this.
type: HandshakeArrayOneHot(A,N)->HandshakeTmuxed(A,N)
modules.serialize( A:Type, inputRates:SDFRate[N], schedule:Module )
fields: {..., kind="serialize", A=A, inputRates=inputRates, schedule=schedule }
serialize is a higher-order module that takes multiple Handshake streams and serializes them into a single stream based on a user-specified ordering module.
Ordering modules have type null -> uint8. Each firining they return the ID of the stream to run. Ordering modules are stateful (or they wouldn't be useful).
modules.interleveSchedule( N:Uint, period:Uint)
fields: {..., kind="interleveSchedule", N=N, period=period }
interleveSchedule simply interleves N stream with a fixed repeating pattern. This schedule returns 2^period items of stream 0, then stream 1, etc. Like this:
period=1: ABABABAB
period=2: AABBAABB
period=3: AAAABBBB
modules.pyramidSchedule( depth:Uint, wtop:Uint, T:Uint )
fields: {..., kind="pyramidSchedule", depth=depth, wtop=wtop, T=T }
pyramidSchedule takes depth streams with pyramid rates (i.e. 1, 1/4, 1/16, 1/64) and serializes them into a "human readable" image pyramid format. wtop is the width of the largest (finest resolution) pyramid level. T is the number of pixels being processed in parallel (i.e. the module will expect wtop/T tokens per line for the first pyramid level).
This module is a compromise between human readability and FIFO size. Likely, there is a schedule that further reduces FIFO size but results in an image that is less understandable. Likewise, it would be nice to make this more human readable, but this would likely use too much FIFO size.
with depth=3, wtop=4, T=1 the pattern is:
AAAAAAAAAAAABBBBC
i.e. 4 lines of A, 2 lines of B, 1 line of C
TODO: in the future, we would like to extend serialize to support ordering modules that make a dynamic decision on which stream to run. These modules would have type {bool,bool,...}->uint8 (bools indicating valid data on each stream) or {uint8,uint8,...}->uint8 (numbers indicating fifo sizes on each stream). This would allow for dynamically-scheduled time-multiplexed hardware modules.
type: Handshake(A)[N] -> HandshakeArrayOneHot(A,N)
modules.toHandshakeArrayOneHot( A:Type, inputRates:SDFRate[N] )
fields: {..., kind="toHandshakeArrayOneHot", A=A, inputRates=inputRates }
toHandshakeArrayOneHot
converts N streams onto a shared bus (N streams transported over 1 data line). Only one stream can be active each cycle, and this is chosen by the downstream ready id. inputRates
is a list of SDF rates for each input stream.
type: HandshakeTmuxed(A,N)->Handshake(A)[N]
modules.demux( A:Type, rates:SDFRate[N] )
fields: {..., kind="demux", A=A, rates=rates }
A HandshakeTmuxed
interface has multiple streams transported over one shared bus. demux
takes the N streams and demultiplexes them into N individual busses. rates
gives the expected SDF rates of the N busses. The rates must sum to one (the bus must be fully utilized but no more).
type: HandshakeTmuxed(A,N) -> Handshake(A)
modules.flattenStreams( A : Type, rates : SDFRate[N] )
fields: {..., kind="flattenStreams", A=A, rates=rates }
A HandshakeTmuxed
interface has multiple streams transported over one shared bus. flattenStreams
flattens these N streams into one single stream, mixing values on the different streams together. Ordering of values is not guaranteed - they simply occur in the order that things happen on the Tmuxed bus. However, the serialize module can be used earlier in the pipeline to give the streams an ordering. rates
gives the expected SDF rate of the stream, which must sum to 1.
type: V(A) -> RV(A)
modules.reduceThroughput( A:Type, factor:Number)
{..., kind="reduceThroughput", factor=factor}
reduceThroughput is a debugging module. It artificially reduces the SDF throughput of the output stream by 1/factor.
type: inputType -> outputType
modules.lut( inputType : Type, outputType : Type, values : LuaValue[] )
fields: {..., kind="lut", inputType=inputType, outputType=outputType, values = values }
lut
creates a lookup table. values
must be an array of size 2^(inputType:bits()), and each of its values must be convertible to outputType
. When lowered to hardware, these lookup tables typically end up being instantiated as BRAMs, so trying to fit the input/output data type to match the size of the BRAMs is typically a good idea.
type: nil -> A[W*T,H]
modules.constSeq( value : LuaValue, A : Type, W : Uint, H : Uint, T : Number )
T*W must be an integer <=W and >0.
fields: {..., kind="constSeq", A=A, w=W, h=H, T=T, value=value}
constSeq
is a combination of a constant and a shift register. constSeq
returns sub-arrays of a 2d array constant (with value value
). The sub-arrays returned are column subsets of width W*T
and height H
from low indicies to high. This matches the behavior of stencilLinebufferPartial - so it can be used to feed stencil computations with filter coefficients.
type: A -> V(A)
modules.overflow( A : Type, count : Uint )
fields: {..., kind="overflow", count = count }
overflow
is used to set up the DRAM test harness. If the AXI bus recieves more data items than it was expecting, it crashes the board. overflow
caps the number of data items at count
to make sure this doesn't happen, even if the user's code has a bug.
type: Handshake(A) -> Handshake(A)
modules.underflow( A : Type, count : Uint, cycles : Uint, upstream : Bool, [tooSoonCycles : Bool] )
fields: {..., kind="underflow", A=A, count = count }
underflow
is used to set up the DRAM test harness. If the AXI bus receives fewer data items than it was expecting, it crashes the board. Underflow can occur for two reasons:
Underflow on the output: The pipeline didn't write enough values (not enough valid bits were true). For this case, set upstream=false
, and it will monitor the writes. It will expect to see at least count
items in cycles
cycles. If this doesn't occur, it will write DEADBEEFs to the bus until count
items have occured. tooSoonCycles
will optionally also raise an assert in the Verilog simulator if the pipeline exits too soon (which is probably also an error of some sort).
Underflow on the input: The pipeline didn't read enough values (not enough ready bits were true). For this case, set upstream=true
, and it will monitor the reads.
type: Handshake(A) -> Handshake(A)
modules.cycleCounter( A : Type, count : Uint )
fields: {..., kind="cycleCounter", A=A, count=count }
Debug module. cycleCounter
counts the number of cycles until count
items have been written on the bus, and then writes out this number of cycles in one full AXI burst (128 bytes). Cycle count is written out as 32 bit unsigned ints (repeated 32 times). This modules is used to measure the number of execution cycles used by the pipeline.
type: nil -> type
modules.freadSeq( filename : String, type : Type )
fields: {..., kind="freadSeq", filename=filename, type = type }
Debug module. Reads a file on disk in the Verilog and Terra simulator. Acts as a data source. This is a NOOP in hardware. The module will continue to read data until the file runs out of data.
type: type -> type
modules.fwriteSeq( filename : String, type : Type )
fields: {..., kind="fwriteSeq", filename=filename, type = type }
Debug module. Writes a file to disk in the Verilog and Terra simulator. Data is passed through this module unmodified, so this can be used to record data in the middle of the pipeline. This is a NOOP in hardware. The module will continue to write data whenever data is passed to it.
The Systolic
library is used to construct pipelined datapaths and lower them to Verilog. In this system, users define SystolicModule
's, which correspond to modules in Verilog. SystolicModules can be instantiated (returning a SystolicInstance
), and SystolicModules can contain instances of other modules. Instances indicate that a copy of the hardware necessary to implement that module should be created.
SystolicModules contain one or more SystolicDataflow
's, which represent operations to be performed by the module. Each SystolicDataflow can have one input and/or one output (which correspond to input/output ports on modules in Verilog). The output is typically driven by the input, but can also be driven by an internal module instance, such as a register. The user specifies the operations to be performed by the dataflow by creating a DAG of SystolicIR
nodes. SystolicIR
is a specification of the primitive operations that can be performed by the hardware.
The SystolicIR
and SystolicDataflows are restricted so that they can always be pipelined to an arbitrary depth by the compiler. The pipelineing is performed automatically prior to lowering to Verilog. SystolicDataflows can each have an associated valid bit (to implement pipeline bubbles) and clock enable CE
(to stall the pipeline). These two signals are wired automatically by the compiler.
The table representation for each SystolicIR node has at least the following values:
{
type:Type, -- output type of node
inputs:SystolicIR[] -- lua list of inputs
loc:String, -- source file location where node was created
}
These standard fields will not be repeated below.
Each SystolicIR node has some common functionality:
SystolicIR:setName( name : String )
Set the intermediate variable name to name
(when lowered to Verilog).
systolic.parameter( name : String, type : Type )
fields: {..., kind="parameter", name=name }
Return a formal parameter of type type
and name name
(used when lowering to Verilog).
systolic.constant( value : Lua, type : Type )
fields: {..., kind="constant", value=value }
Return a value with value value
and type type
. value
must be convertible to type.
systolic.null()
fields: {..., kind="null"}
Return a value of type null. TODO: shouldn't this just be a constant?
systolic.cast( expr : SystolicAST, type : Type )
fields: {..., kind="cast" }
Cast expr
to Type type
.
systolic.slice( expr : SystolicAST, idxLow : Uint, idxHigh : Uint, idyLow : Uint, idyHigh : Uint )
fields: {..., kind="slice", idxLow=idxLow, idxHigh=idxHigh, idyLow=idyLow, idyHigh=idyHigh }
if expr
is of type Array2D, return a new array of smaller size. The new array will include X coordinates in range [idxLow,idxHigh] and Y in range [idyLow, idyHigh]. Coordinates are inclusive and must be in rate of expr
array size.
systolic.index( expr : SystolicAST, idx : Uint, idy : Uint )
fields: none - helper function
if expr
is of type Array2D, select one element of the array (at coordinate [idx,idy]
) and return its value (as a scalar).
systolic.bitSlice( expr : SystolicAST, low : Uint, high : Uint )
fields: {..., kind="bitSlice", low=low, high=high }
Perform a bitwise slice on expr
. expr
can be any type. This performs the same operation as writing expr[high:low]
in Verilog. Returns bit type.
systolic.tuple( list : SystolicAst[] )
fields: {..., kind="tuple" }
Takes a list list
of SystolicAST values and returns them packed together as a tuple type.
Note: this is used in combination with cast
to perform many type conversions. e.g. concatenating multiple values into an Array2D is accomplished by turning them into a tuple, and then casting to Array2D.
systolic.select( cond : SystolicAST, a : SystolicAST, b : SystolicAST )
fields: {..., kind="select" }
Perform the ternary select operation like in C, cond?a:b
.
+ (lua operator)
lhs+rhs{...,kind="binop", op="+"}
- (lua operator)
lhs-rhs* (lua operator)
lhs*rhssystolic.eq( lhs : SystolicAST, rhs : SystolicAST )
lhs == rhssystolic.le( lhs : SystolicAST, rhs : SystolicAST )
lhs <= rhssystolic.lt( lhs : SystolicAST, rhs : SystolicAST )
lhs < rhssystolic.ge( lhs : SystolicAST, rhs : SystolicAST )
lhs >= rhssystolic.gt( lhs : SystolicAST, rhs : SystolicAST )
lhs > rhssystolic.__or( lhs : SystolicAST, rhs : SystolicAST )
lhs or rhs (boolean)systolic.__and( lhs : SystolicAST, rhs : SystolicAST )
lhs and rhs (boolean)systolic.xor( lhs : SystolicAST, rhs : SystolicAST )
lhs xor rhs (boolean)systolic.rshift( lhs : SystolicAST, rhs : SystolicAST )
lhs >> rhs (arithmatic shift)systolic.lshift( lhs : SystolicAST, rhs : SystolicAST )
lhs << rhs (arithmatic shift)
SystolicIR( delay : Uint )
(Lua Call Operator). Apply Delay ofdelay
cycles.systolic.__not( input : SystolicAST )
boolean or bitwise notsystolic.neg( input : SystolicAST )
-inputsystolic.abs( input : SystolicAST )
|input|systolic.isX( input : SystolicAST )
is input Verilog value X?
TODO: most of these 'operators' should really just be turned into modules.
SystolicDataflows are similar to ports in Verilog - they are an externally-visible interface of functionality that can be wired to.
systolic.lambda( name : String, inputParameter : SystolicIR, [output : SystolicAST], [outputName : String], [pipelines : SystolicAST[]], [valid : SystolicAST], [CE : SystolicAST] ) : SystolicDataflow
Define a Systolic Dataflow (i.e. a SystolicIR pipeline that drives an external port)
name
: Name of the dataflow. Used when wiring inputs the dataflow.inputParameter
: formal parameter for the dataflow. Must be a SystolicIR returned bysystolic.parameter()
orsystolic.null()
output
: output of the dataflow (optional)outputName
: name of the output port (when lowered to Verilog)pipelines
: List of other pipelines that execute when this dataflow executes. These pipelines do not return a value (typically they store values internally to registers, etc)valid
: SystolicIR Parameter (of type bool) to drive the valid bit. Optional - if not given, a valid bit is automatically created. This is only necessary if you need to explicitly access the valid bit for some reason.CE
: SystolicAST Parameter (of type bool) to drive the clock enable. Optional - if not given, function has no clock enable.
SystolicInstance
s represent a concrete instantiation of a SystolicModule
. You can then wire SystolicIR values as inputs to the dataflows on the instance.
SystolicInstance:anyDataflowName( input : SystolicIR ) : SystolicIR
Return a SystolicIR node representing the value of applying any dataflow (anyDataflowName
) of an instantiated module on the SystolicIR value input
.
All SystolicModules share some common functionality:
SystolicModule:toVerilog() : String
Return the definition of this Systolic Module as a string of Verilog. Lowering to Verilog is relatively straightforward. The systolic module gets turned into a Verilog module. All dataflow inputs and outputs appear as input and output ports on the Verilog module. Port names are set by the dataflow input/output parameter names (which are explicitly passed by the user when the dataflow is constructed). All types get packed into bitfields with the same layout as the type in Terra.
SystolicModule:instantiate( name : String, [coherent : bool], [arbitrate : bool] ) : SystolicInstance
Generate a SystolicInstance
instance of this module. name
is a string id for this instance.
Systolic contains a number of built-in primitive modules:
systolic.module.new( name : String, dataflows : SystolicDataflow[], instances : SystolicInstance[], [onlyWire : bool], [coherentDefault : bool], [verilogParameters : ], [verilog : String], [verilogDelays : Uint[String]] ) : SystolicModule
Define a new Systolic module.
name
: Name of the new module (used when lowering to Verilog)dataflows
: list of all dataflows to pack into this module.instances
: list of all module instances to be contained in this module.onlyWire
: If true, do not apply pipelining. Module is lowered directly to Verilog. Systolic then serves simply as a higher-level interface to generating Verilog. (optional)coherentDefault
: (optional)verilogParameters
: (optional)verilog
: provide a string of Verilog to stand in for this module (i.e. don't use Systolic to generate Verilog, just return this string). (optional)verilogDelays
: If a Verilog string is provided, this allows you to pass pipelining delays for each dataflow on the module. This is a map from dataflow names to delay. (optional)
systolic.module.reg( type : Type, hasCE : Bool, [initial : LuaValue], [hasValid : Bool] ) : SystolicModule
Create a register (D flip flop) of type type
. hasCE
and hasValid
indicate whether the flip flop should be generated with a CE and Valid bit. hasValid
is true by default. initial
is an initial value for the register.
The returned register module R
has two dataflows on it:
R:set( input : type ) : nil
R:get() : type
systolic.module.regBy( type : Type, setBy : SystolicModule, [hasCE : Bool], [init : LuaValue] )
RegBy is a generalization of an accumulator. RegBy is a register that atomically performs a user-defined operation whenever is it executed. The setBy
argument is the SystolicModule defining the operation to perform. In the case of an accumulator, setBy would perform an addition.
More precisely, setBy
must be a module with one :process(input:{type,type})
dataflow on it. The first tuple index is the previous register value, and the second tuple index is the new user input. process
must have a pipeline delay of 0 and return a value of type type
.
As in the case of the register module, hasCE
indicates whethere the module should have a CE, and init
is an optional initial value for the register.
The returned regby module R
has three dataflows on it:
R:set( input : type ) : nil
Reset the value toinput
R:setBy( input : type ) : nil
Set and Perform the setBy operationR:get() : type
systolic.module.ram128( [hasWrite:Bool], [hasRead : Bool], [init:LuaValue] ) : SystolicModule
ram128 creates a Xilinx 128 bit ram, which occupies one slice. hasWrite
and hasRead
indicate whether a read/write dataflow should be generated.
The returned ram128 module R
has two dataflows on it:
R:read( addr : Uint7) : Bits(1)
read bit at addressaddr
R:write( {addr : Uint7, Bits(1)} ) : Bits(1)
write bit at addressaddr
systolic.module.bram2KSDP( writeAndReturnOriginal : Bool, inputBits : Number, [outputBits: Number], CE : Bool, [init : LuaValue] ) : SystolicModule
bram2KSDP instantiates a Xilinx BRAM module. Xilinx BRAM macros support a number of different configurations (sizes and bandwidths), but they all see to be built out of this module, so I believe this is the fundamental hardware primitive.
The returned bram2KSDP module R
has two dataflows on it:
R:read( addr : Uint7) : Bits(1)
read bit at addressaddr
R:write( {addr : Uint7, Bits(1)} ) : Bits(1)
write bit at addressaddr
systolic.module.file( filename : String, type :Type, CE : bool )
The file module allows you to read or write a file on disk in the Verilog simulator (only). type
is the type of data to read each cycle.
The returned file module F
has three dataflows on it:
F:read() : type
read one valueF:write( input : type ) : nil
write one valueF:reset() : nil
return to the start of the file
systolic.module.print( type : Type, string : String, [CE : Bool], [showIfInvalid : bool] )
The print module allows you to print a string to the console in the Verilog simulator. The module can also take an arbitrary type as input each cycle, which can be printed using the normal Verilog string formatting operators (i.e. %d
etc). If type
is a tuple, the tuple will be unpacked, so that you can print multiple values (i.e. %d %d
). If showIfInvalid
is true, this module prints the string always (even in cycles when the dataflow is not running).
The returned file module P
has one dataflow on it:
P:process( input : type )
print then string withinput
systolic.module.assert( error : String, CE : bool, [exit : Bool] )
The assert module allows you to fire an assert in the Verilog simulator. String error
is printed on assert failure. If exit
is true, the simulation will finish (default true).
The returned assert module A
has one dataflow on it:
A:process( flag : Bool )
check the assert condition onflag
Provides helper functions for dealing with SDFRates.
The SDFRate
refered to in this document has no formal constructor. It simply refers to a lua table in the following format:
{{n,d}} -- a single SDF stream with rate n/d. n and d must be unsigned integers.
{{1,2}} -- e.g. SDF rate 1/2
{{n1,d1},{n2,d2}} -- two SDF streams with rate n1/d1, n2/d2 (for modules with multiple stream inputs...)
{{1,3},{2,3}} -- e.g. SDF rate 1/3,2/3
SDF Rates are stored in this rational format because technically doubles aren't sufficient to store all rationals exactly, which has caused problems for us in the past.
Convenience classes for incrementally constructing Systolic modules and dataflows.
A number of useful Systolic module constructors.
- arbitrary-width ram built of slices
- arbitrary-width ram built of BRAMs
- FIFO
- Shift register constructor
Experimental. A fixed-point math library built on top of Systolic (which only supports plain ints). Internally keeps track of the correct fixed point location, which saves you from having to implement this with shifts and track the point yourself. Also has the ability to automatically lower the math to a Terra implementation, with the same (arbitrary) precision as the hardware.
Experimental. A floating point implementation of fixed.t
. Used to generate ground-truth implementations. Simply load fixed_float.t
instead of fixed.t
and your implementation should still work, but now has 'infinite' precision (from the floats).
A library of helper Rigel modules that are used for multiple examples.
Code for generating the test harness for Terra (harness.terraOnly
), Verilog simulator + Terra (harness.sim
) and AXI+Terra+Sim (harness.axi
).
TODO: These functions are a giant mess of arguments and spaghetti - refactor this to be cleaner.
Given a Rigel module module
with input type I and output type O, we lower it to a systolic module smodule
using the following rules.
definitions:
HS = Handshake, HandshakeArray, HandshakeTuple, HandshakeArrayOneHot, HandshakeTmuxed, HandshakeTrigger
A = one of the basic types (uint, bool, array)
smodule:process() always exists, except for modules with multiple functions (FIFOs), in which case, this method is the same as the name of the function (e.g. 'load' or 'store'), but follows the same lowering rules.
we map from I->O to SI->SO as follows:
I->O: SIdataSlot,SIvalidSlot,SICESlot->SOdataSlot,SOvalidSlot,SOCESlot
A->B: A,bool,bool -> B,null,null
A->V(B): A,bool,bool -> {B,bool},null,null
A->Vtrigger: A,bool,bool -> bool,null,null
V(A)->RV(B): {A,bool},bool,bool -> {B,bool},null,null
Note that for the above cases, there are sometimes two valid bits! The valids get ANDed together at runtime - one is controlled by the system, the other is controlled by the user.
The following can appear as either inputs or outputs, and these are the lowering rules respectively:
Handshake(A) : A,bool,null
HandshakeArray(A) : A[N],bool[N],null
HandshakeTuple(A) : A,bool[N],null
HandshakeTrigger : null,bool,null
smodule:ready() exists if O==RV, or O==HS, or I==HS
for the HS cases, if the input or output type is T, SI/SO gets the value:
T: ST
null: null
Handshake: bool
HandshakeArray: bool[N]
HandshakeTuple: bool[N]
HandshakeArrayOneHot: uint8
HandshakeTmuxed: bool
HandshakeTrigger: bool
Note that it is perfectly valid for one of the ready input or output to be null! This is for sources/sinks.
For the O==RV case, or O==RVTrigger case, smodule:ready() must have type nil->bool
smodule:reset() exists iff module.stateful is true.
Given a Rigel module module
with input type I and output type O, we lower it to a terra class tmodule
with the following methods.
tmodule:process() always exists, except for modules with multiple functions (FIFOs), in which case, this method is the same as the name of the function (e.g. 'load' or 'store'), but follows the same lowering rules.
If the module is either a source or a sink, either TI or TO may not exist (the module will just take 1 argument).
Types are lowered as follows. Valid bits are packed into the function arguments.
A : A (transcription of rigel type to terra type)
V(A) : {A,bool}
RV(A) : {A,bool}
Handshake(A) : {A,bool}
HandshakeArray(A,N) : {A[N],bool[N]}
HandshakeTuple(A) : {A,bool[N]}
HandshakeArrayOneHot(A) :
tmodule:calculateReady() computes and stores the ready bit for the module based on the downstream ready. The ready bit (upstream output) of this module is stored in tmodule.ready
. Note that this function does not return a value (the result is stored into tmodule.ready
. This should be called before tmodule:process(), within each iteration of the inner simulation loop.
Separating out ready bit calculation from process
allows us to approximate the real behavior of the hardware. First, we do the ready bit calculation (in reverse), and store this off to the side temporarily (in tmodule.ready). Modules often also buffer the downstream ready in tmodule.readyDownstream
. Then, we run process
, and actually step the pipes, but only if tmodule.readyDownstream
was true. tmodule.readyDownstream
is not required by the system, but tmodule.ready
is.
The types of DS, tmodule.ready are as follows:
A->B : DNE, DNE
A->V(B) : DNE, DNE
V(A)->RV(B) : void, bool
The following can appear as either inputs or outputs, and these are the lowering rules respectively:
VTrigger : bool
Handshake(A) : bool
HandshakeArray(A) : bool[N]
HandshakeTuple(A) : bool[N]
HandshakeTrigger : bool
Provides similar behavior to the hardware reset, and puts the modules into their default state (at the start of time, or possibly between frames). Note that this should be called after tmodule:init().
For simplicity, tmodule:reset() always exists.
Allocates any memory on the heap that the module needs for simulation.
For simplicity, tmodule:init() always exists.
Frees any memory on the heap that the module allocated (but note, this does not free the module itself!).
For simplicity, tmodule:free() always exists.