# High level synthesis #2: Vectors

For somewhat complex handling of data structures, for example transformations such as a Digital Cosine Transformation used for JPEG encoding, it is desirable to work with vectors. This is an introduction to vector processing using the `hls` subsystem.

The complex multiplication example can also be handled using a vector extension.
There are a few reasons to use arithmetic extensions for vector pipelines as well, depending on the architecture that the code is generated for:

* classic inference of DSP elements where a synthesis back end decides on the mapping
* more controlled 'inline synthesis' where a fixed number of primitives is allocated/cascaded and micro-code is emitted

The microcode approach is not elaborated here. It offers greater flexibility, but requires more complexity on the back end for optimum pipelining. However, we will look into primitive instancing with somewhat transparent inline components.

In [1]:
from cyhdl import *

## Vector signals

Instead of using two separate signals for a complex vector, a `Vector` data type can be pipelined as well. Here, we always use a 2D vector type.

This has the advantage that vector operations can be declared that can operate as binary operation within a pipeline stage.

We import a few auxiliaries from the pipelined vector module:

In [2]:
from cyrite.library.hls.pipelined_vector import VectorSignal as Vector, CustomVectorOp

In [3]:
v0, v1, v2 = [ Vector(2, intbv()[12:]) for _ in range(3) ]

A standard vector addition and assignment to a signal is performed as follows:

In [4]:
v0.set(v1 + v2)

<myirl.vector._VectorAssign at 0x7f8b5cdd60c0>

Other operations are left undefined for this basic vector class and may not translate or synthesize.

### Vector primitives

Let's assume we have a built-in primitive in our configureable logic, that can perform a few extra vector operations, depending on a `mode` parameter. The functional model would look like below (note we are not using any strict interface declaration):

In [5]:
class vector_primitives:
    t_alu = enum('ADD', 'SUB', 'ADDSUB', 'ASSIGN', name = "t_alu") 

    @rtl_function
    def vector_op(rtl, v, a, b, mode, t_alu):	   
        if mode == t_alu.ADD:
            v.next = (a[0] + b[0], a[1] + b[1])
        elif mode == t_alu.ADDSUB:
            v.next = (a[0] + b[0], a[1] - b[1])
        elif mode == t_alu.SUB:
            v.next = (a[0] - b[0], a[1] - b[1])
        else:
            v.next = (b[0], b[1])

As a RTL function, it can be executed either as a native python function or a logic generator, depending on the `rtl` context.
We assume an existing primitive as hardware implementation for the time being, meaning, that there is a built-in blackbox primitive referenced in some way.

However, we can a priori not make use of such a blackbox within a `@pipe` construct in a functional way. In a classic V* HDL, we would have to create instances manually and wire them up for the correct pipe stages. A more readable way is to define custom operators as follows.

### Defining custom operators

We can define new infix operators on a class, like addition of the first, subtraction of the second item. They are framed by a `@` decorator (which is actually an abused matrix multiplication).

We pull in the Vector extension from the ALU DSP library:

In [6]:
from cyrite.library.hls.lib_dspalu import Vector as VectorExt

Finally, we make use of a `VectorExt.addsub` custom vector operation below, using a one stage pipeline and a bit of raw data slicing into vector elements:

In [7]:
from cyrite.library.hls.mypipe import pipelined, pipe

PipeEn = pipelined(Signal).Type(bool)

Data = Signal.Type(intbv, 24)

@block
def unit_comp(clk : ClkSignal, en : PipeEn,
                   d0 : Data, d1 : Data,  q: Data.Output, valid : PipeEn.Output):

    a, b = [ Vector(2, intbv()[12:]) for _ in range(2) ]
    iq = Vector(2, intbv()[13:])

    @pipe(clk, None, en, None, valid)
    def worker(ctx):
        yield [
            iq <= a @VectorExt.addsub@ b
        ]

    connections = [
        a[0] @assign@ d0[12:], a[1] @assign@ d0[:12],
        b[0] @assign@ d1[12:], b[1] @assign@ d1[:12],
        q @assign@ concat(iq[0][12:], iq[1][12:])
    ]
    
    return instances()



We instance this unit and elaborate for VHDL output:

In [8]:
clk = ClkSignal()
a, b = [ Data() for _ in range(2) ]
q = Data()
en, valid = [ PipeEn() for _ in range(2) ]

u = unit_comp(clk, en, a, b, q, valid)
f = u.elab(targets.VHDL)

[32m DEBUG Inline builtin instance [pipe_block_inline 'vector_op/vector_op'] [0m
[7;35m Declare obj 'vector_op' in context '(EmulationModule 'unit_comp')'(<class 'myirl.emulation.myhdl2irl.EmulationModule'>) [0m
 Writing 'unit_comp' to file /tmp/myirl_unit_comp_2hgc2v74/unit_comp.vhdl 




### Resulting HDL

By outputting the resulting VHDL code below, we can see that a `vector_op` unit is referenced from the `work` library, however, as a blackbox reference, it is not created by the above elaboration. Uncomment the command below to see the full VHDL source.

In [9]:
# !cat {f[0]}

We will also note that the `t_alu` parameter is not used. The reason is, that a static constant value is passed to the inline function in this implementation. Different constants will also cause several implementations of the `vector_op` unit being emitted. This is because the `vector_op` inline component was declared as a whitebox component in `lib_dspalu`.

If a blackbox, built-in component was instanced, the `t_alu` parameter might be inferred into a internal constant signal. The `@inference` rule of the inline component effectively decides on how to handle flexible parameters, it may also decide to allocate extra logic.

## Complex multiplication (vector)

We can now finally tackle complex multiplication from the HLS introduction using the above vector type. It is however no longer as simple as a basic vector op, because we do not want to multiply and add in the same clock cycle. The complex multiplication will thus operate on a higher level. But eventually, we'll want to define a `CVect.mul` operator, just that in this case we need to forward at minimum a clock plus validity signals telling when the output is ready.

This user-defined operation is typically implemented as a inline whitebox component. Their `_level` property defines in what hierarchy context they are allowed in. For the `InlineContainer` class below, we need to adapt the default level:

In [10]:
from myirl.library.blackbox import _inline_whitebox_component

def inline_whitebox_component(func):
    def _inline(self, *args, **kwargs):
        c = _inline_whitebox_component(self, func)
        c._level = 2 # Adapt level
        c.blackbox = True
        return c(self, *args, **kwargs)

    return _inline


Here we use an Inline container class that takes `clk` and `enable` parameters, and provides a `mul` member as infix operator:

In [11]:
from myirl.kernel.components import InlineContainer

class CVect(InlineContainer):
    def __init__(self, clk, en, valid):
        self.clk = clk
        self.dv = en, valid
        self.mul = CustomVectorOp(lambda x, y : cvmul(self.clk,
                                                      self.dv[0],
                                                      self.dv[1],
                                                      x[0], x[1], y[0], y[1]))

Obviously, latency/delays comes into play. Let's model a unit that does such a multiplication.

We use a strict interface this time, using the same mapping as the `unit_cmul` component from the [HLS introduction](hls.ipynb)

In [12]:
PS = pipelined(Signal)

CData = PS.Type(intbv, 12)
CRData = PS.Type(intbv, 25)

@block
def unit_cvmul(clk : ClkSignal, ce : PipeEn, valid: PipeEn.Output,
        ra  : CData, ia : CData, rb : CData, ib : CData,
        rq : CRData.Output,
        iq : CRData.Output):

    a, b = [ Vector(2, intbv()[len(ra):]) for _ in range(2) ]
    q = Vector(2, intbv()[len(rq):])

    dv = PipeEn()
    
    c = CVect(clk, en, dv) # Instance factory
    
    # Here, we `.wireup` stmt for the Vector type
    logic = [
        q.wireup(a @c.mul@ b),
        a.wireup((ra, ia)),  # Vector 'wireup'
        b.wireup((rb, ib)),
        rq   @assign@  q[0],
        iq   @assign@  q[1],
        valid  @assign@  dv
    ]

    return instances()

Next, we implement a `cvmul` unit. We use the basic myirl Signal type in order to stay compatible with other extensions that derive from this signal class:

In [13]:
import myirl
_Signal = myirl.Signal

@block
def cvmul_impl(clk : ClkSignal, dvin : PipeEn, dvout : PipeEn.Output,
              x0 : _Signal, x1 : _Signal, y0 : _Signal, y1 : _Signal,
              q0 : _Signal.Output, q1 : _Signal.Output ):

    N = len(x0)
    za, zb = [ Vector(2, intbv()[2*N:]) for _ in range(2) ]

    q = Vector(2, intbv()[len(q0):])
    
    @pipe(clk, None, dvin, None, dvout)
    def worker(ctx):
        yield [
            za[1].set(x0 * y0),
            za[0].set(x0 * y1),
            zb[0].set(x1 * y0),
            zb[1].set(x1 * y1)
        ]
        yield [
            q.set(za @VectorExt.addsub@ zb)
        ]

    wires = [
        q0.wireup(q[1]), q1.wireup(q[0])
    ]

    return instances()


Then, we create a corresponding `@inline_whitebox` that allows us to call a multiplication like a function, silently instancing a pipelined unit.

In [14]:
from myirl.library.blackbox import inline_whitebox, PortSpec

@inline_whitebox(cvmul_impl)
def cvmul(clk : ClkSignal, dvin : PipeEn, dvout : PipeEn.Output,
              x0 : _Signal, x1 : _Signal, y0 : _Signal, y1 : _Signal):
    @myirl.inference(myirl.base.IRL)
    def generate(instance, interface, rule):
        "Generate signals and logic instances in the caller (module)."
        N = len(x0) * 2 + 1
        t0, t1 = [ Signal(intbv()[N:]) for _ in range(2) ]

        # Explicitely add Port:
        interface.addPort('q0', PortSpec(PortSpec.OUT, t0), t0)
        interface.addPort('q1', PortSpec(PortSpec.OUT, t1), t1)
        return (t0, t1)

    return generate

Finally, we run a test instance to see if all resolves:

In [15]:
sigs = unit_cvmul.signals_from_interface()
uut = unit_cvmul(**sigs)
files = uut.elab(targets.VHDL, elab_all = True, outpath = '/tmp')

[32m DEBUG Inline builtin instance [block_inline 'cvmul/cvmul'] [0m
[7;35m Declare obj 'cvmul' in context '(EmulationModule 'unit_cvmul')'(<class 'myirl.emulation.myhdl2irl.EmulationModule'>) [0m
[32m DEBUG Inline builtin instance [pipe_block_inline 'vector_op/vector_op'] [0m
[7;35m Declare obj 'vector_op' in context '(EmulationModule 'unit_cvmul')'(<class 'myirl.emulation.myhdl2irl.EmulationModule'>) [0m
 Collected inline component vector_opu_24u_24u_24u_24u_25u_25E_ADDSUB2 
[7;35m Register type enum_ALUMODE_type in context 'module_defs' [0m
 Writing 'vector_op' to file /tmp/vector_op.vhdl 
 Writing 'cvmul' to file /tmp/cvmul.vhdl 
 Writing 'unit_cvmul' to file /tmp/unit_cvmul.vhdl 
 Creating library file /tmp/module_defs.vhdl 


In [16]:
!cat {files[1]}

-- File generated from source:
--     /tmp/ipykernel_4650/2801045659.py
-- (c) 2016-2022 section5.ch
-- Modifications may be lost, edit the source file instead.

library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;

library work;

use work.module_defs.all;
use work.txt_util.all;
use work.myirl_conversion.all;

entity cvmul is
    port (
        clk : in std_ulogic;
        dvin : in std_ulogic;
        dvout : out std_ulogic;
        x0 : in unsigned(11 downto 0);
        x1 : in unsigned(11 downto 0);
        y0 : in unsigned(11 downto 0);
        y1 : in unsigned(11 downto 0);
        q0 : out unsigned(24 downto 0);
        q1 : out unsigned(24 downto 0)
    );
end entity cvmul;

architecture myIRL of cvmul is
    -- Local type declarations
    -- Signal declarations
    signal worker_ce1 : std_ulogic;
    signal worker_ce2 : std_ulogic;
    signal worker_ce0 : std_ulogic;
    type a_v_8a50 is array (0 to 1) of unsigned(23 downto 0);
    signal v_8a50 : a_v_8a50    ;


## Verification

In the [HLS Introduction](hls.ipynb) we have already modelled a complex multiplication and a test bench. We rerun this notebook:

In [17]:
%run hls.ipynb

[7;35m Declare obj 'tb_unit' in context '(TBDesign 'tb')'(<class '__main__.TBDesign'>) [0m
[32m Module tb: Existing instance unit_cmul, rename to unit_cmul_1 [0m
[7;35m Declare obj 'sigdelay' in context '(TBDesign 'tb')'(<class '__main__.TBDesign'>) [0m
[32m DEBUG Inline instance [CompInline 'sigdelay/sigdelay'] [0m
[32m DEBUG Inline instance [CompInline 'sigdelay/sigdelay'] [0m
EVAL MACRO[0] 1 0 0 1
 Writing 'sigdelay' to file /tmp/sigdelay.vhdl 
 Writing 'unit_cmul_1' to file /tmp/unit_cmul_1.vhdl 
 Writing 'tb_unit' to file /tmp/tb_unit.vhdl 
EVAL MACRO[0] 1 0 0 1
EVAL MACRO[1] 2 1 2 2
EVAL MACRO[2] 4 2 1 0
 Creating library file /tmp/module_defs.vhdl 
DEBUG_FILES ['/tmp/sigdelay.vhdl', '/tmp/unit_cmul_1.vhdl', '/tmp/tb_unit.vhdl', '/tmp/module_defs.vhdl', '/home/cyrite/.local/lib/python3.10/site-packages/cyritehdl-0.1b0-py3.10-linux-x86_64.egg/myirl/targets/vhdl/libmyirl.vhdl', '/home/cyrite/.local/lib/python3.10/site-packages/cyritehdl-0.1b0-py3.10-linux-x86_64.egg/myirl

This time we use our vector unit:

In [18]:
m = TBDesign("tb", ghdl.GHDL)
tb = m.tb_unit(signals, unit_cvmul)
tb.run(200, debug = True)

[7;35m Declare obj 'tb_unit' in context '(TBDesign 'tb')'(<class '__main__.TBDesign'>) [0m
[32m Module tb: Existing instance unit_cvmul, rename to unit_cvmul_1 [0m
[32m DEBUG Inline builtin instance [block_inline 'cvmul/cvmul'] [0m
[7;35m Declare obj 'cvmul' in context '(TBDesign 'tb')'(<class '__main__.TBDesign'>) [0m
[7;35m Declare obj 'sigdelay' in context '(TBDesign 'tb')'(<class '__main__.TBDesign'>) [0m
[32m DEBUG Inline instance [CompInline 'sigdelay/sigdelay'] [0m
[32m DEBUG Inline instance [CompInline 'sigdelay/sigdelay'] [0m




[32m DEBUG Inline builtin instance [pipe_block_inline 'vector_op/vector_op'] [0m
[7;35m Declare obj 'vector_op' in context '(TBDesign 'tb')'(<class '__main__.TBDesign'>) [0m
EVAL MACRO[0] 1 0 0 1
 Collected inline component vector_opu_24u_24u_24u_24u_25u_25E_ADDSUB2 
[7;35m Register type enum_ALUMODE_type in context 'module_defs' [0m
 Writing 'vector_op' to file /tmp/vector_op.vhdl 
 Writing 'sigdelay' to file /tmp/sigdelay.vhdl 
 Writing 'cvmul' to file /tmp/cvmul.vhdl 
 Writing 'unit_cvmul_1' to file /tmp/unit_cvmul_1.vhdl 
 Writing 'tb_unit' to file /tmp/tb_unit.vhdl 
EVAL MACRO[0] 1 0 0 1
EVAL MACRO[1] 2 1 2 2
EVAL MACRO[2] 4 2 1 0
 Creating library file /tmp/module_defs.vhdl 
DEBUG_FILES ['/tmp/vector_op.vhdl', '/tmp/sigdelay.vhdl', '/tmp/cvmul.vhdl', '/tmp/unit_cvmul_1.vhdl', '/tmp/tb_unit.vhdl', '/tmp/module_defs.vhdl', '/home/cyrite/.local/lib/python3.10/site-packages/cyritehdl-0.1b0-py3.10-linux-x86_64.egg/myirl/targets/vhdl/libmyirl.vhdl', '/home/cyrite/.local/lib/pytho

0

## Conclusion

We have run a vectorized variant of our complex multiplication through a simple test bench for our previous bare metal implementation and have (very cheaply) verified it is doing the same thing. However, this is no *proof* yet that this is the case, as we are simply just checking output results.

To actually create a chain of proofs, we need to start with a trusted component, then work our way forward by proof of induction, based on the assumption that the previous operation was correct. For pipelines of the above, this is automatized in several ways:

* Latency checks of involved signals
* Bit size or overflow checks during inference

This is where a new data type family is introduced: the `flexbv` fixed point arithmetics.

To be documented...