# High level synthesis introduction

The cyrite `hls` subsystem provides a few library auxiliaries to enable generation of digital signal processing pipelines for various architectures.

Unlike other HLS tools, the `hls` subsystem does not provide a ready-made solution to translate a C alike routine into hardware elements. The work flow is more like:

* Take a Python processing routine as prototype
* Transform this routine into processing parts in another notation that translates to a pipeline
* Verify that the numerical errors are within a certain range -- this is relevant when moving from float to fixed point elements
* Specify a synthesis rule in the parenting context that:
    * Emits the code into pure hardware elements for direct synthesis, suitable for co-simulation
    * Maps micro-code to perform the calculations using a **fixed** set of arithmetic primitives
    * Outputs HDL core for verification purposes
* Reuse the same 'trusted' code for various target verification flows

By default, the `hls` subsystem works with fixed point arithmetics internally and uses float types only for interfacing. The core wire type is`flexbv`: an `intbv` derivation that can have some adaptive properties within a pipeline.

For this introduction, we use `intbv` however (for the time being).

## Pipelines

When a series of values is to be processed at each clock cycle using several operations such as multiplications chained with additions (like matrix multiplications), pipelines come into play. The general philosophy of a `hls` style processing description is to separate the description of the calculation steps from the actual inference rules.
The caller thus decides, according to the `@pipeline` decorator, on how to lay out the resulting HDL.

Therefore, a `@pipeline` style RTL function reads like a sequential description of calculation steps that are returned using `yield` in the explicit MyIRL notation.

The following unit implements a simple addition with a pipeline of depth one, obviously.

In [1]:
from cyrite.library.hls.mypipe import pipelined_block, pipelined, pipe
from cyhdl import *

PS = pipelined(Signal)
PipeEn = PS.Type(bool)

@pipelined_block
def unit_add(clk : ClkSignal, ce : PipeEn, valid: PipeEn.Output,
	a  : PS, b : PS, q : PS.Output):

	z = q.clone()

	@pipe(clk, None, ce, None, valid)
	def gen_add(ctx):
		yield [
			z .set (a + b)
		]
		
	wires = [
		q.wireup(z)
	]
	return locals()


Note:

* The `@pipe` decorator denotes that a pipeline is to be created from the `gen_add` generator routine. It requires a ClkSignal as parameter.
* Every `yield []` statement generates a pipeline stage. No transpilation is taking place in this context, but you can call `@rtl_function`s.
* For latency accounting, a specific `pipelined()` construct is used create a specific `PipelinedTracker` type class.
* The output port (here `q`) can not be assigned to within the pipeline. A separate `.wireup` statement is required.

We now perform an analysis on this unit as follows:

In [2]:
clk = ClkSignal()
dv0, dv1 = [ PipeEn() for _ in range(2) ]
a, b = [ PS(intbv()[6:]) for _ in range(2) ]
q = PS(intbv()[7:])

u = unit_add(clk, dv0, dv1, a, b, q)
u.analyze()

After analysis, we can verify that the result `q` is delayed with respect to `a` and `b` by a latency of `1`:

In [3]:
assert q.latency() == 1

### Pipelined signals and parameters

In more complex, parallel computing units, it has to be ensured that latencies match. Assume we are adding two values that are computed by separately running pipelines, in this case the latencies must be the same.

In case of an accumulation (e.g. `z.set(z + r)`) however, a latency check would not make sense.

The default `pipelined()` signal class factory performs simple latency checks within pipeline generation. We exercise this later with a deeper pipeline.

## Example #1: Complex multiplication

Let us calculate a simple complex multiplication. However, myirl has no support for the builtin `complex` data type, therefore we need to yet be verbose:

In [4]:
ra, ia = [ PS(intbv()[12:]) for _ in range(2) ]
rb, ib = [ PS(intbv()[12:]) for _ in range(2) ]

A multiplication of such a complex value pair would be performed as follows:

In [5]:
rq_op = ra * rb - ia * ib
iq_op = ra * ib + rb * ia
rq_op

SUB(MUL(s_4021:0, s_e6f4:0), MUL(s_fb97:0, s_bad6:0))

In [6]:
@pipelined_block
def unit_cmul(clk : ClkSignal, ce : PipeEn, valid: PipeEn.Output,
        ra  : PS, ia : PS, rb : PS, ib : PS,
        rq : PS.Output,
        iq : PS.Output):

    N = len(ra) * 2
    rz = [ PS(intbv()[N:]) for _ in range(2) ]
    iz = [ PS(intbv()[N:]) for _ in range(2) ]
    x, y = [ PS(intbv()[N+1:]) for _ in range(2) ]

    @pipe(clk, None, ce, None, valid)
    def gen_cmul(ctx):
        # Stage 0:
        yield [
            rz[0].set(ra * rb),
            rz[1].set(ia * ib),
            iz[0].set(ra * ib),
            iz[1].set(rb * ia)
        ]
        # Stage 1:
        yield [
            x.set(rz[0] - rz[1]),
            y.set(iz[0] + iz[1])
        ]
        
    wires = [
        rq.wireup(x), iq.wireup(y)
    ]
    return locals()


In [7]:
rq, iq = [ PS(intbv()[25:]) for _ in range(2) ]

u = unit_cmul(clk, dv0, dv1, ra, ia, rb, ib, rq, iq)

In [8]:
u.analyze()
assert rq.latency() == 2

We note:
* A pipeline inference would allocate four multiplier elements in hardware for stage '0' and two adders for stage '1'
* We might actually split this up in two configureable multiplier/adder units using a 'subtract' option.

First, we create hardware elements and run this unit in a basic HDL simulation.

### Simulation

For simulation, we feed a few values and delay a computed result through a `.delayed(clk, cycles)` signal delay composite type (which creates a signal and instances a `sigdelay` unit at the same time). The compare unit `cmp` compares those two results and throws an assertion exception if they mismatch.

Now the question is, where do we want to do the actual calculation of the result:
* Pure native Python?
* Transpiled HDL language?

Because everything inside a specific HDL test bench's `@sequence` function is transpiled, we need to do Python native calculations the explicite way outside the test bench in a special generator function, `@evalmacro`. This auxiliary allows to separate native execution from what is actually emitted to HDL. However, unlike a hdlmacro, it allows evaluation of passed iterators at the time where emission to HDL occurs.

In [9]:
from cyrite.simulation import sim, ghdl
from cyhdl import *
from myirl.library.custom_generators import evalmacro

@evalmacro
def calc_complex(values, isig):
    "Do a complex calculation in the Python native domain and return a generator assignment"
    ra, ia, rb, ib = values.evaluate()
    print("EVAL MACRO[%d]" % values.index, ra, ia, rb, ib)
    va = complex(ra, ia)
    vb = complex(rb, ib)
    z = va * vb
    yield [
        isig[0].set(int(z.real)),
        isig[1].set(int(z.imag))
    ]

### Test bench unrolling
Now we create the test bench. Because we might want to extend such a design later, we create a `cyrite_factory.Module` class with a testbench method.

Inside the sequential part of the testbench, we pass a series of test values in a sequential way that are run through the evaluating macro function. Meanwhile, the test values are passed through the generated HDL.

Because the `@evalmacro` runs in the Python domain and we don't create a HDL procedure from it, the test value series iteration actually unrolls in the resulting HDL.

In [10]:
class TBDesign(cyrite_factory.Module):

    @cyrite_factory.testbench('ns')
    def tb_unit(self, signals : dict, unit):
        for n, s in signals.items():
            s.rename(n)
        clk = signals['clk']
        uut = unit(**signals)
    
        # We analyze the `uut` now in order to
        # retrieve the latency of the output signal
        uut.analyze(targets.VHDL)
        LATENCY = signals['rq'].latency()
    
        # Create an internal signal with same properties
        # as the result signal:
        int_r = signals['rq'].clone(), signals['iq'].clone()
    
        # Create a FIFO delay line:
        del_r0 = [ i.delayed(clk, LATENCY) for i in int_r ]
        
        @self.always(delay(2))
        def clkgen():
            clk.next = ~clk
    
        @self.always(clk.posedge)
        def cmp():
            if signals['valid']:
                print("DEBUG SIG", signals['rq'], signals['iq'])
                assert del_r0[0] == signals['rq']
                assert del_r0[1] == signals['iq']
    
    
        @self.sequence
        def main():
            signals['ce'].next = False
            yield delay(10)
            yield clk.negedge
    
            signals['ce'].next = True
    
            for values in [ (1, 0, 0, 1), (2, 1, 2, 2), (4, 2, 1, 0) ]:
                signals['ra'].next = values[0]
                signals['ia'].next = values[1]
                signals['rb'].next = values[2]
                signals['ib'].next = values[3]
                calc_complex(values, int_r) # Must use explicit, nonportable call to macro
                
                yield clk.negedge
                
            signals['ce'].next = False
    
            yield delay(20)
            raise StopSimulation
        return locals()
            

We create a set of signals to pass to the test bench as one possible signal configuration. Here, we still define the sizes explicitely and use `intbv`:

In [11]:
N, M = 12, 25

signals = {
    'clk' : ClkSignal(),
    'ce' : PipeEn(),
    'valid' : PipeEn(),
    'ra'    : PS(intbv()[N:]),
    'ia'    : PS(intbv()[N:]),
    'rb'    : PS(intbv()[N:]),
    'ib'    : PS(intbv()[N:]),
    'rq'    : PS(intbv()[M:]),
    'iq'    : PS(intbv()[M:])
}

Then instance the test bench and run:

In [12]:
m = TBDesign("tb", ghdl.GHDL)

tb = m.tb_unit(signals, unit_cmul)
# print(tb_unit.obj.unparse())

[7;35m Declare obj 'tb_unit' in context '(TBDesign 'tb')'(<class '__main__.TBDesign'>) [0m
[32m Module tb: Existing instance unit_cmul, rename to unit_cmul_1 [0m
[7;35m Declare obj 'sigdelay' in context '(TBDesign 'tb')'(<class '__main__.TBDesign'>) [0m
[32m DEBUG Inline instance [CompInline 'sigdelay/sigdelay'] [0m
[32m DEBUG Inline instance [CompInline 'sigdelay/sigdelay'] [0m


In [13]:
tb.run(200, debug = True, wavetrace= True)

EVAL MACRO[0] 1 0 0 1
 Writing 'sigdelay' to file /tmp/sigdelay.vhdl 
 Writing 'unit_cmul_1' to file /tmp/unit_cmul_1.vhdl 
 Writing 'tb_unit' to file /tmp/tb_unit.vhdl 
EVAL MACRO[0] 1 0 0 1
EVAL MACRO[1] 2 1 2 2
EVAL MACRO[2] 4 2 1 0
 Creating library file /tmp/module_defs.vhdl 
DEBUG_FILES ['/tmp/sigdelay.vhdl', '/tmp/unit_cmul_1.vhdl', '/tmp/tb_unit.vhdl', '/tmp/module_defs.vhdl', '/home/cyrite/.local/lib/python3.10/site-packages/cyritehdl-0.1b0-py3.10-linux-x86_64.egg/myirl/targets/vhdl/libmyirl.vhdl', '/home/cyrite/.local/lib/python3.10/site-packages/cyritehdl-0.1b0-py3.10-linux-x86_64.egg/myirl/targets/vhdl/txt_util.vhdl']
==== COSIM stdout ====
analyze /home/cyrite/.local/lib/python3.10/site-packages/cyritehdl-0.1b0-py3.10-linux-x86_64.egg/myirl/targets/vhdl/txt_util.vhdl
analyze /home/cyrite/.local/lib/python3.10/site-packages/cyritehdl-0.1b0-py3.10-linux-x86_64.egg/myirl/targets/vhdl/libmyirl.vhdl
analyze /tmp/unit_cmul_1.vhdl
analyze /tmp/sigdelay.vhdl
analyze /tmp/tb_unit.v

0

### Wave trace

Using the code below, we dump a few selected signals from the wave trace and display them.

In [14]:
from cyrite.waveutils import draw_wavetrace
selection = {
    'tb_unit.clk' : None,
    'tb_unit.ce' : None,
    'tb_unit.ra[11:0]' : None,
    'tb_unit.rb[11:0]' : None,
    'tb_unit.ia[11:0]' : None,
    'tb_unit.ib[11:0]' : None,
    'tb_unit.valid' : None,

    'tb_unit.rq[24:0]' : None,
    'tb_unit.iq[24:0]' : None
}

In [15]:
draw_wavetrace(tb, 'tb_unit.vcd', sample_clk = 'clk', cfg = selection)

Here we can clearly see that the results are delayed by two clock cycles with respect to the input.

### Conclusion

Using the `calc_complex` generator function, we calculate the results in native python and emit the literal result value to the HDL test bench which is asserting the values match using a compare and delayed signals. This way we probe a set of values using a HDL simulator.

Another option would be to use Co-Simulation. Here, we would not need to do any out of band tricks, as everything on the test bench level would be running natively in Python.

A few more notable points:
* The macro is called once at creation time to collect drivers and sources from the logic it will generate
* Unlike a `@hdlmacro`, it is repeatedly called during emission, i.e. here another N times during the `for` iteration. The logic is then generated ad-hoc after evaluation of the iterator values.