Introduction to Logic Synthesis for AI
--------------------

Logic synthesis is a crucial process in digital circuit design where a high-level description of a system's behavior is transformed into a gate-level representation. In the context of AI, particularly deep learning, logic synthesis enables the efficient implementation of neural networks on hardware such as FPGAs and ASICs.

This notebook uses the NNgen library to demonstrate how deep neural network models can be synthesized into hardware. NNgen is a versatile tool that converts neural network descriptions into hardware description languages like Verilog. This is essential for developing custom hardware accelerators that significantly enhance the performance and efficiency of AI models.

Understanding how to synthesize AI models into hardware is vital for advancing AI applications, particularly in areas requiring high performance and low power consumption, such as embedded systems, edge computing, and real-time data processing.

This notebook is based on the [NNgen tutorial](https://github.com/NNgen/nngen).



## Installation

In [4]:
pip install torch torchvision veriloggen numpy onnx nngen np

Note: you may need to restart the kernel to use updated packages.


## Defining the Neural Network Architecture with NNgen

In this section, we define a simple deep neural network (DNN) model using the NNgen library.

The network consists of:

1. **Input Layer**: Defined as a placeholder for input data with dimensions 32x32x3 (e.g., an RGB image) and a batch size of 1.

2. **First Convolutional Layer**:
   - Weights (`w0`), biases (`b0`), and scales (`s0`) are initialized.
   - A convolution operation (`conv2d`) is applied to the input, followed by ReLU activation and max pooling.

3. **Second Convolutional Layer**:
   - Similar to the first layer, with new weights (`w1`), biases (`b1`), and scales (`s1`).
   - The output is reshaped to prepare for fully connected layers.

4. **First Fully Connected Layer**:
   - Weights (`w2`), biases (`b2`), and scales (`s2`) are defined.
   - A matrix multiplication (`matmul`) is performed with ReLU activation.

5. **Second Fully Connected Layer**:
   - New weights (`w3`), biases (`b3`), and scales (`s3`) are set.
   - The final matrix multiplication operation produces the output layer.

This structure showcases the typical layers found in a convolutional neural network (CNN) and highlights how NNgen can be used to define each layer's operations and data types, preparing the model for efficient hardware synthesis.


In [5]:
from __future__ import absolute_import
from __future__ import print_function

import nngen as ng


# data types
act_dtype = ng.int8
weight_dtype = ng.int8
bias_dtype = ng.int32
scale_dtype = ng.int8
batchsize = 1

# input
input_layer = ng.placeholder(dtype=act_dtype,
                             shape=(batchsize, 32, 32, 3),  # N, H, W, C
                             name='input_layer')

# layer 0: conv2d (with bias and scale (= batchnorm)), relu, max_pool
w0 = ng.variable(dtype=weight_dtype,
                 shape=(64, 3, 3, 3),  # Och, Ky, Kx, Ich
                 name='w0')
b0 = ng.variable(dtype=bias_dtype,
                 shape=(w0.shape[0],), name='b0')
s0 = ng.variable(dtype=scale_dtype,
                 shape=(w0.shape[0],), name='s0')

a0 = ng.conv2d(input_layer, w0,
               strides=(1, 1, 1, 1),
               bias=b0,
               scale=s0,
               act_func=ng.relu,
               dtype=act_dtype,
               sum_dtype=ng.int32)

a0p = ng.max_pool_serial(a0,
                         ksize=(1, 2, 2, 1),
                         strides=(1, 2, 2, 1))

# layer 1: conv2d, relu, reshape
w1 = ng.variable(weight_dtype,
                 shape=(64, 3, 3, a0.shape[-1]),
                 name='w1')
b1 = ng.variable(bias_dtype,
                 shape=(w1.shape[0],),
                 name='b1')
s1 = ng.variable(scale_dtype,
                 shape=(w1.shape[0],),
                 name='s1')

a1 = ng.conv2d(a0p, w1,
               strides=(1, 1, 1, 1),
               bias=b1,
               scale=s1,
               act_func=ng.relu,
               dtype=act_dtype,
               sum_dtype=ng.int32)

a1r = ng.reshape(a1, [batchsize, -1])

# layer 2: full-connection, relu
w2 = ng.variable(weight_dtype,
                 shape=(256, a1r.shape[-1]),
                 name='w2')
b2 = ng.variable(bias_dtype,
                 shape=(w2.shape[0],),
                 name='b2')
s2 = ng.variable(scale_dtype,
                 shape=(w2.shape[0],),
                 name='s2')

a2 = ng.matmul(a1r, w2,
               bias=b2,
               scale=s2,
               transposed_b=True,
               act_func=ng.relu,
               dtype=act_dtype,
               sum_dtype=ng.int32)

# layer 3: full-connection, relu
w3 = ng.variable(weight_dtype,
                 shape=(10, a2.shape[-1]),
                 name='w3')
b3 = ng.variable(bias_dtype,
                 shape=(w3.shape[0],),
                 name='b3')
s3 = ng.variable(scale_dtype,
                 shape=(w3.shape[0],),
                 name='s3')

# output
output_layer = ng.matmul(a2, w3,
                         bias=b3,
                         scale=s3,
                         transposed_b=True,
                         name='output_layer',
                         dtype=act_dtype,
                         sum_dtype=ng.int32)

## (Alternative) Import an Existing Model on a DNN Framework via ONNX

Instead of explicit model construction, you can import an existing model via the ONNX-importer. This allows for leveraging pre-trained models from, e.g., [`torchvision`](https://pytorch.org/vision/stable/index.html).

Then, translate the model into an ONNX file, which can be imported as an NNgen model definition using the `ng.from_onnx` method.

Here's a brief example:

```python
import torch
import torchvision

# Download a pre-trained model from Torchvision
model = torchvision.models.resnet18(weights='IMAGENET1K_V1')

# PyTorch to ONNX
onnx_filename = 'resnet18_imagenet.onnx'
dummy_input = torch.randn(*act_shape).transpose(1, 3)
input_names = ['act']
output_names = ['out']
model.eval()
torch.onnx.export(model, dummy_input, onnx_filename,
                  input_names=input_names, output_names=output_names)

# ONNX to NNgen
dtypes = {}
(outputs, placeholders, variables,
 constants, operators) = ng.from_onnx(onnx_filename,
                                      value_dtypes=dtypes,
                                      default_placeholder_dtype=act_dtype,
                                      default_variable_dtype=weight_dtype,
                                      default_constant_dtype=weight_dtype,
                                      default_operator_dtype=act_dtype,
                                      default_scale_dtype=scale_dtype,
                                      default_bias_dtype=bias_dtype,
                                      disable_fusion=disable_fusion)
```

## Assigning Random Values and Quantizing Weights

In this example, we assign random floating-point values to the network weights and biases for demonstration purposes. In a real-world scenario, you would use actual trained weight values from a deep neural network framework.

### Weight Initialization
We initialize the weights and biases with random values clipped between -3.0 and 3.0. Scales are set to 1. This is done for all layers of the network.

### Quantization

To prepare the model for hardware synthesis, we use NNgen’s quantizer to convert these floating-point weights to integers. This process involves:

- Setting scale factors based on activation data type width.
- Normalizing the input data using ImageNet mean and standard deviation.


In [6]:
import numpy as np

def initialize_weight(shape):
    value = np.random.normal(size=np.prod(shape)).reshape(shape)
    value = np.clip(value, -3.0, 3.0)
    return value

def initialize_bias(shape):
    value = np.random.normal(size=np.prod(shape)).reshape(shape)
    value = np.clip(value, -3.0, 3.0)
    return value

def initialize_scale(shape):
    return np.ones(shape)

def set_values(variables, init_function):
    for var in variables:
        var.set_value(init_function(var.shape))

# Initialize weights
weights = [w0, w1, w2, w3]
set_values(weights, initialize_weight)

# Initialize biases
biases = [b0, b1, b2, b3]
set_values(biases, initialize_bias)

# Initialize scales
scales = [s0, s1, s2, s3]
set_values(scales, initialize_scale)

# Quantizing the floating-point weights using the NNgen quantizer
imagenet_mean = np.array([0.485, 0.456, 0.406]).astype(np.float32)
imagenet_std = np.array([0.229, 0.224, 0.225]).astype(np.float32)

act_scale_factor = 128 if act_dtype.width > 8 else int(round(2 ** (act_dtype.width - 1) * 0.5))

input_scale_factors = {'input_layer': act_scale_factor}
input_means = {'input_layer': imagenet_mean * act_scale_factor}
input_stds = {'input_layer': imagenet_std * act_scale_factor}

ng.quantize([output_layer], input_scale_factors, input_means, input_stds)


## Assigning Hardware Attributes

In deep learning, models are typically executed on hardware with parallel processing capabilities to speed up computations. This is particularly important when dealing with large datasets and complex models, where performance can be a bottleneck.

The code cell assigns hardware attributes to different layers of a neural network model to optimize performance. These attributes configure the degree of parallelism in various directions (input channels, output channels, pixel columns, and rows) and the right-shift amount for integer precision execution.

### Key Attributes:

**Parallelism**:
- `par_ich`: Parallelism in input channels. More input channels processed simultaneously.
- `par_och`: Parallelism in output channels. More output channels processed simultaneously.

### Why It's Important:

**Performance**: By configuring parallelism, the model can leverage hardware capabilities to process multiple operations simultaneously, leading to faster execution.



In [7]:
par_ich = 2
par_och = 2

a0.attribute(par_ich=par_ich, par_och=par_och)
a1.attribute(par_ich=par_ich, par_och=par_och)
a2.attribute(par_ich=par_ich, par_och=par_och)
output_layer.attribute(par_ich=par_ich, par_och=par_och)

par = par_och

a0p.attribute(par=par)

## Verify the DNN Model Behavior by Executing the NNgen Dataflow as Software

After assigning weight values, the constructed NNgen dataflow can be executed as software to verify the behavior of a quantized DNN model. The `ng.eval` method evaluates the NNgen dataflow according to input values provided via method arguments.

In this example, random integer values are produced by NumPy and assigned as an input. However, in practice, actual integer input values, such as image data opened by PIL, should be used.

### Steps:
1. Generate Input Values:
    - Random integer values are generated using NumPy.
    - The values are then clipped and scaled to fit the expected range for the input layer.
    - The values are rounded and converted to the appropriate integer type.

2. Evaluate the Model:
    - The `ng.eval` method is called with the input values to evaluate the NNgen dataflow.
    - The output of the model is captured and printed.


In [17]:
input_layer_value = np.random.normal(size=input_layer.length).reshape(input_layer.shape)
input_layer_value = input_layer_value * imagenet_std + imagenet_mean
input_layer_value = np.clip(input_layer_value, -3.0, 3.0)
input_layer_value = input_layer_value * act_scale_factor
input_layer_value = np.clip(input_layer_value,
                            -1 * 2 ** (act_dtype.width - 1) - 1, 2 ** (act_dtype.width - 1))
input_layer_value = np.round(input_layer_value).astype(np.int64)

eval_outs = ng.eval([output_layer], input_layer=input_layer_value)
output_layer_value = eval_outs[0]

print(output_layer_value)

[[ 11  -2  25 -11  -1  12   7   0  -3  -4]]


## Convert the NNgen dataflow to a hardware description (Verilog HDL and IP-XACT)

After all the weights are assigned and the hardware attributes are configured, the NNgen dataflow is ready to be converted to an actual hardware description.

You can specify the hardware parameters, such as a data width of the AXI interface and system-wide signal names, via the "config" argument. Please see "nngen/verilog.py" for all the list of configurable hardware parameters.

NNgen generates an all-inclusive dedicated hardware design for an input DNN model, which includes parallel processing elements, on-chip memories, on-chip network between the processing elements and the on-chip memories, a DMA controller between off-chip memories and on-chip memories, and FSM-based control circuits. Therefore, no external control, such as DMA on CPU is required after the generated hardware begins a computation.

NNgen supports 3 types of output: 1) Veriloggen object, which is Python-based high-level hardware abstraction, 2) IP-XACT, which is a common IP-core format, and 3) Verilog HDL RTL as a text file.
A generated Veriloggen object can be easily verified by a testing mechanism of Veriloggen and a Verilog simulator.
A generated IP-XACT IP-core can be integrated with other components via AMBA AXI4 interface on an FPGA.

In [18]:
silent = False
axi_datawidth = 32

# to Veriloggen object
# targ = ng.to_veriloggen([output_layer], 'dnn_accelerator', silent=silent,
#                        config={'maxi_datawidth': axi_datawidth})

# to IP-XACT (the method returns Veriloggen object, as well as to_veriloggen)
targ = ng.to_ipxact([output_layer], 'dnn_accelerator', silent=silent,
                    config={'maxi_datawidth': axi_datawidth})
print('# IP-XACT was generated. Check the current directory.')

# to Verilog HDL RTL (the method returns a source code text)
# rtl = ng.to_verilog([output_layer], 'dnn_accelerator', silent=silent,
#                    config={'maxi_datawidth': axi_datawidth})

NNgen: Neural Network Accelerator Generator (version 1.3.4)
[IP-XACT]
  Output: dnn_accelerator
[Configuration]
(AXI Master Interface)
  Data width   : 32
  Address width: 32
(AXI Slave Interface)
  Data width   : 32
  Address width: 32
[Schedule Table]
(Stage 0)
(Stage 1)
  <conv2d None dtype:int8 shape:(1, 32, 32, 64) strides:(1, 1, 1, 1) padding:'SAME'-(1, 1, 1, 1) bias:(64,) scale:(64,) cshamt_out:17 act_func:relu sum_dtype:int32 par_ich:2 par_och:2 concur_och:16 stationary:filter keep_input default_addr:4242240 g_index:0 l_index:1 word_alignment:4 aligned_shape:(1, 32, 32, 64) scale_factor:2.625163>
  | <placeholder input_layer dtype:int8 shape:(1, 32, 32, 3) default_addr:64 g_index:2 word_alignment:4 aligned_shape:(1, 32, 32, 4) scale_factor:64.000000>
  | <variable w0 dtype:int8 shape:(64, 3, 3, 3) default_addr:4160 g_index:3 word_alignment:4 aligned_shape:(64, 3, 3, 4) scale_factor:42.333333>
  | <variable b0 dtype:int32 shape:(64,) default_addr:4160 g_index:3 word_alignment:2 

## Save the quantized weights

All weight parameters are zipped into a single `np.ndarray` by the `ng.export_ndarray` method. This array can be utilized in an actual FPGA platform.

In [19]:
# convert weight values to a memory image:
# on a real FPGA platform, this image will be used as a part of the model definition.
param_filename = 'dnn_accelerator.npz'
chunk_size = 64

param_data = ng.export_ndarray([output_layer], chunk_size)
np.savez_compressed(param_filename, param_data)

## Simulate the generated hardware by Veriloggen and Verilog simulator

If you want to reduce the development time, you can skip this section for Verilog simulation.

If you generate a hardware as Veriloggen object or IP-XACT, you can simulate the hardware behavior on Verilog simulator via the testing mechanism on Veriloggen.

Before the hardware runs, the input data and weight values should be located on the shared off-chip memory. In Verilog simulation in the example, there is a `np.ndarray` object to represent a dump image of the off-chip memory. You can copy the pre-computed values to the memory image by `axi.set_memory` method.

`param_data` is the unified parameter data of all variables and constants. Locations of the located data are configurable, which can be changed from the CPU via the configuration register of the NNgen hardware. In the following example, the head address of unified parameter data (`variable_addr`) is calculated by the same rule as the address calculator in the NNgen compiler.

The `ctrl` method in the following example is an emulation of a control program on the CPU, which is actually an FSM circuit of the control sequence synthesized by the procedural high-level synthesis compiler of Veriloggen. Via the `ng.sim.start` method, the program writes `1` to the `start` register of the NNgen hardware. Then, the hardware begins the computation, and the CPU waits until the computation finishes via the `ng.sim.wait` method.

### Data alignment, `word_alignment`, and `aligned_shape`

**Note that all the input, weight, and output data should be located along with their alignments.** Especially, using a narrower data width (for any data) than the AXI interconnect interface and applying the parallelization via the hardware attribute will require special cares of data arrangement. In a synthesis log, you can find the `word_alignment` and `aligned_shape` for each placeholder, variable, operator. When putting corresponding data on an off-chip memory, a padding will be required according to the word alignment. The difference between the original shape and the aligned shape is the size of padding. In NNgen, padding is required only at an innermost dimension.

Unified variable images, such as `param_data`, are already aligned according to the word alignment. So you don't have to rearrange the data alignment.

In [7]:
import math
from veriloggen import *
import veriloggen.thread as vthread
import veriloggen.types.axi as axi

outputfile = 'dnn_accelerator.out'
filename = 'dnn_accelerator.v'
# simtype = 'iverilog'
simtype = 'verilator'

param_bytes = len(param_data)

variable_addr = int(
    math.ceil((input_layer.addr + input_layer.memory_size) / chunk_size)) * chunk_size
check_addr = int(math.ceil((variable_addr + param_bytes) / chunk_size)) * chunk_size
tmp_addr = int(math.ceil((check_addr + output_layer.memory_size) / chunk_size)) * chunk_size

memimg_datawidth = 32
mem = np.zeros([1024 * 1024 * 256 // memimg_datawidth], dtype=np.int64)
mem = mem + [100]

# placeholder
axi.set_memory(mem, input_layer_value, memimg_datawidth,
               act_dtype.width, input_layer.addr,
               max(int(math.ceil(axi_datawidth / act_dtype.width)), par_ich))

# parameters (variable and constant)
axi.set_memory(mem, param_data, memimg_datawidth,
               8, variable_addr)

# verification data
axi.set_memory(mem, output_layer_value, memimg_datawidth,
               act_dtype.width, check_addr,
               max(int(math.ceil(axi_datawidth / act_dtype.width)), par_och))

# test controller
m = Module('test')
params = m.copy_params(targ)
ports = m.copy_sim_ports(targ)
clk = ports['CLK']
resetn = ports['RESETN']
rst = m.Wire('RST')
rst.assign(Not(resetn))

# AXI memory model
if outputfile is None:
    outputfile = os.path.splitext(os.path.basename(__file__))[0] + '.out'

memimg_name = 'memimg_' + outputfile

memory = axi.AxiMemoryModel(m, 'memory', clk, rst,
                            datawidth=axi_datawidth,
                            memimg=mem, memimg_name=memimg_name,
                            memimg_datawidth=memimg_datawidth)
memory.connect(ports, 'maxi')

# AXI-Slave controller
_saxi = vthread.AXIMLite(m, '_saxi', clk, rst, noio=True)
_saxi.connect(ports, 'saxi')

# timer
time_counter = m.Reg('time_counter', 32, initval=0)
seq = Seq(m, 'seq', clk, rst)
seq(
    time_counter.inc()
)


def ctrl():
    for i in range(100):
        pass

    ng.sim.set_global_addrs(_saxi, tmp_addr)

    start_time = time_counter.value
    ng.sim.start(_saxi)

    print('# start')

    ng.sim.wait(_saxi)
    end_time = time_counter.value

    print('# end')
    print('# execution cycles: %d' % (end_time - start_time))

    # verify
    ok = True
    for bat in range(output_layer.shape[0]):
        for x in range(output_layer.shape[1]):
            orig = memory.read_word(bat * output_layer.aligned_shape[1] + x,
                                    output_layer.addr, act_dtype.width)
            check = memory.read_word(bat * output_layer.aligned_shape[1] + x,
                                     check_addr, act_dtype.width)

            if vthread.verilog.NotEql(orig, check):
                print('NG (', bat, x,
                      ') orig: ', orig, ' check: ', check)
                ok = False
            else:
                print('OK (', bat, x,
                      ') orig: ', orig, ' check: ', check)

    if ok:
        print('# verify: PASSED')
    else:
        print('# verify: FAILED')

    vthread.finish()


th = vthread.Thread(m, 'th_ctrl', clk, rst, ctrl)
fsm = th.start()

uut = m.Instance(targ, 'uut',
                 params=m.connect_params(targ),
                 ports=m.connect_ports(targ))

# simulation.setup_waveform(m, uut)
simulation.setup_clock(m, clk, hperiod=5)
init = simulation.setup_reset(m, resetn, m.make_reset(), period=100, polarity='low')

init.add(
    Delay(10000000),
    Systask('finish'),
)

# output source code
if filename is not None:
    m.to_verilog(filename)

# run simulation
sim = simulation.Simulator(m, sim=simtype)
rslt = sim.run(outputfile=outputfile)

print(rslt)

# start
# end
# execution cycles:     1788938
OK (           0           0 ) orig:           11  check:           11
OK (           0           1 ) orig:            1  check:            1
OK (           0           2 ) orig:          -24  check:          -24
OK (           0           3 ) orig:           -2  check:           -2
OK (           0           4 ) orig:            2  check:            2
OK (           0           5 ) orig:           20  check:           20
OK (           0           6 ) orig:          -22  check:          -22
OK (           0           7 ) orig:           -4  check:           -4
OK (           0           8 ) orig:            1  check:            1
OK (           0           9 ) orig:           14  check:           14
# verify: PASSED
- /home/marcel/git/lsforai/hello_nngen.out/out.v:1528: Verilog $finish

