### How to use the compiler folder for the example of LeNet-5

**Introduction to LeNet-5 : one of the first CNNs, useful for image recognition.**

*Architecture :* 
- **Input image :** 32x32 pixels, 1 channel
- **First convolutional layer C1 :** 6 convolutional filters of size 5x5, resulting in 6 feature maps of size 28x28 | ReLU activation
- **Average pooling layer AP2 :** 2x2 kernel with stride = 2, resulting in feature maps of size 14x14 

*How to use data_definition :*
**Toolbox :** `numpy` tool is required for matrices manipulation : `conda install numpy` for a conda environment

**Defining the size of the matrices :** 
The input image, represented by an input tensor matrix of size 32x32x1 (height x width x channels), goes through C1 to become an output tensor of size 28x28x6.
In this case, to use the VTA and to do the convolution as a GEMM, we use 2D matrices by converting the input tensors with an Im2row method. We obtain an input matrix A (784x25) and a weight matrix B (25x6), whose multiplication results in an output matrix of size 784x6. This is done by ACETONE.
The dimensions of the matrices are obtained using `tensor_matrix_converter.py` (no matrices are generated, only the dimensions):

In [19]:
"""IMPORTING NECESSARY FUNCTIONS"""

%pip install numpy
import numpy as np
import sys
sys.path.append('../compiler/data_definition')
import tensor_matrix_converter
import matrix_generator
import matrix_split
import matrix_multiplication

Note: you may need to restart the kernel to use updated packages.


In [20]:
# To illustrate, let's generate the dimensions of the Input, Weight (post-Im2Row conversion), and Output matrices (after GeMM). 

# ----------------------
# For that, the given dimensions of the Input tensor and Kernel are to be input :

"""INPUT TENSOR"""
input_channel = 1
input_height = 32
input_width = 32

"""KERNEL"""
kernel_channel = 6 # Number of filters
kernel_height = 5
kernel_width = 5

"""Computation Parameters (for convolution)"""
stride_height = 1
stride_width = 1
pad_height = 0
pad_width = 0

# Using `tensor_matrix_converter.py`, we can print the dimensions of the Output tensor (post-convolution) :

"""OUTPUT TENSOR"""
output_tensor_height, output_tensor_weight = tensor_matrix_converter.output_dimension(inp_dim=(input_height, input_width), \
                     wgt_dim=(kernel_height, kernel_width), \
                     stride=(stride_height, stride_width), \
                     padding=(pad_height, pad_width))

# Then, we can print the dimensions of the Input and Weight matrices
tensor_matrix_converter.im2row_matrix_dimension(nc=input_channel, nh=input_height, nw=input_width, \
                            mc=kernel_channel, mh=output_tensor_height, mw=output_tensor_weight, \
                            fh=kernel_height, fw=kernel_width, \
                            sh=stride_height, sw=stride_width, \
                            ph= pad_height, pw=pad_width)

# Size of the input matrix
inp_height = output_tensor_height * output_tensor_weight
inp_width = input_channel * kernel_height * kernel_width
# Size of the weight matrix
wgt_height = inp_width
wgt_width = kernel_channel
# Size of the output matrix
out_height = output_tensor_height * output_tensor_weight
out_width = kernel_channel



Input tensor: nc = 1, nh = 32, nw = 32 
Output tensor: mc = 6, mh = 28, mw = 28 
Kernel: fh = 5, fw = 5 
Parameters: stride = (1, 1), pad = (0, 0) 


Input matrix: height = 784, width = 25 
Weight matrix: height = 25, width = 6 
Output matrix: height = 784, width = 6 




**Configuring the data generation :** 
i.e. whether to randomize the content of the matrices, to pad them, to use an activation function or not (ReLU), what type of files to write / print (JSON, binary), etc...
For that, `user_configuration.py` is to be used (adjusting the parameters to True / False depending on the desired outcome).

*For example, these parameters initialise the 784x25 input matrix A and 25x6 weight matrix B, so that their content is randomized.*

```
isInitRandom = True
A_row = 784
A_col = 25
B_col = 6
```

*As the VTA requires square 16x16 matrices for multiplication ; a ReLU activation is then used :*

```
block_size = 16
isSquare = True
useReLU = True
```

*We want JSON files as outputs, so :*

```
doWriteBinaryFile = False
doWriteJSON = True
```

In [21]:
"""MATRIX GENERATION"""
# Matrices initialised with random value? (True / False)
isInitRandom = True
# If yes, random_bound limit the value range (int8 = [-128; 127] -> random_bound = 128)
random_bound = 4

"""COMPUTATION SPECIFICATION"""
# The size of the square matrix multiplication (multiple two block_size square matrix together)
block_size = 16 # VTA requirement

# Use square matrix or not
isSquare = True

# Compute the non-padded matrix? (True / False)
doMultiplyNonPadded = False

# C matrix option
# Reduction from int16 to int8: useClip (True / False)
# => True: if x > 0: clip => max(127, x)
# => False: Truncate the MSB
useClip = False

# Apply ReLU on the result
useReLU = False


"""PROMPTING AND DUMPING FILES FEATURES"""
# Print the data (True / False)
doPrint = True

# Write matrices in binary files in OUTPUT dir (True / False)
doWriteBinaryFile = False

# Write a JSON file for CHISEL Compute in OUTPUT dir (True / False)
doWriteJSON = True

**Generating the data :**
The program `main_matrix_generator.py` can generate .bin (binary) files for the *functional_simulator* and .json files for the *cycle_accurate_simulator* (using CHISEL). The files will be generated in the *compiler_output/* directory.


It calls functions from several other programs : 
- `matrix_generator.py` : is used to generate the input and weight matrices (A size 784x25 and B size 25x6), according to `user_configuration.py` : the number of rows (height) and columns (width) of the matrix, the padding, if its content is to be randomized or filled with 0s. A and B are to be padded into 784x32 and 32x16 matrices for ease of splitting.
- `matrix_split.py` : needed to split A and B into square 16x16 sub-matrices, as is required by the VTA (only takes matrices of this size for matrix multiplications).
- `matrix_multiplication.py` : used for block matrix multiplication. A_block_i (16x16) and B_block_j (16x16) are multiplied to obtain an output sub-matrix (size 16x16 also). If the function ReLU is used, it is also applies to each of the values in the output matrices.
- `json_generator.py` : translates the input data (and expected output) into a .json file.

In [22]:
# ----------------------
# Generate the matrix A and B with random values

# Input Matrix A
input_matrix = matrix_generator.matrix_int8_creation(n_row=inp_height, n_col=inp_width, isInitRandom=isInitRandom, random_bound=random_bound)

# Weight Matrix B
weight_matrix = matrix_generator.matrix_int8_creation(n_row=inp_width, n_col=wgt_width, isInitRandom=isInitRandom, random_bound=random_bound)

print("Input Matrix (",inp_height, "x", inp_width,") :\n", input_matrix)
print("Weight Matrix (",inp_width, "x", wgt_width,") :\n", weight_matrix)

Input Matrix ( 784 x 25 ) :
 [[-4 -4 -2 ...  2 -2  1]
 [-4 -4 -2 ...  0 -3  2]
 [ 2 -2 -4 ... -1  1 -4]
 ...
 [ 0 -3 -2 ...  2 -2  1]
 [-2  1  1 ... -1  0 -1]
 [-2  0  2 ... -1  2  0]]
Weight Matrix ( 25 x 6 ) :
 [[-1 -1  1  0 -4  1]
 [-3  0 -3  2  1  0]
 [ 2  1 -4 -2 -2  0]
 [-3  0 -2 -1 -2  1]
 [-4 -1  0 -2 -1 -3]
 [-1 -3 -4 -3 -3  0]
 [ 0  0 -1 -1 -4 -3]
 [ 0 -1  1 -2  1 -4]
 [-4 -1  1  1  1 -3]
 [-2 -4  0  1  2  2]
 [ 1 -3 -1  0 -4 -1]
 [-4 -3 -4  1  2  1]
 [ 1  1 -3 -1  1 -3]
 [ 1 -2 -2 -4 -1  1]
 [ 1  2 -4 -4 -3 -1]
 [ 1 -1 -3 -1  0 -4]
 [-2  0 -1 -1 -4  1]
 [-1 -2  0 -4 -4 -3]
 [ 2  1 -4  0  2  2]
 [ 0 -3 -4  1  0 -1]
 [ 2  0  2 -4  0 -3]
 [-3  2  0 -4  2  2]
 [-4  2 -3 -1  0 -2]
 [-3 -2 -2 -3 -2 -1]
 [ 1  0 -1  2 -2 -1]]


In [23]:
# ----------------------
# Padding the matrices so their dimensions can be divided by 16

# Padded Input Matrix A
input_matrix_padded = matrix_generator.matrix_padding(input_matrix)

# Padded Weight Matrix B
weight_matrix_padded = matrix_generator.matrix_padding(weight_matrix)

print("Padded Input Matrix (",input_matrix_padded.shape[0], "x", input_matrix_padded.shape[1],") :\n", input_matrix_padded)
print("Padded Weight Matrix (",weight_matrix_padded.shape[0], "x", weight_matrix_padded.shape[1],") :\n", weight_matrix_padded)

Padded Input Matrix ( 784 x 32 ) :
 [[-4 -4 -2 ...  0  0  0]
 [-4 -4 -2 ...  0  0  0]
 [ 2 -2 -4 ...  0  0  0]
 ...
 [ 0 -3 -2 ...  0  0  0]
 [-2  1  1 ...  0  0  0]
 [-2  0  2 ...  0  0  0]]
Padded Weight Matrix ( 32 x 16 ) :
 [[-1 -1  1  0 -4  1  0  0  0  0  0  0  0  0  0  0]
 [-3  0 -3  2  1  0  0  0  0  0  0  0  0  0  0  0]
 [ 2  1 -4 -2 -2  0  0  0  0  0  0  0  0  0  0  0]
 [-3  0 -2 -1 -2  1  0  0  0  0  0  0  0  0  0  0]
 [-4 -1  0 -2 -1 -3  0  0  0  0  0  0  0  0  0  0]
 [-1 -3 -4 -3 -3  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0 -1 -1 -4 -3  0  0  0  0  0  0  0  0  0  0]
 [ 0 -1  1 -2  1 -4  0  0  0  0  0  0  0  0  0  0]
 [-4 -1  1  1  1 -3  0  0  0  0  0  0  0  0  0  0]
 [-2 -4  0  1  2  2  0  0  0  0  0  0  0  0  0  0]
 [ 1 -3 -1  0 -4 -1  0  0  0  0  0  0  0  0  0  0]
 [-4 -3 -4  1  2  1  0  0  0  0  0  0  0  0  0  0]
 [ 1  1 -3 -1  1 -3  0  0  0  0  0  0  0  0  0  0]
 [ 1 -2 -2 -4 -1  1  0  0  0  0  0  0  0  0  0  0]
 [ 1  2 -4 -4 -3 -1  0  0  0  0  0  0  0  0  0  0]
 [ 1 -1

In [24]:
# ----------------------
# Splitting the matrices into 16 x 16 matrices and displaying the first block (for each matrix A & B) that would be obtained using `matrix_split.py`

# Block Input Matrices (Ai) (16 x 16)
block_input_matrix, input_block_col = matrix_split.matrix_splitting(input_matrix_padded)

# Block Weight Matrices (Bi) (16 x 16)
block_weight_matrix, weight_block_col = matrix_split.matrix_splitting(weight_matrix_padded)

print("First Block Input Matrix (",block_input_matrix[0].shape[0], "x", block_input_matrix[0].shape[1],") :\n", block_input_matrix[0])
print("First Block Weight Matrix (",block_weight_matrix[0].shape[0], "x", block_weight_matrix[0].shape[1],") :\n", block_weight_matrix[0])

First Block Input Matrix ( 16 x 16 ) :
 [[-4 -4 -2 -4  1 -1 -1 -2  0 -1 -2  0 -2 -1  2  2]
 [-4 -4 -2  0  1  1 -2  0 -4 -1 -4 -3  1 -4 -2  2]
 [ 2 -2 -4 -1  0  1  0  0 -3 -1 -3  2 -2  0 -4 -4]
 [-1 -2 -4 -1 -3  1  0 -3 -4  2 -4 -1  1  0  1 -1]
 [ 2  2 -1  1 -2 -2 -4  0  1  1  2 -3 -3  1 -1 -4]
 [-3 -3 -3 -3 -4 -3 -2 -2  1  1 -3 -3 -2 -4  0  0]
 [ 1  2 -4 -3 -3 -4  1  0 -2  2  2  1  1 -4  0 -4]
 [ 0  2 -1 -2  1 -4 -2  1  0 -1  1 -2  1  1 -2 -4]
 [-1 -3  1  2 -3 -4  2 -2  2 -4  0 -4 -4 -3 -3  1]
 [-2  1 -4  1 -3 -2 -4 -3 -2 -4 -1 -3 -3 -1 -3 -4]
 [-2 -4  2 -2  2  0 -1  1 -1  1 -2 -2 -4  0 -4  2]
 [ 2  2 -2  2 -2  2 -4  2 -4 -1  0 -3  1  0  2  0]
 [ 2 -2 -4  1 -2 -1 -3  1  1 -3  0  2 -3 -2 -4 -3]
 [ 2 -2  1  0 -1 -1  1 -2 -2 -4  2 -4 -1 -2  0 -2]
 [-2 -1 -4 -4  0 -3 -1 -2  2 -2 -3  2 -1 -4  2  2]
 [ 0 -3  0  1 -2  2 -2 -2  0 -2 -3 -1 -2  1  0 -4]]
First Block Weight Matrix ( 16 x 16 ) :
 [[-1 -1  1  0 -4  1  0  0  0  0  0  0  0  0  0  0]
 [-3  0 -3  2  1  0  0  0  0  0  0  0  0  0  0  0]


In [25]:
# ----------------------
# The (16 x 16) block matrices we've obtained are then multiplied using the VTA (GeMM)

block_output_matrix, combinations = matrix_multiplication.block_matrix_multiply(block_input_matrix, block_weight_matrix, input_block_col, weight_block_col)
print("First Block Output Matrix (",block_output_matrix[0].shape[0], "x", block_output_matrix[0].shape[1],") :\n", block_output_matrix[0])

First Block Output Matrix ( 16 x 16 ) :
 [[ 39  21  12  16  31 -14   0   0   0   0   0   0   0   0   0   0]
 [ 41  27  36  47  41  -2   0   0   0   0   0   0   0   0   0   0]
 [-26 -17  63  18   2  39   0   0   0   0   0   0   0   0   0   0]
 [ 47  11  31  14  25  39   0   0   0   0   0   0   0   0   0   0]
 [ -3  -6  43  43  12  46   0   0   0   0   0   0   0   0   0   0]
 [ 40  28  97  52  47  27   0   0   0   0   0   0   0   0   0   0]
 [ -2  22  57  64  54  26   0   0   0   0   0   0   0   0   0   0]
 [  5  14  44  33  52  15   0   0   0   0   0   0   0   0   0   0]
 [ 49  56  77  39  42  16   0   0   0   0   0   0   0   0   0   0]
 [ 34  39  96  64  54  66   0   0   0   0   0   0   0   0   0   0]
 [  9  13  61  -8  36   4   0   0   0   0   0   0   0   0   0   0]
 [ 21   7  22   1 -23  20   0   0   0   0   0   0   0   0   0   0]
 [-12 -17  81  63   4  37   0   0   0   0   0   0   0   0   0   0]
 [ 48  10  43  15 -61   6   0   0   0   0   0   0   0   0   0   0]
 [  4  37  49  66  75

In [26]:
"""Using the example of LeNet-5 first convolutional layer C1 :"""

# If `doWriteBinaryFile=True`, running this command will generate the binary files containing the data for the A block matrices, transposed B block matrices (input data) and expected output (ACC).
# The files will be generated in compiler_output/, under the names 'input.bin', 'weight.bin', expected_out.bin'.

# If `doWriteJSON=True`, running this command will also generate the JSON files containing the instructions, UOPs (added later), A block matrices, transposed B block matrices (input data) and expected output (ACC).
# The file will be generated in compiler_output/, under the name 'generated_for_compute.json'.

%run ../compiler/data_definition/main_matrix_generator.py examples.data_lenet5_conv1

Binary files successfully generated.


JSON file successfully generated.

 INITIAL MATRICES:
A_matrix: ((h, w) = (784, 25)) 
 [[ 1  0 -1 ... -4 -4 -2]
 [-4 -4 -3 ... -1  2 -4]
 [-3 -3  1 ...  2 -3 -2]
 ...
 [ 2 -1 -4 ... -1 -2  2]
 [-2  0  0 ... -3  1  0]
 [ 2 -2 -3 ...  0 -4  0]]

 x 
 B_matrix: ((h, w) = (25, 6)) 
 [[-1 -2 -3 -1  2  2]
 [ 1  2 -4  1  2 -1]
 [ 0 -3  2  1 -4  1]
 [ 2 -3  2  1 -1 -3]
 [ 2 -2  0  0  2 -2]
 [-4  2  2  0  0 -1]
 [-1  2 -4  0  0 -4]
 [-4  2  2 -1 -2 -4]
 [-2 -1 -2  0 -4  1]
 [-1  2  0 -3  1 -4]
 [ 0  2  2 -4 -1 -2]
 [-1 -1  0 -1 -4  2]
 [-2 -2  0 -1  1 -3]
 [ 0 -3  2 -1  0  1]
 [-3 -4 -1  1 -1  1]
 [-1 -3  1  1  1 -3]
 [-4  0 -2  1  0 -1]
 [ 2 -1  2 -2  1 -3]
 [-2  0  0 -4 -2  2]
 [ 0 -3  1 -2  2  1]
 [-2 -4 -1  2  0  0]
 [-1 -3 -3 -3 -1 -2]
 [-4 -4 -3  0  2 -1]
 [ 0  2  2 -3  1  0]
 [ 0  2 -4  1  2  2]]



 PADDED MATRICES:
A_padded: ((h, w) = (784, 32)) 
 [[ 1  0 -1 ...  0  0  0]
 [-4 -4 -3 ...  0  0  0]
 [-3 -3  1 ...  0  0  0]
 ...
 [ 2 -1 -4 ...  0  0  0]
 [-2  0  0 ...  0  0  0]
 [ 2 -2 -3

*How to use operations_definition :*

**Objective :** To use the VTA simulators, instructions are to be generated (in .json and .bin files) so they can be run using Scala and/or CHISEL. The following programs use the data obtained from 'data_definition' to generate the instructions for each operation (load, GeMM, ReLU, ALU, store, reset, etc) the VTA needs to perform.

**Generating the instructions :** 
On the example of LeNet-5's first convolutional layer (GeMM), followed by ReLU and average pooling (the aim is to reduce the size of the output matrix after GeMM) :
Currently, we have (16 x 16) block INP matrices *Ai*, and (16 x 16) WGT matrices *Bi*. To execute **GeMM**, *Ai* has to be split into (16 x 1) horizontal vectors, to obtain the block OUTPUT matrices *ACCi*, composed of (16 x 1) horizontal vectors. The matrices are reassembled into (16 x 16) blocks.
We then apply **ReLU** to *ACCi*. The block matrices we've obtained can now receive the Average Pooling, composed of **2 ADD** and **1 SHR** (data storage divided by 4). 

For operation (reset, GEMM, ALU), the UOP buffers (`VTAUop` architecture, according to `structures_insn_uop.py`) have to be filled, to determine on which indexes of the matrices the data will be stored / read. 
The instruction buffers' fields (`VTAGemInsn` for GeMM instructions, `VTAAluInsm` for ALU instructions, `VTAMemInsn` for store / load instructions) should also be input according to the dimensions of the Input tensor and filters.

- **LOAD (128-bit):** Data is extracted from DRAM and stored temporarily in SRAM to compute the operations
- **ALU (128-bit) :** ReLU activation (for each value x of the matrices => max(0, x)), ADD 1 & 2 then SHR (averaging some of the data).
- **GEMM (128-bit) :** Matrix multiplication of A and B (transposed).
- **STORE (128-bit) :** Data from SRAM is copied at the desired location in DRAM, after all the operations have been resolved.

*Using the program `insn_lenet5_conv1_relu_average_pooling.py` as an example on how to generate the data in binary and JSON files :*

In [27]:
"""CONFIGURATION"""

# Initializing the buffers and importing the necessary structures from `structures_insn_uop.py`

# PACKAGE IMPORT
# --------------
import os

# Parent folder
sys.path.append('../compiler/operations_definition')
import structures_insn_uop
#sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))

# UOP DEFINITION
# --------------
# Define empty UOP buffer
uop_buffer = []

# INSTRUCTION DEFINITION
# ----------------------
# Define empty instruction buffer
insn_buffer = []

In [28]:
"""LOAD DATA FROM DRAM"""

# Reset for GEMM operation

if (len(uop_buffer) < 1):
    uop_buffer.append(structures_insn_uop.VTAUop( # UOP 0 - reset
        dst_idx=0, 
        src_idx=0,
        wgt_idx=0
    ))

# Loading the memory for data accessibility

if (len(insn_buffer) < 1):
    insn_buffer.append(structures_insn_uop.VTAMemInsn( # I0: LOAD UOP
        opcode=0, # 0-LOAD, 1-STORE, 3-FINISH
        # DEP FLAG
        pop_prev_dep=0,
        pop_next_dep=0,
        push_prev_dep=0,
        push_next_dep=0,
        # Memory interaction
        buffer_id=0, # 0-UOP, 1-WGT, 2-INP, 3-ACC, 4-OUT, 5-ACC8bit
        sram_base=0x0000,
        dram_base=0x00001000,
        unused=0, # UNUSED
        # Operation over the data
        y_size=1,
        x_size=1,
        x_stride=1,
        y_pad_top=0,
        y_pad_bottom=0,
        x_pad_left=0,
        x_pad_right=0
    ))

# The ACC matrix is wiped, in case of RESET

    insn_buffer.append(structures_insn_uop.VTAGemInsn( # I1: GEMM RESET
        opcode=2, # 2-GEMM
        # DEP FLAG
        pop_prev_dep=0,
        pop_next_dep=0,
        push_prev_dep=1, # Ready signal to LOAD
        push_next_dep=0,
        # Operations
        reset=1, # 0-no, 1-reset
        uop_bgn=0, # UOP 0
        uop_end=1,
        loop_out=49, # Number of (16 x 16) blocks in ACC
        loop_in=16,  # Block size
        # UNUSED
        unused=0, # UNUSED
        # Index factors
        dst_factor_out=16, # Block size
        dst_factor_in=1,
        src_factor_out=0,
        src_factor_in=0,
        wgt_factor_out=0,
        wgt_factor_in=0
    ))

    insn_buffer.append(structures_insn_uop.VTAMemInsn( # I2: LOAD INP
        opcode=0, # 0-LOAD, 1-STORE, 3-FINISH
        # DEP FLAG
        pop_prev_dep=0,
        pop_next_dep=1, # Acknowledge COMPUTE ready signal
        push_prev_dep=0,
        push_next_dep=0,
        # Memory interaction
        buffer_id=2, # 0-UOP, 1-WGT, 2-INP, 3-ACC, 4-OUT, 5-ACC8bit
        sram_base=0x0000,
        dram_base=0x00000100,
        unused=0, # UNUSED
        # Operation over the data
        y_size=1,
        x_size=1568, # Load 98*16 INP
        x_stride=1568,
        y_pad_top=0,
        y_pad_bottom=0,
        x_pad_left=0,
        x_pad_right=0
    ))

    insn_buffer.append(structures_insn_uop.VTAMemInsn( # I3: LOAD WGT
        opcode=0, # 0-LOAD, 1-STORE, 3-FINISH
        # DEP FLAG
        pop_prev_dep=0,
        pop_next_dep=0,
        push_prev_dep=0,
        push_next_dep=1, # Ready signal to COMPUTE
        # Memory interaction
        buffer_id=1, # 0-UOP, 1-WGT, 2-INP, 3-ACC, 4-OUT, 5-ACC8bit
        sram_base=0x0000,
        dram_base=0x00000020,
        unused=0, # UNUSED
        # Operation over the data
        y_size=1,
        x_size=2, # Load 2 WGT
        x_stride=2,
        y_pad_top=0,
        y_pad_bottom=0,
        x_pad_left=0,
        x_pad_right=0
    ))

    insn_buffer.append(structures_insn_uop.VTAMemInsn( # I4: LOAD UOP
        opcode=0, # 0-LOAD, 1-STORE, 3-FINISH
        # DEP FLAG
        pop_prev_dep=1, # Acknowledge LOAD ready signal
        pop_next_dep=0, 
        push_prev_dep=0,
        push_next_dep=0,
        # Memory interaction
        buffer_id=0, # 0-UOP, 1-WGT, 2-INP, 3-ACC, 4-OUT, 5-ACC8bit
        sram_base=0x0001,
        dram_base=0x00001001,
        unused=0, # UNUSED
        # Operation over the data
        y_size=1,
        x_size=6, # Load 6 UOP (2 GeMM + 1 ReLU + 3 Pool)
        x_stride=6,
        y_pad_top=0,
        y_pad_bottom=0,
        x_pad_left=0,
        x_pad_right=0
    ))

In [29]:
# ----------------------
# Splitting (16 x 16) block INP matrices into (16 x 1) vectors

# Input Matrix INP
print("First vector of first block of INP matrix (", np.shape(block_input_matrix[0][0])[0], " x ", 1, ")")
print(block_input_matrix[0][0], "A@0")

# Weight Matrix WGT
print("x \nFirst block of WGT matrix (", block_weight_matrix[0].shape[0], " x ", block_weight_matrix[0].shape[1], ")")
print(block_weight_matrix[0], "B@0")

# Output Matrix ACC
print("= \nFirst vector of first block of ACC (", np.shape(block_output_matrix[0][0])[0], " x ", 1, ")")
print(block_output_matrix[0][0], "C@0")

First vector of first block of INP matrix ( 16  x  1 )
[-4 -4 -2 -4  1 -1 -1 -2  0 -1 -2  0 -2 -1  2  2] A@0
x 
First block of WGT matrix ( 16  x  16 )
[[-1 -1  1  0 -4  1  0  0  0  0  0  0  0  0  0  0]
 [-3  0 -3  2  1  0  0  0  0  0  0  0  0  0  0  0]
 [ 2  1 -4 -2 -2  0  0  0  0  0  0  0  0  0  0  0]
 [-3  0 -2 -1 -2  1  0  0  0  0  0  0  0  0  0  0]
 [-4 -1  0 -2 -1 -3  0  0  0  0  0  0  0  0  0  0]
 [-1 -3 -4 -3 -3  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0 -1 -1 -4 -3  0  0  0  0  0  0  0  0  0  0]
 [ 0 -1  1 -2  1 -4  0  0  0  0  0  0  0  0  0  0]
 [-4 -1  1  1  1 -3  0  0  0  0  0  0  0  0  0  0]
 [-2 -4  0  1  2  2  0  0  0  0  0  0  0  0  0  0]
 [ 1 -3 -1  0 -4 -1  0  0  0  0  0  0  0  0  0  0]
 [-4 -3 -4  1  2  1  0  0  0  0  0  0  0  0  0  0]
 [ 1  1 -3 -1  1 -3  0  0  0  0  0  0  0  0  0  0]
 [ 1 -2 -2 -4 -1  1  0  0  0  0  0  0  0  0  0  0]
 [ 1  2 -4 -4 -3 -1  0  0  0  0  0  0  0  0  0  0]
 [ 1 -1 -3 -1  0 -4  0  0  0  0  0  0  0  0  0  0]] B@0
= 
First vector of first bl

In [30]:
"""GEMM"""

# Generating the instructions for the GeMM, using A vectorized and B.

# ----------------------
# Defining the GEMM UOP buffer

if (len(uop_buffer) < 1 + 1):
    uop_buffer.append(structures_insn_uop.VTAUop( # UOP 1 - GEMM 0
        dst_idx=0, 
        src_idx=0,
        wgt_idx=0
    ))

if (len(uop_buffer) < 2 + 1):
    uop_buffer.append(structures_insn_uop.VTAUop( # UOP 2 - GEMM 1
        dst_idx=0, 
        src_idx=16,
        wgt_idx=1
    ))

# ----------------------
# Defining the GEMM Instruction buffer

index_insn = 5 # Instruction index

if (len(insn_buffer) < index_insn + 1):
    insn_buffer.append(structures_insn_uop.VTAGemInsn( # I5: GEMM
        opcode=2, # 2-GEMM
        # DEP FLAG
        pop_prev_dep=0,
        pop_next_dep=0,
        push_prev_dep=0,
        push_next_dep=0, 
        # Operations
        reset=0, # 0-no, 1-reset
        uop_bgn=1, # UOP 1 + UOP 2
        uop_end=3,
        loop_out=49,
        loop_in=16,
        # UNUSED
        unused=0, # UNUSED
        # Index factors
        dst_factor_out=16,
        dst_factor_in=1,
        src_factor_out=32,
        src_factor_in=1,
        wgt_factor_out=0,
        wgt_factor_in=0
    ))

# ----------------------
# Print the buffers
 
# Printing UOP Buffer
def print_uop_buffer(OP, uop_bgn, uop_end) :
    print(OP, "UOP BUFFER\nACC  INP  WGT\n")
    for i in range(uop_bgn, uop_end):
        print(uop_buffer[i].dst_idx, "  ", uop_buffer[i].src_idx, "  ", uop_buffer[i].wgt_idx, "\n")

# Printing ALU Instruction Buffer      
def print_insn_buffer_ALU(n_insn, OP):
    print(OP, "INSTRUCTIONS\nLP_OUT  LP_IN  DST_OUT  DST_IN  SRC_OUT  SRC_IN  OPCODE  IMM\n")
    print(insn_buffer[n_insn].loop_out, "     ", insn_buffer[n_insn].loop_in, "     ", insn_buffer[n_insn].dst_factor_out, "     ", insn_buffer[n_insn].dst_factor_in, "     ", 
          insn_buffer[n_insn].src_factor_out, "     ", 
          insn_buffer[n_insn].src_factor_in, "     ", insn_buffer[n_insn].opcode, "    ", insn_buffer[n_insn].imm)
    
# ----------------------
# Defining GEMM operation

def GEMM(A, B):
#    assert(A.shape[1] == B.shape[0])
    A = np.array(A)
    B = np.array(B)
    return A @ B

# ----------------------
# Pseudo-code GEMM

def insn_GEMM(ACC, WGT, INP):
    for i0 in range(insn_buffer[index_insn].loop_in):
        for i1 in range(insn_buffer[index_insn].loop_out):
            for uop_index in range(insn_buffer[index_insn].uop_bgn, insn_buffer[index_insn].uop_end):
                X, Y, Z = uop_buffer[uop_index].dst_idx, uop_buffer[uop_index].src_idx, uop_buffer[uop_index].wgt_idx
                dst_idx = i0 * insn_buffer[index_insn].dst_factor_in + i1 * insn_buffer[index_insn].dst_factor_out + X # Index ACC
                inp_idx = i0 * insn_buffer[index_insn].src_factor_in + i1 * insn_buffer[index_insn].src_factor_out + Y # Index INP
                wgt_idx = i0 * insn_buffer[index_insn].wgt_factor_in + i1 * insn_buffer[index_insn].wgt_factor_out + Z # Index WGT
                ACC[dst_idx] += GEMM(INP[inp_idx], WGT[wgt_idx])                                                       # Storage of GEMM(A, B) in ACC
    return ACC

# ----------------------
# Printing the data
# ----------------------

# Printing GEMM UOP Buffer
print_uop_buffer("GEMM", insn_buffer[index_insn].uop_bgn, insn_buffer[index_insn].uop_end)

# Printing GEMM Instruction Buffer 
print("GEMM INSTRUCTIONS\nLP_OUT  LP_IN  DST_OUT  DST_IN  SRC_OUT  SRC_IN  WGT_OUT  WGT_IN\n")
print(insn_buffer[index_insn].loop_out, "     ", insn_buffer[index_insn].loop_in, "     ", insn_buffer[index_insn].dst_factor_out, "     ", insn_buffer[index_insn].dst_factor_in, "     ", 
        insn_buffer[index_insn].src_factor_out, "     ", 
        insn_buffer[index_insn].src_factor_in, "     ", insn_buffer[index_insn].wgt_factor_out, "     ", insn_buffer[index_insn].wgt_factor_in, "\n")

# Printing the Output Matrix
INP_stack = np.vstack(block_input_matrix)       # Stacking the 98 (16 x 16) blocks of A
ACC = np.zeros((inp_height, block_size))        # Initializing the Output Matrix C (49 blocks of size (16 x 16) stacked) with zeros

ACC_GEMM = insn_GEMM(ACC, block_weight_matrix, INP_stack)
#assert(ACC_GEMM[0] == block_output_matrix[0][0])
print("ACC - Output matrix post-GEMM (", ACC_GEMM.shape[0], "x", ACC_GEMM.shape[1], ")")
print(ACC_GEMM)

GEMM UOP BUFFER
ACC  INP  WGT

0    0    0 

0    16    1 

GEMM INSTRUCTIONS
LP_OUT  LP_IN  DST_OUT  DST_IN  SRC_OUT  SRC_IN  WGT_OUT  WGT_IN

49       16       16       1       32       1       0       0 

ACC - Output matrix post-GEMM ( 784 x 16 )
[[ 39.  21.  12. ...   0.   0.   0.]
 [ 41.  27.  36. ...   0.   0.   0.]
 [-26. -17.  63. ...   0.   0.   0.]
 ...
 [ 49.  56.  48. ...   0.   0.   0.]
 [-16.   7.  21. ...   0.   0.   0.]
 [ -3.  12.  29. ...   0.   0.   0.]]


In [31]:
"""ReLU ACTIVATION"""

# In data_definitions/user_configuration.py, if `useReLU=True` :

# ----------------------
# Defining the ALU-RELU UOP buffer

if (len(uop_buffer) < 3 + 1):
    uop_buffer.append(structures_insn_uop.VTAUop( # UOP 3 - ALU (relu)
        dst_idx=0, 
        src_idx=0,
        wgt_idx=0
    ))

# ----------------------
# Defining the ALU-RELU Instruction buffer

index_insn = 6 # Instruction index

if (len(insn_buffer) < index_insn + 1):
    insn_buffer.append(structures_insn_uop.VTAAluInsn( # I6: ALU - MAX IMM 0 (relu)
        opcode=4, # 4-ALU
        # DEP FLAG
        pop_prev_dep=0,
        pop_next_dep=0,
        push_prev_dep=0,
        push_next_dep=0,
        # Operations
        reset=0, # 0-no, 1-reset
        uop_bgn=3, # UOP 3
        uop_end=4,
        loop_out=49,
        loop_in=16,
        # UNUSED
        unused=0, # UNUSED
        # Index factors
        dst_factor_out=16,
        dst_factor_in=1, # ACC incremented by 1
        src_factor_out=16,
        src_factor_in=1, # INP incremented by 1
        alu_opcode=1, # 0-MIN, 1-MAX, 2-ADD, 3-SHR, 4-MUL
        use_imm=1, # 0-no, 1-yes
        imm=0
    ))

# ----------------------
# Defining RELU operation
def RELU(A):
    if (useReLU):
        A = np.maximum(A, 0)
    return A

# ----------------------
# Pseudo-code ALU RELU

def insn_RELU(ACC):
    for i0 in range(insn_buffer[index_insn].loop_in):
        for i1 in range(insn_buffer[index_insn].loop_out):
            for uop_index in range(insn_buffer[index_insn].uop_bgn, insn_buffer[index_insn].uop_end):
                X = uop_buffer[uop_index].dst_idx
                dst_idx = i0 * insn_buffer[index_insn].dst_factor_in + i1 * insn_buffer[index_insn].dst_factor_out + X # Index ACC
                ACC[dst_idx] = RELU(ACC[dst_idx]) # For every row of ACC, we do max(0, value) for each value of the row
    return ACC

# ----------------------
# Printing the data
# ----------------------

# Printing ReLU UOP Buffer
print_uop_buffer("RELU", insn_buffer[index_insn].uop_bgn, insn_buffer[index_insn].uop_end)

# Printing ReLU Instruction Buffer 
print_insn_buffer_ALU(index_insn, "RELU")

# Printing the Output Matrix

ACC_ReLU = insn_RELU(ACC_GEMM)
print("\nACC - Output matrix post-ReLU (", ACC_ReLU.shape[0], "x", ACC_ReLU.shape[1], ")")
print(ACC_ReLU)

RELU UOP BUFFER
ACC  INP  WGT

0    0    0 

RELU INSTRUCTIONS
LP_OUT  LP_IN  DST_OUT  DST_IN  SRC_OUT  SRC_IN  OPCODE  IMM

49       16       16       1       16       1       4      0

ACC - Output matrix post-ReLU ( 784 x 16 )
[[ 39.  21.  12. ...   0.   0.   0.]
 [ 41.  27.  36. ...   0.   0.   0.]
 [-26. -17.  63. ...   0.   0.   0.]
 ...
 [ 49.  56.  48. ...   0.   0.   0.]
 [-16.   7.  21. ...   0.   0.   0.]
 [ -3.  12.  29. ...   0.   0.   0.]]


In [32]:
"""AVERAGE POOLING - First ADD"""

# After this step, the relevant data storage is divided by two.

# ----------------------
# Defining the ADD #1 UOP buffer

if (len(uop_buffer) < 4 + 1):
    uop_buffer.append(structures_insn_uop.VTAUop( # UOP 4 - ALU (first add)
        dst_idx=0, 
        src_idx=1,
        wgt_idx=0
    ))

# ----------------------
# Defining the ADD #1 Instruction buffer

index_insn = 7 # Instruction index

if (len(insn_buffer) < index_insn + 1):
    insn_buffer.append(structures_insn_uop.VTAAluInsn( # I7: ALU - ADD (Average Pooling 1/3)
        opcode=4, # 4-ALU
        # DEP FLAG
        pop_prev_dep=0,
        pop_next_dep=0,
        push_prev_dep=0,
        push_next_dep=0,
        # Operations
        reset=0, # 0-no, 1-reset
        uop_bgn=4, # UOP 4
        uop_end=5,
        loop_out=1,
        loop_in=392,
        # UNUSED
        unused=0, # UNUSED
        # Index factors
        dst_factor_out=0,
        dst_factor_in=2, 
        src_factor_out=0,
        src_factor_in=2, 
        alu_opcode=2, # 0-MIN, 1-MAX, 2-ADD, 3-SHR, 4-MUL
        use_imm=0, # 0-no, 1-yes
        imm=0
    ))

# ----------------------
# Define ADD operation

def ADD(A, B):
    A = np.array(A)
    B = np.array(B)
    return A + B
        
# ----------------------
# Pseudo-code ALU ADD

def insn_ADD(ACC):
    for i0 in range(insn_buffer[index_insn].loop_in):
        for i1 in range(insn_buffer[index_insn].loop_out):
            for uop_index in range(insn_buffer[index_insn].uop_bgn, insn_buffer[index_insn].uop_end):
                X, Y = uop_buffer[uop_index].dst_idx, uop_buffer[uop_index].src_idx
                dst_idx = i0 * insn_buffer[index_insn].dst_factor_in + i1 * insn_buffer[index_insn].dst_factor_out + X
                inp_idx = i0 * insn_buffer[index_insn].src_factor_in + i1 * insn_buffer[index_insn].src_factor_out + Y
                ACC[dst_idx] = ADD(ACC[dst_idx], ACC[inp_idx])
    return ACC

# ----------------------
# Printing the data
# ----------------------

# Printing ADD #1 UOP Buffer
print_uop_buffer("ADD #1", insn_buffer[index_insn].uop_bgn, insn_buffer[index_insn].uop_end)

# Printing ADD #1 Instruction Buffer 
print_insn_buffer_ALU(index_insn, "ADD #1")

# Printing the Output Matrix
ACC_ADD1 = insn_ADD(ACC_ReLU)
print("\nACC - Output matrix post-first ADD (", ACC_ADD1.shape[0], "x", ACC_ADD1.shape[1], ")")
print(ACC_ADD1)

ADD #1 UOP BUFFER
ACC  INP  WGT

0    1    0 

ADD #1 INSTRUCTIONS
LP_OUT  LP_IN  DST_OUT  DST_IN  SRC_OUT  SRC_IN  OPCODE  IMM

1       392       0       2       0       2       4      0

ACC - Output matrix post-first ADD ( 784 x 16 )
[[ 80.  48.  48. ...   0.   0.   0.]
 [ 41.  27.  36. ...   0.   0.   0.]
 [ 21.  -6.  94. ...   0.   0.   0.]
 ...
 [ 49.  56.  48. ...   0.   0.   0.]
 [-19.  19.  50. ...   0.   0.   0.]
 [ -3.  12.  29. ...   0.   0.   0.]]


In [33]:
"""AVERAGE POOLING - Second ADD"""

# After this step, the relevant data storage is divided by two. (4 total)

# ----------------------
# Defining the ADD #2 UOP buffer

if (len(uop_buffer) < 5 + 1):
    uop_buffer.append(structures_insn_uop.VTAUop( # UOP 5 - ALU (second add)
        dst_idx=0, 
        src_idx=28,
        wgt_idx=0
    ))

# ----------------------
# Defining the ADD #2 Instruction buffer

index_insn = 8 # Instruction index

if (len(insn_buffer) < index_insn + 1):
    insn_buffer.append(structures_insn_uop.VTAAluInsn( # I8: ALU - ADD (Average Pooling 2/3)
        opcode=4, # 4-ALU
        # DEP FLAG
        pop_prev_dep=0,
        pop_next_dep=0,
        push_prev_dep=0,
        push_next_dep=0,
        # Operations
        reset=0, # 0-no, 1-reset
        uop_bgn=5, # UOP 5
        uop_end=6,
        loop_out=14,
        loop_in=14,
        # UNUSED
        unused=0, # UNUSED
        # Index factors
        dst_factor_out=56,
        dst_factor_in=2, 
        src_factor_out=56,
        src_factor_in=2, 
        alu_opcode=2, # 0-MIN, 1-MAX, 2-ADD, 3-SHR, 4-MUL
        use_imm=0, # 0-no, 1-yes
        imm=0
    ))

# ----------------------
# Printing the data
# ----------------------

# Printing ADD #2 UOP Buffer
print_uop_buffer("ADD #2", insn_buffer[index_insn].uop_bgn, insn_buffer[index_insn].uop_end)

# Printing ADD #2 Instruction Buffer 
print_insn_buffer_ALU(index_insn, "ADD #2")

# Printing the Output Matrix
ACC_ADD2 = insn_ADD(ACC_ADD1)
print("\nACC - Output matrix post-second ADD (", ACC_ADD2.shape[0], "x", ACC_ADD2.shape[1], ")")
print(ACC_ADD2)

ADD #2 UOP BUFFER
ACC  INP  WGT

0    28    0 

ADD #2 INSTRUCTIONS
LP_OUT  LP_IN  DST_OUT  DST_IN  SRC_OUT  SRC_IN  OPCODE  IMM

14       14       56       2       56       2       4      0

ACC - Output matrix post-second ADD ( 784 x 16 )
[[ 94.  88.  61. ...   0.   0.   0.]
 [ 41.  27.  36. ...   0.   0.   0.]
 [ 22. -10. 132. ...   0.   0.   0.]
 ...
 [ 49.  56.  48. ...   0.   0.   0.]
 [-19.  19.  50. ...   0.   0.   0.]
 [ -3.  12.  29. ...   0.   0.   0.]]


In [34]:
"""AVERAGE POOLING - SHR"""

# With this step, we average the added values.

# ----------------------
# Defining the SHR UOP buffer

if (len(uop_buffer) < 6 + 1):
    uop_buffer.append(structures_insn_uop.VTAUop( # UOP 6 - ALU (shift right)
        dst_idx=0, 
        src_idx=0,
        wgt_idx=0
    ))

# ----------------------
# Defining the ALU-SHR Instruction buffer

index_insn = 9 # Instruction index

if (len(insn_buffer) < index_insn + 1):
    insn_buffer.append(structures_insn_uop.VTAAluInsn( # I9: ALU - SHR (Average Pooling 3/3)
        opcode=4, # 4-ALU
        # DEP FLAG
        pop_prev_dep=0,
        pop_next_dep=0,
        push_prev_dep=0,
        push_next_dep=1, # Ready signal to STORE
        # Operations
        reset=0, # 0-no, 1-reset
        uop_bgn=6, # UOP 6
        uop_end=7,
        loop_out=14,
        loop_in=14,
        # UNUSED
        unused=0, # UNUSED
        # Index factors
        dst_factor_out=56,
        dst_factor_in=2, 
        src_factor_out=56,
        src_factor_in=2, 
        alu_opcode=3, # 0-MIN, 1-MAX, 2-ADD, 3-SHR, 4-MUL
        use_imm=1, # 0-no, 1-yes
        imm=2 # Division by 4 (rounded down)
    ))

# ----------------------
# Defining SHR operation

def SHR(A, IMM) :
    for i in range(len(A)): # A composed of horizontal vectors (16 x 1)
        A[i] = int(np.float(A[i])) >> IMM
    return A

# ----------------------
# Pseudo-code ALU SHR

def insn_SHR(ACC):
    for i0 in range(insn_buffer[index_insn].loop_in):
        for i1 in range(insn_buffer[index_insn].loop_out):
            for uop_index in range(insn_buffer[index_insn].uop_bgn, insn_buffer[index_insn].uop_end):
                X = uop_buffer[uop_index].dst_idx
                dst_idx = i0 * insn_buffer[index_insn].dst_factor_in + i1 * insn_buffer[index_insn].dst_factor_out + X
                ACC[dst_idx] = SHR(ACC[dst_idx], insn_buffer[index_insn].imm)
    return ACC

# ----------------------
# Printing the data
# ----------------------

# Printing SHR UOP Buffer
print_uop_buffer("SHR", insn_buffer[index_insn].uop_bgn, insn_buffer[index_insn].uop_end)

# Printing SHR Instruction Buffer 
print_insn_buffer_ALU(index_insn, "SHR")

# Printing the Output Matrix
ACC_SHR = insn_SHR(ACC_ADD2)
print("\nACC - Output matrix post-SHR (", ACC_SHR.shape[0], "x", ACC_SHR.shape[1], ")")
print(ACC_SHR)

SHR UOP BUFFER
ACC  INP  WGT

0    0    0 

SHR INSTRUCTIONS
LP_OUT  LP_IN  DST_OUT  DST_IN  SRC_OUT  SRC_IN  OPCODE  IMM

14       14       56       2       56       2       4      2

ACC - Output matrix post-SHR ( 784 x 16 )
[[ 23.  22.  15. ...   0.   0.   0.]
 [ 41.  27.  36. ...   0.   0.   0.]
 [  5.  -3.  33. ...   0.   0.   0.]
 ...
 [ 49.  56.  48. ...   0.   0.   0.]
 [-19.  19.  50. ...   0.   0.   0.]
 [ -3.  12.  29. ...   0.   0.   0.]]


In [35]:
"""DATA STORAGE FROM SRAM TO DRAM"""

insn_buffer.append(structures_insn_uop.VTAMemInsn( # I10: STORE
    opcode=1, # 0-LOAD, 1-STORE, 3-FINISH
    # DEP FLAG
    pop_prev_dep=1, # Acknowledge COMPUTE ready signal
    pop_next_dep=0,
    push_prev_dep=1, # Ready signal to COMPUTE
    push_next_dep=0,
    # Memory interaction
    buffer_id=4, # 0-UOP, 1-WGT, 2-INP, 3-ACC, 4-OUT, 5-ACC8bit
    sram_base=0x0000,
    dram_base=0x00000300,
    unused=0, # UNUSED
    # Operation over the data
    y_size=1,
    x_size=784, # Store 49*16 OUT
    x_stride=784,
    y_pad_top=0,
    y_pad_bottom=0,
    x_pad_left=0,
    x_pad_right=0
))

insn_buffer.append(structures_insn_uop.VTAMemInsn( # I11: NOP-MEMORY-STAGE
    opcode=0, # 0-LOAD, 1-STORE, 3-FINISH
    # DEP FLAG
    pop_prev_dep=0,
    pop_next_dep=0,
    push_prev_dep=0, 
    push_next_dep=1, # Ready signal to COMPUTE
    # Memory interaction
    buffer_id=2, # 0-UOP, 1-WGT, 2-INP, 3-ACC, 4-OUT, 5-ACC8bit
    sram_base=0x0000,
    dram_base=0x00000000,
    unused=0, # UNUSED
    # Operation over the data
    y_size=0,
    x_size=0,
    x_stride=0,
    y_pad_top=0,
    y_pad_bottom=0,
    x_pad_left=0,
    x_pad_right=0
))

insn_buffer.append(structures_insn_uop.VTAMemInsn( # I12: NOP-COMPUTE-STAGE
    opcode=0, # 0-LOAD, 1-STORE, 3-FINISH
    # DEP FLAG
    pop_prev_dep=1, # Acknowledge LOAD ready signal
    pop_next_dep=1, # Acknowledge STORE ready signal
    push_prev_dep=0,
    push_next_dep=0,
    # Memory interaction
    buffer_id=0, # 0-UOP, 1-WGT, 2-INP, 3-ACC, 4-OUT, 5-ACC8bit
    sram_base=0x0000,
    dram_base=0x00000000,
    unused=0, # UNUSED
    # Operation over the data
    y_size=0,
    x_size=0,
    x_stride=0,
    y_pad_top=0,
    y_pad_bottom=0,
    x_pad_left=0,
    x_pad_right=0
))

insn_buffer.append(structures_insn_uop.VTAMemInsn( # I13: FINISH
    opcode=3, # 0-LOAD, 1-STORE, 3-FINISH
    # DEP FLAG
    pop_prev_dep=0,
    pop_next_dep=0,
    push_prev_dep=0,
    push_next_dep=0,
    # Memory interaction
    buffer_id=0, # 0-UOP, 1-WGT, 2-INP, 3-ACC, 4-OUT, 5-ACC8bit
    sram_base=0x0000,
    dram_base=0x00000000,
    unused=0, # UNUSED
    # Operation over the data
    y_size=0,
    x_size=0,
    x_stride=0,
    y_pad_top=0,
    y_pad_bottom=0,
    x_pad_left=0,
    x_pad_right=0
))

**How to retrieve the encoded data :** 

Each of the 32-bit UOPs and 128-bit instructions are encoded in hexadecimal pairs. 

*Example of encoding with the SHR Instruction :*

Structure used : 

```
class VTAAluInsn(LittleEndianStructure):
    """ALU instruction structure (128-bit)."""
    _pack_ = 1
    _fields_ = [
        ("opcode", c_uint64, 3),
        ("pop_prev_dep", c_uint64, 1),
        ("pop_next_dep", c_uint64, 1),
        ("push_prev_dep", c_uint64, 1),
        ("push_next_dep", c_uint64, 1),
        ("reset", c_uint64, 1),
        ("uop_bgn", c_uint64, 13),
        ("uop_end", c_uint64, 14),
        ("loop_out", c_uint64, 14),
        ("loop_in", c_uint64, 14),
        ("unused", c_uint64, 1),
        ("dst_factor_out", c_uint64, 11),
        ("dst_factor_in", c_uint64, 11),
        ("src_factor_out", c_uint64, 11),
        ("src_factor_in", c_uint64, 11),
        ("alu_opcode", c_uint64, 3), # 0-MIN, 1-MAX, 2-ADD, 3-SHR, 4-MUL/SHL
        ("use_imm", c_uint64, 1), # 0-NO, 1-YES
        ("imm", c_uint64, 16)
    ]
```

Buffer configuration :

```
insn_buffer.append(structures_insn_uop.VTAAluInsn( # I9: ALU - SHR (Average Pooling 3/3)
        opcode=4,           # >>> 100 [3-bit]
        # DEP FLAG          # >>> 0001
        pop_prev_dep=0,     
        pop_next_dep=0,    
        push_prev_dep=0,    
        push_next_dep=1,    
        # Operations
        reset=0,            # >>> 0
        uop_bgn=6,          # >>> 0000 0000 0011 0
        uop_end=7,          # >>> 0000 0000 0001 11
        loop_out=14,        # >>> 0000 0000 0011 10
        loop_in=14,         # >>> 0000 0000 0011 10
        # UNUSED
        unused=0,           # >>> 0 
        # Index factors
        dst_factor_out=56,  # >>> 0000 0111 000
        dst_factor_in=2,    # >>> 0000 0000 010
        src_factor_out=56,  # >>> 0000 0111 000
        src_factor_in=2,    # >>> 0000 0000 010
        alu_opcode=3,       # >>> 011
        use_imm=1,          # >>> 1
        imm=2               # >>> 0000 0000 0000 0010
    ))
```

==> In hexadecimal, we obtain the following instruction (Big Endian): **I9 : 00 02 b0 04 0e 00 10 38 00 1c 00 70 00 e0 06 44**

In the .bin files (Little Endian), we obtain : **I9 : 44 06 e0 00 70 00 1c 00 38 10 00 0e 04 b0 02 00**

***BINARY FILES***

To obtain the binary files needed for `functional_simulator`, run the following commands in the *OUTPUT/* directory :

>> This command will generate the UOP and Instruction files in *OUTPUT/* : `python ../compiler/operations_definition/examples/insn_lenet5_conv1_relu_average_pooling.py`

>> To access the data in the files, run the commands :
- For the UOP : `hexdump -C uop_lenet5_conv1_relu_average_pooling.bin > uop_letnet5_bin.txt` 
- For the instructions : `hexdump -C insn_lenet5_conv1_relu_average_pooling.bin > insn_lenet5.txt`

The generated binary files can then be copied into *simulators/functional_simulator/binary_input_files/*

***JSON FILES***

For CHISEL, all of the instructions described above are encoded below :

In [36]:
"""JSON FILE"""

# Output to be copied in a .json file (in our case : 'generated_for_compute.json'). UOPs should also be added.

print("Hexadecimal instructions for CHISEL simulation (cycle_accurate_simulator)\n")
i = 0
for insn in insn_buffer:
    print(f"\nI{i}:")
    structures_insn_uop.print_hex_128bit(insn)
    i = i + 1

Hexadecimal instructions for CHISEL simulation (cycle_accurate_simulator)


I0:
0x00000001 00010001 00000040 00000000

I1:
0x00000000 00000810 00200188 002000A2

I2:
0x00000620 06200001 00000004 00000110

I3:
0x00000002 00020001 00000000 800000C0

I4:
0x00000006 00060001 00000040 04000408

I5:
0x00000002 08000810 00200188 00600102

I6:
0x00009002 04000810 00200188 00800304

I7:
0x00002004 00001000 03100008 00A00404

I8:
0x00002004 0E001038 001C0070 00C00504

I9:
0x0002B004 0E001038 001C0070 00E00644

I10:
0x00000310 03100001 0000000C 00000229

I11:
0x00000000 00000000 00000000 00000140

I12:
0x00000000 00000000 00000000 00000018

I13:
0x00000000 00000000 00000000 00000003


The generated .json files can be copied into *simulators/cycle_accurate_simulator/src/test/resources/*