### How to use the compiler folder for the example of LeNet-5

**Introduction to LeNet-5 : one of the first CNNs, useful for image recognition.**

*Architecture :* 
- **Input image :** 32x32 pixels, 1 channel
- **First convolutional layer C1 :** 6 convolutional filters of size 5x5, resulting in 6 feature maps of size 28x28 | ReLU activation
- **Average pooling layer AP2 :** 2x2 kernel with stride = 1 (ou 2?), resulting in feature maps of size 14x14 

*How to use data_definition :*
**Toolbox :** `numpy` tool is required for matrices manipulation : `conda install numpy` for a conda environment

**Defining the size of the matrices :** 
The input image, represented by an input tensor matrix of size 32x32x1 (height x width x channels), goes through C1 to become an output tensor of size 28x28x6.
In this case, to use the VTA and to do the convolution as a GEMM, we use 2D matrices by converting the input tensors with an Im2row method. We obtain an input matrix A (784x25) and a weight matrix B (25x6), whose multiplication results in an output matrix of size 784x6. This is done by ACETONE.
The dimensions of the matrices are obtained using `tensor_matrix_converter.py` (no matrices are generated, only the dimensions):

In [2]:
import numpy as np
import sys
sys.path.append('../compiler/data_definition')
import tensor_matrix_converter
import matrix_generator
import matrix_split
import matrix_multiplication

In [3]:
# To illustrate, let's generate the dimensions of the Input, Weight (post-Im2Row conversion), and Output matrices (after GeMM). 
# For that, the given dimensions of the Input tensor and Kernel are to be input :

"""INPUT TENSOR"""
input_channel = 1
input_height = 32
input_width = 32

"""KERNEL"""
kernel_channel = 6 # Number of filters
kernel_height = 5
kernel_width = 5

"""Computation Parameters (for convolution)"""
stride_height = 1
stride_width = 1
pad_height = 0
pad_width = 0

# Using `tensor_matrix_converter.py`, we can print the dimensions of the Output tensor (post-convolution) :

"""OUTPUT TENSOR"""
output_tensor_height, output_tensor_weight = tensor_matrix_converter.output_dimension(inp_dim=(input_height, input_width), \
                     wgt_dim=(kernel_height, kernel_width), \
                     stride=(stride_height, stride_width), \
                     padding=(pad_height, pad_width))

# Then, we can print the dimensions of the Input and Weight matrices
tensor_matrix_converter.im2row_matrix_dimension(nc=input_channel, nh=input_height, nw=input_width, \
                            mc=kernel_channel, mh=output_tensor_height, mw=output_tensor_weight, \
                            fh=kernel_height, fw=kernel_width, \
                            sh=stride_height, sw=stride_width, \
                            ph= pad_height, pw=pad_width)

# Size of the input matrix
inp_height = output_tensor_height * output_tensor_weight
inp_width = input_channel * kernel_height * kernel_width
# Size of the weight matrix
wgt_height = inp_width
wgt_width = kernel_channel
# Size of the output matrix
out_height = output_tensor_height * output_tensor_weight
out_width = kernel_channel



Input tensor: nc = 1, nh = 32, nw = 32 
Output tensor: mc = 6, mh = 28, mw = 28 
Kernel: fh = 5, fw = 5 
Parameters: stride = (1, 1), pad = (0, 0) 


Input matrix: height = 784, width = 25 
Weight matrix: height = 25, width = 6 
Output matrix: height = 784, width = 6 




**Configuring the data generation :** 
i.e. whether to randomize the content of the matrices, to pad them, to use an activation function or not (ReLU), what type of files to write / print (JSON, binary), etc...
For that, `user_configuration.py` is to be used (adjusting the parameters to True / False depending on the desired outcome).
*(More details regarding LeNet-5 architech and parameter choices? See README)*

*For example, these parameters initialise the 784x25 input matrix A and 25x6 weight matrix B, so that their content is randomized.*

isInitRandom = True
A_row = 784
A_col = 25
B_col = 6

*As the VTA requires square 16x16 matrices for multiplication ; a ReLU activation is then used :*

block_size = 16
isSquare = True
useReLU = True

*We want JSON files as outputs, so :*

doWriteBinaryFile = False
doWriteJSON = True

In [4]:
"""MATRIX GENERATION"""
# Matrices initialised with random value? (True / False)
isInitRandom = True
# If yes, random_bound limit the value range (int8 = [-128; 127] -> random_bound = 128)
random_bound = 4

"""COMPUTATION SPECIFICATION"""
# The size of the square matrix multiplication (multiple two block_size square matrix together)
block_size = 16

# Use square matrix or not
isSquare = True

# Compute the non-padded matrix? (True / False)
doMultiplyNonPadded = False

# C matrix option
# Reduction from int16 to int8: useClip (True / False)
# => True: if x > 0: clip => max(127, x)
# => False: Truncate the MSB
useClip = False

# Apply ReLU on the result
useReLU = False


"""PROMPTING AND DUMPING FILES FEATURES"""
# Print the data (True / False)
doPrint = True

# Write matrices in binary files in OUTPUT dir (True / False)
doWriteBinaryFile = False

# Write a JSON file for CHISEL Compute in OUTPUT dir (True / False)
doWriteJSON = True

**Generating the data :**
The program `main_matrix_generator.py` can generate .bin (binary) files for the *functional_simulator* and .json files for the *cycle_accurate_simulator* (using CHISEL). The files will be generated in the *standalone-vta/compiler_output/* directory.
It calls functions from several other programs : 
- `matrix_generator.py` : is used to generate the input and weight matrices (A size 784x25 and B size 25x6), according to `user_configuration.py` : the number of rows (height) and columns (width) of the matrix, the padding, if its content is to be randomized or filled with 0s. A and B are to be padded into 784x32 and 32x16 matrices for ease of splitting.
- `matrix_split.py` : needed to split A and B into square 16x16 sub-matrices, as is required by the VTA (only takes matrices of this size for matrix multiplications).
- `matrix_multiplication.py` : used for block matrix multiplication. A_block_i (16x16) and B_block_j (16x16) are multiplied to obtain an output sub-matrix (size 16x16 also). If the function ReLU is used, it also applies that to each of the values in the output matrices.
- `json_generator.py` : rather than outputting a binary file, hexadecimal instructions are explicitly given to generate a .json file, where the data from the output matrices are translated.

In [5]:
#Generate the matrix A and B
input_matrix = matrix_generator.matrix_int8_creation(n_row=inp_height, n_col=inp_width, isInitRandom=isInitRandom, random_bound=random_bound)
weight_matrix = matrix_generator.matrix_int8_creation(n_row=inp_width, n_col=wgt_width, isInitRandom=isInitRandom, random_bound=random_bound)

print("Input Matrix (",inp_height, "x", inp_width,") :\n", input_matrix)
print("Weight Matrix (",inp_width, "x", wgt_width,") :\n", weight_matrix)

Input Matrix ( 784 x 25 ) :
 [[-1 -1  1 ... -1 -3  1]
 [ 1 -3 -1 ...  2  2  2]
 [-3 -3  2 ... -3  0 -4]
 ...
 [ 0 -1  0 ...  2  1 -4]
 [-1  1  0 ...  0 -2 -1]
 [-2  1 -3 ...  1 -4 -1]]
Weight Matrix ( 25 x 6 ) :
 [[-2  1  1  2  0 -2]
 [ 2  1  2  2 -4 -4]
 [-3  0 -4 -1  1 -4]
 [ 0 -4  0  2 -2 -1]
 [-1  0  2  2 -1  0]
 [ 0  1 -4  2  1  0]
 [ 1  1 -2  1  2  1]
 [-3 -1 -3 -1 -4 -3]
 [ 1  0  0 -3  1 -4]
 [ 2 -2 -1 -4 -1  0]
 [-4  2  0  2 -1 -1]
 [ 0 -1 -3 -2 -3 -3]
 [-4 -3  2  2 -2  1]
 [-2  0 -1  2  1  2]
 [-1 -2 -2 -3 -3  2]
 [ 2  0  2  1 -4  1]
 [ 2 -4 -3 -4  0  1]
 [-4 -1 -2 -4 -4  0]
 [-4  0  1 -2 -1 -2]
 [-1 -3 -3  1  2  1]
 [-1  2 -3  2 -2 -4]
 [-4 -3  1  2  0 -3]
 [ 2 -1  1  2 -1  0]
 [-4 -2  2 -1 -3 -2]
 [-1  2  2 -3 -3  0]]


In [6]:
# Padding the matrices so their dimensions can be divided by 16

input_matrix_padded = matrix_generator.matrix_padding(input_matrix)
weight_matrix_padded = matrix_generator.matrix_padding(weight_matrix)

print("Padded Input Matrix (",input_matrix_padded.shape[0], "x", input_matrix_padded.shape[1],") :\n", input_matrix_padded)
print("Padded Weight Matrix (",weight_matrix_padded.shape[0], "x", weight_matrix_padded.shape[1],") :\n", weight_matrix_padded)

Padded Input Matrix ( 784 x 32 ) :
 [[-1 -1  1 ...  0  0  0]
 [ 1 -3 -1 ...  0  0  0]
 [-3 -3  2 ...  0  0  0]
 ...
 [ 0 -1  0 ...  0  0  0]
 [-1  1  0 ...  0  0  0]
 [-2  1 -3 ...  0  0  0]]
Padded Weight Matrix ( 32 x 16 ) :
 [[-2  1  1  2  0 -2  0  0  0  0  0  0  0  0  0  0]
 [ 2  1  2  2 -4 -4  0  0  0  0  0  0  0  0  0  0]
 [-3  0 -4 -1  1 -4  0  0  0  0  0  0  0  0  0  0]
 [ 0 -4  0  2 -2 -1  0  0  0  0  0  0  0  0  0  0]
 [-1  0  2  2 -1  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  1 -4  2  1  0  0  0  0  0  0  0  0  0  0  0]
 [ 1  1 -2  1  2  1  0  0  0  0  0  0  0  0  0  0]
 [-3 -1 -3 -1 -4 -3  0  0  0  0  0  0  0  0  0  0]
 [ 1  0  0 -3  1 -4  0  0  0  0  0  0  0  0  0  0]
 [ 2 -2 -1 -4 -1  0  0  0  0  0  0  0  0  0  0  0]
 [-4  2  0  2 -1 -1  0  0  0  0  0  0  0  0  0  0]
 [ 0 -1 -3 -2 -3 -3  0  0  0  0  0  0  0  0  0  0]
 [-4 -3  2  2 -2  1  0  0  0  0  0  0  0  0  0  0]
 [-2  0 -1  2  1  2  0  0  0  0  0  0  0  0  0  0]
 [-1 -2 -2 -3 -3  2  0  0  0  0  0  0  0  0  0  0]
 [ 2  0

In [7]:
# Splitting the matrices into 16 x 16 matrices and displaying the first block (for each matrix A & B) that would be obtained using `matrix_split.py`

block_input_matrix, input_block_col = matrix_split.matrix_splitting(input_matrix_padded)
block_weight_matrix, weight_block_col = matrix_split.matrix_splitting(weight_matrix_padded)

print("First Block Input Matrix (",block_input_matrix[0].shape[0], "x", block_input_matrix[0].shape[1],") :\n", block_input_matrix[0])
print("First Block Weight Matrix (",block_weight_matrix[0].shape[0], "x", block_weight_matrix[0].shape[1],") :\n", block_weight_matrix[0])

First Block Input Matrix ( 16 x 16 ) :
 [[-1 -1  1 -4  1 -2 -2 -4  1  0 -2  1 -2  0  0  1]
 [ 1 -3 -1 -4 -2  0 -1 -4  0 -3  0 -1  1 -2 -3  0]
 [-3 -3  2 -2  2 -3  0 -1 -1 -4 -3 -2 -4 -4  2 -1]
 [ 0  0  2 -2  0 -2 -3 -1  1 -3 -1  1  2  0 -2 -4]
 [ 0  1 -2 -4  2  0  0 -2 -3 -2 -2  0  2  0 -2 -4]
 [ 1  2 -2  1 -1 -4 -3 -1  1 -3  2  1  1  1  1 -1]
 [-4 -4  1  1  2  1 -4  2  0  2 -1  0 -1 -1  0  1]
 [-2  2  2  1 -1  0 -2 -2  0 -4 -4  1  2 -3  0  0]
 [-1  0  0  1 -3 -4 -3 -1  0 -1 -2  0  2  0  0  2]
 [-1 -3 -2 -4 -2 -2 -3 -3 -4  0 -3  0 -3  0 -3  0]
 [-4 -1  2 -1  2  0  0 -3  2  1 -4 -4  1 -1 -4  0]
 [-3 -3  2  0 -2  1  0 -4 -3 -1 -1 -3  0  0 -2  2]
 [-1 -3 -1  0  2 -3  2  2  0  2 -3 -2 -2 -3  0  1]
 [-2  1 -2 -1 -1 -4 -2 -3 -1 -2  1  2 -2  0  2  1]
 [-1  2  0 -3  0 -4 -1  2  0  0 -1 -3 -2 -4  1  2]
 [-4  0 -3 -4 -2 -2 -3 -2  1 -3  2  1  2  0  0  2]]
First Block Weight Matrix ( 16 x 16 ) :
 [[-2  1  1  2  0 -2  0  0  0  0  0  0  0  0  0  0]
 [ 2  1  2  2 -4 -4  0  0  0  0  0  0  0  0  0  0]


In [20]:
# Those two block matrices we've just obtained are then multiplied using the VTA

block_output_matrix, combinations = matrix_multiplication.block_matrix_multiply(block_input_matrix, block_weight_matrix, input_block_col, weight_block_col)

print("First Block Output Matrix (",block_output_matrix[0].shape[0], "x", block_output_matrix[0].shape[1],") :\n", block_output_matrix[0])

First Block Output Matrix ( 16 x 16 ) :
 [[ 51  40  20 -34  34  32   0   0   0   0   0   0   0   0   0   0]
 [ 12  26  37   7  41  43   0   0   0   0   0   0   0   0   0   0]
 [ 46  -6 -15 -33  61  54   0   0   0   0   0   0   0   0   0   0]
 [-37   4  13  23  22 -38   0   0   0   0   0   0   0   0   0   0]
 [ 17  18   2  21  33  24   0   0   0   0   0   0   0   0   0   0]
 [  8  17  57   5 -21  -2   0   0   0   0   0   0   0   0   0   0]
 [  6 -42 -32  -3   1   0   0   0   0   0   0   0   0   0   0   0]
 [ 38  12  24  16   7 -10   0   0   0   0   0   0   0   0   0   0]
 [ 26  -8  32  20   9  21   0   0   0   0   0   0   0   0   0   0]
 [ 53  40  43  -1  40  46   0   0   0   0   0   0   0   0   0   0]
 [ 40  23  46  -7  45  30   0   0   0   0   0   0   0   0   0   0]
 [  4  12  29  39  53  37   0   0   0   0   0   0   0   0   0   0]
 [ 45  -9 -12 -23  10  22   0   0   0   0   0   0   0   0   0   0]
 [ 44  -7   4 -12  -9  33   0   0   0   0   0   0   0   0   0   0]
 [ 49  19  11 -14   1

In [6]:
# Using the example of LeNet-5 first convolutional layer C1 :
# So that we can then write the data into a .json file

%run ../compiler/data_definition/main_matrix_generator.py examples.data_lenet5_conv1

Binary files successfully generated.
JSON file successfully generated.

 INITIAL MATRICES:
A_matrix: ((h, w) = (784, 25)) 
 [[-1  2 -4 ...  2  2 -1]
 [ 1 -1 -2 ...  2 -4  1]
 [ 2 -4  2 ...  0 -2  1]
 ...
 [ 0 -2 -1 ...  2 -1  1]
 [-2  1 -2 ...  2  2 -1]
 [ 0  2  0 ...  2 -1  0]]

 x 
 B_matrix: ((h, w) = (25, 6)) 
 [[-3 -1 -3 -4 -4 -2]
 [-2  0  1  2  2  2]
 [ 2 -2  1 -4  2 -1]
 [ 0 -2  0  0 -3 -4]
 [-4  2 -1 -1  1 -3]
 [ 1 -4 -3  0  0 -2]
 [ 1 -2 -4  1  1 -1]
 [-1  1  0  2  1 -4]
 [ 2  1 -3  0 -2  2]
 [-3 -1  1  1 -1 -3]
 [-2 -1  1 -3  0  0]
 [ 2 -1  1 -2  2 -3]
 [-2  0 -1 -1 -3  2]
 [ 2 -1  2  0  1 -4]
 [-1  2 -1 -4  1 -2]
 [ 1  0  0 -1 -4 -4]
 [-4 -4  0  1 -1  2]
 [-4  1 -3 -1 -2  2]
 [ 0 -4 -4 -1  0  0]
 [-1  2 -4  2  0 -4]
 [ 0 -4 -1  1 -1  1]
 [ 0  2 -2  0 -3 -1]
 [-4 -2 -2 -2  2 -4]
 [-1 -1  2 -3 -2 -4]
 [-1  0  0 -4  2  0]]


 X_matrix: ((h, w) = (1, 1)) 
 [[-3]]



 PADDED MATRICES:
A_padded: ((h, w) = (784, 32)) 
 [[-1  2 -4 ...  0  0  0]
 [ 1 -1 -2 ...  0  0  0]
 [ 2 -4  2 ..

*How to use operations_definition :*

**Objective :** To use the VTA simulators, instructions are to be generated (in .json and .bin files) so they can be run using Scala and/or CHISEL. The following programs use the data obtained from 'data_definition' to generate the instructions for each operation (load, GeMM, ReLU, ALU, store, reset, etc ?) the VTA needs to perform.

**Generating the instructions :** use examples with same matrices as other part (print them) + instructions for examples in examples folder ?
How does vta operates, how he receives the info, in what order, what each instruction does and how many times

On the example of LeNet-5's first convolutional layer (GeMM), followed by ReLU and average pooling (the aim is to reduce the size of the output matrix after GeMM) :
Currently, we have (16 x 16) block INP matrices *Ai*, and (16 x 16) WGT matrices *Bi*. To execute **GeMM**, *Ai* has to be split into (16 x 1) horizontal vectors, to obtain the block OUTPUT matrices *ACCi*, composed of (16 x 1) horizontal vectors. The matrices are reassembled into (16 x 16) blocks.
We then apply **ReLU** to *ACCi* (for each value x of the matrices => max(0, x)). The block matrices we've obtained can now receive the Average Pooling, composed of **2 ADD** and **1 SHR** (data storage divided by 4). 

- **LOAD (128-bit):** 
- **ALU (128-bit) :** ReLU, SHR, ADD 1 & 2
- **GEMM (128-bit) :**
- **STORE (128-bit) :**

*Using the program `insn_lenet5_conv1_relu_average_pooling.py` as an example on how to generate the data in binary and JSON files :*

In [3]:
"""CONFIGURATION"""
# PACKAGE IMPORT
# --------------
import os

# Parent folder
sys.path.append('../compiler/operations_definition')
import structures_insn_uop
#sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))

# UOP DEFINITION
# --------------
# Define empty UOP buffer
uop_buffer = []

# INSTRUCTION DEFINITION
# ----------------------
# Define empty instruction buffer
insn_buffer = []

In [25]:
# Splitting (16 x 16) block INP matrices into (16 x 1) vectors

# Input Matrix INP
print("First vector of first block of INP matrix (", np.shape(block_input_matrix[0][0])[0], " x ", 1, ")")
print(block_input_matrix[0][0], "A@0")

# Weight Matrix WGT
print("x \nFirst block of WGT matrix (", block_weight_matrix[0].shape[0], " x ", block_weight_matrix[0].shape[1], ")")
print(block_weight_matrix[0], "B@0")

# Output Matrix ACC
print("= \nFirst vector of first block of ACC (", np.shape(block_output_matrix[0][0])[0], " x ", 1, ")")
print(block_output_matrix[0][0], "C@0")

First vector of first block of INP matrix ( 16  x  1 )
[-1 -1  1 -4  1 -2 -2 -4  1  0 -2  1 -2  0  0  1] A@0
x 
First block of WGT matrix ( 16  x  16 )
[[-2  1  1  2  0 -2  0  0  0  0  0  0  0  0  0  0]
 [ 2  1  2  2 -4 -4  0  0  0  0  0  0  0  0  0  0]
 [-3  0 -4 -1  1 -4  0  0  0  0  0  0  0  0  0  0]
 [ 0 -4  0  2 -2 -1  0  0  0  0  0  0  0  0  0  0]
 [-1  0  2  2 -1  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  1 -4  2  1  0  0  0  0  0  0  0  0  0  0  0]
 [ 1  1 -2  1  2  1  0  0  0  0  0  0  0  0  0  0]
 [-3 -1 -3 -1 -4 -3  0  0  0  0  0  0  0  0  0  0]
 [ 1  0  0 -3  1 -4  0  0  0  0  0  0  0  0  0  0]
 [ 2 -2 -1 -4 -1  0  0  0  0  0  0  0  0  0  0  0]
 [-4  2  0  2 -1 -1  0  0  0  0  0  0  0  0  0  0]
 [ 0 -1 -3 -2 -3 -3  0  0  0  0  0  0  0  0  0  0]
 [-4 -3  2  2 -2  1  0  0  0  0  0  0  0  0  0  0]
 [-2  0 -1  2  1  2  0  0  0  0  0  0  0  0  0  0]
 [-1 -2 -2 -3 -3  2  0  0  0  0  0  0  0  0  0  0]
 [ 2  0  2  1 -4  1  0  0  0  0  0  0  0  0  0  0]] B@0
= 
First vector of first bl

In [None]:
"""GEMM"""

# Generating the instructions for the GeMM, using A vectorized and B.

# ----------------------
# Defining the GEMM UOP buffer

uop_buffer.append(VTAUop( # UOP 1 - GEMM 0
    dst_idx=0, 
    src_idx=0,
    wgt_idx=0
))

uop_buffer.append(VTAUop( # UOP 2 - GEMM 1
    dst_idx=0, 
    src_idx=16,
    wgt_idx=1
))

# ----------------------
# Defining the GEMM Instruction buffer

insn_buffer.append(VTAGemInsn( # I5: GEMM
    opcode=2, # 2-GEMM
    # DEP FLAG
    pop_prev_dep=0,
    pop_next_dep=0,
    push_prev_dep=0,
    push_next_dep=0, 
    # Operations
    reset=0, # 0-no, 1-reset
    uop_bgn=1, # UOP 1 + UOP 2
    uop_end=3,
    loop_out=49,
    loop_in=16,
    # UNUSED
    unused=0, # UNUSED
    # Index factors
    dst_factor_out=16,
    dst_factor_in=1,
    src_factor_out=32,
    src_factor_in=1,
    wgt_factor_out=0,
    wgt_factor_in=0
))

 print the buffers
 
def print_uop_buffer(uop_buffer, OP) :
    print(OP, " UOP BUFFER")
    print 

# ----------------------
# Defining GEMM operation

def GEMM(A, B):
    A = np.array(A)
    B = np.array(B)
    return A @ B

# ----------------------
# Pseudo-code GEMM

def insn_GEMM(ACC, WGT, INP):
    for i0 in range(loop_in):
        for i1 in range(loop_out):
#            for uop_index in range(uop_bgn, uop_end):
#                X, Y, Z = uop_buffer[uop_index]
                X, Y, Z = uop_buffer1
                dst_idx = i0 * dst_factor_in + i1 * dst_factor_out + X # Index ACC
                inp_idx = i0 * src_factor_in + i1 * src_factor_out + Y # Index INP
                wgt_idx = i0 * wgt_factor_in + i1 * wgt_factor_out + Z # Index WGT
                ACC[dst_idx] += GEMM(INP[inp_idx], WGT[wgt_idx])       # Storage of GEMM(A, B) in ACC
    return ACC

# ----------------------
# Printing the data

ACC_GEMM = insn_GEMM(ACC, WGT, INP) # à definir avec les blocs
print("ACC - Output matrix post-GEMM (", ACC_ADD1.shape[0], "x", ACC_ADD1.shape[1], ")")
print(ACC_GEMM)

In [34]:
"""ReLU ACTIVATION"""

# Dans data_definitions/user_configuration.py, if `useReLU=True` :

ACC = block_output_matrix[0]
if (useReLU):
        ACC = np.maximum(ACC, 0)
print("ACC - Output matrix post-ReLU (", ACC.shape[0], "x", ACC.shape[1], ")")
print(ACC)

# ----------------------
# Defining the ALU-RELU UOP buffer

uop_buffer.append(VTAUop( # UOP 3 - ALU (relu)
    dst_idx=0, 
    src_idx=0,
    wgt_idx=0
))

# ----------------------
# Defining the ALU-RELU Instruction buffer

insn_buffer.append(VTAAluInsn( # I6: ALU - MAX IMM 0 (relu)
    opcode=4, # 4-ALU
    # DEP FLAG
    pop_prev_dep=0,
    pop_next_dep=0,
    push_prev_dep=0,
    push_next_dep=0,
    # Operations
    reset=0, # 0-no, 1-reset
    uop_bgn=3, # UOP 3
    uop_end=4,
    loop_out=49,
    loop_in=16,
    # UNUSED
    unused=0, # UNUSED
    # Index factors
    dst_factor_out=16,
    dst_factor_in=1, # ACC incremented by 1
    src_factor_out=16,
    src_factor_in=1, # INP incremented by 1
    alu_opcode=1, # 0-MIN, 1-MAX, 2-ADD, 3-SHR, 4-MUL
    use_imm=1, # 0-no, 1-yes
    imm=0
))

# ----------------------
# Defining RELU operation

def RELU(A):
    if (useReLU):
        A = np.maximum(A, 0)
    return A

# ----------------------
# Pseudo-code ALU RELU

def insn_RELU(ACC):
    for i0 in range(loop_in):
        for i1 in range(loop_out):
#            for uop_index in range(uop_bgn, uop_end):
#                X, Y = uop_buffer[uop_index]
                X = uop_buffer1
                dst_idx = i0 * dst_factor_in + i1 * dst_factor_out + X
                ACC[dst_idx] = RELU(ACC[dst_idx])
    return ACC

# ----------------------
# Printing the data

ACC_ReLU = insn_RELU(ACC_GEMM)
print("ACC - Output matrix post-ReLU (", ACC.shape[0], "x", ACC.shape[1], ")")
print(ACC_ReLU)

ACC - Output matrix post-ReLU ( 16 x 16 )
[[51 40 20  0 34 32  0  0  0  0  0  0  0  0  0  0]
 [12 26 37  7 41 43  0  0  0  0  0  0  0  0  0  0]
 [46  0  0  0 61 54  0  0  0  0  0  0  0  0  0  0]
 [ 0  4 13 23 22  0  0  0  0  0  0  0  0  0  0  0]
 [17 18  2 21 33 24  0  0  0  0  0  0  0  0  0  0]
 [ 8 17 57  5  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 6  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0]
 [38 12 24 16  7  0  0  0  0  0  0  0  0  0  0  0]
 [26  0 32 20  9 21  0  0  0  0  0  0  0  0  0  0]
 [53 40 43  0 40 46  0  0  0  0  0  0  0  0  0  0]
 [40 23 46  0 45 30  0  0  0  0  0  0  0  0  0  0]
 [ 4 12 29 39 53 37  0  0  0  0  0  0  0  0  0  0]
 [45  0  0  0 10 22  0  0  0  0  0  0  0  0  0  0]
 [44  0  4  0  0 33  0  0  0  0  0  0  0  0  0  0]
 [49 19 11  0  1  0  0  0  0  0  0  0  0  0  0  0]
 [54 17 31 10 42 57  0  0  0  0  0  0  0  0  0  0]]


In [45]:
"""AVERAGE POOLING - First ADD"""

# ----------------------
# Defining the ADD #1 UOP buffer

uop_buffer.append(structures_insn_uop.VTAUop( # UOP 4 - ALU (first add)
    dst_idx=0, 
    src_idx=1,
    wgt_idx=0
))

uop_buffer1 = 0, 1, 0

# ----------------------
# Defining the ADD #1 Instruction buffer

insn_buffer.append(structures_insn_uop.VTAAluInsn( # I7: ALU - ADD (Average Pooling 1/3)
    opcode=4, # 4-ALU
    # DEP FLAG
    pop_prev_dep=0,
    pop_next_dep=0,
    push_prev_dep=0,
    push_next_dep=0,
    # Operations
    reset=0, # 0-no, 1-reset
    uop_bgn=4, # UOP 4
    uop_end=5,
    loop_out=1,
    loop_in=392,
    # UNUSED
    unused=0, # UNUSED
    # Index factors
    dst_factor_out=0,
    dst_factor_in=2, 
    src_factor_out=0,
    src_factor_in=2, 
    alu_opcode=2, # 0-MIN, 1-MAX, 2-ADD, 3-SHR, 4-MUL
    use_imm=0, # 0-no, 1-yes
    imm=0
))

uop_bgn=4 # UOP 4
uop_end=5
loop_out=1
loop_in=392
# UNUSED
unused=0, # UNUSED
# Index factors
dst_factor_out=0
dst_factor_in=2
src_factor_out=0
src_factor_in=2

# ----------------------
# Define ADD operation

def ADD(A, B):
    A = np.array(A)
    B = np.array(B)
    return A + B
        
# ----------------------
# Pseudo-code ALU ADD

def insn_ADD(ACC):
    for i0 in range(loop_in):
        for i1 in range(loop_out):
#            for uop_index in range(uop_bgn, uop_end):
#                X, Y = uop_buffer[uop_index]
                X, Y, Z = uop_buffer1
                dst_idx = i0 * dst_factor_in + i1 * dst_factor_out + X
                print(dst_idx)
                inp_idx = i0 * src_factor_in + i1 * src_factor_out + Y
                ACC[dst_idx] = ADD(ACC[dst_idx], ACC[inp_idx])
    return ACC

# ----------------------
# Printing the data

ACC_ADD1 = insn_ADD(ACC_ReLU)
print("ACC - Output matrix post-first ADD (", ACC_ADD1.shape[0], "x", ACC_ADD1.shape[1], ")") # modifier dim
print(ACC_ADD1)

0
2
4
6
8
10
12
14
16


IndexError: index 16 is out of bounds for axis 0 with size 16

In [None]:
"""AVERAGE POOLING - Second ADD"""

# ----------------------
# Defining the ADD #2 UOP buffer

uop_buffer.append(VTAUop( # UOP 5 - ALU (second add)
    dst_idx=0, 
    src_idx=28,
    wgt_idx=0
))

# ----------------------
# Defining the ADD #2 Instruction buffer

insn_buffer.append(VTAAluInsn( # I8: ALU - ADD (Average Pooling 2/3)
    opcode=4, # 4-ALU
    # DEP FLAG
    pop_prev_dep=0,
    pop_next_dep=0,
    push_prev_dep=0,
    push_next_dep=0,
    # Operations
    reset=0, # 0-no, 1-reset
    uop_bgn=5, # UOP 5
    uop_end=6,
    loop_out=14,
    loop_in=14,
    # UNUSED
    unused=0, # UNUSED
    # Index factors
    dst_factor_out=56,
    dst_factor_in=2, 
    src_factor_out=56,
    src_factor_in=2, 
    alu_opcode=2, # 0-MIN, 1-MAX, 2-ADD, 3-SHR, 4-MUL
    use_imm=0, # 0-no, 1-yes
    imm=0
))

# ----------------------
# Printing the data

ACC_ADD2 = insn_ADD(ACC_ADD1)
print("ACC - Output matrix post-second ADD (", ACC_ADD1.shape[0], "x", ACC_ADD1.shape[1], ")") # modifier dim
print(ACC_ADD2)

In [None]:
"""AVERAGE POOLING - SHR"""

# ----------------------
# Defining the SHR UOP buffer

uop_buffer.append(VTAUop( # UOP 6 - ALU (shift right)
    dst_idx=0, 
    src_idx=0,
    wgt_idx=0
))

# ----------------------
# Defining the ALU-SHR Instruction buffer

insn_buffer.append(VTAAluInsn( # I9: ALU - SHR (Average Pooling 3/3)
    opcode=4, # 4-ALU
    # DEP FLAG
    pop_prev_dep=0,
    pop_next_dep=0,
    push_prev_dep=0,
    push_next_dep=1, # Ready signal to STORE
    # Operations
    reset=0, # 0-no, 1-reset
    uop_bgn=6, # UOP 6
    uop_end=7,
    loop_out=14,
    loop_in=14,
    # UNUSED
    unused=0, # UNUSED
    # Index factors
    dst_factor_out=56,
    dst_factor_in=2, 
    src_factor_out=56,
    src_factor_in=2, 
    alu_opcode=3, # 0-MIN, 1-MAX, 2-ADD, 3-SHR, 4-MUL
    use_imm=1, # 0-no, 1-yes
    imm=2 # Division by 4 (rounded down)
))

# ----------------------
# Defining SHR operation

def SHR(A, IMM) :
    for i in range(A.shape(0)):
        for j in range(A.shape(1)):
            A[i][j] << IMM
    return A

# ----------------------
# Pseudo-code ALU SHR

def insn_SHR(ACC, IMM):
    for i0 in range(loop_in):
        for i1 in range(loop_out):
#            for uop_index in range(uop_bgn, uop_end):
#                X = uop_buffer[uop_index]
                X, Y, Z = uop_buffer1
                dst_idx = i0 * dst_factor_in + i1 * dst_factor_out + X
                ACC[dst_idx] = SHR(ACC[dst_idx], IMM)
    return ACC

# ----------------------
# Printing the data

ACC_SHR = insn_SHR(ACC_ADD2, IMM)
print("ACC - Output matrix post-SHR (", ACC_ADD1.shape[0], "x", ACC_ADD1.shape[1], ")") # modifier dim
print(ACC_SHR)