## *How to use operations_definition :*

**Objective :** To use the VTA simulators, instructions are to be generated in binary files that the simulators will read. The following programs use the data obtained from *data_definition/* to generate the instructions for each operation (load, GeMM, ReLU, ALU, store, reset, etc) the VTA needs to perform.

**Generating the instructions :** 
On the example of LeNet-5's first convolutional layer (GeMM), followed by ReLU and average pooling (the aim is to reduce the size of the output matrix after GeMM) :
Currently, we have (16 x 16) block INP matrices *Ai*, and (16 x 16) WGT matrices *Bi*. To execute **GeMM**, *Ai* has to be split into (16 x 1) horizontal vectors, to obtain the block OUTPUT matrices *ACCi*, composed of (16 x 1) horizontal vectors. The matrices are reassembled into (16 x 16) blocks.
We then apply **ReLU** to *ACCi*. The block matrices we've obtained can now receive the Average Pooling, composed of **2 ADD** and **1 SHR** (data storage divided by 4). 

For operation (reset, GEMM, ALU), the UOP buffers (`VTAUop` architecture, according to `structures_insn_uop.py`) have to be filled, to determine on which indexes of the matrices the data will be stored / read. 
The instruction buffers' fields (`VTAGemInsn` for GeMM instructions, `VTAAluInsm` for ALU instructions, `VTAMemInsn` for store / load instructions) should also be input according to the dimensions of the Input tensor and filters.

- **LOAD (128-bit):** Data is extracted from DRAM and stored temporarily in SRAM (local memory inside the VTA) to compute the operations
- **ALU (128-bit) :** ReLU activation (for each value x of the matrices => max(0, x)), ADD 1 & 2 then SHR (averaging some of the data).
- **GEMM (128-bit) :** Matrix multiplication of A and B (transposed).
- **STORE (128-bit) :** Data from SRAM is copied at the desired location in DRAM, after all the operations have been resolved.

*Using the program `insn_lenet5_layer1.py` as an example on how to generate the data in binary files :*

### CONFIGURATION

Initializing the buffers and importing the necessary structures from `structures_insn_uop.py`.
The buffers are used to store the data :
- **UOP buffers** : this is where the indexes for each instruction are stored. A UOP buffer has three fields : ACC, INP, WGT. The indexes are the initial indexes upon which the operations will be executed.
- **Instruction buffers** : memory storage to describe each instruction. The buffers have different structures depending on whether they're used for memory interactions, or for operation definitions. 

In [22]:
"""CONFIGURATION"""

# PACKAGE IMPORT
# --------------
%pip install numpy
import numpy as np
import sys

# Parent folder
sys.path.append('../src/compiler/vta_compiler/operations_definition')
sys.path.append('../src/compiler/vta_compiler/data_definition')
import matrix_generator as MG
import matrix_split as MS
import structures_insn_uop

# UOP DEFINITION
# --------------
# Define empty UOP buffer
uop_buffer = []

# INSTRUCTION DEFINITION
# ----------------------
# Define empty instruction buffer
insn_buffer = []

# INPUT DATA
# --------------
block_size = 16

# A matrix size
A_row = 784
A_col = 25

block_input_matrix, _ = MS.matrix_splitting(MG.matrix_padding(MG.matrix_creation(n_row=A_row, n_col=A_col, isInitRandom=True, random_bound=4)), block_size, isWeight=False, isSquare=True)

# B matrix size
B_row = A_col # Required by matrix multiplication
B_col = 6

block_weight_matrix, _ = MS.matrix_splitting(MG.matrix_padding(MG.matrix_creation(n_row=B_row, n_col=B_col, isInitRandom=True, random_bound=4)), block_size, isWeight=True, isSquare=True)

Note: you may need to restart the kernel to use updated packages.


#### LOAD DATA FROM DRAM

This first buffer is used to reset ACC, before GEMM.
The fields indicate the initial indexes where the data should be read/written :
- `dst_idx` : where the data is written (ACC)
- `src_idx` : where the data is read (ACC or INP)
- `wgt_idx` : where the data is read in WGT (only used for GEMM)

In [23]:
if (len(uop_buffer) < 1):
    uop_buffer.append(structures_insn_uop.VTAUop( # UOP 0 - reset
        dst_idx=0, 
        src_idx=0,
        wgt_idx=0
    ))

###### *MEMORY BUFFERS*

The following buffers mimick memory interactions between the LOAD module and the COMPUTE module.

Their different fields are :
- `opcode` : indicates which operation the buffer defines
- DEP FLAG : `pop_dep` asks for permission to send the next instruction, `push_dep` gives the permission. This is to make sure there are no overlaps while data processing.
- `buffer_id` : indicates what kind of data is being read
- `sram_base`, `dram_base` : sram is the local memory address, where the data is copied from dram, in the case of LOAD
- `y_size`, `x_size`, etc : the memory size occupied by the data

In [24]:
if (len(insn_buffer) < 1):
    
# Loading the RESET UOP

    insn_buffer.append(structures_insn_uop.VTAMemInsn( # I0: LOAD UOP
        opcode=0, # 0-LOAD, 1-STORE, 3-FINISH
        # DEP FLAG
        pop_prev_dep=0,
        pop_next_dep=0,
        push_prev_dep=0,
        push_next_dep=0,
        # Memory interaction
        buffer_id=0, # 0-UOP, 1-WGT, 2-INP, 3-ACC, 4-OUT, 5-ACC8bit
        sram_base=0x0000,
        dram_base=0x00001000,
        unused=0, # UNUSED
        # Operation over the data
        y_size=1,
        x_size=1,
        x_stride=1,
        y_pad_top=0,
        y_pad_bottom=0,
        x_pad_left=0,
        x_pad_right=0
    ))

# The ACC matrix is wiped, in case of RESET

    insn_buffer.append(structures_insn_uop.VTAGemInsn( # I1: GEMM RESET
        opcode=2, # 2-GEMM
        # DEP FLAG
        pop_prev_dep=0,
        pop_next_dep=0,
        push_prev_dep=1, # Ready signal to LOAD
        push_next_dep=0,
        # Operations
        reset=1, # 0-no, 1-reset
        uop_bgn=0, # UOP 0
        uop_end=1,
        loop_out=49, # Number of (16 x 16) blocks in ACC
        loop_in=16,  # Block size
        # UNUSED
        unused=0, # UNUSED
        # Index factors
        dst_factor_out=16, # Block size
        dst_factor_in=1,
        src_factor_out=0,
        src_factor_in=0,
        wgt_factor_out=0,
        wgt_factor_in=0
    ))
    
# Loading INP

    insn_buffer.append(structures_insn_uop.VTAMemInsn( # I2: LOAD INP
        opcode=0, # 0-LOAD, 1-STORE, 3-FINISH
        # DEP FLAG
        pop_prev_dep=0,
        pop_next_dep=1, # Acknowledge COMPUTE ready signal
        push_prev_dep=0,
        push_next_dep=0,
        # Memory interaction
        buffer_id=2, # 0-UOP, 1-WGT, 2-INP, 3-ACC, 4-OUT, 5-ACC8bit
        sram_base=0x0000,
        dram_base=0x00000100,
        unused=0, # UNUSED
        # Operation over the data
        y_size=1,
        x_size=1568, # Load 98*16 INP
        x_stride=1568,
        y_pad_top=0,
        y_pad_bottom=0,
        x_pad_left=0,
        x_pad_right=0
    ))
    
# Loading WGT

    insn_buffer.append(structures_insn_uop.VTAMemInsn( # I3: LOAD WGT
        opcode=0, # 0-LOAD, 1-STORE, 3-FINISH
        # DEP FLAG
        pop_prev_dep=0,
        pop_next_dep=0,
        push_prev_dep=0,
        push_next_dep=1, # Ready signal to COMPUTE
        # Memory interaction
        buffer_id=1, # 0-UOP, 1-WGT, 2-INP, 3-ACC, 4-OUT, 5-ACC8bit
        sram_base=0x0000,
        dram_base=0x00000020,
        unused=0, # UNUSED
        # Operation over the data
        y_size=1,
        x_size=2, # Load 2 WGT
        x_stride=2,
        y_pad_top=0,
        y_pad_bottom=0,
        x_pad_left=0,
        x_pad_right=0
    ))
    
# Loading UOPs for GEMM & Average Pooling operations

    insn_buffer.append(structures_insn_uop.VTAMemInsn( # I4: LOAD UOP
        opcode=0, # 0-LOAD, 1-STORE, 3-FINISH
        # DEP FLAG
        pop_prev_dep=1, # Acknowledge LOAD ready signal
        pop_next_dep=0, 
        push_prev_dep=0,
        push_next_dep=0,
        # Memory interaction
        buffer_id=0, # 0-UOP, 1-WGT, 2-INP, 3-ACC, 4-OUT, 5-ACC8bit
        sram_base=0x0001,
        dram_base=0x00001001,
        unused=0, # UNUSED
        # Operation over the data
        y_size=1,
        x_size=6, # Load 6 UOP (2 GeMM + 1 ReLU + 3 Pool)
        x_stride=6,
        y_pad_top=0,
        y_pad_bottom=0,
        x_pad_left=0,
        x_pad_right=0
    ))

## GEMM

'Generalized Matrix Multiplication' operation. Instead of having block multiplications, all of the blocks of INP should be stacked into a (1568 x 16) matrix then vectorized. This also results in a (784 x 16) ACC matrix, as we have two WGT block matrices (16 x 16).

The operation operated by the VTA is as follows : each vector of INP is multiplied with WGT transposed (line by line). The results are stored in ACC.

#### VECTORIZATION OF INP AND ACC

*Hardware requirement :* The VTA requires, for Compute, (`block_size=16` x `block_size=16`) WGT matrices and (`block_size=16` x 1) vectors for the rest of the data. 

The (`block_size=16` x `block_size=16`) INP and ACC matrices should be split in vectors : the following operations are done on vectors.

What follows is an example of a GEMM operation, as operated by the VTA :

In [25]:
# ----------------------
# Splitting (16 x 16) block INP matrices into (16 x 1) vectors

# Input Matrix INP
print("First vector of first block of INP matrix (", np.shape(block_input_matrix[0][0])[0], " x ", 1, ")")
print(block_input_matrix[0][0], "A@0")

# Weight Matrix WGT
print("x \nFirst block of WGT matrix (", block_weight_matrix[0].shape[0], " x ", block_weight_matrix[0].shape[1], ")")
print(block_weight_matrix[0], "B@0")

# Output Matrix ACC
C_0 = np.zeros((1, block_size))
print("= \nFirst vector of first block of ACC [to be filled] (", block_size, " x ", 1, ")")
print(C_0, "C@0")

First vector of first block of INP matrix ( 16  x  1 )
[ 2 -4 -4  0 -3  2  1 -1 -2  1 -4 -2 -3  2  2 -1] A@0
x 
First block of WGT matrix ( 16  x  16 )
[[ 2  2 -2  1  0 -2  0  0  0  0  0  0  0  0  0  0]
 [ 2 -3 -4 -2 -1  0  0  0  0  0  0  0  0  0  0  0]
 [ 2 -2 -2 -4  0 -1  0  0  0  0  0  0  0  0  0  0]
 [-3 -3  0  2  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 1 -3  0 -3 -4  2  0  0  0  0  0  0  0  0  0  0]
 [-3 -3 -4  1  1  1  0  0  0  0  0  0  0  0  0  0]
 [-3  0 -2  1 -4  2  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  1 -2  2 -3  0  0  0  0  0  0  0  0  0  0]
 [ 2  2 -1 -3  1  2  0  0  0  0  0  0  0  0  0  0]
 [-2  2  0 -4 -2  0  0  0  0  0  0  0  0  0  0  0]
 [-4 -4  2  1 -3  2  0  0  0  0  0  0  0  0  0  0]
 [-2 -1 -2  0  0 -1  0  0  0  0  0  0  0  0  0  0]
 [ 2  2 -2  2  2 -4  0  0  0  0  0  0  0  0  0  0]
 [ 1  2 -3 -3  2 -3  0  0  0  0  0  0  0  0  0  0]
 [-2 -4 -4 -2 -2  2  0  0  0  0  0  0  0  0  0  0]
 [-1 -1 -3 -4  1  0  0  0  0  0  0  0  0  0  0  0]] B@0
= 
First vector of first bl

##### *OPERATION BUFFERS*

The following buffers give the structure of each operation. 
Their fields are :
- `opcode` : indicates which operation the buffer defines
- DEP FLAG : `pop_dep` asks for permission to send the next instruction, `push_dep` gives the permission. This is to make sure there are no overlaps while data processing.
- `reset` : if 'yes', then the stored data is deleted
- `uop_bgn`, `uop_end` : the UOP indexes needed for the instruction
- `loop_out` : external loops, number of iterations for the operation
- `loop_in` : internal loops
- `dst_factor_out`, `dst_factor_in` : respectively, incrementation of indexes after each `loop_out`, `loop_in` in ACC
- `src_factor_out`, `src_factor_in` : likewise, but in INP
- `wgt_factor_out`, `wgt_factor_in` : likewise, but in WGT

The following buffers lead to these multiplications :

>*A0 * B0 +=> C0*

>*A1 * B1 +=> C0*

>*A2 * B0 +=> C1*

>*A3 * B1 +=> C1*

>*etc...*

In [26]:
"""GEMM"""

# Generating the instructions for the GeMM, using A vectorized and B.

# ----------------------
# Defining the GEMM UOP buffer

if (len(uop_buffer) < 1 + 1):
    uop_buffer.append(structures_insn_uop.VTAUop( # UOP 1 - GEMM 0
        dst_idx=0, 
        src_idx=0,
        wgt_idx=0
    ))

if (len(uop_buffer) < 2 + 1):
    uop_buffer.append(structures_insn_uop.VTAUop( # UOP 2 - GEMM 1
        dst_idx=0, 
        src_idx=16,
        wgt_idx=1
    ))

# ----------------------
# Defining the GEMM Instruction buffer

index_insn = 5 # Instruction index

if (len(insn_buffer) < index_insn + 1):
    insn_buffer.append(structures_insn_uop.VTAGemInsn( # I5: GEMM
        opcode=2, # 2-GEMM
        # DEP FLAG
        pop_prev_dep=0,
        pop_next_dep=0,
        push_prev_dep=0,
        push_next_dep=0, 
        # Operations
        reset=0, # 0-no, 1-reset
        uop_bgn=1, # UOP 1 + UOP 2
        uop_end=3,
        loop_out=49,
        loop_in=16,
        # UNUSED
        unused=0, # UNUSED
        # Index factors
        dst_factor_out=16,
        dst_factor_in=1,
        src_factor_out=32,
        src_factor_in=1,
        wgt_factor_out=0,
        wgt_factor_in=0
    ))

# ----------------------
# Print the buffers
 
# Printing UOP Buffer
def print_uop_buffer(OP, uop_bgn, uop_end) :
    print(OP, "UOP BUFFER\nACC  INP  WGT\n")
    for i in range(uop_bgn, uop_end):
        print(uop_buffer[i].dst_idx, "  ", uop_buffer[i].src_idx, "  ", uop_buffer[i].wgt_idx, "\n")

# Printing ALU Instruction Buffer      
def print_insn_buffer_ALU(n_insn, OP):
    print(OP, "INSTRUCTIONS\nLP_OUT  LP_IN  DST_OUT  DST_IN  SRC_OUT  SRC_IN  OPCODE  IMM\n")
    print(insn_buffer[n_insn].loop_out, "     ", insn_buffer[n_insn].loop_in, "     ", insn_buffer[n_insn].dst_factor_out, "     ", insn_buffer[n_insn].dst_factor_in, "     ", 
          insn_buffer[n_insn].src_factor_out, "     ", 
          insn_buffer[n_insn].src_factor_in, "     ", insn_buffer[n_insn].opcode, "    ", insn_buffer[n_insn].imm)
    
# ----------------------
# Defining GEMM operation

def GEMM(A, B):
#    assert(A.shape[1] == B.shape[0])
    A = np.array(A)
    B = np.array(B)
    return A @ B

# ----------------------
# Pseudo-code GEMM

def insn_GEMM(ACC, WGT, INP):
    for i0 in range(insn_buffer[index_insn].loop_in):
        for i1 in range(insn_buffer[index_insn].loop_out):
            for uop_index in range(insn_buffer[index_insn].uop_bgn, insn_buffer[index_insn].uop_end):
                X, Y, Z = uop_buffer[uop_index].dst_idx, uop_buffer[uop_index].src_idx, uop_buffer[uop_index].wgt_idx
                dst_idx = i0 * insn_buffer[index_insn].dst_factor_in + i1 * insn_buffer[index_insn].dst_factor_out + X # Index ACC
                inp_idx = i0 * insn_buffer[index_insn].src_factor_in + i1 * insn_buffer[index_insn].src_factor_out + Y # Index INP
                wgt_idx = i0 * insn_buffer[index_insn].wgt_factor_in + i1 * insn_buffer[index_insn].wgt_factor_out + Z # Index WGT
                ACC[dst_idx] += GEMM(INP[inp_idx], WGT[wgt_idx])                                                       # Storage of GEMM(A, B) in ACC
    return ACC

# ----------------------
# Printing the data
# ----------------------

# Printing GEMM UOP Buffer
print_uop_buffer("GEMM", insn_buffer[index_insn].uop_bgn, insn_buffer[index_insn].uop_end)

# Printing GEMM Instruction Buffer 
print("GEMM INSTRUCTIONS\nLP_OUT  LP_IN  DST_OUT  DST_IN  SRC_OUT  SRC_IN  WGT_OUT  WGT_IN\n")
print(insn_buffer[index_insn].loop_out, "     ", insn_buffer[index_insn].loop_in, "     ", insn_buffer[index_insn].dst_factor_out, "     ", insn_buffer[index_insn].dst_factor_in, "     ", 
        insn_buffer[index_insn].src_factor_out, "     ", 
        insn_buffer[index_insn].src_factor_in, "     ", insn_buffer[index_insn].wgt_factor_out, "     ", insn_buffer[index_insn].wgt_factor_in, "\n")

# Printing the Output Matrix
INP_stack = np.vstack(block_input_matrix)       # Stacking the 98 (16 x 16) blocks of A
ACC = np.zeros((A_row, block_size))             # Initializing the Output Matrix C (49 blocks of size (16 x 16) stacked) with zeros

ACC_GEMM = insn_GEMM(ACC, block_weight_matrix, INP_stack)
#assert(ACC_GEMM[0] == block_output_matrix[0][0])
print("ACC - Output matrix post-GEMM (", ACC_GEMM.shape[0], "x", ACC_GEMM.shape[1], ")")
print(ACC_GEMM)

GEMM UOP BUFFER
ACC  INP  WGT

0    0    0 

0    16    1 

GEMM INSTRUCTIONS
LP_OUT  LP_IN  DST_OUT  DST_IN  SRC_OUT  SRC_IN  WGT_OUT  WGT_IN

49       16       16       1       32       1       0       0 

ACC - Output matrix post-GEMM ( 784 x 16 )
[[  7.  44.  13. ...   0.   0.   0.]
 [ 30.  27.  26. ...   0.   0.   0.]
 [ 19.  -9.  18. ...   0.   0.   0.]
 ...
 [ 47.  13.  36. ...   0.   0.   0.]
 [ 16.  50.  82. ...   0.   0.   0.]
 [ 24. -24.  29. ...   0.   0.   0.]]


## ALU

#### ReLU ACTIVATION

ALU operation that, for each value of ACC, returns the maximum value between 0 and the value in ACC. This aims to ensure there are no negative values in ACC.

In [27]:
"""ReLU ACTIVATION"""

# In data_definitions/user_configuration.py, if `useReLU=True` :

# ----------------------
# Defining the ALU-RELU UOP buffer

if (len(uop_buffer) < 3 + 1):
    uop_buffer.append(structures_insn_uop.VTAUop( # UOP 3 - ALU (relu)
        dst_idx=0, 
        src_idx=0,
        wgt_idx=0
    ))

# ----------------------
# Defining the ALU-RELU Instruction buffer

index_insn = 6 # Instruction index

if (len(insn_buffer) < index_insn + 1):
    insn_buffer.append(structures_insn_uop.VTAAluInsn( # I6: ALU - MAX IMM 0 (relu)
        opcode=4, # 4-ALU
        # DEP FLAG
        pop_prev_dep=0,
        pop_next_dep=0,
        push_prev_dep=0,
        push_next_dep=0,
        # Operations
        reset=0, # 0-no, 1-reset
        uop_bgn=3, # UOP 3
        uop_end=4,
        loop_out=49,
        loop_in=16,
        # UNUSED
        unused=0, # UNUSED
        # Index factors
        dst_factor_out=16,
        dst_factor_in=1, # ACC incremented by 1
        src_factor_out=16,
        src_factor_in=1, # INP incremented by 1
        alu_opcode=1, # 0-MIN, 1-MAX, 2-ADD, 3-SHR, 4-MUL
        use_imm=1, # 0-no, 1-yes
        imm=0
    ))

# ----------------------
# Defining RELU operation
def RELU(A, useReLU):
    if (useReLU):
        A = np.maximum(A, 0)
    return A

# ----------------------
# Pseudo-code ALU RELU

def insn_RELU(ACC):
    for i0 in range(insn_buffer[index_insn].loop_in):
        for i1 in range(insn_buffer[index_insn].loop_out):
            for uop_index in range(insn_buffer[index_insn].uop_bgn, insn_buffer[index_insn].uop_end):
                X = uop_buffer[uop_index].dst_idx
                dst_idx = i0 * insn_buffer[index_insn].dst_factor_in + i1 * insn_buffer[index_insn].dst_factor_out + X # Index ACC
                ACC[dst_idx] = RELU(ACC[dst_idx], True) # For every row of ACC, we do max(0, value) for each value of the row
    return ACC

# ----------------------
# Printing the data
# ----------------------

# Printing ReLU UOP Buffer
print_uop_buffer("RELU", insn_buffer[index_insn].uop_bgn, insn_buffer[index_insn].uop_end)

# Printing ReLU Instruction Buffer 
print_insn_buffer_ALU(index_insn, "RELU")

# Printing the Output Matrix

ACC_ReLU = insn_RELU(ACC_GEMM)
print("\nACC - Output matrix post-ReLU (", ACC_ReLU.shape[0], "x", ACC_ReLU.shape[1], ")")
print(ACC_ReLU)

RELU UOP BUFFER
ACC  INP  WGT

0    0    0 

RELU INSTRUCTIONS
LP_OUT  LP_IN  DST_OUT  DST_IN  SRC_OUT  SRC_IN  OPCODE  IMM

49       16       16       1       16       1       4      0

ACC - Output matrix post-ReLU ( 784 x 16 )
[[ 7. 44. 13. ...  0.  0.  0.]
 [30. 27. 26. ...  0.  0.  0.]
 [19.  0. 18. ...  0.  0.  0.]
 ...
 [47. 13. 36. ...  0.  0.  0.]
 [16. 50. 82. ...  0.  0.  0.]
 [24.  0. 29. ...  0.  0.  0.]]


#### AVERAGE POOLING
##### *First ADD*

ALU operation that adds two vectors of ACC and stores the result in the first vector of the addition. 

We first define the UOP buffer, where the data is to be read and stored, then the instruction buffer. The operation is then computed into ACC and printed out.

The following buffers lead to these operations :

>*LOOP 0 : ACC@(dst_idx=0) + ACC@(src_idx=1) => ACC@(dst_idx=0)*

>*LOOP 1 : ACC@(dst_idx=0 + 1 * dst_factor_in=2) + ACC@(src_idx=1 + 1 * src_factor_in=2) = ACC@2 + ACC@3 => ACC@2*

>*etc...*

In [28]:
"""AVERAGE POOLING - First ADD"""

# After this step, the relevant data storage is divided by two.

# ----------------------
# Defining the ADD #1 UOP buffer

if (len(uop_buffer) < 4 + 1):
    uop_buffer.append(structures_insn_uop.VTAUop( # UOP 4 - ALU (first add)
        dst_idx=0, 
        src_idx=1,
        wgt_idx=0
    ))

# ----------------------
# Defining the ADD #1 Instruction buffer

index_insn = 7 # Instruction index

if (len(insn_buffer) < index_insn + 1):
    insn_buffer.append(structures_insn_uop.VTAAluInsn( # I7: ALU - ADD (Average Pooling 1/3)
        opcode=4, # 4-ALU
        # DEP FLAG
        pop_prev_dep=0,
        pop_next_dep=0,
        push_prev_dep=0,
        push_next_dep=0,
        # Operations
        reset=0, # 0-no, 1-reset
        uop_bgn=4, # UOP 4
        uop_end=5,
        loop_out=1,
        loop_in=392,
        # UNUSED
        unused=0, # UNUSED
        # Index factors
        dst_factor_out=0,
        dst_factor_in=2, 
        src_factor_out=0,
        src_factor_in=2, 
        alu_opcode=2, # 0-MIN, 1-MAX, 2-ADD, 3-SHR, 4-MUL
        use_imm=0, # 0-no, 1-yes
        imm=0
    ))

# ----------------------
# Define ADD operation

def ADD(A, B):
    A = np.array(A)
    B = np.array(B)
    return A + B
        
# ----------------------
# Pseudo-code ALU ADD

def insn_ADD(ACC):
    for i0 in range(insn_buffer[index_insn].loop_in):
        for i1 in range(insn_buffer[index_insn].loop_out):
            for uop_index in range(insn_buffer[index_insn].uop_bgn, insn_buffer[index_insn].uop_end):
                X, Y = uop_buffer[uop_index].dst_idx, uop_buffer[uop_index].src_idx
                dst_idx = i0 * insn_buffer[index_insn].dst_factor_in + i1 * insn_buffer[index_insn].dst_factor_out + X
                inp_idx = i0 * insn_buffer[index_insn].src_factor_in + i1 * insn_buffer[index_insn].src_factor_out + Y
                ACC[dst_idx] = ADD(ACC[dst_idx], ACC[inp_idx])
    return ACC

# ----------------------
# Printing the data
# ----------------------

# Printing ADD #1 UOP Buffer
print_uop_buffer("ADD #1", insn_buffer[index_insn].uop_bgn, insn_buffer[index_insn].uop_end)

# Printing ADD #1 Instruction Buffer 
print_insn_buffer_ALU(index_insn, "ADD #1")

# Printing the Output Matrix
ACC_ADD1 = insn_ADD(ACC_ReLU)
print("\nACC - Output matrix post-first ADD (", ACC_ADD1.shape[0], "x", ACC_ADD1.shape[1], ")")
print(ACC_ADD1)

ADD #1 UOP BUFFER
ACC  INP  WGT

0    1    0 

ADD #1 INSTRUCTIONS
LP_OUT  LP_IN  DST_OUT  DST_IN  SRC_OUT  SRC_IN  OPCODE  IMM

1       392       0       2       0       2       4      0

ACC - Output matrix post-first ADD ( 784 x 16 )
[[ 37.  71.  39. ...   0.   0.   0.]
 [ 30.  27.  26. ...   0.   0.   0.]
 [ 37.  31.  70. ...   0.   0.   0.]
 ...
 [ 47.  13.  36. ...   0.   0.   0.]
 [ 40.  50. 111. ...   0.   0.   0.]
 [ 24.   0.  29. ...   0.   0.   0.]]


##### *Second ADD*

ALU operation that adds two vectors of ACC and stores the result in the first vector of the addition. 

We first define the UOP buffer, then the instruction buffer. The operation is then computed into ACC and printed out.

The following buffers lead to these operations :

*LOOP_IN 0 :*

>*LOOP_OUT 0 : ACC@(dst_idx=0) + ACC@(src_idx=28) => ACC@(dst_idx=0)*

>*LOOP_OUT 1 : ACC@(dst_idx=0 + 1 * dst_factor_out=56) + ACC@(src_idx=28 + 1 * src_factor_out=56) = ACC@56 + ACC@84 => ACC@56*

>*etc...*

*LOOP_IN 1 :*

>*LOOP_OUT 0 : ACC@(dst_idx=0 + 1 * dst_factor_in=2) + ACC@(src_idx=28 + 1 * src_factor_in=2) = ACC@2 + ACC@30 => ACC@2*

>*LOOP_OUT 1 : ACC@(dst_idx=0 + 1 * dst_factor_out=56 + 1 * dst_factor_in=2) + ACC@(src_idx=28 + 1 * src_factor_out=56 + 1 * src_factor_in=2) = ACC@58 + ACC@84 => ACC@58*
    
*etc...*

In [29]:
"""AVERAGE POOLING - Second ADD"""

# After this step, the relevant data storage is divided by two. (4 total)

# ----------------------
# Defining the ADD #2 UOP buffer

if (len(uop_buffer) < 5 + 1):
    uop_buffer.append(structures_insn_uop.VTAUop( # UOP 5 - ALU (second add)
        dst_idx=0, 
        src_idx=28,
        wgt_idx=0
    ))

# ----------------------
# Defining the ADD #2 Instruction buffer

index_insn = 8 # Instruction index

if (len(insn_buffer) < index_insn + 1):
    insn_buffer.append(structures_insn_uop.VTAAluInsn( # I8: ALU - ADD (Average Pooling 2/3)
        opcode=4, # 4-ALU
        # DEP FLAG
        pop_prev_dep=0,
        pop_next_dep=0,
        push_prev_dep=0,
        push_next_dep=0,
        # Operations
        reset=0, # 0-no, 1-reset
        uop_bgn=5, # UOP 5
        uop_end=6,
        loop_out=14,
        loop_in=14,
        # UNUSED
        unused=0, # UNUSED
        # Index factors
        dst_factor_out=56,
        dst_factor_in=2, 
        src_factor_out=56,
        src_factor_in=2, 
        alu_opcode=2, # 0-MIN, 1-MAX, 2-ADD, 3-SHR, 4-MUL
        use_imm=0, # 0-no, 1-yes
        imm=0
    ))

# ----------------------
# Printing the data
# ----------------------

# Printing ADD #2 UOP Buffer
print_uop_buffer("ADD #2", insn_buffer[index_insn].uop_bgn, insn_buffer[index_insn].uop_end)

# Printing ADD #2 Instruction Buffer 
print_insn_buffer_ALU(index_insn, "ADD #2")

# Printing the Output Matrix
ACC_ADD2 = insn_ADD(ACC_ADD1)
print("\nACC - Output matrix post-second ADD (", ACC_ADD2.shape[0], "x", ACC_ADD2.shape[1], ")")
print(ACC_ADD2)

ADD #2 UOP BUFFER
ACC  INP  WGT

0    28    0 

ADD #2 INSTRUCTIONS
LP_OUT  LP_IN  DST_OUT  DST_IN  SRC_OUT  SRC_IN  OPCODE  IMM

14       14       56       2       56       2       4      0

ACC - Output matrix post-second ADD ( 784 x 16 )
[[ 75.  81. 131. ...   0.   0.   0.]
 [ 30.  27.  26. ...   0.   0.   0.]
 [ 55.  34.  72. ...   0.   0.   0.]
 ...
 [ 47.  13.  36. ...   0.   0.   0.]
 [ 40.  50. 111. ...   0.   0.   0.]
 [ 24.   0.  29. ...   0.   0.   0.]]


##### *SHIFT-RIGHT*

ALU operation that shift-rights each of the values contained in ACC's vectors, meaning the values are divided by 2², in this case. 

We first define the UOP buffer, then the instruction buffer. The operation is then computed into ACC and printed out.

The following buffers lead to these operations :

*LOOP_IN 0 :*

>*LOOP_OUT 0 : ACC@(dst_idx=0) / 2² => ACC@(dst_idx=0)*

>*LOOP_OUT 1 : ACC@(dst_idx=0 + 1 * dst_factor_out=56) / 2² = ACC@56 / 2² => ACC@56*

>*etc...*

*LOOP_IN 1 :*

>*LOOP_OUT 0 : ACC@(dst_idx=0 + 1 * dst_factor_in=2) / 2² = ACC@2 / 2² => ACC@2*

>*LOOP_OUT 1 : ACC@(dst_idx=0 + 1 * dst_factor_out=56 + 1 * dst_factor_in=2) / 2² = ACC@58 / 2² => ACC@58*
    
*etc...*

In [30]:
"""AVERAGE POOLING - SHR"""

# With this step, we average the added values.

# ----------------------
# Defining the SHR UOP buffer

if (len(uop_buffer) < 6 + 1):
    uop_buffer.append(structures_insn_uop.VTAUop( # UOP 6 - ALU (shift right)
        dst_idx=0, 
        src_idx=0,
        wgt_idx=0
    ))

# ----------------------
# Defining the ALU-SHR Instruction buffer

index_insn = 9 # Instruction index

if (len(insn_buffer) < index_insn + 1):
    insn_buffer.append(structures_insn_uop.VTAAluInsn( # I9: ALU - SHR (Average Pooling 3/3)
        opcode=4, # 4-ALU
        # DEP FLAG
        pop_prev_dep=0,
        pop_next_dep=0,
        push_prev_dep=0,
        push_next_dep=1, # Ready signal to STORE
        # Operations
        reset=0, # 0-no, 1-reset
        uop_bgn=6, # UOP 6
        uop_end=7,
        loop_out=14,
        loop_in=14,
        # UNUSED
        unused=0, # UNUSED
        # Index factors
        dst_factor_out=56,
        dst_factor_in=2, 
        src_factor_out=56,
        src_factor_in=2, 
        alu_opcode=3, # 0-MIN, 1-MAX, 2-ADD, 3-SHR, 4-MUL
        use_imm=1, # 0-no, 1-yes
        imm=2 # Division by 4 (rounded down)
    ))

# ----------------------
# Defining SHR operation

def SHR(A, IMM) :
    for i in range(len(A)): # A composed of horizontal vectors (16 x 1)
        A[i] = int(np.float64(A[i])) >> IMM
    return A

# ----------------------
# Pseudo-code ALU SHR

def insn_SHR(ACC):
    for i0 in range(insn_buffer[index_insn].loop_in):
        for i1 in range(insn_buffer[index_insn].loop_out):
            for uop_index in range(insn_buffer[index_insn].uop_bgn, insn_buffer[index_insn].uop_end):
                X = uop_buffer[uop_index].dst_idx
                dst_idx = i0 * insn_buffer[index_insn].dst_factor_in + i1 * insn_buffer[index_insn].dst_factor_out + X
                ACC[dst_idx] = SHR(ACC[dst_idx], insn_buffer[index_insn].imm)
    return ACC

# ----------------------
# Printing the data
# ----------------------

# Printing SHR UOP Buffer
print_uop_buffer("SHR", insn_buffer[index_insn].uop_bgn, insn_buffer[index_insn].uop_end)

# Printing SHR Instruction Buffer 
print_insn_buffer_ALU(index_insn, "SHR")

# Printing the Output Matrix
ACC_SHR = insn_SHR(ACC_ADD2)
print("\nACC - Output matrix post-SHR (", ACC_SHR.shape[0], "x", ACC_SHR.shape[1], ")")
print(ACC_SHR)

SHR UOP BUFFER
ACC  INP  WGT

0    0    0 

SHR INSTRUCTIONS
LP_OUT  LP_IN  DST_OUT  DST_IN  SRC_OUT  SRC_IN  OPCODE  IMM

14       14       56       2       56       2       4      2

ACC - Output matrix post-SHR ( 784 x 16 )
[[ 18.  20.  32. ...   0.   0.   0.]
 [ 30.  27.  26. ...   0.   0.   0.]
 [ 13.   8.  18. ...   0.   0.   0.]
 ...
 [ 47.  13.  36. ...   0.   0.   0.]
 [ 40.  50. 111. ...   0.   0.   0.]
 [ 24.   0.  29. ...   0.   0.   0.]]


##### STORING THE DATA INTO DRAM

After computation, the locally-stored data should be copied into DRAM. The following buffers mimick the memory interactions between the COMPUTE module and the STORE module.

In [31]:
"""DATA STORAGE FROM SRAM TO DRAM"""

insn_buffer.append(structures_insn_uop.VTAMemInsn( # I10: STORE
    opcode=1, # 0-LOAD, 1-STORE, 3-FINISH
    # DEP FLAG
    pop_prev_dep=1, # Acknowledge COMPUTE ready signal
    pop_next_dep=0,
    push_prev_dep=1, # Ready signal to COMPUTE
    push_next_dep=0,
    # Memory interaction
    buffer_id=4, # 0-UOP, 1-WGT, 2-INP, 3-ACC, 4-OUT, 5-ACC8bit
    sram_base=0x0000,
    dram_base=0x00000300,
    unused=0, # UNUSED
    # Operation over the data
    y_size=1,
    x_size=784, # Store 49*16 OUT
    x_stride=784,
    y_pad_top=0,
    y_pad_bottom=0,
    x_pad_left=0,
    x_pad_right=0
))

insn_buffer.append(structures_insn_uop.VTAMemInsn( # I11: NOP-MEMORY-STAGE
    opcode=0, # 0-LOAD, 1-STORE, 3-FINISH
    # DEP FLAG
    pop_prev_dep=0,
    pop_next_dep=0,
    push_prev_dep=0, 
    push_next_dep=1, # Ready signal to COMPUTE
    # Memory interaction
    buffer_id=2, # 0-UOP, 1-WGT, 2-INP, 3-ACC, 4-OUT, 5-ACC8bit
    sram_base=0x0000,
    dram_base=0x00000000,
    unused=0, # UNUSED
    # Operation over the data
    y_size=0,
    x_size=0,
    x_stride=0,
    y_pad_top=0,
    y_pad_bottom=0,
    x_pad_left=0,
    x_pad_right=0
))

insn_buffer.append(structures_insn_uop.VTAMemInsn( # I12: NOP-COMPUTE-STAGE
    opcode=0, # 0-LOAD, 1-STORE, 3-FINISH
    # DEP FLAG
    pop_prev_dep=1, # Acknowledge LOAD ready signal
    pop_next_dep=1, # Acknowledge STORE ready signal
    push_prev_dep=0,
    push_next_dep=0,
    # Memory interaction
    buffer_id=0, # 0-UOP, 1-WGT, 2-INP, 3-ACC, 4-OUT, 5-ACC8bit
    sram_base=0x0000,
    dram_base=0x00000000,
    unused=0, # UNUSED
    # Operation over the data
    y_size=0,
    x_size=0,
    x_stride=0,
    y_pad_top=0,
    y_pad_bottom=0,
    x_pad_left=0,
    x_pad_right=0
))

insn_buffer.append(structures_insn_uop.VTAMemInsn( # I13: FINISH
    opcode=3, # 0-LOAD, 1-STORE, 3-FINISH
    # DEP FLAG
    pop_prev_dep=0,
    pop_next_dep=0,
    push_prev_dep=0,
    push_next_dep=0,
    # Memory interaction
    buffer_id=0, # 0-UOP, 1-WGT, 2-INP, 3-ACC, 4-OUT, 5-ACC8bit
    sram_base=0x0000,
    dram_base=0x00000000,
    unused=0, # UNUSED
    # Operation over the data
    y_size=0,
    x_size=0,
    x_stride=0,
    y_pad_top=0,
    y_pad_bottom=0,
    x_pad_left=0,
    x_pad_right=0
))

## DATA ENCODING

**How to retrieve the encoded data :** 

Each of the 32-bit UOPs and 128-bit instructions are encoded in hexadecimal pairs. 

*Example of encoding with the SHR Instruction :*

Structure used : 

```
class VTAAluInsn(LittleEndianStructure):
    """ALU instruction structure (128-bit)."""
    _pack_ = 1
    _fields_ = [
        ("opcode", c_uint64, 3),
        ("pop_prev_dep", c_uint64, 1),
        ("pop_next_dep", c_uint64, 1),
        ("push_prev_dep", c_uint64, 1),
        ("push_next_dep", c_uint64, 1),
        ("reset", c_uint64, 1),
        ("uop_bgn", c_uint64, 13),
        ("uop_end", c_uint64, 14),
        ("loop_out", c_uint64, 14),
        ("loop_in", c_uint64, 14),
        ("unused", c_uint64, 1),
        ("dst_factor_out", c_uint64, 11),
        ("dst_factor_in", c_uint64, 11),
        ("src_factor_out", c_uint64, 11),
        ("src_factor_in", c_uint64, 11),
        ("alu_opcode", c_uint64, 3), # 0-MIN, 1-MAX, 2-ADD, 3-SHR, 4-MUL/SHL
        ("use_imm", c_uint64, 1), # 0-NO, 1-YES
        ("imm", c_uint64, 16)
    ]
```

Buffer configuration :

```
insn_buffer.append(structures_insn_uop.VTAAluInsn( # I9: ALU - SHR (Average Pooling 3/3)
        opcode=4,           # >>> 100 [3-bit]
        # DEP FLAG          # >>> 0001
        pop_prev_dep=0,     
        pop_next_dep=0,    
        push_prev_dep=0,    
        push_next_dep=1,    
        # Operations
        reset=0,            # >>> 0
        uop_bgn=6,          # >>> 0000 0000 0011 0
        uop_end=7,          # >>> 0000 0000 0001 11
        loop_out=14,        # >>> 0000 0000 0011 10
        loop_in=14,         # >>> 0000 0000 0011 10
        # UNUSED
        unused=0,           # >>> 0 
        # Index factors
        dst_factor_out=56,  # >>> 0000 0111 000
        dst_factor_in=2,    # >>> 0000 0000 010
        src_factor_out=56,  # >>> 0000 0111 000
        src_factor_in=2,    # >>> 0000 0000 010
        alu_opcode=3,       # >>> 011
        use_imm=1,          # >>> 1
        imm=2               # >>> 0000 0000 0000 0010
    ))
```

Encoding this 128-bit sequence in Little Endian, we obtain the following instruction in the .bin files : **I9 : 44 06 e0 00 70 00 1c 00 38 10 00 0e 04 b0 02 00**

***BINARY FILES***

To obtain the binary files needed for the `functional_simulator` and `cycle-accurate simulator`, run the following command :

In [32]:
%%capture
%run ../src/compiler/vta_compiler/operations_definition/examples/insn_lenet5_layer1.py

>> To access the data in the files, run the commands :
- For the UOP : `hexdump -C uop.bin > uop.txt` 
- For the instructions : `hexdump -C instructions.bin > instructions.txt`

We obtain in the `compiler_output/` folder : 
- `uop.bin` : file containing the encoded UOP 
- `instructions.bin` : file containing the encoded instructions