# Multiple Matrix Multiplication Tutorial

In this example, we'll build on the Matmul Tutorial and include a second matrix multiplication. The end result being y = x^3 where x is the input matrix. We'll also include some of the concepts from the previous tutorials. 

By the end of this tutorial, you should feel comfortable with the following concepts:
* Matrix Multiplication on Groq hardware
* Memory Layouts for Tensors
* Buffered Resource Scopes
* Memory Copy

It is expected that you have finished reading the Multiple Matrix Multiplications section of the Groq API Tutorial Guide prior to going through this tutorial. 

## Build a program and Compile with Groq API

Begin by importing the following packages. Note that for this example, in addition to the Groq API, we're also importing the Neural Net library from the Groq API as 'nn'. 

In [None]:
import groq.api as g
from groq.runner import tsp
import groq.api.nn as nn
import numpy as np
print("Python packages imported successfully")

Create your input matrix, set the size and datatype. Remember to name it for easier debug and provide the recommended memory layout for the first matrix.

In [None]:
matrix = g.input_tensor(shape=(120, 120), dtype=g.float16, name="matrix", layout="H1(W), -1, S2")

Before looking at the code, let's discuss what this program will do:

<b> STEPS: <b>

1. First, we need to make a copy of our input matrix in order to do X*X. We'll use our learnings from the Memory Copy Tutorial to do this, however we'll include it inside a buffered resource scope. Doing a memory copy from within the GroqChip reduces the data being sent via the PCIe bus thereby reducing bloat on the bus.

2. Now that we have two sets of 'x', we can multiply them together.

3. Since the result of the matrix multiply is float32, we need to cast it to float16 before the second matrix multiply.

4. Multiply the result after the cast with the copy of 'x' (this is because it's in the correct memory layout as the second matrix)

5. Return the result

In [None]:
class TopLevel(g.Component):  # Create our top level component
    def __init__(self):
        super().__init__()
        self.mm = nn.MatMul(name="MyMatMul", arith_mode_warmup=True)     #Matmul 1: using the nn.MatMul() component.
        self.mm2 = nn.MatMul(name="MyMatMul2", arith_mode_warmup=True)   #Matmul 2: using the nn.MatMul() component.

    def build(self, in1_mt, time=0):             #Provide the value of 'x' and a default time of 0
        with g.ResourceScope(name="MemCopy", is_buffered=True, time=0) as memcopy:   #STEP 1: COPY INPUT
            in1_st = in1_mt.read(streams=g.SG2, time=0)
            in1_copy_mt = in1_st.write(name="write_copy", layout="H1(W), -1, S16(4-38)")    #Assign a layout preferable to the MXM for the second matrix

        with g.ResourceScope(name="mmscope", is_buffered=True, predecessors=[memcopy], time=None) as mmscope :   #STEP2: MATMUL
            result = self.mm(in1_mt, in1_copy_mt, time=0).write(name="mm_result", layout="H1(W), -1, S4")

        with g.ResourceScope(name="cast", is_buffered=True, time=None, predecessors=[mmscope]) as castscope:     #STEP3: CAST FP32 -> FP16
            result_fp16_st = g.cast(result, dtype=g.float16, fp16_inf=False, time=0)    #fp16_inf = false is a non-saturating conversion
            result_fp16_mt = result_fp16_st.write(name="write_cast")
        
        with g.ResourceScope(name="mmscope2", is_buffered=True, predecessors=[castscope], time=None) as mmscope2 :  #STEP4: MATMUL2
            result_final = self.mm2(result_fp16_mt, in1_copy_mt, time=0).write(name="mm_result2", layout="H1(W), -1, S4")
            g.add_mem_constraints([result_final], [result_fp16_mt, in1_copy_mt], g.MemConstraintType.BANK_EXCLUSIVE)
        return result_final    #STEP5: FINAL ANSWER!

A couple points to remember about the API's Matrix Multiplication:
* The Matrix Multiply in the Neural Net library expects two Rank-2 tensors and supports the following data types: 
  * int8 
  * float16 
  * Mixed float16/float32 (See API Reference Guide)
* The API implicitly transposes the 2nd tensor before performing the matmul operation. 
* The inner dimension of both memory tensors must be the same.
* For a float16 matmul, the recommended memory layout for the first matrix is `layout="H1(W), -1, S2"` and the layout for the second matrix is `layout="H1(W), -1, S16(4-38)"`, noting that the output is float32.
* As always, name your tensors for easier debug

In [None]:
top = TopLevel()
total_result = top(matrix, time=0)

Compile the program:

In [None]:
iop_file = g.compile(base_name="multi_matmul", result_tensor=total_result)

## GroqView
GroqView can be used to view the instructions of your program in the GroqChip. Note: it is expected that you are familiar with the GroqView tool (See "GroqView User Guide") for this section of this tutorial. You may skip viewing the program in GroqView and move to the "Prepare Data for Program" section.

Using the following command, we can create a .json file that can be used to view the program in hardware. This will show:
* what instructions occur
* where on the chip they take place, as well as 
* when in time (cycles) each instruction occurs.

In [None]:
g.write_visualizer_data("multi_matmul")

To launch GroqView, uncomment and run the following command. Remember, you still need to create a tunnel to the server running the GroqView tool to load in another window. 

In [None]:
#!groqview multi_matmul/visdata.json

<b>Note:</b> before proceeding to the next section, you'll want to stop the above cell. 

## Run on Hardware
Program the Groq Chip with the binary file of the Matrix Multiply program 

In [None]:
program = tsp.create_tsp_runner(iop_file)

Provide the input data to the Groq Chip which will return the results of the matrix multiplication

In [None]:
# Call the program and provide an input matrix
t1_data = np.random.rand(120, 120).astype(np.float16)
result = program(matrix=t1_data)

## Check Results
Note that the oracle value is FLOAT32 because the output of the MXM matrix multiply is float32 for two Float16 inputs. 

In [None]:
# Compute the oracle value for comparison
oracle = np.matmul(t1_data, t1_data.transpose(), dtype=np.float32)
oracle = np.matmul(oracle, t1_data.transpose(), dtype=np.float32)

# Ensure it matches the Groq Chip
print(np.allclose(oracle, result['mm_result2'], rtol=1e-1, atol=1e-1, equal_nan=True))

## Back to Back Computations
The Groq Chip is still programmed with the matmul program so we can continue to provide input data and it will return the results of the matrix multiplied by itself twice. Now let's look at how we can perform calls to the same program repeatedly with different input tensors.

In [None]:
for i in range(3):
    print(f"Matrix {i}")
    t1_data = np.random.rand(120, 120).astype(np.float16)
    t2_data = np.random.rand(120, 120).astype(np.float16)
    result = program(matrix=t1_data)
    oracle = np.matmul(t1_data, t1_data.transpose(), dtype=np.float32)
    oracle = np.matmul(oracle, t1_data.transpose(), dtype=np.float32)
    print("For input tensor of size {}. Results are: ".format(t1_data.shape))
    print(np.allclose(oracle, result['mm_result2'], rtol=1e-1, atol=1e-1, equal_nan=True))